WWWWWWWW

WWW

.

142
786

 

.THS_

This is to certify that the
dissertation entitled

THE EFFECT OF FITTING A UNIDIMENSIONAL IRT MODEL
TO MULTIDIMENSIONAL DATA IN CONTENT-BALANCED
COMPUTERIZED ADAPTIVE TESTING

presented by

Tian Song

has been accepted towards fulﬁllment
of the requirements for the

Doctoral degree in Measurement and Quantitative
Methods

 

 

W

 

Major Professor’s Signature

Q“? ’5; £010

V
Date

MSU is an Afﬁnnative Action/Equal Opportunity Employer

 

LIBRARY
Michigan State
University

 

 

 

 

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K.lProj/Aoc&Pres/CIRCIDateDue.indd

 

THE EFFECT OF FITTING A UNIDIMENSIONAL IRT MODEL TO
MULTIDIMENSIONAL DATA IN CONTENT-BALANCED COMPUTERIZED
ADAPTIVE TESTING
By

Tian Song

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Measurement and Quantitative Methods

2010

ABSTRACT
THE EFFECT OF FITTING A UNIDIMENSIONAL IRT MODEL To
MULTIDIMENSIONAL DATA IN CONTENT-BALANCED COMPUTERIZED
ADAPTIVE TESTING
By

Tian Song

This study investigates the effect of ﬁtting a unidimensional IRT model to
multidimensional data in content-balanced computerized adaptive testing (CAT).
Unconstrained CAT with the maximum information item selection method is chosen as
the baseline, and the performances of three content balancing procedures, the constrained
CAT (CCAT), the modiﬁed multinomial model (MMM), and the modiﬁed constrained
CAT (MCCAT), are evaluated in terms of measurement precision, item pool utilization
and item exposure control. Three simulation factors are considered: (1) multidimensional
structure; (2) ability distribution; and (3) difﬁculty level of content areas. Simulation
results Show that overall the content balancing methods are similar to or even better than
the maximum information method in terms of measurement precision, especially when
the content areas have uneven difﬁculty levels. However, there is no signiﬁcant
difference in item pool usage and item exposure control. Finally, overall the three content

balancing methods perform very similarly, but MMM has the most efﬁcient item pool

usage.

Dedicated to my parents, and my husband Chen

ACKNOWLEDGEMENTS

There are many people who have helped me tremendously in my dissertation. I
would like to express the deepest appreciation to my advisor, Professor Mark Reckase,
for his kindness, support, and exceptional guidance. I also thank other members of my
dissertation committee, Professor Tenko Raykov, Professor Sharif Shakrani, and
Professor Amelia Wenk Gotwals, for their valuable constructive comments and

suggestions. Without their help, this work would not have been possible.

I also thank Dr. Steven Pierce, Dr. Comic Page, Dr. Brian Silver and the Center of
Statistical Training and Consulting at Michigan State University who have ﬁnancially
supported me for my last two years of doctoral study. I have beneﬁted greatly from the

conversations with them about my consulting work, my research, and my career

development.
Finally, I would like to thank my parents for making this all possible. They taught me

the value of knowledge and hard working. They have been encouraging and supporting

me to pursue my dreams over the years. I also thank my husband for his love and

unconditional support.

iv

 

TABLE OF CONTENTS

LIST OF TABLES ......................................................................................................... vii
LIST OF FIGURES ...................................................................................................... viii
CHAPTER 1
Introduction ....................................................................................................................... 1
1 .l Dimensionality .......................................................................................................... 1
1.2 Content Balancing .................................................................................................... 2
1.3 Purpose of the study ................................................................................................. 3
CHAPTER 2
Computerized Adaptive Testing, Content Balancing Procedures, and
Multidimensional IRT Models ......................................................................................... 6
2.1 Computerized Adaptive Testing ............................................................................... 6
2.1.1 Overview ...................................................................................................... 6
2.1.2 Components of computerized adaptive testing ............................................ 8
2.2 Content Balancing .................................................................................................. 14
2.3 Multidimensional Item Response Theory (MIRT) ................................................. 17
CHAPTER 3
Methods ........................................................................................................................... 23
3.1 Item pool ................................................................................................................. 23
3.2 Simulation Factors .................................................................................................. 26
3.3 Simulation Procedure .............................................................................................. 28
3.4 Evaluation Criteria .................................................................................................. 30
CHAPTER 4
Results .............................................................................................................................. 33
4.1 Simulated item parameters ..................................................................................... 33
4.2 Estimation of unidimensional item parameters ...................................................... 38
4.3 Measurement Precision ........................................................................................... 41
4.3.1 Two-dimensional case ............................................................................... 41
4.3.2 Three-dimensional case ............................................................................. 44
4.3.3 Conditional bias and MSE ......................................................................... 46
4.4 Content Balancing .................................................................................................. 56
4.5 Item pool usage ....................................................................................................... 5 8
4.6 Percentages of underexposed and overexposed items ............................................ 60
CHAPTER 5
Conclusions and Discussions .......................................................................................... 63
5.1 Conclusions ............................................................................................................. 63
5.2 Future research ........................................................................................................ 66

REFERENCES ............................................................................................................... 69

vi

LIST OF TABLES

Table 3.1 Three ability distributions used in the simulation study .................................... 27

Table 4.1 The mean simulated item parameters by content areas in the two-dimensional
case ..................................................................................................................... 35

Table 4.2 The mean simulated item parameters by content areas in the three-dimensional
case .................................................................................................................... 35

Table 4.3 The overall chi-square indices of ﬁt from BILOG-MG calibrations ................ 39

Table 4.4 Summary statistics for the estimated unidimensional item parameters in the
two-dimensional case ......................................................................................... 40

Table 4.5 Stunmary statistics for the estimated unidimensional item parameters in the
three—dimensional case ....................................................................................... 41

Table 4.6 Measurement precision for the maximum information method and the three

content balancing methods in the two-dimensional case ................................... 43
Table 4.7 Measurement precision for the maximum information method and the three

content balancing methods in the three-dimensional case ................................. 45
Table 4.8 Violation rate of the content balancing requirement ......................................... 56
Table 4.9 The mean number of items selected from each content area for the maximum

information method ............................................................................................ 57
Table 4.10 Item pool usage for the four methods .............................................................. 59

Table 4.11 Percentages of underexposed and overexposed items for the four methods in
the two-dimensional case ................................................................................... 61

Table 4.12 Percentages of underexposed and overexposed items for the four methods in
the three-dimensional case ................................................................................. 62

Table 5.1 Comparison between the maximum information method and the three content
balancing methods .............................................................................................. 65

Table 5.2 Comparison between the three content balancing methods .............................. 65

vii

LIST OF FIGURES

Figure 2.1 A ﬂowchart describing computerized adaptive testing ..................................... 7
Figure 2.2 Representation of the characteristics of 45 items in a two—dimensional space:
item arrows and reference composite. .............................................................. 21
Figure 3.1 Dimensional structures .................................................................................... 25
Figure 4.1 Item vector plots in the two-dimensional case ................................................. 36
Figure 4.2 Item vector plots in the three-dimensional case .............................................. 37

Figure 4.3 Conditional biases for the four methods, Difﬁculty=(0, O, O), two-dimensional
case .................................................................................................................. 48

Figure 4.4 Conditional biases for the four methods, Diﬁiculty=(-0.6, 0.6, 0), two-
dimensional case .............................................................................................. 49

Figure 4.5 Conditional MSES for the four methods, Diﬁiculty=(0, O, 0), two-dimensional
case .................................................................................................................. 50

Figure 4.6 Conditional MSES for the four methods, Difﬁculty=(-O.6, 0.6, 0), two-
dirnensional case .............................................................................................. 51

Figure 4.7 Conditional biases for the four methods, Difﬁculty=(0, O, 0), three-
dimensional case .............................................................................................. 52

Figure 4.8 Conditional biases for the four methods, Difﬁculty=(-O.6, 0.6, 0), three-
dirnensional case .............................................................................................. 53

Figure 4.9 Conditional MSES for the four methods, Difﬁculty=(0, 0, 0), three-
dimensional case .............................................................................................. 54

Figure 4.10 Conditional MSES for the four methods, Difﬁculty=(-O.6, 0.6, 0), three—
dimensional case .............................................................................................. SS

viii

Chapter 1

Introduction

Over the last few decades, interest in computerized adaptive testing (CAT) has grown
considerably. As an alternative to a conventional paper-and-pencil test, it uses a computer
to present test items and score responses. In CAT, each examinee is presented with an
individually tailored test. Generally, an adaptive test begins with an item with medium
difﬁculty. If the examinee answers it correctly, then he gets a more challenging item;
otherwise he gets an easier item. After each response, the examinee’s ability is estimated,
and the next item that is the most appropriate for the examinee is selected based on the
current ability estimate. This process continues until there is enough information to place
the person on the ability scale with a speciﬁed accuracy, or until a ﬁxed number of items
have been administrated (Green, Bock, Humphreys, Linn & Reckase, 1984). A major
advantage of CAT is that it provides more efﬁcient and precise ability or latent trait (0)

estimates (Weiss, 1982).

In the current CAT applications, however, there are also a number of challenging
issues, such as dimensionality (Green et al., 1984; Liu, 2007; Weiss & Suhadolnik, 1982),
content balancing (Kinsbury & Zara, 1991), and item overexposure (Chang & Ying, 1999;

Wainer, 2000). This study attempts to address the ﬁrst two issues.

1.1 Dimensionality

Item response theory (IRT) is a family of mathematical models in which the

interactions of a person with test items can be adequately represented by a probabilistic

expression. It plays a central role in almost every aspect of CAT, such as item pool

calibration, item selection, and proﬁciency estimation.

Most IRT models assume that examinees’ responses to the items on a test can be
accounted for by a single latent trait (Lord, 1980). However, this assumption may rarely
hold since most sets of items are not strictly unidimensional and require multiple abilities
to obtain a correct response (Reckase, 1985). For example, a mathematical story problem
requires reading skills to transform word problem to equations as well as mathematical

knowledge to ﬁnd a solution to the equations.

Due to their popularity and simplicity, most computerized adaptive testing programs
use unidimensional IRT models. However, when the unidimensionality assumption is
violated, the application of CAT could be seriously affected. If an item pool is composed
of items that require a complex of abilities to answer correctly, examinees may be
administrated different sets of items that measure completely different combinations of
skills. Weiss and Suhadolnik (1982) examined the robustness of adaptive testing to the
violation of the unidimensionality assumption. The authors used a factor analysis model
to generate multidimensional data and then performed unidimensional adaptive tests with
the maximum information. item selection strategy. The results showed that as
multidimensionality increased, the estimated ability parameters deviated more from their

true (ﬁrst-factor) 0 values.
1.2 Content Balancing

Content balancing is a practical consideration in CAT. Unlike traditional paper-and-

pencil tests which are built on test blueprint or content speciﬁcations, adaptive tests do

not follow content speciﬁcations during item selection. Therefore, examinees may be
administrated different distribution of items by content area. For example, in a math test,
one examinee might receive a test consisting entirely of arithmetic items, and another
might receive a test entirely of geometry items. This lack of content comparability could
pose a threat to the validity of scores, and may not be acceptable to test takers and test
score users. In addition, in licensure and certiﬁcation testing, if the items administered do
not cover all the content areas the test plan requires, it may bring legal challenges to the

test (Kinsbury and Zara, 1991).

Since Green et a1. (1984) ﬁrst noted the need to content balance adaptive tests, a
number of procedures have been proposed to control the content speciﬁcations. Wainer
and Kiely (1987) suggested using testlets that are content balanced beforehand instead of
items; Kingsbury and Zara (1989) proposed a constrained CAT (CCAT) which selects the
most informative item from the content area farthest below its ideal administration
percentage; Chen and Ankenmann (2004) developed a modiﬁed multinomial model
(MMM) to satisfy the practical constraint of content balancing; and Leung, Chang and
Han (2000) proposed a modiﬁed constrained CAT (MCCAT) based on Kingsbury and

Zara’s method.

1.3 Purpose of the Study

The purpose of this study is to investigate, with control of content speciﬁcations, the
effect of ﬁtting a unidimensional IRT model to multidimensional data in CAT.

Unconstrained CAT with the maximum information item selection method is considered

as the baseline, and the performances of three content balancing procedures, CCAT,

MMM, and MCCAT, are evaluated. Speciﬁcally, the research questions are:

(1) What is the effect of ﬁtting a unidimensional IRT model to multidimensional data
in a content-balanced CAT? Does the estimation of ability become more or less
accurate when content balancing procedures are applied?

(2) Which content balancing procedure performs the best in terms of ability recovery,

item pool usage, and item exposure control?

The present study contributes to the literature in three important ways. First, the
robustness of CAT to the violation of unidimensionality assumption and content
balancing are jointly considered. Previous study by Ackerman (1991) investigated the
effects of ﬁtting a unidimensional IRT model to two-dimensional data in an
unconstrained CAT. The results suggested that the estimated unidimensional
discrimination values increased when an item’s 61, 02 composite became similar to the
composite of the unidimensional calibrated 0 scale. These items thus had a greater chance
of being selected and administrated in an adaptive testing using the maximum-
information item selection strategy. The study also found that if a CAT item pool
consisted of items from several content areas measuring dissimilar 61, 62 composites,
examinees at different ability levels might receive different proportion of items from the
content areas. Since a balance across content areas is a requirement in practical CAT
programs, it is interesting to see how examinees’ proﬁciency would be recovered under
this practical constraint. Moreover, in previous literature, after we impose the content

constraint in the unidimensional CAT, the measurement precisions are found to be

comparable to the unconstrained maximum information method, with mean squared
errors (MSE) of 9 slightly higher (Leung, Chang & Hau, 2000; Cheng, Chang & Yi,
2007). Now given the assumption of unidimensionality is violated, applying content
balancing procedure might improve measurement efficiency by insuring adequate
representation of each dimension. Therefore, it is interesting to investigate the joint

effects of those two issues.

Second, three content balancing methods, CCAT, MMM, and MCCAT, are compared
in a different context from previous studies. Most existing studies (e.g., Leung, Chang
and Han, 2003a; Leung, Chang and Han, 2003b) focus on unidimensional data. The
results generally showed that the three methods had similar effects on measurement
efﬁciency and item pool utilization. The present study extends the comparison to

multidimensional data, where items are assumed to require multiple abilities to answer

correctly.

Third, most of the studies on multidimensionality focus on the simple two-
dimensional case. In our study, we start with the two-dimensional case, and then turn to
the more complicated three-dimensional case. In this way, the results might be

generalized to higher dimensional spaces.

Chapter 2

Computerized Adaptive Testing, Content Balancing Procedures, and

Multidimensional IRT Models

This chapter introduces the background knowledge and concepts involved in the
current project. The mechanism of computerized adaptive testing is described in great
detail in section 2.1. Three commonly used content balancing procedures are discussed in
section 2.2. Multidimensional IRT models and item characteristics are described in
section 2.3. In this section, special attention is also given to the orientation of the

unidimensional O-scale in a multidimensional space.
2.1 Computerized Adaptive Testing

2.1.1 Overview

Computerized adaptive testing (CAT) is a method for administrating tests to match
the examinee’s ability level. Several large-scale testing programs now use CAT as
alternatives to paper-and-pencil tests, for example, the Graduate Records Examination
(GRE; Eignor, Stocking, Way & Steffen, 1993), the Test of English as a Foreign
Language (TOEFL; Educational Testing Services, 2007), and the Armed Service

Vocational Aptitude Battery (ASVAB; US. Department of Defense, 1982).

The idea of adapting the difficulty of a test to each individual examinee ﬁrst appeared
in Alfred Binet’s (1905) intelligence test in the context of one-on-one administration.

From 19703, with the development of item response theory and the breakthrough in

computer technology, the idea was reﬁned and deveIOped into the current CAT

procedures for large-scale testing.

In CAT, items are selected adaptively on the basis of the examinee’s responses to the
items previously administrated. Figure 2.1 shows the structure of a CAT procedure in a
ﬂowchart. It begins with the ﬁrst item based on an initial estimate of proﬁciency. After
each item response, a new proﬁciency is estimated and the next optimal item is selected.
This process is repeated until it meets certain stopping rules, for instance, the precision of

proﬁciency is adequate, or a ﬁxed number of items have been administrated.

 

 

Start

 

 

 

 

1. Begin with initial ability
1

2. Select optimal item

L

3. Evaluate Response

 

 

 

 

 

 

 

l

4. Estimate new ability

 

 

 

 

 
 

 

   

ls stopping rule
met?

 
 

f Yes

 

End

 

 

 

Figure 2.1 A ﬂowchart describing computerized adaptive testing

Compared to a paper—and-pencil test, a CAT offers many advantages. The biggest
advantage is that it gives more precise estimates of examinees’ ability level with fewer
items (Wainer, 1993). This is because the most informative items are selected and
administrated, and the items outside of examinees’ ability range are excluded during the
CAT procedure. In addition, each examinee is presented a test with an appropriate range
of difficulty, neither too easy nor too difficult, which reduces the measurement errors
induced by confusion, frustration, or boredom. CAT also provides ﬂexible testing

schedules and immediate feedback for the examinee after the test.

2.1.2 Components of computerized adaptive testing

A basic CAT application consists of four primary components: item pool, item

selection procedure, scoring procedure, and test termination rules (Reckase, 1989).

Item pool

Item pool1 is a collection of items from which the adaptive test is selected. Items in an
item pool are written based on test speciﬁcations, and calibrated and linked to a common
measurement scale using IRT. To give every examinee precise and efﬁcient measurement,
there must be high-quality items with a wide span of difﬁculty levels. It also needs a
sufﬁcient number of items in each content area. For the appropriate size of item pools, six

to twelve times the test length is suggested (Luecht, 1998; Patsula & Steffan, 1997;

 

I There are two types of item pools in practice: a master pool and an operational pool. A master pool
consists of a collection of items at various stage of development. An operational pool is a pool of items
from which individual tests are assembled (Van der Linden, 2005a). This study focuses on the operational
item pool.

Stocking, 1998). In practice, more items are needed due to the issues of item exposure,
item retirement and etc.
Item selection procedure

The two most widely used item selection procedures are maximum information
method (Weiss, 1982) and maximum expected precision method (Owen, 1975).

The maximum information strategy selects the item that provides the maximum

amount of item information, I,- (9}), at the examinee’s current ability estimates l9,

.. _ [PI-(671W
"(91" ‘ Prom — so»

 

(2.1)

where 3}- is the ability estimate for examinee j after n preceding responses, Pi(§j) is the
probability of a correct response to item 1' given current ability estimate 9} , and P1091) is

the ﬁrst derivative of Pi(§j)with respect to 0 evaluated at 9}. For unidimensional three-

pararneter logistic IRT model, the equation becomes:

2 2 _ .
11(9)) = D a‘ (1 C" (2.2)

(Ci+emi(9j—I)i))(1+er>ai(aj—bi))7

 

where at is the item discrimination parameter, bi is the difficulty parameter, C; is the
pseudo-guessing parameter, and D is the scaling constant (typically 1.7). From Equation
(2.2), we can see the item information increases as at increases, hi approaches 0, and Ci
approaches 0 (Harnbleton, Swarninathan, & Rogers, 1991). Therefore, in CAT, items

with large discrimination values and difficulty parameter close to the current estimate of

0 are usually desirable. They yield larger information and have a higher probability of

being selected when the maximum information method is used.

Owen’s maximum expected precision method uses a Bayesian approach. In this
procedure, the item that minimizes the expected posterior variance of the 0 estimate is
selected. Owen developed the mathematical formula for the posterior mean and variance
of 0 (See Owen (1975) for detailed mathematical formula). Compared to maximum
information which is based on iterative numerical methods, the computation burden is
smaller for Owen’s procedure. However, the 0 estimate from Owen’s procedure depends
on the order of the items administrated. That is, if two examinees are presented the same
items and have the same answers, but in different orders, their 0 estimates are different.
Because of this disadvantage, Owen’s procedure is now much less widely used (Wainer,

2000)

In the operational CAT programs, these item selection procedures usually need to be
modiﬁed for practical considerations, such as item exposure control and content

balancing.

Scoring Procedure

In CAT, after each response, the examinee’s proﬁciency is estimated. Based on this
estimate, the next item most appropriate for the examinee is selected. Two commonly
used estimation procedures are maximum likelihood method and Bayesian method (Bejar

& Weiss, 1979).

10

Maximum likelihood estimation (MLE)

Maximum likelihood estimation is to ﬁnd an estimate that results in the highest
likelihood for the observed string of item responses. Given a response string and a set of

items with known parameters, the likelihood function is

L(UII91) = n P(u,,-|9,-) (7-3)
i=1

where L(Uj M) is the likelihood of response string Uj for a person j located at 9i;
uii is the item response on item iby person j (l for correct response and 0 for
incorrect response);
P(uij I91“) is the probability of getting response u for item 1‘ by a person j located at
9i;

The maximum likelihood estimate of an examinee’s ability, 9}, is the value that
maximizes this likelihood function. In practice, we set the derivative of the log-likelihood

function (with respect to 9]) to zero and then solve the equation.

Maximum likelihood method has desirable properties like asymptotical unbiasedness.
However, problems can rise at early stage of CAT, since it cannot provide ﬁnite estimates
for responses to single items or for patterns of responses that are all correct or all
incorrect. To solve the problem, we can either constrain 0 to a reasonable range (e.g., -4

to 4) or use an alternative estimation method --- Bayesian estimation procedure.

11

Bayesian ability estimation

In Bayesian estimation, we use the information about the population ability
distribution. The initially assumed distribution is called the prior distribution. In CAT, we
usually assume that the population ability is normally distributed with a mean of O and a
standard deviation of 1. Given the prior distribution, after the examinee answers the ﬁrst

item, the posterior distribution of 9 is given by the Bayes’ theorem,

L(UII9)f(9)
f6 L(U,-|9)f(9) d6

 

h(9IUj) = (2.4)

where f (9) is the prior probability density function for 0,
U} is the item response string for person j,
L(Uj I0) is the probability of the item response string given 9 (the likelihood
function),

and h(9|Uj) is the posterior probability density of 0 given the item response string.

This posterior distribution then becomes the new prior distribution for the next item. As

the test proceeds, this process continues in a sequential fashion.

There are three cormnon Bayesian-based approaches: Expected a posteriori (BAP),

Maximum a posteriori (MAP) and Owen’s method.

EAP. The EAP method uses the mean of the posterior distribution, h(9|U}-), as the

ability estimate:

12

a = F(9|U,-) = L:ah(e|uj) d0 (2.5)

Instead of computing the integral directly, we can approximate it using Gauss-Hermite

quadrature points (Stroud & Sechrest, 1966),

213:1 XkLI(UjIXk)W(Xk)
2:-.L.(Ujlxk)wcxk)

 

a = E(0|Uj) = (2.6)

where X k is one of q quadrature points, W(Xk) is a weight associated with that point,

and L,- (U,- IXk) is the likelihood function after 1' items evaluated at X k.

MAP. The MAP method pr0posed by Samejirna (1969) uses the mode of the posterior
distribution as the ability estimate, that is, the point that maximizes the posterior

probability density. It can be done by setting the derivative of the posterior probability

density, h(9|Uj), to zero and solving the equation.

Owen ’s method Owen (1975) used a normal approximation to the true posterior
distribution, which allowed us to derive the mathematical form of the mean and variance
of the posterior distribution (See Owen (1975) for detailed mathematical formula). The

mean of the posterior distribution is then used as the examinee’s ability estimate.

Among these three Bayesian estimation methods, EAP provides the most stable
estimates, although the estimates are biased except at the population mean (Bock and
Mislevy, 1982). Intuitively, it is better than the Owen’s method because it evaluates the
posterior distribution directly instead of using a normal approximation. This is also
conﬁrmed by Wang and Vispoel (1998), in which the Owen’s method yielded the worst

performance. In addition, Lord (1986) and Warm (1989) suggested that MAP estimates

l3

could be seriously biased in CAT. Therefore, EAP estimate is adopted at early stage in

this study, when there is no ﬁnite estimate for the maximum likelihood method.

Stopping rule

An adaptive test can be terminated when a target measurement precision has been
achieved, or a ﬁxed number of items have been administrated. Testing each examinee to
a prespeciﬁed degree of precision insures that the measurements for all individuals are
equally precise, but occasionally the test could run out of the items before the target
precision is reached or test time could be extremely long for examinees. Segall, Moreno,
& Hetter (1997) pointed out that in a variable-length test, examinees with extreme
proﬁciency levels tended to have long tests. It may cause fatigue and raise the chance of
careless errors, and each additional item provides little information about the examinee’s

ability. On the other hand, a ﬁxed-length test is easy to implement and constrains the test

time to a reasonable range.

2.2 Content Balancing

Whether to balance the content of items administrated to examinees is one of the ﬁrst
issues that must be addressed in developing a CAT application. By the nature of CAT the
examinees receive different items in the same test, and each should get the same number

of items from each content area for fairness.

l4

Green et al. (1984) ﬁrst commented on the need of content balancing in adaptive tests.
They noticed that in Bock and Mislevy (198l)’s study, on a test of general science, males
performed better than females on natural science items, while females performed better
than males on health and nutrition items. In an adaptive test, if a male examinee is
administrated all health and nutrition items, he might be disadvantaged and the validity of
the score would be threatened. Therefore, content balancing could reduce the impact of
subgroup differences. Kingsbury and Zara (1989) also pointed out that administrating a
test that covered all the content areas in a test blueprint gave an adequate assessment of
the examinee’s ability, and reduced the legal challenges to the test (e.g., licensure tests

and admission tests).

Previous research on content balancing has developed a number of methods.
Kingsbury and Zara (1989) proposed a constrained CAT (CCAT) procedure. In this
procedure, the selection of the next optimal item is restricted to the content area that is

the farthest below its target percentage. Detailed steps are described as follows:

1. Calculate the target percentages of content areas for the test based on the test
blueprint;

2. Estimate the examinee’s provisional proﬁciency after he answers an item;

3. Calculate the percentage of items already administrated in each content area for
this examinee;

4. Compare the empirical percentages to the target percentages, and select the

content area with the largest discrepancy;

5. Within this selected content area, select and administrate the item with the

maximum information at the provisional ability estimate.

15

In this way, the adaptive test would have any desired content distribution. However, Chen
and Ankenmann (2004) argued that this method could lead to high predictability of the
content area. Instead, they developed a modiﬁed multinomial model to meet the content

requirement:

1. Form a cumulative multinomial distribution based on the target percentages of
content areas;

2. Generate a random number from the uniform distribution U(0,1) and use it to ﬁnd
the corresponding content area in the cumulative distribution;

3. Within this selected content area, select and administrate the item with the
maximum information;

4. This process continues until a content area has reached its target percentage. A
new multinomial distribution is formed by adjusting the unﬁlled percentages of

the remaining content areas.

Leung, Chang and Han (2000) also proposed a modiﬁed version of CCAT procedure
to eliminate the undesirable order effect. The procedure is similar to CCAT, except that
items can be selected from all the content areas for which target percentages are not

reached.

Leung, Chang and Han (2003b) compared these three content balancing methods in a
CAT using the maximum information item selection strategy. With simulated
unidimensional data they demonstrated that using content balancing methods caused
some loss in the measurement efficiency. In addition, the three methods had similar

effects on measurement efﬁciency and item pool utilization, but the MMM method had

16

the fewest overexposed items. The present study also compares the three methods, but in
a different context, where items are assumed to require multiple abilities to answer

correctly.

2.3 Multidimensional Item Response Theory (MIRT)

AS described in section 2.1, item response theory plays a central role in computerized
adaptive testing, from item pool calibration, item selection, to ability estimation. Most
computerized adaptive testing programs use unidimensional IRT models, which assumes
examinees’ responses to test items can be accounted for by a single latent trait. However,
the cognitive and psychological processes of responding to test items are very complex,
and many researchers believe that multiple skills inﬂuence the performance on a test (Ip,
2010; Reckase, 1985; Reckase, Ackerman and Carlson, 1988; Traub, 1983; Walker and
Beretvas, 2003). Multidimensional item response theory is a collection of mathematical
models that describe the interaction between persons and test items when more than one

ability are required to account for test performance.

There are two major types of multidimensional IRT models: compensatory and
noncompensatory. The compensatory model is based on a linear combination of ability
dimensions, and a high ability on one dimension can compensate for a low ability on
another dimension. For example, the compensatory form of the multidimensional three-

parameter logistic model is given by (Mckinley and Reckase, 1983),

eD(a'in+di)

1 + eD(a'iO,-+di)

 

POI.” = 1|0,,a1,di,ci) —_— Ci 'I' (1 "" Ci) (2.7)

17

Where P (u,- j = 1'01, at, di, Ci) is the probability of a correct response to item 1' by
person j ;
at] is the response on item i by person j (l is correct and 0 is incorrect);
9, is a column vector (m by 1) of person j ’s abilities in a m-dimensional space;
at is a column vector (m by 1) of discrimination parameters for item 1';
d, is a scalar that related to item difficulty;
6,- is the guessing parameter or low asymptote for item i , and
D is the scaling constant (typically 1.7).

In the multidimensional version, 9 I and (I, are vectors instead of scalars. From equation
(2.7), the exponent of e is a linear function of 6s plus the intercept term d, a;a, + di. The

additivity of the Os implies the compensatory nature of the model.

Sympson (1978) argued that the compensatory model is not realistic for certain
types of items. For example, for a mathematics item that requires both arithmetic
computation skills and reading skills, if an examinee’s reading skills are very low, he
might not understand the problem and hence cannot solve the problem even if he has high

arithmetic computation skills. For this situation, he develOped a noncompensatory model:

eDaik(9jk'-dik)

=1 1 + eDaik(9th—dik)

 

P(u,.,- = 119,, a“ dl, c,) = c. + (1 — c.) (2.8)

18

where m is the number of dimensions, am and dik are the discrimination and difﬁculty
parameter respectively, for item i and dimension k, and other parameters are deﬁned as
before. In this model, the test item is decomposed into individual components, and the
probability of a correct answer is the product of the probabilities of doing each
component correctly. Due to the multiplicative nature of the model, the probability of a
correct response to an item cannot exceed the lowest probability in the product. The
probability does increase with ability increase in one dimension, but up to a limit set by
the lowest probability in the product. Therefore, this model is also called partially

compensatory model.

Some researchers believe that the partially compensatory model is more theoretically
sound, but is less realistic. For example, Ansley (1984) pointed out that data generated
with this model did not resemble real test data. Bolt and Lall (2003) also compared the ﬁt
of compensatory and partially compensatory models to a common data set from a test of
English usage, and found that the compensatory model ﬁt the data better than the
partially compensatory model. In addition, estimation difﬁculty for the partially
compensatory model hinders its development and application. As a result, compensatory
model is more prevalent in the current literature, and we will not deal with partially

compensatory model further in this study.

In a compensatory MIRT model, the parameters sometimes lack intuitive meaning,
so Reckase (1985) and Reckase and Mckinley (1991) developed two statistics to interpret
the characteristics of the items for compensatory models: multidimensional

discrimination (MDISC) and multidimensional difﬁculty (MDIFF). They are deﬁned as

19

1

m 2
MDICSi = Waja, -'= (Z afk) (2.9)

k=1

_d.
MDIFFi = m (2.10)
I

where parameters are deﬁned as before. These two statistics are analogous to
discrimination and difﬁculty parameters from the unidimensional IRT models. M DI C S ,- is
the slope of the item response surface at the steepest point, and indicates the
discriminating power of the item. M D! F Pi is the distance from the origin to the point of
the steepest slope. It represents the multidimensional difficulty of the item: high values
indicate difﬁcult items and low values indicate easy items. In addition, the direction of
the steepest slope ﬁ'om the origin of the space is given by

aik

—- 2.11
(ka=1 aizk)% ( )

cosaik =
where am is the angel between the kth coordinate axis and the line from the origin to the

point that has the greatest slope overall. The cosines above are often called direction

cosines.

Using the concept of multidimensional discrimination, multidimensional difﬁculty,
and direction cosines, items can be displayed graphically in the space. Each item is
represented by an arrow. The base of the arrow is at the point of maximal slope, the
length of the arrow indicates the discrimination of the item, M DIS C i. The distance from

the origin to the base of the arrow represents the difﬁculty of the item, MDIFF}, and the

20

direction of the arrow, at, is derived from the direction cosines of the item. Figure 2.2

shows an item vector plot of 45 items in a two-dimensional space.

 

 

 

 

Figure 2.2 Representation of the characteristics of 45 items in a two-dimensional space:
item arrows and reference composite.

21

Orientation of the unidimensional G-scale in the multidimensional space

Wang (1995, 1996) showed that if we ﬁtted a unidimensional model to a
multidimensional test, the orientation of unidimensional 0-scale was related to the matrix
of discrimination parameters from the compensatory MIRT model. Speciﬁcally, this
unidimensional line is deﬁned as the eigenvector of the a'a matrix associated with the
largest eigenvalue, and is called the reference composite of the test. In Figure 2.2, the

reference composite of the 45 items is represented by the bold dashed arrow.

The projection of the 0-point in the multidimensional space onto the reference
composite gives an estimate of the unidimensional 0 that would result if the response data

from the test items were analyzed using a unidimensional IRT model. Formally, it is
given by

9: = RC'a (2.12)

Where 0* is the projected unidimensional 0; RC is the reference composite vector; and 0

is the multidimensional ability vector.

22

Chapter 3

Methodology

To examine the effect of ﬁtting a unidimensional IRT model to multidimensional data
in a content-balanced CAT, Monte Carlo simulations are conducted. This chapter

describes the simulation design and evaluation criteria in details.

3.1 Item Pool

The item pool consists of 400 items, from three content areas. Content 1 has 160 items,
and Content 2 and 3 each have 120 items. The unbalanced distribution of items across
three content areas resembles a typical item pool in real testing programs. We assume
that multiple skills are required to answer the items correctly, and hence assume that the
item pool has a multidimensional structure. Two types of representative dimensional

structures are adapted from Reckase (2009):

1. Three content areas in a two-dimensional space
In this structure, Content 1 and 2 mainly load on either of the two dimensions
respectively; Content 3 loads on the composite of the two dimensions. For

example, in a mathematics test, there are three content areas: arithmetic, geometry,
and algebra. The arithmetic items mainly measure the examinees’ computation
skills, the geometry items measure the problem solving skills, and the algebra

items require both computation and problem solving skills.

23

2. Three content areas in a three-dimensional space
In this structure, Content 1 measures Dimension 1, Content 2 measures
Dimension 2, and Content 3 measures the composite of all three dimensions.
Using the same example above: in the test, arithmetic and geometry items remain
unchanged and mainly need skills in either computation or problem solving, but

the algebra items are story problems now, which require reading skills in addition

to computation and problem solving Skills.

To construct the dimensional structure, angles between item vectors and dimensions

are speciﬁed. In the two-dimensional case, the direction cosines are (1,0), (0,1) and

(ﬁ, %) for the three content areas respectively. And in the three-dimensional case, the

direction cosines are (1,0,0), (0,1,0) and (%, 4%! ﬁ). Within each content area, the
angular variation is 15" (Roussos, Stout and Marden, 1998). For example, for Content 1
in Structure 1, the angle between the item vector arrow and the ﬁrst dimension, a1, is
randomly selected from the uniform distribution U (0°, 15°); for Content 2 and 3, a1 is
randomly selected from U (75°, 90°) and U(37.5°, 52.50) respectively. Then the angle
between the item arrow and the second dimension, ((2, is calculated by 90° — a1.

Similarly, angles for the three-dimensional case are generated. In Figure 3.1, the two

dimensional structures are illustrated.

The item parameters for the compensatory MIRT model are simulated ﬁ'om
commonly used distributions. The logs of item discrimination parameters (MDISCi) are

randomly drawn from a normal distribution with a mean of 0 and a standard deviation of

24

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

............................

   

 

\\\
I
.. '\ '
\ l
I
' I
‘w‘ ‘ I
-~‘ I
-—' . ‘ I
‘- ‘ I
." | “
b' I u"
I ‘ 'T ‘s‘ I
I " ‘. ’_a' s‘ I
"-T \ -" ‘\ I
’1’-\‘ " “‘ ‘ \\\ I
—"H . W -">“. \ ‘\‘I
~ .. x on e .—~--
h n ‘ -' \
T. ,v' ‘. ’_' \‘
5 ,— ‘ _- \
‘. ‘4 , “ '0 N
)(’ ‘s o“ ‘
§‘ §‘ ‘n‘
.‘~ ,x.
.
‘

Three content areas in a two-dimensional space

 

 

 

 

 

 

 

Content 1

r
............................. '. _._,.__._,__,_J.____,,_,_,.___H

 

 

?
I
I
I
I
I 5
.-"'- J 1‘.
_.-’ I _.L I T
.‘T- I I- I
."I I .' r ‘
~..-’ . . _,-—’ I '
-Content3- - :
r’ I
I ,—r'1 I '
I ---' I I
I _--' I I '
.' 9
-L . .
. ' l s
. l I 1
. I I I
I I l I
. .I I
I I"
.L
..2

 

   

  

Figure 3.1 Dimensional Structures

25

 

 

 

0.5, N (0, 0.52). Difﬁculty parameters (MDIFFi) are drawn from N (O, 0.752) (Fang,

2008). For simplicity, all items have the same low asymptote value (c-parameters) of 0.2.

Given MDISCi, MDIF F1, and the angles, the parameters a,- and d,- for the

compensatory MIRT model are calculated by

aik = MDISCi * cosai,C (3.1)

d,- = —MDIFF,- .. M0156,- (3.2)

3.2 Simulation factors

1.

Dimensional Structure

Two dimensional structures are considered in this study: a) three content areas in

a two-dimensional space; b) three content areas in a three-dimensional space. The
two-dimensional case is the simplest situation of multidimensionality to start with,
and then the more complicated three-dimensional case is examined. By studying

these two representative structures, results might be generalized to higher

dimensions.

Ability Distribution

Examinees’ abilities are simulated from multivariate normal distributions with
zero mean vector and three different variance—covariance matrices. Table 3.1
shows mean vector (u), variance-covariance matrices (z), and correlation
coefﬁcients between abilities (p). Three levels of ability correlations are used: 0,

0.4, and 0.8. A correlation of zero implies that there is no correlation between

26

multiple abilities; 0.4 indicates a moderate correlation; and 0.8 represents a high
correlation. In the three-dimensional case, pairwise correlations are slightly varied

to produce a more realistic relationship between multidimensional abilities.

Table 3.1 Three ability distributions used in the simulation study

 

 

 

 

 

Two dimensions Three dimensions
D b 0 1 0 0
istri ution l _ 0 __ 1 0 p g[:121= = [0 1 0]
p=0 ”’IoI'z‘Io 1 o o o 1
Distribution 2 — [0] — 1 0'4 04' 0'3
”‘ 0'2’I04 1I 0.35
p = 0'4 0.3 o. 35 1
D- -b - 3 __ o _ 1 0.8 0.7
mm ”no“ I‘ " I0] ,2: '" I0 3 1 I 0 75
p = 0'8 o. 7 0.75 1

 

 

 

 

3. Difﬁculty levels for content areas
Two cases are examined: a) The average difﬁculty levels are same for all three
content areas: the mean of difﬁculties (MDIFFs) are assumed to be zero; b)
Content 1 has less difﬁcult items, and Content 2 has more difﬁcult items: the
MDIFFs are decreased by 0.6 for items in Content 1, increased by 0.6 for Content
2, and kept unchanged for Content 3. We choose 0.6 to represent a moderate to

high change of item difficulty (Swaminathen and Rogers, 1990).

4. Content balancing methods

27

Three content balancing methods, CCAT, MMM and MCCAT, are compared.
Unconstrained CAT with the maximum information item selection strategy is

used as the baseline.

In total, the four simulation factors yield 2*3*2*4=48 conditions.

3.3 Simulation procedure
The simulation procedure involves the following steps:

1. We simulate an item pool of 400 items for each combination of dimensional
structure and difﬁculty level conditionz. And we simulate the ability parameters for

2000 examinees using the distribution described above.

2. Given the item and ability parameters, we generate item responses to all items in
the given item pool for 2000 examinees. Using the compensatory three-parameter
multidimensional IRT model (Eq. 2.7), the probability of a correct answer for a
given item and a given examinee (p) is calculated. The 0/1 response is obtained by
comparing p to a random number (x) from a uniform distribution U(0,1). If p>x,
then a correct response is obtained, otherwise, an incorrect response is obtained. In

this way, a 2000 by 400 item response matrix is generated.

 

More precrsely, we simulate the Items for each drmensronal structure wrth even content difﬁculty levels.
Then we alter the difﬁculty values for the items in Content 1 and 2 to get items for the condition of uneven

content difficulty levels. The purpose is to reduce the random noises when comparing the two difﬁculty-
level conditions.

28

 

 

 

3. Based on all the items in the given item pool, we calculate the reference composite
from the a’ a matrix and the projections of the O-points in the multidimensional
space onto the reference composite. To be consistent with the scaling of the
estimates ﬁ'om the unidimensional IRT program (BILOG-MG), the projected Gs are
scaled to have a mean of 0 and a standard deviation of 1. It gives a theoretical
estimate of the unidimensional 0 that would result if the response data from the test
items were analyzed using a unidimensional IRT model. The resulting Os are

considered as the true Os when the recovery of 0 is evaluated.

4. The response data resulted in Step 2 is calibrated using BILOG-MG (Zimowski et
al., 2003) to estimate unidimensional item parameters. The three-parameter logistic
model3 with a scaling constant D=1.7 is applied. In BILOG-MG, the convergence
criterion is set to be 0.005, and the number of quadrature points for the EM

algorithm is set at the default value.

5. Unidimensional CAT is conducted. In this study, the test consists of 30 items. The
ﬁrst item is randomly selected from the 100 items with medium difﬁculty. The
corresponding response is read from the item response matrix generated in Step 2.
Based on this response, the provisional ability is estimated. The expected a

posterior (BAP) method is adopted at the beginning of the test, assuming N(0,1) is

 

3 The unidimensional three-parameter logistic model is
exp (1.7ai(9j-bi))

 

discrimination, difﬁculty, and guessing parameter, respectively, for item i; 9}- is the ability parameter for
person } .

29

the prior distribution. The maximum likelihood estimation (MLE) method is used
until at least ﬁve items have been administrated and the response pattern contains
both 0 and 1 (Cheng, Chang, and Yi, 2007). After each item is administrated, the
next most appropriate item for the examinee is selected using the maximum
information method. The process continues until a 30-item test has been
administrated. During this procedure, the three -parameter logistic model is

assumed, and the unidimensional parameters in Step 4 are used for both ability

estimation and item selection.

6. Repeat Step 5 with content balancing methods during the item selection procedure.

In addition, to examine the conditional measurement precision, simulation is also
conducted for limited points in the ability space, with 50 replications at each ability point.
Five equally spaced values of 0 from -2 to +2 (6= -2, -1, 0, 1, 2) are used. Hence, 25

ﬁxed points (5 x S) are evaluated in the two-dimensional case, and 125 (5 x 5 x 5)

points in the three-dimensional case.

3.4 Evaluation Criteria
1. Measurement precision.

The recovery of ability proﬁciency is assessed by overall bias, mean square error

(MSE) of O, and the correlation between 6 and its estimate (p93). Overall bias and MSE

are calculated by

30

N
Bias: (§--9-)/N (3.3)

and

N
MSE: (a--—e-)2/N (3.4)

where 9} is the estimated ability of the jth exarrrinee from the unidimensional CAT, 6,- is
the true ability of the jth examinee, and N is the number of examinees. Bias and MSE
both provide a good indication of the quality of the recovery of examinees’ abilities. The
smaller the absolute biases and MSES are, the better the abilities are measured. In
addition to overall statistics, conditional measurement precision is calculated at the ﬁxed

points.
2. Content balancing

The number of items administrated from each content area is recorded. If the target
percentage of certain content area is over or under fulﬁlled, the test fails to satisfy the
content balancing constraint. The percentage of tests violating the content constraint is
reported. By design, the CAT with content balancing is expected to have zero percentage

of violation.
3. Item pool usage

The three content balancing methods and the maximum information method are also
compared in terms of item pool usage. In order to have a maximum item pool usage, a
uniform exposure rate distribution is desirable. Chang and Ying (1999) have proposed a

scaled chi-square statistics to evaluate the skewness of exposure rate distribution:

31

 

M ._i 2
2'2 = Z (ER‘L M) (3.5)
.=1 n

where M is the size of item pool, L is the test length, and E R,- is the observed exposure
rate for the ith item. This chi-square statistics quantiﬁes the item pool usage efﬁciency,
and indicates the discrepancy between the observed and ideal item exposure rates. Low

chi-square statistics is preferred, which implies a more efﬁcient item pool usage.
4. Percentages of underexposed and overexposed items

The exposure rate of an item is deﬁned as the ratio of the number of times the item
administrated to the number of examinees. A low exposure rate indicates the item is
rarely used. If there are a large proportion of items with low exposure rates in the item
pool, the cost-effectiveness of developing the items might not be achieved. On the other
hand, if an item is over-selected, it might be known to prospective examinees and test
security is threatened. Therefore, an item with either low or high exposure rate is not
desirable in CAT programs. In this study, following the literature (Cheng, Chang & Yi,
2007), an item is considered as an underexposed item if its exposure rate is less than 0.02,
and an overexposed item if the exposure rate is larger than 0.2. To evaluate the

effectiveness of each content balancing method, the percentages of underexposed and

overexposed items are reported.

32

Chapter 4

Results

The simulation results are discussed in this chapter. Section 4.1 summarizes the
descriptive statistics of the simulated item parameters. Section 4.2 and 4.3 discuss the
estimation of item parameters and person parameters. Section 4.4 to 4.6 evaluate the
maximum information method and the three content balancing methods, in terms of the
percentage of tests violating the content-balancing requirements, the item pool usage, and

the percentages of underexposed and overexposed items.

4.] Simulated item parameters

The descriptive statistics of the simulated multidimensional item parameters for the
two-dimensional case are given in Table 4.1. Along with the standard item parameters
(discrimination a1 and 02, difficulty d), the generalized discrimination and difﬁculty

indices (MDISC and MDIFF), and the angles with the coordinate axes (a1 and a2) are

also shown.

Generally, items are sensitive to differences on a single dimension if they have high
discrimination parameters for the dimension and small angles with the corresponding
coordinate axis. In Table 4.1, for Content 1, the mean discrimination value for the ﬁrst
dimension (a1) is 1.07, and the mean angle with the ﬁrst dimension (a1) is 7.8 degree. It
is clear that items in Content 1 are mostly sensitive to the ﬁrst dimension. Similarly,

items in Content 2 are mostly sensitive to the second dimension. With roughly equal a

33

parameters and angels with 61, 02 axes, items in Content 3 measure a combination of the

two dimensions. These three distinct sets of items are also shown in Figure 4.1.

Table 4.1 also shows the change of difﬁculty for the three content areas across
conditions. In Panel A, the average MDIFFs for the three content areas are -0.09, -0.01
and -0.02, respectively. In Panel B, the difficulty is decreased by 0.6 for Content 1 and
increased by 0.6 for Content 2. It results in an average MDIFF of -0.69 for Content 1, and
0.59 for Content 2. The change of difficulty is also illustrated in Figure 4.1. In the vector
plot of items, the distance from the origin to the base of the arrow indicates the difﬁculty,

MDIFF. From Figure 4.1a to 4.1b, the distance changes by -0.6 and 0.6 for Content 1

and 2 respectively.

Similarly, the descriptive statistics of the simulated item parameters and the item
vector plots for the three-dimensional case are given in Table 4.2 and Figure 4.2. They

demonstrate a clear dimensional structure for the three content areas. Items in Content 1
measure predominantly along 61, items in Content 2 measure predominantly along 62,

and items in Content 3 measure an equally weighted combination of 91, 02, and 93.

34

Table 4.1 The mean simulated item parameters by content areas in the two-dimensional
case

 

 

Content a1 a2 d MDISC MDIFF a1 a2
Panel A: Equal difﬁculty levels across the three content areas: Difﬁculty =(0, 0, 0)

1 1.07 0.14 0.15 1.08 -0.09 7.80 82.20

2 0.15 1.08 0.00 1.09 -0.01 81.97 8.03

3 0.78 0.78 0.03 1.10 -0.02 44.97 45.03
Panel B: Unequal difﬁculty levels across the three content areas: Difﬁculty =(-0.6,0.6,0)

1 1.07 0.14 0.80 1.08 -0.69 7.80 82.20

2 0.15 1.08 -0.65 1.09 0.59 81.97 8.03

3 0.78 0.78 0.03 1.10 -0.02 44.97 45.03

 

Table 4.2 The mean simulated item parameters by content areas in the three-dimensional
case

 

Content a1 a2 a3 (1 MDISC MDIFF a1 a2 a3

 

Panel A: Equal difﬁculty levels across the three content areas: Difﬁculty =(0, 0, 0)
1 1.07 0.08 0.11 0.15 1.08 -0.09 7.80 85.76 84.01
2 0.06 1.08 0.11 0.00 1.09 -0.01 86.85 6.97 84.32
3 0.63 0.64 0.62 0.03 1.10 -0.02 54.70 53.99 56.07

Panel B: Unequal difficulty levels across the three content areas: Difﬁculty =(-0.6,0.6,0)
1 1.07 0.08 0.1 1 0.80 1.08 -0.69 7.80 85.76 84.01
2 0.06 1.08 0.11 -0.65 1.09 0.59 86.85 6.97 84.32
3 0.63 0.64 0.62 0.03 1.10 -0.02 54.70 53.99 56.07

35

 

 

 

 

 

T“ _L
E T TEE ‘“
5 a ......................................................................
4 ____________________________________________________________________________________
_____ Content 3

 

,,,,,,,,,,,,

..................

..........

 

 

I
I
' ..
__________________________________ I
_. .—-—" " ”-77 ----------

\IContenH

I
I
I
“ “"1
I

 

 

   

b. Difﬁculty=(-0.6, 0.6, 0)

—l

 

”‘7 I '

................................
...............

 

 

..............................................

iContent2;
2 ------------------ s --------- \

 

 

 

 

 

 

 

0” ‘—_—'+ —— :1; .
-1~ --------------- . ~ ----------------- fr- ----------
i i i _1 l _ __

Content 3

//

   

,W /
II {1"

 
      

W

   

 

 

 

 

 

Figure 4.1 Item vector plots in the two-dimensional case

36

a. Difﬁculty=(0, 0,0)

-".' I ‘ -‘T
.-‘—T I I’_'-
2 ‘-" I ".-f
__.' I

 

 

-2

,u
-.
4

I --"r.
‘,v I

ohi'

 

"‘"g’Content 3:

I

__L
"a.
I

\
.
~
\
g .
.
‘A
'-\

I
Q

I
~ I
x
I
\‘ ‘~
I
I
1
v I
-‘ “ I
- s
0 h I
.
~~ .
s‘ I
.V»
."- ‘~
.‘T-
.v"
~ ‘u’

-‘I'

~2:176

Figure 4.2 Item vector plots in the three-dimensional case

37

I
-%------
I

 

   

4.2 Estimation of unidimensional item parameters

Following the simulation procedure described in Chapter 3, the item response
matrices are generated using the compensatory three-parameter multidimensional IRT
model (Eq. 2.7). In order to get the unidimensional item parameters for the CAT
procedure, the data is calibrated using the unidimensional three-parameter logistic model
(see footnote 3). In BILOG-MG, all the calibration runs converge for both the EM steps
and the Newton steps and reach the convergence criterion of 0.005, which indicate that
the estimations of item and person parameters reach a good accuracy. Table 4.3 reports
the overall chi-square indices of ﬁt from BILOG—MG when the unidimensional model is
ﬁtted to our data. Under all simulation conditions, the test shows a good ﬁt of the
unidimensional model, with p-values close or equal to 1. It is surprising that
unidimensional model ﬁts well when the dimensionality is not one and the correlation is
zero. It might be due to the multidimensional structure used in the simulation study.
There are two possible reasons. First, by design, items in content 3 measure an equally
weighted combination of all constructs, and hence are very close to the estimated
unidimensional 9 scale. Second, although items in Content 2 and 3 measure a single
construct, they are not very far from the estimated unidimensional 9 scale. For example,
in the two-dimensional case, for Content 1, the angle between the item vector arrow and
the ﬁrst dimension varies from 0° to 15". So it gives us an average angle difference of

37.50 between Content 1 and the estimated unidimensional 9 scale.

Table 4.4 presents the summary statistics of the estimated unidimensional item
parameters across content areas for the two-dimensional case. As we expect, Content 3

has a larger estimated discrimination value (a) than Content 1 and 2. This is because the

38

orientation of the items in this content area is more aligned with the orientation of the
reference composite. This founding is also consistent with the results from Ackerman

(1991).

Table 4.3 The overall chi-square indices of ﬁt ﬁom BILOG-MG calibrations

 

 

 

 

Two-dimensional case Three-dimensional case
Chi-square DF P-value Chi-swe DF P-value

Diﬂiculty=(0,0,0)
p = 0 2545.8 3502 1 2726.1 3518 1
p = 0.4 2486.3 3476 1 2649.3 3480 1
p = 0.8 2551.4 3447 1 2376.7 3395 l
Difﬁculty=(-0.6,0.6,0)
p = 0 3066.8 3491 1 3453.0 3545 0.8631
p = 0.4 2586.5 3491 1 2817.5 3463 l
p = 0.8 2467.9 3398 1 2394.1 3345 1

 

The estimation of the item difficulty is overall satisfactory. When all the three
content areas have an average MDIFF of 0, the mean of the unidimensional difﬁculty
values (b) is generally close to 0. When MDIF F changes by -0.6 and 0.6 for Content 1
and 2, the mean of bs also has a similar change in the same direction. For example, with a
medium ability correlation of 0.4, the mean unidimensional difﬁculties are 0.025, 0.070,
and 0.023 for the three content areas in Panel A, while they change to -0.637, 0.773, and
0.083 in Panel B. Table 4.4 also shows that as the correlation between true abilities
increases, the overall recovery of difﬁculty improves. For example, in Panel A, with an
even difficulty level and an ability correlation of 0.8, the mean unidimensional

difﬁculties are -0.042, 0.016 and 0.006 respectively, which are the closest to the MDIF F

values.

39

Finally, most of the guessing parameters are close to 0.2 with small standard

deviations.

Similarly, the summary statistics for the estimated unidimensional item parameters
in the three-dimensional case are provided in Table 4.5. The observations are largely
consistent with those in the two-dimensional case. For example, Content 3 still has the

largest mean discrimination value under all conditions.

Table 4.4 Summary statistics for the estimated unidimensional item parameters in the
two-dimensional case

 

 

 

 

 

Ability a b c
Correlation Content mean std mean std mean std
Panel A: Difﬁculty= (0, 0, 0)

0 1 0.739 0.227 -0.031 0.826 0.240 0.050
2 0.738 0.245 0.142 0.868 0.252 0.053

3 1.131 0.544 0.005 0.689 0.220 0.037

0.4 1 0.940 0.299 0.025 0.736 0.240 0.046
2 0.886 0.331 0.070 0.758 0.225 0.043

3 1.326 0.651 0.023 0.584 0.214 0.034

0.8 1 1.107 0.426 -0.042 0.701 0.222 0.039
2 1.098 0.456 0.016 0.676 0.218 0.037

3 1.488 0.744 0.006 0.534 0.216 0.033

Panel B: Difﬁculty= (-0.6, 0.6, 0)

0 1 0.754 0.276 -0.789 0.810 0.201 0.045
2 1.222 0.425 0.928 0.537 0.318 0.087

3 1.216 0.554 0.114 0.658 0.253 0.053

0.4 1 0.896 0.345 -0.637 0.764 0.21 1 0.041
2 1.094 0.418 0.773 0.658 0.264 0.059

3 1.331 0.624 0.083 0.582 0.229 0.044

0.8 1 1.095 0.440 -0.599 0.716 0.225 0.043
2 1.159 0.519 0.607 0.656 0.221 0.033

3 1.483 0.743 0.004 0.531 0.211 0.034

 

40

Table 4.5 Summary statistics for the estimated unidimensional item parameters in the
three-dimensional case

 

 

 

 

Ability a b c
Correlation Content mean std mean std mean std
Panel A: Difﬁculty= (0, 0, 0)

0 1 0.622 0.170 -0.053 0.944 0.232 0.049
2 0.712 0.203 0.257 0.902 0.287 0.053

3 1.079 0.488 0.030 0.680 0.228 0.047

0.4 1 0.842 0.258 -0.013 0.793 0.228 0.046
2 0.855 0.299 0.066 0.775 0.229 0.043

3 1.370 0.639 0.023 0.539 0.219 0.035

0.8 1 1.088 0.411 -0.057 0.660 0.210 0.040
2 1.154 0.460 0.058 0.665 0.229 0.040

3 1.696 0.790 0.003 0.456 0.21 1 0.031

Panel B: Difﬁculty: (-0.6, 0.6, 0)

0 1 0.658 0.226 -0.864 0.898 0.197 0.049
2 1.025 0.321 1.100 0.615 0.336 0.091

3 1.134 0.503 0.115 0.652 0.253 0.057

0.4 1 0.804 0.287 -0.747 0.773 0.197 0.048
2 1.142 0.389 0.827 0.614 0.278 0.068

3 1.444 0.649 0.071 0.512 0.234 0.044

0.8 1 1.106 0.442 -0.614 0.657 0.211 0.044
2 1.200 0.496 0.628 0.603 0.235 0.042

3 1.717 0.783 0.018 0.445 0.215 0.029

 

4.3 Measurement precision

4.3.1 Two-dimensional case:

Table 4.6 presents the estimated bias, mean squared error (MSE) and the correlation

between 0 and its estimate (p93) for the four methods in the two-dimensional case.

When the three content areas have the same difﬁculty level, the maximum
information method tends to result in lower biases and MSEs than the three content
balancing methods, except for a high ability correlation of 0.8. However, the three

content balancing methods also yield good measurement precision. The biases and MSEs

41

are close to those from the maximum information method, and the correlations between 0

and its estimate are around 0.96.

On the other hand, when the three content areas have uneven difﬁculty levels, the
maximum information method tends to perform worse. It yields higher MSES and lower

correlations p93 than the three content balancing methods. For example, when there is no

correlation between true abilities, the maximum information procedure produces a MSE
of 0.173 and a correlation of 0.937. In contrast, the three content balancing methods give
MSES around 0.13 and correlations around 0.95. However, the differences in MSE
between the maximum information method and the content balancing methods become
smaller as the ability correlation increases. With a correlation of 0.8, the four methods
perform comparably. Intuitively, high ability correlation reduces multidimensionality.
When the ability correlation approaches 1, multidimensionality diminishes and reduces to
a simple unidimensional case. Previous research (Cheng, Chang & Yi, 2007) has
suggested that in the unidimensional context, content balancing methods yield the
measurement precision close to the maximum information method. Therefore, our ﬁnding

is consistent with previous research on the unidimensional case.

The difficulty levels for the content areas seem to affect the measurement precision
of the four methods, particularly the maximum information method. For example, when
the ability correlation is 0, the MSE for the maximum information method increases from
0.131 to 0.173 as the difﬁculty levels becomes uneven. By design, Content 3 is closer to
the orientation of the calibrated unidimensional 0-scale or the reference composite.
Therefore, its items have larger unidimensional discriminations and hence have a greater

probability of being selected when the maximum information method is used. However,

42

Table 4.6 Measurement precision for maximum information method and three content

balancing methods in two-dimensional case

 

 

Cgiibtliition Method Bias MSE p 93

Panel A: Difﬁculty= (0, 0, 0)

0 Max Information 0.024 0.131 0.968
0 CCAT 0.026 0.146 0.964
0 M 0.031 0.149 0.961
0 MCCAT 0.033 0.145 0.962
0.4 Max Information 0.052 0.138 0.967
0.4 CCAT 0.047 0.144 0.968
0.4 M 0.050 0.152 0.965
0.4 MCCAT 0.049 0.141 0.967
0.8 Max Information -0.001 0.189 0.954
0.8 CCAT 0.017 0.196 0.965
0.8 MMM 0.013 0.197 0.965
0.8 MCCAT 0.015 0.178 0.966
Panel B: Difﬁculty= (-0.6, 0.6, 0)

0 Max Information 0.038 0.173 0.937
0 CCAT 0.039 0.132 0.954
0 MMM 0.040 0.133 0.952
0 MCCAT 0.040 0.135 0.951
0.4 Max Information 0.067 0.153 0.951
0.4 CCAT 0.063 0.138 0.961
0.4 MMM 0.061 0.143 , 0.959
0.4 MCCAT 0.064 0.136 0.961
0.8 Max Information 0.029 0.124 0.961
0.8 CCAT 0.038 0.126 0.971
0.8 M 0.035 0.124 0.970
0.8 MCCAT 0.039 0.122 0.971

 

when the difﬁculty levels for the content areas become uneven, for those who have a very
low or high ability, the items with high discrimination from Content 3 may no longer be
the most optimal items to be selected, while the items from Content 1 or 2 which have
similar difﬁculty level to their ability level become more informative. As a result, the

maximum information method will select a different combination of items in terms of

43

content areas, which might deviate more from the reference composite and hence affect

the measurement precision.

Finally, the three content balancing methods perform comparably. They yield
similar biases, MSES and the correlations between 0 and its estimate. However, in terms

of MSE, MCCAT tends to have slightly lower values than M and CCAT, especially

when three content areas have the same difficulty level.

4.3.2 Three-dimensional case:

Table 4.7 summarizes the overall measurement precision for the three-dimensional
case. Unlike the two-dimensional case, the maximum information method does not
perform the best when three content areas have the same difficulty level. The three
content balancing methods yield overall similar values in biases, MSES and the
correlations between 0 and its estimate as the maximum information method. Among the

three content balancing methods, MCCAT results in the smallest MSES.

When the three content areas have uneven difﬁculty levels, the performances of the
four methods follow similar patterns as in the two-dimensional case. Clearly, evidenced
by larger correlations between 0 and its estimate and smaller MSES, the three content
balancing methods give better recovery of person parameter than the maximum
information method. In addition, as the correlation between true abilities increases, the
bias becomes smaller and the correlation between 0 and its estimate becomes larger. For

example, for CCAT, the absolute mean bias decreases from 0.052 to 0.031, and the

44

Table 4.7 Measurement precision for maximum information method and three content

balancing methods in three-dimensional case

 

 

ﬁxation Method Blas MSE p93
Panel A: Difficulty= (0, 0, 0)

0 Max Information 0.037 0.181 0.950
0 CCAT 0.039 0.161 0.955
0 M 0.031 0.178 0.949
0 MCCAT 0.039 0.151 0.955
0.4 Max Information 0.028 0.172 0.962
0.4 CCAT 0.038 0.174 0.963
0.4 M 0.029 0.172 0.962
0.4 MCCAT 0.038 0.170 0.962
0.8 Max Information 0.016 0.163 0.964
0.8 CCAT 0.034 0.184 0.967
0.8 MMM 0.029 0.197 0.963
0.8 MCCAT 0.035 0.169 0.968
Panel B: Diﬂiculty= (-0.6, 0.6, 0)

0 Max Information 0.052 0.238 0.915
0 CCAT ' 0.059 0.162 0.944
0 M 0.050 0.182 0.936
0 MCCAT 0.051 0.193 0.935
0.4 Max Information 0.038 0.167 0.944
0.4 CCAT 0.042 0.135 0.959
0.4 MMM 0.036 0.138 0.956
0.4 MCCAT 0.041 0.155 0.951
0.8 Max Information 0.023 0.151 0.953
0.8 CCAT 0.031 0.147 0.966
0.8 MMM 0.030 0.154 0.962
0.8 MCCAT 0.040 0.136 0.968

 

correlation between 0 and its estimate increases from 0.944 to 0.966, when the ability
correlation increases from 0 to 0.8. This is not surprising, since with high correlation
between true abilities, the effect of multidimensionality would become smaller and the

estimation procedure would become more accurate. Meanwhile, the difference in MSE

45

between the maximum information method and the three content balancing methods

decreases as the correlation between true abilities increases.

4.3.3 Conditional Bias and MSE

In addition, the conditional biases and MSES for limited points are presented in
Figure 4.3 to Figure 4.10. Five equally spaced values of 0 from -2 to +2 (0 = -2, -1, 0, 1, 2)
are used. Hence, 25 ﬁxed points (5 X 5) are evaluated in the two-dimensional case, and

125 (5 x 5 x 5) points in the three-dimensional case.

Figure 4.3 and 4.4 show the conditional biases for the four methods in the two-
dimensional case. Clearly, the maximum information method and the three content
balancing methods yield similar conditional biases. The ﬁgures demonstrate that the
examinees located at the two ends of the 0 distribution have more volatile biases, while
those who are in the middle have biases close to 0. Generally, with unidimensional data,
the estimated 0-values from the three-parameter logistic model have larger measurement
errors for high and low ability examinees than for middle ability examinees. In particular,
high ability examinees tend to be underestimated and low ability examines tend to be
overestimated. However, the shape of conditional biases in this study does not follow the
pattern strictly. This is mainly because additional estimation errors are introduced when

we ﬁt a unidimensional model to multidimensional data.

Figure 4.5 and 4.6 show the conditional MSES for the four methods. Again, the four
methods perform very similarly. The MSES are small in the middle while large at the two

ends. Also, the MSES at the lower end are larger than those at the upper end, which is due

46

 

to the guessing issue in the three-parameter IRT model. In addition, the difference
between the two ends shrinks when the difficulty levels for the three content areas

become uneven.

The conditional biases and MSES for the three-dimensional case generally follow
the same patterns as those in the two dimensional case. As shown in ﬁgure 4.7 and 4.8,
the four methods yield similar conditional biases. Since the estimated 0 values are
restricted to -4 and 4, the conditional biases are bounded between -4- 0 and 4- 0. In
addition, in the three-dimensional case, the differences between the maximum
information method and the three content balancing methods are more apparent in terms
of MSE. In ﬁgure 4.9 and 4.10, the MSES at the lower end of 0 distribution for the
maximum information method are larger than those for the content balancing methods.
This difference is magniﬁed when the difﬁculty levels for content areas become uneven.
The three content balancing methods performed similarly, although MCCAT gives
slightly higher MSE than CCAT and MMM at the lower end of 0, especially when the

content areas have the same difﬁcult level.

47

 

CCAT

3 —.—ﬂ—iﬁ#v_——._~.__—V_r— +-

 

 

 

 

 

 

 

Maximum Information

 

 

 

 

 

 

con-I'll 2 _ call .2
000.. H 03"
.1. M _ i
o. m0 9 W h I. no 9
I. M 3'.
lo 0 _ l0 0 o
ollllllloI...o .3. _ IIIIIIIII o 32
co. 0...... M o. 00"... co
2 ah 01 4..I MI... 34... oh 2 11 0 4 9.. 34...
as as

Figure 4.3 Conditional Biases for the four methods, Difficulty=(0, 0, 0), two-dimensional
48

case

 

CCAT

 

 

 

 

Maximum Information

 

 

 

 

 

 

 

-2

 

 

-4

 

M CCAT

 

 

 

 

49

Figure 4.4 Conditional Biases for the four methods, Difﬁculty=(-0.6, 0.6, 0), two-

dimensional case

Maximum information CCAT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.51— — 1.5T» , 4.
1' 1
t3 . g; '
2 2 .
0.5” ° - 0.5.
. S
: . . . o . . . .
. o B .
- I . # t L J
-4 -2 0 2 4 -4 -2 0 2 4
9 0
MCCAT MMM
1.53 . 4 . . > 1.5
1 ° 1. . .
LLI o LIJ .
CD
a , 2
0.5* . 0.5» °
: : , .
t . f ' ’ f . 1 j
I. " t ° ___-_
-4 -2 0 2 4 -4 -2 O 2 4
9 9

Figure 4.5 Conditional MSES for the four methods, Difficulty=[0, 0, 0], two-dimensional

case

50

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.5 1.5ﬁ.-_%ﬁ_._2___,_3___.
1 » 1 8
LL! - LLI
co . °
2 3
0.5» - 0.5 ° .
C 0 ° . a
0 _ I ' ._____i 0 _ _.__hr‘0_:__mm_ _
-4 -2 o 4 -4 -2 o 4
0 0
MCCAT MMM
1.5 , 1 5——av—- .
1 ~ 1 -
B 35' °
2 ° 0 E .
0.5» ° 0-5' .
\ Z ‘ :
0 .l 2 “i 0 . l -i
-4 -2 o 4 -4 -2 o 4
9 0

Figure 4.6 Conditional MSES for the four methods, Difficulty=(-0.6, 0.6, 0), two-

dirnensional case

51

Maximum Information

bias
bias

 

 

 

 

 

 

 

 

 

bias
bias

 

 

 

 

 

 

 

Figure 4.7 Conditional Biases for the four methods, Difﬁculty=(0, 0, 0), three-
dimensional case

52

 

l
4

2

 

CCAT

 

-5 a—IEL_
-2 0

-4

 

Maximum Information

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Difficulty=(-0.6, 0.6, 0), three-

9

Figure 4.8 Conditional Biases for the four methods

dimensional case

53

Maximum information

 

 

 

 

 

 

 

3
2.5-
2.
33‘ 1.5» .1-
2 . .0
1- . o
a] .
05L 0 :0
' ' ‘~.o
.°:'} 0 0%..: 0
04M;
-4 -2 0 2 4
0
MCCAT
3 .
2.5»
2.
$ 1.5» ..°
2
1’ .;.'°
0.5 33 .. .
")3 st: .
o- 1 IL
-4 -2 o 2 4
0

MSE

 

 

 

 

 

 

 

 

 

 

 

 

CCAT

34‘ - ..
2.5:

2 .
1.5' .

1 . o. o
0.5» '° ' .

. ‘B’e ". .
o _M3
- -2 0 2
9
MMM

3 . a.“
2.5

2
1.5"

O O‘

1 - . ..

0.5 '5‘ . ‘0 ~
0 .f. o '.. .
o . ‘&é___
-4 -2 0 2
9

Figure 4.9 Conditional MSES for the four methods, Difﬁculty=(0, 0, 0), three-
dirnensional case

54

 

4

 

 

 

 

 

 

 

 

 

Maximum information CCAT
3pm . . 4 ——~ 3r‘~?u———v——+~ —~
2.5 ' - 2.5
2 . ‘ 2
Bi 1.5- «
2 , . . :3 1.5- -
1 0. . 1 ' .0
"33'. ,. ° ~:- - - '
0.5‘ .°. ?.0 o .0 ’o O * 05' 0'. 3).-‘3’ ' .
o o g 9‘. f o. . . to 8 ﬂ ’. ..d..
o ‘ 'of ° .“. . ..
o%_i ____M_i
-4 -2 o 2 4 -4 -2 0 2
0 0
MCCAT MMM
3T- - 3I’ - —ﬁ 4* 4
2_5. - 2.5:
2 . . 2
I}; 1_5, . 4 '63 1.5»-
2 2 - ’
1’ .Q..‘oo. o 1' :0" ' °'
.:o’.. . o 0.. o. .. .° °
05 ...~f{ ..: .. g 05 . ngO.‘ o o .
-4 ﬂ _____2 o 2 4 -4 — _ -2 0 2w"
9 9

Figure 4.10 Conditional MSEs for the four methods, Diﬁiculty=(-0.6, 0.6, 0), three-
dimensional case

55

 

 

 

4.4 Content balancing

Table 4.8 presents the percentage of tests violating the content-balancing requirement
for each method. As expected, the three content-balancing methods perform very well
and there is no content violation. In contrast, the maximum information method yields a
large number of unbalanced tests. The violation rate ranges from 98.75% to 100% across
the conditions. In other words, almost all tests fail to satisfy the content balancing

requirement when the maximum information method is used.

Table 4.8 Violation rate of the content-balancing requirement

Violation Rate 1%)
Maximum

Information CCAT MMM MCCAT

Difficulty Correlation

 

Panel A: Two-dimensional case

 

 

 

(0, 0, 0) 0 100 0 0 0
0.4 98.75 0 0 0
0.8 98.95 0 0 0
(-0.6, 0.6, 0) 0 100 0 0 0
0.4 100 0 0 0
0.8 100 0 0 0

Panel B: Three-dimensional case
(0, 0, 0) 0 100 0 0 0
0.4 99.1 0 0 0
0.8 99.55 0 0 0
(-0.6, 0.6, 0) 0 100 0 0 0
0.4 100 0 0 0
0.8 100 0 0 0

 

Table 4.9 reports the average number of items selected from each content area for the

maximum information method. Because two dimensional structures provide very similar

results, only the results for the two-dimensional case are discussed.

56

First, in the two-dimensional case, Content 3 clearly dominates. For example, with
even difficulty and uncorrelated abilities, there are on average 2.87 items selected from
Content 1, 3.92 items from Content 2, and 23.21 items from Content 3. Intuitively,
Content 3 measures the composite of the two dimensions, and hence it is closer to the
orientation of the calibrated unidimensional O-scale. Therefore, the unidimensional
discrimination estimates are higher for Content 3 items, which make those items more

likely to be administrated when the maximum information method is used.

Table 4.9 The mean number of items selected from each content area for maximum
information method

 

 

 

 

 

Difﬁculty Correlation Content 1 Content 2 Content 3
Panel A: two-dimensional case
(0, 0, 0) 0 2.87 3.92 23.21
0.4 4.39 5.56 20.04
0.8 6.31 6.36 17.33
(-0.6, 0.6, 0) 0 4.32 6.71 18.97
0.4 4.81 6.23 18.96
0.8 5.67 7.51 16.82
Panel B: Three-dimensional case
(0, 0, 0) 0 2.19 3.47 24.34
0.4 3.62 4.36 22.02
0.8 5.30 6.41 18.29
(—0.6, 0.6, 0) 0 3.79 5.45 20.76
0.4 4.53 6.58 18.90
0.8 5.48 6.82 17.70

 

Second, when the three content areas have uneven difficulty levels, relatively more
items are chosen from Content 1 and 2, although Content 3 still dominates. For example,
with an ability correlation of 0, there are on average 4.32 items selected from Content 1,
6.71items from Content 2, and 18.97 items from Content 3. To understand this change,

recall that the mean difﬁculty is decreased by 0.6 for Content 1 and increased by 0.6 for

57

Content 2. Therefore, items from Content 1 or 2 become more informative now for those

who have very low or high ability and thus have higher chance of being selected.

Finally, as abilities become more correlated, items are selected more evenly across

the content areas.

4.5 Item pool usage
Table 4.10 compares the item pool usages for the four methods.

In the two-dimensional case, when the difficulty levels for the three content areas are
the same, the maximum information method results in the highest scaled x2 statistics and
hence has the most unbalanced item pool utilization. For instance, it yields a scaled x2
statistics of 103.13 when the correlation between true abilities is zero. In comparison, the
three content balancing methods produce the scaled )(2 statistics no larger than 94.11.
This is not surprising, because imposing content constraints forces items to be selected
more evenly from all three content areas. As a result, more items are likely to be used.
Among the three content balancing methods, MMM performs the best with the lowest

scaled 12 value, while CCAT does the worst.

The results are different when the difﬁculty levels for the three content areas are
uneven. The maximum information method now becomes more efﬁcient in item pool
usage. It yields similar scaled x2 value as MMM except for the condition with high

ability correlation. This is consistent with the previous founding that uneven difﬁculty

58

leads to more items selected from Content 1 and 2. Among the three content balancing

methods, MMM still performs the best.

In addition, when the true abilities become more correlated, the scaled x2 statistics

drop for all methods, which implies a more even item pool utilization when

multidimensionality is reduced.

Panel B presents the results for the three-dimensional case, which shows similar
patterns as the two-dimensional case. In particular, the maximum information method

leads to a more balanced item pool usage when the three content areas have uneven

difﬁculty levels. It yields an even lower scaled x2 statistics than the three content

balancing methods, although the differences are relatively small.

Table 4.10 item pool usage for four methods

 

 

 

 

 

 

Chiiquare
Difficulty Correlation Max
Information CCAT MMM MCCAT
Panel A: Two-dimensional case
(0, 0, 0) 0 103.13 94.11 89.03 90.3
0.4 89.45 83.15 77.82 79.62
0.8 76.68 71.45 67.81 70.74
(-0.6, 0.6, 0) 0 88.85 94.56 88.53 90.15
0.4 84.02 89.6 85.3 87.42
0.8 75.71 74.67 70.71 74.44
Panel B: Three-dimensional case
(0, 0, 0) 0 105.3 101.17 94.08 95.25
0.4 86.93 84.4 78.78 80.44
0.8 70.07 69.42 65.62 67.7
(-0.6, 0.6, 0) 0 98.79 106.32 99.81 101.43
0.4 79.93 88.28 83.94 87.5
0.8 68.01 74.36 70.37 73.02

 

59

 

4.6 Percentages of underexposed and overexposed items

In this study, we also calculate the exposure rate of each item for the four methods.
An item is classiﬁed as underexposed if its exposure rate is less than 0.02, and
overexposed if larger than 0.2. The percentages of underexposed and overexposed items
for different simulation conditions are summarized in Table 4.11 and 4.12. The
observations for the two dimensional structures are very similar. For brevity, only the

results for the two-dimensional case are discussed.

In the two—dimensional case, when the three content areas have the same difﬁculty
level, the maximum information method tends to produce more underexposed and
overexposed items than the three content balancing methods. For example, with a zero
ability correlation, 66.5% of items are underexposed for the maximum information
method, while the percentage drops to 64.5% for CCAT, 63% for MMM, and 63.5% for
MCCAT. Among the three content balancing methods, MMM seems to be the best,

although the differences are rather small.

For other conditions, there are no signiﬁcant differences among these four methods.
Generally, about 60% of items are underexposed and about 15% of items are
overexposed in the item pool. The high percentages of underexposed and overexposed

items are mostly due to the fact that no exposure control technique is employed in this

study.

60

Table 4.11 Percentages of underexposed and overexposed items for the four methods in
the two-dimensional case

 

 

Correlation Method UnderExposed OverExposed
Panel A: Difﬁculty= (0, 0, 0)

0 Max Information 66.5 15.5
0 CCAT 64.5 14.3
0 M 63.0 14.0
0 MCCAT 63.5 14.8
0.4 Max Information 64.8 17.0
0.4 CCAT 61.5 15.5
0.4 M 60.3 15.3
0.4 MCCAT 61.3 17.0
0.8 Max Information 61.3 15.8
0.8 CCAT 58.8 15.5
0.8 M 57.3 15.8
0.8 MCCAT 58.0 16.5
Panel B: Difﬁculty= (-0.6, 0.6, 0)

0 Max Information 62.0 15.5
0 CCAT 64.5 15.0
0 MMM 63.3 15.3
0 MCCAT 65.0 16.3
0.4 Max Information 63.0 16.3
0.4 CCAT 63.3 14.3
0.4 MMM 61.5 15.0
0.4 MCCAT 63 .5 15.0
0.8 Max Information 61.3 17.0
0.8 CCAT 60.0 15.0
0.8 MMM 58.8 14.8

0.8 MCCAT 60.0 15 .3

61

Table 4.12 Percentages of underexposed and overexposed items for the four methods in
the three-dimensional case

 

 

Correlation Method UnderExposed OverExposed
Panel A: Difﬁculty= (0, O, 0)

0 Max Information 68.5 15.5
0 CCAT 65.8 14.5
0 M 64.5 14.8
0 MCCAT 65.5 15.3
0.4 Max Information 64.3 15.0
0.4 CCAT 60.5 16.3
0.4 MMM 61.3 15.5
0.4 MCCAT 61.5 16.3
0.8 Max Information 59.0 17.5
0.8 CCAT 57.5 17.3
0.8 M 56.5 17.3
0.8 MCCAT 57.8 16.5
Panel B: Difﬁculty= (-0.6, 0.6, 0)

0 Max Information 64.8 14.5
0 CCAT 65.8 15.3
0 MMM 64.0 14.5
0 MCCAT 65.3 15.0
0.4 Max Information 62.0 15.5
0.4 CCAT 62.5 15.8
0.4 MMM 61.0 15.3
0.4 MCCAT 62.5 16.3
0.8 Max Information 59.3 16.5
0.8 CCAT 60.0 15.5
0.8 MMM 58.5 14.5
0.8 MCCAT 59.0 15.8

 

62

 

 

 

Chapter 5

Conclusions and Discussions

Most of the current CAT programs are based on the assumption that a
unidimensional IRT model represents the interactions between persons and test items.
However, many researchers have argued that this assumption rarely holds in the real
world and multiple abilities are required to account for the performance on a test.
Meanwhile, content balancing is also a practical consideration in CAT, since lack of
content comparability could pose a threat to the validity of scores, and may not be
acceptable to test takers and test score users. The purpose of this study is to investigate
the effect of ﬁtting a unidimensional IRT model to multidimensional data in a content-
balanced CAT. Speciﬁcally, unconstrained CAT with maximum information item
selection method is chosen as the baseline, and the performances of the three content
balancing procedures, the constrained CAT (CCAT), the modiﬁed multinomial model
(MMM), and the modiﬁed constrained CAT (MCCAT), are evaluated in terms of

measurement precision, item pool utilization and item exposure control.

5 .1 Conclusions

Prior research has shown that when test data is unidimensional, unconstrained CAT
with the maximum information method gives the best measurement precision (Kingsbury
and Zara, 1991; Cheng, Chang, & Yi, 2007). In contrast, the use of content balancing
increases the acceptance of the tests by practitioners, but may cause some loss in the

measurement accuracy.

63

 

However, the present study shows that when test data is multidimensional, the
content balancing methods actually result in similar or even better accuracy than the
maximum information method. The reason might be that controlling the percentages of
items from individual content areas insures adequate representation of each dimension of

the data, and hence improves the measurement precision.

The results also show that the difﬁculty level of the content areas is a signiﬁcant
factor that affects the performances of the four methods. When the content areas have the
same difﬁculty level, the content balancing methods yield comparable measurement
precisions to the maximum information method. In particular, they produce similar biases,
MSES, and correlations between the true and estimated ability. In addition, the content

balancing methods tend to result in more efﬁcient item pool utilization, and slightly lower

percentages of underexposed items.

On the other hand, when the content areas have uneven difﬁculty levels, the content
balancing methods outperform the maximum information method in terms of
measurement precision. However, the differences shrink as the correlation between true
abilities increases. Moreover, the maximum information method becomes relatively
more efﬁcient in item pool usage. In terms of the percentages of underexposed and

overexposed items, there is no signiﬁcant difference between the four methods.

The study also shows that the results for the two dimensional structures are
generally consistent, which indicates the results might be generalized to a higher

dimensional space.

64

Finally, there is no signiﬁcant difference between the three content balancing
methods. They perform similarly in terms of measurement precision and item exposure
rate. However, MMM appears to have the most efﬁcient item pool utilization. It yields

the smallest scaled 12 statistics among the three methods across all conditions.

Table 5.1 Comparison between the maximum information method and the three content
balancing methods.

 

Measurement Content Item pool Under Over
Difﬁculty Method Precision balancing usage exposed exposed
(0, 0, 0) Max
Information . . . .
' srmllar srmllar
Content
Balancing \/ ‘l ‘1

(-0.6, 0.6, 0) Max
Information similar similar similar

Content
Balancirg \/ V

Table 5.2 Comparison between the three content balancing methods.

 

 

Measurement Item pool Under Over
Difﬁculty Method Precision usage exposed exposed
(0. 0, 0) CCAT
MMM \/ similar similar

MCCAT \/ (Smaller MSE)

(-0.6, 0.6, 0) CCAT
MMM similar 4 similar similar

MCCAT

Table 5.1 and 5.2 summarize the conclusions of this study. Overall, the content
balancing methods are better than the maximum information method, especially for tests
with low correlations in the constructs. They not only produce content-balanced tests for

examinees and increase the acceptance of the adaptive test by practitioners, but also

65

improve the measurement efﬁciency, particularly when the content areas have uneven
difﬁculties. On the other hand, the three content balancing methods perform similarly,

but MMM gives the most efﬁcient item pool usage.

The current study has an important practical implication for CAT. Previous literature
has shown that using content balancing may induce a loss in measurement precision for
unidimensional data. In contrast, we show that it may improve measurement precision
when a unidimensional model is ﬁt to multidimensional data. In real testing programs,
unidimensional IRT models are often used because of their simplicity and popularity.
Meanwhile, many studies found evidences of multidimensionality in real data. For
example, Reckase et al. (1988) examined a test from the ACT Assessment Battery and
showed that the test was clearly multidimensional. Therefore, if the context in this study
resembles the reality more closely, then content balancing is recommended for its

improved acceptance by practitioners and better measurement precision.

5 .2 Future research

There are several potential extensions to this simulation study. First, we only
examine a limited set of item pool structures and the item parameters are simulated from
commonly used distributions. Alternatively, we can examine more general and realistic
item pool structures generated from real test data. Those “real data” evidences can serve

as a good complement to our “pure simulation” results.

Second, the results from this study only apply to ﬁxed-length CATS. F ixed-length

CATs are easy to implement in practice, but they might lead to aberrant response patterns

66

(Chen & Ankenmann, 2004). Therefore, examining variable-length CATS can provide an

important robustness check.

Third, in the current simulation study, the three content balancing methods, CCAT,
MMM, and MCCAT, are ﬁxed content balancing methods. That is, the number of items
from each content area is ﬁxed. However, ﬂexible content balancing is used in several
large-scale CAT programs. It allows the number of items from each content area to be
between a lower bound and an upper bound (Stocking & Swanson, 1993). Many methods
have been developed to handle ﬂexible content—balancing control, such as the weighted
deviation model (WDM) (Stocking and Swanson, 1993), the shadow test approach (van
der Linden, 2005b; van der Linden and Chang, 2003), and the weighted penalty model
(Shin, Chien, Way, & Swanson, 2009). These methods can handle many practical
constraints, including item content and item type. It would be interesting to investigate

how these ﬂexible content balancing methods perform in the current context.

Finally, this study yields a large percentage of underexposed and overexposure items,
because no exposure control is applied. However, in addition to content balancing, item
exposure control is another important practical consideration in CAT. Using item
selection purely based on maximum information, some items may be administrated too
frequently and become known to test takers. As a result, test security and reliability can
be threatened. At the same time, when a small proportion of items are over-selected, there
are also a large number of items in the item pool rarely used. Therefore, to increase test
efﬁciency and security, mechanics needs to be imposed on the item selection procedure .
to control the exposure rate of items. This issue has been addressed in great detail in the

literature (Chang & Ansley, 2003; Georgiadou, Triantaﬁllou & Economides, 2007; Hetter

67

& Sympson, 1997; Pastor, Dodd & Chang, 2002; Stocking & Lewis, 1998, 2000; Way,
1998), and a number of strategies for controlling item exposure have been developed (e.g.,
alpha-stratiﬁed design, Chang, Qian & Ying, 2001; Sympson & Hetter’s method, 1985).
So a natural extension would be incorporating those item exposure control methods into
the study. This can make our CAT procedure more realistic, and it would be also

interesting to examine the interactions between content balancing and item exposure

control.

68

REFERENCES

Ackerman, T.A. (1991). The use of unidimensional parameter estimates of
multidimensional items in adaptive testing. Applied Psychological Measurement,
15(1), 13-24.

Ansley, TN. (1984). Using a unidimensional latent trait model with multidimensional
data: An empirical investigation of robustness. Unpublished doctoral dissertation,
University of Iowa, Iowa city, IA.

Bejar, 1.1., & Weiss, DJ. (1979). Computer programs for scoring test data with item
characteristic curve models (Research Rep. No. 79-1). Minneapolis: University of
Minnesota, Department of Psychology, Psychometric Methods Program.

Binet, A., & Simon, T.A. (1905). Méthode nouvelle pour le diagnostic du niveau
intellectuel des anormaux. L ’Anne'e Psychologique, 11, 191-244.

Bock, R.D., & Mislevy, RJ. (1981). Data quality analysis of the Armed Services
Vocational Aptitude Battery. Chicago: National Opinion Research Center.

Bock, R.D., & Mislevy, RJ. (1982). Adaptive EAP estimation of ability in a
microcomputer environment. Applied Measurement in Education, 6, 431-444.

Bolt, D.M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using markov chain monte carlo. Applied
Psychological Measurement, 27(6), 395-414.

Chang, H., & Ying, Z. (1999). A-stratiﬁed computerized adaptive testing. Applied
Psychological Measurement, 23, 211-222.

Chang, H., Qian, J., & Ying, Z. (2001). A-stratiﬁed computerized adaptive testing with b-
blocking. Applied Psychological Measurement, 24, 333-341.

Chang, S.W., & Ansley, TN. (2003). A comparative study of item exposure control

methods in computerized adaptive testing. Journal of Educational Measurement,
40(1), 71-103.

Chen, S., & Ankenmann, RD. (2004). Effects of practical constraints on item selection
rules at the early stages of computerized adaptive testing. Journal of Educational

Measurement, 41(2), 149-174.

Cheng, Y., Chang, H., & Yi, Q. (2007). Two-phase item selection procedure for ﬂexible
content balancing in CAT. Applied Measurement in Education, 31(6), 467-482.

69

Eignor, D.R., Stocking, M.L., Way, W.D., & Steffen, M. (1993). Case studies in
computerized adaptive test design through simulation (Research Rep. No. RR-93-
56). Princeton, NJ: Educational Testing Service.

Educational Testing Service (2007). Test and score data summary for TOEFL computer-
based and paper-based test. Princeton, NJ: Educational Testing Service.

Fang, Y. (2008). Using a projection method to estimate subscoresﬁom tests with
multidimensional structures. Unpublished doctoral dissertation, Michigan State
University, East Lansing, MI.

Georgiadou, E., Triantaﬁllou, E., & Economides, A. (2007). A review of item exposure
control strategies for computerized adaptive testing developed from 1983 to
2005. .Journal of Technology, Learning and Assessment, 5(8). Retrived [data]
from from http://www.itla.0_rg

Green, B. G., Bock, R.D., Humphries, L. G., Linn, R.L., & Reckase, MD. (1984).
Technical guidelines for assessing computerized adaptive tests. Journal of
Educational Measurement, 21, 347—3 60.

Hambleton, R.K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. N ewbury Park CA: Sage.

Hetter, R. D., & Sympson, J .B. (1997). Item exposure control in CAT-ASVAB. In W.A.
Sands, B.K. Waters & J.R. MacBride (Eds), Computerized adaptive testing: From
inquiry to operation (pp. 141-144). Washington, DC: American Psychological
Association.

lp, EH. (2010). Empirically indistinguishable multidimensional IRT and locally
dependent unidimensional item response models. British Journal of Mathematical
and Statistical Psychology, 63, 395-416.

Kinsbury, G.G., & Zara, AR. (1989). Procedures for selecting items for computerized
adaptive tests. Applied Measurement in Education, 2, 359-3 75.

Kinsbury, G.G., & Zara, AR. (1991). A comparison of procedures for content-sensitive
item selection in computerized adaptive tests. Applied Measurement in Education,
4(3), 241-261.

Leung, C. K. Chang, H, & Hau, K. (2000). Content balancingin stratified computerized
adaptive testing designs. Paper presented at the Annual Meeting of the American
Educational Research Association, New Orleans, LA.

Leung, GK, Chang, H., & Hau, K. (2003a). Incorporation of content balancing

requirements in stratiﬁcation designs for computerized adaptive testing.
Educational and Psychological Measurement, 63, 257-2 70.

70

Leung, GK, Chang, H., & Hau, K. (2003b). Computerized adaptive testing: A
comparison of three content balancing methods. Journal of Technology, Learning,

and Assessment, 2(5). Available from h_tt_p://www.iﬂa.org

Liu, J. (2007). Comparing multi-dimensional and uni-dimensional computer adaptive
strategies in psychological and health assessment. Unpublished doctoral
dissertation, University of Pittsburgh, Pittsburgh, PA.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associations, Inc.

Lord, F .M. (1986). Maximum likelihood and Bayesian parameters estimation in item
response theory. Journal of Educational Measurement, 23, 157- 1 62.

Luecht, RM. (1998). A ﬁamewor/r for exploring and controlling risks associated with
test item exposure over time. Paper presented at the annual meeting of the

National Council on Measurement in Education.

Mckinley, R.L., & Reckae, MD. (1983). An extension of the two-parameter logistic
model to the multidimensional latent space (Research Rep. No. 83 -2). Iowa city,

IA: ACT, Inc.

Owen, R.J. (1975). A Bayesian sequential procedure for quantal response in the context
of adaptive mental testing. Journal of American Statistical Association, 70, 351-

356.

Pastor, D.A., Dodd, 3., & Chang, H.H. (2002). A comparison of item exposure selection
techniques and exposure control mechanisms in CATs using the generalized
partial credit model. Applied Measurement in Education, 26(2), 147 -1 63.

Patsula, L.N., & Steffan, M, (1997). Maintaining item and test security in a CA T
environment: A simulation stuay. Paper presented at the annual meeting of the

National Council on Measurement in Education.

Reckase, MD. (1985). The difﬁculty of test items that measure more than one ability.
Applied Psychological Measurement, 9(4), 401 -4 1 2.

Reckase, MD. (1989). Adaptive testing: The evolution of a good idea. Educational
Measurement: Issues and Practice, 8(3), 11-15.

Reckase, M .D. (2009). Multidimensional item response theory. New York, Springer.

Reckase, M.D., Ackerman, T.A., & Carlson, J .E. (1988). Building a unidimensional test
using multidimensional items. Journal of Educational Measurement, 25(3), 193-

203.

71

Reckase, M.D., & Mckinley, RL. (1991). The discriminating power of items that
measure more than one dimension. Applied Psychological Measurement, 15(4),
361-3 73.

Roussos, L.A., Stout, W.F., & Marden, J .L. (1998). Using new proximity measures with
hierarchical cluster analysis to detect multidimensionality. Journal of Educational
Measurement, 35(1), 1-30.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded
scores. Psychometrika Monograph, 1 7.

Segall, D.O., Moreno, K.E., & Hetter, RD. (1997). Item pool development and
evaluation. In W.A. Sands, B.K.Waters & J .R. MacBride (Eds), Computerized
adaptive testing: From inquiry to operation (pp. 117-130). Washington, DC:
American Psychological Association.

Shin, C.W., Chien, Y.M., Way, W.D., & Swanson, L. (2009). Weighted penalty model for
content balancing in CA T. Iowa city, IA: Pearson.

Stocking, M.L. (1998). A framework for comparing adaptive test designs. Unpublished
manuscript.

Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in

computerized adaptive testing. Journal of Educational and Behavioral Statistics,
23(1), 57-75.

Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in
CAT. In W. J. van der Linden & C. A. W. Glas (Eds), Computerized adaptive
testing: Theory and practice (pp. 163-182). Boston: Kluwer.

Stocking, M.L., & Swanson, L. (1993). A method for severely constrained item selection
in adaptive testing. Applied Psychological Measurement, 1 7, 277-292.

Stroud, A. H., & Sechrest, D. (1966). Gaussian quadrature formulas. Englewood Cliffs,
NJ: Prentice-Hall.

Swaminathan, 11., & Rogers, J .L. (1990). Detecting differential item functioning using

logistic regression procedures. Journal of Educational Measurement, 27(4), 361-
370.

Sympson, J. (1978). A model for testing with multidimensional items. In D.J.Weiss (Ed),

Proceedings of the 1977 computerized adaptive testing conference. Minneapolis
MN: University of Minnesota.

72

Sympson, J .B., & Hetter, RD. (1985). Controllin item-exposure rates in computerized
adaptive testing. In Proceedings of the 27 annual meeting of the Military Testing
Association, (pp.973-977). San Diego CA: Navy Personnel Research and
Development Center.

Traub, RE. (1983). A priori considerations in choosing an item response model. In R.K.
Hambleton (Ed.), Applications of item response theory (pp. 57-70). Vancouver,
British Columbia: Educational Research Institute of British Columbia.

US. Department of Defense. (1982). Armed Services vocational aptitude battery. North
Chicago, IL: US. Military Entrance Processing Command.

van der Linden, W.J. (2005a). Linear models for optimal test design. New York:
Springer-Verlag.

van der Linden, W.J. (2005b). A comparison of item-selection methods for adaptive tests
with content constraints. Journal of Educational Measurement, 42(3), 283-302.

van der Linden, W.J., & Chang, H. (2003). Implementing content constraints in alpha-
stratiﬁed adaptive testing using a shadow test approach. Applied Psychological
Measurement, 27(2), 107-120.

Wainer, H. (1993). Some practical considerations when converting a linearly
administrated test to an adaptive format. Educational Measurement: Issues and
Practices, 12, 15-20.

Wainer, H. (2000). Computerized adaptive testing: A primer (2nd ed.). Hillsdale, NJ:
Lawrence Erlbaum.

Wainer, H., & Kiely, G.L. (1987). Item clusters and computerized adaptive testing: A
case for testlets. Journal of Educational Measurement, 6, 473-492.

Walker, C.M., & Beretvas, SN. (2003). Comparing multidimensional and
unidimensional proﬁciency classiﬁcations: multidimensional IRT as a diagnostic
aid. Journal of Educational Measurement, 40(3), 255-275.

Wang, M. (1985). Fitting a unidimensional model to multidimensional item response
data: The eﬂect of latent space misspeciﬁcation on the application of IRT
(Research Rep. No. NW: 6-24-85). Iowa city, IA: University of Iowa.

Wang, M. (1986). Fitting a unidimensional model to multidimensional item response
data. Paper presented at the ONR contractors conference, Gatlinburg, TN.

Wang, T., & Vispoel, WP. (1998). Properties of ability estimation methods in

computerized adaptive testing. Journal of Educational Measurement, 35(2), 109-
1 35 .

73

Warm, T.A. (1989). Weighted likelihood estimation of ability in the item response theory.
Psychometrika, 54, 427-450.

Way, D.W. (1998). Protecting the integrity of computerized testing item pools.
Educational Measurement: Issues and Practice, 1 7, 17-2 7.

Weiss, D. J. (1982). Improving measurement quality and efﬁciency with adaptive testing.
Applied Psychological Measurement, 6, 473-492.

Weiss, D.J., & Suhadolnik, D. (1982). Robustness of adaptive testing to
multidimensionality. In DJ. Weiss (Ed.), Proceedings of the 1982 item response
theory and computerized adaptive testing conference. Minneapolis MN:
University of Minnesota, Department of Psychology, Computerized Adaptive
Testing Laboratory.

Zimwoski, M.M., Muraki, E., Mislevy, R.J., & Bock, DJ. (2003). BILOG-MG for
Windows. Scientiﬁc Software International, Inc., Lincolnwood, IL.

74