9‘. r
"‘ 3-...

,w.

”N

11" '
{:5 '
’i

- ..

Lr - , ..

D 4' ‘
.- I '
.. 1&3:
9&2.
1:. -.

A .....
.3:

.:>:;;.
~...

w

—.

4
3a..

1:
- £3.

"#3....
V.“

r‘ n» .uA.

.

“1';
:1 "
‘1. t?
. 1r <
:.
Arr-

1:)
-
.u

a-..

.. . w

I»
;"

J 9255",: My
' M?"
s ,. 23’“???

-‘.-
an

4-“

-4 ‘
..
~_.
. u A
.3”.
2..-.

a."

J?! pr ‘
#5.“ a}? {S
...‘ .

V: r:“’*'
ML
'34

““1628"

 

 

 

:‘;L w .‘Aawg'ﬂiuL’f; ii
’11’til$’£ait‘l:n ‘
, l" N .

\l

J

,.

x3.
. .‘3
""544;
. Q‘fﬂf!
3.22%:

,u
a

“1...:-
m: gm-

-\

ﬁgs.

-
'r

4.‘

 

This is to certify that the
dissertation entitled

Designing Optimal Item Pools for Computerized Adaptive
Tests with Exposure Control

presented by

Lixiong Gu

has been accepted towards fulﬁllment
of the requirements for the

Ph. D. degree in Counseling, Educational
Psychology. and Special
Education

 

 

ﬁnes/Q (g mare

 

Major Professor’s Signature
7 i

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

LIBRARY

Michigan State

University

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATEDUE

DAIEDUE

DAIEDUE

 

its 821 @131

w

 

i35izgja

 

—

 

 

 

it $035 2m

 

 

Jig; 312

AUG 1 4. 2m

 

0727

1 .

 

 

 

 

 

 

 

6/07 p:lClRC/DateDue.indd-p.1

—_

 

DESIGNING OPTIMAL ITEM POOLS FOR
COMPUTERIZED ADAPTIVE TESTS WITH
EXPOSURE CONTROLS

By

Lixiong Gu

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counseling, Educational Psychology, and Special Education

2007

ABSTRACT
DESIGNING OPTIMAL ITEM POOLS FOR
C OMPUTERIZED ADAPTIVE TESTS WITH
EXPOSURE CONTROLS
By
Lixiong Gu
Computerized adaptive testing requires a well-designed item pool containing an
appropriate number of items to build an individualized test that matches the examinee's
ability level. An optimal item pool can be defined as a pool consisting of appropriate
items for each individual test that is capable of reaching the desired level of precision. It
also contains well-balanced items that will achieve optimal item usage and lower the cost
of item creation. One of the methods to develop an optimal item pool is Reckase’s
method (2003). which is a Monte Carlo method to determine the properties of an optimal
item pool. This study extends the method for designing item pools calibrated with 3PL
and applies it to situations where no exposure control, Sympson-Hetter procedure, or a-
stratiﬁed procedure is imposed to control the item exposure rate. The procedures for
designing the item pool and two approaches of simulating test items are presented. The

performance of each optimal item pools is evaluated along with the operational item

pools.

DEDICATION

To my wife Yanxuan, my parents, and my sister Lishu

iii

ACKNOWLEDGEMENTS

I am deeply indebted to Professor Mark D. Reckase for his guidance in academics and the
dissertation work, as well as my career paths. Without his constant support and insightful
comments, this work would not have been possible.

I would also like to thank three other members of my dissertation committee: Dr. Ken
Frank, Dr. Richard Houang, and Dr. Hua—Hua Chang for their helpful suggestions on this
study.

I am extremely grateful to Dr. Linda Chard for her critiques on the writing and assistance
in editing the dissertation.

Thanks also go to Dr. Mary Pommerich and Dr. Daniel Segall, who provided two
operational item pools for this study, to Wei He who shared her MATLAB programs on
designing item pools for lPL, and to Raymond Mapuranga for his comments on the early
version of my dissertation.

I am also grateful to Dr. Dianne Henderson-M‘ontero and Dr. Venessa Lall, who
supported me in balancing my time between operational work at Educational Testing

Service and the work on my dissertation.

My deep gratitude goes to my wife Yanxuan for her love and support, and to my parents
and my sister, for their understanding and encouragement.

iv

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................ vii
LIST OF FIGURES ............................................................................................................ x
Chapter I Introduction ......................................................................................................... 1
1.1 Research Context ...................................................................................................... 6
1.2 Summary ................................................................................................................. 13
Chapter II Item Pool Design and Components of Computerized Adaptive Testing ......... 15
2.1 Brief History of the Computer Adaptive Testing ................................................... 15
2.2 Pros and Cons of CAT ............................................................................................ 16
2.3 Components of Computerized Adaptive Testing .................................................... 17
2.3.1 Item Pool .......................................................................................................... 18
2.3.2 Scoring Procedure ............................................................................................ 20
2.3.3 Item Selection Procedure ................................................................................. 23
2.3.4 Stopping Rule ................................................................................................... 26

2.4 Practical Constraints in Item Selection ................................................................... 26
2.5 Exposure Control Methods ..................................................................................... 30
2.5.1 Sympson-Hetter Exposure Control .................................................................. 31
2.5.2 a-Stratiﬁed Adaptive Testing ........................................................................... 34

2.6 Item Pool Design and Its Relationship with Other Components of CAT ............... 37
Chapter III Reckase’s Simulation Method and Extensions to 3PL ................................... 40
3.1 Basic Concepts of Reckase’s Simulation Method .................................................. 40
3.2 Reckase’s Method for Optimal Item Pool Calibrated with lPL ............................. 43
3.3 Reckase‘s Method Applied to 3PL ......................................................................... 48
3.2.1 Extending the “Bin” Concept ........................................................................... 50
3.2.2 Strategies to Generate Items for Item Pool Simulation with 3PL .................... 55

3. 2. 2.] Prediction Model (PM) Strategy ............................................................... 57
3.2.2.2 Minimum Test Information (1W1) Strategy .............................................. 58

3.2.3 Post-simulation Adjustment ............................................................................. 60

3.4 Design Adjustments to Different Exposure Control Methods ................................ 65
3.4.1 Item Pool Design without Exposure Control ................................................... 65
3.4.2 Item Pool Design with Sympson-Hetter Exposure Control ............................. 65
3.1.3 Item Pool Design with a-Stratiﬁed Exposure Control ..................................... 66
Chapter IV Methods .......................................................................................................... 69
4.1 Operational Item Pools ........................................................................................... 69
4.2 Simulation Procedure .............................................................................................. 70
Step 1: Modeling CAT Procedures ........................................................................... 71

Step 2: Generating Examinee Population ................................................................. 71

Step 3: Generating Item Parameters ......................................................................... 7]

Step 4: Generating Response Data ............................................................................ 72

Step 4: Post-Simulation Adjustment ......................................................................... 73

4.3 Control Variables .................................................................................................... 73
4.4 Evaluating Simulated and Operational Item Pools ................................................. 74
Chapter V The Performance of the Item Pools without Exposure Control ...................... 80
5.1 Item Pools for Test without Content Balance ......................................................... 80
5.2 Item Pools for Tests with Content Balance ............................................................. 89
5.3 Summary ................................................................................................................. 97
Chapter VI The Performance of the Item Pool with Sympson-Hetter Exposure Control 99
6.] Item Pools for Tests without Content Balance ....................................................... 99
6.2 Item Pools for Tests with Content Balance ........................................................... 107
6.3 Summary ............................................................................................................... 117
Chapter VII The Performance of the Item Pools with a-Stratiﬁed Exposure Control... 118
7.1 Item Pools for Tests without Content Balance ..................................................... 118
7.2 Item Pools for Tests with Content Balance ........................................................... 126
7.3 Summary ............................................................................................................... 136
Chapter VIII Discussion ................................................................................................. 137
8.1 A Revisit to the Definition of “Optimal” .............................................................. 137
8.2 Implications on the Practice of Item Pool Development ...................................... 139
8.2 Implications on Item Pool Management ............................................................... 141
8.3 Reckase‘s Method vesus the Mathematical Programming Method ...................... 142
8.4 Limitations and future studies ............................................................................... 143
APPENDIX ..................................................................................................................... 145
REFERENCES ............................................................................................................... 174

vi

Table 4.1

Table 5.1

Table 5.2

Table 5.3

Table 5.4

Table 5.5

Table 6. I.

Table 6.2

Table 6.3

Table 6.4

Table 6.5

Table 7.1

Table 7.2

Table 7.3

Table 7.4

Table 7.5

Table A.1

Table A2

LIST OF TABLES

Simulation Design .......................................................................................... 74

Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning
without Exposure Control .............................................................................. 83

Summary Statistics of the Performance of the Item Pools ............................ 83

Item Pool Size and Item Parameter Statistics for General Science without

Exposure Control ........................................................................................... 91
Summary Statistics of the Performance of the Item Pools ............................ 92
Number of Over- and Under-Exposed Items by Content .............................. 92
Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning with

Sympson-Hetter Exposure Control .............................................................. 102
Summary Statistics of the Performance of the Item Pools .......................... 102

Item Pool Size and Item Parameter Statistics for General Science with

Sympson-Hetter Exposure Control .............................................................. 110
Summary Statistics of the Performance of the Item Pools .......................... 111
Number of Over- and Under—Exposed Items by Content ............................ 1 11

Item Pool Size and Item Parameter Statistics for Arithmetic Reasoning with
a-Stratiﬁed Exposure Control ...................................................................... 121

Summary Statistics of the Performance of the Item Pools .......................... 121

Item Pool Size and Item Parameter Statistics for General Science with a-

Stratiﬁed Exposure Control ......................................................................... 129
Summary Statistics of the Performance of the Ideal Item Pools ................. 130
Percentage of Over- and Under-Exposed Items by Content ........................ 130

Item Distribution for the Operational Item Pool — Arithmetic Reasoning... 146

Item Distribution for Item Pool Designed by MTI Method and without
Exposure Control — Arithmetic Reasoning .................................................. 147

vii

Table A.3

Table A4

Table A5

Table A6

Table A.7

Table A8

Table A9

Table A.10

Table A.11

Table A.12

Table A.13

Table A. 14

Table A.15

Table A.16

Table A.17

Item Distribution for Item Pool Designed by PM Method and without
Exposure Control -— Arithmetic Reasoning .................................................. 148

Item Distribution for Item Pool Simulated with MTI method and with
Sympson-Hetter Exposure Control ~— Arithmetic Reasoning ....................... 149

Item Distribution for Item Pool Simulated with PM Method and with
Sympson-Hetter Exposure Control — Arithmetic Reasoning ....................... 150

Item Distribution for Item Pool Simulated with MTI Method and with a-
Stratiﬁed Exposure Control — Arithmetic Reasoning .................................. 151

Item Distribution for Item Pool Simulated with PM Method and with a-
Stratiﬁed Exposure Control — Arithmetic Reasoning .................................. 152

Item Distribution for the Operational Item Pool — General Science Content 1
...................................................................................................................... 153

Item Distribution for the Operational Item Pool — General Science Content 2
...................................................................................................................... 154

Item Distribution for the Operational Item Pool — General Science Content 3
...................................................................................................................... 155

Item Distribution for the Optimal Item Pool Designed by MTI and without
Exposure Control - General Science Content 1 .......................................... 156

Item Distribution for the Optimal Item Pool Designed by MTI and without
Exposure Control — General Science Content 2 .......................................... 157

Item Distribution for the Optimal Item Pool Designed by MTI and without
Exposure Control — General Science Content 3 .......................................... 158

Item Distribution for the Optimal Item Pool Designed by PM and without
Exposure Control — General Science Content 1 .......................................... 159

Item Distribution for the Optimal Item Pool Designed by PM and without
Exposure Control — General Science Content 2 .......................................... 160

Item Distribution for the Optimal Item Pool Designed by PM and without
Exposure Control — General Science Content 3 .......................................... 161

Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 1 ............... 162

viii

Table A. 1 8

Table A.19

Table A.20

Table A2]

Table A.22

Table A.23

Table A.24

Table A.25

Table A.26

Table A.27

Table A.28

Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 2 ............... 163

Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 3 ............... 164

Item Distribution for the Optimal Item Pool Designed by PM and with
Sympson—Hetter Exposure Control — General Science Content I ............... 165

Item Distribution for the Optimal Item Pool Designed by PM and with
Sympson-Hetter Exposure Control — General Science Content 2 ............... 166

Item Distribution for the Optimal Item Pool Designed by PM and with
Sympson-Hetter Exposure Control — General Science Content 3 ............... 167

Item Distribution for the Optimal Item Pool Designed by MTI and with a-
Stratiﬁed Exposure Control — General Science Content 1 ........................... 168

Item Distribution for the Optimal Item Pool Designed by MTI and with a-
Stratiﬁed Exposure Control — General Science Content 2 ........................... 169

Item Distribution for the Optimal Item Pool Designed by MTI and with a-
Stratiﬁed Exposure Control — General Science Content 3 ........................... 170

Item Distribution for the Optimal Item Pool Designed by PM and with a-
Stratiﬁed Exposure Control — General Science Content 1 ........................... 171

Item Distribution for the Optimal Item Pool Designed by PM and with a-
Stratified Exposure Control — General Science Content 2 ........................... 172

Item Distribution for the Optimal Item Pool Designed by PM and with a—
Stratiﬁed Exposure Control — General Science Content 3 ........................... 173

ix

LIST OF FIGURES

Figure 1.1 Steps of computerized adaptive testing ............................................................ 2
Figure 3.1 Demonstration of determining bin width ....................................................... 42
Figure 3.2 Items used for two individual examinee ........................................................ 45
Figure 3.3 Item pool for two examinees .......................................................................... 46
Figure 3.4 Item pool for 5000 examinees ........................................................................ 47
Figure 3.5 Item information provided by two different items ......................................... 49
Figure 3.6 Bins deﬁned by both a- and b- parameters. ................................................... 51
Figure 3.7 Item distribution by b-Bins and ab-Bins ........................................................ 54
Figure 3.8 Bivariate plot of b-parameter and a-parameter for operational item pool ..... 56
Figure 3.9 Demonstration of items in one bin offering more information than items in
another bin. .................................................................................................... 61
Figure 3.10 Items in the order of information provided most in each b-bin. .................... 63
Figure 3.11 Item usage in the order of information provided most in each b-bin ............. 64
Figure 3.12 Item distribution for optimal item pool before adjustment ........................... 64
Figure 3.13 Item distribution for optimal item pool after post-simulation adjustment. ....65
Figure 5.1 Item distribution for item pools without exposure control ............................ 81
Figure 5.2 Test-retest overlap rate conditional on 6 ........................................................ 84
Figure 5.3 Item exposure rate by difﬁculty level ............................................................ 85
Figure 5.4 Average test information conditional on true 6 ............................................. 86
Figure 5.5 Conditional standard error of measurement (C SEM) .................................... 87
Figure 5.6 Conditional bias ............................................................................................. 88
Figure 5.7 Conditional mean square error (CMSE) ........................................................ 88

Figure 5.8 Item distribution for item pools with content balancing and without exposure

control ............................................................................................................ 89
Figure 5.9 Test-retest overlap rate conditional on 6 ........................................................ 93
Figure 5.10 Item exposure rate by difﬁculty level ............................................................ 94
Figure 5.1 1 Average test information conditional on true 6 ............................................. 95
Figure 5.12 Conditional standard error of measurement (CSEM) .................................... 96
Figure 5.13 Conditional bias ............................................................................................. 96
Figure 5.14 Conditional mean square error (C MSE) ........................................................ 97
Figure 6.1 Item distributions for item pools with Sympson-Hetter exposure control ..... 99
Figure 6.2 Test-retest overlap rate conditional on 6 ...................................................... 103
Figure 6.3 Item exposure rate by difﬁculty level .......................................................... 104
Figure 6.4 Average test information conditional on true 6 ........................................... 105
Figure 6.5 Conditional standard error of measurement (C SEM) .................................. 106
Figure 6.6 Conditional bias ........................................................................................... 106
Figure 6.7 Conditional mean square error (CMSE) ...................................................... 107

Figure 6.8 Item distribution for item pools with Sympson-Hetter exposure control ....108

Figure 6.9 Test-retest overlap rate conditional on 6 ...................................................... 112
Figure 6.10 Item exposure rate by difﬁculty level .......................................................... 113
Figure 6.1 1 Average test information conditional on true 6 ........................................... 1 15
Figure 6.12 Conditional standard error of measurement (CSEM) .................................. 1 16
Figure 6.13 Conditional bias ........................................................................................... 116
Figure 6.14 Conditional mean square error (C MSE) ...................................................... 1 17

Figure 7.1 Item distribution for item pools without content balancing and with a-
stratiﬁed exposure control ............................................................................ 1 18

xi

Figure 7.2 Test-retest overlap rate conditional on 6 ...................................................... 122

Figure 7.3 Item exposure rate by difﬁculty level .......................................................... 123
Figure 7.4 Average test information conditional on true 6 ........................................... 124
Figure 7.5 Conditional standard error of measurement (CSEM) .................................. 125
Figure 7.6 Conditional bias ........................................................................................... 125
Figure 7.7 Conditional mean square error (CMSE) ...................................................... 126
Figure 7.8 Item distribution for item pools with content balancing and without exposure

control ..................................................... . ..................................................... 127
Figure 7.9 Test-retest overlap rate conditional on 6 ...................................................... 131
Figure 7.10 Item exposure rate by difﬁculty level .......................................................... 132
Figure 7.1 1 Average test information conditional on true 6 ........................................... 133
Figure 7.12 Conditional standard error of measurement (CSEM) .................................. 134
Figure 7.13 Conditional bias ........................................................................................... 135
Figure 7.14 Conditional mean square error (CMSE) ...................................................... I35

xii

Chapter I Introduction

Since its introduction in the early 1970 s, computerized adaptive testing (CAT) has
been used extensively in educational and psychological assessments (Lord, 1971; 3.
Reckase, 1974; Weiss, 1976). The objective of adaptive testing is to build individualized
tests by selecting items based on the examinee‘s current ability estimate. Test takers do *
not receive questions that are either too difﬁcult or too easy, but are always challenged A
by appropriate items during the entire course of the testing. From the test administrator’s
perspective, when examinees are given questions maximizing the information about their
ability levels from the item response, reduced standard errors and satisfying measurement
precision can be achieved with only a handful of properly selected items. This
potentially leads to more efﬁcient item usage and accurate ability estimates (Weiss,
1976). Adaptive tests are often administered with computers, which can quickly update
the estimate of the examinee’s ability after each item and then select subsequent items
based on the estimate. Therefore. when adaptive tests are mentioned, they are often
referred to as CAT.

Figure 1.1 shows a typical CAT process. It begins with an item pool (also called
item bank) that contains an adequate number of items calibrated using Item Response
Theory (IRT) (Lord, 1980). The CAT algorithm is most commonly an iterative process
involving a procedure for obtaining ability estimates based upon candidate item
performance and an algorithm for sequencing the set of test items to be administered to
candidates. A new ability estimate is computed based on the responses to all of the
administered items and the "best" next item is administered. This process is repeated

until it meets certain stopping criterion. such as time limit, number of items administered

(test length limit), change in ability estimate, content coverage, a precision indicator such

as the standard error, or a combination of factors.

 

Item Pool

 

 

 

 

 

 

V
Select Item

 

Initial Ability
Estimate

 

 

 

 

 

 

 

 

 

 

Administer Item
Obtain Response

 

 

 

 

 

 

 

I

 

 

Update Provisional
Trait EStimate Less than test length

 

 

 

 

Equal to test length

 

 

Compute Final
Score

 

 

 

Figure 1.1 Steps of computerized adaptive testing

Computerized adaptive tests offer many advantages over paper and pencil test.
Examinees get ﬂexible testing schedules and the opportunity to obtain their scores
immediately following administration of the test. More importantly, more accurate ability
estimates and higher test reliability and validity are achieved. Test developers may be
able to adopt new item formats in computerized administration that are not possible with
paper-and-pencil tests.

CAT has also dramatically changed the way tests are administered. Traditional
paper and pencil tests are usually administered in classrooms or a large auditorium where
desks and chairs are the only requirement for a testing site. CAT, however, needs more

expensive equipment (computer system) and a more individualized space to ensure

privacy and security. Specialized test centers with appropriate computer equipment have
been setup that will accommodate only a few examinees at a time. However, tests can be
offered on a nearly continuous basis—multiple administrations per day throughout the
week—to compensate for the large test volumes per administration. Continuous testing
gives test administrators and examinees more freedom. yet it also poses difﬁculties in test
development and test security.

Computerized adaptive testing administers slightly different items to different-IT";
examinees according to their ability levels. It thus requires a large number of diversel;
distributed items to measure each person with precision and efﬁciency (Embretson, 2001)
Items are needed at the extreme levels of difﬁculty, although it may be arduous for even
an experienced item writer to produce such items. The high costs for item development
becomes one of the major impediments to cost-effective implementation of computerized
adaptive testing. A quality item usually needs to go through a lengthy and costly
procedure such as item writing, content and editorial review, pretest, item analysis, and
ﬁnal review before it can be chosen for an operational test. Particularly, item writing
relies on experienced human item writers, who may rather slowly produce items of
varying levels of quality. Some of the items that are produced may fail to meet screening
criteria or may be unable to achieve adequate psychometric properties based on an
empirical tryout.

Until the Internet became widely used, CAT was considered more secure than
paper and pencil tests because different examinees receive different test items. It is
difﬁcult to artiﬁcially promote one‘s score by merely studying a few items (Wainer,

1990). One would have to learn a large portion of the item pool in order for pre-

knowledge to have any impact on an examinee‘s score, because the test was individually
tailored to his or her ability level. However, the increasingly popular Internet makes it
easily for examinees to share test related materials with each other. Presently, a student
taking a test on a future date can obtain information from a friend who took the test on
the previous day (Davis. 2002).

Item leakage and the high costs of item development have been the primary driving“
forces for research on designing and maintaining CAT item pools for better item usage. A,"
One of the solutions is to design high quality items more efﬁciently. A promising
technique is iteméil‘oning, also called model based item generation. This approach starts
with a formal description of a set of “parent items” along with algorithms to derive
families of clones from them (Bejar, 1993, 1996; Bejar & Yocom, 1991; Hively, - 3'
Patterson, & Page, 1968; Osburn, 1968). These parents are also known as “item forms,’i
“item‘templates,” or “itemshells.” More recent item generation research has tried to
model the relationship between item parts and its psychometric properties, such as item
difﬁculty and item discrimination (Bejar, Lawless. Morley, Wagner, & Bennett; 2003;
Glas & van der Linden. 2003', Graf. Peterson, Steffen, & Lawless, 2005). Items with
expected content coverage and psychometric properties (e.g., item difﬁculty and
discrimination) can be produced by varying the essential parts of an item (Deane, Graf, "
Higgins, Futagi, & Lawless. 2006; Singley & Bennett. 2002). Comprehensive reviews of
item-cloning techniques are given in Bejar (1993). .

Item cloning gives some hope in producing a large amount of items in a very short

time, but it can be very difﬁcult and costly in the process of developing item models for "

item cloning. A more realistic solution is to optimize the blueprint for the item pool

development. In common computerized adaptive testing practice, items are selected to
maximize item information at an estimated ability level so we learn the most about an
examinee’s ability (Wainer,]990). It may produce variability in the frequency with _
which items are used because items differ in terms of the desirability of the _ ,i
characteristics for measuring an examinee‘s ability level. For example, items with
average difﬁculty will be more often selected for administration because of the assumed
normal distribution of the examinee ability level. Additionally, more discriminating
items will be selected more often because they tend to be more informative. If item
writers know how many items with certain psychometric properties are needed for an
item pool, they will not waste resources on developing too many items that are rarely
used while creating too few items that may be frequently used.

This current proj ect investigates the applicationsof a simulation me’thodﬁdeveloped
by Reckase (2003, 2004) on designing the blueprint for a CAT item pool. Because an
optimal item pool will be different under different test situations,ithis project also
explores the item pool design when different exposure control methods are used,
speciﬁcally the Sympson-Hetter and a-stratiﬁed methods. The ﬁrst chapter brieﬂy
introduces the background under which this project is developed. The second chapter
reviews the literature on computerized adaptive testing, exposure control, and various
methods developed to design the blueprint to optimizethe item pool construction.

Chapter 3 illustrates Reckase's method in detail and provides extensions of his methods
to the applications with the three-parameter logistic IRT model. Following these sections,

simulation studies are conducted to demonstrate the item pool design process and

’.
«mi
2

."‘ :.

investigate the effectiveness of Reckase‘s method compared to a real item pool. Finally,

the implications of this current study and future directions are presented.

1 . I Research Context

The idea of an item pool is not new. It evolved long before computerized adaptive
testing became popular. Even in conventional paper and pencil tests, a well-designed
item pool provides test developers and teachers a convenient yet powerful tool to produce

99 “-

high quality tests. Item pools have been called by such terms as “item banks, question
banks," “item collections." "item reservoirs," and "test item libraries.” Although
distinctions among some of the temis can be made, they all refer to a relatively large .- ‘- i

. . . . . ‘ . »— l
collection ofeaszly accesszhle test questions (Millman & Arter, 1984). In other words, an. . ' '

1
item pool has to include a number of items which exceeds by several times the number tql
be used in any one test. Items in the pool are indexed, structured, or otherwise assigned
information that can be used to facilitate their selection for a test. Millman agglArter 1.. f
(‘1984Lcategorized this information into assigned item characteristics (e.g., keywords or ‘1;
other subject matter classiﬁers) and measured item characteristics (e.g., item difﬁculty or:
discrimination). The latter is also known as psychometric characteristics.

The concept of an item pool is expanded in computerized adaptive testing. Two
kinds of item pools are distinguished in a typical CAT program. One is often called the i.
master pool, which includes as many items as possibly being created for the testing use.
Another kind is the operational item pool, which is a smaller subset of the master pool
and by design, has to be small enough so that the computer may easily retrieve the item

and minimize item exposure. Yet, it has to be large enough to provide items with the

required characteristics. Due to the continuous nature with which many CATS are

administered, the useful life of an operational item or the entire operational item pool is
limited. After a certain number of uses, items are retired and put back into the master
pool. Some items can be reused only after a reasonably long time.

One question often asked during item pool design is how many items should be in 2 i,

a" fr'.
l'

a pool. Ideally, the more items the better, because it allows more choices in test assembly,
and seldom do the same items appear in tests repeatedly. With larger pools, it is difﬁcult
for examinees to memorize answers. This is a problem in situations where learners have
access to the item pool. Larger pools also mean more items that match content, item 2“
format, and statistical requirements are available (Millman & Arter, 1984). The caveats,
however, are: (a) the items added to the pool should be well written, content valid, and
statistically ﬁt; and (b) the total number of items should be manageable and easily
retrievable.

In paper and pencil test situations. test items are not reused as often. Millman and
Arter (1984)-??? Prosser( l 974) suggested the rule of thumb for the number of items in
an item pool be 10 items for each one that could be used on a testing occasion and 50
items for each class hour of presented material. In computerized adaptive testing, Luecht
(1998) suggests that between 3,800 and 21,000 items may be needed to begin a CAT
program when sufﬁcient pool size. multiple pools, and item pretesting are taken into
consideration. Guidelines that have been suggested. for the appropriate size of the
operational item pool are 150-200 items or from six to twelve times the test length on an
operational form (Luecht, 1998; Patsula & Steffan. 1997; Stocking. 1994; Weiss, 1985).; I, ' -'
However, issues of item exposure, item retirement, and pool rotation may require this

number to be much larger.

An often overlooked issue in item pool design is how to construct a blueprint that
outlines the optimal composition of items with desirable assigned and psychometric
characteristics. The blueprint as the outcome of item pool design can tell item-writers to

write items not only by format (multiple-choice or constructed-response) and content

’4‘,

coverage, but also by the desired psychometric characteristics of the items. The blueprint { " l "

,1
is optimal in that it consists of appropriate items for each individual test that is capable of

reaching the desired level of precision. An optimal blueprint also contains well-balanced
items to achieve optimal item usage and lower the cost of item creation.

Optimizing an item pool may not be an important issue in paper-and-pencil tests
where item exposure is not much of the concern. Usually an item pool may require only
a few items in each assigned item characteristics. more moderately difﬁculty items, and
some extremely easy and extremely difﬁcult items. In computerized adaptive testing,
items have more chances to be overexposed, and the costs of developing new items are so
high that an operational item pool needs to have a more balanced item composition in
order to reduce the item exposure for often used items and increase exposure for less used
items.

A better way to address this problem is to design and develop item pools in a more
systematic and empirical manner. The item writing process is usually guided by

appropriately designed test speciﬁcationsthat outline the contents attributes and their

T?

distributions. Requirement for statistical attributes, such as the range of difﬁculty, may "

”a.
I

be provided but are often difﬁcult to satisfy simply because the values of statistical

I/

attributes for individual items are not easily predicted. However, at the item pool level i
___'t
they often show persistent patterns of correlation with content attributes. These patterns

can be used to minimize the item-writing effort. Through carefully modeling of the CAT
procedure, test speciﬁcation for the item pool could be developed with computer
simulations to forecast the number of items needed with speciﬁc attributes (vanﬁder
Linden_,_1999; Reckase,‘2003). The methods compared here are for the design of a single
item pool and can serve as tools for monitoring the item writing process.

Only a few empirical studies on optimal item pool design have been documented
for computerized adaptive testing. Among them there are two bodies of research. One
assumes the existence of a master item pool with research focusing on the best way to
allocate items into multiple operational item pools. The other focuses on the design of
the operational item pool independent of existing items. These studies assume no
existing items and focus on design of a blueprint for item pool construction in order to
provide precise ability estimation and minimize item exposure.

..-”" Stocking and Swanson’s (1998) system of rotating item pools assumed the
presence of a master item pool from which several smaller operational pools were
generated. The number of operational pools each item should be included in can be
manipulated so that items with hi ghcr exposure rates are assigned to a smaller number of
pools and items with lower rates to a larger number of pools. By randomly rotating the
operational pools during the testing. uniformly distributed exposure rates for the test
items can be achieved. Ariel, Veldkamp, & van der Linden (2004) also, presented a
mathematical method to calculate the optimal way to allocate items from master item
pool into multiple operational pools.

The effectiveness of the rotating pools method, however. relies on the quality of

the master pool. Items have to be available in the master pool to be assigned into smaller

operational pools. Apparently, if the master pool contains difﬁcult items only, even the
optimally assembled rotating pools would not have the easy items. The fundamental
issue in item pool design is how to design an item pool without assuming the existenceof '
the masterwitem 8992999. to explore the ideal characteristics, $491! as the Size and_'th_ehitem
distribution the item pool should have to function efﬁciently.

Research concerning the design of the item pool from scratch focuses on

developing a blueprint for an item pool —— a document that speciﬁes the attributes of the a
r"
items needed in a new pool or an extension of an existing pool. The blueprint is designed

to allow for the assembly of a prespeciﬁed number of test forms from the bank, each with
its own set of speciﬁcations. The resulting item pool would allow CAT procedure to
generate adequate measurement precision for a majority of the test takers even with the
constraints of exposure control or content balancing. A favorable consequence is that the

number of unused items in the bank is also minimized. As will become clear, the ,
at r {7 ‘

blueprint not only speciﬁes the number of items with each content coverage, but also the i I
number of items with certain psychometric properties, particularly the range of IRT I
parameters.

Boekkooi-Timminga (1991) used integer programming to calculate the number of
items needed for future test forms. She used a sequential approach that maximized the
test information function (TIF) under the one-parameter logistic (Rasch) model. These
results were then used to improve on the composition of an existing item bank.

Subsequently, several methods for the construction of rotating item pools have been

demonstrated in empirical studies, some achieved the design goal with integer

10

‘7 I.

programming methods (for a review of these methods, see Ariel, Veldkamp, & van der
Linden, 2004).

Veldkamp & van der Linden (1999) described ﬁve steps to design an optimal
blueprint for a CAT item pool with a mathematical programming method. First, a set of
speciﬁcations for the CAT is analyzed and all item attributes ﬁguring in the
speciﬁcations are identiﬁed. Second, using the speciﬁcations, an integer programming
model for the assembly of the shadow tests in the CAT simulation is formulated. Third,
the population of examinees is identiﬁed and an estimate of its ability distribution is
obtained, for example, from historical data. Fourth, the CAT simulation is carried out
using the integer programming model for the shadow tests and sampling simulees from
the ability distribution. Counts of the number of times items from the cells in the
classiﬁcation table are collected. Fifth. the blueprint is calculated from these counts,
adjusting them to obtain optimal projections of the item exposure rates.

As the basic idea of mathematic programming is to optimize the resource allocation
under the assumption of limited resources, in this case the optimization is constrained by
the resources prespeciﬁed. It thus assumes a design space that is deﬁned as the Cartesian
product of all item attributes ﬁguring in the speciﬁcations of the tests in the program.
Each combination of the attributes represents a virtual item. For example, a combination
of content coverage and a-, b-, c- parameters represents a virtual item available for
optimal item pool. Because item parameters are real numbers, there are inﬁnitely many
combinations of item attributes. To simplify the problem, discrete values are chosen to
represent the possible values of the item parameters. For example, b = -3.0, -2.9, , 2.9,

3.0 represent the possible b-parameter values and a = 0.1. 0.2, , 2.9, 3.0 represents the

11

possible values for a-parameters. Any combination of the a- and b-parameters represents
a virtual item. Veldkamp & van der Linden (1999) demonstrated that modeling
constraints for the CAT version of the GMAT led to a design space containing 12,096
items.

Formulating constraints is an important step in the mathematical programming
method. Van der Linden (1998) laid out three kinds of constraints based on their
mathematical types: categorical item attributes, quantitative item attributes, and inter-
item dependencies. Categorical item attributes are attributes that characterize the content,
format, or author of items. Quantitative item attributes are certain quantitative attributes
items have, such as word counts. difficulty parameters, and discrimination indices. Inter-
item dependencies deal with possible relations of exclusion and inclusion between the
items in the pool, such as items in so-called enemy sets, in which items cannot be
included in the same sets.

The advantage of the mathematic programming method is that it is able to model
complicated test speciﬁcations. Once the constraints are identiﬁed and transformed to
numerical Constraints, special software is available to simulate the optimal item pool.
However, item pool design with mathematical programming method is closely tied with
shadow test procedure in item selections and requires the knowledge of special
optimization software. Depending on the way item attributes are partitioned, the design
space can be very large and the simulation process becomes computationally arduous.

Reckase (2003, 2004) took a slightly different approach and avoided using
mathematic programming. This approach does not assume pre-existing items. Instead,

items are simulated (in terms of the IRT parameters) to match the current ability

estimates to provide sufﬁciently optimal information. Reckase’s method ﬁrst partitions
the target item pool into smaller ones based on different non-statistical attributes, such as
content. Then the CAT process is simulated to construct the small items pools ‘
simultaneously. The simulation starts with an examinee randomly drawn from the 1 tr" ‘
expected examinee distribution to receive the adaptive test. Each item is simulated to be
the optimal item based on the current ability estimate. The same procedure is repeated

for subsequent examinees and the items needed to support a large sample of examinees is
tallied and becomes the optimal item pool. Exposure control rules can be built into the
simulation to decide how many times an item can be reused. This procedure has been

demonstrated successfully with widely available programming software in the design of

CAT item pools for TABE and NCLEX.

1.2 Summary

The present study reports on the development of optimal item pools for
computerized adaptive tests in the investigation of the relative merits of two different
strategies. A modiﬁed version of Reckase’s method is applied to designing optimal item
pools calibrated with three parameter logistic models. Sympson-Hetter and a-stratiﬁed
exposure control methods, as well as content balancing. are investigated.

For the purpose of this research, it was desired to have an operational pool of items
measuring an empirically signiﬁcant dimension of ability, while balancing the content
areas measured. Operational item pools for two sections of the C AT-ASVAB were
chosen as the design target in this study. The ﬁnal item pools were designed to meet the
criteria described by van der Linden (2000): (I) it would be sufficiently large to allow

several thousand overlapping subtests to be drawn from its items; (2) the items would

13

l 5, ..

span the entire range of item difﬁculty relative to the population of interest; and (3) it
would consist of an appropriate mix of high and low discriminating items to lower the
item creation cost while meeting the needs of test precision.

This study compared simulated optimal item pools to operational item pools on
item distribution and performance for examinees randomly sampled from expected
examinee distribution. The simulation study took into consideration the distribution of
the examinee population, content balancing, and the expected precision of ability
estimates.

The following research questions are investigated in this study:

1. What does the optimal item pool designed for a computerized adaptive test look
like when the item selection procedure imposes no exposure control, when it incorporates
Sympson-Hetter method, or when it incorporates the a-stratiﬁed method?

2. What do the optimal item pools designed for a computerized adaptive test look
like when the test does not need content balancing. and when the test needs content
balancing?

3. Do optimal item pools designed by a Monte Carlo simulation perform better

than the real operational item pool in terms of empirical criteria?

14

Chapter II Item Pool Design and Components of

Computerized Adaptive Testing

An optimal item pool design is based on the indepth understanding of the
mechanism of CAT. This chapter introduces computerized adaptive testing, its history,
pro and cons, and the components of the CAT procedure, which includes the item pool,
item selection procedure. ability estimation method, and stopping rules. Special attention
will be given to exposure control procedures, which are an integral part of the item
selection procedure, and how the design of the item pool should be based on the analysis

of its relationship with other CAT components.

2.] Brief History ofthe ComputerAdaptive Testing

Computerized adaptive testing (CAT). as its name suggests, is adaptive testing
delivered by computers. While CAT has only recently become a major force in
measurement practice. the idea of adaptive testing is not new. It has always been
recognized that too easy or too difﬁcult items contribute little to the information about an
examinee’s ability level. By eliminating the need to administer items of inappropriate
difﬁculty, adaptive testing can shorten testing time. increase measurement precision, and
reduce measurement error due to boredom, frustration or guessing (Wainer, 1990).

The ﬁrst adaptive test is known to be Alfred Binet‘s (I905) intelligence test, which
is still in use today in a more modern version. Since the concern was with the diagnosis
of the individual child, rather than the group, Binet realized he could tailor the test to the
individual by a simple strategy—ﬁrst rank-ordering the items in terms of difﬁculty, then

starting to test the child at what he deemed to be a subset of items targeted at his

15

approximation of the candidate's ability. If the child gave the correct answer, harder item
subsets were administered until the child answered a few questions in a row incorrectly.
If the child failed the initial item subset, then easier item subsets would be administered
until the child succeeded frequently. From this information, the child's ability level could
be estimated.

Lord's (I980) Flexilevel testing procedure and its variants, such as Henning's (1987)
Step procedure and Lewis and Sheehan's (1990) Testlets, are a reﬁnement of Binet's
method. The items are stratiﬁed by difﬁculty level, and several subsets of items are
formed at each level. The test then proceeds by administering subsets of items, and
moving up or down in accord with the success rate on each subset. After the
administration of several subsets, the ﬁnal ability estimate is obtained. Though a crude
approach, these methods can produce approximately the same results as more
sophisticated CAT techniques (Yao. 1991).

The use of computers facilitates a further advance in adaptive testing with
convenient administration and selection of single items. Reckasejs (I974) study is an
early example of this methodology of computer-adaptive testing (CAT). Since the mid
1990 3, many well known large-scale high stake testing programs, such as GRE, GMAT,
NCLEX and LSAT, have switched from paper and pencil tests to computerized adaptive

ICSIS.

2.2 Pros and Cons of'CA T

The advantages and cautions of CAT have been well document (e.g., Rudner,
I998; Wainer, 1990, Wainer & Eignor, 2000). One advantage is that, in general, CAT

greatly increases the ﬂexibility of test management (e.g., Grist. Rudner. and Wise, I989;

16

Weiss and Kingsbury, 1984). Tests are individually paced so that an examinee does not
have to wait for others to ﬁnish before going on to the next section. Self-paced
administration also offers extra time for examinees who need it, potentially reducing one
source of test anxiety. A number of options for timing and formatting, notably interactive
audio and video, can be offered. Therefore, it has the potential to accommodate a wider
range of item types. Besides ﬂexibility, CAT increases test efﬁciency by shortening the
test length without reduction in measurement precision. A shorter test reduces examinee
frustration, boredom, and fatigue, a factor that can signiﬁcantly affect an examinee's test
results. Since examinees see only those items appropriate for their ability level, they
remain engaged and challenged. In addition, individualized test leads to easy removal of
faulty items. With CAT, a poorly performing or incorrect item would only affect a
segment of the test-takers, and even for those, the self-correcting nature of CAT would

make it unlikely there would be any impact on the pass-fail decision.

2.3 Components ofComputerizea’ Adaptive Testing

Reckase (1989) listed four major components of a computerized adaptive test: the
item pool, the item selection procedure, the scoring (ability estimation) procedure, and
the stopping rule. Item exposure control and content balancing have recently been
extensively studied to constraint the item selection so that items are selected not only by
their statistical appeals but also content speciﬁcations and security concerns. An optimal
item pool should be determined by the other components of the CAT, namely test length,
expected distribution of the examinee population, ability estimation and item selection

procedure, and target item exposure and overlap rates (Bergstrom & Lunz, 1999).

I7

2.3.1 Item Pool

 

The adaptive feature of CAT makes it unnecessary to use pre-designed test forms
like paper and pencil test. It requires an item pool from which all tests will be drawn. In
practice, there are two kinds of item pools: one is called the master pool and another is
called theoperational pool. The master pool is an inventory of test items maintained to
supply the testing program. An operational pool is a pool of items from which'individual
adaptive tests are actually assembled. Typically, anm‘asterpoolﬂis much less strugtured
thagagopgrational pool. Its items may be in various stages of development, while the
items in an operational item pool have passed all preparatory stages and the pool is ready /

for test assembly. The focus of this study is to investigate the methods for designing an

\

_.s_r.~_...,‘-_._

optimal operational item. pool.

~
2“‘
.. 7"

Ideally, the item pool would have a sufﬁcient number of high quality items to
allow several thousand of overlapping subtests to be drawn from its items. It would have
a sufﬁcient number of items ingeach desired content area to meet the test speciﬁcations.
It would span a wide range of item difﬁculties relative to the population of interest to
allow the CAT to estimate ability levels for a broad range of examinees (Urry, 1977). In
addition, care must be taken to ensure that the item pool would consist of appropriate
items to reduce the over- and under-exposure rate while meeting the test precision
requirement (Davis, 2002, Wainer, 1990). Guidelinesfor the appropriatesi‘zeoftheitem
pool _§_l£§_._1_.5.0:200.items.or_from-six to twelvetimes thetestlengthon an operational form
(Luecht, 1998; Patsula & Steffan, 1997; Stocking, 1994; Weiss, 1985). However, issues
of item exposure, item retirement, and pool rotation may require this number to be much

larger. Due to the continuous nature with which many CATS are administered, the useful

18

t 2 w

life of an item or an item pool is limited. Luecht (1998) suggests that between 3,800 and“
21.000 items may be needed to begin a CAT program when sufﬁcient pool size, multiple
pools, and item pretesting are taken into consideration. Strategies to extend the life of a
pool, such as drawing multiple overlapping pools from an item vat (Patsula & Steffan,
1997), have been proposed. However, the cost and effort to create and maintain a CAT
item pool remains formidable and far exceeds that of paper and pencil testing, which

makes optimal design of item pool more important.

.- m,. _.
.r-’
a.

An item pool is not only a reservoir of items, but also an organized list of items

with clearly deﬁned attributes attached to them. Vanncgler“ LLndegQQQﬂaI’s’tinngﬁéd l ‘i l1

“MN“ ”.22-- t '{ "7'7. . .

—“’—-~.__...M l'

\

three types of item attributes: quantitative, categorical, and logical. Quantitative
attributes are item attributes that take on numerical values. Examples of quantitative
attributes are word counts, expected response times, statistics such as item p-values and
[RT parameters, and frequency of previous item or stimulus usage. Categorical attributes
divide or partition the item pool into subsets of items with the same attribute. Examples
of categorical attributes include content category, response format of items (e.g.,
constructed response or multiple-choice), and use of auxiliary material (e.g., graph or
table). Logical attributes differ from quantitative and categorical attributes in that they
are not prOperties of single items or tests but of pairs. triples, and so forth. The logical
attributes involve relations of exclusion and inclusion between items or tests. For
example, a relation of exclusion between items exists if they cannot be selected for the
same test because one has a clue to the solution of the other (so-called “enemy items”).
A relation of inclusion exists if items belong to a set with a common stimulus and the

selection of any item implies the selection of more than one.

19

2.3.2 Scoring Procedure

 

One of the advantages CAT has is the ability to administer items suitable to an
examinee’s ability level. This is achieved by repeatedly estimating the ability level after
each item is administered. At the beginning of the test administration, an initial value for
ability level is arbitrarily provided since no information is known about an examinee and
no item is administered. This value is commonly the expected mean ability level of the
testing population or a random number around the expected mean ability level. If prior
information is available. it may help determine the initial value that may be closer to the
examinee’s real ability level, thus facilitating the proceeding estimations.

After each item is administered, an examinee‘s ability level is re-estimated, based
on his or her responses to all previously answered items. Maximum likelihood estimation
(MLE) and Bayesian estimation approaches are two commonly used ability estimation
methods.

MLE determines the most likely ability level for an examinee given the response
string to items with speciﬁed parameters, by multiplying together the individual
probabilities of a correct or incorrect response given theta to compute a joint probability
with the function.

h
L(ul9)=l—IPI(“1‘lasaiabiaci) (3)
i=1

where Pi(ui |6,a,-,b,-,c,-) is the probability of getting response 11,- on item i given an
examinee’s true ability 6and item parameters . and n is the number of items. The

maximum likelihood estimate of an examinee’s true ability 6 is 6 , the value that

maximizes the likelihood function (or, equivalently, the log likelihood function).

20

Mathematically, this can be done by taking the derivative of the likelihood function,

setting the result equal to zero and solving for6. Iterative numerical methods such as
Newton-Raphson method (Wainer, 1990) typically are used to solve this equation.

MLE ability estimates are popular in CAT contexts due in part to their desirable
theoretical properties such as asymptotic consistency and asymptotic normality. Problems,
however, occur in solving the likelihood equation when examinees get all items correct or
all items incorrect, because such response pattern yield unbounded ability estimates.
These problems are often handled in CAT by setting arbitrary minimum and maximum
ability estimates for such response patterns (e.g., -4 and +4) or by using a Bayesian-based
ability estimate until examinee answers at least one item correctly and one item
incorrectly.

Owen’s (l 969) Bayesian sequential ability estimation technique was proposed as
part of his adaptive testing strategy, which selects items that minimizes the expected
value of the Bayesian posterior variance. This ability estimation procedure, however, has
proven useful in adaptive testing strategies using other item selection criteria, as well.

Owen’s Bayesian method begins with a prior distribution of ability — in effect, as
an assumption that the examinee is a member of a population with a normal distribution
of ability, with known mean and variance. After each test question, the mean and
variance are updated using a statistical procedure that combines the information in the
prior distribution with the observed score (right or wrong) on the most recent test
question, and the parameters of that question’s IRT model. The updated values of the
ability distribution parameters specify a normal “posterior” distribution, which is used as

the prior distribution for the next question. This process continues until the end of the

21

test. At that point, the posterior mean is used as the estimate of the examinee’s ability

scale location. Owen’s formula for updating the prior mean is as follows:

_ j6P(u,-|6)h(6)d6 4
_ IP(u,-|6)h(6)d6 U

 

#(91' lui)

Owen (1975) showed that after each item is administered the estimates for 6,-1- and

l
" 2

" ~ —2 . 2 __ 5

9,. =41 —o,-_1(a,- +0.4) 2t¢<B,-)/o(B,-)1(I—u,-/A,-). ‘ )
~ 2 ~ 2 2 —2 . 2 —1

0i :ai—l —(Ui—1) (at +(Ti—1) 4m (6)

where
ui is the item response (answer is correct when u,- = l and wrong when u; = O)
¢(B,~) is the standard normal probability density function of Bi,

(I)(B,-) is the standard normal cumulative density function of Bi»

’l'n :[¢(Bi)/(D(Bill(l—ui /Ai)l(] —“i lAi)¢(Bi)/¢(Bi)+3,'l,

l
B,- = (b,- -6I~)/(a,-_2 + 612—1) 2 . and

Ai = ci+(1— Ci )(I)(—B,-)

Adaptive test scoring using Owen‘s procedure takes into account just one item
response at a time. All previous information is absorbed into the parameters of the prior
distribution, which changes after each question. Because of the added prior information,

the Bayesian procedures have the advantage of smaller standard errors than with MLE for

the same number of items administered. However. use of a bad prior can result in the

22

need to administer more items to recover and a regression toward the mean in ability
estimation tends to occur.

MLE cannot estimate ability level until one correct and one incorrect response are
obtained. Thus, MLE cannot be used after the ﬁrst item or in the case in which an
examinee answers all items correctly or incorrectly. Therefore, with MLE, ad hoc
procedures must be implemented to cover cases in which no ﬁnite value 6is available
(Thissen & Mislevy, 2000). One soultion is to assign an examinee’s interim ability
estimate to be half the distance between the current ability estimate and a maximum or
minimum item difficulty value depending on whether all the responses are in the upper or
lower half of the response scale (Koch & Dodd, 1989). Bayesian procedures, on the
other hand, can obtain an ability estimate after the ﬁrst item response. This makes it a

favorable estimation method in some CAT programs.

2.3.3 Item Selection Procedure

 

In CAT, new items are selected adaptively with respect to a provisional estimate of
the examinee’s ability level based on responses to those items already administered
(Davey & Parshall, 1995). The two strategies currently most widely used for item
selection in CAT are maximum information (MI) (Brown & Weiss, 1977) and maximum
posterior precision (MPP) (Owen, 1975).

The M1 strategy selects the item that maximizes the Fisher information value at the

examinee’s current ability estimate. Let Pj(6) denote the item response function for item
j and Qj(6) = l—Pj(6). Then, for a dichotomously scored item, Fisher information is

(Lord, 1980):

[Pi-of
noel-(6)

an__-<_6 >
66

 

I (6): 13(6)Qj(6) = (I)

where PJ-(6) is the probability of a correct response. given 6, and Qj(6) is the probability

of an incorrect response. Plugging item parameters in Equation (1), it can be simpliﬁed
for the dichotomous three-parameter logistic item response model (Hambleton,

Swaminathan, & Rogers, 1991; Lord, 1980):

DZDafa—c.)
(c .D+e l")(1+ Wm)

 

11(6):

where L1. = a j (6 — bl) , D = 1.7, a,- is the item discrimination parameter, bj is the item
difﬁculty parameter. and cj is the pseudo-chance level parameter (i.e., the probability of a

very low 6 examinee correctly answering the item). Equation (2) indicates that the item

information increases as b,- approaches 6, as a,- increases, and as cj approaches 0

(Hambleton et al., 1991).

Unconstrained maximum information selection chooses an item i that maximizes

the Fisher information evaluated at é,- . the provisional proficiency estimate for the
examinee after it proceeding items. When the items that constitute CAT are selected
using M1, the precision of 6 increases as each item is administered (Hambleton et al.,
1991)

In practice, maximum information item selection is often based on a previously
computed table in which items are sorted by the information they provide at each number

of proﬁciency values (an “Info Table”). Item selection is equivalent for all 6 ’s in an

interval around a tabulated value. Rather than evaluating Fisher information for each item

24

in the pool at the current value of 6 each time it needed to select the next item, it need
only be evaluated once for each item at each tabulated point. Item selection based on an
Info Table is slightly less efﬁcient but less computationally burdensome than maximum
information item selection.

The MPP is also called Owen‘s Bayesian method, which selects an item that
maximizes the expected posterior precision of the ability estimate, or, conversely, to
minimize the expected posterior variance of the ability estimate. At the initial stages of a
test, MPP item selection might differ from MI item selection due to its use of the
posterior distribution. However, at later stages, the posterior variance approaches the
reciprocal of the test information (Chang & Stout, 1993). Therefore, MPP strategy might
provide results that are similar to the MI strategy. MPP is computationally easier than the
M1 for its quick approximation to the posterior ability distribution (Davey & Parshall,
1995). This advantage makes MPP more popular when computing power is limited.
However, increases in available computing power and the fact that with the MPP strategy
estimated ability level varies as a function of item order have made maximum
information more widely used (W ainer, 1990). A hybrid strategy was also developed
using Owen’s Bayesian sequential procedure to update ability after each item response
and selecting items sequentially by referring to precomputed item information Iookup
tables. This strategy was evaluated by several researchers and seemed to retain the
advantages of MI and MPP and avoid their disadvantages (Wetzel & McBride, 1986).

These statistically motivated item selection procedures can be tempered by
practical considerations such as item exposure rates and content balancing, both of which

will be described in detail later in this chapter.

25

2.3.4 StoppingRule

 

There are two methods to determine when to stop administering items in a CAT.
One is “ﬁxed length,” which requires all examinees to take the same number of items.
With the same test length for all examinees, test-taking time is similar across examinees,
making test administration more predictable. However, measurement precision will
differ across ability levels. which causes difﬁculty in reporting test reliability. The other
method, the “variable length” method, requires that examinees continue to take items
until reaching a pre-speciﬁed level of precision. It is, however, hard to explain to non-
experts why different items are administered. In terms of the item pool use, variable
length tests tend to perform better than ﬁxed length tests as they minimize test length
(Bergstrom & Lunz, 1999). However, simulation studies showed that ﬁxed-length testing
was more efﬁcient because highly informative items were typically concentrated over a
restricted range of ability examinees. In variable-length testing, examinees falling
outside of a certain range of ability tend to receive long tests, with each additional item
providing very little information and fatigue inﬂuencing the test precision (Segall,
Moreno, & Hetter, 1997). In practice. some adjustments are made to compensate for
shortcomings of either method. For example, in some certiﬁcation tests. every examinee
has to take a minimum number of items to cover enough contents. If the desired
precision level cannot be reached for an examinee whose ability is close to the cut-score,

more items are given until the precision level or a maximum test length is reached.

2.4 Practical constraints in Item Selection

Without additional constraints, item selection algorithms select a single best item at

each step of testing for each examinee. In practice, some constraints that play no formal

26

role in an IRT model are often imposed on item selection algorithms to achieve desirable
patterns of item usage. For example. a given examinee should not receive a particular
item twice; the number of items in each content area must not exceed certain proportions
speciﬁed by test speciﬁcation; and the same item should not be administered more than a
certain number of times. Among these practical considerations, item exposure control
and content balancing are two of the most addressed in CAT operations.

The exposure rate of an item is deﬁned as the ratio between the number of times
the item is administered and the total number of examinees. Because CAT’S are
administered frequently to small groups of examinees. there is a risk that items with high
exposure rates might become known to examinees (Mills & Stocking, 1996). Thus, high
item exposure rates can decrease test security.

Item selection based purely on a statistical model. such as maximum information or
maximum posterior precision, is the primary reason that leads to high item exposure rates.
For example, if no auxiliary information is used to start the CAT, every examinee sees
the same item ﬁrst and one of a single pair of items second. Those three items would
soon become public knowledge for any widely used CAT. Research has demonstrated
that typically, CAT item selection results in the fact that not only are the pool’s most
informative-items most often administered, but also that a very small percentage of the
pool’s items account for a very large percentage of administered items (Wainer, 2000;
Wainer & Eignor, 2000). In other words, the most popular items are popular by a large
margin, resulting in an exponential decline in item usage as rank increases. Furthermore,
with maximum infomiation item selection, two examinees with the same ability estimate

will likely see the same item (Hetter & Sympson, 1997).

While rarely used items seem to be a waste of resources Spent on item creation,
frequently exposed items may cease to be a valid measure of the ability because
examinee may have prior knowledge of items either from taking a pre-test or from friends
who took the test before (Parshall, Davey, & Nering, 1998). The probability the same
examinee would take items he/she saw in pre-test can be minimized if items that are
pretested together are put into separate item pools (Stocking & Lewis, 2000). It is hard,
however, to deal with the threat from what Luech (1998) called “Examinee collaboration
networks (ECNS),” which are global groups of examinees who seek to pool their
resources and test experience to discover a sufﬁcient number of test items from an item
pool to artiﬁcially increase scores (Davis, 2002). This situation is exacerbated by the fact
that tests are often continuously administered in CAT (Stocking & Lewis, 2000).

To lower the risk of overexposing test items, mechanisms are imposed on the item
selection function to control the item exposure rate. Estimation efﬁciency is thus traded
off against a more evenly distributed item exposure. a beneﬁt from the point of view of
test security.

Another consideration on item selection is the balance of item content. Content
areas in a single test are often associated with the notion of multidimensionality. There
have been extensive debates on whether or not a test composed of different content areas
Should be treated as unidimensional or multidimensional test. If unidimensional IRT is
used in item selection and scoring procedures in CAT, one of the major assumptions is
that performance on items within a given content area can be characterized by a
unidimensional IRT ability or ability. Violations of the unidimensional adaptive testing

model may have serious implications for validity and testing fairness (Segall, Moreno, &

28

Hetter, 1997). When content areas are disparate or introduce additional dimensionality, a
logical option is to use a multidimensional model and estimate separate ability levels
within a single test (Parshall, Davey, & Nering, 1998). Another option is to split the item
pool by administering separate tests, with separate ability estimations for each content
area (Segall, Moreno, & Hetter, 1997). However, it is often not practical to divide a
single test into several separate ones with the goals of the testing program (Thissen &
Mislevy, 2000). However, when content areas are shown to measure a single ability
dimension it is possible to design the item pool with item quantities proportional to the
desired content coverage for each examinee‘s test (Green, Bock, Humphreys, Linn, &
Reckase, 1984; Segal, Moreno, & Hetter, 1997). For example, in a test of general science
it might be desirable to constrain the 15—item adaptive test to include seven life science
items, seven physical/earth sciences items, and a single chemistry item, all in prespecifed
ordinal positions in the test.

Similar to item exposure control, programmatic restrictions to the item selection
procedure would be necessary to ensure the desired. content coverage during the CAT. In
a ﬁxed-length CAT, the ordinal positions for each item type or content may be speciﬁed
as a priori, or a spiraling scheme rotating through the various kinds of items may be used.
With variable length CAT, the algorithm rotates through the various types of items so
that balance is approximately maintained at each possible stopping point.

Like item exposure control, one possible drawback for any content balancing
method is that the most informative item in the selected content area may not be the most
informative item available in the item pool. It might threaten measurement precision (in

a ﬁxed length test) and could result in longer tests (in a variable length test) due to

administration of sub-optimal items. Some alternatives were proposed to balance the
content area while maintaining the test efficiency. The ﬁrst is to balance ﬁxed
proportions of information based on the same ﬁxed target content percentages of items
required by the test speciﬁcations. The second is to balance the proportions of
information based on target content percentages of total test information that are

conditional on estimated ability (Davey & Thomas, 1996; Thomasson, 1997).

2.5 Exposure Control Methods

Way (1998) classiﬁed exposure control procedures into two categories:
randomization (Davis and Dodd, 2001; Kingbury and Zara, 1989; Lunz and Stahl, 1998;
McBride and Martin, I983) and conditional selection procedures. Instead of always
administering the most informative item, randomization procedures select several items
near the maximum information level and then randomly administer one from the selected
items. Although relatively easy to implement, for randomization procedures it is hard to
specify a maximum exposure rate for speciﬁc items. Conditional selection procedures
juggle this problem by setting exposure control parameters to each item and adjust the
administration rate accordingly. Conditional selection procedures, however, require a
time-consuming iteration process to obtain exposure control parameters. If the item pool
or the ability distribution of the examinee population changes. it has to go through the
same process to reset the exposure control parameters. In addition to the randomization
and conditional selection procedures, Chang and Ying (1996) developed the a-Stratiﬁed
procedure in which items with low discrimination are administered ﬁrst, followed by
items with high discrimination as more accurate estimations of the examinees’ ability

levels are determined.

30

2.5.1 Sympson—Hetter Exposure Control

 

The Sympson-Hetter exposure control procedure (Sympson & Hetter, 1985) is one
of the most commonly used conditional selection procedures. This procedure assigns to
each item an exposure control parameter value that is based on the frequency of item
selections during an iterative CAT simulation. Items with high administration
frequencies are assigned smaller exposure control parameters, which range from zero to
one. During the test operations, the exposure control parameter of the selected item is
compared to a random number, which also ranges from zero to one. If the exposure
control parameter is larger than the random number, the item is administered. If it is
smaller, the item is put back into the item pool and the same process is applied to the next
best item. The item exposure control parameter is like a threshold. By controlling the
thresholds, the S-H method limits the administration of frequently used items in CAT and
ensures a maximum item exposure rate for less often used items.

The exposure control parameters in the S-H method are usually set by a series of
iterative simulation of real CAT administrations. Simply put, it is the ratio of the target
exposure rate to the probability of the item being selected in testing. How it works can be
shown as follow:

Let S]- denote the selection of item j for a randomly sampled examinee, and let A j

denote the administration of that item. The exposure rate for item j can be interpreted as

P(Aj), the probability that item j is administered to a randomly sampled examinee. The

S-H method separates item administered from item selected by the probability relation
P(Aj) = P(AleJ-)P(Sj) and controls P(Aj) by controlling P(Aj|Sj), the proportion of

selections that lead to administration. For any given exposure rate rj > 0; PM) 3 rj can

31

be achieved by setting P(Alej) 5 r/P(Sj). If P(Sj) is known or can be approximated, this
method can be easily implemented by generating a uniform (0,1) random variable.

Hetter and Sympson (1997) described steps of setting exposure control parameter

K ,- for the items. The probability of administering one item depends on the relationship

between a random number k and the item’s K value. Given that one item is selected, a

random number k is generated and compared to K. if k < K ,- the item will be administered,

otherwise the item will be retained and the next highest information item will be selected
and the same procedure is applied. There are ﬁve steps to set the exposure control
parameters:

Step 1. Generate the ﬁrst set of K i values, which are 1.0 for every item. This

results in an n-by-one vector for n items. Denote each i”? element of it as K ,- associated
with item 1.

Step 2. Administer adaptive tests to a random sample of simulees. For each item,
identify the most informative item i available at the examinee’s current ability estimate
then generate a pseudo-random-number x from the uniform distribution (0,1).

Administer item i if x is less than or equal to the corresponding Ki- Whether or not item i

is administered, it is excluded from further administration for the remainder of this

examinee’s test. Note that for the ﬁrst Simulation, all the K [’s are equal to 1.0 and every

item is administered, if selected.
Step 3. Keep track of the number of times each item in the pool is selected (NS)
and the number of times that it is administered (NA) in the total simulee sample. When

the complete sample has been tested, compute P(S), the probability that an item is

selected, and P(A ). the probability that an item is administered given that it has been
selected, for each item:

P(S) = NS/i‘VE

P(A) = NA/NE

where NE = total number of examinees.

Step 4. Use the value of the expected exposure rate r, and the P(S) values
computed above to compute new K ,- as follows:

If P(S) > r, then new K ,- = r/P(S)

If P(S) <= r. then new K,- = 1.0

Make sure that there are at least n items in the item pool that have new K ,- = I.

Step 5. Given the new K i~ go back to step 2. Use the same examinees and repeat
steps 2, 3, and 4 until the maximum value of P(A) that is obtained in step 3 approaches a
limit Slightly above r and then oscillates in successive simulations.

The K ,- obtained from the ﬁnal round of computer simulations are the exposure-
control parameters to be used in real testing.

The S-H method effectively limits the exposure rates of all items. However,
because items that are not selected cannot be administered, items with small probabilities
of being selected will still have small exposure rates; thus, the S-H method does not
increase exposure rates for "underexposed items. In addition. while the exposure of an
item across 6levels may be controlled. the same control may not hold for examinees at a
particular level of ability. For instance, even though the exposure of an item may be
controlled such that it is administered to no more than 30% of the examinees overall, it

may be administered to examinees of high ability 100% of the time. Furthermore,

33

implementation of this method requires knowledge about P(Sj), which is associated with

the 6distribution of the examinee population. Hence. it is necessary to specify this

distribution a priori and then approximate the value of P(Sj) using Simulation.

Many variations of the S-H technique were proposed afterwards. Parshall, Davey,
and Nering (1998) developed the conditional Sympson-Hetter procedure in which the
exposure control parameters are determined based on ability level. Stocking and Lewis
(1995) extended the technique to utilize a multinomial model and proposed another
version of the technique (Stocking & Lewis, 1998) that conditions the exposure control
parameter not only on the frequency with which the item is selected but also on 6level.
This addition to the S-H technique (often referred to as the conditional Sympson-Hetter
technique, or C SH, when a multinomial model is not used) is desirable for it overcomes
the major disadvantages of SH method by establishing an exposure control parameter for

each item at a number of different 6 levels.

2.5.2 a-Stratiﬁed Adaptive Testing

The use of stratiﬁcation testing based on item response theory is not new.
Poststratiﬁcation in which stratiﬁcation is applied according to an examinee’s test results
has been widely used in assessing differential item functioning (Dorans & Kulick, 1986;
Holland & Thayer, 1988; Shealy & Stout, 1993). Weiss (1973) proposed a stratiﬁed
CAT design in which stratiﬁcation was performed according to item difﬁculty. Chang
and Ying’s (1999) a-stratiﬁed adaptive testing (STR) design was proposed primarily to
address the plagued concern of overdrawing items with high discriminating indices in

item pools.

34

CAT item selection procedure based on maximum item information tends to select
items with higher discrimination parameters more often in the beginning stage of the test
than those with lower discrimination parameters. Because estimation of 6 tends to be
quite inaccurate early in the test, it seems a waste to use high discriminating items at this
point (Chang et al., 2003). With the STR method, the item pool is divided into a number
of strata, based on the values of the discrimination parameter of the items. During the
test, item selection is always constrained to one stratum, selecting items with maximum
information (van der Linden & Pashley, 2000) or the smallest distance between the value

of their difﬁculty parameter, b ,- and the current estimate of 6 (Chang et al.,1999). Early in

the test, items are administered from the stratum with the lowest value for the
discrimination parameter. However, as the test progresses, strata with higher values are
used. AS a consequence, a—stratiﬁcation forces a more balanced exposure for all items,
particularly if the strata in the item pool are chosen to have equal size and nr 5 n/R
(Chang et al., 2003).

A simple a-stratiﬁed selection method can be described as follows:

1. Partition the item bank into K levels according to the item a values;

2. Partition the test into K stages;

3. In the k’h stage, select nk items from the lr’h level based on the Similarity

between b and 6 , then administer the items (note that nl+ + n K equals the test length);

4. Repeat Step 3 from k= 1. 2. K.

Note that item selection with the a-stratiﬁed design is based on matching b-

parameter with 6 rather than maximizing item information (Chang & Ying, 1999). This

simpler criterion is used because the a values are similar within a level. Thus, for the

35

2PLM. maximizing item information is equivalent to matching b with 6. For the more

general 3PLM, matching b with 6 when item a values are the same very closely
approximates maximizing item information (Chang & Ying, 1996). Thus, this Simpler
selection method should maintain higher efﬁciency. It should also result in more evenly
distributed item exposure rates.

In practice, the strata consisting of items with high a values tend to have high b
values. A shortage of lower b items in those strata could cause low b items to be selected
more frequently (Chang. Qian. & Ying, 2001; Parshall, Davey. & Nering, 1998). A
reﬁned STR selection method is called a-stratiﬁed with b blocking (BSTR), which
balances the distributions of b values among all strata. In the BSTR method, the basic
idea is to force each stratum to have a balanced distribution of b values to ensure a good
match of 6 for different examinees.

In most of the stratiﬁed designs (e.g., Chang et al., 1999), four strata have been
used. Hau, Wen, and Chang's (2002) simulation study shows that the ideal and optimum
number of strata to be used in each speciﬁc application depends on the item pool
structure, test length. and other testing conditions. There is a diminishing return in that
dividing the pool into too many strata can lead to small stratum, in which there are not
any items of close difﬁculty for each particular examinee. It is also shown in their study
that when item difﬁculty is normally distributed in an item pool, the optimal strata are

quite independent of the pool size and the correlation between item discrimination and

difﬁculty 0:05).

36

2. 6 Item Pool Design and Its Relationship with Other Components of CAT

Parshall, Davey. and Nering (1998) discuss the three often-conﬂicting goals of
item selection in CAT. First, item selection must maximize measurement precision by
selecting the item maximizing information or posterior precision for the examinee’s
current ability level. Second, item selection must seek to protect the security of the item
pool by limiting the degree to which items may be exposed. Third, item selection must
ensure that examinees will receive a content balanced test. Stocking and Swanson (1998)
add a fourth goal to this list, stating that item selection must also maximize item usage so
that all items in a pool are used, thereby ensuring good economy of item development.
Stocking and Lewis (2000) portray the item selection problem as a balloon — pushing in
on one Side will cause a bulge to appear on another.

An optimally designed item pool seeks the best compromise of the conﬂicting
goals. To allow several thousand overlapping subtests to be drawn from its items, the
item pool must have a sufﬁcient number of high quality items. This is partly decided by
the number of examinees the item pool serves and the distribution of the examinees.
With item security consideration, the more examinees taking the test, the more items that
should be in the item pool. The CAT item selection procedure picks items with a
difﬁculty level approximately comparable to the ability estimates of the examinees,
therefore it is expected that items in the pool have a difﬁculty distribution that is similar
to the examinee ability distribution. It is desirable to have items in the pool to Span a
wide range of item difﬁculty relative to the population of interest to allow the CAT to

estimate ability levels for a broad range of examinees (Urry. 1977).

37

Test length. which is closely tied to the stopping rules in CAT, also plays an
important part in determining the number of items needed in an item pool. For a ﬁxed
length test, if the tests for individuals have no overlapped items, the number of items in a
bank should be exactly the number of items in each form multiplied by the number of test
takers. In reality, items can be used repeatedly within certain security constraints. Even
with item overlap, it is expected that the more items a test requires, the more items
needed in an item pool. Stocking (1994) recommended that the item pool Should have a
number of items that is at least 12 times the length of a test. Variable test length CAT
usually reduces the items needed for individual examinees. In this case, the number of
item needed for an item pool is correlated with the distribution of the test takers, i.e.,
number of examinees at each ability level.

With respect to the same item response patterns. different estimation methods may
lead to slightly different ability estimates and. in turn, inﬂuence the choice of the best
suitable item. Different item selection rules, such as picking the item that maximizes the
information or minimizes posterior variance at the current ability estimates, may choose
different items to be the most appropriate one for the examinee. Both Situations cause
different item usage and require different items in an optimal item pool.

Requirements on content balancing also require different compositions of the items
in an item pool. For example, if the test blueprint for a 40-item math test requires 20
arithmetic reasoning items and 20 problem-solving items. the optimal item pool would
contain a similar number of items for both contents. The goal is to have a sufﬁcient
number of items in each desired content area to assemble an individual test with the

balanced content coverage required by the test design.

38

In addition, care must be taken to ensure that the item pool consists of the
appropriate items to reduce the over— and under-exposure rate while meeting the test
precision requirement. Item overuse causes security concerns because the more
examinees take the same item, the more likely that item would be disclosed to the public.
Item under-use potentially increases the item developing costs. It has been commonly
realized that a tradeoff exists between test efﬁciency and item exposure control. A
choice needs to be made that maximizes efﬁciency within the limits of security
constraints, and that is essentially a matter of optimization. An optimally designed item
pool should be able to compensate for the exposure control and cause very little decrease

in the efﬁciency of ability estimation.

Chapter III Reckases Simulation Method and Extensions to 3PL

This chapter ﬁrst introduces the key concepts of Reckase‘s Simulation method for
optimal item pool design. Then it discusses the simulation procedure when the method iS
applied to items calibrated with one-parameter logistic model (1 PL) and the potential
problems with the three-parameter logistic model (3PL). Finally, the extensions of the
method and their applications in situations when exposure methods are built into item

selection process are discussed in detail.

3.] Basic Concepts of Reckase ’s Simulation Method

An item pool could be described by a list of item parameters for the items in the
pool. The basic idea of Reckase‘s method is to determine the item parameters with
randomly sampled examinees from the expected examinee distribution. The simulated
computerized adaptive tests are administered to the examinees assuming that each item
administered to the examinee has the item parameters best suitable to the provisional
ability estimation. After a certain number of examinees taking the test, the union of the
“virtual” items is the optimal item pool for the CAT program.

Theoretically, every 6estimate is unique and the items optimally suitable for the
estimate have unique item parameters. This Simulation process described above would
lead to as many items in the item pool as the total number of items administered to
examinees, equal to the test length multiplied by the number of examinees. In practice,
however, items function very similarly to those items whose parameters differ by a small
amount. These items are redundant in the item pool in that any one of them could be

used to estimate the ability level for a person with very small loss in precision.

40

The concept of "bin” is introduced to account for the redundancy of items with
similar parameters. A bin is an item reservoir whose boundary is deﬁned by numerical
attributes of the items so that a number of items within a bin have Similar attributes and
are exchangeable in use. If items are calibrated with IPL, the item difﬁculty parameter
(b-parameter) is the main feature that controls the selection of test items. The bins are,
therefore, deﬁned as ranges on the IRT 6—scale. For example, two consecutive bins with
width of 0.2 on the 6—5cale are denoted as (0.002) and (02:04). Items with b-
parameters 0.1 1 and 0.13 are considered exchangeable in CAT item selection because
they all belong to the bin (0020.2). The item pool, therefore. can be considered as a list
of “bins” containing items with similar properties.

The bins that deﬁne an item pool Should have a width that is sufﬁciently small so
all items are considered equally good for estimating the ability level of an examinee. If
the bin width is too large. items in the same bin may vary in their usefulness for
estimating the ability level. The approach taken to determine bin width used here is to
identify the range of 6—scale for an item that includes the maximum of the item
information function and the range around the maximum that is not much lower. “Not
much lower” is arbitrarily deﬁned as 98% of the maximum. Certainly. an argument

could be made for using 96% or 97% as well.

41

 

1.4 1 T I l '

12~ I“ -
/9
1 118 l \. 1=1.i8
e 007 \ 8 =0.35
1" \ -i
l \
l
. / t
2 08~ \ _
+3 l
N i
E i l.
E.

O

(3‘.

1
Mk—M...
W

l

()4 ~ e
l \
la

 

 

 

o
-3 -2 -1 O 1 2 3
Figure 3.1 Demonstration of determining bin width

The end product of the optimal item pool design is an array of integers (x1, x2, ,

x3), which tells how many items are needed in each bin to assemble all tests in a

program. If no exposure control is used, the integers are bounded between zero and the
test length L, because items in each bin can be reused and no single test requires more
than L items from each bin. When item exposure control is assumed, some bins may
contain more items so that the Shared exposure rates for items from the highly exposed

bins are below the target exposure rate.

3.2 Reckase 's Method/or Optimal ltem Pool Calibrated with 1 PL

When items are calibrated with IPL, item difﬁculty is the only psychometric factor
to decide if an item provides the most information at the 6estimate. Therefore, when
designing optimal item pools that are calibrated with IPL, Reckase’s (2003) method
focuses on matching the item b-parameters and the provisional. 6estimates. Reckase’s
method consists of four steps:

The ﬁrst step is to understand clearly the characteristics of the CAT program,
because item pool design must model the test procedure as closely as possible. It is
important to identify the distribution of the expected examinee population, the test length,
and the type of items the test uses. For the CAT process, item exposure control, scoring
algorithm, ability estimate procedure, and stopping rules should be clearly speciﬁed and
strictly followed during item pool simulations.

The second step is to identify the categorical attributes required for the items, such
as content area, and divide the item pool into smaller ones according to these attributes.
If a test has more than one categorical attribute requirement, each separate attribute
introduces a partition of the item pool. This step is to simplify the simulation procedure
by focusing on determining the optimal item by the quantitative attributes such as its
psychometric characteristics.

The third step of the process for determining the optimal item pool is to administer
a Simulated CAT to examinees randomly sampled from the expected ability distribution.
If the ability follows a standard normal distribution, the initial ability level for the
examinee is zero in the 6 metric. The ﬁrst item is the same for all examinees. It is an

item with maximum information at an ability level of zero. The next optimal item will be

43

based on the examinee‘s response on previous items and the estimate of the examinee’s
ability level. Subsequent items are selected to have maximum information at the most
recent ability estimate. If items are calibrated with IPL, the optimal item is the one with
b-value the same as the current theta estimates. As the test items are selected and
administered, they are tallied in bins based on their b-valueS. Assuming the bin width is
0.25, the histograms in Figure 3.2 Show the distribution of items across bins for two
individual examinees A and B with a true ability level of -.095 and .032 taking a lS-item

ﬁxed-length test.

44

 

 

Frequency
A

(,0

63:
8

Frequency
A 01

(p

 

 

 

 

 

 

 

 

-0.032

Bin Location on Theta Scale

0 I I I ' I I I I I I I l
-3,25 -2 75 -2.25 -l 75 -l 25 -0.75 -0 25 0 25 0 75 125 175 2 25 2.75

 

I

 

 

 

 

 

 

 

 

0 I I I I . > 1»- I I I
—3 25 -2.75 -2.25 -1 75 -l 25 43.75 —0.25 0.25 0 75 l25 l

Bin Location on Theta Scale

7

:I

2

’3
L.

C
J

l
.75

 

Figure 3.2 Items used for two individual examinee

45

 

As the number of test administrations increase, the number of items needed in the
ideal pool will increase as well, but 15 items are not added each time because many of the
needed items are already in the pool. Figure 3.2 shows that examinee A and B took some
items from different bins. but also took some items in the same bins. For example, one
item from bin (-0.25:0) is administered to examinee A while two items from the same bin
are given to examinee B. Because the one item administered to examine A can be reused
for examinee B, only one additional item in the same bin is added to the optimal item
pool. Therefore, instead of 30 items being needed for an optimal item pool to support
two examinees, only 23 items are needed if other constraints are not taken into account.
Figure 3.3 displays the distribution of the items in the pool for two examinees.

CI
U I I I I r I r I I I I I

 

,5- 7 a

I
i

 

 

 

 

 

 

I , l I I i l l I
325975-2254 75-126 4375 412*? 025 075125175 225 275
Pin Location on Theta Scale

 

Figure 3.3 Item pool for two examinees
To determine the required size of an ideal item pool, a large number of tests are

administered and the required item pool is tallied. Additional items are added to the item

46

pool after more examinees take the test. In the fourth step, when the expected numbers of
examinees are administered the test, the union of the items forms a distribution of items
that represents an optimal item pool. The sum of the number of items in the bins is the
total number of items needed. Figure 3.4 Shows an example of an optimal item pool for
5000 examinees taking a lS-item ﬁxed length test.

I? I I v I I I I l I I I I I I

 

 

Frequency
0')

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

U I I l I i I 1
4415-3 325-2 ~15 -1 -05 0 05 1 1.5 2 25 3 35 4
8m Location on Theta Scale

 

Figure 3.4 Item pool for 5000 examinees

This strategy works well for item pools calibrated with IPL, when item difﬁculty is
the only factor in determining the amount of information an item provides. In this case,
items with b-parameters the same as an ability estimate will always provide maximum
information at the ability estimate. Therefore, they are always the optimal items at the
ability estimate compared to items with b-parameters different from the ability estimate.
When items are calibrated with 2PL or 3PL, they may differ in the amount of information

they provide even with the same b-parameters, simply because they have different a- or

47

c-parameters. Extensions to Reckase‘s method, therefore, are needed to account for the

differences between a- and c-parameters in designing item pools calibrated with 3PL.

3.3 Reckase '3 Method Applied to 3PL

AS mentioned above, determining the optimal item pool that is calibrated with 3PL
is more complicated than that with lPL because the information an item could provide at
an ability level is determined by the combination of three parameters, the discriminating
parameter a. the difficulty parameter b, and the pseudo guessing parameter c. An item
could provide an inﬁnite amount of information at any ability level, given that the b-
parameter is close to the theta level and the a-parameter is inﬁnitely large. Although it is
impossible to have items with inﬁnitely large a-parameters, it is common to have items
vary widely in their a-parameters. This implies that at a certain ability level, an item
reaching the maximum information it could provide is not necessarily the item providing
maximum information at the theta level. On the other hand, an item providing its highest
information at one ability level may provide more information than any other items in the
item pool over a range of ability levels. As demonstrated in Figure 3.5, an item with
parameters a=l .2, b=0.0, and c=0.2 provides more information at ability level -0.28 than
an item with parameters a = 0.8, b = -0.5, c =0.2, even though the latter reaches its peak

in the amount of information it can provide at this ability level.

48

 

0.8 I I I I I
a=1? b==00 c=O 2
—" ‘— 2120 8 b=—O 5 c=0 2

O7»
/ \

0.6

 

 

 

 

 

I
K“
/

I

Q
'J‘I

I
\_

/ . .

Item In fomation

O
W
t
\
\ \
K
I
/I
r
I

 

 

 

 

/ \
O 1 "' / // \ \\\ .-
. / 5:.
-..- / l _,s/ \rzr a.
0 -‘ M+_,-- I I I I '-
-3 —2 -1 0 1 2 3
6

Figure 3.5 Item information provided by two different items.

Therefore, the optimal item for an ability level should not be deﬁned as the item
providing its most information at the ability level. In addition, it is unrealistic to deﬁne
the optimal item pool as the one that contains items with the highest possible a-
parameters. Instead, the optimal item pool should contain items with a range of
discriminate parameters so that tests assembled from it would provide the sufﬁcient
precision the testing program requires. This study explores two strategies proposed to
simulate the realistically optimal item pool. One focuses on simulating items that meet
the minimum precision needed for an examinee taking the test. The other takes into
consideration the relationship between the a-parameters and b-parameters in real

operational items so that the simulated item parameters are within realistic boundaries.

49

Before introducing both strategies. it is important to extend the “bin” concept to ﬁt the

three-parameter IRT models.

3.2.1 Extending the “Bin” Concept

 

Under the framework of the three-parameter IRT model, the maximum amount of
information an item could provide is decided by all three parameters. An item with high
discrimination (i.e., high a value) generally provides more information than one with low

discrimination. However. Chang and Ying (1999) demonstrated that it may provide less

information at the level oftheta estimate where 6 is far from the examinee’s true theta.
An item with smaller c-parameters provides more information at its maximum level, but
c-parameters usually vary slightly across items so they have little inﬂuence on the
amount of information items give. Therefore a- and b- parameters are the two primary
factors to determine how much information an item is capable of providing at an ability
level. Items that function Similarly have similar a- and b-parameters. This leads to the
extension of the “bin” concept introduced in item pool simulation with IPL, where it is
deﬁned to be the interval of b-parameter values within which items provide similar
amounts of information over a range of ability levels.

With 3PL, the boundary of “bin” is deﬁned by both the a- and b- parameters. This
forms a grid partitioning the plane formed by values of a and b. As illustrated
graphically in Figure 3.6, each cell deﬁned by a range of a- and b- parameters is denoted
as ab-bin, whereas the marginal total across each row is denoted as a-bin and the
marginal total across each column is denoted as b-bin. Items with parameters within the

boundary of any grid deﬁned by both a- and b-parameters provide similar information

50

over the entire range of ability level, and provide maximum information at the ability

level around the boundary of the bin in which they are located.

 

 

 

 

 

 

2.19— ------------ ------------- ------------ ------------- ------------- ------------- ------------ —

2.00-. ___________ ............. ............ ....... _____ . ________ ____________ _

1.79_ ............. ....... .i ........ ......... ,,,,,,,,,,, ............. ....... ,, ..... ; ............. ........... ........... .............. ............ a

155_ ............ .............. ............. i ............. i ........... .............. .............. ............. .............. .............. .............. ............ —l

1,26- ............ ............. ............ ............. ........... ..
a ' i I ' ' ' I 2 2 I I
089 .. ........ .......... ............ ............ .......... ,
000 l is, i H7-l_____ i l l l l l l
-1.40 -l.12 -0.84 -0.56 -0.28 0.00 0.28 0.56 0.84 1.12 1.40 1.68

b
Figure 3.6 Bins defined by both a- and b- parameters.

While the boundaries of the b-bins are determined by dividing the 6—metric (or
equivilantly, metric of the b-parameters) into equal intervals, the width of the boundaries
for the a-bins are set to be different, because the maximum amount of information an
item can provide is proportional to the quadratic function of the a-parameters, assuming
the c-parameter is constant (Lord, 1980). Equation (7) shows the relationship between

the a-parameter and the maximum information. Mi, an item provides.

02a2 2 3/2
Mi =___1_i.[1—20c,.—8c,- +(1+8c,-) ] (7)
8(l-Ci)

51

It can be further shown that the differences between maximum information function

(AM ) for items with different a parameter is

DZU —20c-—8c2 +(1+8c)3/2]
8(l—c)2

 

AM 2 Aa2 (8)

Plugging in the average c-parameter of the existing items, which is around 0.2, the
resulted constant is 0.5, therefore

AM = 0.5 Au2 (9)
Therefore, the boundary of the a-bin within which the changes of the a-parameters cause
little information change can be calculated. The grid deﬁned by a-parameter intervals
and b-parameter intervals becomes the boundary of ab-bins. If 0.4 is considered a small
information change and 0.28 is a small b-parameter change, the bins deﬁned by both a-
and b- parameters are shown in Figure 3.7. For simplicity, an ab-bin is denoted by its b-

parameter boundaries and a-parameter boundaries: (blower bozmd3bupper bound alower
boundzauppw. bound): For example. items with a-parameters between 0.89 and 1.26 and b-

parameters between 0.00 and 0.28 are in an ab-bin (0.00:0.28, 0.89:1.26). They are
considered interchangeable in item selection.

Distinctions are made. however. to the functions of b-bins and a-bins. As
mentioned above, the closeness of the b parameters to ability level determines how an
item would perfomi the best and provides the most infomiation. On the other hand, the
value of the a-parameter determines how much information an item can provide around
the ability level where it functions the best. With the maximum information item
selection approach, if an item with high information at ability level is available, it will be

picked over the low information items. An optimally designed item pool, thus. Should

52

provide sufﬁcient items within each b-bin, and make sure the items with adequately high
a-parameters are available when needed. In other words, b-bins tally the number of items
needed that perform best over the ability levels around the b-bin. Within each b-bin, the

a-bins record at most how many high discriminating items are needed. The item pool
Simulation would produce an array of integers it = (x1, x2, , xB), which tells how many
items are needed in each b-bin, and a matrix X: (X1, X2. , X B), where each element X8
is a integer vector 0131.333. . J’BA) indicating at most how many items are needed in

each ab-bin within a b-bin. In both cases, B is the number of b-bins and A is the number
of ab-bins within each b-bin. The reason why they are recorded in two different matrices

is that x3 is usually not the same as the sum of X8 in the early stage of the item pool
design. After the CAT simulation, yB‘s from ab-bins with the lowest item discrimination
are set to zero so that Z yB.= X8 and only the highest discriminating items required by the

simulation are in the optimal item pool blueprint. Visual displays of the two matrices are
shown in Figure 3.7, where the plot on top shows how many items in each b-bin are
needed for the optimal pool and the plot on the bottom distinguishes different ab-bins

with gray-scales and shows the number of items needed for each ab-bin within a b-bin.

53

 

 

b-Bins

 

N A O> W
'

 

0

ab-Bins

 

-2.94—2.38-1.82-1.26 -0.7-0.14 0.42 0931.54 2.1 2.66

 

24F
22-
20-
18-
16-

f

r

 

O N ‘8 O) W
Y

 

E30..0058<123 '
l:l1.23sa<1.48-

 

 

 

-1.825a<lnf

 

 

 

 

 

I
-294—238—182—1.26-.-07 0.140..420981.54 2.12.66

 

Figure 3.7 Item distribution by b-bins and ab-bins

54

 

3.2.2. Strategies to Generate Items for Item Pool Simulation with 3PL

 

During the item pool simulation, each item generated for the current ability
estimate is assumed to provide its most information at the ability estimate. Then the ab-
bin the simulated item belongs to is identiﬁed by its a- and b-parameters. This is similar
to item generation in a lPL situation where it is assumed the optimal item is the one with
b—parameters close to the current ability estimate and which provides the most item
information. Note that with 3PL this item is not necessarily the item that provides more
information at the current ability estimate than all other items. This item Simulation
procedure simpliﬁes the Simulation process by not taking into account the fact that items
belonging to one bin could give more information than items belonging to the other bin at
the ability level close to the other bin. However, by assuming optimal items are the ones
providing their most information at the ability estimates, recording the ab-bin items
belongs to is equivalent to recording approximately how much information is needed at
the ability estimate. The fact that items in one bin provide more information than items
in another bin will be addressed after the item pool Simulation is done with the
adjustment described in section 3.2.3.

The following two strategies are proposed to generate item parameters during the

item pool simulation process.

 

 

Bivariate Plot ofa and b Parameters for Operational Item Pool 2

R2 =0.315

 

Bivariate Plot of log/a) and b Parameters for Operational Item
Pool 2

R2 = 0.370

 

 

Figure 3.8 Bivariate plot of b-parameter and a-parameter for operational item pool

56

 

3. 2. 2.] Prediction Model (PM) Strategy

 

The PM strategy is based on the fact that a-parameters and b-parameters are
signiﬁcantly correlated (Chang & van der Linden, 2003; van der Linden, Scrams, &
Schnipke, 1999). As an example. Figure 3.8 Shows the scatter plot of an operational item
pool for ASVAB, where the a-parameters tend to shift with the level of b-parameters. In
addition, the variance of the a-parameter increases as the b-parameter increases,
indicating that logarithm transformations of the a-parameters are linearly related to b-
parameters.

To model this relationship, the a-parameter for a simulated item is set equal to the
regression function of the logarithm transformation of the a-parameter (a’) on the b-

parameter (Reckase. 2004).
a' = log(a,-) = ,60 + 61b,- +si (10)
a = exp(a’) (1 1)
where 6,- ~ NM, 02). The variation in the a-parameters is included in the item pool

estimation procedure by adding an error term in the regression function. Because the c-
parameter is not signiﬁcantly correlated to the b-parameter. it is assumed to follow a beta
distribution varying around the average value.

The regression function and the variation in the c-parameters are estimated with the
item parameters obtained from the operational items, which give realistic estimates of the
item parameters for a speciﬁc testing program. During item pool Simulation. items are

generated in three steps:

Step 1) After each response, obtain the estimate of ability level 6 and use it as the

approximation of the b-parameter for the next optimal item.

57

Step 2) Predict a’ from the b-parameter with a regression function estimated from
the operational items. To account for the variation in the a-parameter, a random number
simulated from N(0, 03) is added to the predicted value and then the natural exponential
function of a ’ is the simulated a-parameter.

Step 3) Generate the c—parameter from the beta distribution. Re-compute the b-

parameter so that the item provides maximum information at 6:

. l+./1+8'-
nzm—g m 2" an
. ui

 

With the PM strategy. the a-parameters usually fall in a range Similar to the
operational items. The resulted blueprint for the optimal item pool would most likely
contain items developers could easily produce. It, however, focus primarily on matching
the b-parameters with the examinee’s ability level estimates. The a-parameters are
generated randomly, albeit within a practical range. It does not take into account the
amount of information an item could provide, nor does it consider the information a

simulated test can possibly provide for the examinee.

3.2.2.2 Minimum Test Information (MTI) Strategy

The MTI strategy posits that the item pool is optimal when computerized adaptive
tests assembled from it can provide just sufﬁcient information to examinees. The more
information a test can provide. the more precise the test can estimate an examinee’s
ability level. However, more test information needs more highly discriminating items,
which are usually expensive or hard to create, especially for easy items. The MTI
strategy makes sure that tests provide sufﬁcient precision, but do not contain overly

abundant high discriminate items.

The MTI strategy sets a target information value over a range of 6—scales. The
target test information value is broken down for each item administered to the examinee.
With c-parameters (which can be generated from beta distribution) and b-parameters
(which is estimated by current ability estimate) both known, a-parameters can be
calculated.

The MTI strategy generates items in three steps:

Step 1) Determine how much information a test needs to achieve acceptable ability
estimate precision for individual examinees. Break down the target information I for each
item i according to the following formula:

I —l

target
11' =

adm

 

(13)

[test _]aan

To mimic the way CAT selects items, which picks the items providing the largest
information at current ability estimates. the target information could be manipulated so
that target information starts with a reasonably large number and decreases with the test
going forward, and stays at the value of the expected target information for the last few
items. While simulating the a-stratiﬁed exposure control method, the target information
is set at a lower level and then increased to the expected value because the a-stratiﬁed
method uses low a-parameter items ﬁrst.

Step 2) Generate the c-parameter from the beta distribution. According to Lord
(1980), the relationship between 6 and the parameters of the item providing its
maximum information at 6 can be depicted as

é—b 1 1 1+ 1+8Cl 14)
i- i+Da—,- 11(———"-2 ——). ( ,

59

where D is a scaling factor and is equal to 1.7. The most information a logistic item with

speciﬁc parameters a,- and c,- can. provide at 6 is

13202 2 3/2
IMi=——+;[I.—20Ci—8Cji +(I+8Ci) ] (15)
8(I‘Ci)-

By rearranging the equation and plug in 1,- for M,, it can be Shown that

,
a': 8(1‘Ci)bli (16)
' D2[1—20c,-—8c,-2+(1+8c,-)’/2]

 

 

Given that 1,- and c,- are known, an optimal a-parameter can be found with equation

(1 6) so that the item provides a minimum amount of information at the current ability
estimate.

Step 3) Calculate the b-parameter with a- and c- parameters from equation (12) so

that the generated item provides its most information at 6.

3.2.3 Post-simulation Adjustment

The results from the item pool Simulation, the vector 37‘ and the matrix X showing
the number of items needed from each bin deﬁned by both a- and b-parameters,
essentially Show how many items providing a certain amount of information are needed
within the interval of b-parameters. This is because bins deﬁned by both a- and b-
parameter cluster items by the point where they provide the most information. As
mentioned above, item simulation does not take into account the fact that items belonging
to one bin could give more information than items belonging to the other bin, even at the
ability level within the other bin. For example, Figure 3.9 shows that items from ab-bin

A (—0.84:-0.56, 1.26:1.55) may provide more information at 6s between -1.12 and -0.84

60

than items from bin B (-1.12:-0.84, 0.00:0.89), although these items in bin B may provide
their most information over the same 6 range. In other words, item selection procedure

would choose the item in bin A over the item in bin B for ability estimates around -l.12

 

IO -0 84
O I
. - . . . . .
_ .................................................................................................
C I ' l I a D
I l - ~ . . . e
l l | I I t I
I I I I I I I
I I ‘ l a I I
1 9_ ............ . .............. _ ............................ , .............. , .............. , ............... ,. .....
I I 1 l a I I
I A I l I t I
1 _ ............ , .............. , ........................... _ .............. , .............. , ............. . ......
.55 . . . . . . .
I I I 4 I I C
A I I I t t I D
x I . l a t O
y I I I 0 O I
I I I I I l I
_ ...... .. ..........................................................................................
- l 0 I I I l I
a . . .
_ .......................... .. .................. _ ...................... . .................... .........
.
0.00

 

 

 

-1.40 -1.12 -0.84 -0.56 -0.28 0.00 0.28
b

Figure 3.9 Demonstration of items in one bin offering more information than items
in another bin.

Therefore the optimal item pool actually requires a sufﬁcient number of items
providing large enough information at each b-bin, regardless of the bin to which the items
belong. An item pool constructed strictly following the number of items required by the
blueprint resulted from item pool simulation will have redundant items.

These items are trimmed by using an info table that is created to select the highest
information items from the bins identiﬁed by the Simulation procedure. This will assist

in forming the ﬁnal blueprint for the item pool. To get the highest information item for

61

each b-bin, the mid-points of the b-bins are treated as the anchor ability level, and the
mid-points of both the a-bin and b—bin as a-parameter and b-parameter, respectively,
representing the bin the item comes from. For example, the ability levels needed to form
the information table are (-3.90, -3.70 3.70, 3.90); an item with parameters a = 1.08, b
= 0.10. c = 0.187 represents an item from bin (0.002020, 0.89: 1.26). Ifthree items are
needed from this bin, three items with the same item parameters are entered into the info
table. Sufﬁcient items are drawn in this way to represent the number of items needed in
each bin.

As shown in Figure 3.10. each column represents a b-bin and the number of items
needed in that bin is shown in the second row. The rows below are the item IDs with
each number representing an item. Items are rank ordered by the information it provides
within the boundary of the b-bin. Items closer to the top are the items providing the most
information. In practice, items ranked higher will be selected ﬁrst, regardless of the bins
they are from. Therefore, even though each b-bin still requires a certain number of items,
they can be from the other bins. The graphical way to select needed items is to highlight
the exact number of items needed for each b-bin from the item providing the most
information. The unique items for all highlighted items are the items needed for the

optimal item pool.

62

 

Figure 3.10 Items in the order of information provided most in each b-bin.

One caveat of this procedure is that because items provide more information over a
range of ability levels, they may be administered more times than the test administrators
want. Within a b-bin, the expected number of times an item is administered depends on
the rank order of the information it provides. During the item pool simulation it can be
estimated by recording the number of times items from each bin is simulated and
administered. If an item is to be selected more than the target exposure rate, a new item
from the same ab-bin is added to the ﬁnal item pool. For example, Figure 3.11 shows the
expected item usage within each b—bin for 8000 examinees ordering by the information it
provides. Item 109 in Figure 3.8 is expected to be selected 11,800 times (8000 + 2471 +
1329), which is 0.48 times more than an item can be selected. The optimal item pool will
need one more item in the same ab-bin as item 38. If the target exposure rate is 0.33,
then it is 3.43 times more than the target exposure times. Four more items from the same

ab-bin, therefore, need to be added.

63

 

Figure 3.11 Item usage in the order of information provided most in each b-bin.
Figure 3.12 and 3.13 demonstrate an item pool blue print before and aﬁer post-

simulation adjustment.

 

    

 

 

 

 

 

12 . 187 items in the pool - -0-00 5 a < 0-89
-0.895a<1.26
10. -1.26£a<1.55
-1.555a<1.79
3» l . -1.7asa<2.oo
. l l i -2.005a<2.19
6~ '; E - "2.19sa<2.37
‘ I l .. " -2.37 s a < 2.53
4 ‘ I ' 1:32.53 5 a < 2.68
l 1:12.68 5 a < 2.83

2' l I ‘ 1:12.83sa<2.97

I I I I I @2979an

0-2.5 -0.9 0.7 2.3

Figure 3.12 Item distribution for optimal item pool before adjustment

64

 

12 - 101 Items in the pool =ggg : 2 : $3:
10. . -1.26sa<1.55
-1.55 5a < 1.79
3. . III1J95a<200
Ef?2005a<219
2.1 9 s a < 2.37
[:1237sa<253
@253 s a < 2.68
[:jzeasa<233
[:jzsssa<297
[:1297 s a < Inf

 

 

 

 

 

 

 

as be at " 23

Figure 3.13 Item distribution for the optimal item pool after post-simulation adjustment

3.4 Design Adjustments to Different Exposure Control Methods

3.4.1 Item Pool Design without Exposure Control

The item exposure rate for an item is the number of times an item is administered
divided by the number of examinees. During the optimal item pool design process, the
real items are not available, but the simulated items representing items from the bins they
belong to are. The number of times items from each bin are administered is the estimate
of the marginal exposure count of the items in this bin, which is shared by the number of
items needed in this bin. If the adaptive test has no item-exposure control, the design
blueprint for the item pool after post-simulation adjustment follows directly from these

exposure rates.

3.4.2 Item Pool Design with Sympson-Hetter Exposure Control
The Sympson-Hetter method controls item exposure with a probabilistic index k to

adjust the number of times an individual item is administered. An item that potentially

65

has a high exposure rate is assigned a small k value so that the probability of
administering the item is brought down below the maximum exposure rate. When a
selected item is not administered because of the exposure control, the next most
informative item will be selected. The goal of the optimal item pool design with
Sympson-Hetter exposure control is that in addition to being optimal to accommodate test
length, content balancing, and other aspects of the test, it makes sure the exposure control
only slightly reduces the test precision. The goal is achieved by making sure that there
are sufﬁcient items in the bins where items are selected more often.

The same method used in the post-simulation adjustment introduced previously is
used to retain sufﬁcient items in each bin. Because the number of times an item is used
can be recorded during item pool design process, if it reaches rN, another item from the
same bin is retained so that the share of the total exposure for each of the items within the

bin is not larger than r.

3.1.3 Item Pool Design with a-Stratiﬁed Exposure Control

The a-stratiﬁed exposure control requires that the item pool be partitioned into an
equal number of items in each stratum. However, before the item pool design simulation,
the number of items in a pool has is unknown. It is impossible to decide the number of
items needed in each stratum, and the boundary (deﬁned by a-parameters because it is
stratiﬁed by a) of each stratum. However, by modeling the way the a-stratiﬁed method
selects items, it is possible to approximate the psychometric properties of the items
needed in each stage of the test administration. The a-stratiﬁed method splits a ﬁxed
length test into a number of stages that is equal to the number of strata. Each stage has an

equal number of items, and they are all selected from the same stratum of the item pool.

66

The important rule is that in the ﬁrst stage, items are selected from the strata with the
lowest a-parameters and. therefore, they provide the least information at the current
ability estimates. The next stage items are selected from strata with the next lowest a-
parameters. and in the last stage, items are selected from strata with the largest a-
parameters. Based on the characteristics of the a-stratiﬁed method, each test
administration is divided into stages with an equal number of items in each stage. Within
each stage, adjustments to the item simulation strategy are made to approximate the a-
stratiﬁed exposure control.

For PM item simulation, a random term of the prediction model is partitioned into
four equally likely intervals ordered by its possible values. During each stage of the test,
items are simulated from the corresponding interval. For example, with four strata and a
16-item test, the ﬁrst four items are simulated so that their a-parameters are from the

lowest quarter of possible range. i.e., after a is predicted from 17. random error sis drawn

from N(CD_] (0.125).(0' / 4)2 ) . the next four item are drawn from
N(CD_1(0.375),(0'/4)2) ,. the next four items are N((I)"l (0.625),(0'/4)2), and the last

four items are from N((D"] (0.875),(0'/4)2).

For MTI simulation, the target information for one test is set to be different for
each stratum of items. For example. assume the target information for the whole test is
10.0. During the simulation, the target test information is set to 7.0 for the ﬁrst four items.
Because the target information is for the whole test, the average target information for
each item is 7.0/16 = 0.438 at the current 6estimate. Then. the target test information

increases to 8.0 for the next four items, 9.0 for the next four items, and 10.0 for the last

67

four items. Test information of 10.0 for the last four items makes sure the simulated test

achieve the minimum information needed.

68

Chapter IV Methods

This study is composed of two closely related parts. In the ﬁrst part, a Monte Carlo
simulation is used to design the optimal item pools for two CAT-ASVAB sections. The
second part evaluates the optimal item pools with empirical criteria and compares them to
the operational item pools on performances. This chapter brieﬂy introduces the
characteristics of the operational item pool and the psychometric procedures CAT-
ASVAB used in test administration. It then describes the simulation procedure and the
independent variables on which this study focuses. Finally, a deﬁnition and brief

description of the item pool evaluation criteria are provided.

4.] Operational Item Pools

Operational item pools for two sections of the Armed Services Vocational Aptitude
Battery (CAT-ASVAB), Arithmetic Reasoning and General Science, were used as the
target and benchmark of this study. Originally designed to predict future academic and
occupational success in military occupations, ASVAB is a nationally normed, multi-
aptitude test battery that assesses academic ability and predicts success in a wide variety
of occupations. It tests a student's knowledge in eight areas including general science,
mathematics, word knowledge, paragraph comprehension. electronics information, auto
and shop information, and mechanical comprehension. Its Computerized Adaptive
Testing version, CAT-ASVAB, began in 1997 and was the ﬁrst large-scale adaptive
battery to be administered in a high-stakes setting, inﬂuencing the qualiﬁcation status of
applicants for the US. Armed Forces. It is also one of the most thoroughly researched

tests of human proﬁciencies in modern history.

69

The Arithmetic Reasoning (AR) section of the C AT-ASVAB is a lS-item test
measuring the ability to solve basic arithmetic word problems. The General Science (GS)
section has 15 items and measures knowledge of life science, earth and space science,
and physical science. The General Science test is content-balanced among three content
areas (Life Science, Physical Science, and Chemistry). Items are administrated in roughly
the same proportion and order for each content area, as found in the reference P&P form:

L, P, L, P, L, P, L, P, L, P, L, P, L, C,

Where L = Life Science, P = Physical Science, C = Chemistry.

CAT-ASVAB test items are selected using maximum information. To save
computation time, an information "look-up” table is used. The Sympson-Hetter method
is incorporated to reduce overexposure of certain highly infomiative items. Owen’s
approximation to the posterior mean (Owen, 1975) was used to update the ability
estimate during test administration. For each test, the prior distribution had a mean of 0.0
and a standard deviation of 1.0. The estimate after the last item was used as the score for

the test. Each test was temiinated after a ﬁxed number of items.

4.2 Simulation Procedure

Programs were developed with MATLAB® Student Version R14 (2004) to
simulate item pool design and evaluate both simulated and operational item pools. Item

pool design simulation was conducted in the following steps:

70

Step 1: Modeling CAT Procedures

 

Because the purpose of this study is to investigate the optimal item pool design for
two sections of the CAT-ASVAB testing program, the simulation procedure follows
closely the psychometric procedure used in CAT-ASVAB operations.

Test length was the same as the operational ASVAB sections, both of which are
15-item ﬁxed length sections. Content balance was not considered in AR item pools.
For GS item pools, the item pool was divided into three small pools, each containing
items for one content. Items were administered in the same order as described in Chapter
4.1. where item 1 through 15 are identiﬁed as L. P. L. P. L, P, L, P, L, P, L, P, L, and C.

Maximum information and the info table were used to select items. Owen’s
approximation to the posterior mean (Owen, 1975) was used to update the ability
estimate during test administration. For each test, the prior distribution of Qhad a mean

of 0.0 and a standard deviation of 1.0.

Step 2: Generating Examinee Population

CAT-ASVAB item pools are designed to serve examinees whose ability
distribution is assumed normal with a mean of 0.0 and variance of 1.0. The item pool
simulation followed the same assumption. and examinees were randomly sampled from

N(O, l ).

Step 3: Generating Item Parameters

For each test, the ﬁrst item was generated to be optimal for an ability level of 0 in
the 6-metric. After each response. optimal items were generated for the current 6

estimate. It is assumed that items are calibrated by the three-parameter logistic model.

71

Therefore. a-. b-. and e- parameters were generated by one of the two methods (PM and
MTI) described in Chapter 3.2. With either method, the e-parameter was generated from
a beta distribution with mean and variance equal to the mean and variance of the
operational items. The a-parameters were generated depending on the current ability

estimate and method used (PM or MTI). The b-parameters were generated so that the

item provide its most infomtation at 6? .

Step 4: Generating Response Data

 

The examinee response was generated following each item generation according to
a three-parameter logistic model. where the probability of examinee i correctly answering

itemj is expressed as:

1
1 + e—IJcﬂol—a)

 

13(6)) 2 6. +(1-c.) (17>

PI-(Hj) is the probability that a personj = l .1 with an ability parameter (9} gives a

correct response to an item i = 1 I; a,- is the value for the discrimination parameter, b,-
for the difﬁculty parameter. and Ci for the guessing parameter of item 1.

Because the examinee‘s true 6was known in the simulation, P0- was computed
after each item administered to the examinee was simulated. Then a random number mi]-

was drawn from a uniform distribution U(_0.1) and compared to Pij. If mg- was equal to or

less than Pi]- then it was assigned a l as the response. otherwise 0 was assigned.

Step 4: Post-Simulation Adjustment

 

Five replications were conducted for each combination of methods and control
variables so that a relatively stable approximate of the optimal item pool could be
obtained. The blueprints and the item exposure counts from the ﬁve replications were

averaged before a post-simulation adjustment was done.

4.3 Control Variables

Two independent variables were controlled for in all item pool designs: design
method and exposure control method. Both AR and GS have the same target exposure
control rates (1/3 for Sympson-Hetter method), which is the same as the operational
procedure does. For simplicity. four strata were assumed for the a-stratiﬁed method.

Each of the two item pools. AR and GS, had its unique control variables.
Speciﬁcally. (I) there was no content balancing for AR while three contents were
administered in a ﬁxed order for GS; (2) each item pool deﬁned the bin width differently
in the H-metric, depending on the characteristics of the operational item pools.

The simulation design is illustrated in Table 4.1.

73

Table 4.1 Simulation Design

 

 

CAT
Item Pool

Design

Item Pools

Arithmetic Reasoning (AR)

 

General Science (GS)

 

Test length

15

 

Examinee distribution

N(0.1)

 

Exposure control

No exposure control

 

Sympson-Hetter (target exposure
rate is 1/3)

 

a-stratiﬁed (4 strata)

 

Design Method

Prediction Model

 

Minimum Test Information

 

Bin width

b-bin: 0.20 for AR and 0.26 for
GS

' . 7_ , ._
U-bll’l. AU“ '2JItl/Iaximum - 0.8

 

 

Content Balancing

 

Single content for AR

Three contents for GS

 

Two types of distribution were considered in the item pool evaluation: (a) 6,000
6" s were simulated from N(0.l ). and these values were treated as the true abilities for the
examinees, and (b) 65 ﬁxed values ranging from -4 to 4 with an interval of 0.125 were
selected (i.e., 0 = —4.0. —3.875. . . . . 3.875. 4.0). Five hundred examinees were set to

have an identical latent ability at each ﬁlevel. The former is to evaluate general

4.4 Evaluating Simulated and Operational Item Pools

performances, and the latter is to compute statistics conditional on 0.

The item pool evaluation criteria used by Chang & Ying (1999) and Reckase (2005) if

were adopted for this study. Precision of proﬁciency estimation include average test

74

 

information at each theta level, bias. Mean Square Error (MSE). and correlation
coefficients between estimated and true person parameters. Test security indicators
include skewness of item exposure rate distribution, percentage of overexposed items,
item overlap rate, and percentage of underexposed items.

Conditional Test Information

Test information is the sum of all the Fisher item information in the test. In a
ﬁxed-length CAT test. it can be taken as the index of test efﬁciency. The larger the
amount of information a test provides, the more efﬁcient the test is.

Conditional Standard Error of Measurement (CSEM)

At each ﬁxed 6point. the standard error of measurement (SEM) was calculated by

the formula:

 

N-
1 I . _
ij=1

Where N ,- = 500 is the number of replications (i.e.. the number of adaptive tests
1 N,
administered) at each ﬁxed 6point. and 6,- : ——f— 2 6, is the mean of the ability estimates
i n==l
over the N ,- replications at 6,-.

Bias and Mean Square Error (MSE)

These quantities are defined as follows:

. 1” ~
Btas=——Z(6I-—6j) (19)
NF] ~

and

75

, l N . a
MSE = N Z to, —o_,)- . (20)
./=1

where N is the number of simulees. and 6 j is the estimator of the jth simulee with ability
level 6]: .

Conditional Bias and Conditional Mean Square Error (CMSE)

These quantities are deﬁned as

.v.
. . . 1 ‘ ’ *
C ondtttonal Bias = -N— Z ( 6.1.1“ - 6i) , (21)
I 1:]
and
1 Ni ~ 7
CMSE = — Z (6 -, — 6,)“ . (22)
N i i=1 ’

where 6,- = -4.0, -3.875, . 3.875. 4.0. for i = l. 2. , 65. respectively. and 61-1, (1'

= 1, 2, , 500) is the corresponding estimator of 6,; These values are estimated as the
conditional averages of errors and squared errors in the ﬁnal estimates of 6,- in

simulations. As additional overall measures of the quality of the ﬁnal estimates of 6, the
estimates of the bias and MSE functions in (19) and (20) were averaged over all
simulated values of 6in the study. They give a picture of item pool performance for
individual ability level.

Skewness of Item Exposure Rate Distribution

A )8 statistic proposed by Chang and Ying (1999) is used to measure skewness of

item exposure rate distribution. It is deﬁned as follows:

76

 

2 (r~—L/n)2
Z =2 ’

9 23
L/n ( )

i =1
where

r,- is the observed exposure rate for the ﬁll item.

L is the test length. and

n is the number ofitems in the item pool.

Equation (23) captures the discrepancy between the observed and the ideal item
exposure rates. and it quantiﬁes the efﬁciency of item pool usage. A low )8 value
implies that most of the items are fully used.

The ratio of )8 measures follows an F distribution and can be used to compare the

exposure rates of two methods.

Fmethodl, method2 : szethodl / xzmethodZ (24)

If F < 1. then method 1 is regarded as superior to method 2 in terms of the overall
balance of item exposure rates.

Percentage of Overexposed Items

The exposure rate of an item can be deﬁned as the ratio of the observed number of
item administrations to the total number of the examinees. A moderate level of item
exposure rate is generally desired. A high exposure rate of an item means an increased
risk of the items being known by the prospective examinees. If so, both the test security
and validity are threatened by the high item exposure rate. Therefore, the percentage of
overexposed items is taken as an important criterion to evaluate the success of a CAT
program. The expected exposure rate was set equal to 1/3 for both tests in ASVAB

(Segall, Moreno, & Hetter, 1997).

77

Percentage of Underexposed Items
Low item exposure rate means that the item is rarely used. An item pool with too
many items with too low an exposure rate is a sign of the underutilization of the pool.
Both the cost—effectiveness of developing the items and the appropriateness of the item
selection method are challenged by the low item exposure rate. In this study, an item
with an exposure rate lower than .02 is considered as underexposed.
Test Overlap Rate and Conditional Test Overlap Rate
Test overlap rate is the expected number of common items encountered by two
randomly selected examinees divided by the expected test length. Ideally, the number of
common items between any two randomly sampled examinees should be minimized.
Test overlap rate can be calculated by ( l) counting the number of common items
for each of the N(N — [)0 pairs of examinees, (2) summing all the N(N — 1)/2 counts,
and (3) dividing the total counts by LN(N — M"? (Chang & Ying, 1999). The following
equation summarizes the calculation (Chen. Ankenmann. & Spray. 1999):
n
mej Z mitmi — 1)
i-l

_ . 2 .
T=____I=1 =-~ (25)

L81] LN(N—1)

 

where N denotes the number of ﬁxed-length CATS administered. L is the number of items
in each of the CATS, n is the number of items in the pool. and m ,- is the number of times
item i was administered across all N CATS.

Conditional test overlap rate computes the test overlap rates for 500 tests
administered to each of the sixty-ﬁve ﬁxed 6 s. The same procedure Chang and Ying

(1999) described in equation (25) can be used to compute test overlap rate for the tests

78

administered to each ﬁxed 6. In this case. N is 500 and m ,- is the number of times item i

was administered across all 500 C ATs. The conditional test overlap rate gives a more
accurate picture of test overlap at a particular ability level. instead of the average across

all ability levels.

79

Chapter V The Performance of the Item Pools without Exposure Control

5.] Item Pools/0r Test without Content Balance

Figure 5.1 compares the distributions of the operational item pool and two optimal
item pools designed by MTI and PM, assuming no exposure control. Table 5.1 presents
the sizes and the summary statistics of the item parameters for the three pools. The
optimal item pools consist of the fewest items. This is not surprising. partly because both
assume no exposure control. while the operational pool is designed for tests with the
Sympson-Hetter exposure control.

Table 5.1 indicates that all item pools have items that span a wide range of
difﬁculty levels. roughly from -2.5 to 2.5. However. the items in the optimal item pools
have slight smaller ranges. The operational pool has a large number of items with b-
parameters between 0.0 and 1.5. while the optimal pools display a more even distribution
across b-bins. The MTI pool consists of the fewest items and their a-parameters are more
concentrated. ranging from 1.275 to 1.781. The PM pool shows the characteristics of the
item parameters similar to the operational pool. in which difficult items tend to have high

a-parameter and easy items tend to have moderate to low a-parameters.

80

 

 

 

 

 

a. Operational Item Pool
Content Area 1: 137 Items
at
, , t
10 ‘1' ' i
~‘ 1
8 . . . H,
i ; l
. , l
6 = ‘ i
2‘ i
i
4* i
‘ t
l
2’ II I l
o , I I . t
-4.1 -2.5 -0.‘ . 3.9
b. Item Pool Designed by MTI
Content Area 1: 82 Items

1 I I i
12 l
t
i
10 ~ 1
8” .4
i
6' I
l
i;
4 ~ a.
i
2 1'
5
“ .4i ,___,,_:_2..5_ -09 0.7 (23 3?) " F

 

Figure 5.1 Item distribution for item pools without exposure control

 

 

0. Item Pool Designed by PM

Content Area I: 101 Items

 

 

 

 

 

-0.005a<0.89 -2.193a<2.37
-0.89 sa <1.26 -2.37 sa < 2.53
-1.2esa<1.55 [22.53sa<2.63
-1.55$a<1.79 [:12.68sa<2.83
-1.795a<2.00 [32.83sa<2.97
i2.00sa<2.19 @2973“ Inf

 

 

 

 

 

 

 

Figure 5.1 (Cont’d) Item distribution for item pools without exposure control

The overview of the evaluation results for these item pools are presented in Table
5.2. The ability estimates from all pools exhibit a certain level of positive bias; however,
the magnitudes of the bias are negligible. MSEs from optimal item pools are smaller than
that from operational pool. The MTI pool and PM pool resulted in a higher correlation

coefﬁcient than the operational pool.

82

 

 

 

:: S E 2% 60..

Novena
$3.2 cam? 88.3. 2385 so: as. was: .8 an

A: A03
53 $8.: $0; 22693 :5: as. as: Co 5
$92 ammo as? ea 925% ea:
8%.? o :wm— Nwmm. _ m 2E 2:898 Eu: .«0 Buzz/gm
823 £23 83.0 8:288
28¢ 33o $8.0 mm:
3 :3 ~83 38.0 .85

2.. :2 .5 223a

 

Econ— Eo: of mo oocaccottom 2: mo mocmzﬁm meEsm Mm 635%

 

 

 

 

 

Sod mono owed Ed MEN- 3: 9.2 Red. ~30 mm: $3 83 :: 2a
$0.0 3% Ed was 3 8. N2: is: $3- $3 a: 83 53 Q :2
MES 23 $3 3 _ .o 83. 2.3 E I 2 3 $3 2.3. 23.0 £2 E .5
:5
.52 as: am 53.2 .52 as: am :82 .52 as: am 532 .2... .25
o a a

 

_o.::cU Esmoaxm SEE? chOmmom QuoE£t< c8 85:me 8688mm Eu: new oﬂm Eon :8: _.m mink

83

Table 5.2 also shows that optimal item pools have a smaller test-retest overlap rate
despite having fewer items. It indicates that the magnitude of the item overlap rate may
not be related to the pool size with the optimal combinations of the items in the pool.
The plots for the conditional test-retest overlap rate in Figure 5.2 reveal that optimal item
pools have higher overlap rates for 6 levels below approximately -2.00 and above 2.00.

However, in practice, there are very few examinees at these ability levels.

 

 

+0PwithoutER
--MTIwithoutER
_ +PM warm ER

 

 

 

 

 

 

 

 

0.1
94 .2 6 2 4

Figure 5.2 Test-retest overlap rate conditional on 6

Both optimal item pools have signiﬁcantly smaller percentages of under-exposed
items. Although the MTI pool has a higher percentage of over-exposed items, it is
reasonable given that it is the smallest pool and no exposure control was imposed.

Increasing the pool size reduced the item overlap rate.

84

 

 

a. Operational Item Pool

 

 

 

 

 

1
2
(5
1:1:
3 0.5
U)
D
Q.
X
UJ
D J 1
-4 3 4
b. Item Pool Designed by MTI '
2
(B
o:
E 0.5
(D
D
2’
LU wﬂﬁﬁw A
D i l L*'v‘l l l
-4 -3 -2 -1 U 1 2 3 4
6
c. Item Pool Designed by PM
OJ
5
E!
g 11.5 -
U)
3 MM)
X
Lu
0 1 QM 1 1 1 J
-4 -3 -2 -1 El 1 2 3 4
6

 

Figure 5.3 Item exposure rate by difﬁculty level
Figure 5.3 plots the item exposure rate for individual items in the order of their

difﬁculty levels. Extremely easy and extremely difﬁcult items tend to have smaller

85

 

exposure rates, but under-exposed items are across all difﬁculty levels, especially those in
the operational item pool. Table 5.2 indicates that the MTI pool has the fewest under-
exposed items and Figure 5.3a shows that items with extreme difﬁculty levels are utilized
more often in MTI pools.

As shown in Figure 5.4, the three item pools resulted in quite different average test
information plots at various ﬁxed 6levels. The plot for the PM item pool looks similar to
the one for the operational pool, but provides more information over most ability levels.
The MTI item pool provides signiﬁcantly less information over the ability level between
approximately -1.5 and 2.0. but the amount of information it provides over a long range
of ability levels exceeds the target information, which is 10.0 between ability levels 21:20,

and 8.0 beyond ability levels $2.0.

40; ‘ ‘ ' t. .
l —+— OP without ER ‘

3st .' —+- MTI without ER

”a“ PM without ER J.

 

e

N
01

.3
U1

 

Average Test Information
8

 

 

 

Figure 5.4 Average test information conditional on true 6

86

Figure 5.5 to 5.7 present the C SEM, conditional bias, and CMSE for the three
item pools. Figure 5.6 shows a signiﬁcant increase in the biase of ability estimation,
which is positive for the ability levels below around -2.0 and negative for the ability
levels above around 2.0. It is not surprising because of the short test length and the
Baysian estimation method. The charts show that MTI perform better for ability level

below -2.0 and PM performs better for ability level over 2.5.

1
i ~e~~op without ER ‘~
09; —-—--—- MTI without ER

-9~ PM without ER i

_._L___._ ..... L.._.._. -2....

0.8}

0.7} .
E i
(0 0.6;
o

i

3,0513%

9 _’

g “a“ mué‘mi
0.3}WL .3353...

02

 

   
 

i
‘ t‘
l.
t
I

i
t
i

 

0 .
-4 -2

Figure 5.5 Conditional standard error of measurement (C SEM)

87

Average Conditional Bias

 

 

 

   

Average Conditional MSE
-* N 9’
Q__._.,~_-.:-.-~‘,’L-“viii—mﬁ‘mSi. 9" _

.0
01

 

 

I
h
I
M
Q 0
N
A

Figure 5.7 Conditional mean square error (CMSE)

88

 

.4 -2 o 2 4
o

i~e~0P without ER i

-%—+~MTtwtthoutERi

"9* PM without ER

.—e—~OP without ER i
«g—«r—MTI without ERl
é—e— PM without ER 1

5.2 Item Poolsfor Tests with Content Balance

As can be seen from Figure 5.8, the distribution of the item pools with content
balancing shares a very similar pattern with the distribution of the item pools without
content balancing. Both optimal pools show a more even distribution in difficulty levels,
but the PM pool has more highly discriminating items in difﬁcult items and very few for
easy items. The MTI pool has mostly moderately large a-parameters, regardless of the
difﬁculty levels. Table 5.3 lists the item pool size and the summary statistics of the items
in the pool. Compared to the operational pool, on average the MTI pool and the PM pool
have slightly higher a-parameters and lower b-parameters but the range of the item

parameters are smaller. Both optimal pools consist of fewer items in two larger contents

 

 

a. Operational Item Pool

Content Area l: 59 Items Content Area 2: 58 Items

 

 

 

 

 

 

 

 

6
0-4.0 -25 ~09 07 2.2 3.3 0:40 -25 .0907 2.2 3.3
Content Area 3: 13 Items
8 - . . . _-...
6.
4

 

i
1
3
2 ' j
i
l
0 .. w...)

.4.0 -25 .09 0.7 2.2 3.}:

 

 

Figure 5.8 Item distribution for item pools with content balancing and without
exposure control

89

 

 

b. Item Pool Designed by MTI

 

 

 

    

 

 

 

 

 

 

 

    

 

 

 

 

 

Content Area I: 46 Items Content Area 2: 48 Items
- - e - . . } - - . . e -
8 8g
6 6}
4. 4i-
2’ 2%
i i
0. I ‘ l i ‘ Ol . . j l i 5
-4.0 -2.5 -0.9 0. 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 13 Items
8
6
4' ~
2L ‘
O n .
-4.0 -2.5 -0.9 0.7 2.2 3.8
c. Item Pool Designed by PM
Content Area 1: 41 Items Content Area 2: 46 Items
8' ; 3
f i
6 6%
4.
0} l i .w___~_d' ‘ f 1 l i l
-4.0 -2.5 —0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 12 Items
t ‘ f ' '
8
l
6 3
4* «l
2:

WW;

0 .4.0 .25 a9 0.7 2.2 3.8

 

Figure 5.8 (Conti’d) Item distribution for item pools with content balancing and
without exposure control

90

 

 

 

 

 

 

 

 

33¢ 33 N85 82 ea. _- $3 $3 83- mm; mm? 23 £2 2 2,2
mm 3 Rec 83 82 NE. _- Q2 33 :2. we? mm: wood 23 S :2
2 S 83 $3. ammo RR. R :1 mm: 33- em; on: $2 $3 2 mo

M HGDHCOU
god SEC 33 mg as. _- $3. 2 2 Rod ES to. ado mm? ow SE
33 $2 256 Ema «mod. 83” :2 RS- 32 $2 83 82 we F2
MES 03% good m ad mow. _- 2mm See 33 $2 SE two 22 mm no

N “SCH—HOD
2 S 83 god ammo meow- 29m 82 53 mod 5: ﬁg $2 3 35
OM _ .o was Sod ego $3.. can ow _ ._ 83.. $2 a: 8 S 23 8 E2
:56 83 Sad E3 :Q- :S 3 _ ._ god Rod 3 mm 32 $2 3 “5

ﬂ «cows—CU

Qumm
.52 .32 am :32 .52 as: am as: .52 as: am so: .25 .25
o a e

 

.2250 2:89am 505:3 oocoﬁm .8286 Sm moumtﬁm SBEEE :5: was 8% Boa So: m.m £an

91

 

 

 

 

m_ m _ «.— 3 we mm 3 we om oﬂm .oom
mo.v2wm
$8.3 $3.8 $3.? $3.: $50.2 $3.3 $3.9 $3.0m $3.3 2:898 Eu: ~23 as: Co 8a
m: Aged
$mm.w $oo.o $90.5 $eo.m _ $3.3 $3.3 $mo.3 $3.». _ $3.: 8:898 Eu: ~23 28: ,6 8L
SE :tz mO SE 5.2 m0 SE 5.2 AC 2335
m 2.8.80 Beam—EU 2.3—.50

 

E0280 3 £5: vomomxmLoED use -5>O us 83:52 Wm 28;

 

 

 

8 2: 3 Ram eon.

Sveé
so _ .2 3%. _ N can _ m 2385 so: a? meeco E

m: 03
$2.2 $2.2 $2 _ 2386 Eu: 5? was: .6 5
wow? ammo o0 _ to 32 35.6 E:
mowod omww.w oww©.M— 02w.— OCSmOon 53m mo mmocgoxm
$86 £23 886 8:288
886 225 326 mm:
2:; as: as: 35

s: :2 .5 23:3

 

ﬂoom Eu: 2: mo oocmﬁcotom of mo 85:me meEzm 1m 03$.

2

but have similar number of items in the content with only one item in the test. The MTI
pool has the fewest items.

The overviews of the evaluation results for these item pools are presented in Table
5.4 and 5.5. The ability estimates from all pools exhibit slightly positive bias. The MSE
from the optimal item pools are smaller than that from the operational pool. In addition,
both optimal item pools result in a higher correlation coefﬁcient.

The results show that optimal item pools have smaller test-retest overlap rate
despite having fewer items. Figure 5.9 also shows that the MTI item pools have the

smallest test-retest overlap rates at most ability levels.

0.9» £] —+——MTI without ER
, , ~e— PM without ER _

 

 

 

 

e t
g) E
9 0.4» 4
‘1’ 3
(E 0.3» {
0.2 g
0.1 r t

0 - -

-4 -2 O 2 4

0

Figure 5.9 Test-retest overlap rate conditional on 6
Figure 5.10 plots the item exposure rate for individual items in each item pool by
content level. Regardless of the content. both optimal item pools have signiﬁcantly

smaller numbers of under-exposed items. On the other hand. Table 5.4 shows that the

93

percentages of over-exposed items are similar for all item pools, although optimal item

pools have fewer items.

 

a. Operational Item Pool

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 - it 1 1 -
B 3 2
(U (U ('5
0: [I 0:
gas @05- $0.5»
(0 0') U)
0 D D
Q. Q. Q.
X X X
UJ UJ DJ
0 U 0
-4 -2 U 2 4 -4 -2 D 2 4 41 -2 U 2 4
3 t9 3
b. Item Pool Designed by MTI
1 - 0 1 1 .
CD OJ CD
‘6 E '5
CE II 11'
$0.5» $0.5. $0.5
0') U) U)
D O O
Q. Q. Q.
X X X
.. .. .. Jim
U U U ‘ ~' - *
-4 -2 U 2 4 -4 -2 l3 2 4 41 -2 U 2
3 9 8
c. Item .Pool Designed by PM
1 . l 1 ' 1 .
CD OJ CD
‘65 13 1'6
0: Ct: II
East g [15- £0.5-
0) U) (D
D O O
Q. Q. Q.
X X X
m UJ m
U T U U
4 -2 D 2 4 -4 -2 D 2 4 4 2 U 2
8 8 B

 

 

 

Figure 5.10 Item exposure rate by difﬁculty level

94

Figure 5.1 1 displays the plot of conditional test information for the item pools. It
can be seen that the PM pool results in a plot similar to the operational pool but provides
more information at most ability levels. The MTI item pool provides similar amount of
information, which exceeds the target information. over the ability levels between
approximately -2.0 and 2.0.

The CSEM. conditional bias. and C MSE for three item pools are presented in
Figure 5.12 to 5.14 present. The charts show that at most ability levels, the three item
pools perform very closely. The MTI item pool results in the smallest bias and CMSE
for an ability level below approximately -1.5.

40
g. WOP without ER

--+- MTI without ER
.i -3... PM without ER

Ein’

 

(:0
O

N
01
I

Average Test Information
a a

.3.
O

 

 

 

Figure 5.11 Average test information conditional on true 6

95

1 . ~ . ,

 

 

+0P without ER
09* ‘ --- MTI without ER
08* 4 -6- PM without ER
0.7
E
m 0.6»
0
gOf-J- *}
4:

(U i .

I.— ‘f ‘

9 0.4%. is?
«I “an

a.
Q‘ngﬁf 7:;9n l
0.3 - ' steam :
9W ’

  

 

 

O . . i
-4 -2 0 2 4

0

Figure 5.12 Conditional standard error of measurement (CSEM)

. -_.... OP without ER
--*~ MTI without ER
+ PM without ER

 

Average Conditional Bias

 

 

 

' -4 -2

Figure 5.13 Conditional bias

96

5
.. .9. OP without ER
4.5 ‘ + MTl without ER

 

t—e—PMwithoutER

Average Conditional MSE

 

 

 

Figure 5.14 Conditional mean square error (CMSE)

5.3 Summary

The results suggest that regardless of the constraints of content balancing, the
optimal item pools perform better than the operational item pool based on pool size, test
security and measurement accuracy, although each design method has its preferable
features. The operational item pool performs better over a given range of ability levels
because a large number of items, including very discriminating items, are clustered
around these levels. Optimal item pools, especially ones designed with MTI, provide
information more evenly over most ability levels and provide sufﬁcient measurement
precision with a minimum number of items. All optimal pools, compared to operational
pools, save about 20 or more items and yield better correlations. In addition, optimal

pools have a signiﬁcantly lower percentage of items with exposure rate below 0.02. With

97

or without content balancing, PM item pools resulted in the highest correlation and the
lowest item overlap rate.

Overall, it seems that an item pool designed with the MTI method performs the
best, which indicates that the optimal item pool needs the fewest items to achieve
desirable precision if all the items have moderate item discrimination and distribute

roughly uniformly over a wide range of difﬁculty levels.

98

Chapter VI The Performance of the Item Pool with Sympson-Hetter Exposure Control

6.1 Item Poolsfor Tests without Content Balance

The blueprint of the optimal item pools with SH exposure control is based on the
blueprint of those without exposure control. Speciﬁcally, more items are added in the
bins where items tend to be selected more often than the desired exposure rates. This
relationship is reﬂected in the item distribution illustrated in Figure 6.1, where compared
to the optimal pools without exposure control, there are noticeably more items with b-

parameters -1.0 to 1.0 in optimal pools designed by both MTI and PM.

 

 

 

 

 

 

 

a. Operational Item Pool
Content Area 1: 137 Items
j f I T 7
12 l
_. l i
l s l
10 ‘5 l
E i i
i
8 ”i i
. i
o ,, l
l
l
i l 1
2~ . i
l
0 L l l g ”.3;
-4.1 -2.5 —0.9 0.7 2.3 3.9

 

 

 

Figure 6.1 Item Distributions for item pools with Sympson-Hetter exposure control

99

 

 

b. Item Pool Designed by MTI

Content Area I: 95 Items

1

r

 

: 5.: il liq .l...:tt.i1

-0.9

 

 

c. Item Pool Designed by PM

Content Area I; 120 Items

it: ii: it till! it: .Ele :lu 1.4 .l... .

 

 

 

 

 

Figure 6.1 (Cont’d) Item Distributions for item pools with Sympson-Hetter exposure

control

100

Table 6.1 shows the item pool size and the summary statistics of the item
parameters within each pool. The MTI pool consists of the fewest items and their a-
parameters vary within 1.307 and 1.777. a smaller range than the other two pools. The
optimal item pool designed by PM has more high a-parameter items. However, items in
the operational pool have the maximum a-parameter value in 3.141, compared to 1.777
for items in the MTI pool and 2.633 for items in the PM pool.

The MTI pool has 13 more items and the PM pool has 19 more items than the item
pools without exposure control. but the size of either pool is still smaller than that of the
operational pool. The added items are mostly highly discriminating items because they
tend to have higher exposure rates. This leads to a slightly higher average a-parameter
for the optimal item pools.

Table 6.2 lists the performance overview of the item pools. On average, all three
pools yielded slightly positive bias for ability estimates. The operational pool displayed
the smallest bias, but the difference from the optimal pools is negligible. Both optimal
pools exhibited better performance on all other criteria. The PM pool resulted in the
highest correlation coefficient and the lowest mean square error. The MTI pool, however,
consists of the least items. which is 42 items less than the operational pool and 25 less

than the PM pool.

101

 

 

 

OS 3 E 2% 68
SEE
$3.: 3%.: sex?” 2336 so: as, mesco as
M: $st
5 _ a .50: .x; E 2385 as: 53, was: co 5
9:2 :33 Ego as 355 :5:
~82: 33 ”80.2 22 2385 53:6 eucaﬁm
£86 885 £36 cocsoeoo
$86 $36 $23 mm:
35.0 3:3 38.0 85
s: :2 ,5 32:3.

 

icon :8: 2t 95 oocmctotom 2: we mocmzﬁm meEsm N6 2an

 

 

 

 

 

ES :3 03¢ 83 $2- 83 $3 mod- Sad 23 22V 55 o2 2;
god wove 086 ”N3 EN- mm: 3.2 Ed- 82 RE mood £3 8 :2
was 33 Sod £3 83- $2 2 _ ._ 2 3 $3 2: .m 2:3 32 E “5
Ba
.52 as: am :82 .52 £2 5 :82 .52 as: am :82 .8.— .8.—
.u a a

 

Etcob Esmcmxm 8:02-:OmmExm £3, chOmmom 25:52.2 8% 8:3me $8895; Em: new 3% 30m :8: _.© 2an

102

As shown in Figure 6.2, all three pools exhibit smaller test-retest overlap rates
compared to the item pools without exposure control. This is anticipated because item
selection with the exposure control method tends to utilize more items. Of the item pools
with exposure control, the operational pool clearly has the smallest test-retest overlap rate
at the ability levels larger than 2.0. However, Table 6.2 shows that on average, optimal

item pools have slightly smaller test-retest overlap rate despite having fewer items.

   
 
 

 

 

' ~74» OP without ER 3
0'97 , "" é+Mleithout ER]
g-WvPMwithoutERi
% or-—.-3_7,‘3 . - {we MTI with S-H I
05 " - “ ~~+ PM with S-H
%0-6:
t
a:
5 0.5- 1
8, . .’
E; 0.4 :
2 0.3- «g
0.2 - «3
0.1 »
O . _
-4 -2 0 2 4
0

Figure 6.2 Test-retest overlap rate conditional on (9

Figure 6.3 shows the item exposure rate for individual items in each pool in the
order of the item difﬁculty. It can be seen that the exposure control mechanic works very
well, with the exposure rates for all individual items around or below the target exposure
rate. The MTI pool seems to utilize items more evenly and have the fewest under-
exposed items. The operational item pool seems to have large numbers of difficult items

underexposed.

103

 

 

a. Operational Item Pool

1-

Exposure Rate

 

 

 

 

 

 

3
b. Item Pool Designed by MTI
1 _
ﬂ)
*c'u‘
II
g D 5 -
0')
D
Q.
X
”J JR We
0 I I I _ l I I
-4 -3 -2 -1 U 1 2 3
.9
c. Item Pool Designed by PM
1 _
2
('5
0:
§ 0.5 -
U)
D
D.
X
L” 1%
U I hm I _ I m” I
-4 -3 -2 -1 U 1 2 3

 

Figure 6.3 Item exposure rate by difﬁculty level
A closer look at the measurement precision at the individual ability level is

displayed with the conditional test information plots in Figure 6.4. The plots for item

104

 

pools with exposure control look very similar to those without exposure control. Because
of added items. optimal pools with SH exposure control yield more information at some
ability levels and closely match the information provided at other levels with the optimal
pool without exposure control. The operational pool, on the other hand, produces less

information at the ability levels between -0.5 to 0.75 when SH exposure control is used.

 

 

 

 

40
-—o- OP without ER
35» .té-t—op with S-H
- .. —&- MTt without ER
c 30 45—» MTI with S-H
.9 Mer- PM without ER
a 52 .
E 25 -~--v—- PM wtth S-H
:5
tit 20
a, l
I.
§>15~
g
< 10» *3
5~
It": I T i
C Iﬁiﬂﬁ “39*
-4 4

 

Figure 6.4 Average test information conditional on true (9

The plots for conditional SEM, conditional bias, and conditional MSE are
presented in Figure 6.5 to 6.7. Smaller values indicate better accuracy in ability estimate.
The plots indicate that all item pools yield similar performance at ability levels between -
2.0 and 2.5. The MTI pool performs better for ability levels below -2.0 and the PM pool

performs better for ability levels over 2.5.

105

 

t 3+0Pwithout ER
0-93 *3—-—-0P with S-H :
08'; 33+MTIwithoutER;
' 3+MTIwith S-H 3
07; fab—PM without ER

FWPM with S-H

 

 

3 —e~ OP without ER
.3-+- MTI without ER
3+ PM without ER
.3-—+—0P with S-H
3+ MTI with S-H

3 —v— PM with S-H

   
 
  

Average Conditional Bias

 

 

 

 

.4 P o é 4
a

Figure 6.6 Conditional bias

106

 

 

 

 

 

 

 

 

 

; l;— OP without ER 3
4 « —-—MTI without ER3
3. + PM without ER 3
Lu 3-5 ~ +0P with S-H
"2) +MTI with S-H .
P 3 . ~+ PM with S-H
:2 2.5»
U
5
o 23
0
C!)
as 1.5»
>
< 1 _
0.5
0 1 - 3 _
-4 -2 o 2 4

Figure 6.7 Conditional mean square error (CMSE)

6.2 Item Poolsfor Tests with Content Balance

As shown in Figure 6.8, the optimal item pools for tests with content balancing
display the same pattern as those without content balancing. Items are added to the bins
where items tend to have more than the desired exposure rates. Because content 3
appears only once in a test, no additional items are added in comparison to the pools
without exposure control. An interesting fact is that in optimal item pools, content 2 has
slightly more items than content 1 does, although both contents appear in a test with an

equal number of items.

107

 

 

a. Operational Item Pool
Content Area I: 59 Items Content Area 2: 58 Items

 

 

 

 

   

 

0 -4.0 -2.5-0.9 07 k 2.2 3.8 I 02 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 13 Items

to 3

4.

 

 

0.4.0 -25 -0.9 0.7 2.2 3.8
b. Item Pool Designed by MTI

 

 

 

 

 

 

   

 

Content Area I: 52 Items Content Area 2: 54 Items
8 I Y Y Y 8 ' V r I f
6
4.
2i
0' -41) 4.51.0.9 0.7 2.2 73”“ 07ft) -2.5 -09 0.72.2 3?
Content Area 3: 13 Items
3 . . .
6
4.
2.
0__..____-z____.

 

 

-4.0 -2.5 -0.9 0.7 2.2 3.8

 

Figure 6.8 Item distribution for item pools with Sympson-Hetter exposure control

108

 

 

 

 

 

 

 

 

 

 

 

 

 

 

c. Item Pool Designed by PM
Content Area 1: 51 Items Content Area 2: 54 Items
8 8i
6.
4.
2.
0‘ 2...; 0‘ '
-4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 12 Items
8~ 3
6.
4.
OM- .
~40 -2.5 -0.9 0.7 2.2 3.8

 

 

Figure 6.8 (Conti’d) Item distribution for item pools with Sympson-Hetter exposure
control

Table 6.3 shows the item pool size and the summary statistics of the item
parameters within each pool. The total numbers of items in the two optimal item pools
are similar while both sizes are smaller than the operational pool. Compared to the item
pools without exposure control, the MTI pool has 12 more items and the PM pool has 18
more items. On average, items in the optimal pools have higher a-parameters and

relatively lower b-parameters than the operational pool.

109

 

 

 

 

 

 

 

vﬂd domd dwdd mmmd mdd.m- mom; mod._ mid- wwod Em; edmd Cm; 3 SE
mdmd Kmd dmdd Nomd dmo. T vmm; odd; mo _ .d- Km; own; 3 _d dmog S :2
2 _d mdmd wmdd dmmd 52d- N. o; mm: mo_d- cmod who; mwmd 3d; 2 m0

m 68:00
59d Nmed Edd Cmd wmo. T owo; mmd._ omdd- Sod woo; domd Mme; em SE
S _d owed Kidd Kmd omd.m- So; So: .3d. mom; own; wood dwm; em :2
wood owed ocdd w _ md mow. T m _ mim mood dmmd vomd Nmod Emd Em; mm no

N E8200
nwdd memd omdd Rmd moo. T weed odd; mmdd- oowd woo; owmd owm._ E SE
mood mmed Edd mmmd m _d.m- coo; mm 2 mm _ .d- Km; emu; md _ d mom: mm 52
_odd domd mood vmmd INN- :md ow: oodd wmod 3 md emmd no: om m0

_ “coocoo

3%
.52 3.3—2 Gm 532 .52 .532 am :32 .52 432 Gm :32 Ban .8.—
.o d e

 

35:00 830%: Suezicoaﬁxm 53> oocﬂom 28250 So 855on 86:85.; :5: new ommm Boo So: md 2an

110

 

 

 

 

m. m_ 2 em em wm 3 mm om oﬁm Boo
moVBmm
$dddm $wd.mm $mwdm $mo.mm $$d_ $3.3 $3.3 $wddm $3.3V 2:898 Eu: 53> 2:873 6m
m: Aoowm
$ddd $ddd $ddd $3; $d5m $329 $oo._ $556 $3.2 2:898 Eu: 53> 252.6 8o
EA— :.2 3O 2.— :2 m6 2.— EIE LG 333%
m 2.3.80 Baum—EU E8—ueU

 

EoEoU of £5: domomxmibdcb dam -B>O do BLE: Z We 238i

 

 

 

E o: o: oem Boo
Sveom
$.98 so _ .om $3.2. osoooxo so: a? oEoEo E
m: Aeom
some some $8.2 2:893 so: 55 oeoﬁo so
$8.0 $85 28.0 as oocoo so:
vmmod mmood Somd 28 23898 Eu: (do mmocaou—m
$23 mmmoo $8.0 cousoeoo
ago god 825 mm:
885 2 So £35 £5
2.. :2 .5 22:3

 

Econ. :5: of do oocmgotom 05 do motmcﬁm meEsm ed oEﬁr

lll

Table 6.4 and 6.5 display the overview of the evaluation results for these item
pools. The performance shows the same pattern as the item pools without content
balancing. The MSE from the optimal item pools is smaller than that from the
operational pool. The PM item pool results in the highest correlation coefﬁcient.

Table 6.4 also shows that optimal item pools have smaller test-retest overlap rate
despite having fewer items. The plots for conditional test-retest overlap rates in Figure
5.2 reveal that the operational pool has the lowest overlap rates at ability levels below -
1.5 and the MTI pool has the lowest overlap rates at levels beyond approximately 1.00.
This is consistent with the ﬁndings from item pools without exposure control, but
different from the ones for item pools without content balancing, where the operational

pool has the lowest overlap rates at the two extremes of the ability levels.

 

 

 

 

' .7 7......“ _..Ah.._. .t _,......4....
+ OP without ER
‘, -*— MTI without ER ;
. +PM without ER
‘ +0P with s-H ‘
. +MTI with S-H .
f:— PM with S-H
0.2
0.1 » ~
0 J A L
-4 -2 0 2 4
9

Figure 6.9 Test-retest overlap rate conditional on 6

112

 

 

a. Operational Item Pool

2mm 2ngan

AIL

42024

5
U

2mm Ezmoaxm.

.ML

-4

5.
0

2mm 2:3an

-2

-4

-2

9 _
5. D
0

2mm osmoaxm

8 —
U

Bad mazwoaxm

9|
0

2mm 23886

b. Item Pool Designed by MTI

4 ~4-2024

-202

-4

-2

4
.

8
5. D
0

2mm muzmonxm

8
0

Sam 2szan

0

2mm msmoaxm

8

c. Item Pool Designed by PM

2024

4
.

 

 

Figure 6.10 Item exposure rate by difﬁculty level

113

Table 6.4 and 6.5 also shows that. compared to the operational pool, both optimal
item pools have signiﬁcantly smaller numbers of under—exposed, as well as over-exposed,
items. The number of under-exposed items is similar to the pools without exposure
control. but the number of over exposed items decreases substantially.

Figure 6.10 shows the item exposure rate for individual items ordered by their
difﬁculty levels in each item pool. Extremely easy and extremely difﬁcult items tend to
have less exposure rates, but under-exposed items are across all difﬁculty levels,
especially for the operational item pool. As indicated in Table 6.4, the MTI item pool has
the fewest under-exposed items. and utilizes the extreme difﬁculty level items more often.

Figure 6.11 presents the conditional test information plots. The plots for item
pools with exposure control look very similar to those without exposure control. The
optimal pools with exposure control yield similar or more information than the pool
without exposure control. except that the MTI pool with exposure control yields a smaller
amount of information at the ability levels between 0 and 1.5. The operational pool
produces less information at the ability levels between -1.0 to 1.0 when compared to the

same pool without exposure control.

114

§§+0Pwithom ER
t .‘f;-—*—OPwithS-H
11+MleithoutER

 

 

38888

 
 
 
   

: .?—+—MTIwiths-H
% l+PMwithoutER
g dg—v-PMwithS-H
.o.

E

a

0

'—

0)

EMS

9

<10-

 

 

 

Figure 6.1 1 Average test information conditional on true 6
Figure 6.12 presents the CSEM for three item pools. It shows that the MTI pool
performs better at ability levels below -2 and the PM pool performs better at ability levels
over 0.0. The operational pool has the largest SEM at ability levels between -2.0 and 1.5.
Figure 6.13 and 6.14 show that the MTI pool yields the smallest bias and the
smallest MSE at most ability levels. The operational pool and PM pool produce similar

MSE.

115

9 t
g 0.45
< N

0.3

0.2
O.1~

 

 

+ OP without ER
; -*- OP with S—H
1 -—¢- MTI without ER
29 -*— MTI with S-H
+ PM without ER
—°-- PM with S-H

 

 

 

 

o
-4

-2

Figure 6.12 Conditional standard error of measurement (C SEM)

  

Average Conditional Bias

Figure 6.13 Conditional bias

 

 

-2

116

. --0-- OP without ER

f: --+- MTI without ER
"a“ PM without ER

—*— OP with S—H

L + MTI with S-H

 

 

g—w-PM with S-H

 

 

 

 

 

 

i-e-0Pwithout ER
4-5 ‘ -+-MleithoutER
4t; 4 +PM without ER
m +0Pwtths-H
g 3.5 . +MleithS-H
.5 . gg—v-PMwith S—H
c 3. 41
.2 l
O
(u 2* '
3’
91.5r
<
1.
0.5
0 , .
-4 -2 o 2 4

Figure 6.14 Conditional mean square error (CMSE)

6.3 Summary

The results suggest all optimal pools, compared to operational pools, save about 10
or more items while performing better based on pool size, test security and measurement
accuracy. Tests assembled from optimal pools have smaller test-retest overlap rates. In
addition, optimal pools have signiﬁcantly lower percentages of items with exposure rate

below 0.02.

117

Chapter VII The Performance of the Item Pools with

a-Stratiﬁed Exposure Control

7.1 Item Pools for Tests without Content Balance

The a-stratiﬁed exposure control selects item by the closeness of b-parameter to the
provisional ability estimate instead of maximum information. Therefore. it is not
necessary to adjust the item pool after CAT simulations. The item distribution for the OP,

MTI, and PM pools are displayed in Figure 7.1.

 

a. Operational Item Pool

Content Area l: 137 Items

 

   

 

 

i
12~ t
10» n J
I l ‘
ti
6* t -
l t
l
4» _
0 , II I I" ,
-4.1 -2.5 -09 0.7 2.3 3.9

 

 

 

Figure 7.1 Item distribution for item pools without content balancing and with a-stratiﬁed
exposure control

118

 

b. Item Pool Designed by MTI

Content Area 1; 158 Items

 

V . t I

 

 

 

c. Item Pool Designed by PM

Content Area 1: 240 Items

 

 

 

 

 

 

-4.1 52.5 .09

 

3.9

 

 

Figure 7.1 Item distribution for item pools without content balancing and with a-stratified
exposure control

119

 

As can be seen, the shapes of the distribution look similar for the MTI pool and the
PM pool, although the MTI pool is much smaller. The PM pool has more difﬁcult items
with high a-parameters, while the MTI pool has more moderately difﬁcult items with
high a-parameters. Both optimal pools seem to require more items with moderate to low
b-parameters and fewer items with high b-parameters.

Table 7.1 presents the sizes of three pools and the summary statistics for the item
parameters within each pool. The size of either optimal item pools is larger than the size
of the operational pool by 40 or more. The PM pool has the smallest b-parameter range
and the largest average a-value with a maximum of 3.520 and minimum of 1.146. The
range of a-parameter values for items in the MTI pool is from 1.275 to 3.394 and from
0.746 to 3.141 for the operational pool.

The overview of the evaluation results for these item pools are presented in Table
7.2. The average bias of the ability estimates is positive from the operational pool and
the MTI pool and negative from the PM pool, but the magnitudes of the bias are
negligible. Optimal item pools yield smaller MSE and higher correlation coefﬁcient than
the operational pool does. Of the item pools, the MTI pool results in the highest
correlation coefﬁcient and the smallest MSE.

Table 7.2 also shows that the MTI pool has the lowest test-retest overlap rate. The
PM pool, however, has the highest overlap rate despite having the largest pool size. The
plots of conditional test-retest overlap rate for item pools shown in Figure 7.2 reveal that

the MTI pool. has the lowest overlap rate at most ability levels.

120

 

 

 

RN 2: 5 Ba Boa
Sveé
$2.8 $8.8 $38 2:893 :5: as, as: E E
.3 A03
$93 $wm.m $8.2 258% so: a? was: E 3a
$28 ammo o8? 3e 9255 E:
53.2: £32 28.: as 2335 82:0 323%
83o NS? 823 832.8 M...
$25 $23 2: 3 mm: 1
38.0- 8:; ”god 85
2,. :2 .5 322%

 

icon :8: 2: mo cocmEcoitom of («o wormsﬁm meEsm ms 2%.?

 

 

 

 

 

25.0 82 $2 £5 £3- 83. £3 83. 3: cm?“ £3 E: ta 2;
2.3 3% god Sue 82. 8mm 23 zoo- $2 $2 2.3 if 2: E2
was $3 Mood em _ .0 SEN. $3 2 _ ._ 2 3 3.3 SE 2:3 82 5 mo
2%
.52 £2 am 532 .52 .32 am :82 .52 as: am :82 .8; .2...
o a a

 

7:250 8:89am vomumbmio 2:3 wEEOmmom 38:223. E8 mQSmuﬁm E32523 :8: van oﬁm Boa Eu: .8 28d.

1 _

j 4+- OP with AS
0.9 ~ 33 -+- MTI with AS
0‘87 .. A; --B-- PM Wlth AS

 

  
 
 

q, 547-:1.’ .

*5 0.7

a:

$0.6»

t

9 5

o 0.

a, I

80.4»

m

2 0.3-
0.2-
0.1r

 

 

O .
-4 -2 o 2 4
o

 

Figure 7.2 Test-retest overlap rate conditional on 6

As can be seen in Table 7.2, the MTI pool has the lowest percentage of items with
an exposure rate below .02 and the lowest percentage of items with an exposure rate
higher than the target rate 1/3. The PM pool has a lower percentage of over-exposed
items, but has over half of the items under-exposed. Individual item exposure rates are
shown in Figure 7.3. It is clear that items in the MTI pool are used more evenly. By

contrast, a few items in the PM item pool are used more often than most items, leading to

a highly skewed item exposure rates for PM pool.

122

 

 

a. Operational Item Pool

Exposure Rate
D
0‘!

 

 

 

 

 

 

l A- A .—l #1 I l l
-4 -3 -2 —1 U 1 2 3 4
8
b. Item Pool Designed by MTI
1 r
CD
‘66
[I
g [15 ~ N I
0')
O
Q.
X
“J W
0 I And I I I I
-4 -3 -2 -1 U 1 2 3 4
3
c. Item Pool Designed by PM
2
(U
[1
g D 5 -
0'!
O
Q
X
UJ
D A-A l I
-4 -3 -2 -1 D 1 2 3 4
I9

 

Figure 7.3 Item exposure rate by difﬁculty level
Figure 7.4 shows the average test information at the ﬁxed ability levels. The plots

for the PM pool and the operational pool look similar in shape, but the PM pool provides

123

 

more information at most ability levels. The MTI pool provides similar amount of

information between ability levels of-1.5 to 1.5, and provides information exceeding 10

over the range between -2.0 to 2.0.

40

(IO
01

0)
O

N
01

_.t
01

Average Test Information
N

_I.
O

O

 

 

 

f—e—OP with AS 1
lea—MTI with AS;
§-6-PM with AS

 

 
  
 
   

 

 

 

Figure 7.4 Average test information conditional on true 6

Figure 7.5 to 7.7 present the C SEM, conditional bias. and CMSE for three item

pools. The charts show that all three item pools yield similar SEM, bias, and MSE over

the ability level between -2.0 to 2.0 and the results are mixed for ability levels beyond

i2.0.

124

5 '——e—~OP with AS ’
0.9; *-+—-MTI with AS
5 _ +PM with As;

 

 

 

 

0

T—«e— OP withAs i
i -+-MTI with A8!
,+ PM with AS ;

 
 
  

 

 

Average Conditional Bias

 

Figure 7.6 Conditional bias

27+0PwithAs

 

 

 

——11—— MTI with AS
51- ig—e-PMwithAS
Lu .. .. i
cg ;
_ 45 .
CU .
C i
.9 "
s 3 n
o ‘1' “ l
g .. i
e 21 I
g
<
1: l
i ,
0i _
-4 -2 o 2 4

Figure 7.7 Conditional mean square error (CMSE)

7.2 Item Pools for Tests with Content Balance

The item distributions for the OP, MTI, and PM pools are displayed in Figure 7.8.
The results are similar to those without content balancing. The shape of the item
distributions looks similar for the MTI pool and the PM pool. The PM pool has more
highly discriminating and difﬁcult items in both Content 1 and Content 2, while the MTI
pool has more moderately difﬁcult items with high discriminating parameters in all three
contents. Unlike the operational pool, which has more items with both high a-parameters
and high b-parameters, both optimal pools seem to require more items with moderate to

low b-parameters and fewer items with high b-parameters.

126

 

 

a. Operational Item Pool
Content Area I: 59 Items Content Area 2: 58 Items

ovoo

A

   

0 0
-4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 13 Items

 

 

4.0 -2.5 -0.9 0.7 2.2 3.8
b. Item Pool Designed by MTI

  

     

Content Area 1: 66 Items Content Area 2: 75 Items

8 8:

6 6.

4 4?
I l .

2 2i in.

E i . 0 i l t i 1 5 t '
-4.() -2.5 -0.9 0.7 2.2 3.8 «4.0 -2.5 -0.9 0.7 2.2 3.8

Content Area 3: 14 ltcms

 

J‘— ex
_.______.‘.__~_.__. ,..._..M ._.__._.__

N

. Wit ii :Ttl. . 5

0 I l
-4.0 -2.5 -0.9 0.7 2.2 3.8

 

Figure 7.8 Item distribution for item pools with content balancing and without
exposure control

127

 

 

 

c. Item Pool Designed by PM
Content Area 1: 87 Items Content Area 2: 98 Items

”—‘___l'-' '- 'ﬁg——-__ ' — —‘— _ —_ —T——’—“ _ _T_"_ ﬂ "

 

 

    

 

 

 

 

 

O‘ " ' ‘ ‘ 0'
-4.0 -2.5 -0.9 0.7 2.2 3.8 -4.0 -2.5 -0.9 0.7 2.2 3.8
Content Area 3: 17 Items

F T 1

 

 

2 ~ 1
1

0 .
-4.0 -2.5 -0.9 0.7 2.2 3.8

 

 

 

Figure 7.8 (Conti’d) Item distribution for item pools with content balancing and
without exposure control

Table 7.3 presents the item pool sizes for the three pools and the summary statistics
for item parameters within each pool. Results are similar to ones where no content
balancing was present. The sizes for both optimal item pools are larger than the size of
the operational pool by 40 or more. The MTI pool has the largest average a-parameter
values with a maximum of 4.046 and minimum of 1.244. The operational pool has the
smallest b-parameter range. All three item pools have similar numbers of items in
Content 3. Interestingly. although Content 1 and 2 appear the same number of times in a

test, the optimal pools require more items in Content 2 than Content 1.

128

 

 

 

 

 

 

 

$3 33 mod 3.8 82. ex: 82 ”NS- use 33 22 we: 2 2.:
3:3 8% £3: 23 32. me: N8: See a: #3. :33 83 3 E2
2 3 82 ”mod ommd R _ .N- e a: we: 33- £3 £5 ado :3: 2 .5

m HGDHGOU
a 3 move wood 22: 32- :3 own: 2 S- 5% ﬁg 82 £2 2: 3:
wood memo Sod RSV Nam. 38 $3 $3. 3.3 $5 3.2. 2: mm :2
ME; 033 $90 w So 8”. _ - m Sim Sec 33 $2 .92 :2 m 5: mm mo

N HGDHGOU
~de £2 £3 mm? 83. coed 83 mod- 83 £3 .83 82 cm s:
2 S .33 29o ammo 8mm- £3 82 $3. $2 $3: SS 9.3 we E2
Sod $3 $3 ﬁg :3- :3 2:: god ﬁg :2 $2 $2 am no

~ HGUHGOU

2%
.52 .32 am :32 .52 52 am :82 .52 as: am :32 .25 .8.—
.o a a

 

33:00 chamomxm eoccwbmé 53> oocoﬁm 38:00 Sm moumtﬁm 36:3qu :5: can 35 Bon— Eo: .m& 2%;

I_ N)

 

 

 

 

S 3 2 2: mw mm om me am oﬁw Bog
Novgwm
$535 $3.3 $moem $8.0m $oo.om $3.vm $_ mew $3.2 $3.8 83098 Eu: 2:3 mEBCo 6n—
m: A85”
$3: ﬂ $3.3 $wm.2 $8.0 $mm.m $mo.w $3.0 $om.m $wo.m 2:398 Eu: 5:3 Eco: Co 8L
SE 5.2 AC SE 5.: ac SE 3.2 AC ozmzﬁm
n E0280 N “:3an _ 2.8.50

 

£00m Se: :83 2: mo oozes—Stem of mo motmzﬁm bashesm v.5 2an

28:00 .3 was: vomomxmeoeci was $30 mo ewﬁcwoeon— m8 2an

 

 

 

8m E o: 9% 68
moves:
$33 gang $3.8 2:893 as: 5:3 meaco eh:
2 A23:
some <33. £35 2325 so: as) was: we an:
e800 8” _ .o ~32 as 353 ea:
82%: ME: w _ 3.0 as 2335 share eases:
83o 335 2:3 Sewage
$25 $de :83 mm:
88.? 886 $23 ea
2.. Ea .5 32.3

 

130

The overview of the evaluation results for these item pools are presented in Table
7.4. The results are similar to those without content balancing. The PM pool shows
slightly negative bias for the ability estimates. The average bias from the operational
pool and the MTI pool are positive but the magnitudes of the bias are negligible. Optimal
item pools yield smaller MSE and higher correlation coefﬁcient than the operational pool
does. Of all the pools. the MTI pool resulted in the highest correlation coefﬁcient and the
smallest MSE.

The plots of conditional test-retest overlap rate shown in Figure 7.2 draw a
consistent picture as the plots for item pools without content balancing. The MTI pool
has the lowest overlap rate at all ability levels. Summary statistics in Table 7.2 also
shows that the MTI pool has the lowest average test-retest overlap rate. The PM pool,

however, has the highest overlap rate despite having the largest pool size.

" -o~ OP with AS
‘ ; ——+- MTI with AS
9 -~e- PM with AS

 

 

 

 

 

Figure 7.9 Test-retest overlap rate conditional on 9

131

 

a. Operational Item Pool . _
1 i 1 1 1

0.5 1 0.5 , 0.5 1

Exposure Rate
Exposure Rate
Exposure Rate

 

 

 

o o o
41-2024 41-2024 42024

b. Item Pool Designed by MTI

 

 

 

 

 

 

 

 

 

 

1 . 1 » 1 .
2’. 2 3
('0 (U (U
D’.’ D: 11'
£0.51 g 0.51 $0.5.
(.0 U) (.0
D O D
D. D. Q.
X X X
LLl LU LL!
0 U ‘ D
-4 4 -4 -2 o 2 4 -4 -2 o 2 4
3 .9
c. Item Pool Designed by PM
1 - 1 1
2’. B 2’.
(B ('5 (U
[I 11' CE
$0.51 $0.5. $0.51
U) (0 0')
D D D
D. D. D.
X X X
LlJ LIJ LU
U D U1
-4 -2 U 2 4 -4 -2 U 2 4 -4 -2 U 2 4

8 8 8

 

 

 

Figure 7.10 Item exposure rate by difﬁculty level
Table 7.2 also shows that the MTI pool has the lowest percentage of items with

exposure rate below 0.02, the lowest percentage of items with exposure rate higher than

the target rate 1/3, and the lowest skewness of the item exposure rate. The PM pool has
over half of the items under-exposed and is the most skewed pool in terms of item usage.
Individual item exposure rates are shown in Figure 7.3. It is clear that items in the MTI
pool are used more evenly and only a few items in the PM item pool are used more often
while others are less used.

Figure 7.] 1 shows the average test information over the ability levels ranging from
-4.0 to 4.0. The amount of information the operational pool provides seems quite low.
Even at its peak levels. it is below 10. The plot for the PM item pool looks similar to that
for the operational pool but the PM pool provides more information over a range of
ability levels. The MTI pool provides a similar amount of information between ability

levels -2.0 to 1.5. and provides information exceeding 10 at the range between -2.0 to 2.0.

 

 
 
 
   

4o
+0PwithAS
35. . +MTI with As
,——e-PMwithAs,.
.530"
E
.9 25'
5
:5 20
Q.)
'—
§>15~
a;
<10»

 

 

 

Figure 7.11 Average test infomiation conditional on true 6

Figure 7.12 to 7.14 present the CSEM, conditional bias, and CMSE for three item
pools. The charts show that all three item pools yield similar SEM, bias, and MSE over

the ability level between -2.0 to 2.0. The MTI pool performs better at ability level below

-2.0 and over 2.0.
1 ' ‘ .. -, .....
i-+~ OP with AS
09* " Q -+-MTI with A8
0.8L . , +PM with AS

 

 

 

 

 

-4 é o é 4
0

Figure 7.12 Conditional standard error of measurement (CSEM)

134

. '~<+——0PwithAs
2‘ "2+MleithAS
:+PMwithAS

Average Conditional Bias
:5 .0
9‘ 9 E" T‘

I
uni

-1.5~ -5

 

 

-2 , .
-4 -2 o 2 4
o

 

Figure 7.13 Conditional bias

   

5 . .
‘ g+~op with AS
4.5 -;—~«Mri with AS
4 ,+PM with AS
Lu .
(0
s.
2
0
SE
2
O
O
0
U)
3
<(

 

 

 

Figure 7.14 Conditional mean square error (CMSE)

135

7.3 Summary

The results suggest the optimal pool designed with MTI performs the best, based
on test security and measurement accuracy. It does require more items and some highly
discriminating items. Tests assembled from MTI pools have smaller test-retest overlap
rates and significantly lower percentages of over- and under-exposed items.

The item pool designed by the PM did not perform as well as the operational pool
and. the pool designed by MTI, despite having more items. One possible reason is that the
way items are stratified is slightly different from the way items are simulated. It might
yield better results if items are stratified with Chang and van der Linden’s (2003) 0-1

linear programming method during the evaluation. stage.

136

Chapter VIII Discussion

This chapter ﬁrst revisits the deﬁnition of “optimal” in item pool design and
discusses how this study successfully addresses the criteria of optimality the deﬁnition
implies. It then presents the implication of Reckase’s method on the practice of item pool
development and maintenance. Finally, the limitations of this study and the expected

future research are discussed.

8.] A Revisit to the Definition of “Optimal

/-.

This study investigated two approaches to design optimal blueprints for CAT item
po’o‘l_‘s‘.\Except for the item pool designed for a-stratiﬁed exposure control by the PM
method, optimal pools designed by either method perform better than the operational
pools no matter whether exposure control and content balancing are considered or not.

Optimal item pool design looks for the most desirable or favorable combination of
items to form an item pool that would support the assembly of a large number of
individualized computerized adaptive tests. There is, however, no single pool that is
absolutely optimal, as it is constrained by a number of factors and different compositions
of the items that may yield similar measurement precisions. That is why the two
“optimal” pools look quite different and each is still optimal in some sense.

A general objective for an optimal item pool design is to meet the three criteria
described by van der Linden (1999): 1) sufﬁciently large to allow several thousand
overlapping subtests to be drawn from its items; 2) consisting of items spanning the entire
range of item difﬁculty relative to the population of interest; and 3) consisting of an

appropriate mix of high and low discriminating items to lower the item creation cost

137

while meeting the needs of test precision. It is not hard to meet the ﬁrst criteria, where
minimum size can be translated to the test length divided by the target exposure rate (1.0
if no exposure rate is considered). Simulation studies can easily realize the ﬁrst two
requirements by randomly sampling a large number of examinees from the expected
examinee population. simulating a CAT administration to them, and tallying the number
of items needed in different difﬁculty levels. All optimal item pools designed in this
study are at least ﬁve times the test length and span a wide range of item difﬁculties.
Acceptable precision can be achieved with difﬁculty ranges slightly smaller than the
items in the operational pools in that the ranges of the item difﬁculty level for optimal
pools are all smaller.

The third criterion. item creation cost, was not estimated directly in this study.
However. mechanisms to indirectly approximate the item writing cost and the effort to
minimize it were utilized in Reckase’s item pool design method. Item generations with
both MTI and PM methods imply the relationship between item parameters and the cost
of item creations. The difference between PM and MTI can be primarily the difference
between the assumptions. The MTI method assumes that highly discriminating items
(i.e., items with high a-parameters) are more difﬁcult to write and cost more to create,
therefore, the MTI design tries to limit the number of high discrimination items by
simulating items to meet the minimum test information requirement. The PM method
assumes that the more the items with certain characteristics among the existing items, the
less expensive to create items with the same characteristics. Therefore, it minimizes the
item creation cost by modeling the item characteristics (i.e., the relationship between IRT

parameters) and simulating items that are more like existing items.

138

Overall, the optimally designed item pools perform better than the operational pools
obtained from CAT-ASVAB. The results show that the MTI design generally leads to
smaller pools and contain items with lower a-parameters. The PM pools, on the other
hand, maintain the correlation between item parameters but do not perform as well as the
MTI pool.

The operational pools, on the other hand, provide more measurement precision over
some range of latent ability levels. A closer look at the Arithmetic Reasoning pool ﬁnds
more highly discriminating items at the range of b-parameters between 0 and 1.5. In
practice, when operational item pools retire frequently, such high discriminating items
may be difﬁcult to replace. It introduces doubts on whether or not the same performance
over similar ability levels can be easily duplicated. Item pools designed with Reckase’s
method have more items evenly distributed over a wider range of ability levels. As a
result, optimal pools perform better than the operational ASVAB item pools at most
latent ability levels.

The results imply that improvement may be made to the operational item pools by
adjusting the item distributions to make them closer to what optimal item pool blueprints
demand. For example. the arithmetic reasoning pool may perform better by adding more
moderately discriminating items in the lower end of the latent ability levels and taking
out some highly discriminating items in the higher end of the latent ability levels for

future use.

8.2 Implications 0n the Practice ofItem Pool Development

This study is based on the assumption that examinees are normally distributed with

a population mean ability of 0 and variance of 1. However, in reality. examinee

139

\
“’

distributions are not always normal, and the expected distribution may not match the
exact examinee distribution, which can only be decided when the tests are administered.
The question raised is how robust the design is to the violation of the distributions. There
are two situations, and. thus, two treatments are required. In the case where the expected
distribution is not normal, it is possible to sample the examinees from a predeﬁned
examinee distribution, which can be constructed from previous test administrations. On
the other hand, since it is a simulation study, violation to the assumptions may threaten
the validity of the study and impact the results. The extent of the potential impacts could
be a study of interest for future research.

The end product of the item pool design isnagblueprint listingthe number of (needed

items in each'bintthat isLitems, with the a- and b- parameters in agertain- range. Similar

NM __.~‘___

___t
,_,._.v-—"

to the functionﬂof a test blueprintforthe paper-and-penCil_test,themit‘ernpool blueprint (if?
servesfaﬂsa guide for item selection oritem creation fortheitem ﬁQOl- It portrays the i
optimal item composition an item pool should have and, therefore, is a target item
developers should try to match. Items with desired content coverage and statistics can be
either selected from previously written ones or created by item writers.

This method has not been tested in the practice. In this study, all items required by
the design method are assumed available when comparing the optimally designed item
pools to the operational pools. It seems hard to produce items with exactly the same item
parameters required by the item pool design blueprint. However, with the advance in
item modeling-research, it will be more and more feasible to create large numbers of

similar items/ with the desired psychometric properties. When the PM pool takes into

account the correlations between a- and b-parameters. the blueprint designed by the PM

140

method may be easier to fulﬁll. The MTI pools achieve acceptable measurement
precision with a minimum number of items, but it is uncertain how hard it would be to
ﬁnd or to create the proper items. On the other hand, improvement on the design method,
such as combining the two methods to take advantages of the good features of each
method, will make the design more practical. In addition, it should be pointed out that by
deﬁning the width of “bins”, the blueprint requires similar items within a certain range
instead of with exact item parameters. Future studies are needed to investigate how hard

it is to fulﬁll the required items of the blueprint.

8.2 Implications on Item Pool Management

In practice, operational item pools are not static. In most testing programs, tests are
administered from the bank and new items are pretested on a continuous basis. Obsolete
items are removed from time to time. Thus, monitoring item usage and replenishing new
items are two important tasks of item pool management (van der Linden & Veldkamp,
2000). The item pool design methods presented here can easily be adapted for use in
item pool management, both at the master pool level and at the operational pool level.

The master item pool is a union of operational item pools. The distribution of the
optimal master pool could be simply a number of replications of the operational pool
distribution. In other words, if the master pool will support ten smaller operational pools,
the optimal item distribution of the master pool in each bin is simply ten times the item
distribution in the optimal pool designed by the simulation method. Alternatively, the
union method can take into consideration the expected exposure rates for the items in
each bin, where the number of items needed in each bin for the master pool can be

expressed as

141

X,AB=1MCLX(RXABI‘AB, XAB)
where R is the number of operational item pools a master item pool can support, and r A B

is the expected exposure rate for the numbers in each bin. In this way, the master item

pool has more items in the most exposed bins and fewer items in the least exposed bins.

8.3 Reckase '5 Method vesus the Mathematical Programming Method

The results show that the extensions to Reckase’s method work well in designing
optimal item pools in situations where items are calibrated with 3PL. Compared to the
mathematical programming method, Reckase’s method simulates the CAT procedure
straightforwardly and, therefore, is more ﬂexible in adapting different item selection and
ability estimate process and is easier to implement. Constraints on non-statistical
attributes (e. g., content balancing) are absorbed into the ﬁrst stage of the design by
partitioning the target pool into smaller ones. There is no special software needed. The
mathematical programming method is more mathematically structured by quantifying all
the constraints and searching for the optimal solutions with linear programming, but it
also requires the use of a “shadow test” item selection approach in CAT simulation.
Reckase’s method emphasizes the randomness of the item parameters in simulation,
while the mathematical programming method focusing on optimizing predeﬁned
“pseudo” items. In the end, when they are all modeling the same CAT process, the
simulation results should be similar.

While taking different approaches, Reckase’s method and the mathematical
programming method are similar in many ways. One of the important similarities is
between the PM item simulation approach and the mathematical programming method in

the way item costs are minimized in item pool design process. The mathematical

142

programming method deﬁned a cost function, which is an inverse of the number of real
items with certain combination of the attributes, including IRT parameters. It assumes
that the more real items with the combination of item parameters, the less cost it is to
create items with this item parameter combination. The idea is essentially the same as the
PM method, in which the simulation would more likely generate items along the
regression line of b-parameters on a-parameters where more real items are clustered.

Either method may be able to borrow some ideas from the other to improve the item
pool design. No literature has described the design of item pools with a-stratiﬁed
exposure control by the mathematical programming method. Chang and van der Linden
described the 0-1 linear programming method to optimize the stratiﬁcation of a-
parameters for an existing item pool but not for the “pseudo items”. As explored in this
study, it may be possible to simulate the item pool design by varying the target

information at different stages ofthe test.

8. 4 Limitations andfuture studies

Due to the limited resources. the prediction models in this study were based on one
operational item pool. In practice, it is possible to use multiple recent item pools to get a
more accurate estimation of the attributes of the items written for the testing programs.

Previous research showed that the bin width might inﬂuence the number of items
required in the optimal pool. With post-simulation adjustment utilized in this study, item
pools would trim unnecessary items in the bins, so the bin width might not inﬂuence the
size of the ﬁnal pool. However, future studies are needed to investigate the impact.

The optimal item pool blueprints designed for computerized adaptive testing with a-

stratiﬁed exposure control appear to require more items than those designed for CAT

143

with Sympson-Hetter exposure control. One of the reasons is post-simulation adjustment
was not applied due to different item selection procedure a-stratiﬁed exposure control
uses. Future research is expected to explore the appropriate post-simulation adjustment
for item pools with item selection procedure different from maximum information based
ones

While this study investigated the optimal item pool design with the PM and MTI
methods separately, both methods have their shortcomings. MTI method tends to result
in items with low correlations between their a- and b-parameters. The PM method, while
maintaining similar correlation as the items in operational pools do, tends to perform
better over some ability levels than others. It is important for future research to explore
ways to combine the two design methods so the item generation would take into account

the item parameter correlations while meeting the minimum information requirement.

144

APPENDIX

145

00 Total

2.68 2.83 2.97

2.53

2.19 2.37

APPENDIX

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

Item Distribution for the Operational Item Pool — Arithmetic Reasoning
1.26 1.55
1.26 1.55 1.79 2.00

0.00 0.89
0.89

a

b

 

 

Table A1

00012341434658695H6H59HQ52220000
00000000000000000000011000000000
00000000000000000000000000000000
00000000000000000000010100000000
00000000000000000000010000000000
00000000000000000000131200000000
00000000000000000010100000000000
00000000000000001003002200000000
00000000000000000221124121000000
00000000000001112406213100000000

00000000010114477..35)10005307.10000

000111..107.13QJ43110000000001010000

00001231211000000000000000000000
0000000000000000000000000000000

0364208642086420246302463024530w
s2222244444pppo00000 11111 222223

0000000000000000000000000000000
w0.86.4.2086420864202436.8024680246.80.
.222222 11111 ppopooooo 11111 222223

137

146

16

26 40

11

Total

 

 

Item Distribution for Item Pool Designed by MTI Method and without

Exmsure Control — Arithmetic Reasoning

0.00 0.89
0.89

Table A2

 

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.55

1.26
1.55

a

Q) Total

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.26

b

00000434454

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

00000000000

00000004454

00000430000

00000000000

00000000000

 

54445344344334.4000

000000000000000000

000000000000000000

000000000000000000

000000000000000000

000000000000000000

000000000000000000

000000000000000000

000000000000000000

54445344344334.0000

000000000000004000

000000000000000000

000000000000000000
000000000000000000
36420246802468024h
ppppooooo 11111 2222
000000000000000000
036420246302463024
topopooooo 11111 222

000

000

000

000

000

000

000

000

000

000

000

0
0
0

(X)

2.80 3.00

2.60 2.80
3.00

82

71

11

0

Total

 

 

147

2.83 2.97 00 Total

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
2.53 2.68

1.26 1.55
l.79 2.00 2.19 2.37

1.55

Item Distribution for Item Pool Designed by PM Method and without
1.26

Exposure Control — Arithmetic Reasoning

0.00 0.89
0.89

a

b

 

 

Table A.3

00000624533654566756655372000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000001100010000000

00000000000000001222323000000000

00000000000000032223232330000000

0000000000007.2233310000030000000

000000000003323000000000054000000

00000000233300000000000000000000

00000004300000000000000000000000

00000000000000000000000000000000
0000000000000000000000000000000

0364208642086420246302463024630m
22222244444400o00000 11111 222223

0000000000000000000000000000000
w0.86.4.20.86.4.20864202468072458024580.
.3222224444400o000000 11111 222223

10]

148

l3 19

ll

0

Total

 

 

Item Distribution for Item Pool Simulated with MTI Method and with

Sympson-Hetter Exposure Control — Arithmetic Reasoning

0.00 0.89

Table A4

 

00 Total

[.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97

1.26 l.55
0.89 I26 155

a

b

00000434454655684654443344000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000004454655684654443340000000

.00000430000000000000000004000000

 

00000000000000000000000000000000
00000000000000000000000000000000
0000000000000000000000000000000

0.00/0.420.036.4208642024”ﬁ30246807m4ﬁ30.w
32222244444oo000000011111222223

0000000000000000000000000000000
m03642086420364.202463024680.24.680.
.32222244444000000000 11111 222223

95

0

Total

 

 

149

Item Distribution for Item Pool Simulated with PM Method and with

Sympson-Hetter Exposure Control —- Arithmetic Reasoning

Table A5

 

00 Total

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

2.00 2.]9 2.37 2.53 2.68 2.83 2.97

l 55
1.79

.26
55

9 I
1.26 l.

a 0.00 0.8
0.89

b

 

00000624533986697m67755372000000

O0000000000000000000000000000000

00000000000000000000000000000000

00000000000000000000000000000000

000000000000000000027.00010000000

00000000000000002532323000000000

00000000000000062223237.330000000

000000000000543333l0000030000000

00000000000632300000000002000000

00000000233300000000000000000000

00000004300000000000000000000000

0000067.0000000000000000000000000

00000000000000000000000000000000
0000000000000000000000000000000
0.36420.86.42086.420246802468024o30.w
222222444442oon00000 11111 222223
0000000000000000000000000000000
w0.36.420.364.208.6420.2423302468024630.
.JZQQQQIJJJJﬁvﬁﬁnwOOOOO lllll 222223

120

28 20

25

l6

0

Total

 

 

150

Item Distribution for Item Pool Simulated with MTI Method and with a-

Stratiﬁed Exposure Control — Arithmetic Reasoning

Table A6

 

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.55

1.26

0.00 0.89

a

2.37 2.53 2.68 2.83 2.97 00 Total

1.26 1.55 1.79 2.00 2.19

0.89

b

 

00006667778888998877766554420000

00000000012212211211100000000000

00000000000000000000000000000000

00000000000000000000000000000000

00000000000000].1.an00000000000000

000000000022 IIIIII 21100000000000

000000002222222l2121212000000000

00000O27.]222222222222227.7~7.000000

0000023232002112.I.202l|7.27.7_2200000

00003317.]000000lOOOOOlOIIOIOOOOO

0000310100000000000000000017.0000

00000000000000000000000000000000
0000000000000000000000000000000
0.3642036420864202468024630246.30.m
22222244444ooooooooo 11111 222223
0000000000000000000000000000000
w0.86420364203642024.6.30.2468024630.
.32222214444ooooooooo 11111 222223

158

I7

36 39 26 14

15

0

Total

 

 

151

Table A.7 Item Distribution for Item Pool Simulated with PM method and with a-
Stratiﬁed Exposure Control — Arithmetic Reasoning

\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total

 

 

 

-66 -300 0 0 0 0 0 0 0 0 0 0 0 0 0
-3.00 -2.80 0 0 0 0 0 0 0 0 0 0 0 0 0
-2.80 —2.60 0 0 0 0 0 0 0 0 0 0 0 0 0
-2.60 -240 0 6 0 0 0 0 0 0 0 0 0 0 6
-240 -220 0 2 5 0 0 0 0 0 0 0 0 0 7
-2.20 -200 0 2 6 0 0 0 0 0 0 0 0 0 8
-200 4.80 0 0 9 0 0 0 0 0 0 0 0 0 9
4.80 4.60 0 0 10 0 0 0 0 0 0 0 0 0 10
4.60 4.40 0 0 7 3 0 0 0 0 0 0 0 0 10
4.40 4.20 0 0 6 4 0 0 0 0 0 0 0 0 10
4.20 4.00 0 0 3 7 0 0 0 0 0 0 0 0 10
4.00 -0.80 0 0 1 8 2 0 0 0 0 0 0 0 11
-0.80 -0.60 0 0 0 7 3 0 0 0 0 0 0 0 10
-0.60 -0.40 0 0 0 5 5 0 0 0 0 0 0 0 10
-0.40 -020 0 0 0 0 8 2 0 0 0 0 0 0 10
-0.20 0.00 0 0 0 0 8 3 0 0 0 0 0 0 11
0.00 0.20 0 0 0 0 3 5 2 0 0 0 0 0 10
0.20 0.40 0 0 0 0 1 7 2 0 0 0 0 0 10
0.40 0.60 0 0 0 0 0 3 5 2 0 0 0 0 10
0.60 0.80 0 0 0 0 0 2 6 2 0 0 0 0 10
0.80 1.00 0 0 0 0 0 0 5 4 1 0 0 0 10
1.00 1.20 0 0 0 0 0 0 1 6 2 0 0 0 9
1.20 1.40 0 0 0 0 0 0 0 3 4 2 0 0 9
1.40 1.60 0 0 0 0 0 0 0 1 5 3 0 0 9
1.60 1.80 0 0 0 0 0 0 0 0 2 4 2 0 8
L80 2.00 0 0 0 0 0 0 0 0 0 2 4 2 8
2.00 2.20 0 0 0 0 0 0 0 0 0 0 4 3 7
2.20 2.40 0 0 0 0 0 0 0 0 0 0 2 4 6
2.40 2.60 0 0 0 0 0 0 0 0 0 0 1 4 5
2.60 2.80 0 0 0 0 0 0 0 0 0 0 0 4 4
2.80 3.00 0 0 0 0 0 0 0 0 0 0 0 0 0
3.00 00 0 0 0 0 0 0 0 0 0 0 0 0 0

Total 0 10 47 34 30 22 21 18 14 11 13 17 237

 

152

Table A.8 Item Distribution for the Operational Item Pool - General Science Content 1
\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total

-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 -1.82 3
-1.82 -1.56 0
-1.56 -1.30 0
-1.30 -1.04 2
-1.04 -0.78 3
-0.78 -0.52 0
-0.52 -0.26 3
-0.26 0.00 0
0.00 0.26 0
0.26 0.52 0
0.52 0.78 0
0.78 1.04 0
1.04 1.30 0
1.30 1.56 0
1.56 1.82 0
1.82 2.08 1
2.08 2.34 0
2.34 2.60 0
2.60 2.86 0
2.86 00 0

Total 1 2

oo—~owoawwwbwowow——ooooo

COOOO——————O—OI\)OOO—OOOOO
b.)OOOOOOOOO—OIQOOOOOOOOOOOO
NCOO—o—OOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
NOOOOO—‘O—‘OOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO

OO—N—O—ON-bUt-Bwa-hw-b—NWOOOO

DJ
0
O
kl]
\O

153

 

r1. . . II. III- ] II 1"
a 11. 1!. g . .
-~ on -~ Alum . o - . III“ I.“ o C ( ( ( 1

 

 

Item Distribution for the Operational Item Pool — General Science Content 2

Table A.9

 

00 Total

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

a 0.00 0.89 1.26 1.55
0.89 1.26 155

I)
-oo

1 . l . ._
F.1rnligg

000001102485435483630100
000000000000000000000000
000000000000000000000000
000000000000000001000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000000000000
000000000000000010010000

000000000000000I30200000

.000000000000127.232200000

0000000002553I3I10220100

0
0
0
0
0
1
1
0
2
2
3
0
0
0
0
0
0
0
0
0
0
0
0
0

1 04
1.30
1 56
1 82

(D

-2.86

0.00 0.26
0.26 0.52
0.52 0.78
1.82 2.08
2.08 2.34
2.34 2.60

2.60 2.86

0 78
1.04
1 30
1 56
2.86

 

-2.34 -2.08

-2.08 -l.82
-1.82 -1.56

-2.86 -2.60
-2.60 -2.34
-1.56 -1.30
-l.30 -1.04
-1.04 -0.78
-0.78 -0.52
-O.52 -0.26
-0.26 0.00

58

a).

9

Total

 

 

4

Table A. 1 0 Item Distribution for the Operational Item Pool — General Science Content 3

 

00 Total

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

1.26 1.55
1 55

a 0.00 0.89
0.89 1.26

b
—oo

 

00001..

00000

00000

00000

0

-2.86
-2.86 -2.60

-2.60

000

000

0

0

0

0

0

0
0
0
1

-2.34

-2.34 -2.08

-l.82

-2.08
-1.82 -1.56

 

0100311210020

0000000000000

0000000000000

0000000000000

0

0

1

-1.30
-1.30 -1.04
-1.04
-0.78

-1.56
-0.52

00000000000

00000000000

00000000000

00000000000

00000000000

00000000010

00000110000

—0.78
-0.52
—0.26

-0.26 0.00

0.00 0.26

0.26 0.52

0.52 0.78

1.04
1.30
1.56
1.82

1.82 2.08

0.78
1.04
1.30
1.56
2.08 2.34

001.000

000000

000000

000000

000000

000000

000000

000000

000000

000000

000000

2.34 2.60
2.60 2.86
(I)

2.86

13

A}.

3

Total

 

 

155

Table A.11 Item Distribution for the Optimal Item Pool Designed by MTI and without

Exposure Control - General Science Content l

\000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.33 2.97
b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 -l.82 O
-l.82 -I.56 0
-l.56 -l.30 0
—l.30 —l.04 0
-l.04 -0.78 0
-0.78 -O.52 O
-O.52 -0.26 0
-0.26 0.00 O
0.00 0.26 0
0.26 0.52 0
0.52 0.78 0
0.78 1.04 0
1.04 1.30 0
L30 1.56 0
1.56 L82 0
"L82 2.08 0
2.08 2.34 0
2.34 2.60 O
2.60 2.86 0
2.86 00 0
0

Total

OOOOOOOOOOOOOOOOOOOOOOOOO
O\OOOOIQOOOOOOOOOOOOOOhOOOO

COOOOWIQIQIQWWWLWWWWWWOCOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOCOOOOOOOOOOOOOOOOOOOOOO

oooomwwwmwwwhwwwwwwhoooo

A
O
A
O\

156

Table A.12 Item Distribution for the Optimal Item Pool Designed by MTI and without
Exposure Control — General Science Content 2
\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-oo -2.86 0
—2.86 -2.60 0
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 -l.82 0
-l.82 -l.56 0
-l.56 -l.30 O
-l.30 -1.04 0
-l.04 -0.78 O
-0.78 -O.52 0
-0.52 -0.26 O
-0.26 0.00 0
0.00 0.26 O
0.26 0.52 0
O
O
O
O
O
0
O
0
O
0
0

 

0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
l.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

OOOOOOOOOOOOOOOOOOOOOOOOO
\lOOOOWOOOOOOOOOOOOOOAOOOO

OOOOONWNwwtowww-ADJAWWOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
CDOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOCOOOOOOOOOCOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOCOOOOOOOOOOOOOO

oooowwwmwwmwwwhwbwwboooo

A
A
00

157

Table A. 13 Item Distribution for the Optimal Item Pool Designed by MTI and without

Exposure Control — General Science Content 3

\000 0.89 1.26 1.55 L79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 0
-2.34 —2.08 0
-2.08 -l.82 0
-l.82 -l.56 0
-l.56 -l.30 0
-l.30 -l.04 0
-l.04 -0.78 0
-O.78 -0.52 O
-0.52 -O.26 0
-0.26 0.00 0
0.00 0.26 O
0.26 0.52 0
0.52 0.78 0
0.78 1.04 0
1.04 1.30 0
L30 1.56 O
1.56 1.82 O
1.82 2.08 0
2.08 2.34 0
2.34 2.60 0
2.60 2.86 0
2.86 00 0
0

Total

oooooooooooooooooooooocoo
NOOOOOO—OOOOOOOOOOO—OOOOO

ooooooo—————-——————oooooo
ocooooococooooooooooooooo
ocoooooooocoooooooooooooo
ooooocooooooooooooooooooo
ooooooooooocooooooooooooo
ooooooooooooooooocooooooo
ocooooooooooooooooooooooo
ooooooooooooooooooooooooo
oooooooooooooocoooooooooo

cooooo———————~—————ooooo

—
I—t
—
w

158

Table A.14 Item Distribution for the Optimal Item Pool Designed by PM and without
Exposure Control — General Science Content 1

\000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total

 

 

 

-66 -2.86 0 0 0 0 o o 0 o 0 o o 0 o
-2.86 -2.60 o 0 0 o 0 0 0 0 0 o 0 0 o
-2.60 -234 0 o 0 0 0 o 0 0 0 o o 0 o
-2.34 -2.08 o 0 0 o 0 o o 0 0 o o 0 0
-2.08 -l.82 o 3 0 o 0 o o o 0 o 0 o 3
4.32 -l.56 0 3 0 o 0 o 0 0 o 0 o o 3
-l.56 4.30 0 0 0 o 0 0 o o o 0 o o 0
4.30 4.04 0 0 3 o o o o o o 0 o o 3
4.04 -0.78 o 0 2 o o o o o 0 0 o 0 2
-0.73 -0.52 0 0 l 2 0 0 0 0 o o 0 o 3
-0.52 -0.26 0 0 0 2 0 o o 0 o 0 0 0 2
-0.26 0.00 0 0 0 2 2 0 0 o o 0 0 o 4
0.00 0.26 o 0 0 2 0 0 0 0 0 o o 0 2
0.26 0.52 0 0 0 2 2 0 0 o o o o 0 4
0.52 0.78 0 0 0 l 2 0 0 0 0 o o 0 3
0.78 1.04 o 0 0 I 1 0 0 0 0 0 o 0 2
1.04 1.30 o o 0 2 2 0 0 o o o o o 4
1.30 1.56 0 0 0 2 0 o 0 0 0 o 0 0 2
I56 1.82 0 0 0 2 0 o 0 o 0 0 0 o 2
1.82 2.08 o 0 2 0 o o o 0 0 0 o o 2
2.08 2.34 o 0 0 o 0 o 0 0 0 0 o 0 o
2.34 2.60 o 0 0 0 0 0 0 0 0 o o o 0
2.60 2.86 o 0 0 0 0 0 0 0 0 o 0 o o
2.86 00 o 0 o o 0 0 0 o o o 0 o 0

Total 0 6 8 l8 9 0 o 0 0 0 0 o 41

 

159

Table A. l 5 Item Distribution for the Optimal Item Pool Designed by PM and without

Exposure Control — General Science Content 2
\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
~00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 0
-2.34 -2.08 O
-2.08 -l.82 0
-l.82 -l.56 0
-l.56 -l.30 0
-l.30 -l.04 0
-l.04 -0.78 O
-O.78 ~0.52 O
-O.52 -0.26 0
-0.26 0.00 0
0.00 0.26 O
0.26 0.52 0
0
O
O
0
0
0
0
O
O
O
0

 

0.52 0.78
0.78 1.04
1.04 1.30
l.30 1.56
l.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

OOOOOIQ~———NWIQNNOOOOOOOOO

UtOOOOOOOOOOOOOOOOOONWOOOO
OOOOIQOOOOOOOOOIQWNNOOOOOO
OOOOOOIQNNIQNN—‘OOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO
COCONNwwwwAMWN-BWNNNWOOOO

\l
CS
4:.
ox

160

Table A.16 Item Distribution for the Optimal Item Pool Designed by PM and without

Exposure Control — General Science Content 3

\000 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 O
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 -l.82 l
-l.82 -l.56 0
-l.56 -l.30 0
-l.30 -l.04 0
-l.04 -0.78 0
-0.78 -0.52 0
-O.52 -O.26 0
-0.26 0.00 O
0.00 0.26 O
0.26 0.52 O
0.52 0.78 0
0.78 1.04 0
1.04 1.30 O
1.30 1.56 0
1.56 1.82 0
1.82 2.08 0
2.08 2.34 O
2.34 2.60 0
2.60 2.86 O
2.86 00 0
1

Total

Noooooocccoooooooo-ooooo
oocoooo—————————ooooooooo
ooooooooooooooooocooooooo
ooocooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooocoooooooooocoooooo
ooooooooooooococooooooooo
oooooooooooocoooooooooooo
ooooocooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooocooooooooooooooo

ccoooo——~——————oo——-—oooo

N

161

Table A. l 7 Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 1
\0.00 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 oo Total
--00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 O
-2.34 -2.08 0
-2.08 -l.82 0
-l.82 -l.56 0
-l.56 -l.30 O
-l.30 -l.04 0
-].O4 -O.78 0
-O.78 -O.52 O
-O.52 -0.26 0
-0.26 0.00 0
0.00 0.26 O
0
0
0
0
0
O
0
O
O
O
O
0

 

0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

OOOOOOOOOOOOOOOOOOOOOOOOO
O\OOOONOOOOOOOOOOOOOOAOOOO

ooooowmwwhbwobbwwwwooooo
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO

oooowwwmwbawohhwwwwhooco

A
O
U1
N

162

Table A. 18 Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 2
\000 0.89 L26 l.55 L79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97

b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total

-00 -2.86 O

-2.86 -2.60 0

-2.60 -2.34 0

-2.34 -2.08 O

-2.08 -l.82 O

-l.82 -l.56 O

-l.56 -l.30 0

-l.30 -l.04 0

-l.04 -0.78 0

-O.78 -O.52 0

-0.52 -0.26 O

-0.26 0.00 0

0.00 0.26 0

0

0

0

0

O

0

O

0

0

0

0

0

 

O

0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 l.82
l.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

OOOOOOOOOOOOOOOOOOOOOOOOO
\lOOOOWOOOOOOOOOOOOOO-BOOOO
OOOOONthwwwA-F-i-h-UIAAUJUJOOOOO
OOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
oooowwwwwwwhhbmnbwwaoooo

.b.
\I
{It
A

163

 

Table A. 19 Item Distribution for the Optimal Item Pool Designed by MTI and with
Sympson-Hetter Exposure Control — General Science Content 3

\000 0.89 l.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total

 

 

 

-oo -2.86 O 0 0 O 0 0 0 0 O 0 0 0 0
—2.86 -2.60 O 0 O 0 O O 0 0 0 0 O 0 0
-2.60 -2.34 0 0 0 0 0 O 0 O O O 0 0 0
-2.34 -2.08 O O 0 0 0 0 0 0 O 0 0 O O
-2.08 -l.82 O 0 0 0 0 O 0 0 0 O 0 0 0
-l.82 -l.56 0 0 l O 0 0 0 0 0 0 O O l
-l.56 -l.30 0 O 0 l O 0 0 0 0 0 0 O l
-l.30 -l.04 O 0 O l 0 O O 0 0 O 0 O l
-l.04 -0.78 0 0 0 l 0 0 0 O 0 0 0 O l
-0.78 -O.52 0 0 O l 0 0 0 O 0 0 0 O l
-0.52 -0.26 0 0 0 l 0 0 0 O O O 0 0 l
—0.26 0.00 0 0 0 l O 0 0 0 0 O 0 0 l
0.00 0.26 O 0 0 l 0 O 0 O 0 O O 0 l
0.26 0.52 0 0 0 l 0 O 0 O 0 0 0 O l
0.52 0.78 0 0 0 l O 0 0 0 0 O O 0 l
0.78 1.04 0 0 0 l 0 O O 0 O O 0 O l
1.04 1.30 O 0 0 l 0 0 0 O 0 O 0 0 l
[.30 1.56 0 0 l 0 0 0 O 0 0 O O 0 1
I56 l.82 0 0 0 O 0 0 O 0 0 0 0 O O
l.82 2.08 0 0 0 O O 0 0 0 0 0 O 0 O
2.08 2.34 0 0 O O 0 0 0 O O 0 O 0 0
2.34 2.60 0 0 0 0 0 O 0 0 0 O 0 O O
2.60 2.86 O 0 O 0 0 0 0 O O O 0 O O
2.86 00 0 O O 0 0 0 O 0 0 0 0 0 0

Total 0 0 2 ll 0 O 0 O O O O 0 l3

 

164

Table A20 Item Distribution for the Optimal Item Pool Designed by PM and with

Sympson-Hetter Exposure Control — General Science Content I

\000 0.89 1.26 1.55 1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97

b 0.89 1.26 1.55 1.79 2.00 2.]9 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 O
-2.34 -2.08 0
-2.08 -l.82 O
-l.82 -l.56 0
-l.56 -l.30 0
-l.30 -l.04 0
-l.04 -0.78 0
-O.78 -O.52 0
-O.52 -O.26 0
-0.26 0.00 0
0.00 0.26 0
0.26 0.52 0
0.52 0.78 O
0.78 1.04 O
1.04 [.30 0
1.30 l.56 O
1.56 1.82 0
1.82 2.08 0
2.08 2.34 O
2.34 2.60 0
2.60 2.86 0
2.86 00 0
Total 0

O\OOOOOOOOOOCOOOOOOOWWOOOO
\OOOOONOOOOOOOOO—NhOOOOOOO

OOOOONNN——NNWWMOOOOOOOOO

OOOOOOON—WWOAOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO

COCONNNAN-bMquONN-BOWWOOOO

N
L»)
L»)
{II
~

165

Table A21 Item Distribution for the Optimal Item Pool Designed by PM and with

Sympson-Hetter Exposure Control — General Science Content 2

\000 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 l.26 I.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-oo -2.86 0
-2.86 -2.60 0
-2.60 -2.34 O
-2.34 -2.08 O
-2.08 -l.82 O
-l.82 -l.56 O
-l.56 -l.30 O
-l.30 -l.04 0
-l.04 -0.78 0
-O.78 -0.52 0
-0.52 -0.26 0
-O.26 0.00 O
0.00 0.26 O
0.26 0.52 O
0.52 0.78 O
0.78 l.04 O
1.04 1.30 0
1.30 1.56 O
|.56 1.82 O
1.82 2.08 O
2.08 2.34 0
2.34 2.60 0
2.60 2.86 0
2.86 00 0
0

Total

UtOOOOOOOOOOOOOOOOOONWOOOO
OOOONOOOOOOOOOIQWNNOOOOOO
OOOOON—-———NWIQWMOOOOOOOOO
OOOOOONNNNWANOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOCOOO

OOOOOOOOOOOOOOOOOOOOOOOOO
COOONNwwwwMN-BWQWNNNWOOOO

N
\)
LII
A

166

Table A.22 Item Distribution for the Optimal Item Pool Designed by PM and with

Sympson-Hetter Exposure Control — General Science Content 3
\000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 -l.82 l
-l.82 -1.56 0
-l.56 -l.3O 0
-l.30 -l.04 0
-I.04 -O.78 0
-O.78 -O.52 O
-0.52 -0.26 O
-0.26 0.00 0
0.00 0.26 0
0.26 0.52 O
O
0
0
O
0
0
0
0
O
0
I

 

0.52 0.78
0.78 1.04
1.04 L30
L30 1.56
1.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

Nooooocooooooooooo——ooooo
coooooo—~———————oooooooco
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooocooooocooooooooooo
oooocoooooooooooooooooooo
ooooooooooooooooooooooooo
oooooooooooooooooocooooco
oooooooooooooooocoooooooo
ooooooooooocooooooooooooo

oooooo—-——————~—oo——-—oooo

N

167

Table A23 Item Distribution for the Optimal Item Pool Designed by MTI and with a-

Stratiﬁed Exposure Control — General Science Content 1

\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 l.26 1.55 l.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 0
-2.60 -2.34 O
-2.34 -2.08 0
-2.08 -l.82 O
-l.82 -l.56 0
-l.56 -l.30 O
-l.30 -l.04 0
-l.04 -O.78 O
-0.78 -O.52 0
-O.52 -O.26 O
-0.26 0.00 O
0.00 0.26 O
0.26 0.52 0
0.52 0.78 0
0.78 l.04 0
1.04 L30 0
1.30 l.56 O
1.56 1.82 O
l.82 2.08 0
2.08 2.34 0
2.34 2.60 0
2.60 2.86 0
2.86 00 0
0

Total

OOOOOOOOOOOOOOOOOOOOOOOOO
OOOON—OOONIQ—WIvo——NOW—OOO
DOOM-—NMNNNNNNNNtQIQtOIQNNOOO
OOOOOONNN——N——NNI\)—I\JOOOOO

—OOOOOOOOOOOOOO—OOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOCOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOCOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO
COONMW-bbAMMMQMMMMM-DMWOOO

to
U)
U!
N
\I
00

168

Table A.24 Item Distribution for the Optimal Item Pool Designed by MTI and with a-
Stratiﬁed Exposure Control — General Science Content 2
\000 0.89 1.26 l.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-oo -2.86 0
-2.86 -2.60 O
-2.60 -2.34 0
-2.34 -2.08 0
-2.08 —l.82 0
-l.82 -l.56 0
-l.56 -l.3O O
-l.30 -l.04 0
-l.04 -O.78 O
-0.78 -0.52 0
-0.52 -0.26 O
-0.26 0.00 0
0.00 0.26 0
0
0
0
0
O
0
0
O
0
0
0
0

 

0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

—-oooooooocooocooooooo—ooo
3;oooo—————oow—oow—o~—Noco
3,"oooromwroNNNNNN—NNNNwwwooo
'3ooooo————NN——NN——NN—oooo
['3oooooooo—————NN———oooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ocoooocooooooooooooocoooo
ooooooooooooooooooooooooo
ooooooooooooooocooooooooo
ooooooooooooooooooocooooo

OOONw-hAAU:U:U~O\U:UIO\O\U‘UIO\AUIOOO

00
LI)

169

Table A25 Item Distribution for the Optimal Item Pool Designed by MTI and with a-

Stratiﬁed Exposure Control — General Science Content 3

\000 0.89 1.26 1.55 1.79 2.00 2.l9 2.37 2.53 2.68 2.83 2.97

b 0.89 1.26 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60
-2.60 -2.34
-2.34 -2.08
-2.08 -l.82
-l.82 -l.56
-l.56 -l.30
-l.30 -l.04
-l.04 -0.78
-0.78 -0.52
-0.52 -O.26

0
O
0
0
O
0
0
0
O
O
-0.26 0.00 0
0.00 0.26 0
O

0

O

O

O

0

0

O

0

0

O

O

 

0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
—-ooooo—oooooooooooooooooo
woooooo——oooooooooo—ooooc
-oooooooo—ooooooooooooooo
ocoocoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooocooooooooo
ooooooooooooooooooooooooo

ooooo——————————————ooooo

\OOOOOOOOOO————————-—OOOOOO

A

170

Table A.26 Item Distribution for the Optimal Item Pool Designed by PM and with a-

Stratiﬁed Exposure Control — General Science Content 1

\000 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97

b 0.89 l.26 l.55 |.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86
-2.86 -2.60
-2.60 -2.34
-2.34 -2.08
-2.08 -l.82
-l.82 -l.56
-l.56 -l.30
-l.30 -l.04
-I.04 -0.78
-0.78 -0.52
-0.52 -0.26
-0.26 0.00
0.00 0.26
0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
l.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

0

IONANNOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOO——lohAkin-5000
OOOOOOOOOOOthmwaOOOOOO
OOOOOOO—NUJUIWIQNOOOOOOOOOO
OOO-
OOIQNNNNOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO

OOOOOOOOOOOOOOOOOOOOOOOOO
CONWWAAM-BMUIMQMGMMOQMAOOO

[Q
N
9)
GO
A
O
00
O\

171

Table A.27 Item Distribution for the Optimal Item Pool Designed by PM and with a-

Stratiﬁed Exposure Control — General Science Content 2

\000 0.89 1.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97
b 0.89 l.26 1.55 1.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86 0
-2.86 -2.60 O
-2.60 -2.34 0
—2.34 -2.08 0
-2.08 -l.82 O
-I.82 -I.56 0
-l.56 -l.30 0
-l.30 -l.04 0
-l.04 -O.78 0
-0.78 -O.52 0
-O.52 -0.26 0
-O.26 0.00 0
0.00 0.26 0
0.26 0.52 O
0.52 0.78 0
0.78 1.04 0
1.04 1.30 0
1.30 1.56 O
1.56 1.82 0
1.82 2.08 0
2.08 2.34 0
2.34 2.60 O
2.60 2.86 0
2.86 00 0
0

Total

OOOOOOOOOOOOOOO—‘wwO‘x-bbboo
OOOOOOOOOO—N-ﬁ-MONONWNOOOOOO
OOOOOOONwwmAN—OOOOOOOOOO
OOOO——WWWNOOOOOOOOOOOOOO
OONWWWNOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OCOOOOOOOOOOOOOOOOOOOOOOO
OOOOOOOOOOOOOOOOOOOOOOOOO
OCONWA-BMUIGUIGONQO‘QNQMON-bhboo

to
kit
to
O
to
O
m
U.)
C

172

Table A28 Item Distribution for the Optimal Item Pool Designed by PM and with a-

Stratiﬁed Exposure Control — General Science Content 3

\0.00 0.89 1.26 1.55 [.79 2.00 2.|9 2.37 2.53 2.68 2.83 2.97

b 0.89 126 1.55 l.79 2.00 2.19 2.37 2.53 2.68 2.83 2.97 00 Total
-00 -2.86
-2.86 -2.60
-2.60 -2.34
-2.34 -2.08
-2.08 -l.82
-l.82 -l.56
-I.56 -l.30
-l.30 -l.04
-l.04 -O.78
-0.78 -O.52
-0.52 -0.26
-026 0.00
0.00 0.26
0.26 0.52
0.52 0.78
0.78 1.04
1.04 1.30
1.30 1.56
1.56 1.82
1.82 2.08
2.08 2.34
2.34 2.60
2.60 2.86
2.86 00

Total

 

ooooooooooocococooooooooo
4;ooooooooooooooooo———~ooo
Aoooooooooocoo————ooooooo
Aooooooooo————ooooooooooo
moooo—————ooooooooooooooo
ooooooooooooooooocooooooo
ooooooocoooooooooocooooco
ooooooooooooooooooooooooo
coooooooooocooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooocooooo
ooooooooooooooooooooooooo

oooo—-—-——-——-—-—-—~—-—-—-——u—-—-—-—-—-ooo

\l

173

REFERENCES

174

REFERENCES

Ariel, A., Veldkamp, B. P., & van der Linden, W. J. (2004). Constructing rotating item

pools for constrained adaptive testing. Journal of Educational Measurement, 41(4),
345- 360.

'v Bejar, I. I. (1993.). A generative approach to psychological and educational measurement.

In N. Frederiksen, R. J. Mislevy & I. I. Bejar (Eds. ), Test theory for a new
generation of tests (pp. 323-359). Hillsdale, NJ: Erlbaum.

Bejar I. I. “(1 996). Generative response modeling. Leveraging the computer as a test
' delivery medium (ETS RR- 96 13). Princeton, NJ: ETS.

Bejar, I 1., Lawless R. R..Morley,M. E. Wagner M E., Bennett,R. E. ,& Revuelta, J.
(2003). A feasibility study of on- -the ﬂy item generation in adaptive testing. The
Journal of Technology Learning and Assessment 2(3). http: //www.jtla. org.

.HM-m

_,‘--n'--‘.
a"

Bergstrom, B. A., & Lunz, M. E. (1999). CAT for certiﬁcation and licensure. In F.
Drasgow & J. Olson-Buchanan (Eds.), Innovations in Computerized Assessment (pp.
67-91). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

Binet, A., & Simon, Th. A. (1905). Méthode nouvelle pour le diagnostic du niveau
intellectuel des anorrnaux. L'Année Psychologique, 11, 191-244.

I, Boekkooi-Timminga, E. (1991). A method for designing Rasch Model-based item banks.

Paper presented at the annual meeting of the Psychometric Society, Princeton, NJ.

Brown, J. M., & Weiss, D. J. (1977). An adaptive testing strategy for achievement test
batteries (No. 77-6). Minneapolis: University of Minnesota, Psychometric Methods
Program.

Chang, H. H. (2004). Understanding computerized adaptive testing: From Robins-Monro
to Lord and beyond. In David Kaplan (Ed.) The Sage handbook of quantitative

methodology for the social sciences. (pp. 117-136). Thousand Oaks, CA: Sage
Publications, Inc.

Chang, S. W., Ansley, T. N., & Lin, S. H. (2000). Performance of item exposure control
methods in computerized adaptive testing: Further explorations. Paper presented at
the Annual Meeting of the American Educational Research Association.

Chang, H. H., Qian, J ., & Ying, Z. (2001). Alpha-stratiﬁed multistage computerized
adaptive testing with beta blocking. Applied Psychological Measurement, 25(4),
333-341.

175

Chang, H. H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait
in an IRT model. Psychometrika, 58, 37-52.

Chang, H. H., & van der Linden, W. J. (2003). Optimal stratiﬁcation of item pools in a-

stratiﬁed computerized adaptive testing. Applied Psychological Measurement, 27(4),
262-274.

Chang, H. H., & Ying, Z. (1999). Alpha—stratiﬁed multistage computerized adaptive
testing. Applied Psychological Measurement, 23(3), 211-222.

Chang, H. H., & Ying, Z. (1996). A global information approach to computerized
adaptive testing. Applied Psychological Measurement, 20(3), 213-229.

C hen. S. Y., Ankenmann, R. D., & Spray, J. A. (1999). Exploring the relationship
between item exposure rate and test overlap rate in computerized adaptive testing
(N o. ACT-RR-99-5): American College Testing Program, Iowa City, IA.

Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure
control with computerized adaptive testing. Paper presented at the Annual Meeting
of the American Educational Research Association, San Francisco, CA.

Davey, T., & Thomas, L. (1996). Constructing adaptive tests to parallel conventional
programs. Paper presented at the annual meeting of the American Educational
Research Association, New York.

Davis, L. L. (2002). Strategies for controlling item exposure in computerized adaptive
testing with polytomously scored items. Unpublished doctoral dissertation,
University of Texas, Austin.

Davis, L. L., & Dodd, B. G. (2001). An examination of testlet scoring and item exposure
constraints in the verbal reasoning section of the MCA T. MCAT Monograph Series:
Association of American Medical Colleges.

' Deane, P. ,,Graf E. A. Higgins, D. ,Futagi, Y., Lawless, R. (2006). Model analysis and

modefcreatzon Capturing the task-model structure of quantitative item domains
(No. ETS- RR-06— 11). Educational Testing Service, Princeton, NJ.

Dorans, N. J ., & Kulick, E. M. (1986). Demonstrating the utility of the standardization
approach to assessing unexpected differential item performance on the Scholastic
Aptitude Test. Journal of Educational Measurement. 23, 355-368.

------- “"“x
Embretson, S. E. (2001’). Generating abstract reasoning items with cognitive theory. In S.

H. Irvine & P. C. Kyllonen (Eds) Item generation for test development. Mahwah,
NJ: Erlbaum Publishers.

Forsythe, G. E., Malcolm, M. A., & Moler, C. B. (1976). Computer methods for
mathematical computations, Prentice-Hall, 1976.

176

Graf, E. A., Peterson, S}, Steffen; MT, Lawless R. (2005): Psychometric and cognitive
analysis as a basis for the design and revision ofquantitative item models (No.
ETS- RR-05- 25). Educational Testing Service, Princeton, NJ.

Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. D. (1984).
Technical guidelines for assessing computerized adaptive tests. Journal of
Educational Measurement, 21(4), 347-360.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item
response theory. Newbury Park CA: Sage.

Hau, K. T., Wen, J. B., & Chang, H. H. (2002). Optimum number of strata in the a-
stratiﬁed computerized adaptive testing design. Paper presented at the Annual
Meeting of the American Educational Research Association, New Orleans, LA.

1
\\

Henning, G. (1987). A guide to language testing. Cambridge, Mass: Newbury House.

Hetter, R. D., & Sympson J. B. (1997). Item exposure control in CAT-ASVAB. In W. A.
Sands, B. K. Waters & J. R. McBride (Eds), Computerized adaptive testing: From
inquiry to operation (pp. 141-144). Washington, DC: American Psychological
Association.

Grist, S., Rudner, L., & Wise, L. (1989). Computer adaptive tests (ERIC Digest 107).
Washington, DC: American Institute for Research and ERIC Clearinghouse on
Tests, Measurement, and Evaluatlon (ERIC No. ED 315 425)

.J'-'”
‘,l '

Hively, W, Patterson, H. L., &Page S H. (1968). A "universe-deﬁned" system of
arithmetic aChievement items. Journal of Educational Measurement, 5, 275 2-9.0

Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the Mantel-
Haenszel procedure. In H. Wainer & H. Braun (Eds), Test validity (pp. 192-145).
Hillsdale, NJ: Lawrence Erlbaum Associates.

Koch, W. R., & Dodd, B. G. (1989). An investigation of procedures for computerized
adaptive testing using partial credit scoring. Applied Measurement in Education, 2,
335-357.

Lewis, G, & Sheehan, K. (1990). Using Bayesian decision theory to design a
computerized mastery test. Applied Psychological Measurement, 14, 367-386.

Luecht, R. M. (1998). A framework for exploring and controlling risks associated with
test item exposure overtime. Paper presented at the the annual meeting of the
National Council on Measurement in Education.

(ford, F. M. (1971)‘The self-scoring ﬂexilevel test. Journal of Educational Measurement,
‘ "’ 8, ‘147-151.~'””'

177

Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.

McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in
a military setting. In D. J. Weiss (Ed.), New horizons in testing (pp. 223-226). New
York: Academic Press.

MathWorks. (2005). MATLAB: The language of technical computing, Version 7,
Release 14, Student Version [Computer software]. Natick, MA: Author

:pMillman,mJ.I,8:cArter, J. A(1984)~1 Issues in item banking. Journal of Educational
Measurement, 21 (4), 315-330.

Mills, C. N. ,& Stocking,M. L. (1996). Practical issues in large scale computerized
adaptive testing. Applied Measurement in Education, 9, 287-304.

\‘-——_—

Psychological Measurements, 28, 95-104.

Owen, R. J. (1969). A Bayesian approach to tailored testing (RB-69-92). Princeton, NJ:
Educational Testing Service

Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context
of adaptive mental testing. Journal of the American Statistical Association, 70(350),
35 1 -3 56.

Parshall, C., Davey, T., & Nering, M. (1998). Test development exposure control for
adaptive tests. Paper presented at the Annual Meeting of the National Council on
Measurement in Education, San Diego CA.

Patsula, L. N., & Steffan, M. (1997). Maintaining item and test security in a CAT
environment: A simulation study. Paper presented at the annual meeting of the
National Council on Measurement in Education.

H‘Prosser,F.w(1974) Item banking. In G. Lippey (Ed.), Computer-assisted test construction
' ° (pp. 29-66). Englewood Cliffs, NJ: Educational Technology Publications.

. Reckase, MIDI-(1 97677415 application of the Rasch simple logistic model to tailored
testing. Pape‘fpfesented at the Annual Meeting of the American Educational
Research Association, Chicago, Illinois.

1...,Reckase, MD. (1989). Adaptive testing: The evolution of a good idea. Educational
Measurement Issues and Practice, 8(3), 11-15.

,: Reckase, M. D. (2003) Eltem pool destgn for computerized adaptive tests. Paper presented
' ' at the National Council on Measurement 1n Education, Chicago, IL.

178

Reckase, M. D,-&.He, .W...(2004)._ The ideal item pool for the NCLEX-RN examination--
Report to NCSBN: Michigan State University, East Lansing, MI.

Reckase, M. D. & He, W (2005). Ideal item pool design for the NCLEX-RN exam.
Michigan State University, East Lansing, MI.

I Rudner, L. (1998). Item banking. Practical Assessment, Research & Evaluation, 6(4).

Segall, D. O., Moreno, K. E., & Hetter, R. D. (1997). Item pool development and
evaluation. In W. A. Sands, B. K. Waters & J. R. McBride (Eds.), Computerized
adaptive testing: From inquiry to operation (pp. 117-130). Washington, DC: P
American Psychological Association.

Shealy, R., & Stout, W. F. (1993). A model-based standardization approach that separates “3

true bias/DIF from group differences and detects test bias/DIP as well as item
bias/DIF. Psychometrika, 58, 159-194.

 

‘~Singley,M.MK‘.,:8cBennett, R. E. (2002)::Item generation and beyond: Applications of
schema theory to mathematics assessment. In S. H. Irvine (Ed), Item generation for
test development (pp. 361 384). Mahwah, NJ: Lawrence Erlbaum Associates.

Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools
(No. ETS-RR-94-5): Educational Testing Service, Princeton, NJ. /

Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in
CAT. In W. J. v. d. Linden & C. A. W. Glas (Eds.), Computerized adaptive testing:
Theory and practice (pp. 163-182). Netherlands: Kluwer Academic Publishers.

Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in
computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23,
57-75.

Stocking, M. L., & Lewis, C. (1995). A new method for controlling item exposure in
computer adaptive testing (Research Report 95-25). Princeton, NJ: Educational
Testing Service.

’ Stocking, M. L., & Swanson, L. (1998). Optimal design of item pools for computerized
” adaptive tests. Applied Psychological Measurement, 22, 271-279.

Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized
atapitve testing. Paper presented at the 27th Annual Meeting of the Military Testing
Association, San Diego, CA.

Thomasson, G. L. (1995). New item exposure control algorithms for computerized
adaptive testing. Paper presented at the Annual Meeting of the Psychometric
Soceity, Minneapolis, MN.

179

 

Thissen, D, & Mislevy, R. J. (2000). Testing Algorithms. InH. Wainer (Ed.),
Computerized adaptive testing. A primer (2nd ed., pp. 101-133). Mahwah, NJ:
Lawrence Erlbaum Associates.

Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory
Journal of Educatzonal Measurement, 14(2), 181- 196.

van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests.
Applied Psychological Measurement, 22, 195-211. '

van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in
adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized
adaptive testing: Theory and practice (pp. 1-25). Boston: Kluwer.

van der Linden, W. J ., Scrams, D. J ., & Schnipke, D. L. (1999). Using response-time
constraints to control for speededness in computerized adaptive testing. Applied
Psychological Measurement, 23, 195-210.

van der Linden, W. J. (2005) Linear models for optimal test design. New York: Springer-
Verlag.

Veldkamp, B. P., & van der Linden, W. J. (1999). Designing item pools for computerized
adaptive testing. Research report 99-03: Twente Univ , Enschede (Netherlands)
Faculty of Educational Science and Technology.

Wainer, H. (2000). Rescuing Computerized Testing by Breaking Zipfs Law. Journal of
Educational and Behavioral Statistics, 25(2), 203-224.

Wainer, H., & Eignor, D. (2000). Caveats, pitfalls, and unexpected consequences of
implementing large-scale computerized testing. In H. Wainer (Ed.), Computerized
adaptive testing: A primer (2nd ed., pp. 271-299). Mahwah, NJ: Lawrence Erlbaum
Associates.

Wainer, H. (1990). Computerized adaptive testing: A primer. In. Hillsdale, NJ: Lawrence
Erlbaum Associates.

Weiss, D. J. (1985). Adaptive testing by computer. Journal of Consulting and Clinical
Psychology, 53(6), 774-789.

L:Wei_sv‘s",:1:)_._ _J,Z®Adaptive testing research in Minnesota: Overview, recent results,
and future directions. In C. L. Clark (Ed.), Proceedings of the ﬁrst conference on
computerized adaptive testing (pp. 24-35). Washington, DC: United States Civil
Service Commission.

Weiss, D. J., & Kingsbury, G. G. (1984). Application of computerized adaptive testing to ,'
educational problems. Journal of Educatzonal Measurement, 21, 361 -.375 —-—-/

180

Wetzel, C .D., & McBride, J. R. (1986). Reducing the predictability of adaptive item

sequences. Paper presented at the annual conference of the Military Testing
Association, San Diego, CA.

Yao. T. (1991). CAT with a poorly calibrated item bank. Rasch Measurement
Transactions 5:2, 141.

181

IIIIIIIIIIIIIIIIIIIIIIIIIIIIII

lllIIIHIIIIJIIIIIIIIIIIIIIIllIIlellll