Ix! \

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the
dissertation entitled

OPTIMAL ITEM POOL DESIGN FOR A HIGHLY
CONSTRAINED COMPUTERIZED ADAPTIVE TEST

presented by

Wei He

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Measurement and Quantitative
Methods

 

 

77/62”! Z 6. /anf-'4i¢{a_&

 

Major Professor’s Signature
774,01 ‘1’, 2 0 10

Date

 

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

-._._._.-....—

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

NUVJQ 5129312

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 KIProj/AcwpreleIRClDateDueJndd

 

OPTIMAL ITEM POOL DESIGN FOR A HIGHLY CONSTRAINED
COMPUTERIZED ADAPTIVE TEST

By

Wei He

A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Measurement and Quantitative Methods

2010

ABSTRACT
OPTIMAL ITEM POOL DESIGN FOR A HIGHLY CONSTRAIN ED
COMPUTERIZED ADAPTIVE TEST
By
Wei He

Item pool quality has been regarded as one important factor to help realize
enhanced measurement quality for the computerized adaptive test (CAT) (e. g., Flaugher,
2000; Jensema, 1977; McBride & Wise, 1976; Reckase, 1976; 2003; van der Linden,
Ariel, & Veldkamp, 2006; Veldkamp & van der Linden, 2000; Xing & Hambleton, 2004).
However, studies are rare in how to identify the desired features of an item pool for the
computerized adaptive test (CAT). Unlike the problem of item pool assembly in which an
item pool is assembled from an available master pool according to the desired
speciﬁcations, no actual items are available yet in the problem of item pool design (van
der Linden, Ariel, & Veldkamp, 2006). Since there is no actual item available when
designing an item pool, designing an item pool that is optimal intuitively becomes a
desired goal. This study is focused on designing an optimal item pool for a CAT using
the weighted deviations model (WDM; Stocking & Swanson, 1993) item selection
procedure. Drawing on Reckase (2003) and Gu (2007), this study has extended the bin-
and-union method proposed by Reckase (2003) to a CAT with a large set of complex
non—statistical constraints. The method used to generate optimal item features is a
combination of methods based on McBride & Weiss (1976) and Gu (2007) for statistical
features and a sampling method based on test speciﬁcations for non-statistical features.

The end-product is an item blueprint describing items’ statistical and non-statistical

 

attributes, item number distribution, and optimal item pool size. A large-scale operational
CAT program served as the CAT template in this study. Three key factors considered to
potentially impact optimal item pool features were manipulated including item generation
method, expected amount of item information change, and b-bin width. Optimal item
pool performance was evaluated and compared with that of an operational item pool in
light of a series of criteria including measurement accuracy and precision, item pool
utilization, test security, constraint violation, and classiﬁcation accuracy. A
demonstrative example on how to use identiﬁed optimal item pool features for item pool
assembly was provided. How to apply optimal item pool features to item pool

management, operational item pool assembly, and item writing was also discussed.

 

 

To t/ie [Ewing memory quy grandpa LianEi guo

To my ﬁusEanJC/iuan, my Jaug/iter megan, and'my parents

iv

ACKNOWLEDGEMENTS

The completion of this dissertation marks the end of my long journey in Ph.d.
study, which would not be possible without guidance, support, and encouragement from

many people.

My deepest appreciation and thanks would go to my academic advisor, also my
dissertation chair, Professor Mark Reckase. I feel grateful and blessed with the
opportunities to have closely worked with him since the ﬁrst year of my graduate study. I
thank him for his unremitting mentoring and scholarly insights that kindles my
enthusiasm in research; I thank him for his support and trust that motives me to achieve
the best that I can and to turn those seemingly impossible things to the achievable. He has
provided me with the most exceptional mentor-student relationship that I have dreamed
of in my whole life. I would attribute most of my achievements at graduate school to his
guidance, support, encouragement, and warm care. I would like to thank the other
members of my committee, Professor Richard Houang, Professor Sharif Shakrani, and
Professor Alexander von Eye for their insights, suggestions, and assistance not only in

this dissertation but also throughout my Ph.d. study.

My special gratitude also extends to Dr. Edward Wolfe, who has been providing
me with support and advice during my Ph.d. study. I also feel blessed with the
opportunities to work closely with him. The completion of my Ph.d. study is
indispensable to his scholarly guidance and constant support. I would like to thank
Lixiong for being always available to my questions when I worked on this dissertation

and Linda for her generous help in proofreading this dissertation. I would like to express

 

my gratitude to my friends Shufang and Rachel. Shufang—it is such a blessing to have
you as a ﬁend and a big sister not only in Shanghai but here in the United States. Thank
you for being such a great helper in my life! Rachel—thank you so much for your
constant warm care since I landed in this country! I also thank other friends and
professors that have enriched my life as a doctoral student including Hui Jin, Sungwom
Ngudgratoke, Dipendra Subedi, Qi Chen, Qi Diao, Yunfei Wu, Muthoni, Chueh-an Hsieh,
Hong Jiao, Shudong Wang, Rui Gao, Auntie Dorothy and Uncle Rex, Dr. Yeow Meng

Thum, Dr. Kim Maier, and numerous others.

I am very honored and grateful for the College Board Research Fellowship, the
Robert Ebel Scholarship, and the Dissertation Completion Fellowship. 1 would also like
to thank the National Council of State Boards of Nursing for providing me with valuable

opportunities to work on their research projects.

Finally, I would like to thank my parents and my sister for their immeasurable
support and unquestioning love. I owe the most to my dear husband for his unconditional
love and support and to my daughter Megan for being such a tremendous source of

happiness and inspiration in my life!

vi

 

‘12".

TABLE OF CONTENTS

 

 

 

 

 

 

 

LIST OF TABLES IX
LIST OF FIGURES X
ACRONYMS XIII
CHAPTER I INTRODUCTION 1
1.1 Background ............................................................................................................... 1
1.2 Statement of Research Questions ............................................................................. 5
1.3 Signiﬁcance of the Study ........................................................................................ 10
CHAPTER II LITERATURE REVIEW 12
2.1 Introduction to Computerized Adaptive Testing .................................................... 12
2.2 Item Pool Features, Population Distributions, and Other CAT Components ......... 15
2.3 Practical Constraints in Item Selection in CAT ...................................................... 17
2.3.1 Content-balancing techniques ......................................................................... I 7
2.3.2 Item exposure control procedure ..................................................................... 23

2.4 Optimal Item Pool Design Methods for CAT ......................................................... 25
2.4.1 The binary integer programming method ........................................................ 26
2.4.2 The bin-and-union method and its extension ................................................... 27
CHAPTER III METHODOLOGY AND RESEARCH DESIGN 30
3.1 Methodology ........................................................................................................... 30
3.1.1 Deﬁning a bin map ......................................................................................... 30
3.1.2 Generating optimal items ............................................................................... 35
3.1.3 Modeling the WDM procedure ...................................................................... 39
3.1.4 Post-adjusting item pool size ......................................................................... 43

3.2 Research Design .................................................................................................... 44
3.2.1 CAT model ..................................................................................................... 44
3.2.2 Simulation design ........................................................................................... 46
3.2.3 Research procedure ....................................................................................... 4 7
3.2.4 Evaluation criteria ......................................................................................... 48
CHAPTER IV RESULTS 53
4.1 Characteristics of Candidate ROPs ....................................................................... 53
4.2 Performance of Candidate Optimal Item Pools ..................................................... 73
4.2.] Evaluation results from using conditional 6 points ....................................... 73

vii

 

4.2.2 Evaluation results from using 20,000 examinees randomly sampled from the

 

 

target examinee population ....................................................................................... 81

4.3 A Demonstrative Example .................................................................................... 90
CHAPTER V SUMMARY, DISCUSSION, AND IMPLICATION 95
5.1 Summary of Research ............................................................................................. 95
5.2 Discussion ............................................................................................................... 98
5.3 Implication ............................................................................................................ 102
5.3.1 Implication for item pool management and assembly for CAT ..................... 103
5.3.2 Implication for item writing and development ............................................... 106

5.4 Limitation and Future Research Direction ............................................................ 106
APPENDIX 109
REFERENCE 134

 

viii

 

LIST OF TABLES
Table 3.1 Another view of ab—bin/ab-block ...................................................................... 34
Table 3.2 Information on exam constraints and weights ................................................. 39
Table 3.3 Simulation design .............................................................................................. 46
Table 4.1 Summary descriptive statistics of 24 candidate ROPs ...................................... 56
Table 4.2 Number of tests witnessing constraint violation given by the OP and the
candidate ROPs ................................................................................................................ 79
Table 4.3 Overall performance statistics for the OP and the candidate ROPs using a
random sample of 20, 000 examinees ................................................................................ 84
Table 4.4 Classification accuracy rates for each of four performance levels .................. 88
Table 4.5 Item pool statistics for the Demo_Pool, the OP, and the ROP_21 ................... 91
Table 4.6 Conditional biases, MSEs, and SES given by the OP, the Demo_Pool, and the
ROP_21 ............................................................................................................................. 92
Table 4.7 Overall performance statistics for the OP, the Demo_Pool, and the ROP_21
using a random sample of 20, 000 examinees ................................................................... 93
Table A.l Conditional biases given by the OP and the candidate ROPs ....................... 128
Table A2 Conditional MSEs given by the OP and the candidate ROPs ....................... 130
Table A.3 Conditional SEs given by the OP and the candidate ROPs ........................... 132

ix

 

pan-

LIST OF FIGURES

Figure 2.1 A ﬂowchart of CAT administration ................................................................ 14
Figure 2.2 Increase of item pool size with the increase of number of examinees ............ 28

Figure 3.1 Percentage of maximum item information conditional on the distance between

b value and 0 ....................................................................................... 31
Figure 3.2 An illustrative example of ab-bin/ab—block ................................................... 34
Figure 4.1 Item number distribution in each ab—block for the operational item pool ...... 59

Figure 4.2 Item number distribution in each ab-block for ROPl, ROP2, ROP13, and
ROP14 ............................................................................................................................... 60

Figure 4.3 Item number distribution in each ab-block for ROP9, ROP10, ROP21, and
ROP22 ............................................................................................................................... 62

Figure 4.4 Item discrimination and difﬁculty parameter distributions for the OP, ROPl,
ROP2, ROP13, and ROP14 .............................................................................................. 64

Figure 4.5 Item discrimination and difﬁculty parameter distributions for the OP, ROP9,
ROP10, ROP21, and ROP22 ............................................................................................ 67

Figure 4.6 Item pool information for the OP and 8 candidate ROPs ............................... 70

Figure 4. 7 Distributions of item attributes for the candidate ROP_l , ROP_2, ROP_13,
and ROP_14 ...................................................................................................................... 72

Figure 4.8 Distributions of item attributes for the candidate ROP_9, ROP_lO, ROP_21,

and ROP_22 ...................................................................................................................... 72
Figure 4.9 Graphical representation of conditional bias ................................................. 76
Figure 4.10 Graphical representation of conditional mean square error .......................... 77
Figure 4.11 Graphical representation .of conditional standard error ................................. 78
Figure A. I Item Number Distribution in each ab-block for ROP3 ................................. 109

X

 

Figure A.2 Item Number Distribution in each ab-block for ROP4 ................................. 109

Figure A.3 Item Number Distribution in each ab-block for ROP5 ................................. 1 10
Figure A.4 Item Number Distribution in each ab-block for ROP6 ................................. 1 10
Figure A.5 Item Number Distribution in each ab-block for ROP7 ................................ l 11
Figure A.6 Item Number Distribution in each ab-block for ROP8 ................................. 11 1
Figure A. 7 Item Number Distribution in each ab-block for ROPll .............................. 1 12
Figure A.8 Item Number Distribution in each ab-block for ROP12 ............................... 1 12
Figure A.9 Item Number Distribution in each ab-block for ROP15 ............................... 1 13
Figure A. 10 Item Number Distribution in each ab-block for ROP16 ............................. 1 13
Figure A. I I Item Number Distribution in each ab-block for ROP17 ............................. 1 14
Figure A. 12 Item Number Distribution in each ab-block for ROP18 ............................. 1 14
Figure A. 13 Item Number Distribution in each ab-block for ROP19 ............................. l 15
Figure A. 14 Item Number Distribution in each ab-block for ROP20 ............................. 1 15
Figure A. 15 Item Number Distribution in each ab-block for ROP23 ............................. 1 16
Figure A. 16 Item Number Distribution in each ab-block for ROP24 ............................. 1 16

Figure A. I 7 Item Discrimination and Difﬁculty Parameter Distributions for ROP3 ..... 117
Figure A.I8 Item Discrimination and Difﬁculty Parameter Distributions for ROP4 ..... 117
Figure A. 19 Item Discrimination and Difﬁculty Parameter Distributions for ROP5 ..... 118
Figure A.20 Item Discrimination and Difﬁculty Parameter Distributions for ROP6 ..... 118
Figure A.21 Item Discrimination and Difﬁculty Parameter Distributions for ROP7 ..... 119
Figure A.22 Item Discrimination and Difﬁculty Parameter Distributions for ROP8 ..... 119
Figure A.23 Item Discrimination and Difﬁculty Parameter Distributions for ROPl 1 120

Figure A.24 Item Discrimination and Difﬁculty Parameter Distributions for ROP12... 120

xi

 

Figure A.25 Item Discrimination and Difﬁculty Parameter Distributions for ROPl 5
Figure A.26 Item Discrimination and Difﬁculty Parameter Distributions for ROP16...
Figure A.27 Item Discrimination and Difﬁculty Parameter Distributions for ROP17...
Figure A.28 Item Discrimination and Difﬁculty Parameter Distributions for ROP18...
Figure A.29 Item Discrimination and Difﬁculty Parameter Distributions for ROP19...
Figure A.30 Item Discrimination and Difﬁculty Parameter Distributions for ROP20...
Figure A.3] Item Discrimination and Difﬁculty Parameter Distributions for ROP23
Figure A.32 Item Discrimination and Difﬁculty Parameter Distributions for ROP24...
Figure A.33 Distributions of item attributes for the candidate ROP_I to ROP_4 ........
Figure A.34 Distributions of item attributes for the candidate ROP_S to ROP_8 ........

Figure A.35 Distributions of item attributes for the candidate ROP_9 to ROP_12 ......

121

121

122

122

123

123

124

124

.125

.125

.126

Figure A.36 Distributions of item attributes for the candidate ROP_13 to ROP_I6 ..... 126

Figure A.3 7 Distributions of item attributes for the candidate ROP_17 to ROP_20

Figure A.38 Distributions of item attributes for the candidate ROP_21 to ROP_24

(Images in this dissertation are presented in color.)

xii

.127

.127

 

ACRONYMS

AS VAB: Armed Services Vocational Aptitude Battery

CA T: Computerized Adaptive Testing

CCA T: Constrained Computerized Adaptive Testing

EAP: Expected a Posteriori

E T S: Educational Testing Service

DP: Davey and Parshall

GMAT: Graduate Management Admission Test

GRE: Graduate Record Exam

KS: Kolmogorov-Smirnov test

MCCA T: Modiﬁed Constrained Computerized Adaptive Testing
MLE: Maximum Likelihood Estimation

MMM: Modiﬁed Multinomial Model

MP1: Maximum Priority Index

MRP: Mixed Random and Prediction

MT 1: Minimum Test Information

NCSBN: National Council of State Boards of Nursing

PPT: Paper-and-Pencil Test

R: Random Procedure

ROP: Range-Optimal Item Pool

SH: Sympson-Hetter

SL: Stocking and Lewis Unconditional Multinomial Procedure
SLC: Stocking and Lewis Conditional Multinomial Procedure
STA: Shadow Test Approach

TOEFL: Test of English as a Foreign Language

xiii

 

WDM: Weighted Deviations Model

WPM: Weighted Penalty Model

xiv

CHAPTER I

INTRODUCTION

1.1 BACKGROUND

Computerized adaptive testing (CAT) has been demonstrated to be a mature mode
of testing as witnessed by successful application of CAT to several large-scale
educational assessment programs in the last two decades. Examples of these large-scale
testing programs include the Graduate Record Examination (GRE), the Graduate
Management Admission Test (GMAT), the Test of English as a Foreign Language
(TOEFL), the NCLEXR exam series by the National Council of State Boards of Nursing

(N CSBN), and the Armed Services Vocational Aptitude Battery (ASVAB).

An appealing feature of CAT is its ability to achieve, at individual examinee level,
considerable gains in both measurement precision and efficiency with fewer items than
would be required on a conventional paper-and-pencil test (Eggen & Straetmans, 2000;
Lewis & Sheehan, 1990; Lord, 1977; Wainer, 2000; Weiss, 1982). This measurement

advantage is realized mainly by administering each candidate an individualized test with

each item sequentially tailored to the examinee’s current ability estimate (denoted by 0
thereafter). The “tailoring”, which is achieved by item selection rules, may take different
forms depending on different item selection rules. Currently, there are two most popular
CAT item selection algorithms: maximum Fisher’s information-based and minimum of

posterior variance-based (van der Linden & Pashley, 2000). The former selects items that

can maximize Fisher’s information at the current 0 whereas the latter selects items that

 

can minimize the posterior variance evaluated at the examinees’ current0 . It can be
easily anticipated that the optimal “tailoring” occurs when every item in the pool can
exactly match the desired features requested by an item selection rule. For the maximum
information selection rule, for example, the optimal tailoring occurs when an item with
item difﬁculty parameter (denoted by b thereafter) equals the current proﬁciency estimate
given that the Rasch model is used as item response model. A recent study by Reckase
and He (2009b) demonstrated negligibly small bias and mean squared error (MSE) in
ability estimates when there is optimal tailoring. Other CAT components being equal, a
CAT that has every desired item requested by the CAT algorithm in the pool is expected
to yield better measurement outcomes than a CAT that does not. In other words, the
measurement outcome quality of adaptive testing, like any other test, is correlated with

item pool quality.

Flaugher (2000) discussed the relationship between the quality of the item pools
and the job the adaptive algorithm can do:
Obviously, the higher the quality of the item pools, the better the job the adaptive
algorithm can do. The best and most sophisticated adaptive program cannot
function if it is held in check by a limited pool of items, or items of poor quality
(p. 38).
Even as early as in early 197OS—the inception of CAT research—researchers started to
notice and, either implicitly or explicitly, acknowledge that item pool characteristics may
play a role for an adaptive test in achieving “the best attainable results” (McBride &
Weiss, 1976, p. 9). Following this philosophy, several studies such as Jensema (1972;

1977) and McBride and Weiss (1976) created “ideal” and “perfect” item pools to explore

 

the properties of Owen’s Bayesian adaptive ability testing procedures. The characteristics
of an “ideal” item pool simulated in McBride and Weiss (1976) followed those in
J ensema (1972) in terms that item discrimination parameters (denoted by a thereafter)
were set as high as possible—preferably exceeding .8, item guessing parameters (denoted
by c thereafter) were set equal to .2, and item b parameters were evenly and uniformly
distributed along the proﬁciency scale. A “perfect” item pool was slightly different from
an “idea ” item pool mainly in terms of two aspects: one was that a “perfect” item pool
was created as one behaving as if it contained an unlimited number of items at any
speciﬁable difﬁculty level; and the other was that the items’ b-values were optimal in that

they were calculated based on a formula given by Birbaum (1968, p. 464) which deﬁnes

the location on which an item can provide its maximum information given its (1,; Ci, and

assuming 0,- =0,- , and the items’ a-values were obtained ﬁ'om a linear equation

regressing a-values on the optimal b-values.

The importance of item pool quality in realizing CAT’S measurement quality
continues to be emphasized in the 19903 and 20008 CAT research literature such as in
Dodd et a1. (1995), Embreton (2001), Flaughter (2000), Gorin et a1. (2005); van der
Linden (1998), Wang and Vispoel (1998), Wang and Kolen (2001), and Xing and
Harnbleton (2004). Guidelines on the desired features of a CAT item pool are also
recommended and discussed in studies such as Luecht (1998), Patsula and Steffen (1997),
Stocking (1994), and Way (1998). Stocking (1994) analyzed ﬁve operational item pools
for ﬁve ﬁxed-length CAT tests for the purpose of estimating the sufﬁcient pool size to
meet content and statistical requirements of the CAT tests. Based on the examination,

Stocking recommended it as a rule of thumb that a pool size of six to eight linear forms

 

should be adequate to support a CAT test with the length of one-half of the linear test.
Additionally, Stocking (1994) recommended that a CAT item pool containing 12 times
the length of the CAT test be used in high-stake test. However, very little was addressed
about what makes a hi gh-quality item pool ﬁ'om the perspective of items’ psychometric
properties in the Stocking study. In general, those studies described above are inadequate
in depicting a full picture of the desired characteristics of an item pool for CAT. First,
there lacks a universal understanding of the desired characteristics of an item pool.
Second, very little attention has been paid to how the uniqueness of a particular CAT
algorithm might affect item pool features. For example, it may be that “high-quality item
pools” may appear different for a CAT which aims at measuring the examinees’ abilities
equally well over a certain interval and a CAT which aims at classifying the examinees
into pass/fail by using a cut score. Third, the relationship between target examinee

population and item pool features has barely received any attention.

It is not until the late 19903 that item pool assembly and design for CAT has
developed as independent topics that begin to receive wide attention from researchers.
Examples of studies on item pool assembly can be seen in Ariel, Veldkamp, and van der
Linden (2004), Belov and Armstrong (2009), Stocking and Swanson (1998), van der
Linden, Ariel, and Veldkamp (2006), and Way, Steffen, and Anderson (1998). Compared
with studies on item pool assembly, fewer studies have been conducted on the item pool
design and existing studies include Gu (2007), Reckase (2003), Reckase and He (2004;
2009a), and Veldkamp and van der Linden (2000).

Item pool assembly and item pool design are two distinctive notions. van der

Linden, Ariel, and Veldkamp (2006) provided a clear explanation of their differences.

According to them, in an item pool design problem, no actual items are available yet. By
simulating adaptive test administrations, an optimal item pool blueprint is generated in
which the distribution of numbers of items over the space with all possible combinations
of the relevant statistical and non-statistical attributes of the items are described. In other
words, designing an item pool is similar to painting on a sheet of white paper which may
require careful planning before setting to work with paint brushes in order to produce a
good piece of art work. In an item pool assembly problem, however, the actual items are
already available in a master pool; and what needs to be done is to assemble an item pool
from this available master pool according to the desired speciﬁcation. This study focuses
on item pool design only. Terms such as item po‘ol design/development/generation are
used interchangeably.

Since there are no actual items available when designing an item pool, designing
an item pool that is optimal intuitively becomes a desired goal. Then, a series of
questions arise: what constitutes the desired features of an optimal item pool—both
statistical and non-statistical? How can we identify the desired features of an optimal
item pool? How large is an item pool considered to be appropriate for a CAT program

and how can we estimate the optimal size?

1.2 STATEMENT OF RESEARCH QUESTIONS

Intuitively, we would expect optimal item pool features for different CAT
programs to be different due to various factors in the design of an adaptive test. These
factors include item selection algorithm, constraints on item content, exposure control
procedure, termination rule, overlap restriction, and target examinee population. For

example, it is reasonable to speculate that a long adaptive test may require more items in

5

the item pool than a short one, or for the same adaptive test, implementation of item
exposure control procedure may require more items in the item pool than implementation
of no exposure control procedure. Despite the expected differences in optimal features,
when it comes to optimal item pool design, our ultimate goal is to design an item pool
that can accommodate the uniqueness of a speciﬁc CAT algorithm and contribute to the
realization of the best attainable measurement outcomes no matter what the adaptive

algorithms are.

This expectation can be reﬂected by several deﬁnitions of optimal CAT item pool,
which, from different perspectives, capture the key factors that should be considered in
the item pool design. For example, van der Linden, Ariel, and Veldkamp (2006) deﬁned
an optimal item pool as one

consist(ing) of a maximal number of combinations of items that (a) meet all

content speciﬁcations for the test and (b) are most informative at a series of ability

levels reﬂecting the shape of the distribution of the ability estimates for a

population of examinees (p. 82).

Veldkamp and van der Linden (2000) pointed out that the optimal blueprint—the product
from item pool design effort—speciﬁes what attributes the items in the CAT pool should
have and how many items of each type are needed. Reckase (2003) deﬁned an optimal
CAT item pool as one that always has an item available for selection that matches the
characteristics speciﬁed by the item selection rule. For example, the maximum item
information selection method for the Rasch model requires an item with b—value equal to
the current proﬁciency estimate. Reckase further pointed out that the characteristics of

the optimal pool are dependent on item selection rule, stopping rule, examinee population,

and item exposure control procedure. As an optimal item pool will be prohibitively large
if items needed for every possible proﬁciency estimate are available, the notion of
optimal item pool is later modiﬁed into range-optimal item pool (ROP; Reckase & He,
2009b). How a ROP is developed will be discussed shortly.

To summarize, the deﬁnitions described above point to at least three basic
elements that need to be considered when designing an optimal item pool. That is, no
matter what the adaptive algorithms are, the optimal item pool features should look at
both statistical and non-statistical features and item pool size. Statistical features may
include item parameters, whereas non-statistical features may include content
speciﬁcations, key distribution, and cognitive skills, to name a few.

Previous literature on optimal CAT item pool design is focused on two major
approaches respectively discussed in Veldkamp and van der Linden (2000) and in
Reckase (2003), Reckase and He (2004; 2009a), and Gu (2007). They represent two lines
of approaches: mathematical programming and heuristic. For the former, the CAT is
administered by shadow-test approach (STA) (van der Linden & Reese, 1998) realized by
0-1 or binary linear integer programming in which an objective function is maximized
subject to a set of speciﬁc constraints. The key point that the STA is different from other
approaches is that, in the STA, items are not selected directly from the pool but from a
shadow test, i.e., a full-size test assembled prior to selecting each item in the adaptive test.
A detailed description on the STA will be given in the second chapter. This line of
research argues that the item pool designed out of the STA should be optimal because
adaptive testing with shadow tests can guarantee test administrations that can always

meet all content speciﬁcations and at the same time the item selected at each step is

optimal for ability estimation. However, as Chang (2007) and Robin et a1. (2005) pointed
out, one potential limitation of this method is that commercial software such as CPLEX
or LINDO has to be counted upon to get the solution. As a result, source codes may not
be accessible to end-users, posing difﬁculty to the practitioners in that they have no
control over program’s reﬁnement or modiﬁcation if needed. What’s more, it is likely
that the solution may not always be feasible.

The approach developed by Reckase (2003) can be viewed as heuristic in nature.
Unlike the binary linear programming method, Reckase’ method is straightforward and
easy to handle. The basic idea behind the Reckase’ method in the case of the Rasch
model uses a set of “bins”——each of which takes a certainwidthworrwtheproﬁciﬁenf‘y“
,scateztaeelleetjtcms. and. -a .‘,‘.uni_9n” mechanismyhich is. used- gymnasium 9391 _

size. To design an item pool, ﬁrst of all, an item pool is partitioned into smaller ones

 

according to the non-statistical attribute, such as content area. The simulation starts with
an examinee who is randomly sampled from the expected population and administered
the target CAT test. Each item administered is assumed to be optimal because it satisﬁes

not only statistical but also non-statistical constraints. ﬁne items admide

 

miﬁemWQWwﬂnge same procedure is then
repeated for subsequent examinees. Items in each bin are treated as equivalent in use and
each bin is treated as exclusive from each other. Because items selected for one person
can be used for another one, the ideal item pool is the union of the item sets that are
administered for each examinee. Using a large number of examinees out of the expected
examinee population, it can be anticipated that the number of the items that needs to be

added would diminish with the increase of the sampled examinees, and ultimately, the

pool size should asymptote to some value that will satisfy the requirements of all sampled
examinees. Thus, the end-product of above procedures includes item pool size, item
number distribution, and items’ psychometric properties. The successful application and
extension of this method to operational CAT programs are reported in Reckase and He
(2004; 2005; 2009a) and Gu (2007). However, the above applications are restricted to a
CAT test with only one non-statistical constraint, i.e., the attribute used to partition the
item pool. Further research is needed if a CAT is expected to satisfy a complex set of
content constraints—a practice in operational adaptive testing programs. For example, a
verbal adaptive test in Stocking and Swanson (1993) had 41 non-statistical constraints to
satisfy.

In the adaptive testing, an item has to be sequentially selected that can provide

A

maximum information at the updated 9 and at the same time meet non-statistical
constraint requirement. A common solution to realize this goal is to force the item
selection algorithm to combine the objective of maximum information with a strategy
that imposes the same set of non-statistical constraints on the item selection for each
examinee (van der Linden, 2005a). So far, four approaches have been developed to deal
with complex content constraints and they include the shadow test approach, the
«WEIQEEEVEEEES~ rnpdelLWDM; Stocking & Swanson, 1993), the weighted penalty
model (WPM; Shin et al., 2009), and the maximum priority index (MP1; Cheng & Chang,
2009). Among all these four methods, the WDM is widely used in operational testing

programs (Buyske, 2005). Examples include’GRE and ACCUPLACERTM.

Therefore, this study intends to extend Reckase (2003) and Gu (2007) to design
an optimal item pool for a CAT using the WDM item selection approach. Using an
operational CAT program as a template, the following research questions are addressed:

Q1. What are the desired features of the optimal item pools for a CAT test using
the WDM item selection procedure? The desired features include optimal pool size,
distribution of the number of the items, items’ statistical properties (i.e, psychometric
properties) and item’s non-statistical attributes distribution.

Q2. How do the optimal item pools perform in comparison to the operational item
pool (OIP) in light of different pool size, pool utilization, constraint management,

measurement accuracy and precision, and classiﬁcation accuracy?

1.3 SIGNIFICANCE OF THE STUDY

Any CAT program can be viewed as unique due to the interaction among many
factors such as CAT algorithm and target examinee population. As a result, the desired
pool features may vary with different CAT programs. The methodology presented in this
study is very easy to implement and extremely helpful identifying the desired item pool
features. The end-product of this methodology can be viewed as an item pool speciﬁcally
tailored to the target CAT program and is therefore expected to ensure high quality
measurement outcomes. Once the desired item pool features are identiﬁed, they can serve
multiple purposes. First, they can shed light on the best attainable measurement outcomes
that a CAT algorithm can achieve. This objective can be easily achieved through
simulation in which the CAT in question is administered to the same target examinee
sample using the optimal and the operational item pools respectively and then results are

compared. Second, they can serve as a template for future item pool assembly and at the

10

same time provide meaningful guidance to monitor item writing and item pool
maintenance. The methodology discussed in this study extends Reckase (2003) and Gu
(2007) to CAT programs with a complex set of non-statistical attributes, which is
commonly used in operational CAT programs. By manipulating several key elements that
are expected to affect the application of the proposed methodology, this study also
provides a detailed guidance on the effective application and adaptation of this method to

other operational CAT programs.

11

CHAPTER II

LITERATURE REVIEW

2.1 INTRODUCTION TO COMPUTERIZED ADAPTIVE TESTING

Adaptive testing provides a feasible solution to the problem that very little
information about the examinees’ abilities can be learned if the items are either far too
difﬁcult or far too easy for them. By matching item difﬁculty with examinee’ ability level,
much more information about examinee’s ability level can be gained, therefore

improving measurement precision and efﬁciency.

In fact, the idea of adaptive testing is not new. Its original use can be dated back
to Alfred Binet’s (1905) intelligence test whose administration employs an adaptive
strategy. Focusing on diagnosing an individual’s rather than a group’s intelligence, the
Binet-Simon test—with items sorted according to mental group or age group—is
administered in a way that the examinee start a test with the item set deemed to be
appropriate for their age group; depending on the examinee’s responses, the item set to be
administered are adjusted until the examinees’ appropriate mental group can be identiﬁed
with sufﬁcient certainty. Testing format such as Lord’s ﬂexilevel testing (1971b) and
Weiss’ stradaptive test (1973) can also be viewed as variants of adaptive testing. The
application of CAT in large-scale testing program in real time can be attributed to two
major factors: one is the theoretical foundations laid out by researchers such as Lord
(1970, 1971a, 1977, 1980) and Weiss (1976, 1978) including item response theory and

item selection strategies borrowed from the bioassay ﬁeld; and the other is the rapid

12

development of computer technology in the 19808 which enables instantaneous
computation.

The potential advantages of CAT have been well addressed in several studies (e. g.,
Wainer, 2000; Way, 1998). In summary, they include shorter tests, enhanced accuracy
and efﬁciency in trait estimation, immediate score reporting, greater ﬂexibility in test
scheduling and management, and easier adoption of innovative item formats. Since the
end of last century, however, test security has become a very thorny and challenging
issue in CAT mainly due to the nature of CAT as a type of continuous testing in which
items tend to be used for a certain period of time, leaving many opportunities for item
theft. Once operational items become known through venues such as being posted on a
website—as tends to occur these days -- later examinees are very likely to beneﬁt by
receiving inﬂated scores from item pre-knowledge.

To administer a CAT, at least six components are necessary. They include 1) a
pre-calibrated item pool; 2) a psychometric (e. g., item response theory) model; 3) an item
selection rule (i.e., the method used to adaptively select an item for administration;
examples include maximum Fisher’s information and Owen’s Bayesian item selection
procedure (Owen, 1975)); 4) a starting point (i.e., the location where an examinee is
assumed to be on the proﬁciency scale before the test starts); 5) a scoring rule (i.e., the
method used to sequentially update the examinee’s interim ability estimate; examples
include maximum likelihood estimation (MLE; Bimbaum, 1968) and expected a
posteriori (EAP; Bock & Mislevy, 1982); and 6) a termination rule (i.e., the criteria used
to stop administration of a test; examples include ﬁxed-length test in which all examinees

are administered the test with the same length and variable-length test in which, in

13

general, the level of measurement precision for the ﬁnal trait estimate is used to
determine whether to stop the test or not). In addition, item exposure control and content
balancing procedures are often two important components that need to be considered by a

CAT program. The former “is to ensure test security whereas the latter is“ to ensure test

e“

ledid/ityihe presence of these two components imposes constraints on CAT, resulting in
so-called constrained CAT (CCAT). How these two constraints are realized in item

selection will be discussed shortly.

CAT works in a very simple and straightforward way. Figure 2.1 presents a
ﬂowchart for a typical CAT administration. As can be observed in this ﬁgure, CAT
administration is an iterative process in which the provisional ability estimates are
updated immediately after an item. is administered and this process continues until the test

termination rule is satisﬁed.

Figure 2.1 A ﬂowchart of CAT administration

 

Select the ﬁrst item
for administration

 

 

 

 

 

 

 

Score item

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 
    
 

 

 

 

 

 

 

  

Update 0
Select
another
item based Evaluate against
on the ,. termination rule
updated 6

 

 

 

 

 

14

2.2 ITEM POOL FEATURES, POPULATION DISTRIBUTIONS, AND

OTHER CAT COMPONENTS

The item pool serves as a resource for the creation of a CAT whose goals are
three-fold according to Parshall, Davey, and Nering (1998): 1) to maximize measurement
precision by selecting the item that can maximize information or posterior precision for
the examinee’s current ability level, 2) to ensure that tests measure the same traits for
each examinee by administering a content—balanced test; and 3) to protect the security of
item bank by controlling the rates at which items are administered. These three goals are
more often than not conﬂicting with each other. Stocking and Lewis (2000) compared the
item selection problem in CAT to an inﬂated balloon—pushing against one side may
address one issue but will cause a bulge on another side of the balloon.

With these goals being in conﬂict with each other, the optimal item pool design
for a CAT should seek to balance them. For the ﬁrst goal maximizing measurement
precision, in practice, maximization tends to occur on a targeted range of ability levels.
This goal requires item pools to have characteristics that are appropriate as measurement
tools for targeted range of ability levels. First, items should be of high quality in terms of
carrying characteristics that can match the item selection rule. For example, when the
maximum item information criterion is used for item selection, then high-quality items
can be deﬁned by their discriminating powers in the case of using three-parameter item
response theory model. In general, the higher the discriminating power an item has, the
more information this item provides. The desired item characteristics may be different if
another item selection criterion is used. Second, the item pool characteristics should

consider targeted examinee population so as to provide maximum measurement precision

15

 

where the examinees are located. For example, it is expected that a CAT item pool used
to select high-ability examinees for gifted programs may not provide a satisfactory level
of measurement precision for an exam that is used to place low-ability examinees in
remedial courses. For the former, an item pool with most of the items located at the high-
ability level might be more appropriate; whereas for the latter, an item pool with most of
the items located at the low-ability level might be more appropriate. Dodd, Koch, and De
Ayala (1993) indicated that trait estimates in CAT are more accurate when item pool
characteristics and latent trait distribution of the examinees can match with each other. In
addition to statistical suitability, items should also provide sufﬁcient coverage of content
speciﬁcations. An optimal item pool is expected to support assembling a content-
balanced CAT for each individual examinee according to the target test speciﬁcation.

Research (e.g., Chang & Ying, 1999; Cheng & Chang, 2009; Way, 1998) has
documented that maximizing measurement precision may come at the cost of
overexposing certain items. For example, when the three-parameter item response theory
model is applied, the CAT algorithm tends to capitalize on the differential discriminating
power of the items in the pool, resulting in a disproportionate usage of the item pool. To
equalize item usage, the item exposure control procedure is often implemented for the
purpose of protecting item pool integrity. Some studies (Chang & Ansley, 2003) have
documented the trade-off between item exposure control procedure and measurement
precision. Therefore, an optimal item pool is expected to ease this tension by protecting
test security without compromising measurement precision.

To summarize, an optimally-designed item pool for a CAT should soothe the

tension among these conﬂicting goals so that these goals can be realized in a well-

16

 

balanced and satisfactory manner. An optimal item pool is expected to provide desirable
measurement accuracy and precision, make efficient use of items, ensure a balanced

content coverage, and protect test security.

2.3 PRACTICAL CONSTRAINTS IN ITEM SELECTION IN CAT

Most of CAT programs are constrained in that an item selected for administration

is expected not only to maximize statistical information at the current 8 but also to
satisfy the pre-speciﬁed requirement of non-statistical constraints, typically a set of
content speciﬁcations deﬁned in terms of combinations of attributes the items in the test
should have. In addition, item exposure issues are also important factors that are always

attended to in the CAT for the sake of test security.

2.3.1 CONTENT-BALANCING TECHNIQUES
In conventional paper-and-pencil test (PPT), the requirement for each individual
test to have the same content speciﬁcation is easily met since every examinee is
administered the same test. For a CAT, however, the realization of this requirement has

to be achieved by forcing the item selection algorithm to cpmmbinemthepbjective of

“\
w“-.. _ ,4

«rm-x“

’ “n... - "4“”. ‘ w “I“...

\rnaximizing information with a strategy that can impose the same set of content

speciﬁcations on the items selected for/administration (van der Linden, 2005a). The

M r- -j ”Hr. I‘ "'l “a
kw 'n“\_./ ‘v’\aw—-" _“--./ J

procedure for ensuring the same set of content speciﬁcation for each individual CAT is

called content-balancing.

Several approaches have been proposed to ensurgefgon’terrtlllakancw'fg; These
approaches include Kingsbury and Zara’s (1991) constrained CAT method, the weighted

deviations model (WDM) approach (Stocking & Swanson, 1993), the shadow-test

l7

 

approach (STA; van der Linden & Reese, 1998), the modiﬁed multinomial model (MMM;
Chen & Ankenmann, 2004), the modiﬁed CCAT (MCCAT; Leung, Chang, & Hau,
2003b), the two-phase item selection procedure for ﬂexible content balancing method
(Cheng, Chang, & Yi, 2007), the weighted penalty model (WPM; Shin et al., 2009), and
the maximum priority index (MP1) method (Cheng & Chang, 2009). Comparative studies
on the performance of some of these methods can be found in other studies such as those
by Warren. (299412- Change. Cheng. and, Yi (MCheng and Chang
(2009), Leung, Chang, and Hau (2003a), and van der Linden (2005). Among all those
above methods, CCAT, MCCAT, and MMM can be viewed as methods along the same
line in terms that an item pool is partitioned as several sub-pools by a key attribute such
as content area and the item is spirally selected across different sub-pools to meet the pre-
speciﬁed objective. This line of methods is limited to the situation in which an item
carries only one attribute, i.e., the one used to partition the item pool. Comparatively
speaking, the STA, the WDM, the WPM, and the MPI are more ﬂexible in dealing with a
large set of item constraints. The STA belongs to a mathematical programming method
whereas the other three are heuristic. Among all these four methods, the WDM is widely
used in several operational testing programs (Buyske, 2005); and STA is also a method
that has been widely researched in CAT literature. Provided below are descriptions of the

WDM and the STA.

2.3.1.1 WEIGHTED DEVIATIONS MODEL
The weighted deviations model method (WDM), originally developed by
Stocking and Swanson (1993) out of the concern about possible poor-quality item pools

in large-scale test assembly, is perhaps one of the most popular heuristic methods. The

18

WDM explicitly accounts for non-statistical and statistical item properties with the
desired balance between measurement and construct concerns reﬂected by the weights
selected by the test designers. Unlike the STA, the content speciﬁcations in the WDM are
formulated as goals rather than constraints. Deviations from the content targets are
weighted and incorporated in the objective function together with the distance of current
item information ﬁ'om the target value. In CAT, the WDM approach selects items
sequentially with the smallest sum of weighted deviations. The WDM heuristic
essentially consists of three steps when selecting an item. First, for every item not already
in the test, the deviation for each of the constraints is computed as if the item were added
to the test. Second, the weighted deviations across all constraints are summed. Finally,
the item with the smallest weighted sum of deviations is selected.

Provided below are explanations on the WDM. Let N denote the number of items

in the item pool, k denote the number of constraints, Wk denote the weight assigned to
each constraint, Lk and U k denote the lower and upper bound for each constraint k

respectively, de and dUk denote the deﬁcit from the lower bound and surplus from
upper bound respectively, eL k and eUk denote excess from lower bound and deﬁcit
from upper bound respectively, and d9 denote the “deviations” ﬁom target item
information. gik is 1 if item i has property k and 0 otherwise. xiis a binary decision
variable: it equals 1 if ith item is included in the test and 0 otherwise. The model is

formulated in [1].

l9

Minimize

K K

ZWdek + ZWkdUk +ngg [l]
k=l k=1
Subject to

N

Zgikxi'l'de -6Lk =Lk k=1, ...,K .

i =1
for the lower bound. And

N

Zgikxi +dUk —eUk =Uk k=1, ...,K
i=1

for the upper bound

N

ZIi(0)xi+d9-eg =00

i=1

dUk ,de ,eLk ,eUk 20 k=l, ...,K
d9,e9 20

xi e{0,l},i=l,...N

Ideally, when the WDM is used in CAT for sequential item selection, the item
selected for administration is as informative as possible at an examinee’s estimated
ability level while at the same time contributing as much as possible to the satisfaction of
all other constraints.

The WDM has several advantages. Like any other heuristic approach, the WDM
can always ensure a feasible solution in item selection. Based on the priority placed on

content speciﬁcations and on measurement precision, the test specialists can assign

20

weights ﬂexibly. The disadvantage of this algorithm is the uncertainty in balancing the
content speciﬁcations. Studies have documented that the WDM may violate some of the
constraints (Cheng & Chang, 2009; Robin et al., 2005; van der Linden, 2005). As an
added concern, the WDM usually takes a considerable amount of time to adjust the

heuristic, i.e., ﬁnd best weights, for a new problem.

2.3.1.2 SHADOW-TEST APPROACH ,

The STA, originally proposed by van der Linden and Reese (1998), is based on
the ideas developed for the application of linear programming to optimal test assembly.
The key point that the STA is different from other approaches is that, in the STA, items
are not selected directly from the pool but from a shadow test, i.e., a full-size test
assembled prior to the administration of each item in the adaptive test. Shadow tests are
linear tests that have the following properties: 1) containing all items already
administered to the examinee; 2) optimal at the current ability estimate of the examinee;
and 3) meeting all speciﬁcations an adaptive test has to meet. The item that is actually
administered to the examinee is the one in the shadow test that has not been administered
and is optimal at the current ability estimate. After the item is selected for administration,
the shadow test is returned to the pool, the ability estimate is updated, and the procedure

is repeated. Below is a brief description of how the STA works in CAT.
Step 1: Initialize the estimator of the ability parameter.

Step 2: Assemble the ﬁrst shadow test that meets all constraints and optimizes the

objective function.

21

Step 3: Administer an eligible item in the shadow test that can provide maximum

information at the current ability estimate.

Step 4: Update the parameters in the test assembly model.

Step 5: Assemble a new shadow test but ﬁx items already administered.

Step 6: Repeat Steps 2-5 until expected test length is reached.

In general, the way that the STA works falls into a category called constrained
sequential optimization which typically includes two types of test speciﬁcations:
objectives and constraints. In the STA, the statistical information from the test items at
the current ability estimate can be viewed as the objective function to be optimized and
all other speciﬁcations can be treated as constraints subject to which the optimization has
to take place (van der Linden, 1998; van der Linden, Ariel, & Veldkamp, 2006;
Veldkamp & van der Linden, 2000). One of the STA’s merits is that it can guarantee
non-violation of test speciﬁcation. It is very ﬂexible as well. However, there is a tradeoff
between the speed and optimality of its solution. For larger problems, exact solution may
not be possible in realistic times (van der Linden, 1998). In addition, implementation of
STA has to rely on the commercial software and may pose difﬁculty to practitioners if

they want to modify and reﬁne the source codes (Chang, 2007).

Robin et al. (2005) conducted a comparative study on the performance of the
WDM and the STA using item pools from three existing CAT programs at Educational
Testing Service (ETS). Their results indicated that, in general, the STA does not produce
dramatically better results than the WDM. The STA does not violate any content

objective whereas the WDM has low rates of violations for some minor constraints. In

22

terms of psychometric quality and resource usage, both the STA and the WDM perform
very similarly. Robin and his collaborators (2005) concluded that there was no
compelling reason to believe that many of the practical issues that have arisen in the past
few years in CAT can be cured simply by switching algorithms. Robin and his

collaborators also discussed the concerns over using the STA at the end of their study.

2.3.2 ITEM EXPOSURE CONTROL PROCEDURE

The implementation of item exposure control procedures in CAT aims at
maintaining test security by constraining the administration of more popular items that
would otherwise become compromised due to repeated administration. In CAT, the item
selection rule generally seeks an item that can provide the maximum information at the
current ability estimate, which, tends to pick certain items too often causing the issue of
overexposure. When items become overexposed, examinees may become familiar with
these items even before the actual test and have inﬂated test scores as a result; those
overexposed items may become decreasingly less difﬁcult.

A detailed summary of the CAT item exposure control procedures developed
between 1983 and 2005 is described in Georgiadou, Triantaﬁllou, and Economides
generally grouped into ﬁve categories: 1) randomization, 2) conditional selection, 3)
stratiﬁed, 4) combined, and 5) multiple stage adaptive test design procedures. In
randomized item selection, the next item to be selected is randomly chosen out of a group
of N most optimal items. Procedures such as the 5-4-3-2-1 proposed by McBride and
Martin (1983) and the randomesque procedure (Kingsbury & Zara, 1991) belong to the

ﬁrst category. In conditional item selection, the probability that a selected item is

23

 

administered is conditioned on the frequency with which the item is selected within a
particular targeted population. The most fundamental, perhaps also the most commonly-
used conditional selection procedure, is the Sympson-Hetter (SH) procedure (Hetter &
Sympson, 1997; Sympson & Hetter, 1985). Based on the SH, a series of other conditional
selection procedures have been developed, such as Davey and Parshall (DP) procedure
(Davey & Parshall, 1995; Parshall et al., 1998), Stocking and Lewis unconditional
multinomial (SL) procedure (Stocking & Lewis, 1995), and Stocking and Lewis
conditional multinomial (SLC) procedure (Stocking & Lewis, 1998). Chang and Ying
(1999) proposed a-stratiﬁed procedure and indicated that this procedure can satisfactorily
control the item exposure by better balancing item use rates. This procedure has been
further explored by incorporating more elements such as the SH item exposure control
procedure and content balancing (Chang, et al., 2001; Leung, et al., 2003). As its name
suggests, a combined strategy attempts to combine different methods to develop more
robust strategies that can perform better than an individual strategy alone. Examples of
combined strategies include Progressive Restricted strategy (Revuelta & Ponsoda, 1998),
Nering, Davey, and Thompson’s Hybrid strategy (Nering, Davey, & Thompson, 1998),
and content constraints in a-stratiﬁed adaptive testing using a shadow test approach (van
der Linden & Chang, 2005b). Several studies (e.g., Chang & Ansley, 2003; Chang &
Twu, 1998; Chang et al., 2000; Chang, et al., 2003; Davey & Parshall, 1995; French &
Thompson, 2003; Revuelta & Ponsoda, 1998) have been conducted to evaluate the
performance of different item exposure strategies. The study by Chang and Ansley
(2003), by systematically comparing the performance of ﬁve exposure control algorithms,

reported that the SLC procedure best serves the purposes of controlling the observed

24

exposure rates to the desired values as well as producing the lowest test overlap rate. In

addition, they reported the tradeoffs between item exposure and measurement precision.

2.4 OPTIMAL ITEM POOL DESIGN METHODS FOR CAT

As discussed in the ﬁrst chapter, item pool design—different from item pool
assembly—aims at generating an optimal blueprint which can guide item pool assembly.
So far, two major methods have been proposed to design optimal item pools for CAT.
One is the mathematical linear programming approach presented in Veldkamp and van
der Linden (2000) and the other is called bin-and-union method originally presented in

Reckase (2003) and later extended by Gu (2007).

The approach discussed in Boekkooi-Timminga (1990) can be viewed in some
sense as a prototype of applying linear programming to design IRT-based item pools for
both linear test and CAT. A brief description of this approach is provided here. This is
because, as will be revealed shortly, the optimal item pool design approaches presented
both in Veldkamp and van der Linden (2000) and Reckase (2003) borrow some elements
in Boekkooi-Timminga (1990). To determine the characteristics of the desired item pool,
Boekkooi-Timminga partitioned the ability continuum into several intervals (also called
clusters) assuming the items in the same interval to have equal information functions.
Using a sequential approach, Boekkooi-Timminga calculated the numbers of items
needed for the test forms by maximizing their information function based on the Rasch
model. Since items were collected in each mutually exclusive cluster, the item number

distribution can be determined. Boekkooi-Timminga also demonstrated how optimal item

25

features could be used to determine if the existing item pool can meet test construction

requirement.

2.4.1 THE BINARY INTEGER PROGRAMMING METHOD

The method used in Veldkamp and van der Linden (2000) to design an item pool
for CAT can be viewed as a CAT version of the method described in van der Linden,
Veldkamp, and Reese (2000) in that both studies adopted binary integer programming
method to produce an optimal blueprint. In general, the whole process undergoes four
major steps. First, the set of speciﬁcations for the CAT is analyzed and all item attributes
are identiﬁed and formulated in a series of classiﬁcation tables. These classiﬁcation
tables are set up for categorical and quantitative attributes respectively. For categorical
attributes, they are partitioned by their Cartesian products. For quantitative attributes such
as item parameters, they are partitioned into several clusters, for example, for item
difﬁculty parameter, they can be divided into several intervals like ((-00, -2.5), (-2.5,
2), (2 2.5), (2.5, 00)); and the same approach is used for item discrimination parameters.
Consequently, a blueprint having C-by—Q cells is formulated if C is denoted as the
number of categorical classiﬁcation tables and Q is denoted as the number of quantitative
tables. Second, using this table, an integer programming model to assemble the shadow
test in the CAT simulation is formulated. Third, the CAT is administered to simulees out
of target examinee population with integer programming model for the shadow test. The
ability distribution of simulees can be referred to in the historic data. Finally, the number
of times items in each cell of the classiﬁcation table are counted and collected. These
counts are then adjusted to obtain optimal projections of the item exposure rate; and the

adjusted counts are the ﬁnal blueprints.

26

In Veldkamp and van der Linden’s study, the three-parameter item response
theory (3PL IRT) model was used and c.- was ﬁxed at a common value. The Cartesian

product of all categorical and quantitative attributes yielded 96x124=12096 cells.

2.4.2 THE BIN -AND-UNION METHOD AND ITS EXTENSION
Reckase’ bin-and-union approach, originally documented in Reckase (2003), was
proposed out of the motivation to identify the “desired features of an item pool” (p. 1).
Unlike the integer programming approach used in Veldkamp and van der Linden (2000)
for item pool design, this method is very ﬂexible, straightforward, and easy-to-handle
with end-products including optimal item pool size, the distribution of item numbers, and

items’ psychometric properties.

To implement the bin-and-union method, ﬁve major procedures are required for
each non-statistical attribute area, for example, content area: 1) specify the CAT
procedure; 2) simulate CAT with examinees ﬁ'om the expected population; 3) determine
the ROP for each examinee, which is composed of items that are administered to each
individual examinee. These items are allocated into different bins based on the items’
psychometric properties; items in each bin are considered equivalent in use and each bin
is mutually exclusive to each other; 4) ﬁnd the union of the ROPs for examinees; and 5)
if the union of ROPs is formed sequentially after each examinee is sampled, the number
of items will asymptote to the ROP pool size given the use of a large sample of
examinees. In the Reckase method, the items are collected and allocated in a set of bins.
Like Boekkooi-Timminga (1990), all items in each bin are considered equivalent in use
and each bin is mutually exclusive to each other. Figure 2.2 illustrates how the item pool

increases in size as the number of sampled examinees increases.

27

 

Figure 2.2 Increase of item pool size with the increase of number of examinees

Increase of Pool Size with Increase of Number of Examinees
600 , x r r 1 ~ r 1 r -

500

400 -

Item Pool Size
or
O
O

 

200

100 — .

 

oo — T1060 fob?) 30100 4000 5000 3000 7060 hobo—900070000
Number of Examinees

The successful application of this method can be found in Reckase and He (2005;
2009) in which this methodology was used to design the optimal item pool for an
operational CAT program. The results in those two studies indicated that the item pool
features identiﬁed by this approach, including item pool size, item number distribution,
and statistical properties, can sustain the successful implementation of the operational
exam by allowing a better measurement precision and maintaining test security. The bin

width used to collect items, exposure control procedure implemented by a particular CAT,

and content balancing play a key role in the item pool features.

By using the CAT algorithm implemented by the Computerized Adaptive
Testing-Armed Services Vocational Aptitude Battery (CAT-ASVAB) as a template, Gu

(2007) successfully developed optimal pools that perform better than operational item

28

pool. Gu extended the Reckase method by using the 3PL IRT model and incorporating
Sympson-Better and a-stratiﬁed item selection procedures. Speciﬁcally, Gu worked out
several key components required to determine optimal pool size and optimal item
features including the bin-map constituted by a certain number of blocks with. items in
each of these blocks treated as equivalent. In addition, Gu developed two methods that
can generate items with required features deﬁned by the CAT algorithm. Gu’s results
indicated that the optimal item pools perform better than the operational item pool in
terms of having a smaller item pool size, better measurement accuracy and test security,

and more efﬁcient item use.

In comparison, the implementation of binary linear programming method requires
special knowledge and software; whereas the bin-and-union method allows practitioners
full control over the whole process. Existing research related to Reckase’s bin-and-union
method, however, fislimite‘dﬂto the situation in which the item, pool is PaEIEPRedJELO

several sub-pools by one key attribute only._ If a CAT has to deal with a more complex set
-/‘\_/ we» ~x,ﬂ,.--.-\/ --—- - - - w, . J --m j, -_

‘r‘ \‘ __,-

 

of constraints, the bineand-union method needs to be further extended and thisﬂis the“

‘\‘ {fan—\w”-

. fgcus of the current study. ‘

-.....’-' '5 x... .a-r-v '

29

.ﬂ- N. .

CHAPTER III

METHODOLOGY AND RESEARCH DESIGN

3.1 METHODOLOGY

This section documents several key components required to design an optimal
item pool based on the bin-and-union method. They include how to deﬁne a bin-map, to
generate optimal items, to simulate non-statistical attributes, to model the WDM
procedure, and to post-adjust the pool size. Descriptions of some of the components are

also referred to in Gu (2007).

3.1.1 DEFINING A BIN MAP

IRT provides a powerful item selection method in CAT through the use of the
item information function. The item information function can depict the contribution that
an item makes to ability estimation at points along the proﬁciency scale. The concept of
‘bin’ is used to collect and tally items in Reckase (2003) when the one-parameter logistic
IRT model is used. A ‘bin’ takes a certain width on the b-parameter or 0 scale and items
collected in the same bin can be used interchangeably since there is only a negligible
difference in the item information that can be provided by the items in the same bin. In
the case of the one-parameter logistic IRT model, for exmple,_ﬂWt—Ml§

madame between b-paralrrctereaﬂﬂ:iust caUSPSijiEWtioa in

> ’ ”A...“___~__P-t -.

maximum item information/figure 3.1 depicts the percentage of maximum item
W
information that an item with b—value having certain distance from 0 can provide as

opposed to that provide by a well-matched b-value and 0 in the Rasch model.

30

Figure 3.1 Percentage of maximum item information conditional on the distance between
b value and 0

17’— ﬁ T -'7'T~—.T'—T" T“W 7’T’-—T~ T‘T'ﬁ—‘ﬁ—T‘T;.‘W
J

0.6

0.4

 

Percentage of item information
0
0|

0'” «

 

 

 

 

 

 

0 4 I J_ 1 J; l l

-2.8 -2.4 -2 -1.6 -1.2 -O.8 -O.4 O 0.4 0.8 1.2 1.6 2 .4 2.8
Distance between item difficulty and ability

However, when the other IRT models, for example, two- or three-parameter
models, are used, the story is somewhat different in terms that the item information is
largely determined by an item’s discrimination power (i.e., the magnitude of the a-
pararneter; and the item information is generally higher when a- takes a high value);
while it is the item’s difﬁculty, i.e., b-parameter, that decides the location at which the
contribution of item information is realized. Birnbaum (1968) demonstrated that an item

provides its maximum information at 0max where
1
0max = b,- +D—wln[.5(1+,/l +8cl- )] [2]
i

If c,- equals 0, i.e., in the cases of the one- and two-parameter IRT models, an item

provides its maximum information at an ability level exactly the same as its difﬁculty. If

31

C; is greater than 0, i.e., in the case of three-parameter IRT model, an item provides its

maximum information at an ability level slightly higher than its difﬁculty. In summary,
among items that have the same or very. similar b-values, those having higher a-values

can provide more information given the c-value is a constant.

The above discussion suggests that when the idea of ‘bin’ is used to design the
optimal item pool for a CAT program employing the three-parameter IRT model, the
contribution by different a-values conditional on a certain range of b must be considered.
In other words, the creation of a bin should take not only b- but also a-parameter into

consideration. And we call a bin of this type an ab—block/ab-bin.

Lord (1980) further illustrated that the highest information that a logistic item

with ai and Ci can provide is a quadratic function of a-parameter, given that c-parameter

is a constant. The relationship between the maximum item information that an item can
provide and a-parameter is depicted in the following equation
3
D2“? 2 3
Mi=——2‘[1-200i—80i +(1+8c,-) ] [3]
8(1 — c-)
If we slightly rearrange [3] and add a A denoting the change, [3] becomes [4] which can

indicate the change of item information by the change of a-parameter

3
2 . 2 .5
D [l—20c1-8ci +(l+8c,) ]A 2
= a
8(1—c,-)2

 

[4]

32

where D equals 1.7. A 6 value of .25 leads to AM =.447Aa2 . Since a-value is

conventionally set starting from zero, the boundary of a-parameter conditional on the
same range of b can be determined step by step given the expected amount of information

change.

Figure 3.2 provides an illustrative example of a bin map developed by the
methods discussed above. This bin map is composed of ab-bins/ab-blocks—the basic
units in collecting and tallying items in the case of employing 3PL IRT model. The
blocks in this ﬁgure were determined by a .4 change in item information when the b-
range was set as .4. As a result, a total number of 128 ab-blocks were used to collect and
tally items. How ab-bins/ab-blocks and b—bins are different is also visually demonstrated
in this ﬁgure. For example, the shaded block bounded by the b range between -3.2 and -
2.8 and the a range between 0 and .89443 represents an ab-block in which items with b-
and a-parameters falling into these two ranges are considered to be equivalent in use by
providing similar item information. Another view of the ab-blocks is presented in Table
3.1. A b-bin can be viewed as a collection of multiple ab-blocks. As in the case of the
Rasch model in which a series of b-bins are used to estimate items needed in a test, the
set theory of “union” mechanism is still used to determine the number of items in each

ab-block and the item pool size.

33

a 1.2649

Figure 3.2 An illustrative example of ab—bin/ab-block

 

1an
2.3664

2.1909

1 .7889

0.89443

........r........... ............

 

..............................

.....................................................................

 

...............................

....................

 

....-.............4

............

 

 

I

 

.......

 

 

 

 

 

 

 

 

 

 

 

 

......................

 

 

 

 

 

 

 

0.3

Note. Inﬁ=inﬁnite

Table 3.1 Another view of ab-bin/ab-block

 

 

 

ID 1316) b(ub) a(lb) a(ub)
1 .66 -3.2 0 0.89443
2 -oo -32 0.89443 1.2649
3 -oo -32 1.2649 1.5492
4 -oo -32 1.5492 1.7889
5 -oo .32 1.7889 2
6 -oo -32 2 2.1909
7 .66 -3.2 2.1909 2.3664
8 -oo -32 2.3664 00
9 -3.2 -2.8 0 0.89443
10 -32 -2.8 0.89443 1.2649
1 1 -3.2 -2.8 1.2649 1.5492
12 -3.2 -2.8 1.5492 1.7889
13 -3.2 -2.8 1.7889 2
14 -3.2 -2.8 2 2.1909
15 -32 -2.8 2.1909 2.3664
16 -3.2 -2.8 2.3664 00

 

Note. lb=lower bound ub=upper bound

34

i 1 1 d i
.2 -2.8 -2.4 -2 -1.6 -1.2 -O.8 -0.4 g 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2

 

3.1.2 GENERATING OPTIMAL ITEMS

The quality of the items in an item pool is an important determinant of the success
of the computer adaptive testing (CAT) program. Three approaches based on Gu (2007)
and McBride and Weiss (1977) were used to generate optimal item features for this study.
The ﬁrst one is referred to as the random procedure (R), the second is the mixed random
and prediction approach (MRP), while the third one is the minimum test information
procedure (MT 1). In order for the optimal item features to be useful and realistic for the
target operational CAT program, the historic data were analyzed for the information
necessary for the simulation. The historic data included the distributions of operational
item parameters, the relationship among item parameters, test reliability, and examinees’
ability estimates.

According to the analysis on all 314 operational items, the a- and b-parameters of
the operational items were not statistically correlated. For the a-parameter, it was
normally distributed with a mean of .97938 and a standard deviation of .394903. This
normal distribution was conﬁrmed by the Kolmogorov-Smirnov (KS) test which retained
the null hypothesis with p=.2. For the c-parameter, the results indicated that beta
distribution, i.e., c ~ beta (2.734, 15.839), can better describe its distribution than others.

Therefore, a beta distribution was used to generate c-parameter.

x.._ ..
\.__.o--~"' ' ~.

“e. “4"...

3.1.2.1 RANDOM PROCEDURE (R)
The R procedure can be applied to the situation in which the a- and b- parameters
are statistically independent. To generate an optimal item using the R procedure, the

following steps were followed. For a speciﬁc item,

1) Generated both a,» and c,- from their respective target distributions.

35

2) Given that both a,- and c,- were already known, calculated b,- using [2], where

gmax was current ability estimate.

3.1.2.2 MIXED RANDOM AND PREDICTION PROCEDURE (MRP)

As its name suggests, the MRP is a mixed method. The procedure for the random
part followed 3.1.2.1; whereas for the prediction part, the procedure primarily followed
McBride and Weiss (1976) in which a “perfect” item pool was simulated with optimal

item parameters identiﬁed by regressing the a on the b parameters.

To generate the regression equation, all operational items were ﬁrst divided into
three groups based on their magnitude of b-values. The correlation between the a- and the
b-parameters for each group was calculated with SPSS and only low b—parameter group
(i.e., b—value lower than -I.2103) had statistically signiﬁcant correlation between a- and
b-parameters. Therefore, a simple regression was run for this particular group by

predicting a with b. The regression equation can be written as al- =1-414+-294bi +e,-

where e,- is a random component following a normal distribution ~ N (O, 0%) and creis

calculated by 08 = SCH/(1 — rib) =.400133 following McBride and Weiss (1976).

To generate an optimal item using the MRP, the following steps were followed.
For a speciﬁc item, if its 1),- value—which can be approximated by 8 at each step of test

administration—was above -1.2103, then the R procedure described in 3.1.2.1 was used

to generate optimal item features. Otherwise, the following procedures were adopted:

1) Generated c,- out of the target distribution.

36

2) Generated a,- with ai = 1.414 + .294b,‘ + ei where b,- can be approximated by

6 obtained at each item selection step and 61' was out of N(0,.4001332).

3) Recalculated b,- with [2].

3.1.2.3 MINIMUM TEST INFORMATION PROCEDURE (MTI)

To implement the minimum test information approach, the ﬁrst step was to
specify the target test information. Based on the historic information on the test and the
distribution of ability estimates, the target minimum test information can be speciﬁed via

the following two equations:

Se=SO jl—rxxt [5]

[61

N
Q)
II
to
mml"

where S0 represents the standard deviation of ability estimates, Se represents standard

error of estimate, r vrepresents test reliability, and 16" represents test information.
xx

Once I g was known, then the expected information that each item should provide could

be obtained by dividing the 16" by the test length. Due to the fact that the actual item

information that an item can provide conditional on the current ability estimate may not
be exactly the same as expected, the target item information needed to be updated once
an item was administered. The following formula was used to update the target item

information:

37

_ Ttarget ’ Tadmin

 

I- —
I Lt arg et “ Lad min

[7]

where T represents test information and L represents test length.

According to the historic data, the reliability of the operational CAT program
is .91. The examinee ability is distributed with a mean of -.3813 and a standard deviation
of .9768. Therefore, the target minimum test information is approximately 12.8. Note in
this study the target test information was set differently for different examinees. For
examinees with true abilities between -1.6245 and 1.088235, the target test information
was set as 12.8; for examinees with true abilities between 1.08825 to 2.5 or between -
1.6245 to -2.5, the target test information was set as 9.8; and for the rest of examinees,
the target test information was set as 6.8. The two numbers -1.6245 and 1.088235, along
with -.05397, were three out scores used in this study to place examinees into different

proﬁciency levels.

To generate an optimal item using MTI approach, the following steps were

followed. For a speciﬁc item,

1) Generated c,- out of the target distribution.

2) Calculated (if out of [8], which was actually derived by rearranging [3]. M i

can be replaced by I i in [7].

 

2
81— - M°
“i: ( Cr) 1 3 [8]

 

D2[1—20c,-—8cl.2+(1+8c,-)2]

38

3) Given that both ai and c; were already known, calculated b,- using [2], where

Qmax was current ability estimate.

3.1.3 MODELING THE WDM PROCEDURE

3.1.3.1 SIMULATING ITEM ATTRIBUTES
For the operational CAT to be mimicked in this study, Table 3.2 presents relevant
information on types of constraints, weights, minimum, and maximum bounds of

properties that were expected to be modeled.

Table 3.2 Information on exam constraints and weights

 

 

 

 

 

 

 

 

 

 

Constraint Cogsotzzmt Weight Minimum Maximum
Item Format
Sentence Correction C1 10 10 10
Construction Shift __ _ C2 10 10 10
Errors
Comma C3 10 6 8
Coordination C4 1 0 6 8
Sentence Logic ________ _ C5 10 6 8
Content Area
Arts C6 5 2 5
Practical Affairs C7 5 5 8
Social Science C8 5 2 5
Science C9 5 2 5
Human Sources ___________________ C10 _5 2 5
Content Diversity
Male Reference C1 1 l 1 0 1
Female Reference C12 11 0 1
White C13 10 0 1
Non-white C14 10 O 1
Keys
C15 1 3 7
C16 1 3 7
C17 1 3 7
C18 1 3 7

 

39

As Table 3.2 indicates, ﬁve broad categories are used to describe each individual
item and under each broad category several sub-categories are subsumed. For example,
there are two sub-categories under Item Format and three subcategories under Errors. A
close examination was undertaken to analyze the description of these item attributes and
the distributions of operational items’ attributes. In all, the examination revealed the
following: 1) all items possess only one property under categories Item Format, Errors,
and Key; 2) for Content Area, 271 items come from only one content area while the rest

43 items span two content areas. For these 43 items, they span content areas Cl and C10;

 

3) for Content Diversity, 138 items possess none of the properties under this category;
107 items possess only one property; and 69 items possess two properties. Note that an
item can possess up to two properties under this category since C11 and C12 are

exclusive to each other and so are C13 and C14.

Based on this table, along with the information on the distributions of the
operational items’ attributes, the following method was used to identify the attributes of
each individual optimal item. For an item, a zero row vector with size l-by-18 was
generated with each cell indicating a speciﬁc property. An item was assumed to possess
only one property under all categories except Content Diversity. For the category Item
Format, two numbers 1 and 2 were generated. If 1 was selected, then 1 was marked under
C1 suggesting that this item possessed this attribute. Then, a number among 3, 4, and 5
was randomly drawn; and if, for example, 3 was selected, 1 was marked under C3. For
Content Area, the item attribute was simulated by referring to the distribution ﬁom the
real data. Speciﬁcally, a number between 1 and 100 was randomly selected and divided

by 100. Based on one of ﬁve ranges into which this number fell, the content area from C6

40

to C10 was marked correspondingly. The ﬁve ranges were 1) less than .213376, 2)
between .213376 and .509554, 3) between .509554 and .665605, 4) between .665605
and .818471, or 5) greater than .818471. The procedure used to identify the category
under Content Diversity also applied to what was done to Content Area in that the
distribution ﬁ'om the real data was followed. A number was ﬁrst generated from the
uniform distribution between 0 and 1. If this number was less than .43949, then this item
possessed none of property under Content Diversity. If this number was between .43949
and .780255, then a number between 11 and 14 was randomly selected and 1 was marked
in the corresponding cell in the row vector. If this number was greater than .780255, then
this item was allowed to possess two properties: one was from either C11 or C12 and the
other was from either C13 or C14. For Keys, a number between 15 and 18 was randomly

drawn and 1 was marked in the cell corresponding to the selected number.

3.1.3.2 MODELING THE WDM ITEM SELECTION PROCEDURE

Recall that when the WDM is used for item selection, the item that has the
smallest sum of deviations is selected for administration. The deviated sum is consisted
of two components with one component from weighted deviations from target item
information and other component ﬁom weighted sum of deviations from both lower and
upper bounds. Since item parameters generated by three different methods are expected
to be optimal, minimizing total weighted sum of deviation is equivalent to minimizing
weighted sum of deviations from both lower and upper bounds. Based on this logic, the

WDM item selection procedure was modeled through the following steps:

1) Generated an item for administration in R, MRP, or MTI procedure.

41

2) For this item, generated 322 different kinds of combinations of item attributes
based on the description under 3.1.3.1. The reason for using 322 was that this number
was tantamount to all possible ways of combination of attributes simulated in this study.
Recall that the WDM works in a way that for every item not already in the test, the
weighted sum of deviations has to be computed as if the item were added to the test.
Therefore, from the second item forward, the item to be included in the exam should
automatically take into account the attribute(s) of the previous item(s) by not possessing
the attributes that have been satisﬁed by the items already administered. That is to say,
each of these 322 different combinations of item attributes generated for each individual
item should satisfy this requirement.

3) Computed the weighted sum based on [1]. Since 322 different combinations of
item attributes were generated for each item, an item was expected to have 322 weighted
sums. The item with the smallest weighted sum of deviations was identiﬁed as the
optimal item for administration. Lower bounds, upper bounds, and weights followed
those used in the operational exam.

4) Calculated the probability of correct response for this optimal item by using the
target IRT model. The probability was compared with a random number generated out of
uniform distribution between 0 and 1. If the probability of correct response was greater
than the randomly-generated number, then this item was assigned as 1 for correct
response; and vice versa.

5) Updated the estimate of 6 .

6) Repeated above steps until the target test length. was reached.

42

3.1.4 POST-ADJUSTING ITEM POOL SIZE

Recall each ab-block is mutually exclusive when being used to collect and tally
items. However, when the CAT algorithm searches for an item for administration, the
search takes place in the whole item pool. Furthermore, an item provides its maximum
information at an ability level slightly higher than its difﬁculty in the case of the three-
parameter IRT mode, as Equation [2] indicates. This implies an item collected in an
individual ab—block may not necessarily be able to provide the most information within
the range of abilities equivalent to the b range covered by this ab-block as expected. For
example, Gu (2007) graphically showed that an item from ab—block A with an a-
parameter between 1.26 and 1.55 and b-parameter between -.84 and -.56 can provide
more information at 0, between -l.12 and -.84 than items from an ab-block B with an a-
parameter between 0 and .89 and a b-parameter between -1.12 and -.84. As a result, an
item from A rather than B may be more likely to be selected when the interim ability

estimate is between -1.12 and -.84.

With regard to item pool size, the above discussion implies that the item pool size
identiﬁed by summing the number in each ab-block may carry more items than needed.
As a result, an adjustment procedure is needed to trim out these redundant items so that
items can be more effectively used. To achieve this goal, the following procedures were
adopted. The ﬁrst step involved determining the number and the location of conditional 0
points that were used to calculate maximum item information. In general, this number
can be set as the number of b-bins used to collect and tally item number. As to the
location, these conditional 0 points can be a point within each b-bin. In this study, the

mid-points of each b-bin were used to downsize the item pool and in total, 14 conditional

43

0 points were used (0=-2.6, -2.4, -2, ..., 2.6). In the second step, the maximum
information for all items on each conditional 9 point was calculated and these items were
then rank-ordered in descending order along each conditional 0 point. For each
conditional 0 point, the number of items that was needed was equivalent to the number
identiﬁed in the simulation for each b-bin. For example, if the simulation indicated that

11 items were needed for all ab-bins conditional on the b-value ranging ﬁom -2.8 to -2.4,
then 11 items that can provide the highest information at 9=-2.6 were selected. In this

way, the items that can provide the highest information at each conditional 0 point were
identiﬁed. Finally, all items that were identiﬁed in the second step were collected
together. The unique items were kept as the optimal items needed for the exam. Thus, the

Optimal item pool size was equivalent to the total number of these unique items.

3.2 RESEARCH DESIGN

This section describes the CAT model, simulation design, research procedure, as
well as a series of criteria used to evaluate the performance of the candidate optimal item

pools.

3.2.1 CAT MODEL
An operational large-scale CAT model served as the template. The IRT response
model was the three-parameter logistic model. All items, which were stand-alone
independent items, were selected for administration based on the WDM. The initial
ability was set as O for each individual simulee. To obtain the current ability estimate
before both correct and incorrect responses were available, the expected a posteriori

(EAP) method was used with a N(0,l) prior. Once both correct and incorrect responses

44

were obtained, the maximum likelihood estimation method was used. The test length was
set as 20. No item exposure control procedure was implemented. The target examinee
population followed a normal distribution with a mean of -.3813 and a standard deviation

of .9318. Below is the 3PL IRT model.

exp[Dal-(0 'bi )]

P116): Ci +(1_Ci)l+exp[Da,-(6’-'bi)l

 

[9]
where
D=1.7
P,- indicates the probability of responding to an item correctly for an examinee
with latent ability 0
(1,- indicates item discrimination parameter
b,- indidates item difﬁculty parameter

0,- indicates item-guessing parameter

One thing that needs to be noted is that, to be able to use the WDM, such
information as weights, lower and upper bounds, target item information, and item
information weight has to be known beforehand. The simulation in this study employed
the weight for each constraint, lower and upper bounds as described in Table 3.2. The
target item information was set as 83.25 by referring to a research conducted by Cheng
and Chang (2009). To determine the weight of item information, a series of item
information weights including .1, .5, l, 5, and 10 was tested by using the operational item
pool. The results suggested that using item information weight as 1 yielded the most
stable results. As a result, 83.25 and l were consistently used throughout the entire study

as target item information and item information weight respectively.

45

3.2.2 SIMULATION DESIGN
Three factors were manipulated in this study: item generation method, expected
amount of item information change, i.e., AM , and b-bin width. The simulation design,

described in Table 3.3, involved 24 (6x2x2) conditions. In other words, 24 candidate

 

 

 

 

 

 

 

ROPs were developed.
Table 3.3 Simulation design
Condition Item Item generation b-bin Expected amount of item
pool methods width information change
1 ROP_I R1 0.4 0.4
2 ROP_2 R1 0.4 0.2
3 ROP_3 R2 0.4 0.4
4 ROP_4 R2 0.4 0.2
5 ROP_5 MRP] 0.4 0.4
6 ROP_6 MRPl 0.4 0.2
7 ROP_7 MRP2 0.4 0.4
8 ROP_8 MRP2 0.4 _ 0.2
9 ROP_9 MTIl 0.4 0.4
10 ROP_lO MTIl 0.4 0.2
1 l ROP_l l MT12 0.4 0.4
12 ROP_JZ MT12 0.4 0.2
13 ROP_13 R1 0.8 0.4
14 ROP_14 R1 0.8 0.2
15 ROP_15 R2 0.8 0.4
16 ROP__1 6 R2 0.8 _ 0.2
17 ROP_17 MRP] 0.8 0.4
18 ROP_18 MRPI 0.8 0.2
19 ROP_19 MRP2 0.8 0.4
20 ROP_20 MRP2 0.8 0.2
21 ROP_21 MTI] 0.8 0.4
22 ROP_22 MTIl 0.8 0.2
23 ROP_23 MT12 0.8 0.4
24 ROP_24 MT12 0.8 0.2

 

The only difference between the methods marked with R1 and R2 was in how an

optimal item was determined. Speciﬁcally, for an item generated with R1, its a; and c,-

46

were generated ﬁrst; once a,- and c,- were known, then its b,- parameter were calculated

by [2]. However, for an item generated with R2, 20 a-parameters and 20 c-parameters

were generated ﬁrst out of their target distributions respectively; then all possible

combinations between these 20 al- and 20 c; were taken resulting in 400 different ways of

combination. With [2], the corresponding b,- can be calculated for each individual item.

One item was randomly selected for calculating weighted sum of deviations. The
motivation behind R2 is that, in theory, there are supposed to be more than one item that

can provide maximum information at a given gmax The same was for MRP1 and MRP2.

For MTIl, the target test information was set differently for examinees with
different true abilities in light of the fact that the operational CAT program is used for
placement purposes. The additional features that MT12 possessed was that the target test
information for the ﬁrst ten items for all examinees was expected to be only 1/3 of the
target test information set for the whole test and the remaining was expected to be
reached by the last 10 items. As a result, the target item information was different for the

ﬁrst and last 10 items.

3.2.3 RESEARCH PROCEDURE
To develop each candidate ROP, the following procedures were carried out.
Step I. Identiﬁed item pool size, the distribution of item numbers, and item attributes by
using 10,000 examinees out of target examinee population.
Step II. Generated item pools based on the results from Step I. Then, their sizes were

adjusted and the resulting pools became the ROPs. Item attributes for each item

47

were determined by randomly sampling out of a set of item attributes collected
within each ab—bin in the trimmed item pool.

Step III. Evaluated performance of candidate ROPs in light of a series of criteria
assuming items containing no estimation error. Evaluation was conducted using
3,000 examinees at each 0 point ﬁom -3 to 3 at an increment of .5 and a random
sample of 20,000 examinees out of target examinee population. Note the same
20,000 examinees were consistently used across all ROPs in order for the results

to be comparable.

3.2.4 EVALUATION CRITERIA

The performance of each candidate ROP was compared with that of the
operational item pool against a series of criteria listed below. In all comparisons, the
evaluation results from the operational item pool served as the baseline. Since both
conditional sample and a random sample were used in the simulation, two broad types of
indices—conditional and overall—were used with conditional indices only including bias,
standard error (SE) deﬁned in [10], mean square error (MSE), and constraint violation
(CV). The procedures used to calculate the conditional bias and MSE are very similar to

[1 l] and [12] respectively.

 

 

 

( N /\\2
A 1 g: A kzrgk [10]
sue): —— o——=—
N—lj:1 N
\ /

 

48

Overall indices include the following:

Precision of proﬁciency estimation including:

Bias;

Mean Square Error (MSE);

Correlation coefﬁcients between estimated and true abilities;
Item pool utilization;

Number of underexposed items

Skewness of item exposure rate distribution 12 (Chang & Ying, 1999)

Test security;

Number of overexposed items;
Item overlap rate

Constraint violation (CV);
Classiﬁcation accuracy rate;

Pool size;

What follows are the deﬁnitions of some of the evaluation criteria:

Bias Bias is deﬁned as:

1 N A
Bias=—Z(0i—0i) [ll]
Ni=l

A
where 0,- and 9i are the estimated and true ability of the ith examinee.

SE MSE is deﬁned as:

49

1 N A 2
MSE=—Z(0,°—0,') [12]
Ni=1

A
where 0,- and 0,- are the estimated and true ability of the i th examinee.

Number of overexposed and underexposed items The exposure rate for an item is

calculated by dividing the total number of times that an item is administered to the
examinee by the total number of examinees who have been administered this item. A
commonly-used cutoff value to evaluate whether an item is overexposed is 0.2 (e.g. see
Eigor, Stocking, Way, & Steffen, 1993; Hau & Chang, 2001). An item with exposure rate

lower than 0.02 is considered underexposed.

Constraint violation The constraint violation is captured by 1) the total number of tests

with constraint violation; and 2) the number of different levels of constraint violation.

Skewness of item exposure rate distribution 12 12 can be calculated by the

following equation

n 2

2 ('Z-L/")
= E —— 13
Z i=1 L/n [ ]

where r; is the observed exposure rate for the i’h item, L is the test length, and n is the
total number of items in the pool.

According to Chang and Ying (1999) and the personal communication with Hua-Hua

Chang (January 15, 2010), this 12 index can measure the departure of an item’s actual

exposure ﬁom uniform item exposure and thus quantify the efﬁciency of item pool usage.

50

A useful generalization of this index is F ratio of 12 from two methods, i.e.,

Fmethod1,method2 = Ziahodl / xgemodz which can be used to compare the exposure rates

of the two methods. If Fmethodl,method2 <1, then method 1 is regarded as superior to

method 2 in terms of producing better overall balance of exposure rates. This index is
used only as a descriptive statistic.

Item overlap rate The item overlap rate (sometimes called test overlap rate) is deﬁned
as the percentage of the common items shared by two randomly selected examinees. The

following equation describes how to calculate the average item overlap rate:

 

2
=T/CN

R [14]

where T is the total number of item shared by of, pair of N examinees in the test. C 1%,

N
gives the number of pairs of the test among N examinees. Z Li is the total number of the
i =1

items administered for N examinees.

Classiﬁcation accuracy Classiﬁcation accuracy is deﬁned as the percentage of
examinees who are correctly classiﬁed into different proﬁciency levels described in the
operational testing program’s technical manual. To determine cut scores, an equation that
depicted the relationship between scaled scores and 0 estimates was derived. Based on

this equation, three cut-offs on the 0 scale were found for the equivalent scaled scores.

51

They were -l.6245 (53)], -.05397(86), and 1.088235 (110). As a result, the examinees
were classiﬁed into four different proﬁciency levels which were called 1) Not Proﬁcient

(NP), 2) Partially Proﬁcient (PP), 3) Proﬁcient (P), and 4) Advanced (A).

 

I The number inside the bracket indicates equivalent scaled score.

52

CHAPTER IV

RESULTS

This section consists of three major parts. The ﬁrst part presents characteristics of
all 24 candidate ROPs including their size, statistical, and non-statistical attributes. The
second part summarizes the performance of all 24 candidate ROP8 and compares their
performance with that of the operational item pool against the criteria described above.
The last part presents an illustrative example in which the identiﬁed pool features were
used as a template to guide operational item pool assembly, and the performance of this

new item pool was evaluated and compared with that of the OP.

4.1 CHARACTERISTICS OF CANDIDATE ROPS

Table 4.1 presents descriptive statistics of all 24 candidate ROPs. Clearly, all
candidate ROPs contained fewer items than the OP. And the magnitude of the difference
varied by item generation methods (i.e., R, MRP, and MTI) and expected amounts of
change in item information (i.e., .2 and .4). In general, the R method tended to produce
slightly larger item pools than the other two methods given other things being equal and
was followed up by the MRP and the MTI methods respectively. For example, using a .4
b-bin width and a .4 expected amount of item information change, the R1, the MRP1, and
the MTIl item generation methods respectively produced 192, 189, and 168 items in
ROP_l, ROP_S, and ROP_9. Using a .8 b-bin width and a .2 expected amount of item
information change, the R1, the MRP1, and the MTIl item generation methods
respectively produced 240, 229, and 209 items in ROP_14, ROP_18, and ROP_22. In

addition, the item pools produced by the .2 expected amount of item information change

53

contained roughly one third more items than those produced by the .4 expected amount of
change, given other things being equal. This result was anticipated as .2 expected
amount of item information change implies using more ab-blocks to collect and tally
items than .4 expected amount of item information change. All item pools produced by
the .2 expected amount of item information change contained more than 200 items;
whereas those item pools produced by the .4 expected amount of change contained less

than 200 items.

With regard to the item discrimination parameter, the average a-values in all
candidate ROPs—almost all of them being above 1.2—were higher than that (i.e., .979)
in the OP. Speciﬁcally, both R and MRP item generation methods tended to produce item
pools with higher average a-values than the MTI method regardless of using different
expected amount of change in item information and b—bin width. For example, the
average a-values for Pools ROP_I to ROP_8 were around 1.6, whereas the average a-
values for Pools ROP_9 to ROP_12 dropped to a range between 1.3 and 1.45. Likewise,
the average a-values for Pools ROP_13 to ROP_20 were around 1.45, whereas the
average a-values for Pools ROP_21 to ROP_24 dropped to a range between 1.2 and 1.32.
In addition, the item pools (i.e., ROP_13 to ROP_24) developed with .8 b-bin width
tended to have smaller average a-values than those (i.e., ROP_l to ROP_12) developed
with a .4 b-bin width. Meanwhile, the minimum a-values in those item pools (i.e., ROP_I
to ROP_12) using .4 b-bin width tended to exceed .9—at least .1 higher than those in the

rest of item pools (i.e., ROP 13 to ROP 24) developed using .8 b-bin width.

With regard to the item difﬁculty parameter, the average b-values in all candidate

ROPs were much higher than not only that of the OP but also the average ability (i.e., -

54

.3813) of the target examinee population. In general, the item pools developed with the .2
expected amount of change in item information tended to have easier items than those
developed with the .4 expected amount of change, given other things being equal. For
example, the average b-value of ROP_2 was -.227, but it was -.161 for ROP_l. Likewise,

the average b-value of ROP_14 was -.155, but it was -.236 for ROP_13.

55

lltllllIIIIII'U'III'IIIII'I'I'IIIIIEI'IIUIOIOIIII'lllllI'IIIIIIII'III'I'0'I'I'O'I'I'l'llll'l'l'llllllllllIUIIIIIIIIIlIII'IIIII'III'O'I'I'lll'OII'III'IIIII'III

 

RN :3 coed 33. em; 32. $2 82 $3. $3 EN one 83 21mg:
N2 83 $3 Sod 23 $2. 82 82 an? as: 22 ems E: :iaoe
emm 23 seed at; a; 22- SE SE 23 ES £3 emg 32 Eden
we: 83 83 83 mm; 82- $2 :2 Ed- 83 $3.. £2 8? 35¢
.-.me.----ma..o ..... e emd--.-.e.e.o..o.---..m.m..m ........ emmww--.we.e...~..-.ew.es._.---.e.wm..m.. ........ Mmeeo.-s.ew_..ms;me.wo ..... m 3.9.2. ..... Madam-
w: 23 $3 ES 83 SE- weed en: New? E: 32 Ed :3 55m
SN ES 33 ES 23 $3- 83 3.: 33. :3 $3 32 23 piece
a: was 83 eeod ~26 EN. 83 Na: peed- weed 33 83 £2 mince
.-.m~m----.wmo..o ..... e. m3--.-w.e.m.m---.mm_..q ........ .emmw..---w.e.m.~..-.me.m._..--.ewmd.“ ........ ewes-.-..e.~a..m--.-o.e.wo ..... We...“ ..... enmmm:
e: 83 $2 $3 23 :2. sea mt: $2. 33 mm; 33 23 niece
EN :3 $3 EB om; $2- 82 82 Rad- m2: $5 Ed 83 Niece
.9: Rod 82 meg :3 82- 32 82 $3. :3 £3 :2. SE 38m
.-.me ..... Sena ...... we. ...... wm.m-----.e....._..m ......... m E... -.immense..-$312.83.. ........ guesswwwm lemme ..... m sweetieadm-
ram 52 as: am :32 £2 as: am :82 £2 .32 em :32 .8;
.8.— m m m as:

 

£62 03338.0 VNB 8.3239. 8332.6ch beﬁszw :4 05m...

 

 

SN 23 ES N86 use News NZ ea: 28- ea; acre. 88 82 em ace
3: was we; 83 we; we; Sea 22 £2. mews $3 $8 «a: niece
e8 33 82 $3 £3 £2- Sea 82 83. eta Ea 98¢ a: mace
e2 23 :3 Sea was 82- $2 $2 38. em; :3 $2 22 niece

..-mmm. ..... N mono ..... _. emdlwmoso ..... mead ........ gem:..owww-..owwm--.-me.m..q.. .llama.---mwww-.-.~.w.m.d ..... N 9.2 ..... owlmmm-
a: «:3 was E; and 82- 82 E: 33. ea; $2 $2 5: 218m
ea 23 5.2 83 mm; need. :3 we: 83- a; £3 8.3 $2 2.8m
m2 :3 See at; was $2- 82 $2 23- see e:.~ 22 ea: trace

..-MmN. ..... w._.ouo.-.-wow.d-.-.wmoeo..-.-w.e._..m ........ «swam.-..Nmmm-..wmmw-.-ﬁmd...-.-.-..wem.m.-.-wwwwasw~muo ..... 0 ME ..... Steam-
o: :3 82 Sea is :2- :2 $3 News News a; :2 GE Brace
ea :3 8% ES 33 32- £3 £2 £2. 33 eta 88 ea: Steam
m: was 32 £3 £3 Need- 82 a: was- 23 a; 32 an: Menace

:13...“ ..... N a“... ...... w a... ....... w as. ..... www.mssismmm. ..... N Edisemmnm.idem? ----.-.mm.e..m.--.wmmw.-.-w.mm.d ..... m NW... ........... mm-
ram 52 as am as: a: :2 em :82 a: :2 em :82 Bed
.8.— m m m as:

 

u. see 3. ezee

The graphical representation of item number distribution in each ab-block for the
operational item pool and several selected candidate ROPs (e.g., ROP_I, ROP_2,
ROP_13, and ROP_14; and ROP_9, ROP_IO, ROP_21, and ROP_22) are provided in
Figures 4.1, 4.2, and 4.3. The item number distributions for the rest of the candidate
ROP3—very similar to what are presented here—are provided in the APPENDIX.

Obviously, the items in the OP were distributed quite differently from those in the
candidate ROPs in three major aspects. First, the OP carried quite a few items with a,-

values between 0 and .8944 while the candidate ROPs carried none or very few items
with a,- values within this range. Second, regardless that the total number of items in each
b-bin was different in item pools developed with different b—bin width and expected
amount of item information change, the item number distributions of all candidate ROPs
tended to be more uniform than that of the OP. For example, ROP_l, ROP_2, ROP_13,
and ROP_14 had roughly 15, 20, 30, and 37 items respectively in most of their b-bins.
The difference in deve10ping ROP_l and ROP_2 was in using different expected amount
of change in item information: .4 for ROP_l and .2 for ROP_2; and the same was for
ROP_13 and ROP_14. The difference in developing ROP_l and ROP_13 was in using
different b-bin width: .4 for ROP_l and .8 for ROP_13; and the same was for ROP_2 and
ROP_14. The ﬁndings about ROP_9, ROP_IO, ROP_21, and ROP_22 were also very
similar to those about ROP_I, ROP_2, ROP_13, and ROP_14 except that the total

number of items in each b-bin was slightly smaller. Third, for the OP, the largest

proportion of items was those having a,- values between 0 and .8944 and then followed by

items having a,- values between .8944 and 1.2649. However, for most of the candidate

ROPs, a large proportion of items were those having a,- values between .8944 and 1.5492.

58

Figure 4.1 Item number distribution in each ab-block for the operational item pool

 

 

 

 

 

 

a q l q ﬂ 2

m w m

m. u m «Sea: and
l . 1. L r; )
m 1 m a m m. 58183."
0. m. 1. m 1. n . .
4... a... ..4. a. ..4. ._... 243$an
a a a a a 3
III... 8.3....

3973:;

ems: 1 $8...

. - . w. :8... z 32...
32... 153.-
33.- 2F?

. -._.._.._..........r... «Fr Amen..-
$34- .885-
e85... 242.

. ”=4”er mhér

 

 

70

peers... .
85:62“.

59

 

 

  
  
  

 

 

 
 

      

 

  

 

 

 

 

 

 

 

 

 

.l. I . . ..l. . m.‘ .. I I . . F. ...I. . m.‘ 2
.H Z .9 T .8 .v 0 V .8 .Z .7 W Z .9 r. 8 V 0 V .8 .Z 9 Z .7 8
_. ~ ~ ~ ~ . . .. .. . _ .. _ . . . ~ ~ ~ ~ . . ~ .~ .~ .~ .~
2 ”I. I I z . z I I I . . o. I I z z
Z 8 9 r. .L mt .0 my mo . .0 . 7. we no 9 r. .L «t .0 .7 8 r. .9 .0 7
.....a ”in. aw.» m ... ...
row tar
as: 282.... - a -3
as: as. ... .._
a £3.28. . 3 m am
53 3...... I n . n
.32.. Ease... ..3 m as. N mu.- .3 w
$5.235"... m a 82st.- m
5.: ea. :1. D_ . 8 I -3
332. 38. =an_ .33... gwran
......saarﬂm .3 53....afm e
.38 228...... A . an.
Ao‘Nncu a“ I _ H h A a h E F F L g a -
2
20% .m 65$ 30% .< ESE

Samoa was .m EOM .NmOM . 50% com 0603-?» some 5 noun—£56 598:: :8: N .V NSMQ

60

 

 

.2323"..-
£22.".-
55..."..-

58.3.?-

3.85:"..-

$3.33....-

33.53.?-

3....25EE

.§._§..EB

.38..2§.ID

.3838E-
.38... an.-

 

 

— _ »

IO'Z' ~ 8'Z‘

 

3 8
KouanbOJd

S

 

 

VEOM .Q 383

2

j? 61'1" ~r°

6€'~ 7"
I7"~ Z'I'

I0'Z‘ ~ 8'Z'

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P800 N.» 33%:

mEOm .0 65¢

a

T 42.

. .a

. .§.N~ra- .2

- a32..r..- .3
.gﬁﬁi-

. a3§2rﬁ .3

. .§.§8ED .8
58...“..-

. _ 0 . _ _ 2

Aouanbajd

61

 

I9'I' ~ Z'
"'1' ~ 8°Z“

 

 

 

 

'-.‘ ; v-1
. r- 25:2? Is°-~ z'I-
: IZ'I'~ 9'1-
“. ~14

 

 

 

as: 28..."..-

352.1.

a 5..."...
58.. £2"..-
.32.. ”Ed".-
A§.F§._r~.
.33.. SSE.
as: 32.5%
.32 33...". .
.33.. Serum

     

 

was. a"...

 

..I. .
z 8
~ ~

I I.
m) ..l
6 6

3‘ _ s'z-~rz
' 697%

I
.9
~

"I.
"M

 

 

8 . 3 8
Aouanbajg

l

 

as: a"...
$2.2“..-

- .82.. “3...“..-
- .N3.F§..EE
, 553..."..3

is. a".-

 

 

 

 

_. _ p _

 

L _ T

 

OEOZ .m ESE

~29. ...... .55”. 5:9. .29. .8 8.85-8 58 ... 8:35.... 38.... an: 3. SEE

 

I 18"” Z'I'

' 6L'~ V
' wz-~ 8°:-

 

 

é

Kouan [331:]
62

9‘.

 

 

ﬂow... .23..

x 8'1 ~z

66'1 “TI
61 I “‘1’

I0'Z' ~ 8'Z'

 

 

 

 

as: 2.3."...-
Es. an..-
.N 23..."..-
EE 3...?-
as... ESE-
...E 33...".-

 

1

 

.3: as?“

.38.. EEG

as... 33..."..-
.32 as?“

.38. 38."...

 

 

.35. ...“..-

 

 

 

 

 

L _

 

Aouanba: =1

 

 

NNmOm .Q 353

 

 

 

 

 

 

 

 

 

 

 

... . .. u ..c M
4 a ..., 4 ~ .~ .
u m m w w m m
H
T .33..»- 13m
n
_
7 . a
f .28.?- a;
.3335".- ‘
:5...§«.ID is
. EREQED is
.32.".-
~ . p _ F — h as

 

 

 

 

 

.580 m.» 2:5

KAOM .U 3ch

63

Figures 4.4 and 4.5 ﬁirther illustrate the differences in the distributions of item
discrimination and difﬁculty parameters for the OP and the aforementioned 8 candidate
ROPs. For the a- and b-parameter distributions of the rest of candidate ROPs, they can be
referred to in the APPENDIX. To summarize, both item a- and b-parameters for the
candidate ROPs were distributed quite differently from those of the OP. It seems that,
compared with the OP, the distributions of item a-parameter of the candidate item pools
were truncated at the value of approximately .8; whereas the distributions of item b
parameter of the candidate item pools tended to be uniform along the proﬁciency scale.
For the candidate ROPs, however, both their item a- and b-parameters were quite similar
in shape to each other except that the number of candidate ROPs developed with .2
expected change in item information was larger than that of those developed with .4

expected change in item information.

Figure 4.4 Item discrimination and difﬁculty parameter distributions for the OP, ROPl,
ROP2, ROP13, and ROP14

Item dlgcolgmlnatlon parameter dlstrlbutlon In OP ﬂgomodlfﬂculty parameter dlstrlbutlon In OP

 

 

so» - ”l

Frequency

 

 

 

 

 

1.5 2 2.5

O 0.5

64

Figure 4.4 Cont’d

Item dIscrImInatIon parameter dIstrIbutlon In ROP1 Item difﬁculty parameter distribution In ROP1
. . . 1oo . . . , , .

 

 

 

 

 

 

 

1oo .
m. . 90..
so an»
70 70-
5' Gar * 50*
5
§ sol ~ 50»
it 40t “l ,
so 30-
20 20
10» 1o
00 0:5 93

   

Item dlscrImlnatIon parameter dIstrIbutIon In ROP2 Item dIIIIcuIty parameter dlstrlbutton In ROP2

 

 

 

 

 

 

 

1oo - . 1oo . -
90- ~ 90-
80» . 80- 4
70» « 70»
2' 60+
3 50»
E 4th
30»
ml
10»
oo ois

 

65

Item discrimination
100 ,

Figure 4.4 Cont’d

100 .

parameter distribution In ROP13 Item difﬁculty parameter distribution in ROP13

 

 

3

Frequency
8.

 

Item discrimination
100 .

17

 

 

f

 

parameter distribution In ROP14 I33: difﬁculty parameter distribution In ROP14

 

 

Frequency

 

 

‘17

l

Bgsgaase

‘
Q
j

 

 

60°
r'o

66

  

 

 

Figure 4.5 Item discrimination and difﬁculty parameter distributions for the OP, ROP9,
ROP10, ROP21, and ROP22

 

 

1

Item discogimination parameter distribution in OP “:35 difﬁculty parameter distribution In OP

Frequency

 

 

 

 

 

item discrimination parameter distribution In ROPS Item dil'ﬂcuity parameter distribution In ROP9

 

 

 

 

 

 

 

1oo . 100 - ﬁ ﬁ
90- - our
80- . Ni
70 J 70
5‘ W i 6M
u. 40- 40+
30- so»
20- 20»
ml 10-
oo 0.? 93

   

67

Figure 4.5 Cont’d

Item discrimination parameter distribution in ROPIO Item difficulty parameter distribution in Roma

 

 

so- « 90»
w» .. m.
70-
3 50*
5
:1 50.
E .0.
30..
20.
10»

 

 

 

 

b0

   

Item discrimination parameter distribution In ROP21 Item difficulty parameter distribution In ROP21

 

 

100 1 . i , 100 . a r
90+ i 90*
80- l 80L T
70~ . 70»
> so. . so- -
i
a 60 . 50- ~
I? 40 4m .
30
20

 

 

 

 

 

 

68

Figure 4.5 Cont’d

Item discrimination parameter distribution in ROP22 Item difficulty parameter distribution in ROP22

 

 

 

 

 

 

 

 

 

 

 

 

 

100 . 100 e
m. .4 90..
80+ i so» 1
rol « 70. l
3‘ col so. 1
C
0
§ 50» 50.
i
40» 40» l
30~ ao~ __ «
20» 20am ; ;'=;' .
10» 1o~°=" 4
I 7 o ii
00 0.5 25 <3 4

   

Figure 4.6 presents item pool information for the OP and the aforementioned
eight candidate ROPs. For the pool information of the rest of candidate ROPs, they can
be referred to in the APPENDIX. Note that the operational item pool was the largest one
containing 314 items in total. Clearly, the pool information curve of the OP was quite
different from that of each candidate ROP mainly in that the curve for each candidate
ROP was much ﬂatter than that for the OP and continued as ﬂat over a wide range of
proﬁciency scale. This suggests that the item pool can support equally good measurement
results for examinees with a wide range of abilities. For the OP, the pool information
reached its peak in the proﬁciency range between -1.2 and -.8 and then decreased along
both sides of proﬁciency scale. By the time the pool information for the OP dropped to
approximately 60, that is, the largest pool information that can be provided by ROP_13,

the range of ability scale that was covered by the curve was between -2 to .8 for the OP.

69

For the ROP_13, however, this range was between -2 to 1.6, suggesting that the ROP can
provide equally good measurement outcomes for examinees with latent abilities within
this range. Figure 4.6 also indicates that the pool information for each individual ROP at
different proﬁcient point is different. This difference was the result of different pool
characteristics such as different average a-values, as discussed before, as well as different
item pool size. As the next section will indicate, all candidate ROPs can perform better
than the OP. The better performance can be partly attributed to the optimality of non-

statistical attributes, which will be presented below.

Figure 4.6 Item pool information for the OP and 8 candidate ROPs
Panel A

 

 

14° 1 T T f T T i i i l i i l i a! 4 I

2;:;s+o~314)
120. ....... ........ ........ ........ ..... ...... E ....... ....... ........ ........ ........ ;§.-'¢-'ROP1(192) ..s

immune)
100- ....... ........ ........ ........ ........ ........ 2. ............... ' ........ -§.4--Ror=13(173).l
i 3 Z i .. . Z i i i "°"R°P“(24°)
30. ........ ...... 5:..- ........ ;. ............ .. ........ a 3 i

 

 

 

 

'(9)
e
f
f

3 S : 1': : : : : . . 5 : :
60. ...... g ......... g ......... ,I, ’ ....... i...;;*,,_.*..*.......r, ........ g....\-§....':.,., ........ ;........; ........ i ...... -

2 s 2 - - '5 s - t z s i

3 s a .- . . s a . s ' E a a . 5 § § §
40— ...... ...... ...... ........ ........ ........ ..... ........ 2..-“. ..... ﬁqrﬁ ........ ...... 1

5 E : . . 3 i : : z z : : . : - '4 z :

20 . . ,-. . . . . . . . . . . . ‘3. '.. .
-.......',........',..1.’ ...,, ........ ', ................. ... ....... ', ...... '........', ........ \\.:.',.-......', ...... _.

: z: : : : : : : : ; : : : : "A‘- :

 

 

. i i i J i i i i i m i i l i I “-
-4 -3.6-3.2-2.8-2.4 -2 -1.6-1.2-0.8-O.4 o 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4
9

7O

Figure 4.6 Cont’d

Panel B

 

- % A

. . I I Z I i 3 I Z I I :+OP (314)
120-..- --B--ROP9(168) ..
z §-’-°---R°P1°(229)
100, ....... ........ ........ _______ ....... ....... --o--R0P21 (139).
‘ ' ' ‘ ' ' ' ' ‘ ‘ ' --4---R0P22 (209)

 

140 .r l iff 17! l I f T ﬂ . 1 mm

 

 

 

: : ; : : : 2 ' ' : 2 : i : :
80 - ..... ..; ....... ...,é.-.-_-.~.-.-¢...n..'... ....... .... ................ i
: : : t z ' . . é : : ' : : z
. . . . . a" . . ... . . . . . .
i i i I i . I i i I ‘n. i l I C ‘ i .
' ‘ ' ' i e. i ' i i i \

1(9)

: z s s .. a = s 2 : Ci. 5 i i z 2

so — ------- 1+ -------- :» -------- z» -------- ------- 3+ ----- arm-+44... -------- é --------- s --------- i --------- --------- . ------- .
5 5 : ' .‘ 3 ﬁt... . "‘b-"é'm-lﬁ‘ ' . : ; ; :

f ‘

a
40f ........ ........ -77” ...... #,.¢.~+-+-+._¢‘ ............... a‘ .1 ......... ........ ; ....... .

3 3 - - E 3 ' i E i '- 3 E
: : : : . : : : : . ".‘o. : :
s a . 1' ; s a E z z e - a = a - ‘;-=: ' ;

20 , . . ,4 .- ' ,' . . . . . . . ' .5 . .' .. 5 . ,
. . , _n . . . . » . . . a . ‘ a. .
2 : '.-I I : : : : ; : : : : ' ‘5: :
a I ’ r I l . a I i u a r a 0 I
a a z s r z s e a . s a ; $5

 

 

O
~,~

."'iii.~iii ' i111 “-

1 l ‘ l m
-4 -3.6-3.2-2.8-2.4 -2 -1.6-1.2-0.8 -0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4
0

Note. ( ) indicates item pool size

Examples of the distributions of item attributes for the aforementioned eight
candidate ROPs and the OP are presented in Figures 4.7 and 4.8. Again, the distributions
of item attributes of the rest of candidate ROPs can be referred to in the APPENDIX. In
these ﬁgures, the x-axis represents item attributes as described in Table 3.2. Overall, all
candidate ROPs shared very similar distributions in item attributes regardless of slight
differences. The largest differences in percentage between the OP and the candidate ROP

tended to occur in Attributes 3, 11, 12, and 14.

71

Figure 4. 7 Distributions of item attributes for the candidate ROP_l , ROP_2, ROP_13,
and ROP_14

 

 

0.7 I i I i I I
' ' ; -—or=
0.6... ........ .-... .................................... ..l. —-’--o ROP1 -—<

-~4-- ROP2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0'5 --+-- ROP13 ~
o f -+ - ROP14

0" """" ,1, ------------------------------------------- ' ?W"_? ....... .4
E ’i“ ,
E 0.3 4;): """""" i/ i +!.€'+‘:‘.\5 ...... .’.'Ah

0.2 ,,,,,, \ \"(W ...,; _______ . "‘E , .; ‘5..-

. :6... 1 : E
O 1 i '21:: ’s’ a“ . ;
1

 

 

 

 

 

 

 

 

 

L I i i i 1
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12C13C14C15C16C17C18
Attribute

Figure 4.8 Distributions of item attributes for the candidate ROP_9, ROP_lO, ROP_21,

and ROP_22

 

0.7 I I I
—- op

0.6 S ' ---- ROP9 "
I .-.,... ROP10
""m ROP21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0 .
g 0.4 ---~--r -°+" ROP22 "l
E i
8
a 0.3 . "'255
o. /'/\ , "
(IL-a
0.2 - , . -------
)r’ ‘ é"
' 5
. £5
'5

 

 

 

 

 

 

 

 

0
cut

I
'0: 3
2i “. i

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i 1 '1 4
Ci CZ C3 C4 CS C6 C7 CB C9 C1OC11C12613C14C15C16C17018
Attribute

 

72

4.2 PERFORMANCE OF CANDIDATE OPTIMAL ITEM POOLS

4.2.1 EVALUATION RESULTS FROM USING CONDITIONAL ABILITY POINTS
Figures 4.9 to 4.11 portray conditional biases, MSEs, and SEs given by all
candidate ROPs and the OP. Tables A1 to A3, presented in the APPENDIX, also
summarize conditional biases, MSEs, and SEs given by all candidate ROPs and the OP.
Obviously, all candidate ROPs yielded much better measurement accuracy and precision
as indicated by much lower biases, MSEs, and SEs. Compared with the OP, the following
ﬁndings can be observed with regard to bias. First, all candidate ROPs yielded smaller
amount of biases across the entire 9 scale than the OP, and the differences were more
conspicuous at two tails of the 9 scale (i.e., 9=-3, -2.5, 2.5, 3) than at the middle range. It
is interesting to notice that the OP yielded negative bias at two tails of the 6 scale;
whereas the candidate ROPs yielded negative bias only at the low-ability level. The
reason can be attributed to the considerable shortage of enough operational items with b,-
values at the high-ability level of the proﬁciency scale, as Figure 4.4 indicates. Given that
the item pools contained sufﬁcient number of items with b,- values at the hi gh-ability level
of the proﬁciency scale, as in the case of the candidate ROPs, the magnitude of
conditional bias on these ability points were found to be close to zero. In addition, it
seems that the magnitudes of biases at 0 = -3 were larger in the candidate ROPs
developed by .4 expected change in item information than those developed by .2 change
in item information given other things being equal. Second, the OP tended to
underestimate examinees at both tails of the 9 scale. Although the candidate ROPs also
tended to substantially underestimate examinees at low-ability proﬁciency range, the

degree of underestimation was much less severe in the candidate ROPs than that in the

73

OP and underestimation appeared to occur within a narrower ability range. For example,
the biases at 9 = -3 and 9 = -2.5 were .28 and .19 in absolute value for the OP. For the
candidate ROPs, however, the largest absolute values were .24 and .08 at 9 = -3 and -2.5
respectively. Likewise, the bias at 9 = 3 was .47 in absolute value for the OP. For the

candidate ROPs, however, the largest value was only .05.

A similar pattern can also be observed for conditional MSEs. In general, the curve
of MSEs given by the OP consistently stayed above the other curves given by the
candidate ROPs. The magnitudes of MSEs at the low-ability level, i.e., below -2.5, were
much larger than those at the rest of the ability levels and were followed by those given at
the high ability points, i.e., 2.5 and 3. The reason can be attributed to the ﬁnding that a
certain proportion of examinees with true abilities below -2.5 were found to be
signiﬁcantly underestimated with regard to their estimated abilities. The differences in
MSEs yielded by the OP and the candidate ROPs were much closer at 0 points between -
1.5 and 0.5 than those at the rest of the 9 points. As for the MSEs given by the candidate
ROPs, very similar amounts of MSEs can be observed at most of the 6 points, i.e., -2 to 3;
and the largest difference was observed at the 9 point of -3. For example, the MSEs

observed for ROP17 to ROP20 were quite different at the 9 point of -3.

In terms of conditional standard error, Figure 4.11 indicates that, at all conditional
9 points, the ability estimates given by the OP are consistently more dispersed than those
given by the candidate ROPs, suggesting more precise ability estimates given by the
candidate ROPs. However, those low-ability examinees, i.e., those with true abilities
lower than -2.5, tended to be measured less precisely than others no matter what item

pool was used—the OP or the candidate ROP.

74

Table 4.2, describing the number of tests having constraint violation given by all
candidate ROPs and the OP, indicates that all candidate ROPs have a substantially better
control over constraint violation than the OP. For example, the OP . witnessed a total
number of 9016 individual tests with constraint violation across the whole 9 scale;
whereas for the candidate ROPs, even the highest number—given by ROP_3—was only
171. This number was considered tolerable. For all candidate ROPs and the OP, none of

individual tests have seen more than 1 constraint violation.

75

 

Figure 4.9 Graphical representation of conditional bias

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

M r - ' ' ' A 0.2
.-O .
- it“ " 0 ”- -.. 5'in ‘.__‘
0W W x V, o“,
m ’26" '*"0P l '
at. ‘ ---a-~ROP1 0. .-
----ROP2
i“ --~--Rorz , 4M
«06 ~ “”0" l or
'0 -2 1 o 1 2 3 '0 o 1 2 3
02 . 9
0 .:
if \’ Fm.- ‘
’I "1" 0P \ Iii" ,I' "" 0P 1‘
‘02 I”! "..- ROW 1“ 4 no.€:’/0 "0"ROP13 “‘
4110910 410914
'0‘ -+- ROP11 j, '0" -4- ROP15
or 410912 M «101116
'0 -2 -1 o 1 2 3 '0 1 2 3
02 . , . . , .
o» gW‘WJ
Iii" -+- or I. mop .\
“iii"! '*'R°P17 41 ,,~” 4 ROP21
' ““ROP13 " --~--R0P22
'0“ 11:01:19 1‘, 04- --*-'ROP23 11
"*"mm --u--R0P24 '
.05 1 I I ‘05 I . .
0 2 -1 o 1 2 3 .3 2 .1 1 2 3

76

 

Figure 4.10 Graphical representation of conditional mean square error

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

     
   
     

 

 

 

   
  
 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+09
+9095
-r--ROP6
49097 i.
--+---9098 ‘
M o'd"'r,a"‘ s“’ .
.. .. g . 1
o 1 2 3
I409
\ ~490913
‘1 --+--9091o 123::
M 49-90911 2' 490915
\Q "
3
-4--ROP17
490913
‘ ---v~-90919 1.
‘~ ~4120920 .4. .
K.“ ”are" ‘11 ‘ \ .12 ‘
. '1“ "¥~'4=W“" I ’W‘F‘émi' l
-2 -1 o 2 1 . .

77

 

 

--o--09

a 9091
4 9092
-»—- 9091
-- +~~ 9094

 

 

 

...uﬂuﬁ'.-‘

.—o"'

f
..r' \

‘s.q

 

Figure 4. 11 Graphical representation of conditional standard error

 

---- 09
..... 9095
-~v--~9090
“19097
---->---9090

 

 

 

0-...“_._,..e

 

 

   

,-MMWW- 0.2»

 

 

 

 

 

  
 
    
 

 

 

 

 

 

 

 

 

1 4 . 1 . . -
.....op ""‘0P

0.8 .... 9099 0.8 --1-- 90913

06 \190910 0 ‘1‘ --1-- 90914

' \ ~4~ 90911 2, xx -1— 9091s .5

0,4 ~4-~90912 0,4 02‘in \ -+- R0P16 .e' ‘0

 

 

 

 

 

 

 

-1 0 -1 0 1 2 3
-~o--09 -..-- op ' '

+9091? ~>~~90921

+ 90910 -+- 90922

---1- 90919
- 1- 90920 ,4" “a. .

i a}, 01.. .. ....” "0' ‘l
0,2 ‘I‘WM -------- .1,-WM

 

 

 

 

 

 

 

 

 

 

 

 

 

 

78

 

a o o o o o o o o o o o o o N_ mom
m o c o o o o o o _ v o o o :umom
N o o o o o o o o o o H _ o 218m
..-.m ........ c ........ a ........ w ....... _. m ..... m2... ...... moo_ ....... o ....... N ....... ... .......... a ME ...... -
mm v N m a N o v N N o o o o w mom
2 o o o o N N _ N _ N o o o 55¢
a o o o N o o o o o _ o c o cumom
..... _. NoemaeeoootNNNﬂSm
NH o o N N o o _ o o o o N 0 «Mom
E N N _N av cm a N o o o o o o Ndom
em o o N o N : o o o o m m o Numom
....a ........ a ........ a ........ ._ as: ..... N._..-.-.._ ...... can ...... a ....... a ....... a ....... o 3.51....me ......
.22 Na: NE N5 N? e: 3 mm NN ”a 5. man NE SN mo .
can—Emu—

 
  

53 N m.N N 3 N m... A. 2. _- 3- N- m.N- N- 3:?

 

mKQM $333239 2% ENE KO 2% 3 zmﬁm $230.3 “Schema M23353» mNmmNKQ Lmeﬁaz. NV 2an

79

 

N N N N N N N N N N _ N N N N mom
NN N N N N N N N_ NN N N N N N NNINom
NN N N N N N N N N N _ _N NN : NNINom
..... NNNNNNNNNNNNNNNINE
NN N _ N _ N N N N N N N N N NNsNom
N: N N .N : NN N N N N N N N N NEON
_N M: N_ _ N N N N N N N N N N 21%;
..... N NNN_NNNNNN__NNN:-§
NN N N N N N. N N N N N _ N _ Edam
N N N N N N _ N M: NN N N N N Naom
S N N N 2 N N N N N N _ N N idem
..... _. NNN_NNN_NNNN_NN2NN_-§

20a mm: N5; N3: wmh 2m on mm om m: has 35 w; Duo m0
ESE. m Wm N mg g md o md- T m._. N- Wm- m-

 

 

 

n. Bob NV 2an

80

4.2.2 EVALUATION RESULTS FROM USING 20,000 EXAMINEES RANDOMLY
SAMPLED FROM THE TARGET EXAMINEE POPULATION

Table 4.3 compares overall summary performance statistics for the OP and all
candidate ROPs. To summarize, all candidate ROPs yielded much better performance
with regard to almost all criteria. All candidate ROPs unanimously gave better
measurement accuracy and precision than the OP as witnessed by lower bias and MSE. In
general, all candidate ROPs and the OP tended to overestimate examinees’ abilities.
However, the average biases given by candidate ROPs tended to be around .1—
approximately .1 lower than that given by the OP. With regard to MSE, the results
suggest that the candidate ROPs tended to yield remarkably lower (i.e., .4~.7) average
MSEs than the OP (i.e., .13). The magnitude of average MSE given by each candidate
ROP seems to correlate with its average a-value. For example, ROP_21 and ROP_22—
the two candidate ROPs having the lowest a-values—yielded the highest MSEs, i.e., .07.
The magnitude of correlation coefﬁcient between the true and the estimated abilities
given by each candidate ROP was at least .02 higher than that given by the OP. Similar to
the ﬁndings about MSE, the magnitudes of correlation coefﬁcient given by candidate
ROPs also tended to be correlated with their average a-values. Those candidate ROPs
with lower a-values tended to yield slightly lower correlation coefﬁcients. For example,
ROP_21 and ROP_22 reported correlation coefﬁcient as .96 whereas ROP_l and ROP_2

reported .98 as their correlation coefﬁcients.

With regard to the issue about item use, all candidate ROPs made more efficient
use of items than the OP. This better use of items can be supported by the evidence in

two areas: number of underexposed items and the ratio between 13p over XGRO P for

81

 

each candidate ROP. First, compared with the OP, the candidate ROPs yielded at least 18%
fewer underexposed items. In general, the candidate ROPs developed with a .4 expected
amount of change in item information tended to better use items by having smaller
percentage of underexposed items than those candidate ROPs developed with .2 expected

amount of item information change; and the difference was about 15% in most of cases.

For example, ROP_21 yielded 25% underexposed items whereas ROP_22 yielded 44%
underexposed items. Second, the ratio given by 13p over gig/NR0 P for each candidate

ROP was unanimously above 2, suggesting that all candidate ROPs made a more
balanced use of items than the OP. The candidate ROPs developed by .4 expected amount
of change in item information tended to yield a higher ratio than those developed by .2
expected amount of change, suggesting that those candidate ROPs developed by .4
expected amount of change in item information made better uses of items than those

developed with .2 expected amount of change in item information.

With regard to item overlap rate, it is somewhat surprising that the overlap rate
given by the OP was at least 0.1 higher than that given by each of candidate ROPs despite
the fact that the OP contained more items than each of candidate ROPs. For the candidate
ROPs, meanwhile, it seems that their differences in size did not make a signiﬁcant

difference in the item overlap rate.

In terms of constraint violation, signiﬁcantly fewer numbers of tests having
constraint violation were observed in the candidate ROPs than in the OP. In comparison
with the OP in which at least 8% tests had constraint violation, no constraint violation has

been observed in the tests using candidate ROPs when rounded up to two decimal places.

82

 

11.1- :‘I _

None of the tests have witnessed more than one constraint violations for both OP and

candidate ROPs.

In terms of overexposed items, interestingly, very similar numbers of overexposed
items (i.e., approximately 35 items when overexposure rate was set as .2) were observed
for both the OP and the candidate ROPs regardless of differences in size between the OP
and the candidate ROPs and differences in size among all candidate ROPs as well. When
overexposure rate was set was .3, the difference in the percentage of overexposed items
given by the OP and the candidate ROPs was smaller than that given by setting
overexposure rate as .2. Overall, this ﬁnding seems to be related to the underlying nature
of maximum item information selection rule which tends to be highly sensitive to even
slight difference in item information. Due to the fact that no item control procedure was

implemented in this study, items with high a-values were always selected over the others.

83

mom 28680155 BENZ 821.42 NoEoNeNnNeN 032

 

----mmN ....... N.N..~-.----.N.N.N. ....... N N0. ....... N .N..N.--.-.-..N.N.._.----.-.NN.N. ....... N N0. ....... N m ............................................. NWN.._.NN.N..
NN.N NN N NN.N NN.N NN N NN N NN N NN.N NN.N N238, 69:28 mesa; NNNNNNNNN em
2 m N 0m v on 3 mm N 02 coca—o? “586:8 mega: $338 00 80:52
............................................................................................................................................ .0. 0.00m-.---.-.
$.N mod 2N .206 SN 02.. EN 3mm <2 N RQN
. N
NW; Nm.wN 00.0w modN NN.0m N0.0N Nva mNdm 3.002 NN
HHHMNHNHH“NN.NHHNNNHHMNNHHNNNHHNNHNHmNN.NH“NNWNHNNMNHHHHHHHHHHHHHMWNNMNNH
20.0 0N.0 00.0 Nm.0 0m.0 2.0 00.0 3.0 00.0 mac: 08005.5qu 00 005
ENNNNNN ........ .Nm ........ N .N.---.-.-wN.N.---.-.-.N.N ........ m ._.N.---.-.-.-------.--...--.emmmﬂmﬁémﬁﬁﬂ.
N00 20 00.0 2.0 >00 00.0 50.0 00.0 N00 30 25: 0080x230 we no.5
- .... N. ._ ......... mN. ........ missmm ......... m ._ ......... N ._.--------N.N.------.-N.w--.-.-..Nm---------.--------mN...0.N..Nm..M.mNmmwawwN..N.N..wNmNm..
$0 2.0 $0 ~N.0 2.0 _N.0 2.0 0N.0 :0 3.0 NED: 00898.53 00 080
- .... N. m ......... NN. ........ $5---st ......... N N ......... N NNNNNNN .................... A. SmﬁmmNmNmmemwNmemNﬂ.
mm 0 no 0 00 0 3.0 00.0 mo 0 ma 0 mo 0 v0 0 N NL
MNHNHHNNNHMNNHHHNNMNHHHNNNHHNNNHHHNNNHH”NNHNHNHNHHHHHHHHMHHHNNNH
~00 ~00 000 ~00 8.0 80 ~00 ~00 N00 Nﬂm

 

 

0 00% N. 00M 0 mom m 00% v 00% m mom N 00% _ 00% no 300 So:

 

Ndom s Dom .< 358
820538 000 .0N \e NNQEE. $3352 a 02.2: NNNQM 850.0389 2: 028 as 20 Le\N.N.:N.:SN 9885.5me 0596 mé 2an

84

 

me 0: 0VN m: mmN Nm: 0NN mm: 3 m 0N5 Eon

 

 

NN N NN N NN N NN.N NN.N NN N NN N NN.N NN.N 8:203 3588 was: News? 2

0 v0 0N N N _ m 3 N03 sewage? “583:8 90.6: $568 00 50:82
....................................................................................................................................................... m0mmm.-.-.---

_NN NNN N_N NNN NN.N VNV NNN NNN <2 N NNN

N
NN NV NN NN NN NV NN NN NNVV NN NN NN NV NN.NN V. VNN NN

NNNHHNNNHHNNNHHNNNHHH“NNNHHNNNH“NN.NN“NNNHHNNNHHmHHHHuHHHunHWNNNNNNH

NN.N NN N NVN NN.N NV N NN.N NV N NN.N NN N was: N03803:: No Nam
..... mm.-.-.-.-..Nm-------..N:.------.-.Nm----.-.--wz._.-------.-N.V---.-.-.-N.N.N.-.---.--M.N.-------N-~.N..---.-----------.-----NNNNNNNNNNNNWNNNNN.ENE-

NN N V_ N N_ N : N N N N_ N NN N NN N NN.N NN.N as: Bmaxeoz No Nam
...... NN.N..E.-.-..N..N..-.-.-.-.-N.N.-.-.-.-.-.N._..---.-.13.-.-.-.---N.N.---.-.-.-.N._..-.-.-.-.NN.--.-.-..N.N.-.---.---.-.---.-.-.mmwyeammﬁmﬁmmmﬁmmHammad?

NN.N NN.N NNN NN.N NNN NN.N N_.N NN.N :N NN.N as: 8.0.8503 No Neg

NN NV NN NN NN NV NN - NN VN ANN 28: Emaxosé No 852

NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N VNN NNN
..... N. 0.01--.-.-.mNMN..-.-.-..waa;.-..NN...N-.-.-.-.N.NMN..-.-.-..N.N..N.-.-.-..NN...N-.-.-..NN..N-.-.-.m£111-21---.-.-.3.---.-----;.---.-.-.-.-.-.-.-.mm£-

_N N s N _N N EN EN EN SN 8 N NN N Nem

 

  

I I I I I I I «NEED
2 00M 2 mom 3 00M 2 00% N_ mom 2 00M 0_ mom 0 00M 00 Ben 82—

Naom 9 Numom .m 35$
NV. New N.V 2%...

  

5
00

 

 

SN 3; 00N 02 NN m2 0NN m3 Em 3% Bon—

Il-l'l'lll|lllll'III‘III'I'III'I'IIl'lllllllllll'llllll'llllIIIOII'III'IIIIII'IU'I|III'I'l'III'IIO-II'I|III'IIIIO'IIO'I'III'I'I'I'OII'Ill'l'lll'ﬂllli'l IIIIII"- IIOIOIIIIIIIII'Il1

 

 

NN N NN N NN N NN N NN N NN N NN N NN N NN.N 8:23 New-N88 was“: NNNNNNCN em
2 0m MN _ N 3 0 0 N 02 noun—03 HEN-5:8 0:33 NNENNB .No 808:2

........................................................................................................................................................ m gum-:1--.

NN N NN N NV N NN N NV.N NN N _V N NN N <2 N NNN

N

NN NV : NN NN NV NV NN NN.NV M: NN N_ NV N_.NN 3 V2 NNN
HNNN“HHNHNHNHHHHNNHNHHHVNHNHHHNHNNHHNNHNHHHNNNHH“NHNHNHHNHVNHHHHHHHMHHHHHUN-.NWNNWNNH

NV.N NN.N NV.N NN.N VVN NN.N NV.N _N.N NN.N 3% NNN-36025 No 85
---.-.N-N-.---.-.--mw .......... V .N. .......... N. Nazi-me .......... N .N.:-------.N.N---.---1-0M-.-----N-N-N.---.---.---.---.-.----.NﬁNNRNRNmN-Nwme-Nm.wNﬁNma

NN.N N_.N :N NN.N NN.N N_.N NN.N EN NN.N ANN NN.NN Eases No 85
SN .......... N .N. .......... m N-------.--N.N- .......... V .N.------INN:-----.---NW-.--saw-Nee-.--------.----mﬂ-Nﬁmﬁmmmxmwwﬁm.wNmN-Nz-

NN.N NN.N N_.N NN.N N_.N NN.N V_.N NN.N :N ANN 80: “swag-208 No NN.N
---.-wV.--------..Nw .......... N NNNNN .......... N .N. .......... m .N-.-.-.-----N.N ......... NNN .................... A. m0.WNW-WmWWNmN-Nma-Nme-N-WNWNZI

NN N NN N NN N NN N NN.N NN N NN N NN N VN N .3
HNNHNHHHNNHNHmHNNHNHHNNNHHNNNHHHHHHHNNNHHHNNNHHHHNHNHNHHHNHNHHHHHHmHHHMHHHHWNMNH

8N _NN NNN SN SN NN.N _NN NNN NNN Nam

I I I I I I I I 0502.—

VN mom NN mom NN mom NN mom NN mom a mom M: mom 2 mom No 68 as: . . e
VNINom 3 Nice .e 358

3:80 mé 033-

Table 4.4 presents the number and percentage of examinees correctly classiﬁed
into each of four performance levels, as well as the differences in classiﬁcation accuracy
rates at each of four performance level given by the OP and each of candidate ROPs
when three out scores (i.e., -1.6245, -.05397, and 1.088235) are used. As Table 4.4
indicates, at each performance level, the classiﬁcation accuracy rate given by any
candidate ROP is higher than that given by the OP. The classiﬁcation accuracy rates
given by the candidate ROPs developed with the .2 expected amount of item information
change tended to be slightly higher at all performance levels than those given by the
candidate ROPs developed with the .4 expected amount of item information change,
given other things being equal. In comparison, the candidate ROPs developed by the
random procedure tended to classify examinees more correctly than those developed by
the MTI procedure. In general, this higher classiﬁcation accurate rate seems to be related
to the magnitude of average a-values. Item pools with higher a-values tended to yield

more accurate classiﬁcation accuracy rates.

87

 

cocoa-EB 30365 < .802

 

NNN-N NN.N 2N: VNNN NNN-N NNNN NNN-N NNN-N NNN NVNN NNNN NNNN_ VVNN NNN 82 NN Noe
NVNN NNNN NNN: NNNN NNN 82 NNN-N NNNN _NNN NNN-N VNNN NNNN NNN-N NNNN NNNN 213m
NN.N NN.N NNN: «NN.N NNN-N NNNN NNN-N NNNN NNN NVNN NNN-N VNNNN VNNN NN.N NNNN 318m
NVNN NNNN 3V: NNN-N NVNN NNN NNNN NVNN NNNN NNN-N NN.N NVNN VNNN NNN-N VNNN NEON
-.N-N.N..N.--._. NN.N..-.NN.N-3113.00---.NNNN;:33..-.NmNuN---..NN-N..N ----- N NN.N-.--meuNsewmd .2--NN.NN. .-.-N.NN..N.---.N.N-N..-N-.-.NN.N.-_.-..N.~1mmm ..
NVNN NNN-N NVN: NVNN NVNN NNN VNNN NNN-N NNNN NNN-N NNNN NNNN NNN-N NNN-N NNN_ :INom
NVNN VNNN NNV: NNN-N NNNN NNN NNNN NVNN NNNN NNN-N NNNN VNNN VNNN NNN-N VNNN 218m
NVNN NNN-N NNV: NVNN NVNN VNN NNN-N NNNN NNNV VNNN NNNN NNNN NVNN NNNN NNNN N-Nom
-mNNN-N-.-NN.N..N.-NNN-N.V--.NNN..N---.N.N.N...N.-..NNNN..--mN-_.VN..-.wN-N-.N ..... N NNN:-.mmNuN:--.N.N.N..N..---N.W.NN.N-.-N.N.N...N.-.-.N.NN..N-.-.E-NM ..... N..- Nam:
NNN-N NNNN NNN: NNNN NNN-N NNN NNNN NNNN NEN NVNN NNNN N82 «NN.N NNN-N NNNN NuNom
NNN-N NNN-N NVN: NNNN NNN-N NNN 83 NNNN :NN NVNN NNN-N NNNS NVNN NNN-N NNN_ NnNom
NNNN NNNN _NNN_ NNNN NNNN NNN NNN-N NNN-N NN_N NNN-N NNNN VNNN NNN-N NNNN NNNN NINom
-.N-NN..N.-.-NNN..N--NNN-N.V-.NN.NN-swam---.NNNN.-.mN-NVN-.--.N.N.N..N ..... N NN.N-.-.meuN-z..N.m.N-.N-.--«NNN-_.-mVN..N ..... _- NN.N-.133 ..... V..- Nam-..
NNN-N NNNN NNN: NNN-N NNNN V§ VNNN NNN-N NNN NVNN NNN-N N52 NNN-N NNN-N NNN_ NINom
VNNN VNNN VNN: NNN-N NNNN NNNN NNN-N NNNN NN_N NVNN NNNN NNNS NVNN NNNN N_N_ NINom
NNN-N NNN-N VNN: NVNN NVNN NNN NNN-N NNNN NN_N NVNN NNNN N82 NNN-N NN.NN NNN. Dom
- .......... N wth---N.N.NNN-.---.-.----NNN..N.-.-NNN-.-.-.-.-.-.-..N.N-N-.N..--.NNNV..-.-.-.-.----mNNum----.N-Vw.m--.---.-.-----me..N-.-.m.N-N._ ....... ..— -.e-i
....................................... 3&0.-.-.-.-.-.---.-.-.---.NNN.€.-.-.-.-.-.-----.-.-.-.~&N-0.-.-.-.-.-----.-.-.-.-.Saws--9023--.---.-.-..
V Nam N << < 0?. .6 N 0% .34 E 00% N24 NZ Hz
.35. gong—ZN «aﬂoceum 2.38.3.5 £593.— «aﬂuﬁeum «oz

 

.4?& 9885.8me .5053 0.03 NONE-85 Guano-ea nomNcoSN-NSU vé 2an

88

 

009000511005 .0002

 

 

NVN.N VNNN NNV: NVN.N NVN.N NNN :NN-N NNN-N NNNN NNN-N NNNN NNNN NNNN NNNN NNNN VNINom
NNN-N NNN-N NNN: NVN.N NVN.N NNN _NN.N VVNN N_NN NNNN NNNN NNNN VNNN NN.N NNNN NNuNom
VNNN NNN-N NNN: NNN-N VNNN NNN NNNN VNNN NNNV VNN.N NNN-N NNNN NNN-N NNN-N NNN_ NNINom
NNN-N NVN.N VNNNH NNN-N NNNN NNN VNNN NNN-N NNNV :NN NNNN NNNN NN.N NVN.N NNNN HNINom
.-N-NNHN.--.NN.N.N---.NwN.N.N.--.V-N.N..N--.-N-N.N..N ..... NN.N----NNNNN ----NN.N..N:-..N.NN-N.---.WVNVN----N.NN..N-.-«NN.NN---.NmNnN.---.N-N.N..N.-.--NN.NN -.--NNImom -
NNN-N NNN-N NNN: NNN-N NNN-N NNN NNN-N NNNN NNNN NVN.N VNNN NNNN NNNN NNNN NNN_ NEON
NNN-N NNNN NNN: NNN-N NVN.N NNN 82 NNNN :NN NVN.N NNN-N NNNN NNN-N NNNN NNN" NEON
NVN.N NNN-N NNN: NN.N NNN-N NNN NNN-N NNNN NNNN NNN-N NNNN NNNN NVN.N NNNN NNNN NEON
............ N NN.N.-.NNWNN-1ie----NN.NN--.--NN.N.-.-.-.-.---.-.NN.N.N-.N.---N.N.NN-.-----.---.--mN-NuNss-waN-s--.-------.-meNN.-.-.-_.N..N.N.-----.-ads----
----.--------.-.-.-:.-----.----11----Nmommv- ...................... A. 00.0mm ----------------------- A. 00.0mm ........................ A 0.0.va ------------------------ -
V 85 N << < 0?. .5 N 00.. 95 NN N59: N24 02 00 02
:38- 000§>0< 32000.5 82050.5 EFF—mm “02000.5 8 Z

 

0. 300 04V 053-

89

4.3 A DEMONSTRATIVE EXAMPLE

This section demonstrates how the identiﬁed optimal features can be used to
guide item pool assembly by serving as a model item pool. The target ROP features were
from ROP_21 which has average a, b, and c values as 1.21, -.237, and .135 respectively.
This ROP was chosen because its average a-value was the smallest and it contained the
smallest number of items among all candidate ROPs. The item pool that was assembled
based on ROP_21 features is called the Demo_Pool, whose item pool characteristics are
described in Table 4.5 along with those of the OP and ROP_21. In total, 84 out 314
operational items can strictly meet the optimal psychometric conﬁgurations identiﬁed in
ROP_21. To equalize the pool size of the Demo_Pool and ROP_21 in order for the
results to be comparable, another 55 operational items were added to the Demo_Pool.
The selection of these 55 items was based on two criteria: 1) the selected items—all with
a-values below 0.89446—should have their ai close to 0.89443; and 2) the selected items
should have b-values above -0.8; and this criterion was adopted in order to move the
average b-value closer to that of ROP_2 (i.e., -0.237) as well as average ability of target
examinee population (i.e., -0.3813). Table 4.5 indicates that both average a- and b-values
of the Demo_Pool are lower than those of the OP. In addition, the a,- values were more
dispersed in the Demo_Pool than in the OP, but the b,- values were less dispersed in the

Demo_Pool than in the OP-

Tables 4.6 and 4.7 compare the performance of the OP, the Demo _pool, and the
ROP_21 produced by both conditonal and random examinee samples. The random

examinee sample was the same as those used in the studies described in the previous

90

chapter. Table 4.8 compares the classiﬁcation accuracy rates given by these three item
pools. In general, these two tables reveal the following ﬁndings. First of all, the results in
Table 4.6 from using conditonal 0 points indicate that, almost across all 0 scale, the
Demo_Pool yields similar or even better measurement accuracy and precision than the
OP except at the 0 points -3, -2.5, 2.5, and 3. This result can be largely attributed to the
lack of items with b-values close to these values in the Demo_Pool. Second, the overall
performance statistics in Table 4.7 indicate that the Demo_Pool performs better than the
OP in terms of almost every criteria including better measurement accuracy and precision,
higher classiﬁcation accuracy rates, more efﬁcient item use, and higher correlation
coefﬁcient between the true and the estimated abilities. To some degree, it seems that the
items removed at the result of assembling the Demo_Pool did not contribute much to the
measurement outcomes. The pool conﬁguration of ROP_21 identiﬁed as the end products
of this optimal item pool design effort can successfully serve as a model item pool in the

practical application.

Table 4.5 Item pool statistics for the Demo_Pool, the DP, and the ROP_21

 

 

 

 

Item Item pool Mean SD Max Min Pool Size

parameters

OP 0.979 0.395 2.125 0.022 314

a-parameter Demo_Pool 1.175 0.337 2.125 0.47 139

ROP_21 1.210 0.249 2.111 0.736 139

OP -0.746 0.896 3.617 -2.730 314

b—parameter Demo_Pool -0.501 0.929 2.082 -2.345 139

ROP_21 -0.237 1.553 2.753 —2.792 139

OP 0.147 0.080 0.500 0.007 314

c-parameter Demo_Pool 0.162 0.083 0.500 0.009 139

ROP_21 0.135 0.067 0.337 0.019 139

 

9]

 

nous—0.5 Ngbgoou>o .0002

 

 

 

 

 

 

 

 

N N N N N N N N N N N N N NINE.
NN: VN: VNS NNN NNN NN NN_ N_N _NN NVV NNN NN: NNNN 381980 >0
NNVN NNVN NVN_ NNN N_N NN NN NN NNN NNN NNN N; NVN N0
N N.N N N._ N N.N N N.N- _- N.N- N- N.N- N- 28m
VN.N VN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N V.N NN.N NINE
N_.N NN.N _V.N NV.N NN.N NN.N NN.N NN.N NN.N _N.N NN.N N; NE NNN-Neon mm
NN.N NV.N NV.N NV.N NN.N NN.N NN.N NN.N NN.N _N.N NV.N NN.N NN.N N0
N N.N N N._ _ N.N N N? N- N._- N- N.N- N- N.NNN
:.N N_.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NN.N NV.N NINE
_V.N :.N EN EN VN.N :.N NN.N NN.N NN.N NN.N VN.N N2 NN._ BNNINENQ N.N:
NN.N 2N NN.N N_.N NN.N :.N N.N NN.N NN.N NN.N NN.N NN.N VN.N No
N N.N N N._ _ N.N N N.N. _- N._- N- N.N- N- 2.5
VN.N VN.N NN.N NN.N NN.N EN EN NN.N NN.N NN.N _N.N NN.N- 2N- NEON
EN. EN. NN.N NN.N VN.N VN.N NN.N NN.N NN.N NN.N NN.N- NV.N- _N.N- 8N Neon SE
NV.N. NN.N. NN.N NN.N NN.N VN.N NN.N VN.N NN.N 8.? No.9 N24 VN.N. No
. . . . - .- - .- - NNN
N NN N N_ N NN N NN- _ N_ N NN N N55.

 

NNIRQQ 03 0:0 .N00KI0E0Q 05 NO 20 A0 N630 .N.-N0 N020 .N.-m0: .833 00:00-00:00 04V 2an

92

con-.305 $8065 moi

5983. $8065 NN.N

owSE-tom $8065 A V $82

 

 

N NN.N NN.NN NN.N NNNNNNN NNNNNNN NN.N NN.N NN.N NNN—om

. . . NNN-NV NNNNN . . . I

«NNN NNN NN NV NN NN NN NNN 2 N NNN .8.— Neon
. . NNNNN NN N.NN . . .
N02 3 X: we 3N Vm woo 2 o moo m0
2 o .N.. «So: 25: 6
>0 NNN N a Swag-BE:— omaxo-Ngc N.N-N am: 35
m- N auto-NO
MN No no...— 63 NN. me .x. can t

 

NmmSESNm ©3638 Sagan Sorta-N a MEN-3 ~NIKQ- 8: N25 .NooaNIQEmQ 2% NO 2% Ska-3228» 8:950th 2396 N.V 053-

93

Table 4.8 Classiﬁcation accuracy rates at each petformance level

 

W OP ROP 21 Demo Pool
Proﬁcnency leve — -

 

 

 

 

 

 

Not Proﬁcient (#) 1521 1559 1501

Not Proﬁcient (Prop) 0.828 0.848 0.817
Partially Proﬁcient (#) 9542 9665 9592
Partially Proﬁcient (Prop) 0.863 0.875 0.868
Proﬁcient (#) 4596 4799 4673
Proﬁcient (Prop) 0.773 0.807 0.786
Advanced (#) 983 961 922

Advanced (Prop) 0.798 0.827 0.793
Total (#) 16586 16984 16688
Total (Prop) 0.829 0.849 0.834

Note. # indicates number Prop=Proportion

94

 

CHAPTER V

SUMMARY, DISCUSSION, AND IMPLICATION

5.1 SUMMARY OF RESEARCH

This study introduced a heuristic approach to develop the optimal item pool for a
CAT using the WDM item selection method. An item blueprint as the end-product of this
approach was produced describing items’ statistical and non-statistical attributes, item
number distribution, and optimal item pool size. Speciﬁcally, three different methods
were used to identify the items’ optimal statistical attributes and the prototype of two
methods can be traced back to McBride and Weiss (1976) in which the “ideal” or
“perfect” item pools were simulated with item features identiﬁed by using a formula
created by Birbaum (1968). To identify the optimal non-Statistical features required by
the WDM, a sampling approach was developed that was primarily based on the
description and examination of test speciﬁcation and the distribution of operational items’
non-statistical attributes as well. To determine the optimal item pool size, the bin-and-
union method was used that partitions the space constituted by item a- and b-values into
several smaller ones called ab—bin or ab-block in this study. In other words, an individual
ab-bin/ab-block is bounded by a speciﬁc range of a- and b-values. Blocks are mutually
exclusive to each other; but items in each block are treated as equivalent meaning that
they can be used interchangeably. When items administered to each individual examinee
were sorted out into each individual block and items administered in one individual CAT

were also administered to. another examinee, item pool size could be determined as the

95

 

union of item sets given by using a large number of examinees out of the target examinee

population.

By manipulating three factors—b-bin width, item generation method, and
expected amount of item information change, 24 candidate ROPs were generated which
varied in item pool size, average a- and b-values. To summarize, the item pool sizes
varied from 139 (in ROP_21) to 270 (in ROP_2); the average a-values varied from 1.182
(in ROP_22) to 1.62 (in ROP_2); and the average b-values varied from -.096 (in ROP_S)
to -.358 (in ROP_3)—all being higher than the average ability, i.e., -.3813. Compared
with the OP, almost 90% candidate ROPs contained items with minimum a-values
exceeding .8—similar to what Urry (1977) recommended as the preferred characteristics
of a CAT item pool; whereas the b-parameters of the candidate ROP3—rather than follow
normal distribution as does the examinee ability—~tended to uniformly distributed along
the proﬁciency scale.

Among these three manipulated factors, the b-bin width and the expected amount
of item information change tended to affect item pool size, whereas item generation
method tended to affect item a-values. For example, with other factors being equal, using
a .2 expected amount of change in item information tended to cause 1/3 more items than
using a .4 expected amount of change in item information. When the MTI item
generation method was used, setting maximum individual test information as average test
information tended to produce an item pool with less discriminating power than the other
two item generation methods. The distributions of non-statistical attributes of different

candidate ROPs were also comparable to each other.

96

 

 

Despite the differences in optimal features of different candidate ROPs, the
evaluation results indicated that all candidate ROPs unanimously performed better than
the OP in terms of achieving better measurement accuracy and precision, making more
efﬁcient and balanced use of items, and violating test constraints at a negligible manner.
It is interesting to have observed that, despite the fact that the OP contained more items
than any candidate ROP, the total number of overexposed items in each candidate ROP
indicated little difference from that in the OP and the item overlap rate given by the OP
was higher than that given by any candidate ROP. In addition, slight differences in the
performance of different candidate ROPs have also been observed. In general, the
candidate ROPs with higher average a-values (e.g., ROP_l and ROP_2) tended to yield
better measurement precision and more accurate classiﬁcation accuracy than those with
lower average a-values (e.g., ROP_21 and ROP_22). The candidate ROPs developed
with a .2 expected amount of item information change also tended to yield more
underexposed items than those ROPs developed with a .4 expected amount of item

information change.

In summary, the resulting statistics suggest that the methodology introduced in
this study can generate item pools with characteristics capable of supporting the good
functioning of WDM item selection algorithm—examinees are administered a content-
balanced exam and at the same time estimated with decent accuracy and precision with

regard to their abilities.

97

 

5.2 DISCUSSION

Stocking (1994) addressed several issues with regard to item pool size in the
context of high-stakes admission programs administered in the form of CAT. Examining
ﬁve operational item pools for ﬁve ﬁxed-length CAT tests, Stocking recommended it a
rule-of-thumb that a CAT item pool should be approximately 12 times the length of the
CAT exam for high-stakes exams. Way (1998) commented this rule-of-thumb as
“prudent advice” (p. 23) and “a valuable guideline” (p. 24). Stocking also concluded that
about WW ﬂung pAapegand-pencilquramm should be sufﬁcient to
construct a single CAT pool. As Table 4.1 indicates, the 24 candidate ROPs developed by
using the bin-and-union method in this study vary in size from 139 to 272 with four item
pools containing more than 240 items. When calculating the ratio of item pool size over
test length (i.e., 20), we can see that the ratios were between 6 and 12 for 20 candidate
ROPs; and for the remaining four item pools, they were between 12 and 14. In a study by
Reckase and He (2009) in which the bin-and-union method was also used to develop an
optimal ROP for a variable-length CAT, the results also indicated a ratio of item pool size
over test length as approximately 13. In other words, the bin-and-union method, to a large
degree, corroborates Stocking’s recommendation as to what constitutes an adequate item

pool size for a CAT program.

In addition to the item pool size, the bin-and—union method used in the current
study also depicts the characteristics of a- and b- parameters. The a-parameters in almost
all ROPs tended to peak somewhere around 1.1 with minimum value exceeding .8 in
general; whereas the b-parameters tended to be uniformly distributed along the whole

proﬁciency scale—not exactly matching the distribution of 0 used for the simulation in

98

 

 

this study. In general, the item parameter characteristics suggested in this study are in
line with those recommended in Urry (1977) and Jensema (1977), both of which

recommend uniformwdistribution of item difﬁculty as one of important item bank

- .n “\mqu.

 

7.5 P-ru srqs- “NA-H!“ "‘

requirements for the CAT. This requirement appears to be reasonable in that maximum
information-based item selection method, such as the WDM method, attempts to select
for administration the item which will yield the most information. If there are not
sufﬁcient numbers of items available at a particular difﬁculty level, the item selection
algorithm will have to select an item which is not appropriate and which will yield less
than optimal amount of item information. As a result, the improved measurement quality
for tailored testing may not be achieved as expected. An example of this type is that, as
indicated in a previous session of this study, the operational item pool yielded substantial
bias and error for those examinees located at the high-ability of the proﬁciency scale due
to the shortage of items with high b-values; whereas the candidate ROPs did not
experience the same problem.

Recall when employing either bin-and-union or linear programming method to .
design an item pool, the CAT simulation is conducted EQUIQIBQLENANLEISQQQQN and
mixmﬁgww It is therefore expected that these two
characteristics built into the simulation jointly affect the optimal item pool features. In
other words, an optimal item pool should carry features that can meet the unique needs of
each individual CAT program. In this sense, it might be appropriate to argue that the
optimal item pool features for different CAT programs may be different. This explains

the different ﬁndings of the optimal features of the CAT item pools in the literature. For

example, Dodd, Koch, and De Ayala (1993) indicated that trait estimates in CAT are

99

 

more accurate when item pool characteristics and latent trait estimates in CAT can match
with each other. In another study by Reckase and He (2004) on a variable-length CAT
with termination rule involving evaluating whether the ﬁnal ability estimate is within or

out of the bound set by a certain level of conﬁdence interval of a cut score, however, the

optimal distribution of bi is found to be uniform along the proﬁciency scale but with a

peak around the cut score—barely related to the distribution of expected examinee ability
distribution. Therefore, optimal item parameter characteristics yielded by the current
study may not necessarily work for other different CAT program.

As Table 4.3 indicates, most of the candidate ROPs tend to yield the number of
overexposed items that is very similar to the OP although they contain fewer items than
the OP. If evaluted by the proportion of overexposed items over item pool size, all
candidate ROPs had a higher percentage of overexposed items than the OP, suggesting
that the candidate ROPs may potentially pose a more severe concern over test security
than the OP if test security was really an issue. In fact, this result can be ascribed to the
factors including using maximum information-based item slection rule and implementing
no item exposure control procedure in the targeted operational CAT program. Using
operational CAT pool conﬁgurations, such as the distributions of a- and c—parameters and
the correlation between a- and b-parameters to serve as a template to generate optimal
item features also plays a role. As well documented in the literature (e.g., Wainer, 2000;
Way, 1998), the maximum information item selection rule is very sensitive to even very
slight differences in item information. If a maximum information-based criterion is used
for item selection in a CAT free of any item exposure control procedure, items with high

discrimination are always very likely to be subject to overexposure while many low

100

 

discriminating or even moderate discriminating items are left never being selected. A
solution to overcome the problem of unbalanced item exposure rate caused by using the
maximum information-based item selection criterion is implementing item exposure
control procedure. Due to implementing no item exposure control procedure in our target
CAT program, overexposure may arise as a natural consequence in this study. It is
anticipated that using item exposure control procedure can be of help in easing this
concern. The procedures needed to develop an optimal item pool for a CAT
implementing item exposure control procedure can be referred to in Gu (2007). On a side
note, the target CAT is used to place examinees into different levels of courses. Therefore,
in cases when overexposure causes a test security issue and examinees gain a higher
score due to security breaches, they might just throw themselves into a kind of self-
cheating condition in which they are likely to end up learning little if being placed in a
class beyond their ability level.

In addition to the number of overexposed items, the item overlap rate was also
used in this study as another index for test security because it indicates the percentage of
common items administered to two randomly-selected examinees. The higher the
proportion of identical questions that different examinees receive, the greater the risk will
be to test security. Way (1998) recommended using this index of item overlap rate as a
“global picture of how often items are used” (p. 22). The results in this study indicated
that all candidate ROPs yielded lower item overlap rates than the OP despite the fact that
the OP contained more items. This result, from another perspective, suggests that the test
security issue may not appear as severe as what the number of overexposed items imparts.

A study by Chen, Ankenmann, and Spray (2003) found that equalizing item exposure rate

101

in a pool can reduce the average test overlap rates. Table 4.3 indicates all candidate ROPs

have more homogeneous item exposure rates than the OP, as witnessed by the values of

12 for each candidate ROP. This result explained why the candidate ROPs produced

lower test overlap rate than does the OP.

As discussed in the previous chapter, the R and the MRP item generation methods
tended to yield the candidate ROPs with higher average a-values than the MTI procedure
given other factors being equal. This lower a-value can be attributed to the magnitudes of
minimum test information set for examinees with different true abilities. This study
adopted average test information calculated from the historical data as target test
information for those examinees whose true abilities were within three cut scores used to
classify examinees into four performance levels; and for those examinees whose true
abilities fell beyond the range of cut scores, the target test information was set at a lower
level. It is anticipated that raising the magnitudes of minimum test information can yield
the candidate ROPs having higher average a-values. In light of the characteristics of the
operational item pool used in this study, i.e., the average a-value as .979, it seems that the
way used in this study to set the target test information works well in that it can yield a

candidate ROP with reasonably realistic item parameter values.

5.3 IMPLICATION

The results of this study indicate that the optimal item pools performed much
better than the operational item pool in terms of almost every aspect. This ﬁnding was not
surprising in some sense, given that the item characteristics—not only statistical but also
non-statistical—were designed to be optimal. Rather, what is more interesting is to how

to make good use of these optimal features for a CAT program to achieve “the best
102

 

attainable” measurement outcome. Therefore, this section discusses how to apply optimal
item pool conﬁgurations to issues on item pool assembly, item pool management, and

item writing.

5.3.1 IMPLICATION FOR ITEM POOL MANAGEMENT AND ASSEMBLY FOR
CAT

In practice, two types of item pools are often used for an operational CAT
programimaster and Ioperﬁationalﬂitenrpool‘s; .A master item pool, also called as “vat” by
some researchers (Way, 1998; Way, Steffen, & Anderson, 1998), stores a large quantity
of items from which an operational item pool is assembled. The operational item pool is
the one that provides the resources from which tests are to be selected. Vat management
can be seen as a dynamic process for its role is to maintain up-to-date information on

each existing or newly created item with respect its psyChometrics, usage history, and

availability status.

As discussed before, test security is a concern for the CAT due to its nature as a
type of continuous testing. The concern becomes more severe if only one single static
pool is used. Wang and Kolen (2001) summarized three approaches that have been
proposed to ease the concern potentially caused by using one single static pool. In the
ﬁrst approach, the operational item pool is updated and refreshed with new items
periodically. In the second approach, the old operational item pool is replaced by an
alternate new item pool. And in the third approach, multiple item pools are
simultaneously used and rotated among different testing sites. No matter which approach

is used, however, a key issue is to ensure comparability of different item pools so that the

103

 

 

exams are fair to examinees who are administered the CAT based on different item pools

or different versions of the same item pool.

Both Wainer (2000) and Wang and Kolen (2001) identiﬁed one basic approach to
achieving comparability for CAT based on alternate pools is to create item pools parallel
to each other. To create parallel item pools, the critical issue is to understand the
characteristics of item pools that most affect comparability of the CAT scores. Wang and
Kolen listed several factors, among which are item pool size, item pool assembly
procedures and constraints used to assemble the pools. Wang and Kolen also called for
more research to identify the most inﬂuential characteristics of item pools that
operationally deﬁne the concept of parallelism of pools and to study what level of
parallelism is needed in order to achieve a desired level of comparability of CAT scores.
Wainer (2000) implicitly pointed out that parallelism between new and previous CAT

pools should look at both content and statistical characteristics.

Obviously, if a candidate ROP developed in this study is used as a model item
pool to guide operational item pool assembly out of a vat, the parallelism can be
automatically ensured. An additional advantage of looking at the item pool conﬁgurations
of the candidate ROP is that the item pool assembled based on the desired features can
provide “the best attainable measurement quality” for a given CAT algorithm. In fact, a
similar idea has already been discussed by Way and his colleagues (2001) in which they
proposed analyzing the characteristics—including psychometric and item properties—of
operational item pools that had functioned well in the past, then based on the
characteristics of the item pool found to work well, generating a model pool with the

characteristics to ensure satisfactory functioning of a CAT. The item pool features

104

 

identiﬁed by the method discussed in this study can conveniently serve as the role of the
model item pool proposed by Way and his colleagues (2001). When constructing
multiple parallel pools, the characteristics of this model pool can serve as a template for
other item pools to mimick. What’s more, because it is considered “optima ”, using the
candidate ROP developed in this study as the model pool is also expected to

automatically result in reasonably decent measurement outcomes.

In addition to serving as a model item pool, the characteristics of the candidate
ROP can also help to manage a vat or simplify vat management. To achieve this goal,
ﬁrst of all, a vat can be structured based on bin. For example, a bin may take certain
width only on the proﬁciency scale if the Rasch model is used or take the shape of ab-
block as discussed in this study if the 3PL IRT model is used. Based on their features,
items can be stored in different bins or ab-blocks. The ideal number of items in each b-
bin or ab—block that the vat intends to support can be worked out by multiplying the
number of operational items. Alternatively, if operational item pools are intended to be
overlapping to a certain degree, then the number of items in each b—bin or ab—block can
be easily calculated based on the intended overlapping rates. When an operational item
pool is assembled, the required number of items can be sampled from the corresponding
b-bin or ab-bin in a vat. If the operational item pool needs to be refreshed, the items that
are needed can be directly sought from those corresponding b-bins or ab-blocks storing
items equivalent in use. In summary, the optimal item pool conﬁgurations can not only
guide assembling high-quality item pool but also provide useful insights into how to

manage and maintain a vat. This is a research area worth ﬁlrther exploration.

105

 

5.3.2 IMPLICATION FOR ITEM WRITING AND DEVELOPMENT

One of the elements that the optimal blueprint includes is the distributions of
items on each non-statistical attribute, for example, the attributes that an individual item
should possess and the number of items that should possess a certain attribute or a
combination of multiple attributes at the same time. These features can be used to guide
item writing process in which item writers can be instructed to write items with the
desired attributes based on this blueprint. For example, item writers can be instructed to
write an item with a certain combination of attributes described in the test blueprint. Once
a certain proportion of items are written according to the speciﬁcations in the blueprint,
they can be tested in a sequential manner so as to get some idea on how far they are away
from the intended goal. As van der Linden and his colleagues (1999) suggested, the best
way to' view these optimal item blueprint is to treat them. as tools for continuous pool

management rather than as a one-shot item pool design.

5.4 LIMITATION AND FUTURE RESEARCH DIRECTION

Like most of studies, this study cannot be free ﬁom limitations. First, when the
item pool characteristics of candidate ROPs were evaluated, item parameters were treated
as known—true values containing no estimation error. However, for the operational
items, item parameter estimates were used—implying that the values have estimation
errors. In CAT, item selection criterion always involves optimization, for example,
choosing an item that can provide maximum item information. When item estimates
contain errors, a process known as “capitalization on chance” may occur in the process of
CAT item selection which generally takes place over items calibrated with estimation

error (van der Linden & Glas, 2000). Capitalization on chance makes use of the fact that

106

optimal values of a function of the item parameters is the result of true values of the
parameters and large estimation errors as well (van der Linden & Glas, 2000). It is
reported that the capitalization on chance can result in worse ability estimator than
expected (Hambleton & Jones, 1994). Since items in the candidate ROPs were assumed
containing no estimation error, future study should be conducted on how estimation

errors may affect the performance of these candidate ROPs.

Second, the CAT algorithm that this study considers was only selecting
independent stand-alone items. In practice, a popular testing format is using the item set
which refers to a group of items related to each other through a common stimulus.
Examples may include a reading passage or a graph in which a cluster of questions refers
to a common stimulus. Future studies should consider how to identify optimal features

for the set-based items.

Third, as discussed in the literature review chapter, employing the idea of bin to
determine item pool characteristics can be dated back to the early 19903 in Boekkooi-
Timminga (1990). In both Boekkooi-Timminga’s study and in the current study, each b-
bin or ab-block was treated mutually exclusive from each other when collecting and
tallying items. When a CAT algorithm searches for an item for administration, however,
the search takes place in the whole item pool rather than a b-bin or an ab-block only.
Therefore, the item pool size calculated by adding all items in each b-bin or ab—block
may carry more items than needed; this explains the need of adopting a post-adjustment
procedure to trim out those redundant items. The trimming procedure used in this study,
though compatible with the use of maximum information-based item selection criteria,

tended to remove those items with a-values lower than .8. As a result, the candidate

107

 

ROPs tended to have items with a-values that were too high, which may cause identiﬁed
optimal features less useful in reality. The reason is that, in practice, items with high a-
values are hard to produce. Thus, future research should be conducted to work out a
trimming procedure that can do this job during the process when items are sorted out in
each ab-bin. It is expected that the new trimming strategy can result in a candidate ROP

having items with lower a-values that are more realistic in practice.

Fourth, when using b—bin or ab—blocks to collect items, items in each b-bin or ab-
block were treated as equivalent because they shared very similar item information. This
idea of treating items in the same bin or ab-block can be applied to a couple of areas, for
example, linear paper-and-pencil test assembly. To apply this idea, items can be ﬁrst of
all allocated into different b-bins or ab-block based on their information. Then, the
expected number can be selected out of each group based on the target test speciﬁcation.
Cheng and Chang (2008) have conducted some preliminary research in this area. There

are more research opportunities along this line.

Finally, one of the major purposes of designing an optimal item pool for CAT is
to provide a reference through which the best attainable measurement results that a CAT
algorithm can produce can be known. Despite the facts that the historic data from the
operational test, for example, the operational item pool characteristics and the target
examinee ability distribution, are recommended as an indispensable part used in the
simulation, it is still very likely that the identiﬁed optimal features may appear too
optimal. Future studies may want to consider developing a tolerance region for the

identiﬁed optimality so that the results can be more useful in practice.

108

 

338

Frequency
8

APPENDIX

Figure A. 1 Item Number Distribution in each ab-block for ROP3

I I I I r f T I 7 I

 

L j I

-a=[0 .8944)
_ [:Ia=[.89441.2649) —
[:la=[1.2849 1.5492)
~ -a=[1.5492 1.7889)
-a=[1.7889 2)
~ -a=[2 2.1909)

 

I

l

 

 

 

      

 

 

 

  

 

 

 

1-1 .— v-I 1-1 04 0\
V. c. "I c. "2 l‘.
I I l I
I I

00 v ° ‘3
.1 v. '0 '0
‘7' "1'

   

Figure A.2 Item Number Distribution in each ab-block for ROP4

 

Frequency

 

III ' fl T 1 T1 T T'f-a=[0.63246)

-a=[.63246 .89443)
[ja=[.89443 1.0954)
.a=[1.0954 1.2649)
Da=[1.2849 1.4142)
-a=[1.4142 1.5492)
-a=[1.s492 1.6733)
-a=[1.6733 1.7889)
l -a=[1.7889 1.8974)
.a=[1.8974 2)

l-a=[2 2.0978)

      

        

 

 

 

 

 

 

E I
. 1. I a= 2.0976 2.1909
-....-.._-u I I )
'- I-i 0.! 0—1 —1
v. =2 9: N. x:
N N F‘ "4 '
I I I I z
I I I I
N
°°. “I '1‘ ‘0. 1-1
'7 '7 '7 '

109

Figure A.3 Item Number Distribution in each ab-block for ROP5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

70 I I I I I I I ﬂ I T
-a=[0.8944)
30" Da=[.89441.2649) r
50 _ [:Ia=[1.2649 1.5492) _
-a=[1.5492 1.7889)
>4
§4°* -a=[1.7889 2) i
3
a= 22.1909
530, I r )
IL
20- —
10
o e s 8 a a: :1” ; °~ °~ *8 a I
. . . . . . . "l I‘. "I 'Q . "2 N
rrrrsz'z'ﬂv 7 722'
«a v. «9 e. S ‘3 ‘5 ° " 0=2 :3 :3 :2
r.“ *7 .... -
Figure A.4 Item Number Distribution in each ab-block for ROP6
70 I I I I I I I I T I I f r
.a=[0.63246)
60 . _ -a=[.63246 .89443)
Da=[.89443 1.0954)
50 » — Da=[1.0954 1.2649)
a= 1.26491.4142
D ‘ ’
.- .a=[1.4142 1.5492)
3
3' 30 -a=[1.54921.6733)
b ” .-
“- -a=[1.6733 1.7889)
20 . . -a=[1.7889 1.8974)
.a=[1.8974 2) ‘
10 - - -a=[2 2.0976)
. ' _ -a=[2.0976 2.1909)
‘2 °. ‘°. "! °°. VI 3. ”'1 "- "1 "1 °) ”'2 hi
N '1‘ ‘T ‘1‘ ' ' ' z 1 "‘ " "‘ N '
2' I I I l l I z 1 1 '3. I
«2 V. «:1 e. 2" °? 7' c v' “3 S :2 :3
"a '2' '7 '

110

Frequency

Frequency

Figure A.5 Item Number Distribution in each ab-block for ROP7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

 

 

  

  

 

  

   
 

 

 

 

 

 

 

 

70 I m I I I I 1 I I I T I I W
-a=[0 .8944)
so _ [:a=[.8944 1.2649)
50 Dam .2649 1.5492)
-a=[1.5492 1.7889)
40 r -a=[1.7889 2)
-a=[2 2.1909)
30— N—~~~———~———
20~ _
10— _
0 ' -'E I i ! ~' ‘ ‘ E
g. S. S. «'3. a: 33
'7 *7 —.‘ '7 ; 5
I I I I N VI;
”- V. '7 0. .4 N
'1‘ ‘3' '7 '
Figure A.6 Item Number Distribution in each ab-block for ROP8
70I ﬁ I I I 1 r I I I I I I I I
-a=[0.63246)
60L . .a=[.63246 .89443)
Dar—1.89443 1.0954)
50) . a=[1.09541.2649)
[:Ia=[1.2649 1.4142)
40) ‘ -a=[1.4142 1.5492)
30 _ - .a=[1.5492 1.6733)
‘ -a=[1.6733 1.7889)
20 r 1 -a=[1.7889 1.8974)
1 ' -a=[1.8974 2)
”I l I ‘-a=[2 2.0976)
n l I _. =
O IUE-l- -!---m_ -a[2.09762.1909)
NNNNN‘N;NNNNNNN
". °. ‘0. "i °9 ". °. "2 ". "I 'Q °§ ’2 hi
(I. 6|. IT IT | ' ' 1 I '- v-1 1- N I
z z z z I I I c, <1- 1 1 1 :3: ‘1'
99 V. v.9 9:. 3 °? 3'- ' °°- f3 3 N
'1' '7 -.~ '

111

Frequency

Frequency

70 II I I I I I I I I I I I ﬂ I
.a=[0 .8944)

60* .a=[.89441.2649) ‘

50p -a=[1.26491.6492) -
.a=[1.6492 1.7889)

40 I -a=[1.7889 2) i

30 r .a=[2 2.1909) .

Figure A. 7 Item Number Distribution in each ab-block for ROP1 l

 

 

-.8 ~ -.41

917:?

'— 0-1
a; c
N N
l I
l l
09 v
r? 8:

Figure A.8 Item Number Distribution in each ab-block for ROP12

-1.6 -l.21

-4 -01
0 ~39

.4 ~ .79

 

...! "
8 ~1191. '1. '
° ° ' n'u ’
‘-.-A"

 

 

 

 

 

 

1.2 ~ 1.59

 

2~ 2.39

1.6 ~ 1.99
2.4 ~ -2.8

 

 

70 - - - e - - - - N - - - - N
lam-63246)
60» l-Hﬁm-M)
[jar-8644310954)
50» a=[1.0954 1.2649)
=[1.26491.4142)
40" ' -a=[1.4142 1.6492)
309 . .a=[1.5492 1.6733)
.a=[1.6733 1.7889)
20. .-a=[1.78891.8974)
gag , -a=[1.8974 2)
10’ 15‘- _” ,. 9 "-342 2997‘)
0 I _ E ...-2.09.2.1...-
99999999999999
5 S 5 5 1 1 1 c’, .’, I I I ,2, 3,
99. V. o: 0. Ci ’3 3'- ' °9 3 3 N'
9 6: —-* -

112

   

 

 

 

 

 

 

Figure A.9 Item Number Distribution in each ab-block for ROP15

 

70

 

 

 

 

 

 

 

 

T

f

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Ia=[0 .8944)
°°L [ja=[.8944 1.2649) ‘
Dan .2649 1.5492)
50L -
I a=[1.5492 1.7889)
E. 40 _ Ia=[1.7889 2) r
g Ia=[2 2.1909)
n 30 — ~
20 e
10 ~
0.
3. «'3‘: 3.
9.4 '7 '
I I 1
9°. «.3 S
N. I
Figure A. 10 Item Number Distribution in each ab-block for ROP16
70 I I I I I I I
Ia=[0 .63246)
60 Ia=[.63246 .89443)
Ela=1894431.0954)
50* -a=[1.0954 1.2649)
5. Ia=[1.2649 1.4142)
40 _
g Ia=[1.4142 1.5492)
g sci Ia=[1.54921.6733)
IL Ia=[1.6733 1.7889)
209 -a=[1.7889 1.8974)
Ia=[1.8974 2)
1° ” Ia=[2 20976)
Ia=[2.0976 2.1909)

 

-2.8 ~ -2.01

 

 

 

-2 -l.21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2~ 2.8

Frequency

Frequency

Figure A. 11 Item Number Distribution in each ab-block for ROP17

 

 

 

  

 

   

 

 

 

 

 

 

 

70 I I I I I I T
Ia=[0 .8944)
6° ’ Ia=[.8944 1.2649)
50 - Ia=[1.2649 1.5492)
Ia=[1.5492 1.7889)
‘° * Ia=[1.7889 2)
30 Ia=[2 2.1909)
20 ‘7‘- :
1o I '-
0‘ ~’. 9".“ If: .1 I
8. «'4‘ '6. 35 3. 8. :3
‘3‘ '7‘ 1' 1 ".1 "‘ (A
.1, i. 2 ‘f- 4. '5:
8: ' '
Figure A. 12 Item Number Distribution in each ab-block for ROP18
70 I I I I I I I
Ia=[0 .63246)
60~ .Ia=[.63246 .89443)
Dumas 1.0954)
50- ‘Ia=[1.0954 1.2649)
a=[1 .2649 1.4142)
‘ Ia=[1A142 1.5492)
, Ia=[1.5492 1.6733)
Ia=[1.6733 1.7889)
lIa=[1.7889 1.8974)
Ia=[1.8974 2)
‘Ia=[2 2.0976)
Ia=[2.0976 2.1909)

 

-2.8 ~ -2.01

 

 

 

 

 

-l.2 ~ -.41

-.4 ~ .39

Aw -.1.19

114

l.2~ 1.99

2~ 2.8

Figure A. 13 Item Number Distribution in each ab-block for ROP19

 

70 I I I I I

 

-a=[0 .6944)
6° ” -a=[.8944 1.2649)
la=[1.26491.5492) _

 

 

 

50 ._
g ‘0 -a=[1.6492 1.7889)
g -a=[1.7999 2)
E 30 _ -a=[2 2.1909) _

 

 

 

 

1-1 1-1 1- 0‘ cs a ﬂ
°. N V. "I "‘ . N
‘1‘ '1‘ 1' l "‘ "‘ A
1 1 N ...: i .3.

:2 3' 1' ' "

I

Figure A. 14 Item Number Distribution in each ab-block for ROP20

 

 

 

 

 

 

70 I I r T I I T
.a=[0 .63246)
60~ . 4.a=[.63246 .89443)
[:]a=[.89443 1.0964)
50- “-a=[1.0954 1.2649)
3 a=[1 .2649 1.4142)
c 40- 7
g -a=[1.4142 1.5492)
g 30 <.a=[1.54921.6733)
“- -a=[1.6733 1.7399)
20. - -a=[1.7889 1.8974)
-a=[1.8974 2)
10~ ‘-a=[2 2.0976)
-a={2.0976 2.1909)

 

 

 

 

 

°. N V "I .

‘1‘ ‘1' 5 z . "‘
I

z z N v. 4 .S

co N ' . v-

6: ' ‘7

I

 

Figure A. 15 Item Number Distribution in each ab-block for ROP23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

70 I I f I f r I I I
*.a=[0 .6944)
5° ” “Ea=[.6944 1.2649) ; ‘
50 _ :Ea=[1.2649 1.5492)! _
B ‘.a=[1.54921.7889)
g 4° ” ‘.a=[1.7889 2) 1
E 30» .a=[2 2.1909) q
I].
20 ~ -
10~ .. .3}; _
o ll‘i' H ..“"}91_}§L2 , .,._-
8. «'91 3": 3
'2‘ '7‘ 1' 1
.1, 4 2 ‘-'°
6'! I
Figure A. 16 Item Number Distribution in each ab-block for ROP24
70 I I I I I I
-a=[0 .63246)
60 . .a=[.63246 .69443)
[311469443 1.0954)
50» . -a=[1.09541.2649)
5. 40» ‘ -a=[1.26491.4142)
g .a=[1.41421.5492)
g» _.a=[1.64921.6733)
‘L i.a=[1.67331.1889)
20» 1‘ - . J.a={1.76691.6974)
I ‘ .' J-aqmsn 2)
10» ‘Iaqz 20976)
‘ 0976 21909
._1_ 1.311 1
S 2: 8 9°.
03 A ._: N
5 5 .3. .3.
9°. 0.3 -'
9:

 

116

Figure A. I 7 Item Discrimination and Difﬁculty Parameter Distributions for ROP3

item discrimination parameter distribution in ROP3 Item dim

100

 

ﬁ

90..

Frequency

 

0.5

I

r

1.5

r

 

 

100

culty parameter distribution In ROP3

 

mb

80»

70»

 

 

f

 

Figure A. 18 Item Discrimination and Difﬁculty Parameter Distributions for ROP4

item discrimination
100 .

parameter distribution in ROP4

 

”L
80+

N
0

Frequency

88$$3

 

0.0

T

 

 

117

item difiic
100 .

uity parameter distribution in ROP4

 

 

 

T

 

Figure A. 19 Item Discrimination and Difﬁculty Parameter Distributions for ROP5

item difficulty parameter
100 T .

distribution in ROP6 item

 

Frequency

 

0 0.5 1

Figure A.20 Item Discrimination and Difﬁculty Parameter Distributions for ROP6

item discrimination parameter distribution in ROP6 item difﬁculty parameter distribution In ROP6

1.5

 

 

 

 

100 .
90~
m.

N
0

Frequency

88628

.a
O

 

 

0°
0
0|

T

r

 

118

discrimin
100 .

ation parameter distribution in ROP5

 

90)-

80

ﬂ

 

100

1*

4

i

 

 

 

w>
m»

N
a

 

T

 

 

 

 

 

Figure A.2] Item Discrimination and Difﬁculty Parameter Distributions for ROP7

item discrimination parameter distribution in ROP7 item difficulty parameter distribution in ROP7

 

 

100 . 100 t
90. . 90.
80 30.
70* 70
2? 60* so.
5
§ 50» 50-
d: ‘40- 40r
30 30
20k 20
10» 1o.

 

 

 

 

 

   

Figure A.22 Item Discrimination and Difﬁculty Parameter Distributions for ROP8

item discrimination parameter distribution in ROP8 item difficulty parameter distribution in ROP8

 

 

100 1 f r 1”
90+ - so) i
80r i 30. 1
70* 1 70- -
> -1
0
C
0
a ..
E
II. .5

 

 

 

 

 

  

119

 

 

Figure A.23 Item Discrimination and Difﬁculty Parameter Distributions for ROP1 1

 

item discrimination parameter distribution In ROP11 item difficulty parameter distribution In ROP11

 

 

 

 

 

 

100 f f 100 e
90» - sol «
60 ~ 80» ~
70- - 70 .
>. 605 .
§
:1 50» ~
5 I
IL 40"
30- .1
20b .1
10- i
00 0:5 4

 

Figure A.24 Item Discrimination and Difﬁculty Parameter Distributions for ROP12

item discrimination parameter distribution In ROP12 Item difﬂcuity parameter distribution In ROP12

 

 

100 e 100 . T
90+ 4 90» - ,
60» - 60» «
70» .1 70» a
> 606 -1
0
5
a 50) .
E
u. 40. I

 

 

 

 

   

 

120

Figure A.25 Item Discrimination and Difﬁculty Parameter Distributions for ROP15

item discrimination

100 r

parameter distribution In ROP15

 

90*

Frequency

 

 

 

item difficulty

100 r .

parameter distribution In ROP15

 

”h

 

 

93 -2 -1

ﬁr

 

Figure A.26 Item Discrimination and Difﬁculty Parameter Distributions for ROP16

Rem discrimination parameter distribution In ROP16 Item difficulty parameter distribution In ROP16

100

 

90*

Frequency

 

T

I

 

 

2.5

121

100

 

90*

I

N
O

 

I

 

r

 

 

Figure A.27 Item Discrimination and Difﬁculty Parameter Distributions for ROP1 7

Item discrimination parameter distribution In ROP17ltem difficulty parameter distribution In ROP1?
. . . 100 . . . . . .

 

 

 

 

 

 

100 ﬁ
90» « 90» «
sol . m. l
70) 70» i
3 606 1 60» 1
§ 50» 50)
g .. ..
30» 30.
20 20.
10 10
‘1: 9:

   

Figure A.28 Item Discrimination and Difﬁculty Parameter Distributions for ROP18

 

 

item discrimination parameter distribution in ROP18 item difficulty parameter distribution in ROP18

10° . 100 r T
90~ ~ 90 I
80) a 80* 1
70 I 70)
a 6‘”
§ 60»
E ..r
30
20.
10»

 

 

 

 

   

122

 

Figure A.29 Item Discrimination and Difﬁculty Parameter Distributions for ROP19

 

item discrimination parameter distribution In ROP19 Item difficulty parameter distribution In ROP19

 

 

 

 

 

 

 

100 f r 100 T T

30 . so. .

70 ~ 70 .
B. 30 a 60~ ~
§ 60> so. .
E? 40» 1 40r I

30 4 sot .

20+ 20.

10~ -

°o

 

Figure A.30 Item Discrimination and Difﬁculty Parameter Distributions for ROP20

Item discrimination parameter distribution in ROP20 Item difficulty parameter distribution In ROP20

 

 

100 f 100 1 T
90~ . sol -
om . 80~ .
70. i 70- I

>. 60~ i
o

1:

g 60

u.

 

 

 

 

 

 

123

Figure A.3] Item Discrimination and Difﬁculty Parameter Distributions for ROP23

item discrimination parameter distribution In ROP23 Item difficulty parameter distribution in ROP23

 

 

 

 

 

 

   

1” I ﬂ 1” f ‘T f
90) « 90-
aoi 60 I
70 70» ~
>60-
0
5
i”
l ‘0)-
30» .
20 -1
0 0
0 0.5 1 1.5 2 2.5 4 -2 -1 o 1 2 3 4
a b

Figure A.32 Item Discrimination and Difﬁculty Parameter Distributions for ROP24

item discrimination parameter distribution In ROP24 Item difficulty parameter distribution In ROP24
. . . . 1oo . . . . . .

 

 

100
901 1 901
80' q 80»
70* 70* 1
5‘ 60 1 60+
5
§ 5'”
a: 40-

 

 

 

 

   

124

Figure A.33 Distributions of item attributes for the candidate ROP_l to ROP_4

0.7

 

 

I ﬂ T 77 T if I I I I I I I I I I I
+ OP
0-5 1 ----- - ROP1 1
.. X" .......... ROPZ
0.5 _ t.’ ’ '1‘ 1
:Iv" ‘ \‘ "+" ROP3
" . , __,._-, ROP4
3 0.4 — .... -1
\ 1" :- °
\ e
5 ..w'. 3. ... .53.;
2 137% at:
-. ~ : ‘
0 0.3” Y 1 .\ j \‘ *1 1,1?
0- . \.\ .- 5 ‘ I s‘ I
‘0' ‘.“ g, I \‘\ ’ ~ ‘s
‘3‘ {5'1” :1 2V .
‘5 \ fil’ ‘_ .I
.. 1'5! | ' r
0.2 ‘ \\\\ I" ‘5',"
s.‘\ .5 \‘ 6'."
_\ I
§ ....... .."'. a f
0.1 r ‘-.,_.\ g a
s
4 L g L J 4 4 4

 

 

 

 

 

 

C1 C2

c3 c4 c5 cs c7 C8 csc10c
Attribute

11 C12 C13 C14 C15 C19 C17 C18

Figure A.34 Distributions of item attributes for the candidate ROP_S to ROP_8

0.7 1

 

r l I T I I I f I I I I I I I i
—I- OP
0.6) --~- ROP5 -
"...... ROP6
i ’A
I“. —4
0.5 '3, --+-- ROP7
\
I
o .\ --+—- ROP8
g 0.4i '3‘ ‘ 4
E {3}“.1‘ " " ‘1
ail-*1 ‘ -’. :l
2 5".“ 1&1 ..R
a Oa3i- '.' ‘\ I I " _'
m \. \‘ . ‘ .‘i h K
\ t ‘ It ’ . I. ‘s ’0
i 1 . \.
\ t , . 9y \
\1 s I _, I v”. “ ’1' I
“.3 ' ‘. ‘1 ' ’1‘“ .
\ ' .
0.2 ’- gi ‘\ ‘ a. .' —4
. . ‘\.A§« .i ’
'4' ‘ ,u '
'- r" ‘03: I
'-y- \‘\ \CI !
0.1 ~ ‘3‘ -
\
L l l I l J l L g l l l l 1

 

 

 

 

 

 

 

C1 C2

C3
Attribute

125

C4 C5 C6 C7 C8 C9 C10C11C12C13C14C15C16C17C18

Figure A.35 Distributions of item attributes for the candidate ROP_9 to ROP_12

0.7 I I j . I I

 

 

 

 

 

 

 

 

 

+OP
0.6— __*__ ROPS -
Q
g... ......o.. ROP1O
0.5“ ,” 0+..- ROP11 .4
X
""" ROP12
3 0.4 3 _
5
E
F K
II. 0.3 7"" Ifjjib
" ...)‘n. 1‘
0-2— ’x‘ ..." “*' \“ ’IIIJ
0.1 ~ ‘- A
‘ .ﬂﬁxIIIT 3:..-
o J L 1 1 L l L 4 1 4 §£ '%' 1 l 1
C1 C2 C3 C4 C5 C C7 C8 C C10 C11 C12C13C14 C15C1C C17 C18
Attribute

Figure A.3 6 Distributions of item attributes for the candidate ROP_13 to ROP_16

 

0.7 I I I I I I I I I l

 

 

 

 

 

 

 

 

 

+ OF
”I --«-- ROP13 *
' I "I ------ ROP15
g -.
g 0.4 _ --*--- ROP13 4
'5
g 0'3 _ 1""‘\ _
(L -' . ,
0.2 + v 1
0.1 ~ -
o l l J l l I 1 L1 1

C1 C2 C3 C4 C5 C6 C7
Attribute

126

C8 CS C10 C11 C12 C13 C14C15C1SC17C18

Figure A.3 7 Distributions of item attributes for the candidate ROP_17 to ROP_20
0.7

 

 

 

 

 

 

 

 

 

—-— OP
0.6 ~ .2 ‘ --+-- ROP17 *
[’2‘ ..... . ..... ROP18
0.5» ", ’ ‘3“ .3
-~v-- ROP19
> 1’ ““1 A
g 0.4 - 35;. --+-- ROP20 I
‘1’ "“ ; - I"
3 “y; 7‘"? . le‘
g 0.3 i‘ I ’ I; ‘L
IL I '. '2" ' .-"‘
\ " ' ---:"‘E. '
o 2 r- . \ O‘ V“ \‘y
-N-
0.1 — y \ _
o L Ll L l l l 1 1 l L A - 1 I ‘.~. 1 l l 1
C1 C2 C3 C4 CS C6 C7 C8 C9 C10 C11 C12C13C14 C15 C16 C17 C18
Attribute

Figure A.38 Distributions of item attributes for the candidate ROP_21 to ROP_24
0.7

 

 

 

 

 

 

 

 

 

If T I I I I f I I I I I f I I I I
+ 0P
0.6 *- F. "'0" ROPZ‘I I
’t ..... 9 ..... ROP22
0.5 * ... _
°' ‘.\ "'V'“ ROP23
End ‘ t --+-- ROP24 r
CI '\
3
§O.3~ _
IL . -. i
A I: V‘s?»-
0.2 '- ’I’ \\ I. ‘8.”
“- 9 '
I .‘E‘: \ I
“b”... «<7- “:ﬁ"
0 1 l l I l l l l L I ‘D~'-'=Zr:ﬁ.'r I“?! l 1 1
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12C13C14 C15C16C17C18

Attribute

127

 

8o :3 So So So So 86 So So :3- 3o cod 8o. 2 mom
:3 :3 so So So So So cod :3 cod So cod 3o- :Imom
3o 85 8o 85 go 8.0 So So So 3o 85 So- 85. 218m
85 3d 8d 85 8o 86 So cod 8d :3 So- 85 :.o- 38m
.--auo---..m.§ ..... e do..-Masada.-.-a..m---._.oqo..-..e.a..o----._.q.os-.a..o----om..o-.sadiaomd. ........ wing... ..... -
8.? :3 so So So So So So So 85 85 8.? :.o- sumom
3o .86 8d 86 So So So So So 85 :3 86. Ed. 35m
85 So 23 So So So 85 So 8d 85 So 8.? 3o- 35m
.--.Nmno:--mqme.---.~.o.o.--.mosm---adssad-.smorgasmad----adde.-.-awassadidmum. ........ «...QO ..... -
So. 85 So So So :3 cod :3 So So So cod 26. 35m
So 8d :3 so So So So So So So So 85 N3 Nunez
3o 86 8o cod :3 So 85 5o cod 2; :3 Se 25. 35m
N3..--.E.m..-.-m°uo..-.mo..m Emma.-.«$530..-.w.o..o.-.-.~.q.os-.3.¢.-..&..m.eamuse”.-.ﬁgureseesmw ........ -
m 3 n 3 _ m... a me. _- 3- N- 2- m- 8.. as.
be?

 
  

 

mnNQx 233328 2% 33 KO 8% 3 53m 8de Ngoﬁhzeb _.< 033.

128

 

 

mod :3 Nod So 86 So so 86 :3 Se 86 No? 2.? .NN mom
mod :8 Ned Nod 85 So So Nod 8o Nod So :3 z? N.NImom
mod :3 Nod 85 Nod 86 So So Noe :3 So 8.? N_.? NNImom
3o :3 Nod 85 8d :3 So Nod so Nod So 8.? 2.? Ndom
..-wd..o-.-.w3 ..... _. one.--modsmmd.-.-a.m-.-mo.os--.ad.-.-mq.o-.-dmd---.&.d---.mo..¢..-..$..m. ........ a ~49.” ......
Nod 86 Nod Nod Nod So Nod So 86 So So 8.? 3.? 213m
85 86 So So So 8d :3 So 8o 23 Nod 5.? 2.? Edam
5o Nod Nod So :3 So :3 So :3 Nod 56 3.? ON? :Imom
.--moqo ..... _ .3 ..... _. crammed.Ian?-..dessauce--qu.---.a.o-::&d.---.Ndd-.--wmd.-..o._..m. ........ e. 3.0m ......
So Nod 3o 86 So :3 3o Nod Noo So Nod 8.? 8.? ﬂumom
Nod Nod Nod 86 86 So so So so Nod 23 5.? 2.? idem
8d Nod Nod Nod Nod So cod :3 Noe Nod 55 No? ON? ﬂdom
...Nwsmm.-.Nm..?-.-moqo..-warm.-.wm..o.-..$..m-.-mono?.3..o.-.-~¢.o-.ammo...-..eou?..-m.m?.s.wwuow..-i-i..._d ........ -
N mN N w. _ m... a m? _- m..- N- m.N- m- 8.. so:

 
  

mam—Ede.

 

u. :5 _.< 2.3

129

 

:.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0N0 N_ 00.0
00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0 0N0 :Imom
S0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0N0 0350
00.0 :.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 0:090

200.0. -20.030500”?-0.0..0..-.0¢.0-..0m..0-..00..0.--.00..0.--.00..0.-.-.0.0...0---gas-0.3 ..................... ...- ...........
00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 3.0 00.0 00.0 N00

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 :.0

-.00..0. --0040..-.0.0q0.--00..0---0.0..0-..00..0.-..0.0..0.--wads-00.0.-.-.0.0..0-.-.wm..0.--.0.0..0 ..................... ..- ...........
00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 2.0

10000. .-Nm0..-....Nu0..-0.N...0...0440-.4.00.0.-..00..0.-..00..0;;00..0.-.-an0..-.0.N...0--.mw..0 ...................................
0 0.N N 0.0 _ 0.0 0 0.0- _- 0.? N- 0.N-

 

 

£3 2000008 20 35 mo 20 00 00.00 £0: 03000008 N.< 200%

130

 

 

00.0 20 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 80 2.0
:.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 00.0 000

2.0 N_.0 00.0 00.0 00.0 00.0 00.0 000 00.0 N00 00.0 2.0

:.0 20 00.0 00.0 00.0 00.0 00.0 N00 00.0 N00 00.0 20

-Nada-M030:-.0.0q0-.-.0-0..0-.wad-:03-.30.-..00.0.-..0.0..0.-.-m0.0-.-.0m.m.s.&..0 ..................... I- ...........
000 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0 00.0 0N0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0 00.0 00.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 00.0

20.000.--.0000..-.0.0q0.--.0.0..0--.0.0..0-..00...0-.0000.-.«mas-0.000.---0.0.0..-.0m.m.-..00..0 ..................... I- ...........
000 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 000 00.0 :.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 000 00.0 000 00.0 00.0

00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 00.0 8.0 00.0

:30. .-NW0:-....Nu0..-0.w.0-;0.~...0-..m...0.-:03.-..00..0.-..00..0.-;0._40-.53.00.2003 ...................................
0 0.N N 0._ _ 0.0 0 0.0- 7 0.0- N- 0.N-

 

 

ﬁ 0:00 N4 2an

131

 

00.0 00.0 0N.0 0N0 N0 _N0 _N0 NN0 0N.0 0N0 NN0 :0 000 N0 .50

NN.0 0N.0 0N.0 0N0 0N.0 NN0 NN0 NN.0 NN.0 0N.0 0N0 N00 000 :Imom

00.0 00.0 0N.0 NN.0 0N0 0N0 NN.0 NN.0 0N0 0N0 0N.0 :0 N00 2:000

N00 00.0 0N.0 0N0 0N0 0N0 0N0 0N.0 0N.0 0N0 0N.0 N00 00.0 0109.

-.m.N-..0- .-.0..N..0..-._.Nq0..-mN..0-.-.0._. .0-.-.0.w0..---_.m.0.-.-._.N..0..-.M.N...0.-..Nwms-wmd;.-0.N..0..-.ww..m .......... 0 Jam .........

NN.0 0N.0 NN.0 0N0 0N0 2.0 20 20 20 _N0 0N.0 000 00.0 Nunez

0N.0 0N.0 :0 2.0 20 00.0 2.0 2.0 20 «N0 0N.0 00.0 N0.0 0:000

0N.0 0N.0 NN.0 _N0 0N0 0N0 0N.0 e0 0N.0 0N0 0N0 00.0 00.0 0:09.

-0.N-.0:-0.N-.0..-.qu0-.-.0.0.-0..0.0-0---0.0-0:22 .0--.-0._..0.---w0...0.-..Nwm--.NN..0.---0-N...0----Nw.0 .......... 0 ...-0.9.0 .........

0N.0 0N.0 NN.0 0N0 0N0 20 e0 0N0 _N.0 _N0 0N.0 0N0 0.0 0:090

0N.0 0N.0 _N.0 0N0 0N.0 20 20 20 0N.0 0N0 _N.0 0N.0 000 Nu0om

NN.0 0N.0 NN.0 0N.0 0N0 0N.0 0N.0 0N.0 0N.0 NN0 0N.0 00.0 00.0 0:090

.00....0- .-.0000..-mm0..-m0..0..-.0m.-0-.-mm0..-.._.m.._...-.-0.N...0..-.,-0.N...0ENE.-..00..0.-.-._-N.-0-.-.wm.m ............ mm ...........
. . . - .- - - 8.— .5:

anN00 _ 00 0 00. _ 0_ N 0 £00

 

....QO 303008 as 0:0 mo 20 00 000.00 0000 B§§§o 0.< 20¢

132

 

 

00.0 _0.0 0N.0 0N.0 0N.0 NN.0 0N.0 NN0 0N.0 0N0 0N.0 00.0 00.0 0N 000
00.0 00.0 NN.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 NN.0 0N.0 N00 00.0 0N:000
00.0 00.0 0N.0 NN.0 0N.0 0N.0 0N.0 0N0 0N.0 0N0 0N.0 00.0 00.0 NN:000
00.0 00.0 0N.0 0N.0 0N.0 NN.0 0N.0 NN0 0N.0 0N0 0N.0 00.0 00.0 8:000
:quo..-.0.N..:o.im-..:o-.-MNJO-iquo!-.m._...o.iaow..m-.--mN-O ::::: 0 mum:50w.d---w.~...0-i.mw..o!--mm.d .......... o win-0.0.0.0 ........
0N.0 NN.0 0N.0 NN.0 NN.0 0N.0 0N.0 NN.0 0N.0 0N.0 NN.0 00.0 00.0 2:000
NN.0 0N.0 NN.0 0N.0 0N.0 0N.0 0N.0 0N0 NN.0 0N0 0N.0 N00 0.0 2:000
0N.0 0N.0 0N.0 NN.0 0N.0 0N.0 0N.0 NN0 0N.0 0N.0 0N.0 3.0 :0 2:00
-000.--0N...0...N.N...0-.-.N.N...0..-.000..-0.N-.0.-..00.0.-.-._.~.0..-.0.0-0.N.0:00-0:00..-.00 .......... 0. .300 ........
0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 NN0 0N.0 0N.0 0N.0 00.0 00.0 2:000
0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N.0 0N0 0N.0 0N0 0N.0 :0 00.0 2:000
0N.0 0N.0 0N.0 0N.0 NN.0 NN.0 NN.0 0N.0 NN.0 0N0 0N.0 00.0 00.0 2:000
dang....-a-mna-iwmna!-$ua-.-mwuo.;.mmm;33$-.-guesamwhw.-..«Wa-;wmna:i—-ma..-.w.m...-..-i.-i.e-MG ...........
0 0.N N 0.0 0 0.0 0 0.0- 0- 0.0- N- 0- 0.50
3:5

 

0.300 0.0. 2000

133

REFERENCE

Ariel, A., Veldkamp, B. P., & van der Linden, W. J. (2004). Constructing rotating item
pools for constrained adaptive testing. Journal of Educational Measurement, 41, 345—
359.

Belov, D. I., & Armstrong, R. D. (2009). Direct and inverse problems of item pool design
for computerized adaptive testing. Educational and Psychological Measurement,
69(4), 544-547.

Binet, A., & Simon, Th. A. (1905). Méthode nouvelle pour le diagnostic du niveau
intellectual des anormaux. L'Année Psychologique, 11, 191-244.

Birbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In
Lord, F. M. and Novick, M. R., Statistical and theories of mental test scores. Reading,
Mass: Addision-Wesley, 1968 (Chapters 17-20).

Bock, D. R., & Mislevy, R. (1982). Adaptive EAP estimation of ability in a
microcomputer environment. Applied Psychological Measurement, 6(4), 431-444.

Boekkooi-Timminga, E. (1990). A method for designing IRT-based item banks (Research
Rep 90-7). Netherlands: University of Twente, Department of Education.

Buyske, S. (2005). Optimal design in educational testing. In M. P. F. Berger & W. K.
Wong (Eds), Applied optimal designs (pp. 1—19). West Sussex, UK: Wiley.

Chang, H. (2007). Book review: Linear models for optimal test design. Psychometrika,
72, 279-281.

Chang, H., & Ying, Z. (2009). Nonlinear sequential design for logistic item response
theory models with applications to computerized adaptive tests. The Analysis of
Statistics. 37(1). 1466-1488.

Chang, H., & Ying, Z. (1999). A-stratiﬁed multistage computerized adaptive testing.
Applied Psychological Measurement, 23, 211-222.

134

Chang, H., Qing, J ., & Ying, Z. (2001). A-stratiﬁed multistage computerized adaptive
testing with b blocking. Applied Psychological Measurement, 25, 333-341.

Chang, S., & Ansley, T. N. (2003). A comparative study of item exposure control
methods in computerized adaptive testing. Journal of Educational Measurement,
40(1), 71-103.

Chang, S.. & Twu, B. Y. (September 1998). A comparative study of item exposure
control methods in computerized adaptive testing. ACT Research Report Series,
ACT-RR-98-3.

Chen, S., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection
rules at the early stages of computerized adaptive testing. Journal of Educational
Measurement, 41(2), 149-174.

Chen, S. Y., Ankenmann, R. D., & Spray, J. A. (2003). The relationship between item
exposure and test overlap in computerized adaptive testing. Journal of Educational
Measurement, 40, 129-145.

Cheng, Y., & Chang, H. (2008). A new heuristic for parallel form assembly based on
information curve matching. Paper presented at the annual meeting of National
Council on Measurement in Education, Chicago, IL.

Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely
constrained item selection in computerized adaptive testing. British Journal of
Mathematical and Statistical Psychology, 63, 369-383.

Cheng, Y., Chang, H., & Yi, Q. (2007). Two-phase item selection procedure for ﬂexible
content balancing in CAT. Applied Psychological Measurement, 31, 467-482.

Davey, T., & Nering, N. (2002). Controlling item exposure and maintaining item security.
In C. N. Mills, M. T. Potenza, J. J. Fremer, & W. C. Ward, (Eds), Computer-based
testing: Building the foundation for future assessments (pp. 165-191). Mahwah, NJ:
Erlbaum.

135

Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure
control with computerized adaptive testing. Paper presented at the annual meeting of
the American Educational Research Association, San Francisco, CA.

Dodd, B. G., De Ayala, R. J ., & Koch, W. R. (1995). Computerized adaptive testing with
polytomous items. Applied Psychological Measurement, 19, 5-22.

Eggen, T. J. H. M., & Straetmans, G.J.J.M. (2000). Computerized adaptive testing for
classifying examinees into three categories. Educational and Psychological
Measurement, 60, 713-734.

Eignor, D. R., Way, W. D., Stocking, M. L., & Steffen, M. (1993). Case studies in
computer adaptive test design through simulation (Research Rep. No. 93-56).
Princeton, NJ: Educational Testing Service.

Embretson, S. E. (2001). Generating abstract reasoning items with cognitive theory. In S.
H. Irvine & P. C. Kyllonen (Eds), Item generation for test development. Mahwah, NJ:
Erlbaum Publishers.

F laugher, R. (2000). Item pools. In H. Wainer (Ed.), Computerized adaptive testing: A
primer (pp. 37-59). Mahwah, NJ: Lawrence Erlbaum.

French, B., & Thompson T. (April, 2003). The evaluation of exposure control procedures
for an operational CAT. Poster presented at the annual meeting of the American
Educational Research Association (AERA), Chicago, IL.

Georgiadou, E., Triantaﬁllou, E., & Economides, A. A. (2007). A review of item
exposure control strategies for computerized adaptive testing from 1983 to 2005. The
Journal of Technology, Learning, and Assessment, 5(8). Retrieved from
http://escholarship.bc.edu/cgi/viewcontent.cgi?article=1093 &context=jtla.

Gorin, J. S., Dodd, B. G., Fitzpatrick, S. J ., & Shieh, Y., (2005). Computerized adaptive
testing with the partial credit model: Estimation procedure, population distributions,
and item pool characteristics. Applied Psychological Measurement, 29(6), 433-456.

Gu, L. (2007). Designing optimal iem pools for computerized adaptive tests with
exposure controls. Unpublished doctoral dissertaion. Michigan State University.

136

Hambleton, R. K.,&J ones, R. W. (1994). Item parameter estimation errors and their
inﬂuence on test information functions. Applied Measurement in Education, 7, 171—
186.

Hau, K. T., & Chang, H. (2001). Item selection in computerized adaptive testing: Should
more discriminating items be used ﬁrst. Journal of Educational Measurement, 38 (3),
249-266.

Hetter, R., & Sympson, B. (1997). Item exposure control in CAT-ASVAB. In W. Sands,
B. Wasters, & J. McBride (Eds.), Computerized Adaptive Testing: From Inquiry to
Operation (pp. 141-144). Washington, DC: American Psychological Association.

J ensema, C. J. (1972). An application of latent trait mental test theory to the Washington
Pre-College Testing Program. Unpublished doctoral dissertation. University of
Washington, 1972.

Jensema, C. J. (1977). Bayesian tailored testing and the inﬂuence of item bank
characteristics. Applied Psychological Measurement, 1, 111-120.

Kingsbury, G. G., & Zara, A. R. (1991). Procedures for selecting items for computerized
adaptive tests. Applied Measurement in Education, 2, 359—375.

Leung, C. K., Chang, H., & Hau, K. T. (2003a). Incorporation of content balancing
requirements in stratiﬁcation designs for computerized adaptive testing. Educational
and Psychological Measurement, 63, 257-270.

Leung, C. K., Chang, H., & Hau, K. (2003b). Computerized adaptive testing: A
comparison of three content balancing methods. Journal of Technology, Learning,
and Assessment, 2(5).

Lord, M. F. (1970). Some test theory for tailored testing. In W. H. Holzrnan (Ed.),
Computer assisted instruction, testing, and guidance (pp. 139-183). New York:
Harper & Row.

\f Lord, F. M. (1971a). Robbins-Monro procedures for tailor testing. Educational and
Psychological Measurement, 31, 3-31.

137

Lord, F. M. (1971b). The self-scoring ﬂexilevel test. Journal of Educational
Measurement, 8, 147-151.

Lord, F. M. (1977). A broad-range tailored test of verbal ability. Applied Psychological
Measurement, 1, 95-100.

Lord, F. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum.

Luecht, R. M. (1998). A framework for exploring and controlling risks associated with
test item exposure over time. Paper presented at the annual meeting of the National
Council on Measurement in Education.

McBride, J. R., & Martin. J. T. (1983). Reliability and validity of adaptive ability tests in
a military setting. In D. J. Weiss (Ed.), New Horizons in Testing (pp. 223-236). New
York, NY: Academic Press.

McBride, J. R., & Weiss, D. J. (1976). Some properties of a Bayesian adaptive ability
testing strategy (Research Rep No. 76-1). Minneapolis, MN: Psychometric Methods
Program, Department of Psychology.

Nering, M.L., Davey, T. & Thompson, T. (1998, June). A hybrid method for controlling
item exposure in computerized adaptive testing. Paper presented at the annual
meeting of the Psychometric Society. Urbana, IL.

Owen, R. J. (l975).A Bayesian sequential procedure for quanta] response in the context
of adaptive mental testing. Journal of the American Statistical Association, 70, 351-
356.

Parshall, C., Davey, T., & Nering, M. (1998). Test development exposure control for
adaptive tests. Paper presented at the Annual Meeting of the National Council on
Measurement in Education, San Diego CA.

Patsula, L. N., & Steffan, M. (1997). Maintaining item and test security in a CAT
environment: A simulation study. Paper presented at the annual meeting of the
National Council on Measurement in Education.

138

 

O

Reckase, M. D. (1976). The eﬂ'ect of item pool characteristics on the operation of a
tailored testing procedure. Paper presented at the spring meeting of the Psychometric
Society, Murray Hill, NJ.

0 Reckase, M. D. (2003). Item pool design for computerized adaptive tests. Paper presented
at the National Council on Measurement in Education, Chicago, IL.

Reckase, M. D., & He, W. (2004). The ideal item pool for the NCLEX-RN examination—
Report to NCSBN: Michigan State University.

Reckase, M. D., & He, W. (2005). Ideal item pool design for the NCLEX-RN exam.
Michigan State University, East Lansing, MI.

Reckase, M. D., & He, W. (2009a). ). Optimal item pool design for the 2009 NCLEX
Exam--report to the National Council of State Boards of Nursing (NCSBN): Michigan
State University.

Reckase, M. D., & He, W. (2009b). The inﬂuence of item pool quality on the functioning
of computerized adaptive tests. Paper presented at the annual meeting of
Psychometric Society, Cambridge, UK.

Revuelta, J ., & Ponsoda, V. (1998). A comparison of item exposure control methods in
computerized adaptive testing. Journal of Educational Measurement, 34, 311-327.

Robin, F., van der Linden, W. J ., Eignor, D. R., Steffen, M., & Stocking, M. L. (2005). A
comparison of two procedures for constrained adaptive test construction (ETS
Research Rep No. RR-04-39). Princeton, NJ: Educational Testing Service.

Shin, C., Chien, Y., Way, W. D., & Swanson, L. (2009). Weighted penalty model for
content balancing in CA TS. Pearson. Retrieved from
http://www.pearsonedmeasurement.com/down]oads/research/Weighted%2OPenaltﬂg
20Model.pdf

 

Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools
(ETS Research Rep No. 94-05). Princeton, NJ: Educational Testing Service.

139

Stocking, M.L., & Lewis, C. (1995). A new method of controlling item exposure in
computerized adaptive testing (Research Rep No. 95-25). Princeton, NJ: Educational
Testing Service.

Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in
computerized adaptive testing. Journal of Educational and Behavioral Statistics,
23(1), 57-75.

Stocking, M. L., & Lewis, C. (2000). Methods of controlling the exposure of items in
CAT. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive
testing:Theory and practice (pp. 163-182). Netherlands: Kluwer Academic Publishers.

Stocking, M. L., & Swanson, L. (1998). Optimal design of item banks for computerized
adaptive tests. Applied Psychological Measurement, 22, 271—279.

«3 Swanson, D., & Stocking, M. L. (1993). A method for severely constrained item selection
in adaptive testing. Applied Psychological Measurement, 1 7, 277—292.

Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized
adaptive testing. Paper presented at the 27th Annual Meeting of the Military Testing
Association, San Diego, CA.

Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory.
Journal of Educational Measurement, 14(2), 181-196.

van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests.
Applied Psychological Measurement, 22, 195-211.

van der Linden, W. J. (2000). Constrained adaptive testing with shadow tests. In W. J.
van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and
practice (pp. 27—52). Boston: Kluwer.

van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer.

van der Linden, W. J. (2005a). A comparison of item-selection methods for adaptive tests
with content constraints. Journal of Educational Measurement, 42, 283-302.

140

van der Linden, W. J ., & Chang, H. (2005b). Implementing content constraints in alpha-
stratiﬁed adaptive testing using a shadow test approach (Computerized Testing
Report 01-09). Law School Admission Council. Retrieved from
http://www.lsacnet.org/Research/ct/implementing—content-constraints-in-alpha-
stratiﬁed-adaptive-testing-using—shadow-test-approachpdf

 

 

van der Linden, W. J ., & Glas, C. A. W. (2000). Capitalization on item calibration error
in adaptive testing. Applied Measurement in Education, 13(1), 35-53.

.3, van der Linden, W. J ., & Pashley, P. J. (2000). Item selection and ability estimation in
adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized
adaptivetesting: Theory and practice (pp. 1—25). Boston: Kluwer.

van der Linden, W. J., & Reese, L. (1998). A model for optimal constrained adaptive
testing. Applied Psychological Measurement, 22 (3), 259-270.

van der Linden, W. J ., Ariel, A., & Veldkamp, B. P. (2006). Assembling a CAT item
pool as a set of linear tests. Journal of Educational and Behavioral Statistics, 31, 81-
99.

Van der Linden, W. J., Veldkamp, B. P., & Reese, L. M. (2000). An integer
programming approach to item bank design. Applied Psychological Measurement,
24(2), 139-150.

Veldkamp, B. P., & van der Linden, W. J. (2000). Designing item pools for computerized
adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized
adaptive testing: Theory and practice (pp. 149—162). The Netherlands: Kluwer
Academic Publishers.

Wainer, H. (2000). Rescuing computerized adaptive testing by breaking Zipf’s law.
Journal of Educational and Behavioral Statistics, 25, 203-224.

Wang, T., & Kolen, M. J. (2001). Evaluating comparability in computerized adaptive
testing: Issues, criteria and example. Journal of Educational Measurement, 38(1), 19-
49.

Wang, T., & Vispoel, W. P. (1998). Properties of ability estimation methods in
computerized adaptive testing. Journal of Educational Measurement, 35(2), 109-135.

141

 

Way, W. D. (1998). Protecting the integrity of computerized testing item pools.
Educational Measurement: Issues and Practice, 1 7, 17-27.

Way, W. D., Steffen, M., & Anderson, G. S. (1998). Developing, maintaining, and
renewing the item inventory to support computer-based testing. Paper presented at the

colloquium computer-based testing: Building the foundation for future assessments,
Philadelphia, PA.

Way, W. D., Swanson, L., Steffen, M., & Stocking, M. L. (2001). Refining a system for
computerized adaptive testing pool creation (Research Report. No. 01-18). Princeton,
NJ: Educational Testing Service.

Weiss, D. J. (1973). The stratiﬁed adaptive computerized ability test (Research Report
73-3). Minneapolis: University of Minnesota, Department of Psychology,
Psychometric Methods Program.

Weiss, D. J. (1976). Adaptive testing research at Minnesota: Overview, recent results,
and future directions. In C. L. Clark (Ed.), Proceedings of the First Conference on
Computerized Adaptive Testing (pp. 24-35). Washington DC: United States Civil
Service Commission.

Weiss, D. J. (1978). Proceedings of the I 97 7 Computerized Adaptive Testing Conference.
Minneapolis: University of Minnesota, Department of Psychology, Computerized
Weiss, D. J. (1982). Improving measurement quality and efﬁciency with adaptive
testing. Applied Psychological Measurement, 6, 473-492. Adaptive Testing
Laboratory.

Wise, 8., & Kingsbury, G. G. (2000). Practical issues in developing and maintaining a
computerized adaptive testing program. Psicologica, 21, 135-155. Retrieved from
http://www.uv.es/psicologica/articulosdﬁyz.OO/wise.pdf.

 

Xing, D., & Hambleton, R. K. (2004). Impacts of test design, item quality, and item bank
size on the psychometric properties of computer-based credentialing examinations.
Educational and Psychological Measurement, 64(1), 5-21.

142

 

”Willi Iii: [lit 11 liﬂiiljl‘llqliﬁml”