DIAGNOSTIC 
T
OOLS
 
FOR 
IMPROVING 
THE AMOUNT OF ADAPTA
TION
 
IN 
ADAPTIVE TESTS
 
USING
 
OVERALL AND
 
CONDITIONAL 
INDICES
 
OF 
ADAPTATION
 
By
 
Unhee Ju
 
 
A DISSERTATION
 
Submitted to 
 
Michigan State University
 
 
i
n partial fulfillment of the requirements 
 
f
or the degree of
 
 
Measurement and Quantitative Methods

Doctor of Philosophy
 
2019
 
ABSTRACT
 
DIAGNOSTIC TOOLS FOR
 
IMPROVING THE AMOUN
T OF ADAPTATION 
 
IN ADAPTIVE TESTS US
IN
G OVERALL AND CONDIT
IONAL INDICES OF 
ADAPTATION
 
By
 
Unhee Ju
 
In recent years, 
computerized adaptive testing (
CAT
)
 
has been widely used in educational 
and
 
clinical settings.
 
The
 
basic idea of CAT is relatively straightforward
. A computer is used to 
administer items tailored for individuals to maximize
 
the
 
measurement precision
 
of their 
proficiency 
estimates. 
However, the administration of CAT is not so simple. Those who 
administe
r CATs must, while trying to optimize an item selection criterion, consider a variety of 
practical 
issues such as test security, content balancing, the purpose of testing
, and other test 
specifications
. Such extraneous 
factors
 
make it possible that a CAT m
ight have 
so 
many 
constraints that in practice it is 
barely 
adaptive at all. This concern is at the forefront of the 
current 
study
, which poses two
 
k
ey questions: How adaptive is a highly adaptive test really? 
How can the level of adaptation be improved? 
 
Th
is
 
study aims to develop three new 
statistical indicators to 
measure
 
the amount of 
adaptation conditional on the 

 
evaluate
 
the 
feasibility 
and utility of these 
adaptation 
measur
es in helping to diagnose and improve adaptivity 
that occurs during the CAT administration. 
Extending work done by Reckase, Ju, and Kim 
(2018), the proposed measures are based on three components

the differences in the locations 
between the selected items 

item locations administered to each examinee, and the magnitude of information that the test 
presents to each examinee. Hence, they can be used to assess adaptivity during the CAT proc
ess, 
3
 
as well as to identify differences in the level of adaptation for individuals or subgroups of 
examinees. 
 
To demonstrate the performance of the proposed adaptation indices, this study 
conduct
ed
 
analyses of real
 
operational testing
 
data from a 
h
ealthca
re licensure examination, as well as 
comprehensive 
simulation studies under various 
conditions
 
that 
affect 
adaptivity in a CAT. 
The 
key findings of the study suggest that 
the proposed adaptation indices are likely to function as 
intended to sensitively 
det
ect
 
the magnitude of adaptivity 
for
 
a CAT over the proficiency 
continuum. These new measures 
shed light on
 
how much adaptation of a given test occurs across 
individual proficiency levels or subpopulations. With some guidelines for 
the interpretation of 
the
se measures
 
recommended in this study, 
the adaptation indices 
can also readily serve
 
as 
diagnostic tools in practice for helping test practitioners
 
design 
item pools
 
and adaptive tests that 
support high adaptivity
.
 
 
Copyright by
 
UNHEE JU
 
20
19
 
 
v
 
ACKNOWLEDGEMENTS
 
 
I would like to express deep gratitude to my advisor and committee chair, Dr. Mark D. 
Reckase. The support, guidance, and encouragement that he has provided throughout my 
doctorate training years has been 
priceless. He has given me numerous opportunities to conduct 
research with him, showing me I could enjoy doing research and with self
-
motivation. A 
passionate scholar and wise educator, he has been a great role model to me. I would not have 
come this far w
ithout his tremendous academic support and emotional encouragement.
 
For their support and invaluable comments, I also sincerely appreciate my committee 
members, Dr.
 
Kimberly 
Kelly, Dr. Richard
 
Houang
, and Dr. Christopher Nye. I thank 
National 
Council of State Boards of Nursing
 
(NCSBN), especially Dr. Qian Hong for allowing me to have 
access to operational data used for this dissertation study. Also, I am deeply thankful to Dr. Carl 
F. Falk for sharing
 
his knowledge and research experience, as well as to Dr. Eunsoo Cho for 
providing financial support the last two years and research opportunities in applied research 
areas. I also want to thank my advisor in South Korea, Dr. Eunlim Chi who first sparked m
y 
interest in Educational Measurement (Psychometrics) and took me under her wing until the end 
of my PhD journey. 
 
My special thanks go out to 
Nancy Duchesneau
 
and 
William Sullivan
 
for reading over 
my dissertation, and to my close colleagues and friends at
 
Michigan State University who have 
helped me stay steady and throughout the six years always stood by me

especially Jiahui Zhang 

Jihyun Park, Ajin Lee, and Susie 

friendship, cheering me on, and being with me through every important stage of this journey. 
 
vi
 
Most importantly, I dedicate this dissertation to my family. No words can fully express 
my heartfelt gratit
ude and appreciation to my mom and dad, Taesook Kang and Yeonghwan Ju, 
and my brother, Bongseop for their unconditional love, patience, confidence, and belief in me. 
None of this would have been possible without your love, support, and encouragement. 
 
 
vii
 
T
ABLE OF CONTENTS
 
 
LIST OF TABLES
 
................................
................................
................................
................................
...................
 
ix
 
LIST OF FIGURES
 
................................
................................
................................
................................
..................
 
xi
 
CHAPTER 1. INTRODUCTION
 
................................
................................
................................
...........................
 
1
 
1.1
 
Background
 
................................
................................
................................
....................
 
1
 
1.2
 
Research Questions
 
................................
................................
................................
........
 
4
 
CHAPTER 2. LITERATURE REVIEW
 
................................
................................
................................
...............
 
6
 
2.1
 
Item Response Theory
................................
................................
................................
....
 
6
 
2.1.1
 
Rasch (1PL) model
 
................................
................................
................................
....
 
7
 
2.1.2
 
2PL model
 
................................
................................
................................
.................
 
7
 
2.1.3
 
3PL model
 
................................
................................
................................
.................
 
7
 
2.1.4
 
Information function for dichotomous IRT models
 
................................
..................
 
8
 
2.2
 
Computerized Adaptive Testing
 
................................
................................
....................
 
9
 
2.2.1
 
Item pool
 
................................
................................
................................
.................
 
10
 
2.2.2
 
Item selecti
on procedure
 
................................
................................
.........................
 
12
 
2.2.3
 
Scoring procedure
 
................................
................................
................................
...
 
18
 
2.2.4
 
Stopping rules
 
................................
................................
................................
..........
 
20
 
2.2.5
 
Adapti
ve test designs
 
................................
................................
...............................
 
20
 
2.3
 
Factors Affecting Adaptation
 
................................
................................
.......................
 
21
 
CHAPTER 3. INDICES FOR THE AMOUNT OF ADAPTATION
 
................................
.............................
 
25
 
3.1
 
Existing Measures of the Amount of Adaptation
 
................................
.........................
 
27
 
3.1.1
 
Correlation index
 
................................
................................
................................
.....
 
27
 
3.1.2
 
Ratio of standard deviations index
 
................................
................................
..........
 
28
 
3.1.3
 
Proportion of reduction in variance index
 
................................
...............................
 
29
 
3.1.4
 
Percent of optimal information index
 
................................
................................
......
 
30
 
3.2
 
New Conditional Measures of the Amount of Adaptation
 
................................
..........
 
31
 
3.2.1
 
Deviation of difficulty index
 
................................
................................
...................
 
31
 
3.2.2
 
Conditional proportion of reduction in variance index
 
................................
...........
 
33
 
3.2.3
 
Ratio of information index
 
................................
................................
......................
 
33
 
CHAPTER 4.  METHODS
 
................................
................................
................................
................................
....
 
37
 
4.1
 
Common CAT Specifications
 
................................
................................
......................
 
37
 
4.2
 
Research Question 1
 
................................
................................
................................
.....
 
38
 
4.2.1
 
Item pool
 
................................
................................
................................
.................
 
38
 
4.2.2
 
Simulation d
esign
 
................................
................................
................................
....
 
39
 
4.2.3
 
Evaluation criteria
 
................................
................................
................................
...
 
44
 
4.3
 
Research Question 2
 
................................
................................
................................
.....
 
45
 
4.3.1
 
Item pool
 
................................
................................
................................
.................
 
46
 
4.3.2
 
Simulation procedure
 
................................
................................
..............................
 
47
 
4.3.3
 
Evaluation criteria
 
................................
................................
................................
...
 
48
 
viii
 
4.4
 
Research Question 3
 
................................
................................
................................
.....
 
48
 
4.4.1
 
Simulation design
 
................................
................................
................................
....
 
48
 
4.4.2
 
Evaluation criteria
 
................................
................................
................................
...
 
54
 
4.5
 
Research Question 4
 
................................
................................
................................
.....
 
55
 
4.5.1
 
Item
 
pool
 
................................
................................
................................
.................
 
55
 
4.5.2
 
Test design
 
................................
................................
................................
...............
 
55
 
4.5.3
 
Evaluation criteria
 
................................
................................
................................
...
 
58
 
4.6
 
Research Question 5
 
................................
................................
................................
.....
 
59
 
4.6.1
 
CAT specifications for the NCLEX
-
RN exam
 
................................
.......................
 
59
 
4.6.2
 
Item pool
 
................................
................................
................................
.................
 
61
 
4.6.3
 
Evaluation criteria
 
................................
................................
................................
...
 
63
 
CHAPTER 5. RESULTS
 
................................
................................
................................
................................
.......
 
64
 
5.1
 
Research Question 1
 
................................
................................
................................
.....
 
64
 
5.1.1
 
Variation in item pool size
 
................................
................................
......................
 
64
 
5.1.2
 
Variation in item pool spread
 
................................
................................
..................
 
86
 
5.2
 
Research Question 2
 
................................
................................
................................
...
 
107
 
5.2.1
 
Baseline for the CATs
 
................................
................................
...........................
 
107
 
5.2.2
 
Region 1: 
-

 
................................
................................
....................
 
109
 
5.2.3
 

................................
................................
......................
 
111
 
5.3
 
Research Question 3
 
................................
................................
................................
...
 
114
 
5.3.1
 
Measurement accuracy and precision
 
................................
................................
...
 
114
 
5.3.2
 
Amount of adaptation
 
................................
................................
............................
 
117
 
5.3.3
 
Test security
 
................................
................................
................................
..........
 
121
 
5.4
 
Research Question 4
 
................................
................................
................................
...
 
123
 
5.4.1
 
Measurement accuracy and precision
 
................................
................................
...
 
123
 
5.4.2
 
Amount of adaptation
 
................................
................................
............................
 
125
 
5.5
 
Research Question 5
 
................................
................................
................................
...
 
128
 
5.5.1
 
Conditional adaptivity
 
................................
................................
...........................
 
129
 
5.5.2
 
Overall adaptivity
 
................................
................................
................................
..
 
130
 
CHAPTER 6. CONCLUSION AND DISCUSSION
 
................................
................................
......................
 
132
 
6.1
 
Summary of Findings
 
................................
................................
................................
.
 
132
 
6.2
 
Practical Utility of Conditional Adaptation Indices
 
................................
...................
 
138
 
6.2.1
 
Diagnostic tools for improving adaptivity
 
................................
............................
 
138
 
6.2.2
 
Use of 
conditional adaptation indices in automated test assembly
 
.......................
 
140
 
6.3
 
Alternative Ways to Define Conditional Adaptation Indices
 
................................
....
 
141
 
6.4
 
Implications
 
................................
................................
................................
................
 
143
 
6.5
 
Limitation and Future Research
 
................................
................................
.................
 
145
 
APPENDIX
 
................................
................................
................................
................................
............................
 
149
 
REFERENCES
 
................................
................................
................................
................................
......................
 
158
 
 
ix
 
LIST OF TABLES
 
 
Table 4.1  Descriptive Statistics and Zero
-
Order Correlation
s
 
of Item Parameters for the Item 
Pool from Minnesota Comprehensive Assessment (MCA) Grade 6 Mathematics Adaptive Test 
(
n
 
= 635)
 
................................
................................
................................
................................
........
 
39
 
Table 4.2  Descriptive Statistics of Generated Item Pools by Item Pool Size
 
..............................
 
42
 
Table 4.3  Descriptive Statistics of Generated Item Pools by Item Pool Spread (
n
 
= 400)
 
..........
 
43
 
Table 4.4  Item Distributions for Item Pools Considered in Research Question 3
 
.......................
 
52
 
Table 4.5  Descriptive Statistics of 
b
-
Parameters
 
by Stage for Each MST Design
 
......................
 
57
 
Table 4.6  Content Distribution of the First 60 Items for the NCLEX
-
RN in 2016
 
.....................
 
61
 
Table 4.7  Descriptive Statistics of 
b
-
Parameters for the NCLEX
-
RN Item Pool
 
........................
 
62
 
Table 5.1  Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a Rasch
-
based CAT by Item Pool Size and 
Proficiency
 
Estimator
 
................................
............................
 
66
 
Table 5.2  Overall Adaptation Statistics for a Rasch
-
based CAT by Item Pool Size and 
Proficiency
 
Estimator
................................
................................
................................
....................
 
75
 
Table 5.3  Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a 3PL
-
based 
CAT by Item Pool Size and 
Proficiency
 
Estimator
 
................................
................................
......
 
78
 
Table 5.4  Overall Adaptation Statistics for a 3PL
-
based CAT by Item Pool Size and 
Proficiency
 
Estimator
 
................................
................................
................................
................................
.......
 
85
 
Table 5.5  Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a Rasch
-
based CAT by Item Pool Spread and 
Proficiency
 
Estimator
 
................................
........................
 
89
 
Table 5.6  Overall Adaptation Statistics for a Rasch
-
based CAT by Item Pool Spread and 
Proficiency
 
Estimator
................................
................................
................................
....................
 
96
 
Table 5.7  Overall S
tatistics of Measurement Precision of 
Proficiency
 
Estimates for a 3PL
-
based 
CAT by Item Pool Spread and 
Proficiency
 
Estimator
 
................................
................................
..
 
99
 
Tabl
e 5.8  Overall Adaptation Statistics for a 3PL
-
based CAT by Item Pool Spread and 
Proficiency
 
Estimator
................................
................................
................................
..................
 
106
 
Table 5.9  Overall Statist
ics of Measurement Precision of 
Proficiency
 
Estimates for the 3PL
-
based 40
-
item CAT by Exposure Control Procedure and Item Pool Distribution
 
......................
 
117
 
Table 5.10  Overall Adaptation Statistics for a 3PL
-
based 40
-
item CAT by Exposure Control 
Procedure and Item Pool Distribution
 
................................
................................
.........................
 
121
 
x
 
Table 5.11  Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for the 3PL
-
based 40
-
Item Adaptive Test by Test Design and Item Pool Distribution
 
................................
.
 
125
 
Table 5.12  Overall Adaptation Statistics for a 3PL
-
based 40
-
item CAT by Exposure Control 
Procedure and Item Pool Distribution
 
................................
................................
.........................
 
128
 
Table 5.13  Overall Adaptation Statistics for a Rasch
-
Based Variable
-
Length CAT for an 
Operational NCLEX
-
RN Test
................................
................................
................................
.....
 
131
 
Table 6.1  Benchmark Values of Conditional and Overall Adaptivity Indices by IRT Models and 
Proficiency
 
Estimators
 
................................
................................
................................
................
 
133
 
 
xi
 
LIST OF FIGURES
 
 
Figure 
4
.1. Item distribution for the master pool (
N 
= 3,000).
 
................................
.....................
 
47
 
Figure 
4
.2. Number of items needed in the ideal item pool for a 3PL
-
based CAT of 40 items.
 
..
 
50
 
Figure 
4
.3. Distribution of 
b
-
parameters for the regular and optimal item pools.
 
........................
 
51
 
Figure 
4
.4. Distribution of exposure control parameters for the Sympson
-
Hetter procedure for the 
regular item pool (left) and the optimal item pool (right) of 300 items.
 
................................
.......
 
53
 
Figure 
4
.5. A 1
-
2
-
3 three
-
stage MST design used in the study.
 
................................
...................
 
56
 
Figure 
4
.6. Information function by each path for the 10
-
10
-
20 MST using regular item pool and 
optimal item pool.
 
................................
................................
................................
.........................
 
58
 
Figure 
4
.7. Information function by content strand for the NCLEX
-
RN item pool.
 
....................
 
63
 
Figure 
5
.1. Conditional bias, TSEM, and RMSE of proficiency estimates for a Rasch
-
based CAT 
by item pool size and proficiency estimator.
 
................................
................................
................
 
67
 
Figure 
5
.2. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a Rasch
-
based CAT by 
item pool size and proficiency estimator.
 
................................
................................
.....................
 
72
 
Figure 
5
.3. Plot of a POI index for a Rasch
-
based CAT by item pool size and proficiency 
estimator.
 
................................
................................
................................
................................
.......
 
73
 
Figure 
5
.4. Relationship of TSEM with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
based CAT by item pool size and proficiency estimator.
 
................................
..........
 
73
 
Figure 
5
.5. Conditional bias, TSEM, and RMSE of proficiency estimates for a 3PL
-
based CAT 
by item pool size and proficiency estimator.
 
................................
................................
................
 
77
 
Figure 
5
.6. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based CAT by 
item pool size and proficiency estimator.
 
................................
................................
.....................
 
82
 
Figure 
5
.7. Plot of a POI index for a 3PL
-
based CAT by item pool size and proficiency 
estimator.
 
................................
................................
................................
................................
.......
 
83
 
Figure 
5
.8. Relationship of TSEM with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a 3PL
-
based CAT by item pool size and proficiency estimator.
 
................................
.............
 
83
 
Figure 
5
.9. Conditional bias, TSEM, and RMSE of proficiency estimates for a Rasch
-
based CAT 
by item pool spread and proficiency estimato
r.
 
................................
................................
............
 
88
 
xii
 
Figure 
5
.10. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a Rasch
-
based CAT by 
item pool spread and proficiency 
estimator.
 
................................
................................
.................
 
93
 
Figure 
5
.11. Plot of a POI index for a Rasch
-
based CAT by item pool spread and proficiency 
estimator.
 
................................
................................
................................
................................
.......
 
94
 
Figure 
5
.12. Relationship of TSEM with conditional adaptivity indices for a Rasch
-
based CAT 
by item pool spread and proficiency estimator.
 
................................
................................
............
 
94
 
Figure 
5
.13. Conditional bias, TSEM, and RMSE of proficiency estimates for a 3PL
-
based CAT 
by item pool spread and proficiency estimator.
 
................................
................................
............
 
98
 
Figure 
5
.14. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based CAT by 
item pool spread and proficiency estimator.
 
................................
................................
...............
 
103
 
Figure 
5
.15. Plot of a POI index for a 3PL
-
based CAT by item pool spread and proficiency 
estimator.
 
................................
................................
................................
................................
.....
 
104
 
Figure 
5
.16. Relationship of TSEM with conditional adaptivity indices (DOD, CRPV, and ROI) 
for a 3PL
-
based CAT by item pool spread and proficiency estimator.
 
................................
......
 
104
 
Figure 
5
.17. A 
p
lot of conditional adaptivity indices over the proficiency continuum for the CAT 
using the 300
-
item pool (baseline
).
 
................................
................................
.............................
 
108
 
Figure 
5
.18. A plot of bias, TSEM, and RMSE over the proficiency continuum for the CAT 
using the 300
-
item pool 
(baseline).
 
................................
................................
.............................
 
109
 
Figure 
5
.19. Distributions of conditional adaptivity indices by number of items added at Region 
1 (
-

.
 
................................
................................
................................
....................
 
110
 
Figure 
5
.20. Distributions of statistics for measurement accuracy and precision by number of 
items added at Region 1 (
-
0.25 < 

 
................................
................................
...............
 
111
 
Figure 
5
.21. Distributions of conditional adaptivity indices by number of items added at Region 

 
................................
................................
................................
.....................
 
112
 
Figure 
5
.22. Distributions of statistics for measurement accuracy and precision by number of 

2.25).
 
................................
................................
.................
 
113
 
Figure 
5
.23. Conditional bias, TSEM, and RMSE of proficiency estimates for the 3PL
-
based 40
-
item CAT by exposure control p
rocedure and item pool distribution.
 
................................
.......
 
116
 
Figure 
5
.24. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based 40
-
item 
CAT by exposure control procedure and item pool distribution.
 
................................
...............
 
120
 
xiii
 
Figure 
5
.25. Exposure rate distribution of 300 items ordered by 
b
-
parameter (top) and exposure 
rate (bottom) for a 3PL
-
based 40
-
item CAT by exposure control procedure and item pool 
distribution.
 
................................
................................
................................
................................
.
 
122
 
Figure 
5
.26. Conditional bias, TSEM, and RMSE of proficiency estimates for a 3PL
-
based 
adaptive test by test design and item pool distribution.
 
................................
..............................
 
124
 
Figure 
5
.27. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based 40
-
item 
adaptive test by exposure control procedure and item pool distribution.
 
................................
...
 
127
 
Figure 
5
.28. Conditional adaptivity statistics (DOD, CPRV, and ROI) for a Rasch
-
based 
variable
-
length CAT for an operational NCLEX
-
RN test.
 
................................
.........................
 
130
 
Figure A.1. Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
based CAT by item pool size and proficiency estimator
 
................................
.........
 
150
 
Figure A.2. Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a 3PL
-
based CAT by item pool size and 
proficiency estimator
 
................................
............
 
152
 
Figure A.3. Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
base
d CAT by item pool spread and proficiency estimator
 
................................
....
 
154
 
Figure A.4. Relationship of RMSE with conditional adaptivity indices (DOD, C
PRV, and ROI) 
for a 3PL
-
based CAT by item pool spread and proficiency estimator
 
................................
.......
 
156
 
 
1
 
CHAPTER 1.
 
INTRODUCTION
 
 
1.1
 
Background
 
C
omputerized adaptive
 
test
ing
 
(
CAT
) ha
s
 
been used in 
a wide range of
 
setting
s. These 
include 
licensure and certification
 
examination
 
(e.g., National Council Licens
ure Examination
 
[NCSBN, 2016]
), 
admissions tests 
(e.g., Graduate Record Examinations
®
, Graduate 
Management Admission Test), achievement assessments within state
wide educational 
system 
(e.g., 
Minnesota [Mi
nnesota Department of Education, 2017
]
)
, and clinical settings to assess 
psychological or health
-
related outcomes
 
(e.g., anxiety
-
 
and depression
-
CAT [
Walter
, 2010])
, 
and still others. The popularity of CAT is attributed to its merits of efficient testing and high 
measurement precision of profic
iency estimates
. 
As CAT is implemented, it selects, administers, 
and scores items tailored for each individual, based on optimizing criteria such as maximizing 
the Fisher information at the current proficiency estimate. 
 
The basic idea of CAT is relatively
 
straightforward. However, numerous practical 
challenges to the deployment of CAT have persisted. These concern the design, implementation, 
and maintenance of a CAT program with respect to development and maintenance of
 
the
 
item 
pool, test administration (
e.g., item selection, scoring, and termination procedures), test security, 
and examinee issues. The success of 
a
 
CAT program is dependent on how well these practical 
concerns are addressed (see Wise & Kingsbury, 2000, for details). Measurement professional
s 
have resolved a number of these issues in CAT using alternative options. 
For
 
instance, a variety 
of appropriate constraints are imposed on item selection to conform to test specifications (e.g., 
Kingsbury & Zara, 1989; van der Linden & Reese, 1998)
 
and i
tem exposure control 
2
 
requirements (e.g.,
 
Chang, Qian, & Ying, 2001
; Sympson & Hetter, 1985). Some CATs adapt at 
the testlet level to incorporate the grouped items associated with a common stimulus (e.g., 
Wainer & Kiely, 1987). Multistage testing (MST), a s
pecial version of CAT, adapts at the stage 
level using pre
-
constructed modules, allowing reviews on psychometric and content properties 
and more 
efficient handling of 
complex test constraints (e.g., Yan, von Davier, & Lewis, 2016). 
In addition, CATs differ
 
in their item pool design, their stopping rule, and estimation procedures. 
All of these features h
ave influence
 
on an operational CAT program. 
 
Although 
many of these designs and variations in the implementation of CATs 
are given 


would be 
equally adaptive to 

.
 
An
 
administered 
CAT
 
may not be very adaptive
 
if it imposes too 
many 
constraints on item selection for the purpose of strong exposure control and strict content
-
balancing using a small item pool with limited spread in item difficulty
.
 
A severely constrained 
CAT may lead to all examinees getting almost the same test, ma
king it nearly the same as paper
-
and
-
pencil tests. If a CAT has a relatively large item pool, however, without any constraints on 
item selection, examinees may receive an optimal set of items customized for their 
proficiency
 
level
s
 
during a test, showing a
 
high level of adaptation. 
Another issue for the consideration is 
the 
test fairness for examinees. Sometimes, a testing program uses multiple item pools for the test 
administration. In this situation, some examinees receive the items given in the test that
 
match 
well
 
with
 

proficiency
 
levels from the pool of 
high
-
quality 
items, while others could 
take the items less adapted for their abilities due to the
 
different
 
item pool character
istics
 
assembled from the master item pool
. 
 
Spurred by such c
oncerns, Reckase and colleagues (Reckase, Ju, & Kim, 2018) proposed 
three adaptation indices to quantify the amount of adaptation that occurs in CAT based on the 
3
 
variance of the difficulty parameters for the items administered to the examinees. While these
 
measures are useful and work fairly well (e.g., Reckase, Ju, & Kim, 2017, 2019), they are limited 
to the evaluation of the adaptivity of tests over the entire group of examinees, rather than for 
individuals or subgroups. The measures give us the overall d
iagnosis of the adaptivity of 
administered tests but provide no specific information about the degree of the adaptivity 

proficiency
 
level. The latter information would be useful to modify 
test designs or the quality of the item
 
pool for
 
reaching
 
the optimal adaptation desired for the 

In a
ddition, the overall 
indices introduced by Reckase et al. (2018) focus on how 
appropriately the items administered to examinees are customized to their 
final
 
proficiency
 
estimates
, while item selection is driven by 
interim
 
proficiency estimates
. 
T
he adaptation indices 
thus 
do 
not know the quality of adaptation during the intermediate stages of CAT because the 
final 
proficiency
 
estimate is not known until the end of the test administration. In other words, 
the
se
 
adaptation measures conduct the post
-
evaluation of whether the item
s presented in the test 

proficiency
 
l
evels, but they are blind to 
whether the test 
provides the items that well match their momentary 
proficiency
 
estimates during the CAT 
process.    
 
In response to this perceived neces
sity, Kingsbury and Wise (2018) suggested a new 
measure of adaptation 
based on
 
item response theory (IRT) test information. Although this index 
is informative, it fails to take into account the alignment of and variations in difficulty for the 
items
 
admini
stered to each examinee, as well as it focuses on the actual test information based on 
the final 
proficiency
 
estimates. 
Plus, a
s found in Reckase et al. (2018, 2019), a single
 
index 
may 
be
 
insufficient
 
to 
assess all the relevant information about
 
the 
magnitude of adaptation
 
that 
4
 
happen
s during a CAT
 
because adaptivity is intertwined with item pools, item selection 
algorithms, 
proficiency
 
estimators, and other test specifications
.
 
 
1.2
 
Research Questions
 
 
To address pra
ctical needs and the gap in 
CAT litera
ture, t
his 
study
 
proposes new 
statistical indicators to examine the 
level
 

using (1)
 
the locations 
of the items 
(item difficulty or the location of the maximum information) 
administered to each examinee
, (2) their variances, and (3) their IRT information. 
Th
e study 
then 
explore
s
 
the capabilities of the
se 
three new adaptation measures
 
as tools for understanding how 
well
 
adaptivity occurs in the applications of CATs
, as well as demonstrates the practical utility of 
these indices using 
real 
operational 
data from the licensure and certification exam
ination
. 
Consequently, 
a new class of 
the adaptivity indices 
introduced here
 
can help measurement 
professionals and test developers understand the adaptation costs associated with item pool 
designs, test designs, constraints on item selection, and so forth.
 
Overall, th
is 
study
 
is guided by 
five
 
research questions: 
 
1.
 
How sensitive a
re the conditional adaptation indices to changing char
acteristics of the 
item pool, 
proficiency
 
estimators
, and IRT models
?
 
2.
 
For a given population of examinees, 
test specifications
, and 
an 
item pool, 
how can the 
conditional adaptation measures b
e used to 
revise the item pool in such a way that
 
a CAT 
works better? 
 
3.
 
Do the 
conditional adaptation
 
indices capture the varying degree of adaptivity resulting 
from constraints imposed on item selection
 
for exposure control
?
 
 
4.
 
Can the conditional adaptation indices b
e used to 
gauge
 
the amount of adaptation 
incurred by 
adaptive 
test designs
 
(fully adaptive tests vs. multistage adaptive tests)?
 
 
5
 
5.
 
Do
 
the
 
conditional adaptation indices 
function 
appropriately
 
to diagnose 
the adaptivity of 
an operational CAT
 
program
?
 
The first four research questions are answered through comprehensive simulation studies, 
and the last research question is demonstrated using operational variable
-
length CAT data. 
The 
next chapter reviews
 
the features of the item response theory (IRT) mode
ls, the 
components 
of 
CAT
, and factors that possibly affect the amount of adaptation.
 
Chapter 3 describes indices to 
measure the amount of adaptation, followed by
 
a
 
chapter that gives details about simulation 
designs and real operational data of the CA
T
 
pr
ogram. Finally, the last
 
two chapters
 
(
Chapter
 
4 
and Chapter 5
)
 
present the findings for
 
the performance o
f the proposed indices and discuss how 
the new set of adaptation measures can be efficiently and directly utilized for 
comparing the 
quality of adapti
vity at individuals or subpopulations and for 
improving the amount of adaptation 
by revising the item poo
ls or the test designs and 
specifications
.
 
 
6
 
CHAPTER 2.
 
LITERATURE REVIEW
 
 
This literature review chapter consists of three 
main 
sections
.
 
The first section explores 
the characteristics of item response theory (IRT), which is the fundamental basis of computerized 
adaptive testing (CAT) 
in terms of scoring and item selection
. The 
second section summarizes 
the components
 
of 
CAT
. The last section discusses 
plausible factors that have influences on the 
amount of adaptation for CAT
.  
 
2.1
 
Item Response Theory
 
 
Item response theory (IRT; Lord, 19
80
) describes the interaction 
between test items and 
examinees through a mathematical model, called
 
an
 
item response function
 
(IRF) that specifies 
the prob
ability
 
of a correct response on a given item, with item parameters
,
 
as a function of an 


)
. Item parameter
s, in general, include (1) an item difficulty parameter 
that indicates the relative difficulty or easiness of an item (i.e., location parameter) to examinees, 
(2) an item discrimination parameter that describes how well an item distinguishes between 
examin
ees of varying proficiency levels, and (3) an item pseudo
-
guessing parameter that 
indicates the possibility of giving a correct answer by chance. 
 
IRT models a
re usually classified into two types
 
based on how the item responses are 
scored: dichotomous IRT 
models and polytomous IRT models. 
Since the study focuses on the 
tests of dichotomously scored items, this section only describes dichotomous IRT models, which 
are commonly applied to binary scored multiple
-
choice (MC) items or true/false items (e.g., 
corr
ect/incorrect). Three frequently
-
noted dichotomous IRT models include the one
-
parameter 
7
 
logistic (1PL) or Rasch model (Rasch, 196
1
), the two
-
parameter logistic (2PL) model 
(Birnbaum, 1968), and the three
-
parameter logistic (3PL) model (Birnbaum, 1968).
 
 
2.1.1
 
Ra
sch (1PL) model 
 
The 1PL model, also known as the Rasch model, is the most parsimonious model among 
the commonly considered IRT models. It assumes unit discrimination for all items and no 
guessing. The IRF of the Rasch model specifies the probability of a correct response 
on item 
i
 
for 
examinee 
j
 
by:
 
 
(
2.
1
)
 
where 


represents the proficiency level of examinee 
j
, 
and 


denotes the
 
item 
difficulty 
parameter of item
 
i
. 
 
2.1.2
 
2PL model 
 
 
The 2PL model considers not only the 
item difficulty (


) 
but also the item discrimination 
parameter (


response on an item. In this model, the probability of a correct response on item 
i
 
administered to 
examinee 
j
 
can be defined as:
 
 
(
2
.2
)
 
2.1.3
 
3PL model 
 
 
Unlike the 2PL model, the 3PL model considers that, for an examinee with very low 
proficiency, there is a possibility that the examinee correctly answers an 
item through a pseudo
-
guessing parameter (


), especially by chance with a MC item. The probability of examinee 
j
 
having a correct response for item 
i
 
is:
 
8
 
 
(
2.
3
)
 
2.1.4
 
Information function for 
dichotomous IRT models
 
IRT provides a measurement precision for the items over the 
proficiency
 
continuum 
through the usage of an item information function. 
The IRT information function of 
dichotomously scored items can be expressed as follows (Lord, 
1980):
 
 
(
2.
4
)
 
where 


is the probability of correctly answering the item given 

, and 


is the first 
derivative of the probability function. 
 
For the 3PL logistic model, Equation 2.4 can be represented as:
 
 
(2.5
)
 
where 


. Based on Equation 2.5, the information function for the 2PL model can 
be obtained by setting 


, and the information function for the Rasch model is obtained by 
setting 


and 


.
 
High information for an item at a particular 
proficiency
 
level indicates 
t
hat the item is very informative
 

proficiency
. 
The amount 
of 
information is greatly affected by 
a
-
parameter. 
Note that for the 1P
L model
, the items have the 
same amount of maximum information
 
at the location where the 
proficiency
 
is close to
 
the
 
b
-
parameter of the item (i.e., 0.25). 
In addition, the test informat
ion function can be 
simply 
calculated by summing the information functions for the test items contingent on 
proficiency
. 
 
9
 
2.2
 
Computerized Adaptive Testing 
 
In recent years, with discussion about visions for next
-
generation assessment, CAT has 
received great re
-
attention in educational system for personalized assessments (e.g., Conely, 
2018; Embretson, 2001). CAT delivers an individualized test tailored to a t
est
-
taker, and thus it 
can shorten the test length without sacrificing measurement precision. Compared to paper and 
pencil (P&P) 
linear 
tests, the advantages of CAT reported in
 
the
 
literature (e.g., Chang, 2004; 
Gibbons et al., 2008; Meijer & Nering, 1999; van der Linden, 2010) include shorter tests, 
improved test reliability, and immediate test scoring and reporting. Also, CAT allows 
one 
to 
obtain information that are not available
 
in 
P&P 
tests, including response time (e.g., Wi
se, 
Bhola, & Yang, 2006), 
graphical entries, mouse/eye movements, and so forth, which may open 
new avenue
s
 

activities. Furth
ermore, CAT enables the use of a variety of innovative items
 
and technology
-
enhanced items, which leads to improve
ment in
 
the validity evidence of tests that cannot be 
obtained in P&P tests (Luecht & Clauser, 2002). 
 
Basically, the CAT algorithm starts wit
h 
a
 
selection of the first item whose 
b
-
parameter is 
matched with a pre
-
determined initial 
proficiency
 
estimate.
 
After 
the item 
is scored 
and the 

proficiency
 
estimate
 
is updated
, the next item is then selected at the current 
proficiency
 
estimate from the given item pool based on the item selection criterion. This 
procedure continues until 
a
 
stopping rule is satisfied. Reckase (1989) reported four core 
components 
for an operationa
l
 
CAT: item pool, item selection procedure, 
scoring procedure
, and 
stopping rule
s
. Constraints for content balancing and exposure control are considered in the item 
selection procedure. In what follows, these four components will be briefly illustrated. 
 
 
10
 
2.2.1
 
Item pool
 
A paramount element that affects the performance of 
a 
CAT in numerous ways is
 
the
 
item pool. 
For instance, the item pool affects the proficiency estimates, eventually influencing a 
subsequent item to be administered. 
In real operational settings,
 
there are 
two types of item pools 
for CAT. One is 
a
 
master pool

supply the testing program. The other is an 
operational item pool
, which is used during a testing 
implementation period to 
provide
 

n
g 
company typically assembles the
 
operational item pool
s
 
from the master pool so as to renew the 
item pools after a certain period of time
 
usages
 
or a certain number of students take the test using 
the same item pool. 
 
Without 
a
 
well
-
designed item pool, 
a
 
CAT cannot be successfully implemented
. Thus,
 
the size and the quality of the item pool is very essential. 
The desired item pool
 
for CAT
 
has 
been recommended to 
include 
an
 
adequate number of good quality of items
 
to provide 
informative 
tests to the 
sample of examinees
 
(Flaugher, 2000
; McBride, 1977
)
.
 
Here, the good 
quality of item
s (
i.e., optimal items
)
 
generally have high item discriminations (e.g., 
a
 
>
 
.08)
 
and 
low guessing parameters (e.g., 
c
 
< .03). At the same time, the range of item difficulty in the item 

can take items well
-
tailored to their proficiency levels (Mills & Stocking, 1996; Urry, 197
7).
 
In 
addition to 
the 
statistical requirements of the optimal item pool, 
the pool should
 
contain items 
that measure the intended construct f
or the testing purpose and the use of the test scores (Kane, 
2013).
 
 
S
ome research 
(e.g., Gu
 
& Reckase
, 2007;
 
He & Reckase, 2013
; Reckase, 2010
; 
Veldkamp & van der Linden, 2010
) 
introduced
 
approaches to design the item pool for CAT but 
11
 
with different 
definitions
 
of an 
optimal 
item pool
. 
Veldkamp and van der Linden (
2010
) 
proposed a method for designing 
an 
optimal
 
blueprint for a CAT 
item p
ool 
with the integer 
programming model that
 
minimiz
es
 
an estimate of 
item
-
writing costs 
using
 
the classification 
table 
defined by 
item attributes figuring in 
test 
specifications (e.g., content, format, word count, 
item 
difficulty
)
.
 
The goal of this item pool design process is to figure out the number of items 
required for each cell of the classification table
, guiding the item writing process. However
,
 
this 
method 
uses
 
the 
characteristics 
of a previous or existing item p
ool 
as a starting point 
to define 
item
-
writing costs. 
 
Another line of research on item pool design 
(e.g., Gu
 
& Reckase
, 2007; He & Reckase, 
2013; Mao, 2014) 
has been based on the bin
-
and
-
union method (Reckase, 2010) with more 
emphasis on the 
psychometric properties of an optimal item pool
,
 
in lieu of the item
-
writing 
costs. 
It also does not require pre
-
existing information about the item pool. 
An
 
optimal item pool
 
defined in this method
 
should 
include 
a
 
desired item 
available for every stage o
f item selection 
that matches the current proficiency estimate for
 
each examinee
.
 
The optimal item pool is 
determined by tallying the location of the sequential proficiency estimates for each examinee 
with the expectation that there would be an item in the
 
pool whose information peaks at that 
location on the proficiency scale
.
 
As
 
the items used for 
a single
 
examinee can be used for 
other
 
examinees, the full item pool is determined from the union of the items
 
required
 
for 
the entire 
set 
of 
examinees of 
interest (see Reckase (2010) and He and Reckase (2013) for more details of the 
process)
. 
Exposure control, content balancing, and other specifications can be incorporated into 
the CAT simulations to identify the design for the optimal item pool. 
I employed
 
th
is
 
bin
-
and
-
un
ion method to design the ideal
 
item pool in Se
ction 
4.4.1.1.
  
 
12
 
T
he item pool size
, another important aspect of the item pool,
 
is dependent on
 
the testing 
purpose,
 
the CAT specifications (e.g., exposure control and content balancing)
, 
and the 

proficiency distribution (
Parshall, 2002; 
Reckase, 2010
). 
P
rior
 
research (e.g.,
 
Chen, 
Ankenmann, &
 
Spray, 2003; 
Gönülates, 2015) 
has generally supported that the siz
e of the item 
pool should be 10 to 12 times larger than the test length (Stocking, 1994). 
To investigate the 
effect of item pool size on adaptivity for CAT, I 
manipulate
 
the item pool size as a factor in the 
simulation s
tudies (see S
ection 
4.2.2.1
).
 
2.2.2
 
Item s
election procedure
 
Another key component of CAT is the item selection algorithm. The most frequently 
used item selection algorithm is 
maximum Fisher information
 
(MFI; Lord, 1977)
 
due to its easy 
implementation
. The MFI select the next item that has the maximum information 
in Equation 
(2.4) 
at the current 
proficiency
 
estimate
 
from the available item pool. 
In other words, it selects 
the item that most 
precisely
 
measures the current proficiency estimate. Thus, a
s 
this algorithm 
selects the most informative item, the efficiency of CAT also increases. 
However, MFI has some 
disadvantages. At the beginning stages of CAT, there is not 
enough
 
information of an 

 
to guide the MFI item selecti
on procedure
, resulting in selected 
items
 
that
 
might not be the best ones.
 
The items administered earlier in the test could also bring 
about big jumps in the proficiency estimates. This is the reason why it is recommended for 
examinees to be extra careful while answering the first few questions of 
the test
. S
uch issue
 
could
 
be m
itigated by using prior information to select items or using different item selection 
rules
 
such as
 
the
 
Kullback
-
Leibler 
measure
 
(Chang & Ying, 1996) at least at the early stages of 
CAT
, especially for 
a
 
short test 
(Chen, Ankenmann, & Chang, 2000). 
 
13
 

Bayesian
 
item selection
 
approach (Owen, 1975) is also commonly used in CAT 
program
s
. This 
approach
 
selects the items that minimize 
the
 
expected p
osterior variances of the 
proficiency
 
estimates.
 
To calculate t
he posterior distribution of the 
proficiency
 
for item selection 
in CAT
, Owen 
used a
 
normal approximation 
with closed
-
form expressions, instead of the true 
posterior,
 
in order t
o
 
minimize
 
the computational complexity
.
 
He
 
proved that as the number of 
the ad
m
inistered items become infinite
, the expected value of the posterior distribution will 
converge to the true
 
value of 
proficiency
. In general, an examinee receives the first item that 
matches well with the initial 
proficiency
 
estimate that is equal to the expected value of the prior 
distribution. The algorithm then searches for a next item that will reduce the posterior variance 
the
 
most. After each item is administered, a new posterior distribution is computed using the 
response
 
string
s and 
the 
prior distribution
 
(usually, normal distribution)
, and the
n
 
this updated 
posterior becomes the prior distribution for selecting the next ite
m
. 
As 

 
is
 
an approximate empirical Bayes procedure for CAT which 
requires simpler computation, this 
method
 
is faster than other Bayesian 
item selection approaches
.
 
 
In addition to these two item selection 
approaches
, there are 
other Bayesian item 
selection 
procedures. For instance, van der Linden (1998
a
) 
proposed several
 
Bayesian item 
selection criteria
 
based on the full posterior, including maximum posterior
-
weighted
 
information 
(MPWI), 
maximum expected information (MEI), 
minimum expected posterior variance 
(MEPV), 
maximum expected posterior weighted
-
information (MEPWI).
 
Penfield (2006) 
compared the performance of MEI and MPWI to MFI, reporting that the Bayesian proce
dures 
yielded slightly more precise estimates than MFI. 
Prior studies (e.g., Choi & Swartz, 2009)
 
also 
found that these Bayesian item selection procedures are computationally intensive but produce 
14
 
comparable results to the simpler MFI procedure. Therefore,
 
the MFI procedure is the most 
widely used in item selection of CAT and used in this dissertation, as well.
 
Other
 
practical 
considerations are made 
in item selection 
to address the 
issues
 
of over
-
 
or 
under
-
exposed items
,
 
content validity

for CAT. 
To 
handle these practical issues,
 
constraints 
are generally imposed 
on 
the 
item selection
 
procedures
. 
The constraints on item selection
 
include but
 
are not limited to exposure control, content 
balanci
ng, and item enemies (
Eignor, Stocking,
 
Way, 
&
 
Ste
ff
en
,
 
1993
; Weiss, 2011). 
Among 
these numerous constraints, t
he following selection
s
 
will briefly discuss 
some constraints on item 
selection for exposure control and content balancing in CAT
. 
 
2.2.2.1
 
Exposure control 
 
In CAT, selecting items without considerations other than the objective selection criterion 
usually leads to a disproportionate use of 
particular
 
items in the pool. That is, some items are 
much 
more
 
frequently administered to examinees, 
a
nd
 
other items are rarely or never 
administered.
 
Test developers do not 
want 
examinees to have pre
-
knowledge of the items 
and do 
not want to waste the cost of developing the unused items. To limit the exposure of items in 
CAT, exposure control procedures 
h
ave been introduced by putting some constraints on 
item 
selection during the CAT administration
 
(e.g., 
Chang & Ying, 1999; 
Davey & Parshall, 1995; 
Kingsbury & Zara, 1989; McBride & Martin, 1983; 
Stocking, 1993; 
Revuelta 
&
 
Ponsoda
, 
1998
; 
Sympson & Hetter, 1985)
.
 
These exposure control procedures can be divided into four main 
types: randomized, conditional, stratified, and combi
ned procedures (Georgiadou, Triantafillou, 
& Economides, 2007). 
 
R
andomized procedures include 
several
 
variations 
on randomization of items in item 
selection 
for exposure control
 
(e.g., Bergstrom, Lunz, & Gershon, 1992; 
Eignor et al.
, 
1993
; 
15
 
Way, Zara, & Le
ahy, 1996)

-
4
-
3
-
2
-
1 procedure 
randomly 
selects
 
the first item from
 
a group of the most informative five at the beginning of the 
test. After the current proficiency is updated, a group of the four most optimal ite
ms are selected 
and 
the second item 
is 
chosen at random
 
from th
is
 
subset. This procedure continues until 
the
 
subset is defined as 
the best single available 
item. Kingsbury and Zara (1989) proposed the 
randomesque 
procedure
, which is the most commonly used in operational settings due to its 
simplicity. This procedure 
randomly 
selects
 
one item from the most informative 
n
 
items (e.g., 

process. 
 
 
Conditional procedures control the exposure of items based on a given criteria (e.g., the 
frequency of item usage for a target sample of examinees). The most representative example of 
conditional procedures is the
 
Sympson
-
Hetter 
me
thod
 
(
Sympson 
& Hetter, 1985
)
. 
This 
procedure requires 
an
 
item exposure parameter
, say 
k
, 
ranging from 0 to 1 
(i.e., the conditional 
probability that the item will be administered given the item has been selected) 
obtained 
iteratively 
from simulations 
for a target 
sample of simulated examinees 
prior to 
the 
administration of CAT.
 
The value of 
k
 
is high for a certain item, indicating this item has been 
rarely administered and thus has a higher probability of being administered if selected. 
The value 
of 
k
 
is low for a particular item, implying the item has been frequently administered and has a 
lower probability of being administered if selected. During the CAT administration, after 
selecting an optimal item to be administered, a random number from a unifor
m distribution 
between 0 and 1 is generated and compared to the exposure parameter 
k
 
of the selected optimal 
item. This item is administered if this random value is smaller than the value 
k
 
of the selected 
item. Otherwise, the next optimal item is selected
, and the same procedure is applied to this item 
16
 
until an item is administered to the examinee.  
This procedure successfully controls the over
-
exposed items, but it is 
very 
time
-
consuming
 
because the iterative simulations must be done a 
prior
i
 
(Georgiadou 
et
 
al., 2007)
. 
 
 
Stratified procedures 
stratify the item pool according to statistical properties such as item 
discrimination and difficulty, and then administer an item from a given stratum. The 
a
-
stratified 
method (Chang
 
& Ying
, 
1999
)
 
is an example of the stratified methods
.
 
This procedure is
 
motivated by the situation where items are solely chosen based on their information, resulting in 
disproportionate usage of some highly informative items. As informative items are 
unnecessarily 
used earlier in the test, in which the interim proficiency es
timates 
contain too much error to be 
considered accurate
, the final proficiency estimates are more likely to be over or under estimated. 
To regulate the use of 
highly 
informative items, t
his method first administers items with lower 
a
-
parameters
 
at the ear
lier stages of the test and administers items with higher 
a
-
parameters at
 
the 
later
 
of the test to improve the efficacy of the items
.
 
Following this solution, many variations 
ha
ve
 
been proposed, including 
the 
a
-
stratified with 
b
-
blocking 
(Chang
, Qian, & 
Ying, 2001
)
 
and 
the 0
-
1 stratification strategy (Chang & van der Linden, 2003), etc. 
 
Lastly, combined procedures attempt to combine two or more exposure control methods. 

-
restricted
 
combined procedure is a notable 
example. 
This combined procedure
, derived from the maximum information method and the restricted 
maximum information method,
 
is intended to prevent the overexposure of items and to increase 
the usage of rarely or unused items while maintaining precision of proficiency estimates. 
The 
modified version of this method, the progressive
-
restricted standard error method was also 
devel
oped (see McClarty
,
 
Sperling, & Dodd, 2006
 
for details). 
 
17
 
Among these exposure control procedures, I
 
cho
ose
 
the 
randomesque 
method
, 
the 
Sympson
-
Hetter method, and
 
the
 
a
-
stratified with 
b
-
blocking 
method
 
to see how the different 
procedures affect the level 
of customization for CAT using the proposed adaptivity statistics 
(see 
Section 4.4.1.2).
 
2.2.2.2
 
Content balancing
 
Like the 
P&P
 
test, a CAT should conform 
to a 
test blueprint, especially to cover multiple 
content areas, which is closely associated with the interpretation and validity of the test scores. 
This can be realized through content balancing procedures. 
 
Although a variety of strategies for 
content balanc
ing exist, the most commonly used procedure in research and operational settings 

In this procedure, 
the target proportions of 
each content area 
are first prespecified. After the administration of each item, t
he current 
proportions of each content area are calculated and compared to the pre
-
specified target 
proportions. The content area with the largest discrepancy between the target and current 
proportions is selected, while items from other content areas are 
filtered out from the item pool, 
and the next item with the highest information will be selected from the available items from that 
content area. 
Previous research (
e.g., 
McClarty et al., 2006)
 
has provided evidence to support 
that this procedure successfu
lly administers specified proportions of items per content area
. 
 
In addition to this simple
 
procedure,
 
more complex 
strategies for content balancing are 
also available. These content balancing methods include the weighted deviations model 
(Swanson & 
Stocking, 1993),
 
the shadow test approach (
van der Linden 
&
 
Reese 
,
1998
), 
the
 
weighted penalty model (Shin, Chien, Way, & Swanson, 2009),
 
the maximum priority index 
method (Cheng & Chang, 2009)
, and the bin
-
structured method (Davey, 2005), 
among others
. 
 
18
 
2.2.3
 
S
coring procedure
 
In the beginning of the test, an initial 
proficiency
 
value is arbitrarily determined because 
there is no available information about an examinee. The initial 
proficiency
 
value is 
typically 
set 
to 0.0, which is the mean of the 
proficiency

proficiency
 
estimate is the
n updated after each item is administered based on the item responses. 
Proficiency 
estimation methods are essential because the methods could affect not only the reporting score of 
the test, but also the selection of items to be administered and the decisi
on of terminating the test 
(e.g., standard error of proficiency estimates). Previous studies have proposed proficiency 
estimation approaches 
(e.g., Bock & Mislevy, 1982;
 
Lord, 1986;
 
Owen, 1975), provided ways to 
overcome some challenges that 
a particular e
stimation
 
method has for CAT (e.g., Han, 2016) and 
compared their performance, as well (e.g., Wang & Vispoel, 1998)
.
 
Among the existing 
proficiency estimation methods, m
aximum likelihood estimation (MLE) and Bayesian estimation 
method
s
 
such as 
expected a p
osteriori (EAP
; Bock & Mislevy, 1982
)
, maximum a posteriori 
(MAP; Samajima, 1969),
 

empirical 
Bayesian method
 
(Owen, 1975)
 
are the most 
widely used in CAT program
s
. 
 

multiplying the 
probabilities of a response string with the location independence assumption (
Hambleton & 
Swaminathan, 1985
). 
To find the most likely value of proficiency estimates that maximizes the 
likelihood, 
the 
Newton
-
Raphson method can be used.
 
The 
M
LE
 
approach
 
provides 
proficiency
 
estimates which are consistent
,
 
efficient
, and asymptotically normally distributed
. 
The normality 
property is a very practical advantage
 
of MLE
 
because 
it allows the standard error of 
the 
proficiency estimate
 
to be calculated using the information function
 
shown in 
Equation (2.4)
. 
However, the MLE provides an infinite 
proficiency
 
estimate if the item responses are either all 
19
 
correct or incorrect so that at the beginning of CAT, the estimates cannot be computed
 
until both 
correct and incorrect responses exist. To tackle this problem, in practice, 
either
 
a 
step param
eter 
(e.g., 0.7; Reckase, 19
76
) or arbitrary 
lower
 
and 
upper bound values of proficiency estimates 
(e.g., say 
-
4 and +4) are used earl
y
 
in the CAT.
 
Another way to solve 
this issue
 
is to start with a 
Bayesian estimation procedure and switch to MLE after both correct and incorrect responses are 
obtained (e.g., NCSBN, 2016).
 
Bayesian estimation methods are alternatives to MLE for handling th
is infinity problem. 
EAP determines the most likely location of proficiency as the expected mean of the posterior 
distribution, and MAP as the model of the posterior distribution. 
The
se
 
Bayesian approach
es
 
can 

proficiency
 
level even
 
after the first response is obtained with the help 
of the prior distribution.
 
Although 
the Bayesian estimation methods have
 
such an advantage, a 
well
-
known weakness is that their estimates are generally biased toward the mean of the prior 
distribution
, re
sulting in a shrunken score scale
 
(e.g., Ho & Dodd, 2012; 
Kim & Nicewander, 
1993
; 
Wang & Vispoel, 1998
; Weiss, 1982
). 
Another example of a Bayesian estimation 

every update of proficiency estimate in 
CAT, the posterior proficiency distribution from the previous one is used as a prior distribution 

ayesian method is also very popular 
because 
it
 
is straightforward to 
compute the proficiency estimates and faster than other Bayesian methods
. 
However
, 
this 
method has the 
major 
downside that the proficiency estimates are affected by the sequence of the 
it
em presentation. This problem might be alleviated by re
-
estimating the response 
string
s at the 
end of CAT u
s
ing an alternative proficiency estimation method such as maximum likelihood 
estimation 
(Wang & Vispoel, 1998).
 
20
 
Taken together, 
I
 
focus
 
on MLE and EAP
 
(Section 
4.2.2.3
)
.
 
These two proficiency 
estimation methods are notably used in CAT (
Hambleton, Swaminathan, &
 
Rogers, 1991, p. 148; 
Weiss, 1982).
 
2.2.4
 
Stopping rule
s
 
Stopping rules are closely tied with the purpose of tests. In general, there a
re two ways to 
decide when the test terminates:
 
f
ixed length of the tests and v
ariable length of
 
the tests. A fixed
-
length test
 
require
s
 
all the examinees to receive 
an
 
equal number of items given in the test
.  
However, this feature of the same number of items that every student takes 
might cause the 
measurement precision of final proficiency estimates 
to
 

depending on the distribution of items in the pool. 
 
A variable
-
le
ngth test provide
s
 
a
 
different number of the items to 
students
 
until a pre
-
specified standard error 
(i.e., measurement precision) 
of 
proficiency
 
estimate
s
 
is satisfied. A 
target measurement precision 
(e.g., < 0.3 or 0.2) 
is considered a test termination cr
iterion in order 
for each examinee to have the same magnitude of measurement precision. One problem in 
variable
-
length CAT is that the examinees with very high or low 
proficiency
 
levels will have a 
longer test than others due to the fact that the item pool could run out of suitable items to be 
administered. One suggested approach to deal with this issue is to combine the measurement 
precision rule with setting the maximum/minimum 
number of items in practice (Thissen & 
Mislevy, 2000).
 
In this 
dissertation
, all simulation studies of CAT were based on the fixed
-
length 
test, and 
the 
empirical illustration using an operational adaptive test was 
a
 
variable
-
length CAT.
  
 
2.2.5
 
Adaptive test des
igns
 
Due to the benefits drawn from CAT, there has increased the applications of CAT with 
some modifications in test designs for compensating for its weaknesses and for encouraging the 
21
 
practical uses in real educational and operational settings. For exampl
e, the full item
-
level CAT 
cannot review in advance the test items to be administered to each examinee, implying the 
potential of a lack of quality control (Luecht & Nungester, 1998). Also, the full CAT may 
require more funding for its development and impl
ementation. 
 
However, multistage adaptive testing (MST), as a special form of CAT, adapts at the 
stage/module level
,
 
and it has some practical advantages over the item
-
level CAT in operational 
settings (e.g., 
Stark & Chernyshenko, 2006
). With MST,
 
examinee
s can 
not only skip items but 
also 
review and revise their responses to the items within the stage during the testing, which is 
not available in CAT. Modules (i.e., a group of items) are also pre
-
assembled before test 
administration
. So, MST 
allow
s
 
test developers to control the quality of tests and content 
balancing while maintaining a comparable measurement precision to the full CAT
 
when the test 
is well designed (Xing & Hambleton, 2004)
. However, MST may reduce adaptivity 
compared to
 
the item
-
lev
el adaptation of CAT
 
(e.g., Reckase et al., 
2019)
. 
 
Recently, 
another form of CAT, called hybrid CAT (Wang
, Lin, Chang, & Douglas
, 
2016), has been introduced that combines characteristics of item
-
level CAT and MST. 
Administering an MST in the beginning of 
the test contributes to improving an initial proficiency 
estimate for the later implementation of CAT but also achieving content balancing more 
systematically.
 
In this dissertation, 
how much adaptation occurs across proficiency levels 
is 
examined depending
 
on the different adaptive test designs of the item
-
level CAT and MST (see 
Section 4.5.2).
 
2.3
 
F
actors Affecting Adaptation
 
 
The amount of adaptation can be affected by numerous factors associated with the 
characteristics of an item pool and the CAT specifications. First of all, the item pool is a 
22
 
fundamental and vital element for the development and deployment of a CAT. The bes
t
,
 
sophisticated CAT program cannot function well if an item pool consists of poor
-
quality items or 
items suitable for the limited range of proficiency (Flaugher, 2000; van der Linden, Ariel, & 
Veldkamp, 2006). The higher the quality of the item pool, the 
more likely the adaptive algorithm 
will work well
. Accordingly, it needs to understand the extent to which the amount of adaptation 
during the CAT would be affected by characteristics of the item pool. To do this, previous 
studies examined the effects of t

pool size and the spread of difficulty of the items in the pool at the entire group level (Ju & Lee, 
2018; Kim, Ju, & Reckase, 2018; Reckase et al., 2018). The results of these studies sugge
sted 
that the item
-
pool composition would, in predictable ways, influence the amount of adaptation. 
That is, the item pool should be about more than ten times the test length with more spread in 
difficulty of items for the adequate adaptivity of the CAT. I
n addition to that, the shape of the 
item pool could affect the results of the CAT and, plausibly, the performance of the adaptive 
algorithm, taking into account the shape of the 
proficiency
 
distribution (e.g., 
G
ö
n
ü
lates
, 2015; 
Reckase, 2010).
 
 
I
ntertwined
 
with the item
-
pool characteristics, a variety of components of the CAT would 
also impact the consequences of the CAT, including adaptivity. For example, IRT models would 

ns, which 
may eventually affect the selection of items at the momentary proficiency estimate. Kim et al. 
(2018) compared the overall adaptation measures using the 3PL model to those using the Rasch 
model. Because of the effects of discrimination and guessi
ng parameters, the suggested 
benchmark values were slightly different for the two models, though their conclusions appeared 
to be the same. 
 
23
 
Meanwhile, 
proficiency 
estimation plays a pivotal role in a successful CAT 
implementation because it is closely rel
ated to the item
-
selection procedure. The 
MFI 
item 
selection method is the most frequently used in the CAT because of measurement precision and 
efficiency. This method assumes a perfect correspondence between the current proficiency 
estimate and the true p
roficiency level of an examinee. If the assumption is violated due to poor 
accuracy of the proficiency estimates, the item
-
selection algorithm may select items that are not 
well associated with the target true proficiency, resulting in selecting less optim
al items that 
contribute to increasing errors in subsequent proficiency estimates (Ho & Dodd, 2012). This 
issue might be more severe in the early stages of CAT
 
because, generally speaking, little 
adaptation occurs before the proficiency
 
is well estimated
. 
Given this fact, in previou
s research 
(Reckase et al., 2018)
, adaptation statistics were computed using all
 
of
 
the items administered to 
examinees during a CAT and also for the items used i
n the last half of their tests. It was shown 
that higher values of 
the adaptivity statistics were reported with the last half of the test, though 
the extent of the increment was small. 
 
 
Most previous research has used 
MLE
 
in the CAT. Ju and Lee (2018) explored the 
performance of the overall adaptivity measures across dif
ferent proficiency estimators. They 
found that the correlation index and the ratio of standard deviations index were robust to 
different estimation methods. Yet the 
proportion of reduction in variance (
PRV
)
 
index appeared 
to be affected by the proficiency 
estimates, especially with 
an
 
increase in the PRV benchmark 
value using 
the 
EAP
 
method

regressing toward the mean or mode of the prior distribution (e.g., Kim & Nicewander, 1993). 
Test length als
o might matter; after all, with a longer test, there might be more chances of a test 
24
 
being customized for a test taker. However, the adaptivity measures appeared to be robust to the 
test length (Ju & Lee, 2018).
 
 
Constraints on the item
-
selection algorithm
 
should negatively influence the selection of 
optimal items at the interim proficiency estimate during a CAT, resulting in a degrading of the 
amount of adaptation. Constraints include content constraints, exposure
-
control constraints, and 
item
-
type constra
ints. Previous research (Reckase et al., 2017, 2018) has demonstrated the 
effects of exposure control on adaptation. For example, the Sympson
-
Hetter exposure
-
control 
procedure seemed to limit the amount of adaptation with a relatively small item pool; no l
imit, 
however, was shown with the randomesque procedure and 
a
-
stratified with 
b
-
blocking 
procedure. Recently, many content constraints can be easily controlled through, among others, 
constrained CAT, shadow test approach, weighted deviation model. 
 
 
Observations of operational CATs have presented that all test designs are not equally
 
adaptive because of different units of the customization of the test and designs of test 
specifications. For instance, a full regular CAT adapts at the item level, while MST adapts at the 
stage/module level. In recent years, researchers have introduced a 
new hybrid design that 
incorporates both item
-
level and stage
-
level CAT (Wang
 
et al.
, 2016). Reckase and colleagues 
(2017; 
2019
) compared the adaptivity across
-
item level CAT and different designs of MST, 
identifying that the MST design appeared to be less
 
adaptive than the others. 
 
 
Taken all together, t
his
 
dissertation
 
explore
s
, among the various factors that affect the 
amount of adaptation during a CAT, the interaction effects of the pool characteristics and 
proficiency estimators on 
a new class of c
onditional 
adaptation 
indices. 
The 
effects
 
of constrain
t
s 
for exposure control 
and adaptive testing designs 
on the amount of adaption
 
are additionally 
examined through simulations. 
 
25
 
CHAPTER 3.
 
INDICES FOR THE AMOU
NT OF ADAPTATION
 
 
Before discussing the indices to measure the amount of adaptation, it is necessary to 
define operational
ly what test adaptation (i.e. adaptivity) is. Reckase and his colleagues (Reckase 
et al., 2018, 2019) defined adaptation as the extent to which a CAT gives items that properly 
match the final proficiency estimate for the examinee. Kingsbury and Wise (2018)
 
similarly 
defined test adaptation but with more focus on test information, given the available item pool and 
test specifications. In this dissertation, test adaptation or adaptivity can be defined as the extent 
to which a CAT provides the informative item
s that properly match current proficiency estimates 
at each stage of the CAT process. Thus, a test can be viewed as being highly adaptive when the 
items administered to each examinee match well with the provisional proficiency estimates at the 
start of eac
h item during the CAT. 
 
 
To quantify the amount of adaptation of a CAT, it is assumed that test taker 
j
 
has a 
known location on a latent continuum (


) and that the goal of the CAT 
is to select the optimal 
set of items that will produce an 
accurate estimate of that location (


) given the available item 
pool and CAT specifications (Reckase et al., 2018). In the hypothetical case in which the 
location of the test taker on the continuum is known with an infinite item pool, an optimal set of 
t
est items for each test taker 
j
 
would have maximum information at 


when the maximum Fisher 
information (MFI) item selection method is used. 
 
Consider the simple case of the Rasch model. In that hypothetical case, all the selected 
items would have item
-
d
ifficulty parameters (i.e., 
b
-
parameters) equal to 


, resulting in a set of 
items that had the average of 
b
-
parameters equal to 


and their standard deviation equal to zero.
 
26
 
Extending this case to a sample of test takers with true locations on the prof
iciency scale, the 
mean 
b
-
parameter for each examinee would be perfectly correlated with 


, 
and the standard 
deviation of the mean 
b
-
parameters would be equal to that of 


. 
 
Alternatively, using the 3PL IRT model for scaling an
d scoring, the location of the 
maximum information
 
for each optimal item 
i
 
for a test (


; Birnbaum, 1968) can be substituted 
for the 
b
-
parameter:
 
 
(
3.1
)
 
where 
a
i
 
is an item
-
discrimination parameter, 
b
i
 
is an item
-
difficulty parameter, 
c
i
 
is an item
-
pseudo
-
guessing parameter, and 
D
 
is a scaling constant that makes the logistic function similar to 
the normal ogive function. Since the location (


) of maximum inform
ation is slightly higher 
than the 
b
-
parameter, the selection of items might be a little bit different than the selection based 
on the difficulty parameter, but the concept of the adaptation is still the same with use of the 
location (


) of maximum infor
mation (Kim, Ju, & Reckase, 2018).
 
This ideal type of CAT never exists in real operational settings because we never know 
the true location of a test taker on the proficiency continuum. Nevertheless, the hypothetical case 
does give some direction toward th
e possible types of measures that can be used to quantify the 
amount of adaptation that occurs in an operational adaptive testing. This conceptualization of 
adaptation and the ideal features of a CAT thus leads to existing adaptivity measures (Reckase et 
a
l., 2018; 2019) and a new class of conditional statistics of the amount of adaptation proposed in 
this dissertation. This chapter reviews the current measures of the amount of adaptation and then 
introduces three new conditional adaptation indices. 
 
27
 
3.1
 
Ex
isting Measures of the Amount of Adaptation
 
 
Under the conceptualization and assumptions of a desired CAT,
 
Reckase and his 
colleagues (Reckase et al., 2018) proposed three overall statistics: Correlation index, ratio of 
standard deviations index, and proportion of reduction in variance index. These measures were 
mostly based on the variance of the 
b
-
parameters
 
(or the location of maximum information) for 
the items administered to test takers. Note that 
the location of maximum information in
 
Equation 
(3.1)
 
can be substituted for the 
b
-
parameters
 
for computing the three statistics 
when the 3PL 
model is used. 
They 
performed well in assessing the overall adaptivity of a CAT over examinees; 
they could not, however, be applied to evaluate the adaptation contingent on proficiency level. 
Motivated by this perceived concern, Kingsbury and Wise (2018) introduced a new inde
x of the 
amount of adaptation using the IRT test information that could be used to diagnose adaptivity for 
both the entire group and the individual test events.
 
3.1.1
 
Correlation 
i
ndex 
 
The first adaptivity measure that Reckase and his colleagues proposed 
is the
 
correlation
 
between the mean 
b
-
parameter for the items administered to examinees and the final estimate of 
their proficiency:  
 
 
(3.2
)
 
where 


is the mean 
b
-
parameter for the items administered to a test taker 
j
, and 


is the final 

-
scale for a test taker 
j
. This index indicates whether examinees 
with various levels of proficiency receive tests that are different in difficult
y and that the 
difficulty levels match well the estimated proficiency levels. As shown in Reckase et al. (2018), 
higher values of the index imply better adaptivity of a CAT. The suggested benchmark value for 
28
 
interpreting this statistic is for the Rasch mod


3.1.2
 
Ratio of s
tandard 
deviations i
ndex
 
Even if the correlation index shows a high value close to 1.0, it is possible that the 
adaptivity of the CAT might n
ot be good because of poor qualities of the item pool or some 
problems with the item selection algorithm. The second index helps assess such aspects of 
adaptivity. It is the 
ratio
 
of the standard deviation of the averages of the 
b
-
parameters for the 
items 
administered to examinees, 


, to the standard deviation of the final proficiency estimates 
for those examinees, 


:
 
 
(
3.3
)
 
where the subscript 
j
 
indicates the particular examinee. This index indicates whether the spread 
of the mean 
b
-
parameters
 
of the items selected to examinees matches the spread of their 
proficiency estimates. If the item selection algorithm is working properly but an item pool 
has a 
limited range of difficulty, the correlation index may yield a high value, but this ratio index may 
report a lower value because of the small 


relative to 


(Reckase et al., 2018). 
 
 
For this statistic, unlike other adaptation indices, the value of 1.0 is optimal, as higher 
values than 1.0 can be obtained. For instance, values larger than 1.0 can be obtained when the 
item pool has an unusual distribution of the 
b
-
parameters with many e
xtremely easy and difficult 
items but insufficient items in the middle range of difficulty. In this case, the 


value could be 
large relative to 


, ending up with the index value greater than 1.0. Therefore, the distance 
from 1.0 is important when
 
interpreting this statistic for evaluating adaptivity. Since the unusual 
type of item pool is rarely found in the real word, previous studies suggested the benchmark 
29
 

d the 

 
3.1.3
 
Proportion of 
reduction in variance i
ndex
 
The last index that Reckase et al. (2018) introduced is 
the
 
proportion of reduction of the 
variance
 
(PRV)
 
of the 
b
-
parameters for the items selected for
 
the examinee, on average, from the 
amount of variance of the 
b
-
parameters for all of the items in the pool:
 
 
(
3.4
)
 
where 


is the average of the within
-
examinee variances of 
b
-
parameters for 
the
 
items 
selected for each examinee, and 


is the variance of the 
b
-
parameters for all the items in the 
pool. This index focuses more on the adaptivity within the examinee regarding the item pool, 
especially in a situation where the item pool has insuffi
cient items in the area in which 
the 
final 

 
is located. If such a situation is encountered, the 
adaptation of the CAT may also be poor because the item selection algorithm may have to select 
items whose 
b
-
parameters po
orly match the current proficiency estimates. Hence, the variation of 
the 
b
-
parameters for that test taker might be large, though the mean 
b
-
parameter
 
might be close 
to the final proficiency estimate. The index would reflect this situation and be construct
ed in the 

 
Regarding the interpretation of the PRV indicator, a value less than 1.0 represents the 
average amount of within
-
examinee variation in difficulty of the items administered over a 
sample of examinees
 
relative to the amount of variation in difficulty for all the items in the pool. 
That is, if the variation of 
b
-
parameters is zero 

 
meaning that for each examinee it is constant as 
in the aforementioned hypothetical ideal case 

 
but the item pool has var
iation in 
b
-
parameters 
30
 
for the items, then this PRV value is 1.0. The suggested benchmark value is .80 regardless of the 
IRT model (Kim et al., 2018; Reckase et al, 2018).
 
3.1.4
 
Percent of 
o
ptimal 
i
nformation 
i
ndex
 
Kingsbury and Wise (2018) introduced a new statistical indicator to measure the amount 
of adaption using the IRT test information that can be applied to not only a group of examinees 
but also to individual or subgroup test events. Their index, called the 
p
ercent of optimal 
information 
(POI),
 
is based on the ratio of observed test information to the maximum 
information possible given the item pool and the IRT model and defined as follows:
 
 
(
3.5
)
 
where 


is the actual test information observed for an examinee 
j
 
based on its final estimated 
proficiency, and 


is the optimal amount of test information that can be obtained by 
administering 
a
 
40
-
item
 
test
 
at the true proficiency level of an examinee from
 
a given item pool 
and IRT model. An alternative way to compute the optimal information is to calculate the test 
information available in the pool from the most informative test items at the final estimated 
proficiency. By the summation over the examinees 
in the group of interest (i.e., 
J
 
refers to the 
group size), the POI index can also be used as an overall measure of the adaptation. This index is 
easily interpretable, but it has not yet been thoroughly examined 
across
 
numerous 
item pool 
conditions 
or
 
con
straints on item selection
. In addition, 
it is still blind to 
the
 
extent information 
obtained during the CAT process based on interim proficiency estimates.
 
31
 
3.2
 
New Conditional Measures of the Amount of Adaptation
 
 
This dissertation proposes new three indi
ces to investigate the amount of adaptation 

b
-
parameters of items administered to 
each examinee (deviation of difficulty; DOD), (2) their within
-
examinee variances (conditional 
proportion of reductio
n in variance; CPRV), and (3) the IRT information (ratio of information; 
ROI). These statistics have t
he 
same assumptions and the same goal of the CAT 
to
 
the overall 
adaptation measures (Reckase et al., 2018). The only difference is that these new measures
 
can 
evaluate the various aspects of adaptivity that result from the implementation of the CAT 
conditional on the proficiency level or by subgroups of test events. Also, they focus more on the 
characteristics of the items based on the 
interim 
proficiency e
stimates during the CAT process, 
instead of the final proficiency estimates at the end of the tests. Note that like the overall 
adaptation statistics
 
described
 
above, the 
b
-
parameters can be replaced with the location of 
maximum information (
Birnbaum, 1968
) when the 3PL model is used for scaling and scoring.
 
3.2.1
 
Deviation of 
d
ifficulty 
i
ndex 
 
The first index that
 
the study 
propose
s
 
is the 
deviation of difficulty 
(DOD) index that 
focuses on the observed 
difference between 
the
 
b
-
parameter of the administered item and th
e 

proficiency
 
estimate at which that item was selected (i.e., desired item 
difficulty).
 
The DOD
 
index can assess how well 
a CAT uses
 
the available item pool to match 
item characteristics to
 
the
 
examinee

s
 
provisional 
proficiency
 
estimate
. It can also 
allude to 
how 
well
 
the potential ef
ficiency of an item is realized, given the fact that
 
the expected efficiency 

proficiency
 
is close to the location of that item
.
 
 
The DOD index for each examinee 
j
 
is, over the test items administered, the
 
average 
pro
portionate reduction 
of the observed location match between the item and 
the 
interim 
32
 
proficiency
 
estimate
 
relative to the average deviation of all the eligible items in the pool from the 
current 
proficiency
 
estimate. The index is represented by
:
 
 
(3.6
)
 
where 


is
 
examinee  
j

i
th item, 


is 
the difficulty parameter of the 
i
th item for the examinee 
j
, 


is the test length for examinee 
j
, and 

 
is the number of available items in the pool at the interim proficiency estimate, 


. Note 
that 


is the number of initial items be
fore the first update of the proficiency estimate occurs. 
For instance, for fully adaptive testing, a single initial item is generally administered, while for 
multistage adaptive testing, a group of items in the routing module may be administered. Since 
th
ere is no interim proficiency estimate other than the arbitrary starting value prior to selecting 
the first item(s), the initial item set is not taken into account in the index calculation.
 
 
The DOD index is a concept similar to that of the examinee 
j


where 

 
is the standard deviation of proficiency level estimates. While useful with the similar 
interpretation of 
z
-
scores, this DM index does not have an upper limit of the possible values, 
and
 
it considers only the distance of difficulty from the provisional pr
oficiency estimate. It thus 
needs some criteria in advance to provide the upper limit indicating 
a
 
high informative test. 
 
 
However, the proposed DOD index is readily interpretable. The value of 1.0, the highest 
attainable, indicates a test event
 
where
 
ite
ms were perfectly matched to the momentary 
proficiency
 
estimate at each item selection. The distance from 1.0 represents 
a
n average of the 
33
 
deviation of the difficulty of the administered item from the interim 
proficiency
 
estimate relative 
to the average de
viation of the difficulty of all the
 
eligible
 
items in the pool from that 
proficiency
 
estimate. 
Thus, a higher value of 


indicates a higher match level, implying better adaptation 
in that 
the
 
CAT is providing, throughout the test administration, cl
ose to the maximum 
information available at the interim 
proficiency
 
estimate. 
 
3.2.2
 
Conditional 
p
roportion of 
r
eduction in 
variance i
ndex 
 
The second proposed index is the 
conditional proportion of reduction in variance 
(
CPRV
)
 
index, which is a modified version of the PRV index. It determines if the item pool has 
sufficient items in the region of the final proficiency estimate of each
 
examinee
. This index 
would be particular
ly useful in a situation where the item selection algorithm may have to select 
some items whose difficulty parameters poorly match the current proficiency estimate. Hence, 
the variation of the difficulty parameters for that examinee might be large, though 
the mean 
difficulty might be close to the final proficiency estimate. The CPRV is expressed as: 
 
 
(
3.7
)
 
where 


is the variance of the
 
b
-
parameters for all the items in the item pool and 


is the 
within
-
examinee variance of the 
b
-
parameters
 
for the items administe
red to 
examinee 
j
. 
Like the 
PRV index, a
 
value deviating from 1.0 indicates the amount of variation in difficulty for the 
i
tems selected for examinee 
j 
compared to the amount of variation in difficulty for the items in 
the full item pool. 
 
3.2.3
 
Ratio of information i
ndex
 
The last index, called the
 
ratio of information
 
(
ROI
)
 
index, is equal to, over the 
administered 


-
item
s
 
for examinee 
j
, the average ratio of the information of item 


at the 
34
 
interim 
proficiency
 
estimate prior to selecting the 
k
th item 
i
 
for examinee 
j
, 


, to the 
maximum potential inform
ation that item 
i
 
can have, 


:
 
 
(
3.8
)
 
where  


is the point at which the item 
i 
can reach maximum information (Birnbaum, 1968; 
e.g., 


for 1PL/2PL model). 
Alternatively, the observed information of each of the administered 
items can be computed at the final 
proficiency
 
estimate, 


, rather than the interim estimate. 
While being readily useful in practice, this 
method
 
could be blind to the appropriateness of
 
the 
items customized to the examinee in the middle of the CAT administration. For instance, if a 
high
-
proficiency
 

but happens to miss that item, the student will get the low 
proficiency
 
estimate after the first 
item, leading to 
the student 
receiv
ing
 
a couple of relatively easy items for the next
 
few
 
items
; 
however, if the student then improves the 
proficiency
 
estimate continuously by answering 
correctly all the rest of the ite
ms, the ROI value using the final 
proficiency
 
estimate shows the 
test is relatively low informative because the student gets less informative items close to the 

proficiency
 
estimate. The original ROI index in Equation (3.8), on contrary, 
i
ndicates good informative items presented to the student during the CAT process. Overall,
 
this 
index can assess how informative a test is compared to the maximum potential information 
that 
the administered items can have
. The ROI index can range from 0.0 to 1.0. A value of 1.0 
indicates that a test is appropriately constructed and administered to the examinee using the items 

proficiency
. The higher the 
val
ues, the more informative a test for the examinee.
 
35
 
 
In addition to the utilization of ROI conditional on the 
proficiency
 
level in Equation (
3.8
), 
it can also be used for the overall diagnosis of adaptivity by simply averaging the ROI values 
over the entire
 
group of examinees:
 
 
(
3.9
)
 
where 
N
 
is the total number of examinees that took the adaptive test. It would help test 
developers or practitioners understand the overall picture of adaptivity of CATs at the entire 
group or target group levels of interest. 
 
 
The ROI index is 
originally derived from the concept of relative efficiency (Lord & 
Novick, 1968) that compares the Fisher information functions. This may also be in lin
e with the 
POI index (Kingsbury & Wise, 
2018). However, the ROI index is conceptualized differently 
from
 
the POI index with respect to the definition of optimal information (i.e., the denominator of 
the index). Kingsbury and Wise (2018) identified the optimal test information through the 
administration of the entire test at the true or final proficiency leve
l using the actual item pool. 
They also stated that the optimal information can be defined using the theoretical limit from the 
known value of the maximum information given the Rasch model. This is actually 
a
 
s
imilar 
concept 
to
 
the item
-
pool utilization in
dex (
G
ö
n
ü
lates
, 2015) used for evaluating the efficiency of 
item pool performance. For the proposed ROI index, however, the optimal information focuses 
more on the maximum potential that the administered item has; this is similar to the expected 
item effic
iency that Han (2012) used as a step to item selection. Thus, the ROI index is expected 
to evaluate the adaptivity of the CAT from whether an administered item fulfills the maximum 
level of the attainable information at the
 

interim proficiency e
stimate. Moreover, the 
ROI index is reflective of how well the items are informatively presented to the examinees 
36
 
during the whole process of the CAT, while the POI index cares more about whether informative 
items are provided to the examinee around the fi
nal 
proficiency
 
estimate. 
 
 
37
 
CHAPTER 4. 
 
METHODS
 
 
This dissertation proceeds with five main studies to evaluate the feasibility and utility of 
three new conditional indices to measure the amount of adaptation in the implementation of 
computerized adaptive testing (CAT) with dichotomously scored items
 
using simulated data and 
real operational CAT data. 
All the analyses were conducted using MATLAB R2015b (The 
MathWorks, Inc., 1984
-
2015) and the visualization of the results were completed
 
using R 
software (R Core Team, 2018). 
This chapter describes details about the research designs for 
replying to the five research questions (see Section 1.2).
 
4.1
 
Common CAT Specifications 
 
All item
-
level CAT
s
 
in the first 
four studies
 
share some common CAT specifications. 
The CAT was a fixed
-
length
 
test
 
of 40 items. An initial item
-
level proficiency estimate of 0.0 
was used for all examinees. Items were selected using the maximum Fisher information (MFI) 
algorithm that chooses the item
 
to be administered that has the maximum information at the 
current proficiency estimate. Other than Research Question 1, maximum likelihood estimation 
(MLE) was used to estimate the interim and final proficiency after both correct and incorrect 
responses 
existed in the response string. When only either correct or incorrect responses are 
present, the maximum likelihood estimates are infinite. To deal with this problem, prior to MLE, 
the last proficiency estimate was incremented by the step size of 0.7 after
 
a correct response and 
decremented by 0.7 after an incorrect response (Patience & Reckase, 1980; Reckase, 1975). 
Also, maximum likelihood estimates were confined between 
-
4 and 4 to restrict some extreme 
proficiency estimates within a practical interval. 
 
38
 
 
For each study condition, item
-
level CATs were administered to 2,000 examinees 
randomly sampled from a standard normal distribution, 
N
 
(0, 1). This sample size is reflective of 
large
-
scale 
operational testing settings to get a 
representative sample from the proficiency 

distribution ranging from 0.0 to 1.0. The random uniform number was
 
then
 
compared to the 

 
the item. If the probability of correct response was greater than
 
the random number, a score of 1 
was assigned as a response; otherwise, a score of 0 was recorded. In all cases, t
he results were 
replicated 
50 times for computing the stability of the adaptivity statistics.
 
4.2
 
Research Question 1
 
The first set of simulations were intended to e
valuate the sensitivity of the three 
conditional 
adaptation indices
 
to
 
various
 
item
-
pool 
quality conditions, 
pr
oficiency
 
estimators, 
and IRT models with the goal of providing some guidelines for interpreting these indices. It was 
hypothesized that the values of 
the 
conditional adaptivity indices will increase as the item pool 
size and item pool spread increase, but
 
that the indices will be rarely affected by 
proficiency
-
estimation methods and IRT models.
 
4.2.1
 
Item pool
 
To simulate an item pool for the 3PL model as realistically as possible, the simulated 
item pool was modeled after the multivariate distribution of the 
a
-
, 
b
-
, and 
c
-
parameters but with 
different marginal distributions, respectively
,
 
using only multiple
-
choice items in the item pool 
from the
 
Minnesota Comprehensive Assessment (MCA) 
Grade 6 Mathematics adaptive 
assessment
.
 
The descriptive statistics and zero
-
order Pearson correlation
s
 
of the item parameters 
are presented in 
Table 
4.
1.
 
Specifically, while taking into account the correlation among the item 
39
 
parameters, 
a
-
parameters were drawn from a lognormal distribution, 
b
-
parameters from a normal 
distribution, and 
c
-
parameters from a beta distribution. Based on the multivariate distribu
tion, 
sets of item pools that act like the empirical pool were generated according to simulation 
conditions of interest. Note that an item pool based on the Rasch model was generated, only 
taking into account the distribution of 
b
-
parameters shown in Table
 
4.1.
 
Table 
4
.
1
 
 
Descriptive Statistics and Zero
-
Order Correlation
s
 
of Item Parameters for the Item Pool from 
Minnesota Comprehensive Assessment (MCA) Grade 6 Mathematics Adaptive Test (n = 635)
 
 
Descriptive Statistics
 
Correlation
s
 
 
M
 
SD
 
Min.
 
Max.
 
a
-
parameter
 
b
-
parameter
 
c
-
parameter
 
a
-
parameter
 
1.03
 
0.30
 
0.20
 
1.99
 
 
b
-
parameter
 
0.27
 
0.95
 
-
2.53
 
3.14
 
0.24
 
 
c
-
parameter
 
0.16
 
0.10
 
0.00
 
0.60
 
0.06
 
0.00
 
 
4.2.2
 
Simulation design
 
The first 
simulation study was conducted to examine the sensitivity of the three 
conditional indices to item pool
 
characteristics
, 
proficiency
 
estimator, and IRT model. Here, the 
item
-
pool quality was operationalized by two aspects: (1) item
-
pool size and (2) item
-
p
ool 
spread in 
b
-
parameters. These item
-
pool characteristics were fully crossed with 
proficiency
 
estimators 
and IRT models,
 
forming a total of
 
72 conditions
 
(10 pool size
s 

 
2 
proficiency
 
estimators 

 
2 IRT models + 8 pool spreads 

 
2 
proficiency
 
estimators 

 
2 IRT models). 
 
4.2.2.1
 
Item pool size
 
Using each IRT model, 10 item pools were generated that varied in item pool size from 
50 to 500 in increments of 50. First, using the observed multivariate distribut
ion described 
above, 500 sets of item parameters were generated that had descriptive statistics and correlations 
40
 
as similar as possible to the empirical set in Table 4.1. These full sets of item parameters were 
then randomly divided into 10 sets with 50 it
ems each in a way that item characteristics were 
similar across the 10 sets. Then, the first set of 50 items were used for the simulation of a 50
-
item pool. The first set was then combined with the second set of 50 items to construct the 100
-
item pool. Thi
s process was repeated, adding a set of 50 items each time, until the simulation was 
conducted using the full set of 500 items in the pool. This elaborated way of creating different 
sizes of item pools allows a researcher to solely explore the relationship
s between pool size and 
values of the adaptivity statistics. Otherwise, it is possible that a small item pool with high
-
quality items (i.e., items with high discrimination and small guessing parameters) could perform 
better than a larger pool with low
-
qual
ity items given no other constraints imposed on item 
selection. Table 4.2 summarizes descriptive statistics and correlations among item parameters for 
the 10 generated item pools. 
 
4.2.2.2
 
Item pool spread
 
Another aspect of item
-
pool characteristics that could aff
ect the amount of adaptation is 
the degree of spread in difficulty of the items in the pool. That is, if the difficulty of items is in a 
limited range, even with the large item pool, the adaptive test cannot be suitably customized for 
examinees who are loc
ated outside that range. To quantify this situation, a CAT was simulated 
using eight 400
-
item pools that differed in the level of standard deviation of 
b
-
parameters from 
0.2 to 1.6 in increments of 0.2 with the mean of 
b
-
parameters set to 0.0. Other 
a
-
 
and 
c
-
parameters were controlled to be the same as the 400
-
item pool generated above and were also 
fixed across all the item pools. Table 
4.
3 displays the summary of the simulated item pools by 
the level of spread in 
b
-
para
meter. In all the conditions manipulated by the pool spread, the item 
41
 
pool size was 400, which is at least 10 times larger than the test length of 40 items as 
recommended by Stocking (1994).  
 
4.2.2.3
 
Proficiency
 
estimation methods
 
Two proficiency estimation metho
ds were considered. One was MLE, most frequently 
used in operational settings, and the other was 
expected a posteriori
 
(EAP; Bock & Mislevy, 
1982) using the standard normal distribution as the prior distribution and using 81 evenly spaced 
quadrature points
 
to determine the posterior distribution. 
 
4.2.2.4
 
IRT models
 
The performance of the three conditional indices were further inspected using two IRT 
models: (1) Rasch model (i.e., one
-
parameter logistic model) and (2) three
-
parameter logistic 
(3PL) model. Comparing the performance between the two IRT models can inform
 
us of how the 
c
-
parameter affect the stability of the indices over the 
proficiency
 
continuum. 
 
42
 
Table
 
4
.
2
 
 
Descriptive Statistics of Generated Item Pools by Item Pool Size
 
Pool
 
Size
 
a
-
parameter
 
b
-
parameter
 
c
-
parameter
 
Correlation
 
M
 
SD
 
Min.
 
Max.
 
M
 
SD
 
Min.
 
Max.
 
M
 
SD
 
Min.
 
Max.
 
(
a
, 
b
)
 
(
a
, 
c
)
 
(
b
, 
c
)
 
50
 
0.98
 
0.27
 
0.57
 
1.91
 
0.10
 
1.00
 
-
1.85
 
1.89
 
0.17
 
0.09
 
0.01
 
0.54
 
.24
 
.00
 
.06
 
100
 
1.00
 
0.26
 
0.57
 
1.91
 
0.28
 
0.94
 
-
1.85
 
2.29
 
0.16
 
0.10
 
0.01
 
0.59
 
.27
 
.00
 
-
.03
 
150
 
1.01
 
0.27
 
0.54
 
1.95
 
0.33
 
1.00
 
-
2.43
 
2.96
 
0.16
 
0.11
 
0.01
 
0.59
 
.24
 
.05
 
-
.01
 
200
 
0.99
 
0.27
 
0.54
 
1.95
 
0.31
 
0.98
 
-
2.43
 
2.96
 
0.16
 
0.10
 
0.01
 
0.59
 
.24
 
.08
 
-
.00
 
250
 
1.01
 
0.29
 
0.54
 
2.07
 
0.28
 
0.95
 
-
2.43
 
2.96
 
0.16
 
0.10
 
0.01
 
0.59
 
.23
 
.01
 
.02
 
300
 
1.01
 
0.29
 
0.52
 
2.23
 
0.27
 
0.96
 
-
2.43
 
2.96
 
0.16
 
0.10
 
0.01
 
0.59
 
.26
 
.04
 
.00
 
350
 
1.02
 
0.30
 
0.52
 
2.23
 
0.28
 
0.96
 
-
2.43
 
2.96
 
0.16
 
0.10
 
0.01
 
0.59
 
.28
 
.05
 
.04
 
400
 
1.02
 
0.30
 
0.52
 
2.23
 
0.27
 
0.96
 
-
2.43
 
2.96
 
0.16
 
0.10
 
0.01
 
0.59
 
.24
 
.07
 
.03
 
450
 
1.02
 
0.29
 
0.46
 
2.23
 
0.27
 
0.96
 
-
2.43
 
3.19
 
0.16
 
0.10
 
0.01
 
0.59
 
.26
 
.06
 
.00
 
500
 
1.03
 
0.30
 
0.46
 
2.35
 
0.27
 
0.95
 
-
2.43
 
3.19
 
0.16
 
0.10
 
0.01
 
0.60
 
.24
 
.03
 
-
.01
 
Note.
 
The simulated item pool based on the Rasch model had the same distribution of 
b
-
parameters.
 
 
43
 
Table
 
4
.
3
 
 
Descriptive Statistics of Generated Item Pools by Item Pool Spread (n = 400)
 
SD
 
of 
 
b
-
parameter
 
a
-
parameter
 
b
-
parameter
 
c
-
parameter
 
Correlation
 
M
 
SD
 
Min.
 
Max.
 
M
 
SD
 
Min.
 
Max.
 
M
 
SD
 
Min.
 
Max.
 
(
a
, 
b
)
 
(
a
, 
c
)
 
(
b
, 
c
)
 
0.2
 
1.02
 
0.30
 
0.52
 
2.23
 
0.00
 
0.20
 
-
0.55
 
0.54
 
0.16
 
0.10
 
0.01
 
0.59
 
.03
 
.07
 
.05
 
0.4
 
1.02
 
0.30
 
0.52
 
2.23
 
0.02
 
0.41
 
-
1.11
 
1.36
 
0.16
 
0.10
 
0.01
 
0.59
 
.03
 
.07
 
.05
 
0.6
 
1.02
 
0.30
 
0.52
 
2.23
 
0.01
 
0.60
 
-
1.58
 
1.83
 
0.16
 
0.10
 
0.01
 
0.59
 
-
.04
 
.07
 
-
.01
 
0.8
 
1.02
 
0.30
 
0.52
 
2.23
 
0.02
 
0.81
 
-
1.84
 
2.60
 
0.16
 
0.10
 
0.01
 
0.59
 
.13
 
.07
 
.01
 
1.0
 
1.02
 
0.30
 
0.52
 
2.23
 
0.00
 
1.00
 
-
2.65
 
3.11
 
0.16
 
0.10
 
0.01
 
0.59
 
.08
 
.07
 
-
.04
 
1.2
 
1.02
 
0.30
 
0.52
 
2.23
 
0.02
 
1.21
 
-
2.84
 
3.30
 
0.16
 
0.10
 
0.01
 
0.59
 
-
.04
 
.07
 
-
.08
 
1.4
 
1.02
 
0.30
 
0.52
 
2.23
 
0.01
 
1.41
 
-
3.62
 
3.83
 
0.16
 
0.10
 
0.01
 
0.59
 
.01
 
.07
 
.04
 
1.6
 
1.02
 
0.30
 
0.52
 
2.23
 
0.01
 
1.59
 
-
4.20
 
4.86
 
0.16
 
0.10
 
0.01
 
0.59
 
.02
 
.07
 
.07
 
Note.
 
The simulated item pool based on the Rasch model had the same distribution of 
b
-
parameters.
 
44
 
4.2.3
 
Evaluation criteria
 
For each condition, the recovery of 
proficiency
 
estimates and the amount of adaptation 
were evaluated. For the precision and accuracy of the final 
proficiency
 
estimates, conditional 
statistics including bias (CB), standard error of measurement based on test information 
(CTSEM), and root mean square err
or (CRMSE) were computed at each 
proficiency
 
level:
 
 
(
4.1
)
 
 
(
4.2
)
 
 
(
4.3
)
 
where 


is the final proficiency estimate for the examinee 
j
, 


is the true proficiency of the 
examinee 
j
, 


is the Fisher information of the 
i
th item at the current estimate, 


, 
L
 
is the test 
length, and 
R
 
is the total number of replications (i.e., 
R
 
= 50).
 
 
Overall statistics were considered to provide summary information of the recovery 
aggregated over the 
proficiency
 
levels. The overall statistics including bias, TSEM, RMSE, and 
the Pearson correlation between true and final estimates of 
proficien
cy
 
(i.e., the fidelity 
coefficient, 


; McBride, 1977) were computed across all examinees within a single replication, 
where 
N
 
is the total number of examinees, 


is the mean of true 
proficiency
 
values
 
over 
N
 
examinees, and  


is the mean of final 
proficiency
 
estimates over 
N
 
examinees:
 
45
 
 
(
4.4
)
 
 
(
4.5
)
 
 
(
4.6
)
 
 
(
4.7
)
 
 
More importantly, to evaluate the performance of statistical indicators
 
of the amount of 
adaptation, 
existing
 
adaptation measures (Kingsbury & Wise, 2018; Reckase et al., 2018) and the 
conditional adaptation measures I proposed were calculated using the equations listed in (
3.2
) 
through (
3.8
).  Furthermore, relationship between the proposed conditional adaption i
ndices and 
the conditional measurement statistics was visually inspected via a scatter plot.
 
4.3
 
Research Question 2
 
 
To demonstrate the practical utility of the proposed conditional measures for the amount 
of adaptation as diagnostic tools for 
improving the adaptivity of a CAT, this research question 
investigated h
ow many items need to be added to
 
attain an
 
acceptable level of 
adaptation over 
the proficiency continuum under the hypothesized scenario. 
 
 
Suppose a state
-
wide achievement testing pr
ogram is planning to improve 
an
 
adaptive 

46
 
across the entire 
range of 
proficiency levels. To do this, the first basic step that they want to take 
is to revise an item 
pool that makes a CAT
 
that
 

proficiency
 
estimate


ution, 
locating many students in the middle proficiency levels. However, given the proficiency 
population with the currently available item pool, some 
range of 
proficiency 
levels
 
is adequate 
for the customization of the CAT to students, whereas other areas
 
may not be. To improve 
adaptivity at the proficiency area that is currently below the criterion, items whose information 
peaks over that area need to be added to the existing item pool. The items to be added are 
selected from the master pool, which is usu
ally available in real
-
world operational settings. 
 
4.3.1
 
Item pool
 
The 300
-
item pool for the 3PL model developed for the item
-
pool
-
size study in the first 
research question was employed as the existing item pool to be revised later. The reason for 
choosing tha
t pool size is that the item
-
pool
-
size study would suggest that a pool size of 300 
presented acceptable adaptation, and at the same time, there is still room to approach a better 
level of adaptation for a fixed
-
length CAT of 40 items. 
 
Next, the master poo

number of items than required by a CAT. I generated 3,000 sets of item parameters for the 
master pool that mimics the target distribution of item parameters described in Table 4.1, taking
 
into account the correlations among 
a
-
, 
b
-
, and 
c
-
parameters. The resulting distributions of item 
parameters for the master pool were very similar to the target of Table 4.1. The distributions of 
b
-
parameters for the master pool is presented in Figure 4.1
.
 
47
 
 
Figure 
4
.
1
.
 
Item distribution for the master pool (N = 3,000).
 
4.3.2
 
Simulation 
p
rocedure
 
First, 2,000 examinees were randomly sampled from a standard normal distribution, 
 
N
 
(0, 1). A 40
-
item CAT was then administered to examinees using the 300
-
item pool with the 
3PL model. I then identified the values of the conditional adaptation measures, w
hich were 
considered as a baseline. Looking at those values over the entire proficiency level
 
range
, I 
determined which region on the proficiency scale needs to improve adaptivity of the CAT giving 
more appropriate items 
which are 

proficiency level. Then, better 
targeting items were added that ought to be sufficient to 
gain the
 
desired level of adaptivity. At 

-
region
 
under the benchmark values of the three statistics, the fixed numbers of items to be 
added to the item pool are
 
5, 10, 15, 20, 30, 40, 50, and 100. 
 
48
 
4.3.3
 
Evaluation 
c
riteria
 
To answer the second research question, three conditional measures for the amount of 
adaptation were computed using equations in (
3.6
) through (
3.8
) to see whether the test can be 
labeled as 

existing item pool. Conditional statistics
 
listed
 
in (
4.1
) to (
4.3
) were also calculated for checking 
to 
what extent the measurement precision of the 
proficiency
 
estima
tes 
was 
improved after the 
item pool was revised.  
 
4.4
 
Research Question 3
 
 
Imposing constraints on item selection for exposure control may contribute to reducing 
the amount of adaptation during a CAT. In Research Question 3, I examined whether 
the 
proposed
 
i
ndices 
can properly 
capture the 
changes of
 
adaptivity resulting from 
the exposure
-
control procedure
 
over the 
proficiency
 
continuum. 
 
4.4.1
 
Simulation 
d
esign
 
I 
designed
 
the effects of exposure
-
control procedures moderated by item pool 
design
s to 
emphasize the cap
ability of the indices that 
identify the differences in adaptivity given by 
item 
pool quality
 
and constraints on item selection
. In total, there are eight conditions (2 item pools 

 
4 exposure control procedures). For each condition, 
CATs were administered
 
to 2,000 examinees 
randomly sampled from the standard normal distribution, 
N
 
(0, 1)
 
over 50 replications
.
 
4.4.1.1
 
Item pool
 
I created two item pools: (1) an 
optimal
 
operational item pool and (2) a 
regular
 
operational item pool. First, the optimal item pool was designed using the bin
-
and
-
union method 
(see Reckase, 2010 for details on this procedure). The optimal pool has a sufficient number of 
items with a distribution that satisfies the desired features of
 
a CAT administered to the target 
49
 
population of examinees (e.g., Veldkamp & van der Linden, 2000). Using the bin
-
and
-
union 
method, the blueprint of the ideal item pool design can be identified in terms of the distribution 
of items, item characteristics, an
d item pool size
 
for the predetermined CAT specifications of 
interest. More specifically, the ideal item pool was first determined by tallying the number of 
selected items needed in each range

 
of the 
proficiency
 
estimates, 
which are 
spe
cifi
ed on the 
proficiency
 
scale,
 
producing a target distribution for items over bins. The bin 
size is the median range of near maximum information, which was determined based on having 
information within 90% of the maximum for an item. In this case, the bin wi
dth was 0.7. 
 
To design the ideal pool, through simulations, a 40
-
item CAT selected from the master 
pool was administered to 2,000 examinees, and as each CAT is administered and the union of the 
required items is taken, the ideal item pool grows in 
size when simulated examinees that are 
different than those previously selected are chosen. Here, the master pool was the same as one 
that was already created in Research Question
 
2. As seen in Figure 4.2, the size of the ideal
 
item 
pool quickly grows earl
y
 

proficiency
 
range is covered. The number of items at the asymptote is an estimate of the number 
of items needed for the ideal item pool. Since the simulation is a random process, it was
 
replicated 10 times and then the median pool size and the median value in each of the bins were 
determined for the ideal item pool. The median of the sizes for the ideal item pool was 400 items. 
 
50
 
 
Figure 
4
.
2
.
 
Number of items needed in the ideal item pool for a 3PL
-
based CAT of 40 items.
 
However, the ideal item pool is sometimes not realistic because it requires items for 
extremely high or low 
proficiency
 
levels that are not enco
untered very often in practice. 
Therefore, after identifying the distribution of items for the ideal item pool, items from a master 
pool then filled in the requirements of the frequency distribution over bins in the ideal pool 
design. Items were selected t
hat had maximum information
 
for
 
the 
proficiency
 
range defined by 
the bin boundaries. In some bins, no items were available in the master item pool. The union of 
the selected items is viewed as the 
optimal
 
operational item po
ol because t
his does not contain
 
extremely easy or difficult items, it can be considered reasonably an optimal pool, in practice. 
The size determined for this optimal pool was 300. To make the two pools of similar size, the 
regular operational item pool consisted of 300 items. The 300
-
it
em pool developed in Research 
Question 1 was used as a typical operational pool because that pool was generated in a way 
that
 
mimick
ed
 
the real item pool from the state
-
wide testing program.
 
 
51
 
Table 4.4 presents the distributions of items over bins for the ideal item pool, the optimal 
item pool, and the regular item pool. Compared to the other two item pools, the ideal item pool 
had 32 items with maximum information in the 
-
3.85 to 
-
3.15 bin, 35
 
items with the maximum in 
the 
-
3.15 to 
-
2.45 bin, and 34 items with the maximum in the 3.15 to 3.85 bin. These items were 
relatively extreme in difficulty given the distribution of items. Since the master pool did not have 
sufficient items that had maximu
m information at such extreme 
proficiency
 
regions, the optimal 
operation item pool had 1, 6, and 5 items, respectively in those ranges on the 
proficiency
 
scale. 
Other than the extremes, the optimal item pool had almost identical distribution of items to th
e 
ideal item pool that included items fairly uniformly distributed from 
-
2.45 to 3.15. Meanwhile, 
the real item pool had a visibly narrow
er
 
distribution of items with the largest frequency in the 
0.35 to 1.05 range of the 
proficiency
 
scale. Given the purpo
se
 
of the test
 
is not to classify 
students into mastery vs. non
-
mastery using the single cut
-
off score but to attain equal 
measurement precision over the 
proficiency
 
continuum, at 
the 
least 
the test
 
would need more 
easy items for the low 
proficiency
 
studen
ts. Despite the same pool size, the optimal pool and the 
regular pool apparently had a different distribution, which is visualized in Figure 4.3.
 
(a) Regular Item Pool
 
(b) Optimal Item Pool
 
 
Figure 
4
.
3
. 
Distribution of 
b
-
parameters for the regular and optimal item pools.
 
52
 
Table
 
4
.
4
 
 
Item Distributions for Item Pools Considered in Research Question 3
 
Bin Boundaries
 

-
Scale)
 
Ideal Item Pool
 
Optimal
 
Item Pool
 
Regular
 
Item Pool
 
Lower bound
 
Upper bound
 
-
3.85
 
-
3.15
 
32
 
1
 
0
 
-
3.15
 
-
2.45
 
35
 
6
 
0
 
-
2.45
 
-
1.75
 
37
 
24
 
3
 
-
1.75
 
-
1.05
 
38
 
38
 
13
 
-
1.05
 
-
0.35
 
39
 
39
 
53
 
-
0.3
5
 
0.3
5
 
39
 
39
 
73
 
0.3
5
 
1.05
 
38
 
38
 
88
 
1.05
 
1.75
 
38
 
38
 
45
 
1.75
 
2.45
 
37
 
37
 
23
 
2.45
 
3.15
 
35
 
35
 
1
 
3.15
 
3.85
 
34
 
5
 
1
 
 
Total
 
400
 
300
 
300
 
 
4.4.1.2
 
Exposure control methods
 
Along with a no
-
exposure control as a reference, the study considered three commonly 
used exposure
-
control methods. The first exposure
-
control approach is the randomesque 
procedure (Kingsbury & Zara, 1989), in which an item to be administered is randomly s
elected 
from the 
N
 
items that have the best information at the current 
proficiency
 
estimate. In this study, 
one item was selected out of the 10 most informative items at the current 
proficiency
 
estimate. 
 
The second procedure is the 
Sympson
-
Hetter method (Sympson & Hetter, 1985) with a 
target rate of maximum item exposure, which was 0.20 in this study. This method is a 
probabilistic item exposure control in CAT by separati
ng the item selection process from
 
the 
item administration proc
ess. Specifically, this approach employs a simulation of the CAT 
procedure using the actual item pool to determine how often items will be selected for 
administration given an expected distribution of examinees. In this process, an exposure control 
53
 
paramet
er is estimated for each item in the item pool, which is the conditional probability that 
the item will be administered if that item is selected. The control parameters have to be 
determined through an iterative process of the CAT until the exposure contro
l parameters are 
stabilized. In this study, the stable values of the exposure control 
parameters were obtained after
 
15 iterations of the CAT process with each of the regular and optimal item pools. Figure 4.4 
presents the distribution of the estimated exp
osure control parameters for the two item pools. For 
these pool
s
, over 125 items had exposure control parameter values of 1.0, which means that no 
exposure control was needed for these items. These items might be unused or underexposed in 
the CAT process. 
For the regular item pool, over 50 items had the control parameter values of 
around 0.4, while for the optimal item pool, about 100 items had the control parameters of 
around 0.3 and 0.4.
 
 
Figure 
4
.
4
.
 
Distribution of 
exposure control parameters for the Sympson
-
Hetter procedure for 
the regular item pool (left) and the optimal item pool (right) of 300 items.
 
54
 
 
Lastly, the 
a
-
stratified with 
b
-
blocking procedure (BAS; Chang et al., 2001) was 
considered for exposure control.
 
For the implementation of BAS, the item pool was partitioned 
into four levels (strata) based on the magnitude of the 
a
-
parameters, but the strata were blocked 
on the 
b
-
parameter to ensure that the mean and standard deviations for the 
b
-
parameters were 
abo
ut identical across the four strata. That is, the item pool was first sorted according to the 
magnitude of the 
b
-
parameters and divided into 75 groups with each group consisting of four 
items that were homogeneous in
 
the
 
b
-
parameter. Then, starting with th
e first block of four 
items, the item with the lowest 
a
-
parameter was located in the lowest stratum, the item with the 
next lowest 
a
-
parameter in the second stratum, and so on. This procedure continued for each 
block of items to create four strata of item pools that differed in the magnitude of 
a
-
parameters 
but spanned the similar range of 
b
-
parameters. Note that for BAS, an item was s
elected with its 
b
-
parameter closest to the interim estimate of proficiency instead of the MFI item selection.  
 
4.4.2
 
Evaluation 
c
riteria
 
Similar to previous research questions, conditional 
statistics in Equations (3.6) through 
(3.8) 
and overall statistics for 
the amount of adaptation in 
Equations (3.2)
 
through 
(3.4)
 
were 
compared across eight conditions along with statistics for evaluating measurement precision and 
accuracy in the proficiency estimates. In addition, so as to further examine test security, I 
rep
orted the distribution of observed item e
xposure rates, computed using
 
Equation (
4.8
). 
 
 
(
4.8
)
 
where 
t
 
is how many times an item 
i
 
was administered and 
N
 
is the total number of examinees.
 
55
 
4.5
 
Research Question 4
 
The
 
fourth
 
simulation study investigated the utility of these 
conditional adaptivity 
measures to identify the difference in the amount of adaptation that occurs when a MST is used 
instead of a fully item
-
level CAT, moderated b
y different item pool designs. T
wo 
study factors
 
were manipulated: 
(a) item pool design and (b) adaptive test design. Since all factors studied 
were fully crossed with each other, four conditions (2 item pool × 2 test design) were examined. 
For each condition, each adaptive test was adminis
tered to 
a simulated sample of 2,000 
examinees over 50 replications
.
 
4.5.1
 
Item 
p
ool
 
A
s with Research Question 3, 
two types of item pools were used
.
 
One is an 
optimal 
item 
pool that had more uniform counts of items across the 
proficiency
 
levels. The other is
 
a 
regular
 
item pool, which is a bell
-
shaped distribution of items usually found, in practice. In this study, I 
used the same item pools that 
were 
created in the study for Research Question 3. 
The regular 
item pool contained more items whose information pe
aked in the range of 
-

-
Scale, whereas 
the optimal item pool included 
items of which difficulty were broadly distributed. 
Again, the size of the two pool
s
 
was 300. 
 
4.5.2
 
Test 
d
esign
 
The test length for both CAT (i.e., item
-
level adaptive tes
t design) and 1
-
2
-
3 three
-
stage 
MST (i.e., module/stage
-
level adaptive design) was 40 items. A fixed length CAT design with 
the same specifications as above was employed. Regarding the 1
-
2
-
3 MST desi
gn (as presented 
in Figure 
4.5
), I
 
utilized a single panel with increasing module length, staring with the short 
module in the routing test and ending with a longer module in Stage 3 (i.e., 10
-
10
-
20 design). 
Prior research (e.g., Kim & Kim, 2018; Reckase et al., 2017; Svetina, Liaw, Rutkow
ski, & 
56
 
Rutkowski, 2019) found that administering few items in the beginning stage and more items in 
the last stage tended to produce more accurate final proficiency estimates. 
 
 
Figure
 
4
.
5
. 
A 1
-
2
-
3 three
-
stage MST design used
 
in the study.
 
For the stage and 
module configurations, 
two MST designs
 
were formed
 
from the 
different item pools. From each of the item pools, a routing module in Stage 1 was constructed 

 
TIF as closely as possible 
based on a single decision point of 0.0 to route examinees to one of two second stage modules. 
The second stage modules were also designed to make accurate classifications of examinees into 
the three modules in Stage 3 so that i
tems for each second
-
stage module with TIF peaked at a 
cut
-
off point of 
-
1 and 1, respectively
,
 
were selected. Lastly, items for the third
-
stage modules 
were selected to provide approximately
 
uniform information of the final estimates across the 
proficienc
y levels, taking into account the amount of information obtained from the previous 
stages. Thus, in Stage 3, the easy module was designed for the 
proficiency
 

-
2 to 
-

-
range from 
-
1 to 1, and the difficult m
odule was designed 

-
range from 1 to 2. Table 
4.
5 displays descriptive statistics of item difficulty parameters 
57
 
by stage modules for each MST design. The medium module in Stage 3 consisted of relatively 
easy items, as more informative items were ne
ed
ed
 
to make the information 
curve flat 

-
range of 
-
1 to 1. 
 
Among numerous routing strategies, each examinee was routed through modules of 

proficiency
 
level as closely as possible. 
Examinees were routed based on the IRT MLE 
proficiency
 
estimate and were not allowed to 
take non
-
adjacent paths based on the findings of previous research (Kim & Kim, 2018; Svetina et 
al., 2019). The TIF function for four pos
sible paths of each MST design in Figure 
4.6
 
showed 
that the height of
 
the
 
TIF was higher for the MST from the regular pool over the middle range of 
proficiency
 

item pool. 
 
Table 
4
.
5
 
 
Descriptive Statistics of b
-
Parameters by Stage for Each MST Design
 
Stage
 
Module
 
Number 
of Items
 
b
-
parameters
 
M
 
SD
 
Min
 
Max
 
Regular Item Pool
 
 
Stage 1
 
Routing
 
10
 
-
0.02
 
0.33
 
-
0.62
 
0.40
 
Stage 2
 
Easy
 
10
 
-
0.93
 
0.18
 
-
1.21
 
-
0.65
 
 
Difficult
 
10
 
0.86
 
0.27
 
0.48
 
1.30
 
Stage 3
 
Easy
 
20
 
-
1.35
 
0.65
 
-
2.43
 
-
0.15
 
 
Medium
 
20
 
-
0.29
 
0.18
 
-
0.60
 
-
0.07
 
 
Difficult
 
20
 
1.50
 
0.14
 
1.26
 
1.81
 
Optimal Item Pool
 
 
Stage 1
 
Routing
 
10
 
0.07
 
0.22
 
-
0.25
 
0.33
 
Stage 2
 
Easy
 
10
 
-
0.86
 
-
1.35
 
-
1.18
 
0.15
 
 
Difficult
 
10
 
0.99
 
0.28
 
0.62
 
1.43
 
Stage 3
 
Easy
 
10
 
-
1.66
 
0.80
 
-
2.58
 
-
0.14
 
 
Medium
 
10
 
-
0.45
 
0.33
 
-
1.15
 
-
0.09
 
 
Difficult
 
20
 
1.53
 
0.18
 
1.19
 
1.88
 
Note. 
Routing points were 0.0 for the first stage module, 
-
1 
for the easy module in Stage 2, and +1 for the 
difficulty module in Stage 2.
58
 
 
Figure 
4
.
6
.
 
Information function by each path for the 10
-
10
-
20 MST using regular item pool and 
optimal item pool.
 
Note.
 
Path 1 = Stage 1 

 
Easy in Stage 2 

 
Easy in Stage 3; Path 2= Stage 1 

 
Easy in Stage 2 

 
Medium in Stage 3; Path 3 = Stage 1 

 
Difficult in Stage 
2 

 
Medium in Stage 3; Path 4= Stage 1 

 
Difficult in Stage 2 

 
Difficult in Stage 3.
 
 
4.5.3
 
Evaluation criteria
 
The performance of the two test designs using different item pool
s
 
was evaluated in term
s
 
of measurement precision and the amount of adaptation. Firs
t, 
I
 
examined how 
proficiency
 
was 
accurately and precisely estimated in adaptive testing using 
bias, 
TSEM, and RMSE over the 
sample of e
xaminees, listed in Equations (4.4
) to (
4.6
), and contingent on the 
proficiency
 
levels 
in Equations (
4.1
) to (
4.3
). T

proficiencies 
in Equation (
4.7
) was also 
calculated to gauge 
the relation between true and 
estimated 
proficiency
.
 
More importantly, t
he conditional adaptation
 
measures that I proposed in 
59
 
Equation
s 
(3.6)
 
to
 
(3.8) 
were calculated to investigate the adaptivity at each 
proficiency
 
level. 
T
he existing overall 
adaptation 
indices 
in Equations (3.2) though (3.4) 
were also computed to 
understand the adaptivity over the entire group of examinees
.
 
4.6
 
Research Question 5
 
In the last question, the performance of the
 
proposed adaptivity measures were
 
examined 
using real operational CAT data. To do this, the National Council Licensure Examination for 
Registered Nurse (NCLEX
-
RN) examination was used. The NC
LEX
-
RN (National Council of 
State Boards of Nursing [NCSBN], 2017) is a nursing licensure examination delivered by the 

abilities that are essential for the entry
-
lev
el nurse to use in order to meet the needs of clients 

201
6
).
 
The full sample 
for this quarter administration period was about 70,000, which was huge. Instead of computing 
adaptation st
atistics using the entire sample, multiple samples of 2,000 examinees were randomly 
drawn from the total sample, resulting in 35 samples in total. Thus, adaptive measures were 
computed over 35 samples of 2,000 examinees, allowing for 
the 
evaluati
on of
 
the stability of the 
adaptation values, as well. 
 
In what follows, 
the details
 
of the NCLEX
-
RN exam
 
are described 
in 
terms of the CAT specifications and
 
the
 
item pool
 
used in this study
. 
 
4.6.1
 
CAT 
s
pecifications for the NCLEX
-
RN 
e
xam
 
The NCLEX
-
RN examination 
employs the Rasch
-
based variable
-
length CAT. On an 
operational examination, 
proficiency
 
is estimated using an Owen Bayesian estimation (Vale & 
Weiss, 1977) with a prior with the mean of 
-
1.0 logit and the standard deviation of 2.0 first until 
both correct 
and incorrect responses exist for an examinee
. The 
proficiency
 
estimate is 
then 
updated using
 
the MLE with Newton
-
Raphson. An examinee starts with an item that has 1.0 
60
 
logit below the cut
-
off score. The current NCLEX
-

exam

proficiency
 
estimate should be 1.65 times the standard error (95% 
confidence interval, one
-
tail) higher than the cut score. In the same logic, an examinee will fail 
the exam if their 
proficiency
 
estimate is 1.65 times the standard 
error lower than the cut score. 
Here, the standard error is recalculated after each item. Based on the stopping rule of CAT using 
a standard error, resulting in a variable
-
length CAT, the minimum test length of operational 
items is 60 and the maximum is 25
0. 
Each
 
examinee also take
 
additional
 
15 pretest items in each 
examination between the 10
th
 
and 60
th
 
operational items, which are not included in 
proficiency
 
estimation and analyses in this study. If it cannot be clearly decided whether an examinee passes 
or fails an examination, the decision then would be made on the basis of the 
proficiency
 
estimate 
after taking the final items by examining whether the final
 
proficiency
 
estimate exceeds the cut 
score of 1.0 to pass.   
 
More importantly, the NCLEX
-
RN exam has three parts 
to 
the content 
and item 
selection 
procedure
. 
First, the computer system d
etermines the number of items for each of eight
 
content 
strands
 
for 
the minimum length exam. Every 
examinee
 
will receive the same number of items 
per content area for the f
irst 60 items shown in Table 4.6
. The second component is that the order 
of items is determined by randomly selecting a content area with equal probabil
ity. Once a 
content area has been exhausted (e.g., an examinee took the maximum number of items from a 
category), items from that content area will no longer be tested during the minimum length test. 
A
fter the minimum length test, 
the content strand presen
ting the greatest divergence from the 
desired testing percentage is selected
 
(Kingsbury & Zara, 1989)
. The divergence from the 
desired percentage is computed using the following formula:
 
61
 
 
(
4.9
)
 
where 
TPC
 
is the target percentage for the content strand, 
N
 
is the number of items previously 
presented from the content area, and 
T
 
is the total number of items previously presented. After 
determining the content area, 15 items are picked that have maximum inform
ation based on 

proficiency
 
estimate
. Then one out of the 15 items is 
randomly administer
ed
 
to that 
examinee.
 
Table
 
4
.
6
 
 
Content Distribution of the First 60 Items for the NCLEX
-
RN in 2016
 
Content St
rand
 
Number 
of Items
 
Target %
 
  
Lowest % 
-
 
Highest %
 
Content 1
 
Management of Care
 
12
 
20
 
17
-
23
 
Content 2
 
Safe
t
y and Infection Control 
 
7
 
12
 
9
-
15
 
Content 3
 
Health Promotion and Maintenance
 
6
 
9
 
6
-
12
 
Content 4
 
Psychosocial Integrity
 
5
 
9
 
6
-
12
 
Content 5
 
Basic Care and Comfort
 
5
 
9
 
6
-
12
 
Content 6
 
Pharmacological and Parenteral Therapies
 
9
 
15
 
12
-
18
 
Content 7
 
Reduction of Risk Potential
 
7
 
12
 
9
-
15
 
Content 8
 
Physiological Adaptation
 
9
 
14
 
11
-
17
 
Total
 
 
60
 
100
 
 
4.6.2
 
Item 
p
ool
 
For this 
quarter 
period in
 
2016, the NCLEX
-
RN exam used an
 
operational item pool of 
1,244 items across eight content areas. Table 4.7 summarizes the descriptive statistics of 
b
-
parameters for the item pool used in this study. The distribution of 
b
-
parameters for each conte
nt 
strand was similar 
across all eight content areas, with the mean of 
b
-
parameters close to the cut
-
off score of 0.0. 
As shown in Figure 4.7, the information for all content areas peaked around 0.0, 
62
 
indicating that the item pool 
includes
 
adequately inform
ative items near the cut
-
score
 
for 
the 
NCLEX
-
RN exam.
 
It was also expected that the amount of information was greater for the 
content areas consisting of more items. 
 
Table 
4
.
7
 
 
Descriptive 
Statistics of b
-
Parameters for the NCLEX
-
RN Item Pool
 
 
Content Strand
 
b
-
parameters
 
n
 
M
 
SD
 
Min.
 
Max.
 
Content 1
 
0.02
 
0.85
 
-
2.35
 
2.22
 
248
 
Content 2
 
0.03
 
0.83
 
-
2.27
 
2.19
 
150
 
Content 3
 
0.06
 
0.83
 
-
2.22
 
2.30
 
112
 
Content 4
 
0.00
 
0.84
 
-
2.13
 
2.23
 
112
 
Content 5
 
0.00
 
0.79
 
-
2.24
 
2.07
 
112
 
Content 6
 
0.04
 
0.81
 
-
2.24
 
2.17
 
186
 
Content 7
 
0.01
 
0.84
 
-
2.32
 
2.22
 
150
 
Content 8
 
0.06
 
0.83
 
-
2.24
 
2.19
 
174
 
Total
 
0.03
 
0.83
 
-
2.35
 
2.30
 
1,244
 
 
63
 
 
Figure 
4
.
7
.
 
Information function by content strand for the NCLEX
-
RN item pool.
 
4.6.3
 
Evaluation 
c
riteria
 
To investigate the amount of adaptation for the NCLEX
-
RN exam during the quarter of a 
year in 201
6
 
(April to June), three conditional adaptivity indices listed in Equat
ions (
3.6
) to (
3.8
) 
and three overall adaptivity indices were computed to evaluate adaptivity at individual 
proficiency
 
level and over the entire sample of examinees. 
 
 
64
 
CHAPTER 5.
 
RESULTS
 
 
This chapter summarizes the results of the analyses organized into five sections 
corresponding to 
the five research questions described in Chapter 1. The first four sections 
present the results of the comprehensive simulation studies that investigated the 
feasibility and 

proficiency
 
levels under 
numerous conditions. The last section illustrates the empirical demonstration for the conditional 
adaptivity indices using the real data from the 
licensure and certification exam.
 
5.1
 
Research Question 1
 
In the first study, comprehensive simulations were conducted to examine the sensitivity 
of the proposed adaptivity statistics to item pool characteristics, IRT models, and 
proficiency
 
estimation me
thods. In particular, item pool characteristics
 
related to the 
quality of an item pool, 
were manip
ulated by (1) varying
 
item pool size and (2) 
varying i
tem pool spread. 
These 
manipulations help
 
understan
d how well CATs use the available item pool to match 
item 

proficiency
 
continuum
.
 
The following two 
major sections summarize the impacts of these two aspects across different IRT models and 
proficiency
 
estimation methods in terms of the amount of adaptation as well as measurement 
accuracy and precision. For each of the studied conditions, all results of the 40
-
item CATs were 
averaged across 50 replications.
 
5.1.1
 
Variation in 
i
t
em 
p
ool 
s
ize
 
First, item pool
 
size
 
was investigated to see whether three new conditional adaptation 
statistics sensitively identify the differences in the amount of adaptation for the CATs using 10 
65
 
item pools that differed in size from 50 to 500 in increments of 50. This section descr
ibes the 
effect of item pool size across two 
proficiency
 
estimators (MLE and EAP) when each of IRT 
models (Rasch model and 3PL model) was used in the CATs, respectively. 
To
 
better understand 
the new indices, measurement accuracy and precision were first in
spected, followed by the 
amount of adaptation. 
 
5.1.1.1
 
Rasch 
m
odel
 
Measurement accuracy and precision
.
  

proficiency
 
estimates were 
evaluated using conditional and overall statistics for measurement accuracy and precision. 
The 
smaller bias, test
-
information
-
based standard error of measurement (TSEM), and root mean 
square error (RMSE) are, the better the recovery of 
prof
iciency
 
estimates is. Also, 
higher 
correlation between true and final 
proficiency
 
estimates (


) is associated with better recovery.
 
Conditional statistics
.
 
 
Figure 5.1 shows the mean bias, TSEM, and RMSE across evenly
-
spaced bins on the 
proficiency
 

) continuum. The MLE 
approach presented little bias in
 
proficiency
 
estimates, while the EAP approach reported bias regressing the 
proficiency
 
estimates 

-
continuum. With the EAP, in other words, 
the 
pro
ficiency
 

-
scale but 
overestimated at the negative extremes. The TSEM and RMSE values displayed the slight U
-

-
continuum than 
at the 
moderate 
proficiency
 
region. The degree of the U
-
shape pattern was obviously greater for the 
EAP compared to the MLE. Furthermore, with the bigger item pool, the 
proficiency
 
estimates 
appeared to be better recovered, especially for the extreme ends of the scale regardless of their 
estimation approaches.
 
66
 
Overall statistics.
  
Table 5.1 presents the summary information about overall accuracy 
and precision of 
proficiency
 
estimate
s over the entire sample of examinees. As the pool size 
increases, the values for bias, TSEM, and RMSE were small, and the correlation coefficients 
were high. Although the correlations were 
almost identical between 
MLE and EAP, EAP 
produced slightly higher
 
overall bias but lower overall standard errors compared to MLE across 
the 10 item pools. The differences were getting negligible with 
larger
 
item pool
s
, though.
 
Table
 
5
.
1
 
 
Overall Statistics of Measurement Pr
ecision of 
Proficiency
 
Estimates for a Rasch
-
based CAT by 
Item Pool Size and 
Proficiency
 
Estimato
r
 
Pool 
 
Size
 
MLE
 
 
EAP
 
Bias
 
TSEM
 
RMSE
 

Bias
 
TSEM
 
RMSE
 

50
 
-
0.002
 
0.366
 
0.371
 
0.941
 
 
-
0.006
 
0.356
 
0.338
 
0.942
 
100
 
-
0.004
 
0.338
 
0.342
 
0.948
 
 
-
0.002
 
0.331
 
0.317
 
0.949
 
150
 
0.000
 
0.332
 
0.336
 
0.949
 
 
-
0.004
 
0.327
 
0.314
 
0.950
 
200
 
-
0.001
 
0.330
 
0.333
 
0.950
 
 
-
0.001
 
0.325
 
0.313
 
0.951
 
250
 
-
0.001
 
0.329
 
0.332
 
0.950
 
 
-
0.003
 
0.325
 
0.312
 
0.951
 
300
 
0.000
 
0.328
 
0.332
 
0.950
 
 
-
0.003
 
0.324
 
0.311
 
0.951
 
350
 
-
0.001
 
0.328
 
0.331
 
0.950
 
 
-
0.002
 
0.324
 
0.311
 
0.951
 
400
 
0.001
 
0.328
 
0.330
 
0.950
 
 
-
0.003
 
0.324
 
0.312
 
0.951
 
450
 
0.000
 
0.327
 
0.329
 
0.951
 
 
-
0.002
 
0.323
 
0.310
 
0.951
 
500
 
-
0.003
 
0.327
 
0.329
 
0.951
 
 
-
0.001
 
0.323
 
0.311
 
0.951
 
 
67
 
 
Figure 
5
.
1
.
 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for a Rasch
-
based CAT 
by item pool size and 
proficiency
 
estimator
.
 
 
68
 
Amount of adaptation
.  Adaptivity for CATs were evaluated over the entire sample of 
examinees usi
ng the existing overall statistics, as well as at the individual 
proficiency
 
levels 
using the conditional statistical indicators proposed in this dissertation.
 
Conditional adaptivity
.  To evaluate the amount of adaptation for CAT contingent on the 
proficie
ncy
 
levels, three adaptation statistics were proposed in this dissertation, including 
deviation of difficulty (DOD), conditional proportion of reduction in variance (CPRV), and ratio 
of information (ROI) indices. As shown in Figure 5.2, all three adaptivit
y m
easures did 
sensitively detect 
differences in the amount of customization across the 
proficiency
 
levels for the 
CATs using the varying sizes of the item pools. That is, the increase in the item pool size led to 
higher values of the DOD, CPRV, and ROI indices, implying better adaptation. However, for a 
given 40
-
item CAT and its administration procedur
e, there was not much improvement in 
adaptivity for pool sizes greater than 300
. 
T
he POI index 
(Kingsbury & Wise, 2018) 
exhibited
 
little sensitiv
ity
 
to
 
item pool size
s and 
proficiency
 
estimation methods (see Figure 5.2)
.
 
Additionally,
 
based 
on the ribbon r
epresenting the
 
empirical standard error in Figure 5.2, the 
values of three adaptation indices appeared to be stable over the 50 replications, though the 
CPRV index was relatively less precise. 
 
Focusing on the individual 
proficiency
 
levels, moderate 
profi
ciency
 
rang
ing
 
from
 
-
0
.
5
 
to 
1.
5
 

-
scale produced better adaptivity compared to other 
proficiency
 
levels
 
across the 10 
item pools. 
This implies that the CAT appropriately selected relatively good items adapted to the 

proficiency
 
estimates from an item pool, and also the item pool contained enough 
informative items for the students whose 
proficiency
 
were around the middle level. 
Interestingly, 
using the smallest item pool of 50 items reported a slightly different pattern of the th
ree measures 
compared to other item pools. Although the ROI and DOD indices showed approximately a 
69
 

= 
-
1) took the items whose difficulty varied more than Student
 

item pool contained relatively more informative items for Student B compared to those for 
Student A.
 
 
Regarding the comparison of results between the MLE and EAP approaches, the DOD 
and ROI measures looked more sensitive t
o the MLE, as the range of their values over the 
proficiency
 
continuum was greater for the MLE. Also, the values of these two indices for the 
EAP were generally larger than those for the MLE especially at the extreme regions of the 
proficiency
 
scale, which
 
is associated with the features of the EAP estimator in terms of the 
accuracy of 
proficiency
 
estimates as well as the distribution 
of the item pool. That is, 
EAP 
proficiency
 
estimates were more under
-
 
or overestimated at the extreme regions (see Figure 5.
3), 
and the given bell
-
shaped item pool included 
fewer 
good items for the very high or low 
proficiency
 
levels. Accordingly, their biased 
proficiency
 
estimates 
provided
 
more chances of 
administering informative items to students during the CAT administratio
n. This is evidence 
t
hat 
the adaptivity of CAT is closely related to the performance of the 
proficiency
 
estimation. 
Meanwhile, CPRV presented 
a
 
similar pattern between the two estimators, but MLE reported
 
generally 
lower values with greater empirical
 
stand
ard errors
 
compared to EAP. The latter can be 
also explained by the property of the 
proficiency
 
estimation. The MLE used a step size of 0.7 
until the examinee had both correct and incorrect responses in the beginning of the CAT, 
resulting in selecting more
 
heterogeneous items for determining the approximate location of 
proficiency
 
earlier in the test. The EAP, however, had the benefit of using a prior for the 
estimation earlier in the test but the biased estimates regressed toward the mean of the 
prior 
70
 
dist
ribution, leading to 
CPRV values that were more stable over the replications and higher in the 
broader 
proficiency
 
regions.   
 
In sum
mary
, these results suggest that a value in the mid 0.90s for the DOD index, a 
value in the high 0.70s for the CPRV, and a value in the high 0.90s for the ROI index indicate 
good adaptation when the Rasch model is used for the CATs with MLE. For the 
CATs using
 
the 
Rasch model and 
EAP, a value in the high 0.90s for the DOD, and a value in high .80s for the 
CPRV, and a value in the high 0.90s indicate good adaptation. 
 
Overall adaptivity
.  Table 5.2 reported the results of overall adaptation measures. As 
expected
, the increase in the item pool size was more likely
 
due
 
to increasing the values of the 
correlation, ratio of SDs, and PRV indices. Similar to conditional adaptivity indices, adaptivity 
was not much improved for item pool sizes larger than 300 items. Howe
ver, the POI index 
yielded an unexpected pattern; as the pool size increases, the POI index decreases. It may be due 
to the fact that with the smaller item pool, there may be little variation in the administered test 
items selected between using the provis
ional 
proficiency
 
estimate and the final/true 
proficiency
 
estimate, leading 
to the
 
POI value 
of
 
1.0. Regarding the 
proficiency
 
estimation methods, the ratio 
of SDs index was smaller for EAP than for MLE due to the property of the EAP estimates 
shrinking to
 
the mean of a prior distribution, allowing the CAT to select relatively more 
homogeneous items for each examinee. On the contrary, other indices including the correlation, 
PRV, and POI measures were slightly higher for EAP. 
 
MLE results for the 40
-
item te
st provided evidence to support the benchmark values from 
a previous study with the 30
-
item test (Reckase et al., 2018): low 0.90s for the correlation index, 
mid 0.80s for the ratio of SDs index, and about 0.80 for the PRV index. EAP results for the 40
-
ite
m test additionally suggest some benchmark values for interpreting these overall indices for 
71
 
CATs; a value in the mid 0.90s for the correlation index, a value in the high 0.70s for the ratio of 
SDs index, and a value in the high 0.80s for the PRV can be co
nsidered good adaptation.
 
Relationship between 
c
onditional 
a
daptivity and 
m
easurement 
p
recision
.
  
To 
visualize their relationships, the TSEM values were plotted against the DOD, CPRV, and ROI 
measures, respectively by item pool size and 
proficiency
 
estimat
or (see Figure 5.4). Within each 
item pool size condition, the DOD and the ROI measures 
showed 
similarly a negative and 
curvilinear association with the standard errors (TSEM). Despite their nonlinearity, the Pearson 
correlation coefficients were computed 
for information purposes only, yielding greater than .90 
across all conditions. However, the CPRV measure did not 
d
isplay a
n obvious
 
linear or 
curvilinear relation with the standard errors
. Instead, there were two lines identified for MLE and 
EAP, which may be due to the larger TSEM values at the positive and negative extremes of 
proficiency
 
(remember, the 
U
-
shaped distribution of TSEM) but relatively constant CPRV values 
at these regions.
 
A clear
 
finding is that
 
the
 
standard errors were apparently small
 
when CPRV is 
greater than or equal to its benchmark value, or vice versa
.
 
In addition, plots for the relation of 
RMSE and the adaptation indices are provided 
in Appendix 
A
.
 
Because RMSE cons
iders both 
bias and standard errors, their relationships were not as apparent as those with TSEM.
 
 
72
 
 
Figure 
5
.
2
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a Rasch
-
based CAT by 
item pool size and 
proficiency
 
estimator.
 
73
 
 
Figure 
5
.
3
. 
Plot of a POI index for a Rasch
-
based CAT by item pool size and 
proficiency
 
estimator.
 
 
F
igure 
5
.
4
. 
Relationship of TSEM with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
based CAT by item pool size and 
proficiency
 
estimator.
 
74
 
Figure 5.4
.
 

75
 
Table 
5
.
2
 
 
Overall Adaptation Statistics for a Rasch
-
based CAT by Item Pool Size and 
Proficiency
 
Estimator
 
Pool 
Size
 
MLE
 
 
EAP
 

PRV
 
POI
 
 
PRV
 
POI
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
50
 
0.86
 
(
.003
)
 
0.27
 
(
.003
)
 
0.36
 
(
.001
)
 
99.79
 
(
.016
)
 
 
0.89
 
(
.003
)
 
0.31
 
(
.003
)
 
0.37
 
(
.001
)
 
99.97
 
(
.004
)
 
100
 
0.90
 
(
.003
)
 
0.61
 
(
.006
)
 
0.73
 
(
.002
)
 
97.91
 
(
.060
)
 
 
0.95
 
(
.002
)
 
0.62
 
(
.004
)
 
0.79
 
(
.001
)
 
99.03
 
(
.033
)
 
150
 
0.92
 
(
.002
)
 
0.74
 
(
.006
)
 
0.80
 
(
.002
)
 
96.88
 
(
.063
)
 
 
0.96
 
(
.002
)
 
0.71
 
(
.005
)
 
0.87
 
(
.001
)
 
98.25
 
(
.049
)
 
200
 
0.93
 
(
.002
)
 
0.78
 
(
.006
)
 
0.81
 
(
.003
)
 
96.32
 
(
.075
)
 
 
0.96
 
(
.001
)
 
0.74
 
(
.005
)
 
0.89
 
(
.001
)
 
97.70
 
(
.052
)
 
250
 
0.93
 
(
.002
)
 
0.81
 
(
.005
)
 
0.80
 
(
.003
)
 
95.99
 
(
.071
)
 
 
0.96
 
(
.001
)
 
0.76
 
(
.005
)
 
0.88
 
(
.001
)
 
97.44
 
(
.049
)
 
300
 
0.94
 
(
.
002
)
 
0.83
 
(
.007
)
 
0.80
 
(
.003
)
 
95.77
 
(
.079
)
 
 
0.96
 
(
.001
)
 
0.78
 
(
.004
)
 
0.89
 
(
.001
)
 
97.24
 
(
.051
)
 
350
 
0.94
 
(
.002
)
 
0.85
 
(
.007
)
 
0.80
 
(
.003
)
 
95.60
 
(
.065
)
 
 
0.96
 
(
.001
)
 
0.78
 
(
.005
)
 
0.89
 
(
.002
)
 
97.11
 
(
.052
)
 
400
 
0.94
 
(
.002
)
 
0.86
 
(
.007
)
 
0.80
 
(
.003
)
 
95.51
 
(
.085
)
 
 
0.96
 
(
.001
)
 
0.79
 
(
.005
)
 
0.89
 
(
.002
)
 
97.01
 
(
.051
)
 
450
 
0.94
 
(
.002
)
 
0.88
 
(
.005
)
 
0.80
 
(
.003
)
 
95.41
 
(
.070
)
 
 
0.96
 
(
.001
)
 
0.80
 
(
.005
)
 
0.89
 
(
.001
)
 
96.95
 
(
.054
)
 
500
 
0.94
 
(
.002
)
 
0.88
 
(
.006
)
 
0.80
 
(
.003
)
 
95.35
 
(
.086
)
 
 
0.96
 
(
.001
)
 
0.80
 
(
.004
)
 
0.89
 
(
.001
)
 
96.89
 
(
.041
)
 
Note. 

 
76
 
5.1.1.2
 
3PL 
m
odel
 
Measurement accuracy and precision
.
  
Like the results for the Rasch model, 

proficiency
 
estimates were evaluated using conditional and overall statistics. 
 
Conditional statistics
.
 
 
Figure 5.5 shows the mean bias, TSEM, and RMSE across evenly
-
spaced bin
s on the 
proficiency
 

) continuum. 
Unlike the findings for the Rasch model, both the 
MLE and EAP approaches presented more bias and 
large
 
standard errors of 
proficiency
 
estimates 
at the extreme ends of the 
proficiency
 
continuum due to the adverse effect of 
c
-
parameters on the 
estimations (Thissen & Wainer, 1982). Specifically, the EAP approach for the 3PL model 
reported greater regress
-
toward
-
the
-
mean bias, whereas the MLE yielded much greater standard 
errors (TSEM) at the 
proficiency
 
extremes. While there was more bias of 
prof
iciency
 
estimates, 
implying less accurate 
estimates 
for the EAP than for the MLE model, there were 
large
 
standard 
errors (TSEM) in the estimates,
 
implying less precision for MLE than for EAP. Considering both 
bias and standard errors, the RMSE values were higher for EAP at the extreme 
proficiency
 
regions but they were similar to each other at the middle ranges of the 
proficiency
 
scale. 
Moreover, wit
h the smaller item pool, the 
proficiency
 
estimates appeared to be less accurate and 
less stable regardless of their estimation approaches.
 
Overall statistics.
  
Overall accuracy and precision of 
proficiency
 
estimates were 
summarized in Table 5.2. As the poo
l size increased, the values for bias, TSEM, and RMSE 
decreased, and the correlation coefficients increased. Although the correlations and mean bias 
were similar 
to 
one another, the MLE yielded higher standard errors (TSEM), which 
results in
 
larger RMSE va
lues across the 10 item pools compared to the EAP. 
 
77
 
 
Figure 
5
.
5
. 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for a 3PL
-
based CAT 
by item pool size and 
proficiency
 
estimator.
 
 
78
 
Table 
5
.
3
 
 
Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a 3PL
-
based CAT by 
Item Pool Size and 
Proficiency
 
Estimator
 
Pool 
 
Size
 
MLE
 
 
EAP
 
Bias
 
TSEM
 
RMSE
 

Bias
 
TSEM
 
RMSE
 

50
 
-
0.003
 
0.429
 
0.345
 
0.950
 
 
-
0.003
 
0.307
 
0.295
 
0.956
 
100
 
-
0.007
 
0.324
 
0.280
 
0.966
 
 
-
0.002
 
0.253
 
0.245
 
0.970
 
150
 
-
0.005
 
0.255
 
0.245
 
0.973
 
 
-
0.002
 
0.228
 
0.224
 
0.975
 
200
 
-
0.002
 
0.237
 
0.231
 
0.976
 
 
-
0.001
 
0.216
 
0.212
 
0.978
 
250
 
-
0.002
 
0.219
 
0.213
 
0.979
 
 
-
0.001
 
0.202
 
0.199
 
0.980
 
300
 
-
0.003
 
0.209
 
0.207
 
0.980
 
 
-
0.002
 
0.197
 
0.194
 
0.981
 
350
 
-
0.001
 
0.203
 
0.203
 
0.981
 
 
-
0.001
 
0.192
 
0.189
 
0.982
 
400
 
-
0.001
 
0.196
 
0.196
 
0.982
 
 
-
0.001
 
0.186
 
0.185
 
0.983
 
450
 
 
0.000
 
0.193
 
0.193
 
0.982
 
 
-
0.001
 
0.184
 
0.183
 
0.983
 
500
 
-
0.001
 
0.187
 
0.188
 
0.983
 
 
-
0.001
 
0.179
 
0.178
 
0.984
 
 
Amount of adaptation
.
  
Adaptivity for CATs were evaluated using the existing overall 
statistics as well as conditional statistical indicators proposed in this dissertation.
 
Conditional adaptivity
.  R
esults indicated that the three
 
measures appeared to be 
sensitive to 
variation in item 
pool size across the
 
proficiency
 
levels
 
shown in Figure 5.6
. 
As the 
pool size increased, all three measures showed higher values, indicating better adaptation. For 
the given 40
-
item test and CAT administration procedure, there
 
was not muc
h improvement in 
th
ree
 
adaptivity indices for pool sizes greater than 300. Note that, compared to the patterns for 
the Rasch model, all the values of three adaptation measures were smaller across the 
proficiency
 
levels. 
The middle 
proficiency
 
area produced
 
better adaptivity than other 
proficiency
 
areas. Not 
only that, but with the smallest item pool of 50 items, the ROI and DOD indices s
howed 
approximately a symmetric pattern centered on 0.0,
 
whereas
 
the CPRV values 
were 
asymmetrically distributed. 
 
79
 
In addi
tion,
 
according to the shading ribbon representing the empirical standard error
 
of 
the adaptation measures in Figure 5.6, 
the values of three adaptation indices appeared to be 
stable over the 50 replications, but there were relatively higher empirical standard errors in the 
measures at the very ends of the 
proficiency
 
continuum. The latter might be due to the limited 
items av
ailable for the extreme or due to the larger standard errors of the 
proficiency
 
estimates at 
the very high or low 
proficiency
 
levels, which is more likely affected by 
c
-
parameters. 
However, 
the POI index was neither sensitive to variation in item pool siz
e
 
nor to 
proficiency
 
estimators in 
Figure 5.7
.
 
In comparison to the EAP approach, 
three adaptation
 
measures appeared to be more 
sensitive to the MLE across the 10 item pools
, and t
he empirical standard errors of these 
measures were larger for 
MLE
. These
 
mig
ht be related to the measurement properties of the 
abilities 

 
to handle the unstable estimation issues caused by the 
c
-
parameters when the 3PL model is used.
 
In particular, t
he CPRV presented
 
a
 
similar pattern 
between the two estimators, but MLE reported slightly lower values with the greater empirical 
standard errors compared to EAP. The latter can
 
also 
be explained by the p
roperties
 
of the 
proficiency
 
estimation. Along with the 
c
-
parameter iss
ue, MLE
 
used a step size of 0.7 until 
MLE 
can be computed earlier in the CAT, resulting in selecting more heterogeneous items for 
determining the approximate location of the 
proficiency
 
level. EAP, however, had the benefit of 
using a prior for the estimati
on earlier in the test but presented biased estimates regressed toward 
the mean of the prior distribution, leading to the CPRV values that were more stable over the 
replications and higher in the broader 
proficiency
 
regions.   
 
 
Taken all together, these r
esults 
suggest some guidelines for interpreting the adaptation 
measures; 
a value in the mid
 
0
.
7
0

 
is good for D
OD,
 
a value 
in the low 0
.80

 
is considered 
a 
80
 
good adaptation f
or the CPRV, 
and a value 
in the
 
mid 0
.

 
is good for 
ROI when the 3PL 
model is used for scaling and scoring with MLE. 
Results of
 
the
 
3PL
 
CATs 
using
 
EAP
 
support 
the guidelines found using Rasch model, but they suggest a slightly higher benchmark value for 
the CPRV, which is 

 
Overal
l adaptivity
.  Table 5.4 presents the findings of 
the 
overall adaptation measures. As 
item pool size increased, all three measures increased, but these values were not much improved 
for item pool sizes larger than 300. However, as identified in the pattern
 
for the Rasch model, the 
POI index decreased, as item pool size increased. Unlike the Rasch model, the differences in the 
overall adaptivity measures between 
MLE and 
EAP were relatively small. While EAP presented 
slightly higher PRV values across the pool
 
size conditions, other overall measures 
performed
 
similarly regardless of the 
proficiency
 
estimators, which is consistent with the findings from 
a
 
previous study (Ju & Lee, 2018). Furthermore, these results confirmed again the benchmark 
values for interpr
eting the overall adaptation statistics suggested from previous studies (Ju & 
Lee, 2018; Kim et al., 2018). A value in the high 0.90s for the correlation index and a value in 
the high 0.70s for the ratio of SDs index can be considered good adaptation for t
he CAT 
regardless of the 
proficiency
 
estimation methods. However, the benchmark value for the PRV 
index is a value in the low 0.80s for 
the CATs with combination of
 
3PL/MLE and a value in t
he 
mid 0.80s for 
3PL/EAP. Again, as 
with 
the CPRV index
, which
 
is a
 
modified version of PRV, 
the PRV index might be affected by the properties of 
proficiency
 
estimation approaches.
 
Relationship between 
c
onditional 
a
daptivity and 
m
easurement 
p
recision
.
  
For 
brevity, plots of relations of the standard errors (TSEM) with the DOD, CPRV, and ROI indices 
are displayed in Figure 5.8, which were similar to the relations found using the Rasch model. For 
each item pool, the DOD and ROI measures were negatively r
elated with the standard errors, 
81
 
showing a slightly nonlinear pattern for both MLE and EAP. In spite of their nonlinearity, the 
Pearson correlation coefficients were computed for information purposes only, reporting very 
strong, negative relationships betw
een the TSEM and either DOD or ROI. 
However, again, the 
CPRV measure did not present an apparent pattern for the relation of standard errors. The 
relations between RMSE and adaptivity indices are presented in Appendix A.
 
82
 
 
Figure 
5
.
6
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based CAT by 
item pool size and 
proficiency
 
estimator.
 
83
 
 
Figure
 
5
.
7
. 
Plot of a POI index for a 3PL
-
based CAT by item pool size and 
proficiency
 
estimator.
 
 
Figure 
5
.
8
. 
Relationship of TSEM with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a 3PL
-
based CAT by item pool size and 
profici
ency
 
estimator.
 
 
84
 
Figure 5.8
.
 

85
 
Table 
5
.
4
 
 
Overall Adaptation Statistics for a 3PL
-
based CAT by Item Pool Size and 
Proficiency
 
Estimator
 
Pool 
Size
 
MLE
 
 
EAP
 

PRV
 
POI
 
 
PRV
 
POI
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
50
 
0.83
 
(
.004
)
 
0.25
 
(.002)
 
0.30
 
(
.001
)
 
99.92
 
(.007)
 
 
0.86
 
(
.003
)
 
0.28
 
(
.002
)
 
0.30
 
(
.001
)
 
99.95
 
(
.005
)
 
100
 
0.93
 
(
.004
)
 
0.55
 
(.005)
 
0.70
 
(
.001
)
 
99.34
 
(.027)
 
 
0.96
 
(
.001
)
 
0.59
 
(
.003
)
 
0.71
 
(
.001
)
 
99.48
 
(
.021
)
 
150
 
0.95
 
(
.003
)
 
0.66
 
(.006)
 
0.76
 
(
.001
)
 
98.98
 
(.033)
 
 
0.97
 
(
.001
)
 
0.69
 
(
.003
)
 
0.77
 
(
.001
)
 
99.16
 
(
.021
)
 
200
 
0.96
 
(
.003
)
 
0.7
0
 
(.004)
 
0.79
 
(
.001
)
 
98.54
 
(.041)
 
 
0.98
 
(
.001
)
 
0.73
 
(
.003
)
 
0.81
 
(
.001
)
 
98.82
 
(
.034
)
 
250
 
0.97
 
(.003)
 
0.72
 
(
.005
)
 
0.80
 
(
.001
)
 
98.12
 
(.051)
 
 
0.98
 
(
.001
)
 
0.74
 
(
.003
)
 
0.82
 
(
.001
)
 
98.51
 
(
.033
)
 
300
 
0.97
 
(
.002
)
 
0.74
 
(
.004
)
 
0.81
 
(
.002
)
 
97.68
 
(.060)
 
 
0.98
 
(
.001
)
 
0.75
 
(
.002
)
 
0.83
 
(
.001
)
 
98.23
 
(
.042
)
 
350
 
0.97
 
(
.002
)
 
0.75
 
(
.004
)
 
0.82
 
(
.002
)
 
97.44
 
(.062)
 
 
0.99
 
(
.001
)
 
0.76
 
(
.003
)
 
0.84
 
(
.001
)
 
98.07
 
(
.041
)
 
400
 
0.98
 
(
.002
)
 
0.77
 
(.003)
 
0.82
 
(.002)
 
97.29
 
(.059)
 
 
0.99
 
(
.001
)
 
0.78
 
(
.003
)
 
0.84
 
(
.001
)
 
97.90
 
(
.038
)
 
450
 
0.98
 
(
.002
)
 
0.78
 
(.004)
 
0.82
 
(
.002
)
 
97.16
 
(.070)
 
 
0.99
 
(
.001
)
 
0.79
 
(
.003
)
 
0.84
 
(
.001
)
 
97.81
 
(
.045
)
 
500
 
0.98
 
(
.00
1)
 
0.78
 
(.004)
 
0.83
 
(.002)
 
97.49
 
(.062)
 
 
0.99
 
(
.001
)
 
0.79
 
(
.002
)
 
0.85
 
(
.001
)
 
97.94
 
(
.040
)
 
Note. 

 
86
 
5.1.2
 
Variation in 
i
tem 
p
ool 
s
pread
 
Another characteristic of item pools that could affect the amount of adaptation is the 
magnitude of the spread of item characteristics (
b
-
parameter or location of maximum 
information) in the item pool. The second section of Research Question 1 aimed to exa
mine the 
sensitivity of three conditional adaptation indices to variations in item pool spread. It was 
hypothesized that
 
if the difficulty of the items in the pool is in a limited range, even though the 
item pool is large, the CAT cannot be suitably custom
ized for students whose 
proficiency
 
levels 
are outside that range covered by the item pool. To test the hypothesis, eight item pools were 
simulated that differed in the 
SD
s of 
b
-
parameters in an item pool from 0.2 to 1.6 at 0.2 intervals. 
The size of all item pools considered here was 400. 
In the following, th
e 
impact
 
of item pool 
s
pread was investigated moderated by IRT models (Rasch and 3PL) and 
proficiency
 
estimators 
(MLE and EAP)
 
in terms of
 
measurement accuracy and precision 
as well as
 
the amount of 
adaptation.
 
 
5.1.2.1
 
Rasch 
m
odel
 
Measurement 
accuracy and precision
.
  
The final 
proficiency
 
estimates were evaluated 
using three conditional
-
 
and four overall statistics for measurement accuracy and precision.
 
A 
smaller bias value indicates a more accurate 
proficiency
 
estimate, and a smaller TSEM value 
indicates a more precise and stable 
proficiency
 
estimate. The RMSE considers both bias and 
standard errors together so that a smaller RMSE is 
associated with 
better recovery of 
proficiency
 
estimates. Also, the higher cor
relation between true and final 
proficiency
 
estimates (


) is 
associated with better recovery.
 
Conditional statistics
.
 
 
As shown in Figure 5.9, MLE presented little bias scattering the 
mean 
proficiency
 
estimates around 0.0, while EAP showed biased est
imate
s
 
regressed toward the 
87
 
mean of a prior distribution (i.e., 0.0) on the 
proficiency
 
scale. These findings were consistent 
across all eight item pools that had different SDs of 
b
-
parameters, though the limited item pool 
with small 
SD
s of 
b
-
parameters sh
owed slightly more bias in the estimates. However, MLE 
provided 
larger
 
standard errors (TSEM) 
for
 
the 
proficiency
 
estimates than EAP, especially at the 
extreme regions of the 
proficiency
 
continuum. As t
he item pool spread was restricted, 
TSEM 
increase
d, 
implying less precision
 
in the 
proficiency
 
estimates. Overall, the RMSE values of the 
two estimators were similar to each other at the moderate 
proficiency
 
levels, whereas they
 
were 
obviously greater for 
EAP at the extreme positive and negative ends of the
 
proficiency
 
scale. 
 
Overall statistics
. 
 
Table 5.5 presents the summary of four overall statistics by variation 
in item pool spread and 
proficiency
 
estimator. In general, slightly more bias was found in the 
estimates using EAP, while larger
 
standard error
s (TSEM) were identified using MLE over the 
entire sets of data. The differences either in bias or in standard errors between the two estimators 
were small, 
becoming
 
negligible with larger item pool
s
. The correlation coefficients, 


, were 
almost ident
ical between the two estimators, and the degree of the correlation improved as the 
SD
s of 
b
-
parameters increased in the item pool.
 
More interestingly, regardless of variations in 
item pool spread, the RMSE was smaller for EAP than for MLE, implying that th
e final 
proficiency
 
estimates were more accura
tely, precisely measured using 
EAP
,
 
on average
,
 
over the 
entire sample of students.  
 
88
 
 
Figure 
5
.
9
. 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for a Rasch
-
based CAT 
by item pool spread and 
proficiency
 
estimator.
 
89
 
Table 
5
.
5
 
 
Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a Rasch
-
based CAT by 
Item Pool Spread and 
Pro
ficiency
 
Estimator
 
Pool
 
SD
 
(
b
s) 
 
MLE
 
 
EAP
 
Bias
 
TSEM
 
RMSE
 

Bias
 
TSEM
 
RMSE
 

0.2
 
0.002
 
0.356
 
0.361
 
0.945
 
 
-
0.002
 
0.341
 
0.325
 
0.947
 
0.4
 
0.001
 
0.340
 
0.343
 
0.948
 
 
-
0.003
 
0.331
 
0.318
 
0.949
 
0.6
 
0.001
 
0.332
 
0.336
 
0.950
 
 
-
0.002
 
0.326
 
0.314
 
0.950
 
0.8
 
0.000
 
0.329
 
0.331
 
0.950
 
 
-
0.002
 
0.324
 
0.311
 
0.951
 
1.0
 
-
0.001
 
0.327
 
0.330
 
0.951
 
 
-
0.003
 
0.323
 
0.311
 
0.951
 
1.2
 
0.000
 
0.326
 
0.330
 
0.950
 
 
-
0.002
 
0.323
 
0.311
 
0.951
 
1.4
 
0.001
 
0.326
 
0.330
 
0.950
 
 
-
0.002
 
0.323
 
0.311
 
0.951
 
1.6
 
0.000
 
0.326
 
0.329
 
0.951
 
 
-
0.002
 
0.323
 
0.310
 
0.951
 
Note.
 
b
s = 
b
-
parameters.
 
 
Amount of adaptation
.
 
 
The proposed conditional adaptivity indices, along with the 
overall indices, were used to assess the difference in adaptivity for the CATs.
 
Conditional adaptivity
.  Figure 5.10 presents the distributions of three conditional 
adaptation indices over the 
proficiency
 
continuum using eight item pools that varied in the 
SD
s 
of 
b
-
parameters by two 
proficiency
 
estimators. 
Overall, the three adaptivity measures sensitively
 
detected each corresponding aspect of the amount of adaptation for the CATs depending on 
the 
extent of the item 
po
ol spread
.
 
The proposed statistics generally increased as the 
b
-
parameters 
were more broadly spread out in the pool. In particular, at the extreme regions of the 
proficiency
 
continuum, it was 
clearly
 
observed that the values of the three indices gradually improved as the 
item pool contained more difficult or easy items
. For the item pool with 1.6 
SD
 
of the 
b
-
parameters, the three measures indicated that the CATs were almost equally well adapted across 
all 
proficiency
 
levels. Looking at the performance of each index, the DOD and ROI indices 
functioned as expected dependi
ng on variation in item pool spread
 
over the 
proficiency
 
90
 
continuum
; however, for the CPRV index, an 
unexpected
 
pattern was 
observed
 
when the 
SD
 
of 
b
-
parameters in the item pool was very small (i.e., 0.2 or 0.4)
.
 
The CPRV values using those 
item pools were 
exceptionally small, closer to 0.0 or even below 0.0 on the moderate 
proficiency
 
levels. Since the CPRV index compares the variation of the 
b
-
parameters of the items selected 
for each examinee relative to the variation of 
b
-
parameters in the entire item po
ol, if the item 
pool includes most items whose difficulty were in a very restricted range, say 0.04 or 0.16 
variances, even if the within
-
examinee variance is per se small, that variance could be larger
 
relative 
to the item pool variance. Note that as with
 
the findings for the item
-
po
ol
-
size study, the 
POI index showed little sensitivity
 
across the 
proficiency
 
continuum and variation in the spread 
of 
b
-
parameters in the item pool (see Figure 5.11).
 
The three adaptation measures showed similar patterns between the MLE and the EAP 
using the eight item pools, but their observed ranges of the values over the 
proficiency
 
continuum were different with broader ranges for the MLE. That is, compared to the ML
E, all 
three adaptation measures presented less variations in the corresponding values across the 
proficiency
 
levels for the CATs using EAP, which was consistent across the eight item pools.
 
Again, this might due to
 
the
 
features of the EAP estimator yieldi
ng the regress
-
toward
-
the
-
mean 
bias in the estimates.
 
Regarding the stability of the indices, the DOD and ROI measures reported small 
empirical standard errors over the 
proficiency
 
scales regardless of variations in item pool spread. 
However, the CPRV inde
x showed poor stability when the 
b
-
parameters were in 
a 
very restricted 
range in the pool, although the index was stable with the item pools that had the 
SD
 
of 
b
-
parameters larger than 0.8. This instability was even greater when the MLE was used for the 
CA
Ts because a fixed
-
step size of 0.7 was used earlier in the test until the MLE can be 
91
 

distribution, 
N
 
(0, 1), the variation of 
b
-
parameters for an item pool should be 
about equal 
to 
or 
greater than the variation in the final 
proficiency
 
estimates in order for the CPRV index to be 
precise.
 
Finally
, these findings for the item
-
pool
-
spread study support the benchmark values
 
of 
the proposed conditional indices, suggested in the previous item
-
pool
-
size study in Section 5.1.1, 
when the Rasch model is used for the CATs. For the 40
-
item test, mid 0.90s for DOD, high 0.70s 
for CPRV, and high 0.90s for ROI indicate good adaption usi
ng the MLE, while using the EAP, 
high 0.90s for DOD, high 0.80s for CPRV, and high 0.90s for ROI considered good adaptation.   
 
Overall
 
adaptivity
.  The results of overall adaptivity for the item pool spread showed that 
the correlation, ratio of SDs, and P
RV indices gradually improved as the spread of the item pool 
difficulty increased (see Table 5.6). Compared to the MLE, using the EAP reported larger values 
of the correlation and PRV measures but smaller values for the ratio of SDs index. As mentioned 
ear
lier, the latter is attributed to the property of the EAP estimator. However, as with the results 
for the item pool size study, the POI index was rarely sensitive to the spread of the item pool 
difficulty, and its value decreased as the spread of the item 
pool increased. Overall, these results 
were in line with the item
-
pool
-
size study when selecting benchmark values for the measures. At 
the same time, it can be confirmed that based on the benchmark values, the variation of 
b
-
parameters for an item pool sho
uld be larger than the variation in 
proficiency
 
estimates for the 
CATs to be well adapted to students whose 
proficiency
 
is at the extremes of the 
proficiency
 
scale.
 
Relationship between 
c
onditional 
a
daptivity and 
m
easurement 
p
recision
.
  
Figure 
5.12 display
s the relations of the standard errors (TSEM) with the DOD, CPRV, and ROI 
92
 
indices. The DOD and ROI measures were negatively associated with the standard errors with a 
linear relationship but with a slightly nonlinear relation between DOD and TSEM for MLE w
ith 
the limited spread of the item pool difficulty. The Pearson correlation coefficients w
ere also 
computed, showing high,
 
strong
 
correlation
 
with one another. The DOD and TSEM correlation
s 
were
 
in the range of 
-
0.98 to 
-
0.90 for EAP as well as in 
-
0.98 to 
-
0.74 for MLE. The ROI and 
TSEM correlation
s were
 
in the range
 
of
 
-
0.98 to 
-
0.81 for EAP as well as 
-
0.97 to 
-
0.81 for MLE. 
However, for the relation with CPRV, it is interesting to note that o
n the one hand, when the 
SD
 
of 
b
-
parameters in the pool was smaller than 0.6, the CPRV values were in general positively 
correlated with the TSEM values; on the other hand, when the spread of 
b
-
parameters was equal 
or greater than 0.8, their relationship a
ppeared to be negative. Again, the former can be explained 
by the unusual pattern identified in Figure 5.10 with the restricted spread of the item pool. Also, 
when the spread of difficulty in the pool was large, the relation between CPRV and TSEM 
looked li
near. 
 
93
 
 
Figure 
5
.
10
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a Rasch
-
based CAT by 
item pool spread and 
proficiency
 
estimator.
 
 
94
 
 
Figure 
5
.
11
. 
Plot of a POI index for a Rasch
-
based CAT by item pool spread and 
proficiency
 
estimator.
 
 
Figure 
5
.
12
.
 
Relationship of TSEM with conditional adaptivity indices for a Rasch
-
based CAT 
by item pool spread and 
proficiency
 
estimat
or.
 
95
 
Figure 5.12
.
 

96
 
Table 
5
.
6
 
 
Overall Adaptation Statistics for a Rasch
-
based CAT by Item Pool Spread and 
Proficiency
 
Estimator
 
 
Pool
 
SD 
(
b
s)
 
MLE
 
 
EAP
 

PRV
 
POI
 
 
PRV
 
POI
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
0.2
 
0.84
 
(
.003
)
 
0.26
 
(
.003
)
 
0.15
 
(
.016
)
 
98.61
 
(
.034
)
 
 
0.87
 
(
.003
)
 
0.30
 
(
.002
)
 
0.30
 
(
.012
)
 
98.82
 
(
.027
)
 
0.4
 
0.88
 
(
.003
)
 
0.50
 
(
.004
)
 
0.43
 
(
.011
)
 
97.30
 
(
.064
)
 
 
0.91
 
(
.002
)
 
0.53
 
(
.004
)
 
0.61
 
(
.006
)
 
97.97
 
(
.053
)
 
0.6
 
0.91
 
(
.003
)
 
0.68
 
(
.004
)
 
0.65
 
(
.005
)
 
96.45
 
(
.066
)
 
 
0.94
 
(
.002
)
 
0.68
 
(
.004
)
 
0.78
 
(
.003
)
 
97.41
 
(
.052
)
 
0.8
 
0.93
 
(
.002
)
 
0.81
 
(
.004
)
 
0.74
 
(
.004
)
 
95.78
 
(
.087
)
 
 
0.96
 
(
.001
)
 
0.77
 
(
.004
)
 
0.85
 
(
.002
)
 
97.13
 
(
.056
)
 
1.0
 
0.94
 
(
.002
)
 
0.88
 
(
.006
)
 
0.82
 
(
.002
)
 
95.42
 
(
.076
)
 
 
0.97
 
(
.001
)
 
0.80
 
(
.004
)
 
0.90
 
(
.001
)
 
96.94
 
(
.048
)
 
1.2
 
0.95
 
(
.002
)
 
0.93
 
(
.006
)
 
0.87
 
(
.002
)
 
95.18
 
(
.082
)
 
 
0.97
 
(
.001
)
 
0.83
 
(
.005
)
 
0.93
 
(
.001
)
 
96.82
 
(
.056
)
 
1.4
 
0.96
 
(
.001
)
 
0.96
 
(
.006
)
 
0.90
 
(
.001
)
 
95.10
 
(
.078
)
 
 
0.97
 
(
.001
)
 
0.83
 
(
.004
)
 
0.95
 
(
.001
)
 
96.82
 
(
.052
)
 
1.6
 
0.96
 
(
.001
)
 
0.97
 
(
.005
)
 
0.92
 
(
.001
)
 
95.13
 
(
.073
)
 
 
0.97
 
(
.001
)
 
0.83
 
(
.004
)
 
0.96
 
(
.001
)
 
96.88
 
(
.049
)
 
Note. 

 
97
 
 
5.1.2.2
 
3PL 
m
odel
 
Measurement accuracy and precision
.
  
The final 
proficiency
 
estimates were evaluated 
using three conditional
-
 
and four overall statistics for measurement accuracy and precision.
 
 
Conditional statistics
.
 
 
As shown in Figure 5.9,
 
using the 3PL model, the MLE yielded 
small bias when the restricted spread of the item 
pool was used, while the EAP reported bias in 
the estimates regressed toward the mean of a prior distribution over the 
proficiency
 
continuum 
regardless of the spread of the item pools. The extent of bias became smaller with the bigger 
spread of the item po
ols. The estimates for the MLE were underestimated while those for the 
EAP were overestimated at the negative extreme region of the 
proficiency
 
scale, and vice versa 
at the positive extreme 
proficiency
 
region. Meanwhile, as with the results for the Rasch m
odel, 
the MLE produced larger standard errors (TSEM) in the 
proficiency
 
estimates compared to the 
EAP especially at the extremes of the 
proficiency
 
scale. As the 
SD
 
of 
b
-
parameters in the item 
pool increased, the standard errors 
reduced, indicating more precision
 
in the estimates. 
Considering both bias and standard errors, the RMSE values were small to 
a
 
similar extent for 
the two estimators at the middle 
proficiency
 
le
ve
ls around 0.0, whereas
 
they
 
were greater at the 
extremes of the 
proficiency
 
scale
 
and even the EAP presented higher values at the
 
very ends of 
the scale than
 
MLE did
. 
 
 
Overall statistics
. 
 
In general,
 
as the spread of the item pool increased, the standard
 
errors 
(TSEM and RMSE) decreased and
 
the correlation coefficients, 


,
 
improved. Although the 
corr
elations were similar between
 
MLE and EAP, as similar to the results for the Rasch model, 
the grand means of stan
dard errors were greater for 
MLE than for EAP. The differences became 
small as the 
SD
 
of 
b
-
parameters increased, though. Again, it suggests that the final 
proficiency
 
98
 
 
esti
m
ates 
were more precisely measured using EAP on average over the entire 
group 
of students
 
compared to 
MLE
.  
 
 
Figure 
5
.
13
.
 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for a 3PL
-
based CAT 
by item pool spread and 
proficiency
 
estimator.
 
99
 
Table 
5
.
7
 
 
Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for a 3PL
-
based
 
CAT by 
Item Pool Spread and 
Proficiency
 
Estimator
 
Pool
 
SD
 
(
b
s)  
 
MLE
 
 
EAP
 
Bias
 
TSEM
 
RMSE
 

Bias
 
TSEM
 
RMSE
 

0.2
 
-
0.003
 
0.376
 
0.320
 
0.960
 
 
-
0.001
 
0.253
 
0.234
 
0.973
 
0.4
 
-
0.001
 
0.295
 
0.262
 
0.971
 
 
-
0.001
 
0.218
 
0.208
 
0.978
 
0.6
 
0.000
 
0.244
 
0.225
 
0.978
 
 
-
0.001
 
0.195
 
0.190
 
0.982
 
0.8
 
0.002
 
0.196
 
0.195
 
0.982
 
 
0.000
 
0.183
 
0.179
 
0.984
 
1.0
 
0.003
 
0.182
 
0.184
 
0.984
 
 
-
0.001
 
0.177
 
0.175
 
0.985
 
1.2
 
0.004
 
0.178
 
0.180
 
0.985
 
 
-
0.001
 
0.176
 
0.176
 
0.985
 
1.4
 
0.005
 
0.177
 
0.180
 
0.985
 
 
-
0.001
 
0.176
 
0.176
 
0.985
 
1.6
 
0.005
 
0.175
 
0.178
 
0.985
 
 
-
0.001
 
0.174
 
0.174
 
0.985
 
Note.
 
b
s = 
b
-
parameters.
 
 
Amount of 
a
daptation.  
Adaptivity of the CATs was evaluated using the conditional 
measures at the individual 
proficiency
 
level and the overall indices over the entire sample of 
students.
 
Conditional adaptivity
.  
Results for the effects of variation in item pool spread
 
using the 
3PL model 
indicated that the 
three adaptation
 
measures 
(DOD, CPRV, and ROI) 
appea
red to be 
sensitive to 
the 

proficiency
 
levels 
(see Figure 
5.14
).
 
The proposed statistics mostly increased as the diffic
ulty parameters were more broadly 
spread out in the pool. In particular, at the extreme regions of the 
proficiency
 
continuum, it was 
observed that the values of the three indices gradually improved as the item pool contained more 
difficult or easy items. G

proficiency
 
and the 40
-
item test, these results suggested that the variation of 
b
-
parameters in an 

proficiency
 
in order to achieve good 
adaptivity over the examinees, especially for those at the extremes of 
proficiency
. However, as 
100
 
found in the previous investigations, the POI index appeared to be 
in
sensitive to the spread of 
items in the pool using the 3PL model, though the POI values wer
e slightly lower at th
e extremes 
of the 
proficiency
 
scale
 
regardless of the 
proficiency
 
estimators (see Figure 5.15). Considering 
the concept of the index, it might be expected that given the available item pool, the optimal test 
information would be simil
ar to the observed test information unless the interim estimate 
deviated far from the final estimate or the CAT included many constraints on the item selection. 
 
Compared to the results for the Rasch model in Section 5.1.2.1, even though the 
conditional ad
aptation statistics were computed using the location of maximum information 
(Birnbaum, 1968) instead of 
b
-
parameters, the values of the three indices were relatively lower 
over the 
proficiency
 
continuum when the 3PL model was used for the CATs. Also, the u
nusual 
pattern was not identified in the plot of CPRV for the 0.2 
SD
 
of 
b
-
parameters for the item pool 
condition. These might be due to the effect of 
a
-
 
and 
c
-
parameters on 
proficiency
 
estimation as 
well as on the information function, affecting the item selection procedure for the CATs using 
3PL model. 

the Rasch model
 
because of the characteristics of the r
estricted item pool interacted with the 
effect of 
a
-
 
and 
c
-
parameters
. 
With respect to the comparison between MLE and EAP, the three 
statistics showed similar patterns between the two 
proficiency
 
estimators across the eight item 
pools, but their observed v
alues were generally greater with the smaller ranges over the 
proficiency
 
continuum for the EAP. 
 
Regarding the stability of the three measures, as with the results of the variation
-
in
-
pool
-
size study using the 3PL model, the three statistics had greater e
mpirical standard errors at the 
very top and bottom ends of the 
proficiency
 
continuum rather than those at the moderate 
101
 
proficiency
 
levels. As mentioned before, this may be due to
 
the effect of 
c
-
parameters on the 
proficiency
 
estimates for the 3PL model. 
 
 
In sum, these r
esults
 
were consistent with the findings for the item pool size study when 
the 3PL model was used for the CATs, supporting the benchmark values for the three statistics to 

a value
 
in the low 0.

for CPRV, and a value in the mid 0
.

a 
value
 

.

 
Overall adaptivity
.  
Similar to prior studies using the 3PL model (Reckase et al.
, 2018; Ju 
& Lee, 2018), as the spread of the 
b
-
parameters for the item pool increased, the overall 
adaptivity indices gradually improved with the most sensitive of the ratio of standard deviations 
index (see Table 5.8). These results gave additional evid
ence to support the statistics selected for 
indicating good adaptation over the entire group of students using the overall measures: For 

a value
 

SDs, and a value in th
e low 0
.

correlation, 
a value
 

.

 
Relationship between 
c
onditional 
a
daptivity and 
m
easurement 
p
recision
.
  
As seen in 
Figure
 
5.16, except for the very large spread of the item pool whose 
SD
 
of 
b
-
parameters was 
greater than 1.2, the DOD measure was negatively, strongly correlated with TSEM (
-
.93 < 
r
s < 
-
0.80 for MLE; 
-
0.98 < 
r
s < 
-
0.73 for EAP).  While the relationship between ROI and TSEM was 
slightly nonlinear for the very restricted spread of t
he item pool, they were negatively and very 
closely associated with one another for other pool spread conditions (
-
0.99 < 
r
s < 
-
.83 for MLE; 
-
0.99 < 
r
s < 
-
.81 for EAP). This was expected because both TSEM and ROI were computed using 
the information. Lastly
, for the relation of CPRV, it was hard to find a systematic pattern of the 
102
 
relation across the item pool spread conditions, but with the high spread of the item pool whose 
b
-
parameters were greater than 1.0, regardless of the 
proficiency
 
estimators, TSEM 
had a 
negative, strong, and linear relation with CPRV (
-
0.93 < 
r
s < 
-
.74 for MLE; 
-
0.97 < 
r
s < 
-
0.68 for 
EAP).
 
 
103
 
 
Figure 
5
.
14
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based CAT by 
item pool spread and 
proficiency
 
estimator.
 
104
 
 
Figure 
5
.
15
. 
Plot of a POI index for a 3PL
-
based CAT by item pool spread and 
proficiency
 
estimator.
 
 
Figure 
5
.
16
. 
Relationship of TSEM with conditional adaptivity indices (DOD, CRPV, and ROI) 
for a 3PL
-
based CAT by item pool spread and 
proficiency
 
estimator.
 
105
 
Figure 5.16
.
 

106
 
Table 
5
.
8
 
 
Overall Adaptation Statistics for a 3PL
-
based CAT by Item Pool Spread and 
Proficiency
 
Estimator
 
Pool
 
SD 
(
b
s) 
 
MLE
 
 
EAP
 

PRV
 
POI
 
 
PRV
 
POI
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
0.2
 
0.87
 
(
.006
)
 
0.18
 
(
.002
)
 
0.45
 
(
.003
)
 
97.43
 
(.065)
 
 
0.96
 
(
.002
)
 
0.21
 
(
.001
)
 
0.49
 
(
.002
)
 
98.55
 
(
.025
)
 
0.4
 
0.92
 
(.006)
 
0.38
 
(
.004
)
 
0.57
 
(
.002
)
 
98.24
 
(.040)
 
 
0.96
 
(
.001
)
 
0.42
 
(
.002
)
 
0.59
 
(
.002
)
 
98.59
 
(
.027
)
 
0.6
 
0.95
 
(
.005
)
 
0.56
 
(
.00
5)
 
0.73
 
(.001)
 
98.04
 
(.038)
 
 
0.98
 
(
.001
)
 
0.59
 
(
.002
)
 
0.75
 
(
.001
)
 
98.41
 
(
.027
)
 
0.8
 
0.97
 
(.003)
 
0.71
 
(.005)
 
0.79
 
(
.002
)
 
98.03
 
(.051)
 
 
0.98
 
(
.001
)
 
0.72
 
(
.002
)
 
0.81
 
(
.001
)
 
98.47
 
(
.033
)
 
1.0
 
0.98
 
(
.001
)
 
0.79
 
(.004)
 
0.84
 
(
.002
)
 
97.91
 
(.060)
 
 
0.99
 
(
.001
)
 
0.79
 
(
.002
)
 
0.85
 
(
.001
)
 
98.31
 
(
.036
)
 
1.2
 
0.99
 
(
.001
)
 
0.88
 
(.002)
 
0.88
 
(
.001
)
 
97.81
 
(.056)
 
 
0.99
 
(
.000
)
 
0.88
 
(
.002
)
 
0.89
 
(
.001
)
 
98.15
 
(
.040
)
 
1.4
 
0.99
 
(
.000
)
 
0.91
 
(
.002
)
 
0.91
 
(
.001
)
 
97.96
 
(.043)
 
 
1.00
 
(
.000
)
 
0.90
 
(
.002
)
 
0.91
 
(
.000
)
 
98.43
 
(
.031
)
 
1.6
 
0.99
 
(
.000
)
 
0.90
 
(
.002
)
 
0.93
 
(
.001
)
 
98.27
 
(.043)
 
 
1.00
 
(
.000
)
 
0.89
 
(
.002
)
 
0.93
 
(
.000
)
 
98.53
 
(
.028
)
 
Note. 

 
107
 
5.2
 
Research Question 2
 
 
The second research question demonstrates the practical utility of the proposed 
conditional statistics for the amount of adaptation as diagnostic tools for improving the 
adaptivity of a CAT. To do this, a hypothetical scenario of 
a
 
state
-
wide testing progr
am was 
introduced in Section 4.3. It would be expected that findings from this demonstration can inform 
us of some insights about h
ow many items need to be added to
 
achieve
 
an
 
acceptable level of 
adaptation in a
 
particular proficiency region of interest. 
 
5.2.1
 
Baseline for the CATs
 
 
As a first step, results for the current 40
-
item CATs administered to 2,000 examinees 
using the 300
-
item pool were evaluated in terms of three perspectives of conditional adaptivity 
and measurement precision. The statistics were computed to be served as a
 
baseline to determine 
the proficiency levels where the amount of adaptation was not adequate during the CAT 
administration. Figure 5.17 presents the distributions of three conditional adaptation measures 
over the 
proficiency
 
continuum. Based on the benchm
ark values for the 3PL model with MLE, 
two proficiency regions, colored by red in the plot, were selected where either DOD, CPRV, or 
ROI was below the suggested criteria for good adaptation: (1) 
-

(2)
 

< 2.25. Out of 2,000 students, the former includes about 400 students, 
and the latter includes about 60 students. 
 
For the first proficiency region (
-

criterion value of mid .70s, while CRPV and ROI were acce
ptable. It means that students in that 
region received items whose characteristics on average deviated to some extent from the 
proficiency estimate at which those items were selected, relative to the average distance between 
all the eligible pool items fro
m that current estimate. As variation in the characteristics of the 
108
 
administered items was 
small, it 
is
 
plausible that the item pool did not include items whose 
information peaks at that proficiency region. For the other region (1.7

the 
DOD and ROI value
s were
 
acceptable, the CPRV
 
value 
was lower than the criterion value of low 
0.80s. It indicates that while the students took items whose characteristics were generally well 
matched, the item pool did not include 
sufficiently 
good items
 
for the students in that region so 
that the item selection algorithm may have to select some items whose characteristics poorly 
match their interim proficiency estimates. In addition, given the bias, TSEM, and RMSE values 
(see Figure 5.18), the measuremen
t accuracy and precision of the proficiency estimates in these 
two regions were similar. Therefore, it would be interesting to see how many items need to be 
added in the item pool to approach to the acceptable levels of adaptivity in these two regions.  
 
 
Figure 
5
.
17
. 
A 
p
lot of conditional adaptivity indices over the proficiency continuum for the CAT 
using the 300
-
item pool (baseline).
 
109
 
 
Figure 
5
.
18
. 
A plot of bias, TSEM, and RMSE over the 
proficiency continuum for the CAT 
using the 300
-
item pool (baseline)
.
 
Once the proficiency region of interest was determined, a series of fixed numbers of 
items were added to attain an acceptable level of adaptation at each region. The items to be 
added we
re selected whose information was high at that proficiency region
 
from the master pool. 
Note that to mimic the real
-
word situation, the items in the master pool w
ere
 
normally 
distributed, instea
d of 
uniform information. Hence, the quality of items to be ad
ded was not fully 
controlled so that the item quality may not be equal across the items to be selected from the 
master pool. For each region, the fixed numbers of items to be added to the item pool are 5, 10, 
15, 20, 30, 40, 50, and 100, and the results we
re replicated over 50 times. 
 
5.2.2
 
Region 1: 
-

 
As a starting point, the mean value of each ad
a
ptivity index at the first proficiency region 
(
-

) was 0.69 (
SD
 
= .01) for DOD, 0.84 (
SD
 
= .02) for CRPV, and 0.85 (
SD 
= .01) 
110
 
for ROI. While the CPRV and ROI were acceptable, the DOD index was below the criterion of 
mid 0.70s suggested in the study for Research Question 1 (see Section 5.1). To improve 
adaptivity to the acceptable levels, 
5, 1
0, 15, 20, 30, 40, 50, and 100 items from the master pool 
were sequentially added to the existing operational item pool. Results reported that c
ompared to 
the baseline for the three conditional adaptation statistics, as the items that were the most 
informa
tive in the master pool at that proficiency region were added to the operation item pool, 
the three aspects of adaptivity were gradually improved (see Figure 5.19). In particular, when 30 
informative items were additionally included to the operational pool
, the DOD values at the 
Region 1 of students showed clear
,
 
visible improvement, exceeding the benchmark value of 
mid .70s, and the other two statistics also obviously increased. The three statistics were not much 
improved after more than 40 items were adde
d to the item pool. 
 
 
Figure 
5
.
19
. 
Distributions of conditional adaptivity indices by number of items added at Region 
1 (
-

 
Regarding the measu
rement accuracy and precision, a
 
similar pattern was identified. A
s 
more items were added to the item pool, the bias and standard errors decreased. In particular, 
similar to the distributions of the adaptivity indices, when 30 items were additionally included to 
the operational pool, the standard errors (TSEM, RMSE) were
 
reduced. This suggests that the 
111
 
improvement of adaptivity for the CAT can lead to enhancing the measurement precision of 
proficiency estimates.
 
 
Figure 
5
.
20
. 
Distributions of statistics for measurement accuracy and 
precision by number of 
items added at Region 1 (
-

 
5.2.3
 

Another proficiency region that need to be improved is the proficiency levels ranging 
from 
1.75 to 2.25, called Region 2. For the baseline at Region 2 (

2.25
),
 
the mean 
value of each ad
a
ptivity index was 0.
78
 
(
SD
 
= .0
04
) for DOD, 0.
67
 
(
SD
 
= .0
07
) for CRPV, and 
0.8
1
 
(
SD
 
= .01
2
) for ROI. 
 
In contrast with the results for Region 1, the DOD values were 
acceptable, whereas 
the CPRV 
values were lower than the gu
ideline of low 0.80s 
and 
the ROI 
was also slightly below the benchmark of mid 0.80s. 
 
To improve 
all 
adaptivity
 
indices
 
to 
be 
acceptable,
 
again,
 
5, 10, 15, 20, 30, 40, 50, and 
100 items from the master pool were sequentially added to the existing operational item pool. 
As 
a result, c
ompared to the baseline for the three conditiona
l adaptation statistics, as more 
items 
that were the most informativ
e in the master pool at that proficiency region were 
included
 
in
 
the 
operation item pool, the three adaptivity 
indices clearly increased (see Figure 5.21
). In particular, 
when 30 items
 
that were the most informative at Region 2
 
among the eligible items in 
the master 
112
 
pool 
were additionally included to the operational pool, the 
CPRV
 
values 
increased but
 
were
 
still 
below the acceptable level. After adding 50 items, the CPRV reach
ed
 
an acceptable level of 
adaptivity. As with the findings for the CPRV, the 
DOD and ROI measures presented visible 
enhancement after 30 items were additionally included in the operational item pool. The ROI 
index exceeded the benchmark value when at least 40 items were added to the item pool. One 
thing that should be noted is that
 
the DOD value decreased when 100 items were added. 
Although the variances in the eligible items for the pool were similar, as the quality of items in 
the master pool was not fully controlled, some items added were relatively informative to those 
students 
at Region 2 but the location of these items whose information was optimal deviated 
from the current proficiency estimates. This might be plausible because of the impact of
 
the
 
a
-
parameter on the information function. It thus resulted in reducing the DOD va
lue.
 
 
Figure 
5
.
21
. 
Distributions of conditional adaptivity indices by number of items added at Region 

 
The resulting distributions of bias, TSEM, and RMSE for Region 2, shown in Figure 
5.22, were consisten
t with the results for the Region 1 study. The more informative items there 
are in the operation item pool, the smaller bias and standard errors of the proficiency estimates 
there are. Particularly, the TSEM values were apparently reduced when 30 informati
ve items 
113
 
were added to the operational item pool. After more than 40 items were added, the measurement 
precision was not much improved. Comparing the adaptivity to the measurement precision, it 
provides some evidence showing that the amount of adaptation f
or CATs would not always 
function together with the measurement precision.
 
 
Figure 
5
.
22
. 
Distributions of statistics for measurement accuracy and precision by number of 

 
To sum up, th
e results for these simulation studies can answer how many items need to 
be added to improve the adaptivity for the CATs. Although the item quality was not fully 
controlled in the study, it suggests that for the given CAT specifications and the item pool 
d
istribution, in general, adding about 30 items that are informative at the particular proficiency 
levels of interest can contribute to visible enhancement of the amount of adaptation. However, it 
should be noted that the number of items can be adjusted dep
ending on the item quality.
 
 
114
 
5.3
 
Research Question 3
 
Exposure contro
l is a vital aspect of CAT for 
test security purpose in large
-
scale 
assessments. Because of the limited availability of computers, test sessions are usually scheduled 
multiple times per day or every day over the week. That means that examinees can share 
information about the test items 
that they have taken before and after their tests, resulting 
in 
threat
s to
 
test security. To tackle this concern, exposure control procedures have been suggested 
as
 
a way of putting some constraints on item selection to limit the number of items that stude
nts 
can share in common. Such constraints might affect the amount of adaptation of a CAT by 
preventing it from selecting the best items that match well with the current 
proficiency
 
estimate. 
 
This chapter primarily explores the effect of the exposure contr
ol on the level of 
adaptivity for a CAT and then further investigates whether these effects can be moderated by the 
item pool characteristics. Three exposure control procedures that are commonly used in practice 
were considered in this study: (a) the rando
mesque procedure, (2) 
a
-
stratified method with 
b
-
blocking (BAS), and (3) the Sympson
-
Hetter method. A CAT with no exposure control 
pr
ocedure was administered for
 
comparison purpose
s
. Also, two 300
-
item pools with different 
shape
s
 
of 
distribution
s
 
of item d
ifficulty 
were employed: (1) a bell
-
shaped regular item pool, and 
(2) a rectangular
-
shaped optimal item pool that was created using the bin
-
and
-
union method (see 
Sectio
n 
4.4.1.1
 
f
or technical details). The results were summarized in terms of 
the 
relation 
b
etween true and estimated 
proficiency
 
(i.e., measurement accuracy/precision), the amount of 
adaptation, and test security in the following sections.
 
5.3.1
 
Measurement 
a
ccuracy and 
p
recision
 
Conditional statistics.
 
 
Figure 5.23 displays the measurement accuracy and precision of 
proficiency
 
estimates 
contingent 
on 
proficiency
 
level 
in term of bias, TSEM, and RMSE. Results 
115
 
indicated that
 
regardless of exposure control procedures, there was little bias in 
proficiency
 
es
timates over the 
proficiency
 
scale except for the extreme ends of the scale. Compared to the 
CAT performance with no exposure control, the BAS and Sympson
-
Hetter methods produced 
slightly larger standard errors of the 
proficiency
 
estimates than the randome
sque procedure did, 
which was consistent when either item pool was used. However, with the well
-
designed optimal 
item pool, the standard errors were noticeably reduced especially at the extreme 
proficiency
 
regions, suggesting the 
proficiency
 
estimates were
 
nearly
 
equally precise across the 
proficiency
 
levels. 
 
Overall statistics.
 
 
Without exposure control, the CAT using the optimal item pool 
reported slightly smaller standard errors (TSEM and RMSE), implying that 
proficiency
 
was 
more precisely estimated ove
r the entire sample of examinees. Compared to the results of CAT 
with no exposure
 
control
, all three exposure control procedures presented larger standard errors, 
implying less precise 
proficiency
 
estimates. With the regular item pool, the BAS design provi
ded 
larger RMSE and smaller fidelity (


), whereas with the optimal item pool the Sympson
-
Hetter 
approach reported larger RMSE and smaller fidelity. Unlike the other two exposure control 
procedures, the Sympson
-
Hetter method did not 
result in a 
differe
nce in overall measurement 
precision between the regular item pool and the optimal item pool. 
 
116
 
 
Figure 
5
.
23
. 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for the 3PL
-
based 40
-
item CAT by exposure control procedure and item pool distribution.
 
 
117
 
Table 
5
.
9
 
 
Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for the 3PL
-
based 40
-
item 
CAT by Exposure Control Procedure and Item Pool Distribution
 
Item Pool
 
Exposure Control Procedure
 
Statistic
 
 
Bias
 
TSEM
 
RMSE
 

Regular 
 
No exposure control
 
-
0.002
 
0.209
 
0.207
 
0.980
 
item pool
 
Randomesque procedure
 
-
0.002
 
0.213
 
0.211
 
0.979
 
 
a
-
stratification with 
b
-
blocking
 
0.004
 
0.245
 
0.244
 
0.972
 
 
Sympson
-
Hetter method
 
-
0.002
 
0.245
 
0.242
 
0.973
 
Optimal
 
No exposure control
 
0.003
 
0.197
 
0.199
 
0.981
 
Item pool
 
Randomesque procedure
 
0.003
 
0.200
 
0.203
 
0.980
 
 
a
-
stratification with 
b
-
blocking
 
0.010
 
0.228
 
0.236
 
0.974
 
 
Sympson
-
Hetter method
 
0.006
 
0.240
 
0.244
 
0.972
 
 
5.3.2
 
Amount of 
a
daptation
 
Conditional 
a
daptivity
.
  
As shown in Figure 5.24, in general, using the well
-
designed 
optimal item pool, 
higher adaptability was attained over 
a
 
broad range of 
the 
proficiency
 
continuum across the exposure control procedure conditions. Regardless of item pool 
characteristics, th
e BAS desi
gn for exposure control led to an improvement in
 
adaptivity over the 
proficiency
 
continuum even compared to the CAT procedure with no exposure control. The 
DOD and ROI indices noticeably increased over the 
proficiency
 
levels ranging
 
from
 
-
1.0 to 
2.0 
for the regular item pool and over the 
proficiency
 
ranging
 
from
 
-
3 to 3 for the optimal item pool. 
This is expected to some degree in that for the BAS, the items were selected whose 
b
-
parameters 
had a good match with the current 
proficiency
 
estimate, which is closely associated with the 
concepts of the DOD index.  Also, the BAS forced items with high 
a
-
parameters providing more 
information than those with the low 
a
s, to be used in later stages of a test. Since the efficiency of 
an item with 
high 
a
 
might not be fully utilized if the (true) 
proficiency
 
is not close to the 
118
 
difficulty of that item (Hambleton & Swaminathan, 1985, pp. 108
-
115), the item with high 
a
-
parameter should be use
d
 
later in the test when more accurate 
proficiency
 
estimates 
are 
available. Accordingly, the BAS yielded higher ROI values, suggesting that a test was efficiently 
adapted to each
 
examinee. 
 
However, the Sympson
-
Hetter method reduced the level of adaptation regardless of item 
pool characteristics. The DOD values were
 
dramatically low. A plausible explanation is that the 
Sympson
-
Hetter method controls for overexposed items by distinguishing the item selection and 
administration processes, but it does not contribute to using underexposed or never used items 
that are rar
ely selected based on the maximum information item selection criterion. 
F
or the 3PL 
model an item with high 
a
-
 
and small 
c
-
parameters 
usually 
ha
s
 
high information, 
resulting in
 
that 
item
 
being 
more likely selected and administered even though the 
b
-
paramet
er (or location of the 
maximum) of that item is not closely matched with the current
 
proficiency
 
level. This poor usage
 
of informative items (i.e., mismatch between the item location and the current 
proficiency
 
estimate) can be 
made 
even worse by limiting 
the highly exposed items 
by the 
Sympson
-
Hetter 
procedure so that the DOD values yielded 
were 
very low. 
The randomes
q
u
e procedure did not 
have 
much 
e
ffect 
on 
the level of adaptation over the 
proficiency
 
continuum. 
 
Overall 
a
daptivity.
  
The overall adaptivity results are summarized in Table 5.10. With 
no exposure control, the CAT procedure clearly performed very well with the designed optimal 
item pool rather than with the regular item pool. The latter reported the ratio of SDs index, 0
.74, 
slightly below the suggested benchmark value of high 0.70s. With the optimal item pool, all the 
overall adaptivity measures were very high even though the exposure control was imposed on the 
item selection for 
the
 
CAT
. Regardless of item pool characte
ristics, the BAS led to the increase 
in the ratio of SDs index but
 
with
 
a
 
small decrease in the correlation and PRV indices. The 
119
 
former may be due to the fact that items were selected from only one quarter of the full item pool 
across stages of a test. The
 
latter may be because selecting items within each stratum leads to the 
selection of more extreme items as 
b
-
blocking makes extreme items available for the item 
selection. In addition, the Sympson
-
Hetter method reduced the PRV values. As previously 
mention
ed, the Sympson
-
Hetter method limits overexposed items to keep the exposure under the 
pre
-
specified value, leading 
to
 
items being administered to each examinee on average 
that are
 
more widely spread around their final 
proficiency
 
estimates than the other p
rocedures. However, 
the PRV 
value 
was 
still above the criterion of good adaptation
 
with the optimal item pool, 
suggesting the sensitivity of the Sympson
-
Hetter method to the item pool characteristics. Again, 
as with the conditional adaptivity measures, it 
was observed that the randomesque procedure 
barely affect
ed
 
the amount of adaptation for a CAT
 
for
 
the entire group level of examinees.   
 
120
 
 
Figure 
5
.
24
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 
3PL
-
based 40
-
item 
CAT by exposure control procedure and item pool distribution.
 
 
121
 
Table
 
5
.
10
 
 
Overall Adaptation Statistics for a 3PL
-
based 40
-
item CAT by Exposure Control Procedure and 
Item Pool Distribution
 
Item Pool
 
Exposure Control Procedure
 
Statistic
 
 
PRV
 
Regular 
 
No exposure control
 
0.97 (.002)
 
0.74 (.004)
 
0.81 (.002)
 
item pool
 
Randomesque procedure
 
0.97 (.002)
 
0.73 (.005)
 
0.80 (.002)
 
 
a
-
stratification with 
b
-
blocking
 
0.94 (.002)
 
0.80 (.006)
 
0.80 (.003)
 
 
Sympson
-
Hetter method
 
0.96 (.004)
 
0.74 (.005)
 
0.74 (.002)
 
Optimal
 
No exposure control
 
0.99 (.001)
 
0.91 (.003)
 
0.91 (.001)
 
Item pool
 
Randomesque procedure
 
0.99 (.001)
 
0.91 (.003)
 
0.90 (.001)
 
 
a
-
stratification with 
b
-
blocking
 
0.95 (.002)
 
0.95 (.005)
 
0.90 (.001)
 
 
Sympson
-
Hetter method
 
0.98 (.002)
 
0.99 (.004)
 
0.85 (.001)
 
 
5.3.3
 
Test 
s
ecurity
 
As shown in Figure 5.25, the Sympson
-
Hetter method presented better exposure control 
of the highly exposed items than the other procedures but did not successfully control for the 
underexposed or unused items. However, the BAS approach had well
-
balanced it
em exposure 
and better utilization of items, showing 
a
 
decrease in exposure rates for the item
s that were 
highly exposed and an
 
increase in exposure rates for the items that were rarely or never used 
compared to the results with no exposure control. It is 
noted that the BAS approach had more 
items whose exposure rates were greater than 0.20
 
for the optimal item pool rather than for the 
regular item pool.  The randomesque procedure led to reducing the overexposure rates of the 
items whose difficulty was arou
nd 0.0. Without the exposure control procedure, all examines had 
the same initial 
proficiency
 
estimates of 0.0, resulting in taking the same first item with 
b
-
parameter closest to 0.0. That first item reported the perfect exposure rate shown in Figure 5.25
.
 
 
122
 
 
Figure
 
5
.
25
.
 
Exposure rate distribution of 300 items ordered by 
b
-
parameter (top) and exposure 
rate (bottom) for a 3PL
-
based 40
-
item CAT by exposure control procedure and item pool 
distribution.
 
 
123
 
5.4
 
Research Question 4
 
To further demonstrate the utility of the proposed measures of conditional adaptivity with 
the benchmark values obtained in the first research question, this section for Research Question 4 
evaluates the functioning of different adaptive test designs moder
ated by item pool 
characteristics in terms of the property of 
proficiency
 
estimates and the amount of adaptation. In 
this study, two adaptive testing designs were considered: (1) an item
-
level CAT, 
in 
which 
individual items are fully adapted to an examinee

proficiency
 
estimate during the CAT 
procedure; and (2) a multistage adaptiv
e test (MST), which is adapted to
 
the stage or module 
(i.e., a set of items) level of ite
ms f
o
r
 
an examinee. 
As previously mentioned, the 1
-
2
-
3 three
-
stage MST with increasing mo
dule length through stages (i.e., 10
-
10
-
20 items) was constructed 
by selecting 90 items from an item pool. 
It was hypothesized that MST reported less adaptivity 
than item
-
level C
AT, although MST can achieve a
 
similar level of measurement precision. In 
addi
tion, as with the study for Research Question 3, the same two item pools (i.e., regular item 
pool and optimal item pool) were used. 
 
5.4.1
 
Measurement accuracy and precision
 
Conditional statistics. 
 
Figure 5.26 presents the distributions of the bias, TSEM, RMSE 
values across the 
proficiency
 
continuum by test design and item pool characteristics. In general, 
both test designs had comparable measurement accuracy and precision in the moderate 
proficienc
y
 
range; however, the MST design obviously had more bias and standard errors at the 
extreme regions of the 
proficiency
 
scale than the fully adaptive CAT. With the optimal item 
pool, the CAT reported smaller values of all three statistics than those with th
e regular item pool 
and 
had 
even measurement precision (i.e., TSEM) over the entire 
proficiency
 
continuum. 
Meanwhile, the MST did not make big differences in the recovery of 
proficiency
 
estimates 
124
 
between the two item pools, reporting fairly high standard
 
e
rrors and bias at the extreme
 
ends of 
the 
proficiency
 
scale.  
 
 
Figure 
5
.
26
. 
Conditional bias, TSEM, and RMSE of 
proficiency
 
estimates for a 3PL
-
based 
adaptive test by test design and item pool distribution.
 
125
 
Overall
 
statistics.
 
 
Results indicated that regardless of item pool characteristics, the fully 
adaptive testing (CAT) reported better recovery of 
proficiency
 
estimates givi
ng lower TSEM and 
RMSE values and higher fidelity correlation than the MST. For the item
-
level CAT, the optimal 
item pool contributed to slightly improved measurement precision of proficiency estimates 
relative to the typical operational item pool. Althoug
h the MST design created from the regular 
item pool showed a similar pattern of conditional measurement statistics to the MST using the 
optimal item pool, the overall statistics,
 
except for the bias, indicated slightly better measurement 
precision for the 
latter.
 
Table
 
5
.
11
 
 
Overall Statistics of Measurement Precision of 
Proficiency
 
Estimates for the 3PL
-
based 40
-
Item 
Adaptive Test by Test Design and Item Pool Distribution
 
Item Pool
 
Test Design
 
Bias
 
TSEM
 
RMSE
 

Regular Item Pool
 
Full CAT
 
-
0.003
 
0.209
 
0.207
 
0.980
 
 
MST
 
-
0.001
 
0
.420
 
0.2
59
 
0
.970
 
Optimal Item Pool
 
Full CAT
 
0.003
 
0.197
 
0.
200
 
0.981
 
 
MST
 
0.005
 
0.
402
 
0.243
 
0.9
73
 
 
5.4.2
 
Amount of 
a
daptation
 
Conditional 
a
daptivity. 
 
Which examinees 
were not administered items of appropriate 
quality?
 
Putting it differently, how can we improve the test designs or the quality of an item pool 
for better adaptivity across the 
proficiency
 
levels of interest? The conditional adaptation 
measures proposed in this study can help tackle these concerns. As shown
 
in Figure 5.27, the 

proficiency
 
levels than the MST from three aspects 
of adaptivity across 
proficiency
 
levels, though they had similar level of adaptation at some 
126
 
proficiency
 
region
s
. In fact, the MSTs did not 
satisfactorily meet the guidelines of the DOD (i.e., 
mid .70s), CPRV (low .80s), and ROI (mid .80s) measures, and their adaptivity seemed different 
across the individual 
proficiency
 
levels. For instance, adaptivity around the 
proficiency
 

was bette
r than other 
proficiency
 
regions in that informative items were properly administered 
and adapted for the students according to the CPRV and ROI values close to the guidelines. On 
the contrary, the CAT using the optimal item pool yielded equally high value
s of the proposed 
three indices that exceed the guidelines over the broader range of the 
proficiency
 
levels than the 
results of the CAT using the regular item pool. The latter met the suggested benchmark values of 
good adaptation in the moderate 
proficienc
y
 
range but did not in other 
proficiency
 
areas.
 
Overall 
a
daptivity
. 
 
Table 5.12 summarizes the overall measures of adaptation by test 
designs and item pools. As with the results of conditional adaptivity, using either item pool, the 
values of correlation, ratio of SDs, and PRV indices for the MST design were notably lower t
han 
those for CAT. This is because MSTs adapted at the module/stage level while CATs 
are 
adapted 
at the item level. Also, all students took the same routing module of 10 items, and students who 
took the same path through the stages received identical test 
items, which can contribute to 
limiting the amount of adaptation to some extent. For the item
-
level CAT, using the optimal item 
pool led to obviously improved values of the ratio of SDs and the PRV index, implying that 
CATs presented items well
-
customized 
to the final 
proficiency
 
estimates. This is due to the fact 

proficiency
, meaning a larger SD of the difficulty parameters. Interestingly, for the MST, the 
value of ra
tio of SDs for the optimal pool was greater than that for the regular pool due to the 
same reason previously mentioned, whereas the other two indices were comparable to one 
another. This implies that the optimal pool allowed students to take on average the
 
items that 
127
 
more closely matched 
to 
their 
proficiency
 
level but showed similar degree of other aspects of 
adaptivity to the regular item pool. It should be noted that overall adaptation indices for either 
MST were apparently lower than the guidelines, impl
ying not as good as adaptation of CAT. 
 
 
Figure 
5
.
27
. 
Conditional adaptivity statistics (DOD, CPRV, and ROI) for a 3PL
-
based 40
-
item 
adaptive test by exposure control procedure and item pool distribution
.
 
128
 
Table
 
5
.
12
 
 
Overall Adaptation Statistics for a 3PL
-
based 40
-
item CAT by Exposure Control Procedure and 
Item Pool Distribution
 
Item Pool
 
Test Design
 

PRV
 
 
M
 
(
SE
)
 
M
 
(
SE
)
 
M
 
(
SE
)
 
Regular Item Pool
 
Full 
CAT
 
0.97
 
(0.002)
 
0.74
 
(0.004)
 
0.81
 
(0.002)
 
 
MST
 
0.88
 
(0.005)
 
0.55
 
(0.008)
 
0.75
 
(0.002)
 
Optimal Item Pool
 
Full CAT
 
0.99
 
(0.001)
 
0.91
 
(0.003)
 
0.91
 
(0.001)
 
 
MST
 
0.88
 
(0.006)
 
0.65
 
(0.007)
 
0.74
 
(0.001)
 
 
5.5
 
Research Question 5
 
Lastly, this section demonstrates whether both conditional and overall adaptivity statistics 
function as expected using the real operational dataset to examine the amount of adaptation over 
the entire sample of examinees and at the individual 
proficiency
 
l
evels. To do this, the finial 
proficiency
 
estimates of examinees, the list of item parameters for the items administered to each 
examinee, the list of item scores or interim 
proficiency
 
estimates for the items administered to 
each examinee, and the item pa
rameters for the item pool are required for computing conditional 
and overall statistics. These data were available for the NCLEX
-
RN licensure examination
 
in 
2017,
 
which employed a
 
variable
-
length CAT with a minimum test length of 60 and a maximum 
length o
f 250 operational items. Note that the pretest items were removed from this analysis. The 
total sample for this administration period included about 70,000 examinees. Like a previous 
study (Reckase et al., 2018), 
35
 
subsamples of 2,000 examinees were rando
mly sampled from 
the 
full sample without replacement.
 
This allowed me to evaluate the stability of the adaptation 
measures
, as well
.  
 
129
 
5.5.1
 
Conditional 
a
daptivity
 
The conditional adaptivity indices clearly provided evidence about how the NCLEX 
exam was designe
d for their classification (pass/fail) purpose based on the cut
-
off scores of 0.0. 
Around the cut
-

study for Research Question 1, implying that the test provided informative i
tems efficiently 
customized to classify students whose 
proficiency
 
was above or below their criterion, 0.0. To be 
specific, the DOD value was slightly below the guideline of mid .90s, but the CPRV and ROI 
measures were far better than the guidelines at the
 
cut
-

measures were stable across 35 samples of 2,000 examinees randomly sampled from the full 
dataset. Their empirical standard deviations were in the ranges of 0.04 to 0.12 for the CPRV 
index, 0.02 to 0.06 for the 
DOD index, and 0.01 to 0.07 for the ROI index. Compared to the 
other two statistics, the CRPV showed slightly more variations in the measure across samples, 
which might due to the property of the 
proficiency
 
estimation procedure used in the NCLEX
-
RN 
test. 
That is, the Ow
e
n Bayesian estimation
 
(
Vale & Weiss, 1977)
 
with 
a
 
prior with a mean of 
-
1.0 and a standard deviation of 2.0 was used in the beginning of the test and then the MLE was 
employed after both correct and incorrect responses exist for an examinee
. In this procedure, the 
selected items were affected by the current 
proficiency
 
estimate, causing more within
-
examinee 
variation in 
b
-
parameters
 
of the CPRV. At any rate, there was no doubt that the NCLEX
-
RN test 
reported outstanding adaptation at their cut point, which is well align
ed
 
with their test purpose.  
 
130
 
 
Figure 
5
.
28
. 
Conditional adaptivity statistics (DOD, C
PRV, and ROI) for a Rasch
-
based 
variable
-
length CAT for an operational NCLEX
-
RN test.
 
 
5.5.2
 
Overall 
a
daptivity
 
Table 5.13 summariz
es the results for the overall values
 
to gauge the level of adaptation 
for the NCLEX test over the sample of examinees during this 
administration period. For three 
overall adaption measures, this test complied
 
with
 
the guidelines suggested by the simulation 
studies for Research Question 1 in Section 5.1. Specifically, the correlation index was 0.91, 
which was in the low 0.90s, the rat
io of SDs index was 0.90, which exceeded the benchmark 
value in the mid 0.80s, and the PRV value met the guideline of about 0.80. Taken all together, 
these indices indicated that this NCLEX test is worth
y
 
of being labeled
 

Despite 
the co
nstraints o
f
 
the item selection algorithm for content balancing and exposure 
131
 
control, this test showed good adaptivity resulting from the well
-
designed item pool and a strong 
item selection algorithm. 
 
Table
 
5
.
13
 
 
Overall Adaptation Statistics for a Rasch
-
Based Variable
-
Length CAT for an Operational 
NCLEX
-
RN Test
 
 
PRV
 
 
M
 
SD
 
M
 
SD
 
M
 
SD
 
NCLEX
 
0.91
 
0.004
 
0.90
 
0.010
 
0.80
 
0.003
 
Benchmark Values
 
Low 0.90s
 
 
Mid 
0.80s
 
 
0.80
 
 
Note.
  
SD = Empirical standard deviation over 35 samples of data.
 
 
132
 
CHAPTER 6.
 
CONCLUSION AND DISCU
SSION
 
 
6.1
 
Summary of Findings
 

measure the amount of adaptation conditional on proficiency levels for computerized adaptive 
testing (CAT). Second, it was to evaluate their feasibility and utility at det
ecting how much a test 

pool, proficiency estimators, constraints on item selection, and test design through simulations 
and empirical demonstration using real
 
data analysis. Three conditional adaptation measures 
were (1) the deviation of difficulty (DOD), (2) the conditional proportion of reduction in 
variance (CPRV) index, and (3) the ratio of information (ROI). The proposed measures provide 
slightly different
 
information about the amount of adaptation assessed over the entire sample of 
examinees using the existing overall adaptation indices. These measures can help 
us 
understand 
the adaptivity of a test for an individual examinee or for subgroups of particular
ly interest. For a 
particular subgroup of students, for example, the CAT reported a high CPRV value but low DOD 
and ROI values. In these test events, it is plausible that the students received items of similar 
difficulty, but the administered items deviate
d, on average, widely from their provisional 
proficiency estimate. This might be because an item pool did not contain informative items for 
those student
s or there were some problems with
 
the item
-
selection procedure. Taken together, 
these conditional stat
istics function toward the goal of gauging the differences in the amount of 
adaptation from these three viewpoints. From both the simulation studies and real data analysis, 
five key findings were drawn.
 
133
 
First
, the results of comprehensive simulations in Re
search Question 1 suggest some 
guidelines for interpreting the proposed conditional adaptation indices for adaptive testing by 
different IRT models and proficiency estimators. When the Rasch model is used for scaling and 
scoring, these benchmark values, as
 
summarized in Table 6.1, indicate good adaptation

a DOD 
value in the mid 0.90s, with a maximum likelihood estimation (MLE) for proficiency estimator, 
a CPRV value in the high 0.70s, and a ROI value in the high 0.90s.
 
Not only that, but 
a DOD in 
the mid 0.
70s, a CPRV in the low .80s, and a ROI in the mid 0.80s indicate good adaptation 
when the 3PL IRT model is used, meaning that an adaptive test administers items that are well 
customized to an individual student. With the expected a posteriori (EAP) estimat
ion method, the 
guidelines for the CPRV index were slightly higher than those for the MLE method, but the 
guidelines for the DOD and ROI indices were the same as the ones for the MLE. 
 
Table
 
6
.
1
 
 
Benchmark Values of Conditional and Overall Adaptivity Indices by IRT 
M
odels and 
Proficiency
 
Estimators
 
 
Rasch
 
 
3PL
 
 
MLE
 
EAP
 
 
MLE
 
EAP
 
Conditional Indices
 
 
DOD
 
 
Mid 0.90s
 
=
 
 
Mid 0.70s
 
=
 
CPRV
 
High 0.70s
 
High 
0.80s
 
 
Low 0.80s
 
 
Mid 0.80s
 
ROI
 
High 0.90s
 
=
 
 
Mid 0.80s
 
=
 
Overall Indices
 
 
Low 0.90s
 
 
=
 
 
High 0.90s
 
=
 

Mid 0.80s
 
=
 
 
High 0.70s
 
=
 
PRV
 
0.80
 
High 0.80s
 
 
Low 0.80s
 
 
Mid 0.80s
 
134
 
One thing to note is for the simulated 40
-
item test, these conditional statistical descriptors 
were stable in the middle proficiency, ranging from 
-
2.0 to 2.0 based on the small values of 
empirical standard errors of the statistics. With the 3PL model, how
ever, the statistics generally 
had fairly large standard errors at the extremes of the proficiency range, particularly with a small 
item pool or a restricted item pool. As the pool size increased and more items were located at the 
extremes, the stability o
f these measures improved but was still large compared to the error in the 
middle proficiency range. This instability at the extremes may be due partly to the effect of 
c
-
parameters in the proficiency estimation. However, this concern was resolved with th
e Rasch 
model. 
 
The findings also provide evidence that regardless of IRT models and proficiency 
estimators, 
the bigger the item pool size is, the more spread of difficulty in the item pool there is, 
the better adaptation there is, and the more accurate pr
oficiency estimates there are, which is 
consistent with the results of previous studies (Reckase et al., 2018; Ju & Lee, 2018
). 
For
 
a high
 
level of adaptivity in CA
T,
 
for the 40
-
item test, 
the
 
recommended 
pool should be 
at least 
a
 
300
-

proficiency
 
distribution
. The required item
-
pool size for a CAT relies on the distribution of a 

Reckase, 2010). However, it has 
typically been recommended that an item pool should be at least 10 to 12 times larger than the 
length of the CAT (Stocking, 1994). Along with the results of the overall adaptivity indices and 
the measurement precision of pro
ficiency estimates, the performance of the proposed adaptivity 

 
The amount of adaptation was closely associated with the standard
 
errors in the 
proficiency
 
estimation
. Given the inspection 
of the relation between conditional adaptivity 
135
 
measures and the standard errors of proficiency estimates, the DOD and ROI measures had 
strong and negative relations with the TSEM values. This suggests that good adaptation for CAT 
can lead to improving the 
measurement precision of proficiency estimates, and vice versa. This 
might be due to the fact that more precise proficiency estimates contribute to 
giving more
 
appropriate information on the item
-
selection algorithm, leading to suitable customization of th
e 

nd RMSE, however, a
 
systematic pattern of 
CPRV was not found. In sum, the adaptivity of CAT should be closely associated with the 
measurement precision of proficiency estimates, but it was shown that th
ey are different.  
 
Second
, the study demonstrated the practical utility of the proposed conditional 
adaptation statistics as diagnostic tools for improving the amount of adaptation for CAT in 
Research Question 2. The findings of the second simulation stud
y indicated that the conditional 
adaptivity indices can provide insight into how to revise the existing item pool so as to improve 
the level of adaptation. Although the measurement accuracy and precision of the proficiency 
estimates were similar, the amoun
t of adaptation could differ. Based on the initial computation of 
the adaptivity indices shown in Figure 5.17, a good example is the two proficiency regions of 
-

 
As a result of adding to the existing item pool a series 
of fixed numbers of items, the 
information of which was high at each proficiency region, the more informative items available 
in the pool improv
ed the level of adaptation at 
both regions. In particular, given the test length of 
40 items and the composition
 
of the item pool, the adaptivity indices were visibly improved and 
met the guidelines suggested in the first study when 30 informative items were added to the 
operational item pool at each region. To mimic a realistic situation, I did not fully control th
e 
quality of items in the master pool. That is, the master pool included items normally distributed, 
136
 
implying that there might not be sufficiently high
-
quality items at each region. Hence, some of 
the added items were not as good as others for each region.
 
The number of items to be added 
may depend on the quality of an item. For instance, an item with a high 
a
-
parameter to be 
administered at the proficiency level closer to its 
b
-
parameter can contribute a good deal to 
improving the adaptivity compared to lo
w
-
quality items. With the high
-
quality items at a 
particular proficiency region, the smaller number of items may be need
ed
 
to 
enhance
 
the amount 
of adaptation for a CAT.
 
Third,
 

w
hen the constraints of exposure control were imposed on the item
-
selection algorithm. The 
three exposure
-
control procedures considered in this study reported comparable measurement 
precision of proficiency estimates; nonetheless, the magnitude of adaptatio
n was apparently 
different across the exposure
-
control procedures. The randomesque procedure did not 
compromise adaptation using either the regular or the optimal item pool. The Sympson
-
Hetter 
method, though, reduced the level of adaptation, especially for
 
the DOD index. Interestingly, the 
a
-
stratified with 
b
-
blocking (BAS) approach contributed to improving adaptivity across the 
proficiency levels. Although this pattern would be consistent across different properties of item 
pools, the well
-
designed optimal
 
item pool presented greater adaptivity across a broader range of 
proficiency levels for all exposure
-
control approaches of CAT. This suggests that, when 
exposure control is employed, good adaptation can occur with a good
-
quality item pool. It is 
additiona
lly noted that the Sympson
-
Hetter method controlled well for overexposed items, 
whereas the BAS design showed more balanced
-
usage of items in the pool by controlling for 
underexposed items.
 
137
 
Fourth
, another notable result of this research focused on the amo
unt of adaptation 
presented by adaptive testing designs using the 3PL IRT model and the MLE for proficiency 
estimation. The specific designs considered here for a 40
-
item test were a fully item
-
level 
adaptive test (i.e., CAT) and a 1
-
2
-
3 three
-
stage multis
tage adaptive test (MST). It was shown 
that a MST reported obviously less adaptation than a CAT regardless of item
-
pool 
characteristics, though the two testing designs reported comparable accuracy of proficiency 
estimates in the moderate proficiency levels
 
ranging
 
from
 
-
1 to 2. The MST did not satisfy the 
guidelines of
 
the
 
three conditional adaptation indices across the proficiency levels. For the MST, 
the middle proficiency region showed relatively better adaptivity than other proficiency regions. 
An unant
icipated result was that the MST formed from the optimal item pool did not present 
clearly higher adaptation than the MST created from the regular item pool. This differed from the 

 
proficiency using 
the optimal item pool than it did when using the typical operational item pool. 
 
Last but not least, 
empirical demonstration was made using real operational data from the 
NCLEX nursing licensure examination. Three conditional 
adaptivity measures indicated that this 
variable
-
length test was wel
l designed, with a
 
high
-
quality item pool for satisfying the purpose of 
the test showing good adaptivity, being near the cut
-
score of 0.0. That is, at the proficiency level 
closer to the c
ut
-
score, the three adaptivity indices met the benchmark values. This suggests that 
the test was very adaptive for classifying examinees into mastery and non
-
mastery of their 
proficiency for the nursing licensure

even when considering content balancing and
 
exposure 
control. Overall, the proposed statistics were properly functioned as a diagnostic tool for 
understanding the amount of adaptation contingent on the proficiency continuum for an 
operational CAT.  
 
138
 
6.2
 
Practical Utility of Conditional 
Adaptation Indices 
 
The findings of the entire study reported that new conditional adaptivity measures were 
sensitive to item
-
pool characteristics, test designs, and test specifications that could affect the 
amount of adaptation and suggested some guidelin
es for interpreting the statistics. The study also 
provided evidence to support the usage of t
hese conditional indices 
for 
helping test developers 
and measurement professionals revise test design or an item pool to improve the test adaptivity. 
This section
 
presents a discussion of the practical utility of the conditional adaptation indices. 
 
6.2.1
 
Diagnostic 
t
ools for 
i
mproving 
a
daptivity
 
The proposed adaptation indices can be used not only as quality control tools to monitor 
the adaptivity of an adaptive testing
 
program, but also as diagnostic tools for improving 
adaptivity by revising an item pool or a test design. Practitioners may want to maintain the level 
of adaptation for adaptive tests across testing windows, item pools, subgroups of examinees, or 
time occ
asions. In practice, not all examinees can take the tests at the same time. Some may take 
the tests earlier than others, and some may take the tests through different windows or using a 
different item pool assembled from the master pool. Sometimes, test de
velopers may be 
interested in inspecting a particular subgroup of examinees to determine if the tests adequately 
function well for the test purpose. Examinees may want to take a fair test that measures their 
proficiency as accurately and precisely as possi
ble as well as a test with items well adapted for 
their proficiency levels. This desire may come to realization with adaptation statistics, as they 
can play a role in evaluating and tracking the amount of adaptation for the administered tests. 
 
Beyond the 
quality control, the newly proposed adaptation indices are particularly useful 
in diagnosing a current test and to provide some directions for improving the adaptivity of the 
test by revising an item pool, test specifications, or test designs. A pivotal el
ement of a CAT is 
139
 
the item pool, such that the quality of the CAT is closely tied with the properties and the quality 
of the item pool (Flaugher, 2000). As demonstrated in Research Question 2, the level of 
adaptation can be improved by adding items that ar
e informative at a proficiency region of 
interest to the existing item pool. Some test events can occur using multiple item pools even 
within the same session. To facilitate test fairness for students, the adaptation indices can be 
used, along with the ite
m pool utilization index (Gönülates, 2015), to assess the performance of 
the tests with each item pool in such a way that the multiple item pools include sufficiently good 
items adapted for students as equally as possible. 
 
The proposed adaptation indices 
can be used to determine the adaptive test design but 
also to revise test specifications that optimize the adaptivity over the proficiency continuum 
within a test design. As with the study connected with Research Question 4, measurement 
professionals and t
est practitioners can utilize the adaptation statistics to compare the 
performances of different test designs (e.g., linear fixed
-
test, item
-
level CAT, MST, and hybrid 
CAT [Wang
 
et al.
, 2016]). Not only can this be done for individuals or subgroups of exam
inees 
using the conditional adaptivity indices but also for the entire sample of examinees using the 
overall indices. 
 
Furthermore, the conditional adaptivity measures are partic
ularly useful for modifying 
tests to improve 
adaptivity within 
an
 
adaptive tes
t design. A good example would be an MST. 
The amount of adaptation and measurement quality for the MST depends on how the MST is 
designed and structured using the available item pools. With the optimal item pool, for example, 
the MST did not outperform, as
 
shown in Figure 5.27, the test created using the regular item 
pool. In this case, the composition of items for each module through stages can be modified, 
based on the resulting plot of adaptation indices, so as to enhance the adaptivity of the test for 
140
 
h
igh or low proficiency students. That is, if the goal of a test is to achieve equal measurement 
precision and adaptability for all students. 
 
Regarding the case of an item
-
level CAT, the test specifications c
an be determined by 
comparing d
ifferences in the
 
amount of adaptation. Such differences are based on constraints in 
the item
-
selection algorithm for content balancing, exposure control, and avoiding test 
speededness. Based on the resulting values of the adaptivity statistics, test practitioners can 
deci
de whether some constraints should be relaxed or whether items should be added to the item 
pool. Taken together, the new adaptation metrics suggested here contribute to gauging the 
potential improvement of adaptivity for CAT, conditional on an individual s

level or those of examinee
 
subgroups. Of course, practical concerns in the testing situation and 
the purpose of the tests
 
must be taken into account.
 
6.2.2
 
Use of 
c
onditional 
a
daptation 
i
ndices in 
a
utomated 
t
est 
a
ssembly
 
The proposed conditional adaptation indices can be used as a constraint or as an objective 

b
) 
distinction of test specifications in automated (optimal) test assembly
, constraints are a test 
attribute or a function of item attributes that need to be met by setting an upper and/or a lower 
limit. Meanwhile, objective functions are the attribute(s) to be optimized by attaining a minimum 
or maximum value. The test assembly
 
algorithm can be defined in numerous ways using the 
proposed conditional indices. With a constraint, for instance, a lower limit of each of the three 
adaptation measures can be set so that all students receive the test items yielding at least a certain 
va
lue of the lower limit for each adaptation index. In a similar vein, the adaptivity indices can 
serve as an objective function of automated test assembly. They do so by setting the target values 
of the three adaptation indices so as to assemble, while sati
sfying all of the test constraints, the 
141
 

maximum mean value of each adaptation measure through preliminary analyses using sets of 
field tests or a small simul
ation study.  
 
6.3
 
Alternative 
W
ays to 
D
efine 
C
onditional 
A
daptation 
I
ndices
 
The amount of adaptation can be quantified 
differently 
depending on 
how 
its 
concept is 
operational
ly d
efin
ed. This study quantifies the amount of adaptation based on concepts of a 
hig
hly adaptive test and provides well
-

To measure the amount of adaptation, this study establishes
 
three quantities: (1) the DOD index, 
(2) the CPRV index, and (3) the ROI index. The overall adaptation i
ndices (Reckase et al., 2018) 

final proficiency estimates. The conditional adaptivity quantities proposed here, in contrast, 
focus more on how well a CAT uses

during its administration

an available item pool and 
item
-

estimate. The latter is beneficial in that the final proficiency estimates cannot be known during 
the CAT pr
ocess. Furthermore, the current proficiency estimates can assess whether the item
-
selection algorithm is working correctly so as to provide a well
-
matched item during the CAT 
process even though the interim estimate may deviate from the final estimates. It
 
was found, via 
comprehensive simulation and empirical studies, that the proposed three indices functioned as 
expected. Nonetheless, it is still necessary to discuss here alternative ways to define the current 
conditional adaptation indices.
 
 
Rather than t
he usage of interim proficiency estimates, the conditional adaptation indices 

indices are conceptually more associated with seeing whether the optimal set of ite
ms are 
142
 
administered to each examinee, assuming that their final proficiency estimates are not biased and 
sufficiently close to their true proficiency levels. Taken as an example is the DOD index. With 
the 
final 
proficiency estimate, the DOD can evaluate whether the items were properly 
administered for matching their location (either item difficulty or the location of maximum 

location 
from the final proficiency estimate can be large early on in the test, more likely yielding a DOD 
value below 0. 
 
Another example is the ROI index. The current ROI index compares the information 
function of the administered item at the provisional
 
proficiency estimate to the potential 
maximum information that the item can reach. Instead, the observed item information, or the 
numerator of the ROI quantity, can be computed at the final proficiency estimate. This might 
allow for knowing the amount of 
information that the test provides to individual examinees 
around their final proficiency estimate. It can be blind, though, to whether the items were, during 
the intermediate stages of the CAT process, appropriately presented and well utilized for the 
exa
minees. 
 
Moreover, the ROI index can have a differently defined criterion value, which is placed 
in the denominator of the index. This study identifies the optimal criterion value of the 
information as the maximum potential information that an item can hav
e. However, the optimal 
information can be determined through one of two ways. It can be through the maximum 
information available in an existing item pool

the theoretical limit of the maximum information 
given the Rasch model. Or it can be through the max
imum information that can be obtained from 
the most informative items in the pool at the true or the final estimated proficiency levels. In 

143
 
at the interim profic
iency estimate during the CAT. It would instead assess whether a perfect 
item (that matches well the true or final estimated proficiency) is presented to each examinee 
using the current item pool (e.g., Gönülates, 2015). For instance,
 
the percent of inform
ation 
index (Kingsbury & Wise, 2018) compared the observed actual test information to the optimal 
information obtained by administering a test at the true proficiency level. Although the ROI 
seems to have different conceptualizations depending on how the o
ptimal information is 
identified, these conceptualizations still have something in common

they are information
-
based 
measures.   
 
6.4
 
Implications
 
Overall, with the help of guidelines and their own understanding of the concepts, 
researchers and practitioners 
c
ould easily interpret 
the three conditional adaptivity
 
indices 
investigated in this study
. These new measures allow us to understand how much adaptation of a 
given test occurs across 
proficiency
 
levels or subgroups of students
.
 
They also help us 
understand
 
the
 

adaptive test designs, 
constraints
 
on the item selection, among other test specifications
. Together with the overall 
adaptivity measures for the entire groups of examinees, the newly propose
d conditional measures 
for the amount of adaptation 
offer unique contributions. Indeed, they enable
 
practitioners to have 
a better sense of 
which adaptive test may not be very adaptive for individual examinees or 
particular subgroups of examinees
.
 
 
Another
 
benefit of these indices are the various ways in which one can summarize the 
distribution of conditional adaptivity measures. The most intuitive way is to visualize, as I did in 

re proficiency 
continuum or against particular proficiency regions of interest using a scatter plot, histogram, or 
144
 
box plot. As with the overall adaptation statistics, the distribution of the conditional adaptation 
measures can also be summarized using the
 
descriptive statistics such as a mean, median, and 
standard deviation of the values for a set of tests administered to a group of examinees in the 
population. If the distribution is skewed, the mean and median values of the statistics are 
discrepant. It w
ould be advisable here to look at the entire distribution of the conditional 
measures using a graphical method. One thing that has to be noted here is that the amount of 
adaptation
 
using the distribution of the adaptivity indices should be understood given
 
the specific 
goals of a testing program. Although some examinees received items that were poorly 
customized for their proficiency levels, this might, depending on the testing purpose, matter 
little.
 
The concept for the amount of adaptation should be under
stood differentially from that of 
measurement accuracy and precision, despite their being tightly associated with each other. The 
relations of the standard errors in proficiency estimates with each of three adaptivity indices, 
were not, as we noted above, 
perfectly correlated to one another; even the CRPV index did not 
show the apparent pattern with the measurement quality. Although students had similarly 
acceptable levels of measurement quality of their proficiency estimates, the quality of adaptation 
coul
d differ, as seen in Research Question 2, depending on numerous issues. These include the 
size and characteristics of an item pool, the functioning of the item
-
selection algorithm, test 
design, and other issues that can affect adaptivity of CAT. That is, t
he proficiency was accurately 
and precisely estimated to some extent, but the test could not provide best items adapted to the 
examinees because of deficiencies in the item pool or some problems in the item
-
selection 
algorithm. 
 
145
 
The tests, in contrast, wer
e very adaptive. The quality of measurement, however, could be 
poor. This statement is evidenced, for instance, by the findings of the comparison of the 
conditional adaptivity values between the MLE and the EAP in Research Question 1. In general, 
the EAP e
stimator showed the greater adaptivity for the 40
-
item test than the MLE over the 
proficiency continuum. The proficiency estimates obtained using EAP, however, were biased. 
This bias is due to the property of the EAP regressing the proficiency estimates to
ward the mean 
of a prior distribution (Ho & Dodd, 2012; Kim & Nicewander, 1993). Students took the items 
that were well adapted for their current proficiency estimates, but this will not accordingly 
reduce the bias caused by the property of proficiency est
imator. Additionally, it may be imagined 
that with a very short test, the level of adaptation could be high with the well
-
designed item pool 
and test specifications, but the proficiency would be poorly estimated because the number of 
items might not be ade
quate for a precise and accurate estimation of the proficiency. Therefore, 
the high adaptivity of a CAT does not always guarantee the high measurement efficiency of 
proficiency estimates. 
 
6.5
 
Limitation and Future Research
 
This section briefly 
discusses some limitations of the study while also offering directions 
for future research. First, r
eported findings from this study examined the differences in the 
amo
unt of adaptation affected by 
limited aspects. 
While the concepts and measures to evalua
te 
the level of adaptation have recently received growing attention, more research is needed in this 
area. Future study should elaborate
 
on
 
the current adaptation measures by examining their 
performance under other 
factors that plausibly affect the adaptiv
ity
. These factors
 
includ
e
 
not 
only 
other adaptive testing designs (e.g., hybrid CAT design) and test specifications such as 
latency
-
constraints for preventing test speededness and content
-
constraints for content balancing, 
146
 
but also using other real operational adaptive testing data with a d
ifferent purpose of a test (e.g., 
equal measurement precision over the proficiency continuum). Researchers may
 
come up with
 
further alternatives to the existing adaptivity indices with different assumptions and definitions 

 
S
econd, in Research Question 2, the quality of the items to be added from the master pool 
were not fully controlled. The current study found that to visibly improve the level of adaptation 
from all three aspects
 
called for approximately 30 items. However, t
he number of items needed 
to improve the level of adaptation can differ,
 
depending on the quality of an item and the location 
in which that item is efficiently presented. What are the features of a high
-
quality item? 
According to Eignor, Stocking, Way, & S
teffen (1993), a high
-
quality item is one that is the 

function, the amount of the information is maximized when an item has a high 
a
-
parameter and 
b
-
parameter close 

The approach in 
this study 
requires multiple iterative procedures to improve the amount of 
adaptation given the distribution of test takers
. These procedures are selected
 
by checking
 
the 
statistics after the removal or addition of proper items from the available item pool. 
Occasionally
, this iterative approach may be time
-
consuming in real operational settings.
 
Future 
research can explore possible ways of coming up with an equation th
at computes how many 
items need to be added, taking into account the amount of information that an item has at the 
current proficiency estimate. With the help of that equation, the item pool or test designs are 
expected to be more efficiently revised by ad
ding the items necessary to improve adaptivity, in 
lieu of the iterative procedure taken in this study. 
 
147
 
Third, the
 
finding
s gleaned from Research Question 4
 
underscores 
the paramount 
importance of optimally designing MST modules
. To obtain more precise
 
pr
oficiency
 
estimates
, 
the
 
literature has mostly focused on either MST path structures or the stage and module 
configurations 
(e.g., Luo & Kim, 2018; Patsula, 1999; Xiong, 2018; Zenisky, 2004). R
elatively 
little is known
, though,
 
about how the ways of assemb
ling the modules and composition of item 
characteristics affect the 
proficiency
 
estimates and the level of customization for MST given the 

proficiency
 
distribution. Future research 
could
 
address this concern by exploring 
other ways of optimally designing MST beyond the approach of using the target TIF.  
 
Fourth, little research has yet explored the measures of the amount of adaptation in the 
mixed
-
format adaptive testing context. Mixed
-
fo
rmat tests that include multiple
-
choice (MC) 
items, constructed
-
response (CR) items, and testlets have been commonly used in educational 
large
-
scale testing (Kim, Walker, & McHale, 2010; Kuechler & Simkin, 2010; 
Yao &
 
Schwarz, 
2006
). Due to their enhanced 
psychometric features, mixed
-
format tests can be more 
informative, efficient, and valid as well as more promising for future applications implementing 
innovative items (e.g., technology
-
enhanced items, multiple
-
response items, hot
-
spot items) in 
CAT progra
ms. This results in enhanced content coverage and measurement accuracy (Jiao, Liu, 
Haynie, Woo, & Gorham, 2012; Wendt
, 2008). It is recommended to 
polytomously score and 
calibrate such innovative items, CR items, and testlets using the polytomous IRT model
s in CAT 
programs (e.g., Jiao et al., 2012). It would be interesting for future research to investigate 
whether the existing adaptation measures could work well with the polytomously scored items 
when their item parameters are calibrated using the polytomo
us IRT models with the overall item 
difficulty parameterization. 
Also, existing studies have explored the influences of test designs 
using only dichotomous items and without consideration of content balancing. Multiple item 
148
 
types and content balancing may 
produce more variations in the test designs, resulting in 
different levels of adaptation
. This calls into question how much adaptation occurs as a result of 
the mixed
-
format test designs with content balancing approaches.
 
 
Lastly, this dissertation suggest
s some guidelines to interpret the indices based on the 
current simulations results, but the benchmarks were not solely evaluated yet. Hence, further 
research is needed to work on elaborating the benchmark values for the adaptation indices via a 
Type I err
or study and a power study. Additionally, the dissertation evaluated the performance of 
the adaptation indices using the maximum Fisher information (MFI) item selection procedure. In 
fact, the proposed measures were developed based on the MFI item selectio
n criterion. In 
literature and operational CAT programs, there are numerous item selection criteria out there 
(see Section 2.2.2). It is worthy of exploring whether the adaptation measures properly work 
with other item selection procedures such as b
-
matchi
ng and Bayesian criteria in future research.
 
A promising testing format has emerged in recent years for the next
-
generation 
assessments; that format is adaptive testing. In fact, numerous testing programs have already 
employed a variety of adaptive
 
tests. It is now time to reconsider and evaluate how adaptive the 

tests. It is also strongly recommended that adaptive testing be implemented knowing the 
fo
llowing: such testing is attended with psychometric impacts from the test designs and 
specifications on adaptivity. This is particularly true in a situation where a testing agency has just 
started transitioning its testing format from the paper
-
and
-
pencil 
linear testing to the adaptive 
testing. In the area of adaptive testing, therefore, the newly established adaptation indices in this 
study can direct test practitioners toward consequential consideration of improving the item 
pools and test designs for ada
ptive tests. 
149
 
APPENDI
X
150
 
APPENDIX
 
 
SUPPLEMENTARY FIGURE
S FOR RESEARCH QUEST
ION 1
 
 
Figure A.
1
. 
Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
based CAT by item pool size 
and proficiency estimator
 
 
151
 
Figure A.1.
 

152
 
 
Figure A.
2
. 
Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a 3PL
-
based CAT by item pool size and proficiency estimator
 
 
153
 
Figure 
A.2.
 

154
 
 
Figure A.
3
. 
Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a Rasch
-
based CAT by item pool spread and proficiency estimator
 
 
155
 
Figure A.3.
 

156
 
 
Figure A.
4
. 
Relationship of RMSE with conditional adaptivity indices (DOD, CPRV, and ROI) 
for a 3PL
-
based CAT by item pool spread and proficiency estimator
 
 
157
 
Figure A.4.
 

158
 
REFERENCES 
159
 
REFERENCES
 
 
Bergstrom, B. A., Lunz, M. E., &
 
Gershon, R. C. (1992). Altering the
 
level of difficulty in 
computer adaptive
 
testing. 
Applied Measurement in Education, 5
, 137
-
149.
 

F. M. Lord and M. R. Novick (Eds.), 
Statistical theories of mental test scores
 
(pp. 374
-
472). Reading, MA: Addison
-
Wesley. 
 
Bock, R. R., & Mislevy, R. J. (19
82). Adaptive EAP estimation of ability in a microcomputer 
environment. 
Applied Psychological Measurement
, 
6
, 431
-
444. 
 
Chang, H.
-
H. (2004). Understanding computerized adaptive testing: from Robbins
-
Monro to
 
Lord and beyond. In D. Kaplan (Ed.), 
The Sage ha
ndbook of quantitative methods for the 
social sciences 
(pp. 117
-
133). Thousand Oaks, CA: Sage.
 
Chang, H. H., Qian, J., & Ying, Z. (2001). 
a
-
s
tratified multistage computerized adaptive
 
testing 
with 
b 
blocking. 
Applied Psychological Measurement, 25
, 
333
-
341.
 
Chang, H.
-
H., & van der Linden, W. J. (2003). Optimal stratification of item pools in 
a
-
str
atified 
computerized adaptive testing. 
Applied Psychological Measurement, 27
, 262

274.
 
Chang, H.
-
H., & Ying, Z. (1996). A global information
 
approach to com
puterized adaptive 
testing.
 
Applied Psychological Measurement, 20
, 213
-
229.
 
Chang, H. H., & Ying, Z. (1999). 
a
-
Stratified multistage computerized adaptive testing.
 
Applied 
Psychological Measurement, 23, 
211
-
222
 
Chen, S.
-
Y., Ankenmann, R. D., & 
Chang, H.
-
H
 
(200
0
).
 
A compassion of item selection rules at 
the early stages of computerized adaptive testing. 
Applied Psychological Measurement, 
24,
 
241
-
255. 
 
Chen, S.
-
Y., Ankenmann, R. D., & Spray, J. A. (2003). The relat
ionship between item
 
exposure 
and test overlap in computerized adaptive testing. 
Journal of Educational Measurement, 
40
, 129
-
145.
 
Cheng, Y., & Chang, H. H. (2009). The maximum priority index method for severely
 
constrained item selection in computerized ad
aptive testing. 
British Journal of 
Mathematical and Statistical Psychology, 62
, 369
-
383.
 
Choi, S. W., & Swartz, R. J. (2009). Comparison of CAT item selection criteria for polytomous 
items
. Applied Psychological Measurement, 33
, 419
-
440.
 
Chang, H.
-
H., Qian, J. & Ying, Z.  (2001). 
a
-
stratified multistage computerized adaptive testing 
with 
b
 
blocking. 
Applied Psychological Measurement, 25
, 333
-
341.
   
 
160
 
Conley, T. D. (2018). The Promise and Practice of Next Generation Assessment. Cambridge, 
MA
: Harvard Education Press.
 
Davey, T. (2005, April). 
An Introduction to bin
-
structured Adaptive Testing
. Presented at the
 
annual meeting of the American Educational Research Association, Montreal
, Canada
.
 
Davey, T., & Parshall, C. G. (1995,
 
April). 
New 
algorithms for item selection and exposure 
control with computerized adaptive testing
. Paper
 
presented at the Annual Meeting of
 
the 
American Educational Research
 
Association, San Francisco
, CA. 
 
Eignor, D. R., Stocking, M. L., Way, W. D.,
 
& Steffen, M. (19
93). 
Case studies in computer 
adaptive test design through simulation 
(ETS Research Report No.
 
93
-
56). Princeton, NJ: 
Educational
 
Testing Service.
 
Embretson, S.E. (2001). The second century of ability testing: Some predictions and 
speculations. Retrievable
 
at 
http
:// 
www.ets.org/Media/Research/pdf/PICANG7.pdf
.
 
Flaugher, R. (2000). Item pools. In H. Wainer, N. J. Dorans, D. Eignor, R. Flaugher, B. F. Green, 

Computerized adaptive testing A primer
 
(2nd ed., pp. 
37
-
60). Mahwah, NJ: Lawrence Erlbaum.
 
Georgiadou, E., Triantafillou, 
E., & Economides, A. (2007). A review of item exposure control
 
strategies for computerized adaptive testing developed from 1983 to 2005. 
Journal of 
Technology, Learning, and Assessment
, 5, 4
-
38.
 
Gibbons, R. D., Weiss, D. J., Kupfer, D. J., Frank, E., Fagio
lini,
 
A., Grochocinski, V. J., . . . 
Immekus, J. C. (2008). Using computerized adaptive testing to reduce the burden of
 
mental health asses
sment.
 
Psychiatric Services, 59
, 361
-
368
.
 
Gönülates
, E. (2015). 
A novel approach to evaluate item pools: The item poo
l utilization index
 
(Doctoral Dissertation). 
 
Gu, L. & Reckase, M. D. (2007). 
Designing optimal item pools for computerized adaptive tests 
with Sympson
-
Hetter exposure control
. 
In D. J. Weiss (Ed.), 
Proceedings of the 2007 
GMAC Conference on 
Computerized Adaptive Testing.
 
 
Hambleton, R. K. & Swaminathan, H. (1985). 
Item response theory: principles and applications.
 
Boston, MA: Kluwer
-
Nijho
 
Pub.
 
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). 
Fundamentals of item response 
theory.
 
New
bury Park, CA: SAGE
.
 
Han, K. T. (2012). An efficiency balanced information criterion for item selection in 
computerized adaptive testing. 
Journal of Educational Measurement, 49
, 225
-
246.
 
Han, K. T. (2016). Maximum likelihood score estimation method with fe
nces for s
hort
-
l
ength 
tests and computerized adaptive t
ests. 
Applied Psychological Measurement, 40
, 289

301.
 
161
 
He, W., & Reckase, M. D. (2014). Item 
pool design for an operational variable
-
l
ength 
computerized adaptive t
est. 
Educational and Psychological Meas
urement
, 
74
, 473
-
494. 
 
Hembry, I. F. (2014). 
Operational characteristics of mixed
-
format multistage tests using the 3PL 
testlet response theory model
 
(Doctoral Dissertation).
 
Hoyt, C. (1941). Test reliability estimated by analysis of variance.  
Psychometri
ka, 6
, 153
-
160. 
 
Ho, T.
-
H., & Dodd, B. G. (2012). Item s
elect
ion and ability estimation p
rocedures for a 
mixed
-
format a
daptive 
t
est. 
Applied Measurement in Education, 25
, 305

326. 
 
Jiao, H., Liu, J., Haynie, K., Woo, A., & 
Gorham, J. (2012). Comparison b
et
ween 
d
ichotomous 
and 
polytomous scoring of innovative items in a large
-
scale computerized adaptive t
est. 
Educational and Psychological Measurement
, 
72
, 493
-
509. 
 
Ju, U. & Lee, Y. (2018, July). 
Effects of ability estimation methods on the amount of adaptation 
for computerized adaptive tests. 
Paper presented at the biannual meeting of the 
International Test Commission, Montreal, Canada.
 
Kim, J. K. & Nicemander, W. A. (1993). Ability estimation for
 
conventional tests. 
Psychomtrika, 
58
, 587

599.
 
Kim, S., Ju, U., & Reckase, M. D. (2018, April).  
Evaluating indicators of amount of adaption to 
3pl computerized adaptive test. 
 
Paper presented at the annual meeting of the National 
Council on Measurement i
n Education, New York. 
 
Kim, S., Walker, M. E., & McHale, F. (2010). Comparisons among designs for equating mixed
-
format tests in large
-
scale assessments. 
Journal of Educational Measurement,
 
47, 
36
-
53. 
 
Kingsbury, G. G., & Zara, A. R. (1989). Procedures fo
r selecting items for computerized 
adaptive tests.  
Applied Measurement in Education, 2,
 
359
-
375.  
 
Kingsbury, G. G. & Wise, S. L. (2018, July). 
A new measure of adaptation based on test 
information
. Paper presented at the biannual meeting of the International Test 
Commission, Montreal, Canada. 
 
Kuechler, W.L., & Simkin, M.G. (2010). Why 
i
s 
p
erformance on 
m

c
hoice 
t
ests and 
c

r
esponse 
t
ests 
n
ot 
m
ore 
c
losely 
r
elated? Theory and an 
e
mp
irical 
t
est.
 
Decision Sciences
 
Journal of Innovative Education, 8, 
55
-
73.
 
Lord, F. M. (1977). A broad
-
range tailored test of verbal ability. 
Applied Psychological 
Measurement, 1
, 95
-
100.
 
Lord, F. M. (1980). 
Applications of item response theory to 
practical testing problems.
 
Hillsdale, 
NJ: L. Erlbaum Associates.
 
Lord, F. M. 
(1986). Maximum likelihood and B
ayesian parameter estimation in item response
 
theory. 
Journal of Educational Measurement, 23
, 157
-
162
.
 
162
 
Lord, F.
 
M., & Novick,
 
M. R. (1968). 
Statis
tical theories of mental test scores
. Reading, MA: 
Addison
-
Wesley.  
 
Lin, H. (2012). 
Item selection methods in multidimensional computerized adaptive testing 
adopting polytomously
-
scored items under multidimensional generalized partial credit 
model
 
(Doctoral dissertation).
 
Luo, X. (2015).
 
Incorporating mixed item formats in CAT: A comparison of shadow test and bin
-
structured approaches
 
(Doctoral Dissertation). 
 
Luo, X., & Kim, D. (2018). A 
t
op
-
d
own 
a
pproach to 
d
esigning the 
c
omputerized 
a
daptive 
m
ul
tistage 
t
est. 
Journal of Educational Measurement, 55
, 243

263. 
 
Luecht, R. M. & Clauser, B. E. (2002). Test models for complex CBT. In C. N. Mills, M. T.
 
Potenza, J. J. Fremer, & W. C. Ward (Eds.), 
Computer
-
based testing: building the 
foundation for 
future assessments
 
(pp. 67
-
88). Mahwah, NJ: Lawrence Erlbaum.
 
Luecht, R.M. & Nungester, R.J. (1998). Some practical examples of computer
-
adaptive
 
sequential testing. 
Journal of Educational Measurement, 35
, 229
-
249.
 
Mao, L. (2014). 
Designing p
-
Optimal Item 
Pools for Multidimensional Computerized Adaptive 
Testing
 
(
Doctoral Dissertation
).
 
 
McBride, J. R. (1977). Some properties of a Bayesian adaptive ability testing strategy. 
Applied 
Psychological Measurement, 1
, 121
-
140. 
 
McBride, J. R., & Martin, J. T. 
(1983). Reliability and validity of adaptive ability tests in a
 
military setting. In D. J. Weiss (Ed.), 
New horizons in testing: Latent trait test theory and 
computerized adaptive testing 
(pp. 224
-
236). New York, NY: Academic Press.
 
McClarty, K. L., Sperli
ng, R. A., & Dodd, B. G. (2006, April). 
A variant of the progressive
-
restricted item exposure control procedure in computerized adaptive testing systems 
based on the 3PL and partial credit models
. Paper presented at the annual meeting of the
 
American Educa
tional Research Association, San Francisco, CA.
 
Meijer, R. & Nering, M. L. (1999). Computerized adaptive testing: overview and introduction.
 
Applied Psychological Measurement, 23
, 187
-
194.
 
Mills, C. N. & Stocking, M. L. (1996). Practical issues in large
-
sc
ale computerized adaptive 
testing.
 
Applied Measurement in Education, 9
, 287
-
304
.
 
Minnesota Department of Education (2017). 
Technical manual for Minnesota standards
-
based 
accountability and English language proficiency assessments: For the academic year 
2015

2016.
 
Minnesota Department of Education
. 
Roseville, 
MN.
 
National Council of State Boards of Nursing (2016). 
NCLEX
-
RN examination: Test plan for the 
National Council Licensure Examination for Registered Nurses
. National Council of 
163
 
State Boards of Nursi
ng. Chicago, IL. Retrieved from 
https://www.ncsbn.org/RN_Test_Plan_2016_Final.pdf
 
Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of
 
adaptive mental testing. 
Journal of the American Statistical Association, 70
, 351
-
3
56.
 
Park, R. (2015). 
Investigating the impact of a mixed
-
format item pool on optimal test designs for 
multistage testing
 
(Doctoral Dissertation).
 
Parshall, C. G. (2002). Item development and pretesting in a CBT environment. In C. N.
 
Mills, 
M. T. Potenza, J. J. Fremer, & W. C. Ward (Eds.), 
Computer
-
based testing: building the 
foundation for future assessments 
(pp. 119
-
141). Mahwah, New Jersey:
 
Lawrence 
Erlbaum Associates. 
 
Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002
). 
Practical considerations in 
computer
-
based testing
. New York: Springer Verlag.
 
Patience, W. M., & Reckase, M. D. (1980, April). 
Effects of program parameters and item pool 
characteristics on the bias of a three
-
parameter tailored testing procedure
. Pape
r 
presented at the annual meeting
 
of the National Council on Measurement in Education,
 
Boston
,
 
MA.
 
 
Patsula, L. N. (1999). 
A comparison of computerized adaptive testing and multistage testing
 
(Doctoral dissertation). 
 
Penfield, R. D. (2006). Applying Bayesian item selection approaches to adaptive tests using 
polytomous items. 
Applied Measurement in Education, 19
,
 
1
-
20.
 
Rasch, G. (1961). O
n general laws and the meaning of measurement in psychology. In
 
Proceedings of the 
fourth Berkeley symposium on mathematical statistics and probability 
(Vol. 4, pp. 321
-
333). University of California Press Berkeley, CA. 
 
R Core Team. (2018). R: A language and environment for statistical computing. Vienna, Austria. 
URL https://www.R
-
proje
ct.org/.: R Foundation for Statistical Computing.
 
Reckase, M. D. (1975, April). 
The effect of item choice on ability estimation when using a simple 
logistic tailored testing model
. Paper presented at the Annual Meeting of the American 
Educational Research 
Association, Washington, DC.
 
Reckase, M. D. (1976
, April
). 
The effect of item pool characteristics on the operation of a 
tailored testing procedure.
 
Paper presented at the spring meeting of the Psychometric 
Society, Murray Hill, NJ.
 
Reckase, M. D. (1989). Adaptive testing: The evolution of a good idea.
 
Educational 
Measurement Issues and Practice, 8
, 11
-
15.
 
Reckase, M. D.  (2010). Designing item pools 
to optimize the functioning of a computerized 
adaptive test. 
Psychological Test and Assessment Modeling, 52
, 127
-
141.
 
164
 
Reckase, M. D., Ju, U. & Kim, S. (2017, November). 
Differences in the amount of adaptation 
exhibited by various computerized adaptive test
ing designs
.
 
Paper presented at the 17
th
 
Annual Maryland Assessment Research Center (MARC) conference, Maryland.
 
Reckase, M. D., Ju, U. & Kim, S. (2018). Some measures of the amount of adaptation for 
computerized adapti
ve tests. In M. Wiberg, S.
 
Culpepper, R. Janssen, J. González, D. 
Molenaar (eds.), 
Quantitative Psychology, IMPS 2017, Springer Proceedings in 
Mathematics & Statistics 233
 
(pp. 25
-
40). Charm: Springer.
 
Reckase, M.
 
D., Ju, U. & Kim, S. (2019
).
 
How adaptive is an adaptive test: Are al
l adaptive tests 
adaptive?
 
Journal of Computerized Adaptive Testing, 
7
, 1
-
14.
 
Revuelta, J., & Ponsoda, V. (1998). A comparison of item exposure control methods in
 
computerized adaptive testing. 
Journal of Educational Measurement, 35
, 311
-
327.
 
Samejima, F. 
(1969). 
Estimation of latent ability using a response pattern of graded scores
 
(Psychometrika Monograph No. 17). Richmond, VA: Psychometric Society
.
 
Shin, C. D., Chien, Y., Way, W. D., & Swanson, L. (2009). 
Weighted penalty model for content 
balancing in CATs
. Pearson Research Report. Retrieved from
 
http://www.pearsonassessments.com/NR/rdonlyres/99A4327B
-
5968
-
4AB2
-
A8CD
-
8D502D22C2DE/0/WeightedPenaltyModel.pdf
 
 
Stark, S., & Chernyshenko, O. S. (2006). Multistage testing: Widely or narrowly applicable?  
Applied Measurement in Ed
ucation, 19
, 257
-
260
 
Stocking, M. L. (1993). 
Controlling item exposure rates in a realistic adaptive testing paradigm
 
(
ETS Research Report
 
No. 93
-
2). Princeton, NJ
:
 
Educational
 
Testing Service.
 
Stocking, M. L. (1994). 
Three practical issues for 
modern adaptive testing item pools
 
(ETS 
Research Report No. 93
-
2). 
Princeton, NJ
:
 
Educational
 
Testing Service.
 
Sympson, J. B., & Hetter, R. D. (1985, October). 
Controlling item exposure rates in computerized 
adaptive testing
. Paper presented at the Annual 
Meeting of the Military Testing Association. 
Navy Personnel Research and Development Center: San Diego, CA.
 
 
Swanson, L. & Stocking, M. L. (1993). A model and heuristic for solving very large item
 
selection
 
problems. 
Applied Psychological Measurement, 17
, 
151
-
166
.
 
The MathWorks, Inc. (1984
-
2015). MATLAB version 10.1. Natick, Massachusetts: The 
MathWorks Inc.
 
Thissen, D., & Mislevy, R. J. (2000). Testing algorithms. In 
H. Wainer, N. J. Dorans, D. Eignor, 

(Eds.), 
Computerized adaptive 
testing
: A
 
primer
 
(2nd ed., pp. 
101
-
133
). Mahwah, NJ: Lawrence Erlbaum.
 
Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. 
Psychometrika, 
47,
 
397
-
412.
 
165
 
Urry, V. W. (1977). Tailored testing: a successful application of latent trait theory. 
Journal of 
Educational Measurement, 14
, 181
-
196. 
 
Vale, C. D.
 
& Weiss, D. J. (1977). 
A rapid item
-
search procedure for Bayesian adaptive testing
 
(Research Report 77
-
4). Minneapolis, MN: University of Minnesota, Psychometric 
Methods Program, Adaptive Testing Laboratory.
 
van der Linden, W. J. (1998
a
). Bayesian item selection criteria for adaptive testing. 
Psychometrika, 63
, 201
-
216.
 
van der Linden
, W. J
. (1998
b
). Optimal 
a
ssembly of 
p
sychological and 
e
ducational 
t
ests. 
Applied 
Psychological Measurement, 22
, 195

211. 
 
van der Linden, W. J. (2010). Con
strained adaptive testing with shadow tests. In W. J.
 
van der 
Linden & C. A. Glas (Eds.), 
Elements of adaptive testing
 
(pp. 31
-
55). New
 
York, NY: 
Springer. 
 
van der Linden, W. J., Ariel, A., & Veldkamp, B. P. (2006). Assembling a computerized 
adaptive testing item pool as a set of linear tests. 
Journal of Educational and Behavioral 
Statistics, 31
, 81
-
99.
 
van der Linden, W. J., & Reese, L. M. (1998). A model fo
r optimal constrained adaptive testing. 
Applied Psychological Measurement
, 
22
, 259
-
270.  
 
Veldkamp, B. P., & van der Linden, W. J. (2000). 
Designing item pools for computerized 
adaptive testing
. 
In van der Linden, W. J. & Glas, C. A. W. (Eds.) (2000). 
Comp
uterized 
Adaptive Testing: Theory and Practice
. Dordrecht: Kluwer.
 
Wang, W., Drasgow, F., & Liu, L. (2016). Classification 
a
ccuracy of 
m
ixed 
f
ormat 
t
ests: A 
b
i
-
f
actor 
i
tem 
r
esponse 
t
heory Approach. 
Frontiers in Psychology, 7
.  
 
Wang, S., Lin, H., Chang, H.
-
H., & Douglas, J. (2016). Hybrid 
c
omputerized 
a
daptive 
t
esting: 
From 
g
roup 
s
equential 
d
esign to 
f
ully 
s
equential 
d
esign. 
Journal of Educational 
Measurement
, 
53
, 45
-
62. 
 
Wang, T. & Vispoel, W. P. (1998). Properties of ability estimation methods in 
computerized
 
adaptive testing. 
Journal of Educational Measurement, 35
, 109
-
135.
 
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for 
testlets. 
Journal of Educational measureme
n
t
, 24,
 
185
-
201.
 
Way, W. D. (1998). 
Protecting the integrity of computerized testing item pools.
 
Educational 
Measuremen
t: Issues and Practice, 17
, 17
-
27. 
 
Way, W., Zara, A., & Leahy, J. (1996, April). 
Modifying the NCLEX
TM
 
CAT item selection 
algorithm to improve item exposure
. Paper presented at the Annual Meeting of the 
American Educational Research Association, New York,
 
NY.
 
166
 
Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. 
Applied Psychological Measurement, 6
, 473

492.
 
Weiss, D. J. (2011). Better data from better measurements using computerized adaptive testing
. 
Journal of Methods a
nd Measurement in the Social Sciences, 2
, 1
-
27.
 
Wendt, A. (2008). Investigation of the item characteristics of innovative item formats. 
Clear 
Exam Review
, 
19
, 22
-
28.
 
Wise, S. L., Bhola, D. S., & Yang, S.
-
T. (2006). Taking the time to improve the 
validity
 
of low
-
stakes tests: the e
ff
ort
-
monitoring CBT. 
Educational Measurement: Issues and Practice, 
25
, 21
-
30.
 
Wise, S. L., & Kingsbury, G. G. (2000). Practical issues in developing and maintaining a 
computerized adaptive testing program. 
Psicológica, 2
1, 
135
-
155.
 
Xing, D., & Hambleton, R. K. (2004). Impact of test design, item quality, and item bank size on 
the psychometric properties of computer
-
based credentialing examinations. 
Educational 
and Psychological Measurement, 64
, 5
-
21.
 
Xiong, X. (2018). A 
h
ybrid 
s
trategy to 
c
onstruct 
m
ultistage 
a
daptive 
t
ests
. Applied Psychological 
Measurement, 
42,
 
630
-
643.
 
Yan, D. von Davier, A. A. & Lewis, C. (Eds.) (2016).  
Computerized Multistage Testing: Theory 
and Applications.
 
Boca Raton, FL: CRC Press.
 
Yao, L., & Sch
warz, R. D. (2006). A multidimensional partial credit model with associated item 
and test statistics: An application to mixed
-
format tests. 
Applied Psychological 
Measurement, 37
, 3
-
23.
 
Zenisky, A. L. (2004). 
Evaluating the effects of several multi
-
stage testing design variables on 
selected psychometric outcomes for certification and licensure assessment 
(Doctoral 
Dissertation).
 
Zheng, Y., Nozawa, Y., Gao, X., & Chang, H. H. (2012). 
Multistage adaptive testing 
for a 
large
-
scale
 
classification test: the designs, automated heuristic assembly, and comparison with 
other testing modes
 
(ACT Research Report 2012
-
6). Iowa City, IA: ACT Inc. 
 
Zhou, X., & Reckase, M. D. (2014). Optimal item pool design for computerized ad
aptive tests 
with polytomous items using GPCM. 
Psychological Test and Assessment Modeling
, 
56
, 
255
-
274.