Designing p-Optimal Item Pools for
Multidimensional Computerized Adaptive Testing
By
Liyang Mao

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Measurement and Quantitative Methods - Doctor Of Philosophy
2014

ABSTRACT
DESIGNING P-OPTIMAL ITEM POOLS FOR
MULTIDIMENSIONAL COMPUTERIZED ADAPTIVE TESTS
By
Liyang Mao
The interest in multidimensional computerized adaptive testing (MCAT) has grown
considerably over the last few years. While a significant amount of research has been conducted
on item selection and ability estimation methods for MCAT, few studies specifically addressed
the item pool design for MCAT. To ensure a proper functioning of MCAT, a well-designed item
pool is imperative. A well-designed item pool should consist of a number of well-balanced items
that achieve appropriate test precision, item usage, as well as lower the cost of item creation.
One method to develop such an item pool is the p-optimality method, which is proposed by
Reckase (2003 & 2007) for unidimensional CAT. This paper aims to develop p-optimal item
pools for MCAT by extending the Reckase’s method to a multidimensional context.
The extension includes the generation of a multidimensional optimal item based on the DOptimality item selection creation, the definition of the MDIFF-bin to describe multidimensional
item succinctly for item pool design, and the interpretation for the p-optimal item pool in a
multidimensional context. In this paper, a total of 24 p-optimal item pools were designed and
then developed for different test specification, with different correlation among dimensions,
based on different bin size, and under the condition with or without item exposure control. The
characteristics for the 24 p-optimal item pools are summarized. A simulation study was
conducted to evaluate the performance of the p-optimal item pools against baseline pools
existing in research literature.

Results show that p-optimal item pools achieve similar levels of measurement accuracy as
baseline pools, but they consist of fewer items and perform better in terms of item pool usage
and test security. The characteristics and the performance of the p-optimal item pools are
affected by factors such as test specification, correlation among dimensions, bin size, and item
exposure control. The results in this study can provide a general guideline for the item pool
development for MCAT. More importantly, because the p-optimal item pool is specifically
tailored to the MCAT programs, the p-optimal item pool design procedure described in this study
can be adapted to other MCAT programs with different features and purposes. The end product
of the p-optimal item pool design can be used as an instructive guide for item creation, item pool
development, and item pool management.

ACKNOWLEDGEMENTS

This dissertation is not just a research study. It reflects how I have grown during my studies
at Michigan State University. Without the guidance, support, encouragement, and care from
many people, I could not accomplish this dissertation and graduate with a PhD degree.
I want to express my deepest appreciation to my advisor, Mark Reckase. I thank him for
giving me the great opportunities to work with him on both coursework and research projects; I
thank him for his profound knowledge and enthusiastic that encourages me to be a better scholar;
I thank him for his patience and support when I exploring my research interests; and I also thank
him for his generous help when I need him.
I also want to express my gratitude to Edward Roeber, a member of my dissertation
committee and guidance committee, for his support, encouragement, and also warm care
throughout my PhD study. My gratitude also extends to my dissertation committee members,
Joseph Martineau and Richard Houang, who have provided me insightful suggestion and help for
my dissertation.
I also would like to thank Bettie Menchik for providing me the opportunity to work on several
education policy projects for the Michigan Department of Education. I would attribute most of
my knowledge about education policy to her generous help and guidance. I truly enjoyed
working with her and we have become very good friends. I also thank Neelam Kher, Michelle
William, and Kimberly Maier for their support during my study.

iv

I am sincere grateful for my friends Tingqiao Chen, Chang Chi, Emre Gonulates, Eun Hye
Ham, Xin Luo, Bing Tong, Keyin Wang, Xuechun Zhou, and many others, who have enriched
my life in graduate school.
I also want to thank my wonderful boyfriend, Jianxun Wang, for making me smile and happy
in my life.
Finally, I want to express my appreciation to my parents, Yukun Hou and Yingjian Mao, for
their unconditional love since I was born.

v

TABLE OF CONTENTS

LIST OF TABLES ....................................................................................................................... viii
LIST OF FIGURES ...................................................................................................................... xii
Chapter 1 Introduction .................................................................................................................. 1
Chapter 2 Unidimensional and Multidimensional CAT ............................................................... 6
2.1 Computer Adaptive Testing .................................................................................................. 6
2. 2 Unidimensional IRT and CAT ............................................................................................. 7
2.2.1 Unidimensional IRT Models .......................................................................................... 7
2.2.2 Item Selection Methods for UCAT ................................................................................ 8
2.2.3 Ability Estimation Methods for UCAT .......................................................................... 9
2.2.4 Practical Constraints for UCAT ................................................................................... 10
2.3 Multidimensional IRT and CAT ......................................................................................... 12
2.3.1 Multidimensional IRT Models ..................................................................................... 12
2.3.2 Generalization of UCAT to MCAT .............................................................................. 16
2.3.3 Item Selection Methods for Multidimensional CAT .................................................... 17
2.3.4 Ability Estimation Methods for Multidimensional CAT ............................................. 19
2.3.5 Stopping Rules for MCAT ........................................................................................... 22
2.3.6 Practical Constraints for MCAT ................................................................................... 22
Chapter 3 p-Optimality Method and the Extension to MCAT .................................................... 25
3.1 From Optimal Item Pool to p-Optimal Item Pool ............................................................... 25
3.2 p-Optimal Item pool Design for UCAT .............................................................................. 26
3.3 Extending the p-Optimality Method to MCAT ................................................................... 29
3.3.1 Optimal Item Generation .............................................................................................. 32
3.3.2 Interpretation for the “p-Optimal” ................................................................................ 34
3.3.3 Extending the “bin” concept ......................................................................................... 35
3.3.4 An example of the p-optimal item pool design for MCAT .......................................... 36
3.3.5 p-Optimal Item Pool Design for MCAT with Exposure Control ................................. 37
3.3.6 p-Optimal Item Pool Design for MCAT with Non-Simple Structure .......................... 40
Chapter 4 Study Design and Procedures ..................................................................................... 44
4.1 MCAT Algorithms .............................................................................................................. 44
4.2 Simulation Procedure .......................................................................................................... 45
Phase I. P-optimal Item Pool Design .................................................................................... 45
Phase II. P-Optimal item pool development ......................................................................... 47
Phase III. Baseline Pool Development ................................................................................. 48
Phase IV. Simulation Study Conduct ................................................................................... 49
4. 3 Evaluation Criteria ............................................................................................................. 51
vi

Chapter 5 Simulation Results ...................................................................................................... 54
5.1 Item Pool Characteristics .................................................................................................... 54
5.1.1 Summary for Item Pool Characteristics........................................................................ 54
5.1.2 Item distribution for p-optimal item pools ................................................................... 58
5.2 Performance of the p-Optimal Item Pools........................................................................... 63
5.2.1 Performance for item pools based on Test Specification 1 (high correlation) ............. 64
5.2.2 Performance for item pools based on Test Specification 1 (moderate correlation) ..... 73
5.2.3 Performance for item pools based on Test Specification 2 (high correlation) ............. 82
5.2.4 Performance for item pools based on Test Specification 2 (moderate correlation) ..... 90
5.2.5 Performance for item pools based on Test Specification 3 (high correlation) ............. 99
5.2.6 Performance for item pools based on Test Specification 3 (moderate correlation) ... 108
Chapter 6 Discussion and Conclusion....................................................................................... 116
6.1 Summary of Results .......................................................................................................... 116
6.2 Discussion of Results ........................................................................................................ 119
6.3 Implications ....................................................................................................................... 123
6.4 Limitation and Future Studies ........................................................................................... 125
APPENDIX ................................................................................................................................. 127
REFERENCES ........................................................................................................................... 134

vii

LIST OF TABLES

Table 3.1: The p-optimal pool for two examinees ........................................................................ 29
Table 4.1: Mean and covariance matrix for the two examinee populations ................................. 47
Table 4.2: Bin count for a .96-optimal item pool ......................................................................... 47
Table 4.3: Bin count for a .86-optimal item pool ......................................................................... 47
Table 4.4: Item Statistics for the Three Baseline pools ................................................................ 50
Table 4.5: The 37 θ Points for the Three Dimensional MCAT .................................................... 50
Table 5.1: Summary for the .96-optimal item pools and baseline pools ...................................... 55
Table 5.2: Summary for the .86-optimal item pools and baseline pools ...................................... 55
Table 5.3: Item distribution for the .96-optimal item pools.......................................................... 59
Table 5.4: Item distribution for the .86-optimal item pools.......................................................... 59
Table 5.5: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control ............................................................................................................................ 65
Table 5.6: The performance of the .96- and .86-optimal pool and the baseline pool with exposure
control ........................................................................................................................................... 65
Table 5.7: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control ............................................................................................................................ 74
Table 5.8: The performance of the .96- and .86-optimal pool and the baseline pool with exposure
control ........................................................................................................................................... 74
Table 5.9: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control ............................................................................................................................ 83

viii

Table 5.10: The performance of the .96- and .86-optimal pool and the baseline pool with
exposure control ............................................................................................................................ 83
Table 5.11: Conditional Bias for the 𝜽𝜽 estimates without exposure control ................................ 85
Table 5.12: Conditional Bias for the 𝜽𝜽 estimates with exposure control ..................................... 86

Table 5.13: Conditional RMSE for the 𝜽𝜽 estimates without exposure control............................. 87

Table 5.14: Conditional RMSE for the 𝜽𝜽 estimates with exposure control .................................. 88

Table 5.15: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control ............................................................................................................................ 91
Table 5.16: The performance of the .96- and .86-optimal pool and the baseline pool with
exposure control ............................................................................................................................ 91
Table 5.17: Conditional Bias for the 𝜽𝜽 estimates without exposure control ................................ 94
Table 5.18: Conditional Bias for the 𝜽𝜽 estimates with exposure control ..................................... 95
Table 5.19: Conditional RMSE for the 𝜽𝜽 estimates without exposure control............................. 96

Table 5.20: Conditional RMSE for the 𝜽𝜽 estimates with exposure control .................................. 97

Table 5.21: The performance of the .96- and .86-optimal pool and the baseline pool ............... 101
Table 5.22: The performance of the .96- and .86-optimal pool and the baseline pool ............... 101
Table 5.23: Conditional Bias for the 𝜽𝜽 estimates without exposure control .............................. 103
Table 5.24: Conditional Bias for the 𝜽𝜽 estimates with exposure control ................................... 104
Table 5.25: Conditional RMSE for the 𝜽𝜽 estimates without exposure control........................... 105

Table 5.26: Conditional RMSE for the 𝜽𝜽 estimates with exposure control ................................ 106

Table 5.27: The performance of the .96- and .86-optimal pool and the baseline pool ............... 109
Table 5.28: The performance of the .96- and .86-optimal pool and the baseline pool ............... 109
Table 5.29: Conditional Bias for the 𝜽𝜽 estimates without exposure control .............................. 111
ix

Table 5.30: Conditional Bias for the 𝜽𝜽 estimates with exposure control ................................... 112

Table 5.31: Conditional RMSE for the 𝜽𝜽 estimates without exposure control........................... 113

Table 5.32: Conditional RMSE for the 𝜽𝜽 estimates with exposure control ................................ 114

Table A.1: Bin count table for the .96-optimal item pool (Test Specification 1, high correlation,
without item exposure control) ................................................................................................... 128
Table A.2: Bin count table for the .86-optimal item pool (Test Specification 1, high correlation,
without item exposure control) ................................................................................................... 128
Table A.3: Bin count table for the .96-optimal item pool (Test Specification 1, moderate
correlation, without item exposure control) ................................................................................ 128
Table A.4: Bin count table for the .86-optimal item pool (Test Specification 1, moderate
correlation, without item exposure control) ................................................................................ 128
Table A.5: Bin count table for the .96-optimal item pool (Test Specification 2, high correlation,
without item exposure control) ................................................................................................... 129
Table A.6: Bin count table for the .86-optimal item pool (Test Specification 2, high correlation,
without item exposure control) ................................................................................................... 129
Table A.7: Bin count table for the .96-optimal item pool (Test Specification 2, moderate
correlation, without item exposure control) ................................................................................ 129
Table A.8: Bin count table for the .86-optimal item pool (Test Specification 2, moderate
correlation, without item exposure control) ................................................................................ 129
Table A.9: Bin count table for the .96-optimal item pool (Test Specification 3, high correlation,
without item exposure control) ................................................................................................... 130
Table A.10: Bin count table for the .86-optimal item pool (Test Specification 3, high correlation,
without item exposure control) ................................................................................................... 130
Table A.11: Bin count table for the .96-optimal item pool (Test Specification 3, moderate
correlation, without item exposure control) ................................................................................ 130
Table A.12: Bin count table for the .86-optimal item pool (Test Specification 3, moderate
correlation, without item exposure control) ................................................................................ 130
Table A.13: Bin count table for the .96-optimal item pool (Test Specification 1, high correlation,
with item exposure control) ........................................................................................................ 131
x

Table A.14: Bin count table for the .86-optimal item pool (Test Specification 1, high correlation,
with item exposure control) ........................................................................................................ 131
Table A.15: Bin count table for the .96-optimal item pool (Test Specification 1, moderate
correlation, with item exposure control) ..................................................................................... 131
Table A.16: Bin count table for the .86-optimal item pool (Test Specification 1, moderate
correlation, with item exposure control) ..................................................................................... 131
Table A.17: Bin count table for the .96-optimal item pool (Test Specification 2, high correlation,
with item exposure control) ........................................................................................................ 132
Table A.18: Bin count table for the .86-optimal item pool (Test Specification 2, high correlation,
with item exposure control) ........................................................................................................ 132
Table A.19: Bin count table for the .96-optimal item pool (Test Specification 2, moderate
correlation, with item exposure control) ..................................................................................... 132
Table A.20: Bin count table for the .86-optimal item pool (Test Specification 2, moderate
correlation, with item exposure control) ..................................................................................... 132
Table A.21: Bin count table for the .96-optimal item pool (Test Specification 3, high correlation,
with item exposure control) ........................................................................................................ 133
Table A.22: Bin count table for the .86-optimal item pool (Test Specification 3, high correlation,
with item exposure control) ........................................................................................................ 133
Table A.23: Bin count table for the .96-optimal item pool (Test Specification 3, moderate
correlation, with item exposure control) ..................................................................................... 133
Table A.24: Bin count table for the .86-optimal item pool (Test Specification 3, moderate
correlation, with item exposure control) ..................................................................................... 133

xi

LIST OF FIGURES

Figure 3.1: Information Function for a Test Item Fit by the Unidimensional Rasch Model ........ 27
Figure 3.2: Item distributions for examinee with true ability (0.7, 1.5) ....................................... 38
Figure 3.3: Item distributions for examinee with true ability (-1.1, -1.0) ..................................... 38
Figure 3.4: Item distributions for the two examinees .................................................................. 39
Figure 3.5: Increase in required pool size as number of examinees increases ............................. 39
Figure 3.6: The test information on three directions..................................................................... 43
Figure 4.1: The 29 θ Points for the Two Dimensional MCAT ..................................................... 50
Figure 4.2: The 37 θ Points for the Three Dimensional MCAT ................................................... 51
Figure 5.1: The direction of the information for items with a = (1,1,1) ....................................... 60
Figure 5.2: Item distribution for the .96-optimal item pool without exposure control ................. 62
Figure 5.3: Item distribution for the .86-optimal item pool without exposure control ................. 62
Figure 5.4: Item distribution for the .96-optimal item pool with exposure control ...................... 63
Figure 5.5: Item distribution for the .86-optimal item pool with exposure control ...................... 63
Figure 5.6: Conditional bias for the 𝜽𝜽 estimates without exposure control .................................. 68
Figure 5.7: Conditional bias for the 𝜽𝜽 estimates with exposure control ....................................... 69
Figure 5.8: Conditional RMSE for the 𝜽𝜽 estimates without exposure control ............................. 70
Figure 5.9: Conditional RMSE for the 𝜽𝜽 estimates with exposure control .................................. 71
Figure 5.10: Conditional bias for the 𝜽𝜽 estimates without exposure control ................................ 77
Figure 5.11: Conditional bias for the 𝜽𝜽 estimates with exposure control ..................................... 78
xii

Figure 5.12: Conditional RMSE for the 𝜽𝜽 estimates without exposure control ........................... 79
Figure 5.13: Conditional RMSE for the 𝜽𝜽 estimates with exposure control ................................ 80

xiii

Chapter 1

Introduction

Over the last few decades, computerized adaptive testing (CAT) has achieved great popularity
in educational assessments. Different from a conventional paper-and-pencil test, CAT uses a
computer to deliver test items that are selected by tailoring each item to the ability level of an
examinee.

Such delivery of tests has several advantages, such as increasing measurement

precision, reducing testing time, faster score reporting, and flexible scheduling of examinees
(Wainer, 2000). Starting in the 1990s, CAT has been successfully applied to many operational
testing programs, including the Armed Services Vocational Aptitude Battery (ASVAB), the
Computerized Adaptive Placement Assessment and Support Services (COMPASS), the Graduate
Management Admission Test (GMAT), and the National Council Licensure Examination
(NCLEX). Furthermore, in the 2014-15 school year, almost half states in the United State will
replace their current K-12 assessments by the CAT-based assessment system developed by the
Smarter Balanced Assessment Consortium (SBAC, 2013).
Most operational adaptive tests have been developed based on a unidimensional item response
theory (UIRT) model.

Nevertheless, the interest in CAT based on multidimensional item

response theory (MIRT) models (refer to as multidimensional CAT, MCAT) has grown
considerably as shown by the increasing number of articles in the literature (e.g., Segall, 2010;
Seitz & Frey, 2013; Wang & Chang, 2011; Yao, 2013). One reason that MCAT has become
very popular is that current educational assessments often cover multiple content standards, so
that those assessments may not be strictly unidimensional (Reckase, 2009). In a mathematics
test for Grade 4, for example, there is a concern about providing an adequate number of algebra,
geometry, number operation, data analysis, and measurement items to each examinee, because
these content areas are defined as separate components of mathematics proficiency by the
1

Common Core State Standards (CCSS, 2010). In this situation, it would be straightforward to
apply MCAT for assessments with multidimensional features.
In addition, MCAT would be preferred when diagnostic information (i.e., subscores) is to be
reported. In educational assessments, although the total score is useful for some decision making,
subscores complement the total score by providing information about examinees’ strengths and
weaknesses on each content area. Therefore, test users usually ask for subscores for diagnostic
purposes. Teachers also prefer subscores because subscores can help them design specific
instruction for each student.

In MCAT, subscores on all content areas can be estimated

simultaneously using a MIRT model. In unidimensional CAT (UCAT), however, the UCAT
needs to be carried out separately, one content area at a time, to estimate each subscore one by
one. Therefore, in subscore estimation, MCAT often yields better measurement efficiency than
UCAT (Luecht, 1996; Segall, 1996; Wang & Chen, 2004; Yao, 2012; Mao, Luo & Zhou, 2013).
Like a UCAT program, a MCAT program also consists of several components and procedures.
It begins with an item pool that contains an adequate number of items calibrated using a MIRT
model. Then, the MCAT methods usually follow an iterative process: (1) assign an initial ability
level to an examinee, (2) select a test item from the item pool using an item selection method, (3)
administer the selected test item to a examinee and collect the response, and (4) score the
response and update the ability estimates.

This process continues until a certain stopping

criterion is met. Operational implementation of CAT often includes constraints such as content
balancing and item exposure control to address the validity and security issue.
While a significant amount of research has been conducted on generalizing the item selection
and ability estimation methods from UCAT to MCAT (Mulder & van der Linden, 2009; Segal,
1996; Wang & Chang, 2011; Yao, 2013), few studies can be found that specifically addressed

2

the item pool design issue for MCAT.

In all of the existing studies about MCAT, the

multidimensional item pools are either built from pure simulation (i.e., van der Linden, 1999) or
created from operational UCAT programs or paper-and-pencil tests (i.e. Diao, 2009; Song, 2010;
Yao, 2013). The quality of these item pools is unknown. Because a CAT program cannot
function well without an item pool that contains sufficient number of appropriate items for all the
examinees, item pool design is critical for MCAT programs. Therefore, in order to design
quality item pools for MCAT, current item pool design methods for UCAT need to be
generalized to MCAT.
For UCAT programs, there are two methods focusing on item pool design: one is the shadow
test approach (Veldkamp & van der Linden, 2000); the other one is the p-optimality approach
(Reckase, 2003 & 2010). According to Veldkamp and van der Linden (2000), before items are
selected to administer, a shadow test is first assembled from a large item pool (usually called
“master pool”) using a linear integer programming model. Then a test item is selected from the
shadow test, not directly from the item pool, to administer. The integer programming model
guarantees that all constraints (i.e., content balancing and item exposure control) on test
administration can be met. However, it is still unclear how to design a master pool and what are
the desired features of a master pool. Without a multidimensional master pool, the shadow test
approach cannot be implemented for MCAT programs.
Unlike the shadow test approach, the p-optimality approach developed by Reckase (2003 &
2010) directly addressed the item pool design issue. The definition of Reckase’s p-optimal item
pool is an item pool “that always has an item available for selection that p% matches the desired
characteristics specified by the item selection routine for the CAT” (Reckase, 2007). To design
such an item pool, an examinee is first randomly sampled from the target examinee population to

3

take the CAT. Each administered item is simulated to be optimal for this examinee. This
procedure is then repeated for the subsequent examinees.

Because items created for one

examinee can be used for another, the p-optimal item pool is the union of the item sets that are
administered to each examinee. After the simulation procedure is repeated for a large number of
examinees, the number of item in the item pool will eventually approach an upper bound. Thus,
the final product of the simulation is an item pool blueprint that tells the pool size and item
distribution of the item pool. This blueprint can be directly used as the target for item creation
and item pool development.
Therefore, this study aims to generalize the p-optimality method (Reckase, 2003 & 2007) to a
multidimensional context. Although the generalization seems conceptually straightforward - just
implement the simulation procedure based on a MIRT model, to practically implement this
procedure, a number of technical challenges need to be solved. For example, the optimal item is
unique for each examinee when the unidimensional Rasch model (Lord, 1980) is used. In the
multidimensional context, however, the optimal item is not unique, because one optimal item can
be found on each direction of measurement. How to select the most appropriate optimal item is
the first challenge.
The MCAT in this study is based on the multidimensional Rasch model. The reason for
selecting the multidimensional Rasch model is because the idea of p-optimal item pool was first
proposed based on the unidimensional Rasch model. It is thus straightforward to choose the
multidimensional Rasch model when this idea is extended to MCAT for the first time. In this
study, a p-optimal item pool will be first generated based on the simplest two-dimensional model
with simple structure. Then p-optimal items pools for MCAT with higher dimensions and with
non-simple structure will be generated next.

4

Specifically, the research questions of this study are:
1. Can the p-optimality method be generalized to design item pools for MCAT based on
the test design and the examinee population characteristics?
2. How does the performance of a MCAT using the p-optimality item pool design
method compare with the performance using other item pool designs?
3. How do the characteristics of the p-optimal item pool change with exposure control
and different test specifications (i.e. the number of dimension, correlation among
dimensions, simple structure or not)?
Previous work has suggested that MCATs have great potential, but few studies investigate the
item pool design for MCATs.

By extending the idea of the p-optimal item pool to a

multidimensional context, the results from this study could provide a general guideline about the
desired characteristics of the p-optimal item pool for certain MCATs.

More importantly,

because the p-optimal item pools are specifically tailored to the MCAT programs, the simulation
procedures described in this study can be adapted to other MCAT programs with different
features and different test purposes. The end product of the p-optimal item pool design tells the
characteristics of the optimal item pool. If the operational item pool is developed based on the poptimal item pool design, the item pool is expected to ensure the proper functioning of MCATs
and to produce reliable measurement outcomes.

5

Chapter 2

Unidimensional and Multidimensional CAT

This chapter first introduces the computerized adaptive testing (CAT) in Section 2.1. The
unidimensional IRT model and the unidimensional CAT are briefly discussed in Section 2.2.
The multidimensional IRT model and the multidimensional CAT are explained in Section 2.3.
2.1 Computer Adaptive Testing
CAT is a special form of a computer-delivered test that is adaptive to the examinee's ability
level. The “adaptive” means test items are selected on the basis of the examinee's responses to
the items previously administrated. One early use of adapting the difficulty of a test to each
individual examinee is the Binet-Simon (1905) intelligence test. The items in this test were
grouped according to mental age, and the selection of items is determined by the examinee’s
mental age estimate, which is derived from the responses to the items administered earlier. From
the 1970s, with the development of item response theory and the breakthrough in modern
computer technology, the idea of adaptive testing was refined and developed into the current
CAT procedures.
For a typical CAT program, the test begins with the first item selected based on an initial
estimate of an examinee’s ability level. After each item is administered, a new ability level is
estimated and the next item with optimal properties at the new estimate is selected to administer.
This process is repeated until it meets certain stopping rules, such as, the precision of proficiency
estimate is adequate, or a fixed number of items have been administrated. Therefore, a basic
CAT application consists of four major components: an item pool, an item selection procedure, a
scoring procedure, and a test stopping rule (Reckase, 1989). In practice, constraints such as

6

content balancing and exposure control are often imposed on the item selection procedure to
ensure the test validity and test security.
2. 2 Unidimensional IRT and CAT
2.2.1 Unidimensional IRT Models
Item response theory (IRT) is a group of mathematical models that describes the relationship
between examinee ability and the possibility of answering test items correct. Unidimensional
item response theory (UIRT) models assume examinees’ responses to test items depend on one
single latent trait (Lord, 1980). The item response function (IRF) for the three-parameter logistic
(3PL) model (Birnbaum, 1968) is defined by
P�𝑢𝑢ij = 1�𝜃𝜃𝑗𝑗 , 𝑎𝑎𝑖𝑖 , 𝑏𝑏𝑖𝑖 , 𝑐𝑐𝑖𝑖 � = 𝑐𝑐𝑖𝑖 + (1 − 𝑐𝑐𝑖𝑖 )

e𝑎𝑎 𝑖𝑖 (𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖 )

1 + e𝑎𝑎 𝑖𝑖 (𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖 )

,

(2.1)

where P�𝑢𝑢ij = 1�𝜃𝜃𝑗𝑗 , 𝑎𝑎𝑖𝑖 , 𝑏𝑏𝑖𝑖 , 𝑐𝑐𝑖𝑖 � is the probability of a correct response to item i by person j; 𝑢𝑢ij is

the response on item i by person j (1 is correct and 0 is incorrect); 𝜃𝜃𝑗𝑗 is person j’s continuous

latent ability; 𝑏𝑏𝑖𝑖 , the item difficulty parameter of item i, denotes the inflection point of the IRF;

𝑎𝑎𝑖𝑖 , the discrimination parameter for item i, is proportional to the slope of the IRF at its inflection

point; 𝑐𝑐𝑖𝑖 , the lower asymptote of the IRF, is the guessing parameter for item i.

If the guessing parameter is set to 0 for all the items, the 3PL model becomes a two-parameter

logistic (2PL) model specified by the following IRF:
P�𝑢𝑢ij = 1�𝜃𝜃𝑗𝑗 , 𝑎𝑎𝑖𝑖 , 𝑏𝑏𝑖𝑖 � =

e𝑎𝑎 𝑖𝑖 (𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖 )

1 + e𝑎𝑎 𝑖𝑖 (𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖 )

,

(2.2)

and if the item discrimination parameter is further restricted to be 1 across all the items, the 2PL
model results in a Rasch model, which is defined by
P�𝑢𝑢ij = 1�𝜃𝜃𝑗𝑗 , 𝑎𝑎𝑖𝑖 , 𝑏𝑏𝑖𝑖 � =

e𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖

1 + e𝜃𝜃 𝑗𝑗 −𝑏𝑏 𝑖𝑖
7

.

(2.3)

In IRT, the term “information,” also called Fisher information, plays an important role in
parameter estimation as it is a statistical indicator of the quality of the estimate of a parameter.
The formula for item information can be derived in a number of different ways, but the one
provided by Lord (1980) is the most well known. Let 𝑃𝑃𝑖𝑖 (𝜃𝜃) denote the IRF for item i, and let
𝑄𝑄𝑖𝑖 (𝜃𝜃) = 1 − 𝑃𝑃𝑖𝑖 (𝜃𝜃). Then the Fisher information can be obtained by
∂𝑃𝑃𝑖𝑖 (𝜃𝜃) 2
)] �𝑃𝑃𝑖𝑖 (𝜃𝜃)𝑄𝑄𝑖𝑖 (𝜃𝜃) .
∂𝜃𝜃

𝐼𝐼𝑖𝑖 (𝜃𝜃) = [

(2.4)

When the 3PL model is used, (2.4) becomes
𝐼𝐼𝑖𝑖 (𝜃𝜃) = 𝑎𝑎𝑖𝑖2

𝑄𝑄𝑖𝑖 (𝜃𝜃) 𝑃𝑃𝑖𝑖 (𝜃𝜃) − 𝑐𝑐𝑖𝑖 2
[
] .
𝑃𝑃𝑖𝑖 (𝜃𝜃) 1 − 𝑐𝑐𝑖𝑖

When 𝑐𝑐𝑖𝑖 = 0, the information for the 2PL model is

𝐼𝐼𝑖𝑖 (𝜃𝜃) = 𝑎𝑎𝑖𝑖2 𝑃𝑃𝑖𝑖 (𝜃𝜃)𝑄𝑄𝑖𝑖 (𝜃𝜃),

(2.5)
(2.6)

and when 𝑐𝑐𝑖𝑖 = 0 and 𝑎𝑎𝑖𝑖 = 1, the information for the Rasch model can be simplified to
𝐼𝐼𝑖𝑖 (𝜃𝜃) = 𝑃𝑃𝑖𝑖 (𝜃𝜃)𝑄𝑄𝑖𝑖 (𝜃𝜃).

(2.7)

IRT models and Fisher information play a central role in CAT, from item calibration to item
selection and ability estimation. In Section 2.2.2 and 2.2.3, item selection methods and ability
estimation methods for unidimensional CAT (UCAT) will be briefly introduced. The practical
constraints for UCAT will be introduced in Section 2.2.4.
2.2.2 Item Selection Methods for UCAT
Items in CAT are selected to be adaptive to the examinee’s ability level estimate. The most
widely used item selection procedures for UCAT are the maximum Fisher information method
(Weiss, 1982), the maximum posterior precision method (Owen，1975), and the maximum
global information method (Chang & Ying, 1996).

8

The maximum Fisher information method selects the item that provides the maximum amount
of Fisher information at examinee's current ability estimate, 𝜃𝜃� . Therefore, the unconstrained

Fisher information-based item selection methods administers items that maximize (2.4) at 𝜃𝜃 = 𝜃𝜃� .

The maximum posterior precision method is also known as the Owen’s Bayesian method. It

selects the next item maximizes the expected posterior precision of 𝜃𝜃�. In the early stage of a

CAT, the Owen’s Bayesian method may select different items from the Fisher information
method, because of the effect of the prior. As the test length increases, the effect of the prior
decreases and the results from the two methods become similar (Chang & Stout, 1993).
The maximum global information method selects items based on the Kullback-Leibler (KL)
divergence (Kullback & Leibler, 1951), which is a non-symmetric measure of the difference
between two probability distributions. In the early stage of a CAT, when the estimated ability is
away from the examinee’s true ability, the global information method performs better than the
Fisher information method with respect to the efficiency and precision of ability estimation
(Chang & Ying, 1996; Chen, Ankenmann, & Chang, 2000). After several items are administered
and the estimated ability become close to the true ability, the KL divergence effectively reduces
to Fisher information.
2.2.3 Ability Estimation Methods for UCAT
In CAT, after each response, the examinee's ability estimate is updated, based on his or her
responses to all previous items.

The two commonly used estimation procedures are the

maximum likelihood method and the Bayesian method (Bejar & Weiss, 1979).
Maximum likelihood estimation (MLE) method is to find an estimate that result in the highest
likelihood for the observed string of item responses. The likelihood function is defined as:

9

𝑛𝑛

𝐿𝐿(𝑢𝑢|𝜃𝜃) = � 𝑃𝑃𝑖𝑖 (𝜃𝜃),
𝑖𝑖=1

(2.8)

where 𝑃𝑃𝑖𝑖 (𝜃𝜃) is the IRF for item i. The highest point on the likelihood function can be located by

taking the derivative of (2.8). Iterative numerical methods such as Newton-Raphson method
(Wainer, 1990) are often used to solve the derivative equation. MLE ability estimates have
desirable properties like asymptotical unbiasedness. However, problems can rise at the early
stage of CAT, since MLE cannot provide finite estimates for responses to single items or for
patterns of responses that are all correct or all incorrect. To solve the problem, we can either
constrain 𝜃𝜃� to a reasonable range (e.g., -4 to 4) or use alternative estimation methods such as

Bayesian procedure.

In Bayesian estimation, by summing the prior distribution, the posterior distribution of θ can
be specified based on the Bayes’ theorem. The mean of the posterior distribution (refer to as
EAP) or the mode of the posterior distribution (refer to as MAP) can be used as the examinee’s
ability estimate. EAP is more widely used in UCAT because of its stability (Bock & Mislevy,
1982).
2.2.4 Practical Constraints for UCAT
In practice, item selection depending solely on the item selection methods described above
might bring concerns about test validity and security. For instance, if a content area requires
more instructional time, more items measuring this content area should be administered. Also, if
some test items are overexposed and examinees have seen them before taking the test, the
validity of the test will be affected.

To address these considerations, operational testing

programs often impose constraints on item selection process. A brief summary of the content
balancing constraint and item exposure control constraint is provided below.
10

Content balancing procedures ensure each examinee receives approximately the same
proportion of items from each content area. The proportion can be determined based on the test
specification. Several approaches have been proposed to ensure content balancing in UCAT,
such as the weighted deviations model approach (Swanson & Stocking, 1993), the shadow-test
approach (van der Linden & Reese, 1998), the modified multinomial model (Chen &
Ankenmann, 2004), the maximum priority index method (Cheng & Chang, 2009), and so on.
Several research studies (e.g., Cheng & Chang, 2009, Leung, Chang, and Hau, 2003, and van der
Linden, 2005) have compared the performance of some of these methods. Generally speaking,
the shadow-test approach and the maximum priority index are more flexible in dealing with
several constraints, and the weighted deviations model is more widely used in operational testing
programs (Buyske, 2005). Detailed descriptions of these content balancing methods can be
found in He (2010) and van der Linden (2010).
Item exposure control procedures aim to preventing test items from overexposing to
examinees. Numerous item exposure control procedures have been proposed in the last few
decades. The most commonly used procedure is the Sympson-Hetter (SH) procedure (Hetter &
Sympson, 1997; Sympson & Hetter, 1985).

This procedure assigns an exposure control

parameter to each item based on the frequency of item selections during an iterative CAT
simulation. During the test operation, if the exposure control parameter is larger than a random
number, the item is administered; otherwise another item is selected and goes through the SH
procedure again.

Another well known item exposure control procedure is the a-stratified

procedure proposed by Chang and Ying (1999). This procedure mainly addresses the issues of
overdrawing items with high discrimination from item pools. The a-stratified procedure first
partitions the item pool into several levels according to the a-parameter of items. Items with

11

small a-parameter have high priority in the early stage of the test. Items with large a-values are
saved for the later stage in a CAT administration. The maximum priority index procedure
(Cheng & Chang, 2009) used for content balancing also can be used for item exposure control.
This procedure adds a weight to item selection method. Items with higher exposure rates are
weighted less. This weight index ensures no item is exposed more than a predetermined rate. A
detailed summary of the item exposure control procedures described above can be found in
Georgiadou, Triantafillou, and Economides (2007).
All the item selection methods, ability estimation methods, and operational constraints
heretofore discussed derived to select the appropriate item to administer and pinpoint an
examinee's true ability. They are all directly related to the item pool design as the desired item
pool should always consist of an appropriate item for every item selection and ability estimation
process. In Chapter 3, I will explain how item selection methods, ability estimation methods,
and operational constraints determine the item pool design. However, before explaining reasons
for inefficiencies, I first describe multidimensional IRT models and multidimensional CAT.
2.3 Multidimensional IRT and CAT
2.3.1 Multidimensional IRT Models
Most operational CAT programs use UIRT models.

Nevertheless, the test items in

educational and psychology assessments usually measure more than one latent trait so that many
researchers have found that examinees often need to use multiple skills to answer test items
(Childs & Oppler, 2000; Wu & Adams, 2006; Svetina, 2013).

Similar to UIRT,

Multidimensional IRT (MIRT) is also a collection of mathematical models that describe the
interaction between persons and test items. The difference is that the MIRT models deal with
situations when more than one ability are required for test performance (Reckase, 2009).
12

There are two major types of MIRT models: compensatory and partially-compensatory. The
compensatory model is based on a linear combination of ability dimensions, and a high ability on
one dimension can compensate for a low ability on another dimension.

Sympson (1978),

however, argued that the compensatory model is not realistic for certain types of items, because
not all skills can compensate each other. Thus he developed a partially-compensatory model to
address this issue. Although the partially-compensatory model is more theoretically sound than
compensatory models, studies have found compensatory models actually fit real test data better
(Ansliey, 1984; Bolt & Lall, 2003).

In addition, estimation difficulty for the partially-

compensatory model hinders its development and application. As a result, compensatory models
are more prevalent in the current literature, and thus will be the only ones focused on in this
study.
The compensatory form of the multidimensional three-parameter logistic (M3PL) model is
given by Reckase (2009), which is
P�𝑢𝑢ij = 1�𝜽𝜽𝒋𝒋 , 𝒂𝒂𝒊𝒊 , 𝑑𝑑𝑖𝑖 , 𝑐𝑐𝑖𝑖 � = 𝑐𝑐𝑖𝑖 + (1 − 𝑐𝑐𝑖𝑖 )

e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

1 + e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

,

(2. 9)

where P�𝑢𝑢ij = 1�𝜽𝜽𝒋𝒋 , 𝒂𝒂𝒊𝒊 , 𝑑𝑑𝑖𝑖 � is the probability of a correct response to item i by person j; 𝑢𝑢ij is the
response on item i by person j (1 is correct and 0 is incorrect); 𝜽𝜽𝒋𝒋 is a row vector of person j’s

abilities in a m-dimensional space; 𝒂𝒂𝒊𝒊 is a row vector of the discrimination for item i; 𝑑𝑑𝑖𝑖 is a
scalar that is related to item difficulty; and 𝑐𝑐𝑖𝑖 is the guessing parameter for item i. From equation

(2.9), the exponent of e is a linear function of θs plus the intercept term d，𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋 ′ + 𝑑𝑑𝑖𝑖 . The
addition of the θs implies the compensatory nature of the model. If 𝑐𝑐𝑖𝑖 is assumed to be 0 for all

the items, the M3PL model becomes the multidimensional two-parameter logistic (M2PL) model,
which is defined as

13

P�𝑢𝑢ij = 1�𝜽𝜽𝒋𝒋 , 𝒂𝒂𝒊𝒊 , 𝑑𝑑𝑖𝑖 � =

e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

1 + e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

.

(2. 10)

The multidimensional extension of the Rasch model was not simply the M2PL model with all
the a-parameter set to 1.0, as the relationship between the Rasch model and the 2PL model for
the UIRT case. The consequence of setting all the a-parameter to 1.0 is that the 𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋 ′ + 𝑑𝑑𝑖𝑖

becomes (𝜃𝜃𝑗𝑗 1 + 𝜃𝜃𝑗𝑗 2 + ⋯ + 𝜃𝜃𝑗𝑗𝑗𝑗 ) + 𝑑𝑑𝑖𝑖 .

Therefore, the M2PL model is reduced to a

unidimensional Rasch model with 𝜃𝜃 = 𝜃𝜃𝑗𝑗 1 + 𝜃𝜃𝑗𝑗 2 + ⋯ + 𝜃𝜃𝑗𝑗𝑗𝑗 .

The multidimensional Rasch

model in current literature was proposed by Adams et al. (1997). The model they specified is for
the general case that includes both dichotomously and polytomously scored test items. Reckase

(2009) provide the dichotomous case of Adam’s model, which is
P�𝑢𝑢ij = 1�𝜽𝜽𝒋𝒋 , 𝒂𝒂𝒊𝒊 , 𝑑𝑑𝑖𝑖 � =

e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

1 + e𝒂𝒂𝒊𝒊 𝜽𝜽𝒋𝒋

′

+𝑑𝑑 𝑖𝑖

.

(2. 11)

Equation (2.10) and (2.11) appear to be identical. The only difference between the two is the
way that the 𝒂𝒂𝒊𝒊 vector is specified. In (2.10), 𝒂𝒂𝒊𝒊 is a characteristic of item i that is estimated

from the data. In (2.11), 𝒂𝒂𝒊𝒊 is a characteristic of item i that is specified by the test developer.

Adams et al. (1997) specified two variations for the model: between-item and within-item
dimensionality. For between-item dimensionality, the 𝒂𝒂𝒊𝒊 -vector has elements that are all zeros

except for one element. For the two-dimensional case, 𝒂𝒂𝒊𝒊 -vector of [1 0] or [0 1] would indicate

between-item dimensionality. The vector [1 0] would specify that the item was only affected by
ability level on dimension 1 and the vector [0 1] specifies that the item is only affected by ability
level on dimension 2. For within-item dimensionality, the 𝒂𝒂𝒊𝒊 -vector has more than one nonzero

element. A specification for within-item dimensionality might have a vector such as [1 1]
indicating that the item is affected equally by both dimensions. In some literature (Reckase,

14

2009; Segall, 1996; Yao, 2013), the feature of between-item dimensionality is called simple
structure, and the within-item dimensionality is called non-simple structure.
In a compensatory MIRT model, in order to make the 𝒂𝒂𝒊𝒊 - and 𝑑𝑑𝑖𝑖 - parameter more meaningful,

Reckase (1985) and Reckase and Mckinley (1991) developed two statistics to interpret the
characteristics of the test items: multidimensional discrimination (MDISC) and multidimensional
difficulty (MDIFF). They are defined as:
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑖𝑖 = �𝒂𝒂𝒊𝒊 ′ 𝒂𝒂𝒊𝒊 ,

𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑖𝑖 =

−𝑑𝑑𝑖𝑖
,
𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑖𝑖

(2. 12)

(2. 13)

where parameters are defined as before. 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀i is the slope of the item response surface at the

steepest point, and indicates the discriminating power of the item. 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀i is the distance from

the origin to the point of the steepest slope. It represents the multidimensional difficulty of the
item: high values indicate difficult items and low values indicate easy items. Thus, the MDISC
and the MDIFF value for a MIRT model are analogous to the item discrimination and the item
difficulty value for a UIRT model.
The concept of information that is used in UIRT also can be generalized to the
multidimensional case. The definition of information for a MIRT model is the same as the

definition for a UIRT model, except that information for MIRT is an m*m matrix, denoted by
𝑰𝑰(𝜽𝜽). The {r-th, s-th} element of this matrix is denoted by 𝑰𝑰𝒓𝒓𝒓𝒓 (𝜽𝜽). For the M3PL model, the

diagonal elements of 𝑰𝑰(𝜽𝜽) are (Segall, 1996)
𝑰𝑰𝒓𝒓𝒓𝒓 (𝜽𝜽) = �

2
𝑎𝑎𝑟𝑟𝑟𝑟
𝑄𝑄𝒊𝒊 (𝜽𝜽)[𝑃𝑃𝒊𝒊 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑃𝑃𝒊𝒊 (𝜽𝜽) − 𝑃𝑃𝒊𝒊𝟐𝟐 (𝜽𝜽)]
,
𝑃𝑃𝒊𝒊𝟐𝟐 (𝜽𝜽)(1 − 𝑐𝑐𝑖𝑖 )𝟐𝟐
i∈v

and the off-diagonal elements are

15

(2. 14)

𝑎𝑎𝑟𝑟𝑟𝑟 𝑎𝑎𝑠𝑠𝑠𝑠 𝑄𝑄𝑖𝑖 (𝜽𝜽)[𝑃𝑃𝒊𝒊 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑃𝑃𝒊𝒊 (𝜽𝜽) − 𝑃𝑃𝒊𝒊𝟐𝟐 (𝜽𝜽)]
𝑰𝑰𝒓𝒓𝒓𝒓 (𝜽𝜽) = �
,
𝑃𝑃𝒊𝒊𝟐𝟐 (𝜽𝜽)(1 − 𝑐𝑐𝑖𝑖 )𝟐𝟐
i∈v

(2. 15)

where 𝑎𝑎𝑟𝑟𝑟𝑟 is the r-th element of the 𝒂𝒂𝒊𝒊 -vector for item i, 𝑎𝑎𝑠𝑠𝑠𝑠 is the s-th element, and other
symbols are used as previously defined. For the M2PL and the multidimensional Rasch model,
the information matrix for item i can be simplified to
2
𝑎𝑎𝑖𝑖1
𝑰𝑰𝒊𝒊 (𝜽𝜽) = 𝑃𝑃𝒊𝒊 (𝜽𝜽)𝑄𝑄𝒊𝒊 (𝜽𝜽) � ⋮
𝑎𝑎𝑖𝑖1 𝑎𝑎𝑖𝑖𝑖𝑖

⋯
⋱
⋯

𝑎𝑎𝑖𝑖1 𝑎𝑎𝑖𝑖𝑖𝑖
⋮ �.
2
𝑎𝑎𝑖𝑖𝑖𝑖

(2. 16)

For the multidimensional Rasch model, the 𝒂𝒂𝒊𝒊 𝒂𝒂𝒊𝒊 ′ matrix in (2.16) only consists of 0’s and 1’s.
2.3.2 Generalization of UCAT to MCAT

The merging of MIRT and CAT has been an intriguing direction to explore.

When

unidimensional algorithms are generalized to multidimensional, we add a huge amount of
complexity. Luecht (1996) pointed out that unlike a unidimensional CAT (UCAT), which is
merely trying to locate examinees on a latent ability scale, a multidimensional CAT (MCAT)
must locate examinees on a plane or a hyperplane and administers items that minimize the joint
estimation errors for those ability estimates. Although a MCAT is much more complicated than
an UCAT, researchers (e.g., Segall, 1996; Wang & Chen, 2004; Yao, 2012; Mao, Luo & Zhou,
2013) have demonstrated that MCAT is worth the added complications, as MCAT often yields
better measurement efficiency than UCAT.
Therefore, to generalize UCAT to MCAT, Reckase (2009) suggested four basic components
to be addressed: (1) item pool development, (2) item selection method implementation, (3)
examinees’ ability estimation, and (4) stopping rule determination.

In practical, practical

constraints (i.e., content balancing and item exposure control) are also important components for
MCAT. Because the desired features of the item pool are dependent on the other four, the
16

procedures for item selection will be presented first in 2.3.3, followed by the ability estimation
method in 2.3.4, stopping rules in 2.3.5, and practical constraints in 2.3.6. The development of
multidimensional item pool is described in Chapter 3.
2.3.3 Item Selection Methods for Multidimensional CAT
Item selection is crucial for UCAT as well as for MCAT. If the selected items only provide
littler information for ability estimation, the adaptive test will not function well. Like the
unidimensional item selection methods, the multidimensional methods are also based on
maximizing or minimizing some criterion values at the current ability estimates. The maximum
determinant of the Fisher information matrix (D-Optimality) method, Bayesian D-Optimality
method, and the maximize KL Information method will be introduced in this section.
The D-Optimality, proposed by Segall (1996) can be considered as the multidimensional
extension of the maximum Fisher information method for UCAT. Suppose k-1 items have
already been administered to an examinee and the k-th item is to be determined. The DOptimality method selects the k-th item that maximizes the quantity
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �| ,
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘

(2. 17)

� � is the summation of information for the previous k-1 items, and 𝐼𝐼𝑖𝑖 �𝜽𝜽
� � is the item
where 𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜽𝜽
𝑘𝑘
information for the k-th item. The item information is defined in (2.14 – 2.16). Note that the || in

(2.17) means the determinant of a matrix. The process for selecting the next item is to identify
the item that has an item information matrix that, when added to the current test information
matrix, will result in the largest value for the determinant of the sum (Reckase, 2009).
Yao (2012) pointed out that the D-Optimality method has an undesirable quality. Towards
the beginning of the MCAT, the information matrix may not have full rank, resulting in the

17

quantity of (2.17) equals to 0. However, this issue can be remedied by applying the Bayesian
version of the D-Optimality to the problem of item selection.
The Bayesian D-Optimality method (Segall, 1996) takes a prior distribution into account. It
selects the k-th item that maximizes
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� � + 𝚽𝚽 −1 | ,
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘

(2. 18)

where 𝚽𝚽 −1 is the prior distribution, which is the inverse of 𝚽𝚽, and 𝚽𝚽 is the variance-covariance
matrix of the examinees’ multidimensional ability. For the first few items, the Bayesian D-

Optimality method is expected to select different items compared with the D-Optimality method,
but as the test length increase, the two methods should become similar.
The maximum KL information method for MCAT was first presented by Veldkamp and van
der Linden (2002). This method is an extension of the Chang and Ying (1996) for a UCAT to
solve the issue of selecting proper items when ability is poorly estimated in the early stage of the
UCAT. When only one item is considered, the KL information is given by
𝑃𝑃𝒊𝒊 (𝜽𝜽𝟎𝟎 )
𝑃𝑃𝒊𝒊 (𝜽𝜽𝟎𝟎 )
𝐾𝐾𝑖𝑖 (𝜽𝜽, 𝜽𝜽𝟎𝟎 ) = 𝑙𝑙𝑙𝑙 �
� + �1 − 𝑃𝑃𝒊𝒊 (𝜽𝜽𝟎𝟎 )�𝑙𝑙𝑙𝑙 �
�.
𝑃𝑃𝒊𝒊 (𝜽𝜽)
𝑃𝑃𝒊𝒊 (𝜽𝜽)

(2. 19)

The item selection rule presented by Veldkamp and van der Linden (2002) is to select the item
that maximizes
� 𝑘𝑘−1 � = � 𝐾𝐾𝑖𝑖 �𝜽𝜽, 𝜽𝜽
� 𝑘𝑘−1 � 𝑓𝑓(𝜽𝜽|𝑢𝑢1 , … , 𝑢𝑢𝑘𝑘−1 )𝜕𝜕𝜽𝜽,
𝐾𝐾𝑖𝑖𝐵𝐵 �𝜽𝜽

(2. 20)

� 𝑘𝑘−1 � is the Bayesian posterior expected information after k-1 items, and
where 𝐾𝐾𝑖𝑖𝐵𝐵 �𝜽𝜽

𝑓𝑓(𝜽𝜽|𝑢𝑢1 , … , 𝑢𝑢𝑘𝑘−1 ) is the posterior density after k-1 items. The implementation of the maximum

KL information method requires very long CPU time because of the calculation for two integrals.

The first integral is the estimation of the 𝑓𝑓(𝜽𝜽|𝑢𝑢1 , … , 𝑢𝑢𝑘𝑘−1 ), and the second one is shown in (2.20).

For this reason, the item selection based on KL information is not considered in this study.
18

2.3.4 Ability Estimation Methods for Multidimensional CAT
The ultimate goal for most MCAT is to estimate the multidimensional ability for examinees.
Assume an item has been selected using one of the item selection methods described in Section
2.3.3, and an examinee has provided a response for this item. An ability estimation method is
then used to update the estimate of the examinee ability. The two general classes of ability
estimation methods for MCAT are: maximum likelihood and Bayesian. These two methods are
described in this section.
For the maximum likelihood estimation (MLE) method (Segall, 1996), MIRT ability is
� that maximize the likelihood function 𝐿𝐿(𝒖𝒖|𝜽𝜽), i.e.,
estimated by finding the mode 𝜽𝜽
𝜕𝜕
𝑙𝑙𝑙𝑙 𝐿𝐿(𝒖𝒖|𝜽𝜽) = 0.
𝜕𝜕𝜽𝜽

(2. 21)

𝜽𝜽(𝒋𝒋+𝟏𝟏) = 𝜽𝜽(𝒋𝒋) − 𝜹𝜹(𝒋𝒋) ,

(2. 22)

Using Newton-Raphson method, suppose 𝜽𝜽 is the approximation that maximize 𝑙𝑙𝑙𝑙 𝐿𝐿(𝒖𝒖|𝜽𝜽), then
where 𝜹𝜹(𝒋𝒋) is the m*1 vector defined as
𝜹𝜹(𝒋𝒋) = �𝐻𝐻(𝜽𝜽(𝒋𝒋) )�

−1

∗

𝜕𝜕
𝑙𝑙𝑙𝑙 𝐿𝐿�𝒖𝒖�𝜽𝜽(𝒋𝒋) �.
𝜕𝜕𝜽𝜽

(2. 23)

The 𝐻𝐻(𝜽𝜽(𝒋𝒋) ) in (2. 23) is a m*m matrix of second derivatives evaluated at 𝜽𝜽(𝒋𝒋) . The diagonal
elements of 𝐻𝐻(𝜽𝜽) take the form
𝑯𝑯𝒓𝒓𝒓𝒓 (𝜽𝜽) = �

2
𝑎𝑎𝑟𝑟𝑟𝑟
𝑄𝑄𝒊𝒊 (𝜽𝜽)[𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖2 (𝜽𝜽)]
,
𝑃𝑃𝑖𝑖2 (𝜽𝜽)(1 − 𝑐𝑐𝑖𝑖 )𝟐𝟐
i∈v

and the off-diagonal elements are of the form
𝑯𝑯𝒓𝒓𝒓𝒓 (𝜽𝜽) = �

𝑎𝑎𝑟𝑟𝑟𝑟 𝑎𝑎𝑠𝑠𝑠𝑠 𝑄𝑄𝑖𝑖 (𝜽𝜽)[𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖2 (𝜽𝜽)]
,
𝑃𝑃𝑖𝑖2 (𝜽𝜽)(1 − 𝑐𝑐𝑖𝑖 )𝟐𝟐
i∈v

19

(2. 24)

(2. 25)

𝜕𝜕

The 𝜕𝜕𝜽𝜽 𝑙𝑙𝑙𝑙 𝐿𝐿�𝒖𝒖�𝜽𝜽(𝒋𝒋) � in (2.23) is a m*1 vector of partial derivatives of 𝑙𝑙𝑙𝑙 𝐿𝐿�𝒖𝒖�𝜽𝜽(𝒋𝒋) � with the r-th

element defined as

𝜕𝜕
𝑎𝑎𝑟𝑟𝑟𝑟 [𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖 (𝜽𝜽)]
ln 𝐿𝐿(𝒖𝒖|𝜽𝜽) = �
,
𝜕𝜕𝜃𝜃𝑟𝑟
(1 − 𝑐𝑐𝑖𝑖 )𝑃𝑃𝒊𝒊 (𝜽𝜽)
i∈v

(2. 26)

With all the terms in (2.23) defined, the Newton-Raphson method can be used to obtain 𝜽𝜽(𝒋𝒋+𝟏𝟏)
repeatedly until 𝜹𝜹(𝒋𝒋) becomes sufficiently small.

Similar to the unidimensional MLE, the

multidimensional MLE also has the issue of infinite estimates in the early stage of the MCAT
(Diao, 2009; Reckase, 2009). Reckase (2009) also pointed out that the minimum number of test
items needed to get finite estimates for a three dimensional MCAT is three, but the actual
number in the MCAT can be larger than that. To overcome this problem, a Bayesian procedure
can be considered.
The Bayesian method (Segall, 1996) is similar to the MLE method, except that the likelihood
function is the product of the likelihood and the prior:
𝑓𝑓(𝜽𝜽|𝒖𝒖) = 𝐿𝐿(𝒖𝒖|𝜽𝜽)

𝑓𝑓(𝜽𝜽)
,
𝑓𝑓(𝒖𝒖)

(2. 27)

where the 𝐿𝐿(𝒖𝒖|𝜽𝜽) is the likelihood function, 𝑓𝑓(𝜽𝜽) is the prior distribution of 𝜽𝜽, wand 𝑓𝑓(𝒖𝒖) is the

marginal probability of 𝒖𝒖. Segall (1996) defined the 𝑓𝑓(𝜽𝜽) as a multivariate normal distribution
with mean vector 𝝁𝝁 and variance-covariance matrix 𝚽𝚽. Because Yao (2012) found that the mode
of the posterior distribution (known as MAP) yields better precision and requires less
computation time than does the expectation of the posterior (known as EAP), only the MAP
procedure is described in this study. The mode of the posterior distribution can be obtained by
maximizing the natural logarithm of the posterior distribution, i.e.,
𝜕𝜕
ln 𝑓𝑓(𝜽𝜽|𝒖𝒖) = 𝟎𝟎,
𝜕𝜕𝜽𝜽
20

(2. 28)

where the 𝑓𝑓(𝜽𝜽|𝒖𝒖) is defined in (2.27). Because the (2.28) formula has no explicit solution, an

iterative numerical procedure such as the Newton-Raphson procedure must be used. Suppose 𝜽𝜽

is the approximation that maximizes 𝑙𝑙𝑙𝑙 𝑓𝑓(𝜽𝜽|𝒖𝒖), then
𝜽𝜽(𝒋𝒋+𝟏𝟏) = 𝜽𝜽(𝒋𝒋) − 𝜹𝜹(𝒋𝒋) ,

where 𝜹𝜹(𝒋𝒋) is the m*1 vector defined as
𝜹𝜹(𝒋𝒋) = �𝐽𝐽(𝜽𝜽(𝒋𝒋) )�

−1

∗

(2. 29)

𝜕𝜕
𝑙𝑙𝑙𝑙 𝑓𝑓�𝜽𝜽(𝒋𝒋) �𝒖𝒖�.
𝜕𝜕𝜽𝜽

(2. 30)

The 𝐽𝐽(𝜽𝜽(𝒋𝒋) ) in (2.30) is a m*m matrix of second derivatives evaluated at 𝜽𝜽(𝒋𝒋) . The diagonal
elements of 𝐽𝐽(𝜽𝜽) take the form
𝑱𝑱𝒓𝒓𝒓𝒓 (𝜽𝜽) = �

2
𝑎𝑎𝑟𝑟𝑟𝑟
𝑄𝑄𝒊𝒊 (𝜽𝜽)[𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖2 (𝜽𝜽)]
− 𝜙𝜙 𝑟𝑟𝑟𝑟 ,
2 (𝜽𝜽)(1
𝟐𝟐
𝑃𝑃
−
𝑐𝑐
)
i∈v
𝑖𝑖
𝑖𝑖

(2. 31)

where 𝜙𝜙 𝑟𝑟𝑟𝑟 is the r-th diagonal element of 𝚽𝚽 −𝟏𝟏 . The off-diagonal elements of 𝐽𝐽(𝜽𝜽) are of the
form

𝑎𝑎𝑟𝑟𝑟𝑟 𝑎𝑎𝑠𝑠𝑠𝑠 𝑄𝑄𝑖𝑖 (𝜽𝜽)[𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑐𝑐𝑖𝑖 𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖2 (𝜽𝜽)]
𝑱𝑱𝒓𝒓𝒓𝒓 (𝜽𝜽) = �
− 𝜙𝜙 𝑟𝑟𝑟𝑟 ,
𝑃𝑃𝑖𝑖2 (𝜽𝜽)(1 − 𝑐𝑐𝑖𝑖 )𝟐𝟐
i∈v
𝜕𝜕

(2. 32)

where 𝜙𝜙 𝑟𝑟𝑟𝑟 is the {r-th, s-th} element of 𝚽𝚽 −𝟏𝟏 . The 𝜕𝜕𝜽𝜽 𝑙𝑙𝑙𝑙 𝑓𝑓�𝜽𝜽(𝒋𝒋) �𝒖𝒖� in (2. 30) is a m*1 vector of

partial derivatives of 𝑙𝑙𝑙𝑙 𝑓𝑓�𝜽𝜽(𝒋𝒋) �𝒖𝒖� with the r-th element defined as

𝜕𝜕
𝑎𝑎𝑟𝑟𝑟𝑟 [𝑃𝑃𝑖𝑖 (𝜽𝜽) − 𝑐𝑐𝑖𝑖 ][𝑢𝑢𝑖𝑖 − 𝑃𝑃𝑖𝑖 (𝜽𝜽)]
𝜕𝜕
ln 𝑓𝑓(𝜽𝜽|𝒖𝒖) = �
−[
(𝜽𝜽 − 𝝁𝝁)′]𝚽𝚽 −𝟏𝟏 (𝜽𝜽 − 𝝁𝝁) .
𝜕𝜕𝜃𝜃𝑟𝑟
𝜕𝜕𝜃𝜃𝑟𝑟
(1 − 𝑐𝑐𝑖𝑖 )𝑃𝑃𝒊𝒊 (𝜽𝜽)
i∈v
𝜕𝜕

(2. 33)

The 𝜕𝜕𝜃𝜃 (𝜽𝜽 − 𝝁𝝁)′ in (2. 33) denotes as a 1*m vector with the r-th element equal to 1 and all other
𝑟𝑟

elements equal to 0. With all the terms in (2.29) defined, the 𝜽𝜽(𝒋𝒋+𝟏𝟏) can be obtained repeatedly
until 𝜹𝜹(𝒋𝒋) become sufficiently small.

21

2.3.5 Stopping Rules for MCAT
The stopping rules for an UCAT fall in to two groups: fixed length and variable length. The
fixed length stopping rule is very easy to adapt to a MCAT. For the fixed length rule, the total
number of items to be administered to each examinee is pre-determined based on the purposes of
the test and practical considerations.

When the number of items is reached in the test

administration, the CAT will stop and the final ability estimate is computed. Because the fixed
stopping rule is easy to implement, most MCATs in the research literature use the fixed length to
as their stopping rules (e.g., Diao, 2009; Segall, 1996; Wang & Chang, 2011; Yao, 2012).
Variable length CAT controls the test length using a statistical criterion. For example, in a
UCAT, if the standard error of measurement for θ estimate is smaller than a critical value, the
test stops and the final θ estimate is reported. Therefore, variable length CAT administers
different number of items to different examinees. Yao (2013) proposed two stopping rules for
MCAT, the standard error (SE) and predicted standard error (PSE). The results showed that the
PSE yields slightly worse results than the SE, but with fewer items. The detailed description for
these two methods can be obtained from Yao’s paper.
2.3.6 Practical Constraints for MCAT
Content balancing and item exposure control are as important to MCAT as to UCAT. Among
the numerous content balancing methods for UCAT, the shadow test approach is the first one
that has been successfully implemented in MCAT by Veldkamp and van der Linden (2002).
Because the shadow test approach requires an existed master pool, it is not applicable in this
study. The Maximum Priority Index (MPI) method is another content balancing that has been
implemented in MCAT by Frey, Cheng, & Seitz, (2011), and also been used in Yao (2012) and
Yao (2013). According to Yao (2012), the MPI index for item i is defined by
22

𝐷𝐷

𝑐𝑐

𝑃𝑃𝑃𝑃𝑖𝑖 = � 𝑓𝑓𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 ,
𝑙𝑙=1

(2. 34)

where the constraint matrix 𝐶𝐶 = (𝑐𝑐𝑖𝑖𝑖𝑖 )𝐽𝐽 ∗𝐷𝐷 , indicating the loading information for item i on
dimension l. If item i loads on dimension l, 𝑐𝑐𝑖𝑖𝑖𝑖 = 1; otherwise 𝑐𝑐𝑖𝑖𝑖𝑖 = 0. Suppose the percentage

of items in each content area is fixed. Then the 𝑓𝑓𝑖𝑖𝑖𝑖 is defined by
𝑓𝑓𝑖𝑖𝑖𝑖 =

(𝑋𝑋𝑙𝑙 − 𝑥𝑥𝑙𝑙 )
,
𝑋𝑋𝑙𝑙

(2. 35)

where 𝑋𝑋𝑙𝑙 is the number of items that should be administered from dimension l, and so far 𝑥𝑥𝑙𝑙 such
items have been selected. At the beginning, 𝑓𝑓𝑖𝑖𝑖𝑖 is 1 when no item has been selected from

dimension l, and it gets smaller as 𝑥𝑥𝑙𝑙 increases. When 𝑥𝑥𝑙𝑙 = 𝑋𝑋𝑙𝑙 , 𝑓𝑓𝑖𝑖𝑖𝑖 = 0; no more items will be

selected from this dimension. The MPI is implemented by multiplying the 𝑃𝑃𝑃𝑃𝑖𝑖 to the item
selection criteria.

For example, for the D-Optimality method, item 𝑖𝑖 = 𝑘𝑘 is selected if

� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �| ∗ 𝑃𝑃𝑃𝑃𝑘𝑘 has a maximum value among all the items in the item pool.
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘

The MPI method has also been used for item exposure control in MCAT in Yao (2012 &

2013). Suppose the maximum exposure rate of item i is fixed to 𝑅𝑅𝑖𝑖 . For each selection step, let

𝑛𝑛𝑖𝑖 be the number of examinees that have already selected item i. Then the index for the item
exposure control is defined by

𝑓𝑓𝑖𝑖𝑖𝑖 =

(𝑅𝑅𝑖𝑖 − 𝑛𝑛𝑖𝑖 ⁄𝑁𝑁)
,
𝑅𝑅𝑖𝑖

(2. 36)

where N is the total number of examinees, and 𝑛𝑛𝑖𝑖 ⁄𝑁𝑁 is the actual exposure rate for item i. This

index makes sure that no item is selected with exposure rate larger the predefined rate, 𝑅𝑅𝑖𝑖 . To
implement the MPI for item exposure control, Yao (2012 & 2013) multiplied the 𝑓𝑓𝑖𝑖𝑖𝑖 in (2.36) to

the item selection criterion. The results shows the MPI can effectively control the item exposure
rate, and increase the item pool usage to 100% when several item selection methods are used,
23

including the D-Optimality method. Although the 100% item pool usage is desirable, the
number is inflated due to the misuse of the MPI. For a two dimensional CAT, after 20 items has
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �| value for all items in the item pool ranges from 3.39
been administered, the |𝐼𝐼𝑠𝑠20 �𝜽𝜽
21

to 3.82. If the 𝑓𝑓𝑖𝑖𝑖𝑖 in (2.36) is smaller than 0.88 (𝑓𝑓𝑖𝑖𝑖𝑖 =. 88 implies 𝑛𝑛𝑖𝑖 ⁄𝑁𝑁 =. 12𝑅𝑅𝑖𝑖 , which is much
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �� ∗ 𝑓𝑓𝑖𝑖𝑖𝑖 for the item associated with the largest
smaller than 𝑅𝑅𝑖𝑖 ), the value of the �𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜽𝜽
𝑘𝑘

� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �| value will be smaller than 3.82. That is for say, if 𝑓𝑓𝑖𝑖𝑖𝑖 is multiply to the
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘
selection criterion of the D-Optimality method, the best item available in the pool will not be
selected, even though its actual exposure rate is much smaller than the maximum rate, 𝑅𝑅𝑖𝑖 .

The reason why the MPI method functions properly in UCAT but not in MCAT is the

difference in the item selection criterion. The minimum value of the Fisher information is close
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �| is not 0. This issue can be solved by
to 0, but the minimum value of the |𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘
rescaling the item selection criterion, and then multiplying 𝑓𝑓𝑖𝑖𝑖𝑖 to the rescaled criterion, instead to
the criterion itself.

In this study, a non-linear method is used to rescale the criterion of the D-Optimality method.
First, a percentile rank is calculated for all the items available in the item pools, that is
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �|𝑅𝑅𝑅𝑅 = Percentile��𝐼𝐼𝑠𝑠 �𝜽𝜽
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� ���,
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜽𝜽
𝑘𝑘
𝑘𝑘−1
𝑘𝑘

(2. 37)

� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �|𝑅𝑅𝑅𝑅 denotes the rescaled criterion. Second, select the item with a
where the |𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜽𝜽
𝑘𝑘
� � + 𝐼𝐼𝑖𝑖 �𝜽𝜽
� �|𝑅𝑅𝑅𝑅 ∗ 𝑓𝑓𝑖𝑖𝑖𝑖 . This item exposure control procedure is referred
maximum value of |𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜽𝜽
𝑘𝑘

to as the “Modified MPI” in this study.

Yao (2012) also used the Sympson–Hetter (SH) method for item exposure control in a MCAT,
but she didn’t recommend it as it is very time consuming to create the “exposure-control table”,
and the computation time increases exponentially with the number of dimensions.
the SH method is not considered in this study for item exposure control.
24

Therefore,

Chapter 3

p-Optimality Method and the Extension to MCAT

This chapter first introduces the concept of a p-optimal item pool in Section 3.1. Section 3.2
then presents the p-optimality method for describing an item pool and its application to item pool
design using the unidimensional Rasch Model. Finally, the extension of the method to the
MCAT item pool design based on the multidimensional Rasch model is discussed in Section 3.3
in detail.
3.1 From Optimal Item Pool to p-Optimal Item Pool
Before introducing the details about the p-optimal item pool, it is important to define the
optimal item pool first. Reckase (2010) defined the best possible, or optimal, item pool as that,
whenever the CAT item selection algorithm is searching for a test item to administer, exactly the
item that is desired is available in the item pool. If a desired item is always available for every
item selection, than the item pool can be considered to be optimal.
Suppose that a fixed length UCAT is based on the unidimensional Rasch model, and uses
maximum Fisher information for item selection. For this type of UCAT, the maximum Fisher
information method selects items with the difficulty parameter, 𝑏𝑏𝑖𝑖 , exactly equal to the current

ability estimate 𝜃𝜃�. This is because the information function for the unidimensional Rasch model,
which is 𝐼𝐼𝑖𝑖 (𝜃𝜃) = 𝑃𝑃𝑖𝑖 (𝜃𝜃)𝑄𝑄𝑖𝑖 (𝜃𝜃), reaches its maximum value of 0.25 when 𝑏𝑏𝑖𝑖 = 𝜃𝜃𝑗𝑗 .

An optimal item pool for this CAT procedure is the one that always has an item in the pool

with b-parameter exactly equal to 𝜃𝜃� for every item selection process for every examinee.

Because 𝜃𝜃 is a continuous variable that has infinite number of values on the 𝜃𝜃 scale, if items in

the item pool exactly match all the 𝜃𝜃�, the item pool has to consist of infinite number of items.
25

To make the concept of the optimal item pool realistic for practical item pool design, a poptimal item pool (Reckase, 2010) was introduced to approximate an optimal pool of smaller
size with little loss of specified characteristics (i.e., item information). Reckase (2010) referred
the p-optimal item pool design method as the p-optimality method. Reckase (2010) also defined
the p-optimal item pool as an item pool “that always has an item available for selection that p%
matches the desired characteristics specified by the item selection routine for the CAT.” The
implementation of the p-optimality method in UCAT based on the unidimensional Rasch model
is described below.
3.2 p-Optimal Item pool Design for UCAT
For the UCAT described above, a p-optimal item pool will always has an available item that
can provide at least p% of the maximum Fisher information at the current 𝜃𝜃 estimate. Figure 3.1
shows the Fisher information function for a test item based on the unidimensional Rasch model.

The horizontal scale is 𝜃𝜃 − 𝑏𝑏 so that the results can generalize to all the values of 𝜃𝜃. The

information reaches the maximum value when 𝜃𝜃 − 𝑏𝑏 = 0, that is 𝜃𝜃 = 𝑏𝑏. Instead of requiring

items with maximum information always available in the item pool, it might be acceptable to
relax the criterion to at least 90% maximum information. That is, instead of needing items with
𝑏𝑏 = 𝜃𝜃� , an item with b-parameter .65 unit away from 𝜃𝜃� also meets the criterion (see Figure 3.1).

Therefore, if an item pool meets the criterion of always having an available item with b-

parameter .65-unit away from 𝜃𝜃�, the item pool can be said to be .9-optimal, because the available
item can provide at least 90% of the maximum possible information for ability estimation. This
way of describing the design of an item pool is called p-optimal for proportion of maximum
optimality (Reckase, 2010).

26

Figure 3.1: Information Function for a Test Item Fit by the Unidimensional Rasch Model

Such a p-optimal pool is designed by the following steps:
1) Specify the characteristics of a CAT program, such as the IRT model, test length, item
selection method, ability estimation method, and stopping rule. In the example here, the
UCAT is based on the unidimensional Rasch model, selects items using the maximum
Fisher information method, estimates ability by the MLE, and with test length fixed at 30item.
2) Randomly sample an examinee from the target examinee population and generate the first
optimal item. The optimal item is an item with b-parameter equal to the initial value of 𝜃𝜃�
for this examinee.

3) Generate a response to this item based on this examinee’s true 𝜃𝜃. A random number is
first generated from the Uniform(0,1) distribution. If the random number is greater than
the probability of this examinee answering this item correct, a correct response is assigned
to this examinee; otherwise, an incorrect response is assigned.
27

4) Update the 𝜃𝜃� using the MLE method based on the response generated in step 3.

5) Generate the next optimal item with b-parameter equal to the updated 𝜃𝜃�.

6) Repeat the process of generating response, ability estimation, and optimal item generation
until the stopping rule is satisfied.
7) Classify all the generated optimal items into “item bins”.

Item bins are defined as

intervals on the b-parameter scale. For a .9-optimal pool, the criterion is that the bparameter is within .65-unit distance away from 𝜃𝜃�. To meet this criterion, the width of the
item bin should be set to .65. In this case, the first item bin is centered on the zero point

and ranges from -.325 to .325. The rest of the item bins can be determined by stepping off
in either direction.
8) Document the number of items in each item bin for this examinee.
9) Repeat steps 2 to 8 for another examinee. The union of the number of items in each bin
forms the p-optimal pool for these two examinees (see Table 3.1). Union, instead of
summation, is considered because the items used for the first examinee can be used for the
second one.
10) Repeat this process for a large number of examinee. The union of items across all the
examinees is the end product of the p-optimal pool design.

The end product of the p-optimal item pool design is a bin-count table, which tells the number
of items in each item bin. This bin-count table can be used as the guidance for item creation. If
items can be created to match the bin-count table, the item pool is deemed to be p-optimal. A
more detailed description of this method can be found in Reckase (2010).

28

Table 3.1: The p-optimal pool for two examinees
Item bin
-3
-2.4
-1.8
-1.2
-0.6
0
0.6
1.2
1.8
2.4
Examinee 1
0
0
10
13
7
0
0
0
0
0
Examinee 2
0
0
0
9
15
6
0
0
0
0
Union
0
0
10
13
15
6
0
0
0
0
Note: the values on the first row represent the central point of each item bin;
the values on the second and third row represent the number of items in each item bin.

3
0
0
0

3.3 Extending the p-Optimality Method to MCAT
As discussed in Chapter 2, the desired features of an item pool depend on the item selection
method, ability estimation method, stopping rule, as well as constraints such as content balancing
and item exposure control. The p-optimality method for item pool design described above also
depends on the selection of these methods. Therefore, the first step of extending the p-optimality
method to MCAT is to determine the characteristics of the MCAT program. The psychometric
model, item selection method, ability estimation method, stopping rule, and constraints for the
MCAT considered in this study are defined below.
First, the multidimensional Rasch model defined by equation (2.11) is served as the
psychometric model for the MCAT in this study.

There are two reasons of choosing the

multidimensional Rasch model. The first one is because the idea of p-optimal item pool design
was proposed for a UCAT based on the unidimensional Rasch model. It is thus straightforward
to choose the multidimensional Rasch model when this idea is extended to MCAT for the first
time. The second reason is that the multidimensional Rasch model is relatively simple compared
with the M2PL and the M3PL model defined by (2.09) and (2.10), respectively. Because the aparameter is fixed in the multidimensional Rasch model, the determination of the optimal item is
much easier than the situation of unfixed a-parameter (Gu, 2007). Given these two reasons, this

29

study only focuses on the p-optimal item pool based on the multidimensional Rasch model.
Future studies can extend this method to other complex MIRT models.
Second, the D-optimality method (Segall, 1996) is used to select items in this study. The Doptimality can be considered as the multidimensional extension of the maximum Fisher
information for UCAT; hence, the method of optimal item generation can be extended to the
multidimensional context in a fairly straightforward fashion. Therefore, the p-optimal item pool
design in this study is based on the D-optimality item selection method only.
Third, the Bayesian MAP (Segall, 1996) is the ability estimation method for the MCAT in
this study. The Bayesian MAP is used here because Yao (2013) mentioned, in one of her
unpublished manuscripts, the Bayesian MAP yields better precision than does the MLE and
perform similarly or better than the Bayesian EAP. Also, because the Bayesian MAP solves the
issues of infinite ability estimates in early MCAT, the Bayesian MAP method is adopted for the
p-optimal item pool design in this study.
Fourth, the stopping rule in this study is the fixed length rule. The variable length stopping
rule is not considered here for two reasons. First, Reckase (2010) has demonstrated the poptimal item pool design for a fixed length CAT can be easily modified to be used in a variable
length CAT. There is no need to describe both of them in this study. The second reason, again,
is because the fixed length rule is relatively easy to be built into the p-optimal item pool design
procedure.
Fifth, the content balancing constraint is not implemented in this study. The reason is that the
D-optimality item selection method can balance the number of items administered from each
dimension.

In a two dimensional MCAT, for example, if more items are selected from

Dimension 1, there will be more information on the direction of 𝜃𝜃1 and less information on the
30

direction of 𝜃𝜃2 . Then the D-optimality method will select the next item from Dimension 2 until
there is more information on the direction of 𝜃𝜃2 . Therefore, when the test is completed, the

number of items from each dimension is expected to be very similar, even though no content
balancing is implemented. For some operational testing programs, the number of items for each
content area is set to be different because some content area may require more instructional time.
In this situation, the content balancing is necessary and can be built into the p-optimal item pool
design procedure. This situation, however, is not considered in this study.
Sixth, the p-optimal item pool design with and without item exposure control is compared in
this study, to answer the third research question for this study. The Modified Maximum Priority
Index described in Chapter 2 is used for item exposure control.
In the following sections, the p-optimal item pool design for MCAT is first demonstrated on
the simplest case: a MCAT measuring a two-dimensional ability,(𝜃𝜃1 , 𝜃𝜃2 ), using items fit by the

two-dimensional Rasch model with simple structure, and without item exposure control. For this

specific MCAT, there are only two clusters of items in the item pool. Items in Cluster 1 only
measure 𝜃𝜃1 with 𝒂𝒂𝑖𝑖 = (1,0). Items in Cluster 2 only measure 𝜃𝜃2 with 𝒂𝒂𝑖𝑖 = (0,1). According to
equation (2.11), the two-dimensional Rasch model can also be specified as:
𝑃𝑃(𝜽𝜽) =

e𝑎𝑎 1𝑖𝑖 𝜃𝜃 1𝑗𝑗 +𝑎𝑎 2𝑖𝑖 𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖

1 + e𝑎𝑎 1𝑖𝑖 𝜃𝜃 1𝑗𝑗 +𝑎𝑎 2𝑖𝑖 𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖

The feature of simple structure can simplify (3.1) into

𝑃𝑃(𝜽𝜽) =

.

(3. 1)

𝜃𝜃 1𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎧𝑃𝑃 (𝜽𝜽) = e
, for items from Cluster 1
⎪ 1
1 + e𝜃𝜃 1𝑗𝑗 +𝑑𝑑 𝑖𝑖

e𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎨
(𝜽𝜽)
𝑃𝑃
=
, for items from Cluster 2
⎪ 2
1 + e𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎩

.

(3. 2)

The method of optimal item generation, the extension of item bins, and the interpretation of poptimal item pool for this specific MCAT are demonstrated in Section 3.3.1 to 3.3.3. An
31

example of the p-optimal item pool design for this MCAT is presented in Section 3.3.4. The poptimal item pool design for MCAT with exposure control is introduced in Section 3.3.5.
3.3.1 Optimal Item Generation
For the UCAT, the optimal item is the one that maximizes the information function at the
current 𝜃𝜃�. For the MCAT described above, according to equation (2.17), the k-th optimal item is
the one that maximizes the quantity

|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��|.

(3. 3)

where 𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜃𝜃�� is the summation of the information for the k-1 items that has been administered,

denoted as

𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜃𝜃�� = 𝐼𝐼𝑖𝑖1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖2 �𝜃𝜃�� + ⋯ + 𝐼𝐼𝑖𝑖𝑘𝑘 −1 �𝜃𝜃��;

(3. 4)

and 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� is the information function for k-th item that is going to be administered. According

to equation (2.16), 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� for the two-dimensional multidimensional Rasch model with simple
structure can be specified into
1
𝑃𝑃1 𝑄𝑄1 �
0
𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� = �
0
𝑃𝑃2 𝑄𝑄2 �
0

𝑃𝑃 𝑄𝑄 0
0
� , for items from Cluster 1
�=� 1 1
0
0
0
,
0
0
0
�=�
� , for items from Cluster 2
0 𝑃𝑃2 𝑄𝑄2
1

where 𝑃𝑃1 and 𝑃𝑃2 are defined in equation (3.2) and 𝑄𝑄1 = 1 − 𝑃𝑃1 , 𝑄𝑄2 = 1 − 𝑃𝑃2 .

(3. 5)

Suppose among the k-1 administered items, 𝑘𝑘1 of them are from Cluster 1 and 𝑘𝑘2 of them are

from Cluster 2, where 𝑘𝑘1 + 𝑘𝑘2 = 𝑘𝑘 − 1. Substituting (3.5) in to (3.4), we obtain
𝑘𝑘 1

⎡� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖
𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜃𝜃�� = ⎢ 𝑖𝑖=1
⎢
0
⎣

⎤
⎥.
𝑘𝑘 2
� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 ⎥
⎦
𝑖𝑖=1
0

Again, substituting (3.5) and (3.6) into (3.3), we obtain
32

(3. 6)

|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃� �|
𝑘𝑘 1

=

⎧ ⎡� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖
⎪��⎢ 𝑖𝑖=1
⎪⎢
⎪
0
⎣
𝑘𝑘

⎨ ⎡� 1 𝑃𝑃 𝑄𝑄
⎪ ⎢ 𝑖𝑖=1 1𝑖𝑖 1𝑖𝑖
⎪��
⎪⎢
0
⎩⎣

⎤
⎥ + �𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘
𝑘𝑘 2
0
� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 ⎥
⎦
𝑖𝑖=1
0

0�
�� , if the kth item is from Cluster 1
0

⎤
0
⎥ + �0
��� , if the kth item is from Cluster 2
𝑘𝑘 2
0
𝑃𝑃
2𝑖𝑖 𝑘𝑘 𝑄𝑄2𝑖𝑖 𝑘𝑘
⎥
� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖
⎦
𝑖𝑖=1
0

.

(3. 7)

By solving the determent, (3.7) becomes

|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃� �|
=

𝑘𝑘 1

𝑘𝑘 2

𝑘𝑘 2

⎧ �� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � + �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 , if the kth item is from Cluster 1
𝑘𝑘
𝑘𝑘
⎪
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1

.
𝑘𝑘 1
𝑘𝑘 2
𝑘𝑘 1
⎨
⎪�� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � + �� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � 𝑃𝑃2𝑖𝑖𝑘𝑘 𝑄𝑄2𝑖𝑖𝑘𝑘 , if the kth item from in Cluster 2
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
⎩
𝑘𝑘

(3. 8)

𝑘𝑘

1
2
Because �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � and �∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � are constant across all the potential k-th item,

maximizing |𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��| in is equivalent to maximizing 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 or 𝑃𝑃2𝑖𝑖𝑘𝑘 𝑄𝑄2𝑖𝑖𝑘𝑘 . Based on

the two-dimensional Rasch model with simple structure defined in equation (3.2), 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 will

be maximized when 𝑃𝑃1𝑖𝑖𝑘𝑘 = 0. 5, or −𝑑𝑑𝑖𝑖 = 𝜃𝜃1𝑗𝑗 . Similarly, 𝑃𝑃2𝑖𝑖𝑘𝑘 𝑄𝑄2𝑖𝑖𝑘𝑘 will be maximized when
−𝑑𝑑𝑖𝑖 = 𝜃𝜃2𝑗𝑗 .

Therefore, the optimal item for the k-th item is either the one from Cluster 1 with −𝑑𝑑𝑖𝑖 = 𝜃𝜃1𝑗𝑗

or the one from Cluster 2 with −𝑑𝑑𝑖𝑖 = 𝜃𝜃2𝑗𝑗 . To determine which one is the true optimal, it only
𝑘𝑘

𝑘𝑘

2
1
needs to compare �∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � with �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 �, because the first term in equation (3.8) is the

same. If the following inequality holds
��

𝑘𝑘 2

𝑖𝑖=1

𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � > ��

𝑘𝑘 1

𝑖𝑖=1

33

𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 �,

(3. 9)

the optimal item is from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0) and −𝑑𝑑𝑖𝑖 = 𝜃𝜃1𝑗𝑗 . If this inequality holds
��

𝑘𝑘 2

𝑖𝑖=1

𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � < ��

𝑘𝑘 1

𝑖𝑖=1

𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 �,

(3. 10)

the optimal item is from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1) and −𝑑𝑑𝑖𝑖 = 𝜃𝜃2𝑗𝑗 . If the two terms are the same,

the optimal item is randomly picked.

In other words, after k-1 items are administered, if the test information on the direction of
Dimension 1 is smaller, the k-th optimal item is an item measuring Dimension 1 with −𝑑𝑑𝑖𝑖 = 𝜃𝜃�1 .
If the test information on the direction of Dimension 2 is smaller, the k-th optimal item should
measure Dimension 2 with −𝑑𝑑𝑖𝑖 = 𝜃𝜃�2 .

The information from previous administered item

determines which cluster the optimal item is from, and the current θ estimates determines the dparameter for the optimal item.
3.3.2 Interpretation for the “p-Optimal”
For a unidimensional .9-optimal item pool, items that can provide at least 90% of the
maximum possible Fisher information are always available for selection. For the unidimensional
Rasch model, because Fisher information is P*Q, the “.9-optimal” means the selected item yield
at least 90% of the maximum possible value of P*Q.
For the MCAT in this study, the item selection method is the D-optimality. Suppose the D-

optimality method selects the k-th optimal item from Cluster 1.

This item should have a

𝑘𝑘 1
𝑘𝑘 2
𝑘𝑘 2
maximum value of |𝐼𝐼𝑠𝑠𝑘𝑘 −1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��| = �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 ��∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � + �∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 �𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 ,

compared with other items.
becomes

𝑘𝑘 2
By bringing the �∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � to the front, |𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��|

|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��| = ��

𝑘𝑘 2

𝑖𝑖=1

𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � ���
34

𝑘𝑘 1

𝑖𝑖=1

𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � + 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 �.

(3. 11)

Here, because the maximum determinant is not simply the P*Q, the 90% of the maximum
value of 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 is no longer equivalent to 90% of the maximum determinant. Therefore, the

“.9-optimal” item pool no longer implies items that are at least 90% of the maximum determinant
of the 𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� are always available.
𝑘𝑘

1
In fact, the �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � in (3.11) is the sum of the information for all the administered items

from Cluster 1 on the direction of 𝜃𝜃1 , and the 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘 is the information for the k-th item on the
direction of 𝜃𝜃1 . The same interpretation can be made for items in Cluster 2. As mentioned
above, the D-optimality method selects the item that adds the maximum information on the

current test information on the direction of minimum information. If the item pool is .9-optimal,
the selected item would be the one that adds at least 90% of maximum possible information on
the current test information on the direction of minimum information.

Therefore, the

interpretation for the “p-optimal” item pool in MCAT is that items that can add at least pproportion of maximum possible information on the current test information on the direction of
minimum information are always available in the item pool.
3.3.3 Extending the “bin” concept
In UCAT, item bins are created by dividing the scale of the b-parameter into several intervals.
These item bins are referred to as “b-bin,” since they are defined on the b-parameter scale. As
mentioned in Chapter 2, the d-parameter in an MIRT model is an intercept term that is related to
both item difficulty and item discrimination. The item difficulty in MIRT models is the MDIFF.
The value of MDIFF has the same interpretation as the b-parameter for UIRT models. Therefore,
the “MDIFF-bin”, instead of “d-bin,” is used for the optimal item pool design.
For the two dimensional Rasch model with simple structure defined in (3.2), the item
response function (IRF) for items from Cluster 1 is the same as the IRF for the unidimensional
35

Rasch model with 𝜃𝜃 = 𝜃𝜃1 . Similarly, the IRF for items in Cluster 2 is the same as the IRF for
the unidimensional Rasch model with 𝜃𝜃 = 𝜃𝜃2 . Therefore, Figure 3.1 also can be used here to

determine the size for the MDIFF-bin.

For .9-optimal item pool, if an item from Cluster 1 is

selected, the d-parameter of this item should be within .65-unit distance away from the current
estimate of −𝜃𝜃1 . If an item from Cluster 2 is selected, the d-parameter of this item should be

within .65-unit distance away from −𝜃𝜃2 . Therefore, the width of the interval on the d-parameter

scale is .65. Because MDIFF is equal to −𝑑𝑑𝑖𝑖 for the two dimensional Rasch model with simple
structure, the size for the MDIFF-bin is also .65. In the case of .86-optimal, the interval on the d-

parameter scale is 0.8 so that the size for the MDIFF-bin is 0.8. In the case of .96-optimal, the
size for the MDIFF-bin is 0.4.
3.3.4 An example of the p-optimal item pool design for MCAT
For the MCAT described above, items are fitted by the two-dimensional Rasch model with
simple structure, and the test length is fixed at 30. Suppose two examinees have taken this
MCAT and their true abilities are (0.7, 1.5) and (-1.1, -1.0), respectively. For each examinee, the
first item randomly chosen from either Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0) and −𝑑𝑑𝑖𝑖 exactly equal to the

starting value of 𝜃𝜃1 , or from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1) and −𝑑𝑑𝑖𝑖 exactly equal to the starting
value of 𝜃𝜃2 . Then a response to this item is generated using the two-dimensional Rasch model
� is updated with the Bayesian MAP method. The process of selecting the next item is:
and the 𝜽𝜽

1) Select two items first: one from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0) and −𝑑𝑑𝑖𝑖 exactly equal to 𝜃𝜃�1 ; and

another from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1) and −𝑑𝑑𝑖𝑖 exactly equal to 𝜃𝜃�2 .
2) Compute the value of |𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��| for the two items.

3) The optimal item is the one associated with a larger value of the |𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��|.
36

The simulation continues as the test length reaches 30 items. The distributions of the MDIFF
value of the administered items for these two examinees are shown in Figure 3.2 and Figure 3.3,
respectively.

These distributions used a MDIFF-bin width of 0.6 on the MDIFF scale to tally

the number of items required in each bin. For both examinees, 15 items are from Cluster 1 and
15 from Cluster 2.
The comparison between the two distributions shows that the items selected for these two
examinees have some in common. This means the second examinee can use the items that have
been administered for the first examinee. Therefore, rather than needing 30+30 = 60 items, the
p-optimal item pool for these two examinees requires only 56 items. This number is the count of
the items in the union of the two sets.

Figure 3.4 displays the distribution for the 56 items.

Among the 56 items, half of them are from Cluster 1 and another half are from Cluster2.
When a third examinee is taking the test, the set of items required for that examinee can be
determined. Then, the size and distribution of the p-optimal item pool can be determined by
taking the union of items for the three examinees. This process can be continued until the
number of items no longer increases. Figure 3.5 illustrates how the required item pool increases
in size as the number of examinees increases.

For the example given here, the item pool size

reaches an asymptote at 340 items after 3,000 examinees.
Similar to the UCAT, the end product of the p-optimal item pool design for MCAT is a bincount table, which tells the number of item in each MDIFF-bin for each dimension. The real poptimal item pool used for test operation can be created based on this bin-count table.
3.3.5 p-Optimal Item Pool Design for MCAT with Exposure Control
If no item exposure control is implemented, the union of the optimal items for a large number
of examinees is the blueprint for the operational p-optimal item pool development. If item
37

Figure 3.2: Item distributions for examinee with true ability (0.7, 1.5)

Figure 3.3: Item distributions for examinee with true ability (-1.1, -1.0)

exposure control is implemented in the adaptive test, a post-simulation adjustment (Gu, 2007) is
used after the p-optimal item pool design process to make sure there are sufficient items in each
bin where items are more often selected.
This study set a maximum item exposure rate, R, for all the items in the item pool. The item
exposure rate is the number of times an item is administered divided by the total number of

38

Figure 3.4: Item distributions for the two examinees

350

300

Pool Size

250

200

150

100

50

0

0

500

1000

1500

2000 2500 3000 3500
Number of Examinees

4000

4500

5000

Figure 3.5: Increase in required pool size as number of examinees increases

examinees. During the p-optimal item pool design process, the actual item exposure rate for
each item is not available, but the number of items from each MDIFF-bin that are administered
can be documented. Suppose N is the total number of examinees used for the p-optimal item
pool design process, 𝑚𝑚𝑗𝑗 is the number of item in the j-th MDIFF-bin, and 𝑠𝑠𝑗𝑗 is number of times
39

of an item from the j-th MDIFF-bin being administered. The expected item exposure rate, 𝑟𝑟̅𝑗𝑗 , for
each item in the j-th MDIFF-bin can be obtained by
𝑟𝑟̅𝑗𝑗 =

𝑠𝑠𝑗𝑗 ⁄𝑚𝑚𝑗𝑗
.
𝑁𝑁

(3. 12)

Compare 𝑟𝑟̅𝑗𝑗 with R for j = 1, 2, …, J, where J is the total number of MDIFF-bin’s. If 𝑟𝑟̅𝑗𝑗 is

smaller than R, it implies that the number of items the j-th MDIFF-bin is sufficient so that the no
post-simulation adjustment is not necessary. If 𝑟𝑟̅𝑗𝑗 is larger than R, the number of items the j-th

MDIFF-bin is insufficient and the adjustment is needed.

by

To ensure 𝑟𝑟̅𝑗𝑗 ≤ 𝑅𝑅, the predicted number of item in the j-th MDIFF-bin, 𝑚𝑚
�𝑗𝑗 , can be calculated
𝑚𝑚
�𝑗𝑗 =

𝑠𝑠𝑗𝑗
,
𝑅𝑅𝑁𝑁 ′

(3. 13)

where 𝑁𝑁 ′ is the total number of examinees that is going to take the MCAT. The post-simulation

�𝑗𝑗 for all the MDIFF-bin’s with 𝑟𝑟̅𝑗𝑗 > 𝑅𝑅. In
adjustment is implemented by replacing 𝑚𝑚𝑗𝑗 with 𝑚𝑚

other words, the post-simulation adjustment sets the number of items the j-th MDIFF-bin to 𝑀𝑀𝑗𝑗 ,
where 𝑀𝑀𝑗𝑗 is defined by

𝑀𝑀𝑗𝑗 = max⁡{𝑚𝑚𝑗𝑗 , 𝑚𝑚
�𝑗𝑗 }.

(3. 14)

If 𝑀𝑀𝑗𝑗 is not an integer, it will be rounded up to the next integer.

3.3.6 p-Optimal Item Pool Design for MCAT with Non-Simple Structure
Suppose a third cluster of items that equally measures 𝜃𝜃1 and 𝜃𝜃2 with 𝒂𝒂𝑖𝑖 = (1,1) is

considered to the MCAT described above. This MCAT would be with the feature of non-simple
structure. The two-dimensional Rasch model in this case can be written as

40

e𝜃𝜃 1𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎧ 𝑃𝑃1 (𝜽𝜽) =
, for items from Cluster 1
1 + e𝜃𝜃 1𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎪
⎪
e𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖
𝑃𝑃(𝜽𝜽) =
, for items from Cluster 2 .
𝑃𝑃2 (𝜽𝜽) =
⎨
1 + e𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎪
e𝜃𝜃 1𝑗𝑗 +𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖
⎪
𝑃𝑃3 (𝜽𝜽) =
, for items from Cluster 3
⎩
1 + e𝜃𝜃 1𝑗𝑗 +𝜃𝜃 2𝑗𝑗 +𝑑𝑑 𝑖𝑖

(3.15)

And the 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� can be specified as
1
⎧ 𝑃𝑃1 𝑄𝑄1 �0
⎪
0
𝑃𝑃2 𝑄𝑄2 �
𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃�� =
0
⎨
⎪𝑃𝑃 𝑄𝑄 �1
⎩ 3 3 1

𝑃𝑃 𝑄𝑄 0
0
�=� 1 1
� , for items from Cluster 1
0
0
0
0
0
0
�=�
� , for items from Cluster 2 ,
0 𝑃𝑃2 𝑄𝑄2
1
𝑃𝑃 𝑄𝑄 𝑃𝑃3 𝑄𝑄3
1
�=� 3 3
� , for items from Cluster 3
𝑃𝑃3 𝑄𝑄3 𝑃𝑃3 𝑄𝑄3
1

(3.16)

Suppose among the k-1 administered items, 𝑘𝑘1 of them from Cluster 1, 𝑘𝑘2 of them from

Cluster 2, and 𝑘𝑘3 of them from Cluster 3, where 𝑘𝑘1 + 𝑘𝑘2 + 𝑘𝑘3 = 𝑘𝑘 − 1. Substituting (3.16) into
(3.6), we obtain

𝑘𝑘 1

𝑘𝑘 3

⎡� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + � 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖
𝑖𝑖=1
�
𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃� = ⎢ 𝑖𝑖=1
𝑘𝑘 3
⎢
� 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖
⎣
𝑖𝑖=1

𝑘𝑘 3

⎤
⎥.
𝑘𝑘 2
𝑘𝑘 3
� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 + � 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 ⎥
⎦
𝑖𝑖=1
𝑖𝑖=1
�

𝑖𝑖=1

𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖

(3.17)

Note the off-diagonal elements of the 𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� are no longer zero because of adding items from
Cluster 3. By substituting (3.16) and (3.17) into (3.9), we obtain
|𝐼𝐼𝑠𝑠𝑘𝑘−1 �𝜃𝜃�� + 𝐼𝐼𝑖𝑖𝑘𝑘 �𝜃𝜃��|
𝑘𝑘 1

𝑘𝑘 2

𝑘𝑘 3

𝑘𝑘 2

𝑘𝑘 3

⎧�� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � �� 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � + �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 + � 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � 𝑃𝑃1𝑖𝑖𝑘𝑘 𝑄𝑄1𝑖𝑖𝑘𝑘
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
⎪
⎪
𝑘𝑘 1
𝑘𝑘 2
𝑘𝑘 3
𝑘𝑘 1
𝑘𝑘 3
= �� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 � �� 𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � �� 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � + �� 𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + � 𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � 𝑃𝑃2𝑖𝑖𝑘𝑘 𝑄𝑄2𝑖𝑖𝑘𝑘 ,
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
⎨
𝑘𝑘
𝑘𝑘
𝑘𝑘
𝑘𝑘
𝑘𝑘 2
⎪
1
2
3
1
⎪�� 𝑃𝑃 𝑄𝑄 � �� 𝑃𝑃 𝑄𝑄 � �� 𝑃𝑃 𝑄𝑄 � + �� 𝑃𝑃 𝑄𝑄 + � 𝑃𝑃 𝑄𝑄 � 𝑃𝑃 𝑄𝑄
1𝑖𝑖 1𝑖𝑖
2𝑖𝑖 2𝑖𝑖
3𝑖𝑖 3𝑖𝑖
1𝑖𝑖 1𝑖𝑖
2𝑖𝑖 2𝑖𝑖
3𝑖𝑖 𝑘𝑘 3𝑖𝑖 𝑘𝑘
⎩
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1
𝑖𝑖=1

(3.18)

41

To determine the optimal item in this case, the amount of information on three directions
𝑘𝑘

1
needs to be compared: 1) ∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 represents the amount of information on the direction of 𝜃𝜃1 ,

𝑘𝑘

𝑘𝑘

3
2
𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 is the
2) ∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 is the amount of information on the direction of 𝜃𝜃2 , and 3) ∑𝑖𝑖=1

amount of information on the direction of 45 degree line (See Figure 3.6). The Direction 1, 2,
and 3 shown in Figure 3.6 is the direction best measured by items from Cluster 1, 2, and 3,
respectively. That is, the direction with the maximum discrimination power.
𝑘𝑘

2
If the amount of information on the direction of 𝜃𝜃1 is the smallest (i.e., �∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 +

𝑘𝑘 3
𝑘𝑘 1
𝑘𝑘 1
𝑘𝑘 2
3
∑𝑘𝑘𝑖𝑖=1
𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � and �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + ∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 �), the
𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � is larger than �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + ∑𝑖𝑖=1

optimal item is from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0) and −𝑑𝑑𝑖𝑖 = 𝜃𝜃1𝑗𝑗 . If the amount of information on
𝑘𝑘

𝑘𝑘

3
1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + ∑𝑖𝑖=1
𝑃𝑃3𝑖𝑖 𝑄𝑄3𝑖𝑖 � is larger than the other two),
the direction of 𝜃𝜃2 is the smallest (i.e., �∑𝑖𝑖=1

the optimal item is from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1) and −𝑑𝑑𝑖𝑖 = 𝜃𝜃2𝑗𝑗 . If the amount of information
𝑘𝑘

𝑘𝑘

1
2
on the direction of the 45 degree line is the smallest (i.e., �∑𝑖𝑖=1
𝑃𝑃1𝑖𝑖 𝑄𝑄1𝑖𝑖 + ∑𝑖𝑖=1
𝑃𝑃2𝑖𝑖 𝑄𝑄2𝑖𝑖 � is the

largest), the optimal item is from Cluster 3 with 𝒂𝒂𝑖𝑖 = (1,1) and −𝑑𝑑𝑖𝑖 = 𝜃𝜃1𝑗𝑗 + 𝜃𝜃2𝑗𝑗 . If the three
terms are the same, the optimal item is randomly picked.

Because the d-parameter for optimal items from Cluster 3 is equal to −(𝜃𝜃1𝑗𝑗 + 𝜃𝜃2𝑗𝑗 ), the scale

of the d-parameter for items from Cluster 3 is different from the scale of the d-parameter for

items from Cluster 1 and 2. That is, two-unit distance on the d-parameter for items from Cluster
3 is corresponding to one-unit distance on the d-parameter for items from Cluster 1 and 2.
Therefore, to meet the criterion of the p-optimal item pool, the width of the d-bin for items from
Cluster 3 should be twice of the width for items from Cluster 1 and 2. Because this study adopts
the MDIFF-bin instead of d-bin, and 𝑀𝑀𝑀𝑀𝐼𝐼𝐼𝐼𝐼𝐼 = −𝑑𝑑𝑖𝑖 /√2 for items from Cluster 3, the width of

the MDIFF-bin for items from Cluster 3 is √2 times larger than width of the MDIFF-bin for
42

Figure 3.6: The test information on three directions

items from Cluster 1 and 2. For the .9-optimal item pool, the width of the MDIFF-bin for items
from Cluster 1 and 2 is 0.65, and for items from Cluster 3 is √2 ∗ 0.65, which is 0.92. For the
MCAT with higher order dimension, the width of the MDIFF-bin for items measuring more than
one dimension can be determined in a similar way.

43

Chapter 4

Study Design and Procedures

In this chapter, the algorithms for the multidimensional computerized adaptive testing
(MCAT) are first defined in Section 4.1. Section 4.2 then describes a simulation study that was
used to compare the p-optimal item pools with other item pools existed in literature. The criteria
for item pool comparison are introduced in Section 4.3.
4.1 MCAT Algorithms
The MCAT in this study is based on the multidimensional Rasch model defined by (2.11).
Three test specifications are considered:
•

Test specification 1: two-dimension simple structure. In this case, the item pool consists
of two clusters of items: items from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0) only measure 𝜃𝜃1 and items

•

from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1) only measure 𝜃𝜃2 .

Test specification 2: three-dimension simple structure. In this case, the item pool consists
of three clusters of items: items from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0,0) only measure 𝜃𝜃1 , items
from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1,0) only measure 𝜃𝜃2 , and items from Cluster 3 with

•

𝒂𝒂𝑖𝑖 = (0,0,1) only measure 𝜃𝜃3 .

Test specification 3: three-dimension non-simple structure. In this case, the item pool
consists of four clusters of items: items from Cluster 1 with 𝒂𝒂𝑖𝑖 = (1,0,0) only measure 𝜃𝜃1 ,

items from Cluster 2 with 𝒂𝒂𝑖𝑖 = (0,1,0) only measure 𝜃𝜃2 , items from in Cluster 3 with

𝒂𝒂𝑖𝑖 = (0,0,1) only measure 𝜃𝜃3 , and items from Cluster 4 with 𝒂𝒂𝑖𝑖 = (1,1,1) equally
measure 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 .

For the MCAT simulation in this study, items are selected by the D-optimality method, θ is
estimated using the Bayesian MAP method, and test length is fixed at 30 items. The prior for the
44

Bayesian MAP is the multivariate normal distribution of the true θ in this study (Segall, 1996).
The MCAT with and without item exposure control are considered in this study. For the MCAT
with exposure control, the maximum item exposure rate is fixed at 0.2, and the Modified MPI
method is used to make sure the exposure rate for all items in the item pool are less than 0.2. A
detailed description of the D-optimality, the Bayesian MAP, and the Modified MPI methods can
be found in Chapter 2 and 3.
4.2 Simulation Procedure
This study is carried out in four major phases. In the first phase, a p-optimal item pool based
on each test specification is designed and the bin-count table is created. In the second phase, the
actual p-optimal item pools are developed based on the bin-count table created from the previous
phase. In the third phase, a baseline pool for each test specification is developed for comparison
purposes. In the fourth phase, a simulation study is carried out to evaluate the performance of
the MCAT using a p-optimal item pool against the MCAT using a baseline pool.
Phase I. P-optimal Item Pool Design
Based on the test specifications and adaptive algorithms described in the section 4.1, poptimal item pools are designed to guarantee that every item that was requested by the item
selection rule is available for administration. As described in Chapter 3, the design for the poptimal item pools should also based on characteristics of the target examinee population. In this
study, the examinee population for the CAT-ASVAB in Segall (1996) is adopted to design the poptimal item pools. The CAT-ASVAB measures nine content areas, and each content area is
treated as one dimension. The correlation among the nine dimensions ranges from 0.2 to 0.9.
The MCAT in this study only measures a two- or three-dimensional ability; thus, two or three

45

content areas from the CAT-ASVAB are selected to use in this study. To investigate how the
correlation among dimensions affects the p-optimal item pool design, both moderately correlated
content areas and highly correlated content areas are selected. The low correlation condition is
not considered in this study as it is rare in educational assessments.
For the moderate correlation condition, the three content areas are the Arithmetic Reasoning
(AR), Word Knowledge (WK), and Electronics Information (EI).

For the high correlation

condition, the three dimensions are the General Science (GS), Word Knowledge (WK), and
Paragraph Comprehension (PC). The mean and the variance-covariance matrix for ability that
requires for these content areas are shown in Table 4.1.
To design the p-optimal item pool for each condition, 3,000 examinees were randomly
sampled from the multivariate normal distribution with mean vector and variance-covariance
matrix described in Table 4.1. The number of 3,000 is used here because, as shown in Figure 3.5,
the size of the p-optimal item pool reaches the asymptote after 3,000 examinees. For each
examinee, all items administered in each cluster were allocated to the MDIFF-bins. Two sets of
bin sizes, .4 and .8, corresponding to a .96- and .86-optimal pool respectively, were considered in
this study.
In total, 24 p-optimal item pools (i.e., 3 Test Specifications * 2 correlations * 2 bin sizes *
with or without exposure control) are designed in this study. To eliminate potential sampling
errors, 100 replications were conducted. The final bin-count table for each p-optimal item pool
is the average of its 100 replications. Table 4.2 shows a bin-count table for the .96-optimal item
pool for the MCAT with test specification of three-dimension non-simple structure, moderate
correlation among dimensions, and without exposure control. Table 4.3 is a bin-count table for

46

Table 4.1: Mean and covariance matrix for the two examinee populations
Moderate Correlation

High Correlation

2-dimension

3-dimension

2-dimension

3-dimension

Dimension

AR and WK

AR, WK, and EI

GS and WK

GS, WK, and PC

Mean Vector

[0,0]

[0,0,0]

[0,0]

[0,0,0]

1 . 61
�
. 61 1

1 . 61 . 64
�. 61 1 . 72�
. 64 . 72 1

1 . 91
�
. 91 1

1 . 91 . 81
�. 91 1 . 88�
. 81 . 88 1

Variance-Covariance
Matrix

�

�

Table 4.2: Bin count for a .96-optimal item pool
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

1

3

4

5

6

7

7

7

8

7

7

7

6

5

4

3

1

a = (0, 1, 0)

1

3

4

5

6

7

7

7

7

7

7

7

6

5

4

3

1

a = (0, 0, 1)

2

3

4

5

6

7

7

7

8

7

7

7

6

6

5

3

1

MDIFF

-5.6

-4.9

-4.2

-3.5

-2.8

-2.1

-1.4

-0.7

0

0.7

1.4

2.1

2.8

3.5

4.2

4.9

5.6

a = (1, 1, 1)

1

3

5

6

7

7

8

8

8

8

7

7

7

6

5

3

1

Note: the values on the first row represent the central point of each item bin;
the values on the second and third row represent the number of items in each item bin.

Table 4.3: Bin count for a .86-optimal item pool
MDIFF
-3.2 -2.4 -1.6 -0.8
0
0.8 1.6 2.4
2
5
7
8
8
8
7
5
a = (1, 0, 0)
2
5
7
8
8
8
7
5
a = (0, 1, 0)
2
5
7
8
8
8
7
5
a = (0, 0, 1)
MDIFF
-5.6 -4.2 -2.8 -1.4
0
1.4 2.8 4.2
2
6
8
9
9
9
8
6
a = (1, 1, 1)
Note: this table can be interpreted in the same way as Table 4.2.

3.2
2
2
3
5.6
2

the .86-optimal item pool for the same MCAT. There are 17 MDIFF-bins for the .96-optimal
item pool and 9 MDIFF-bins for the .86-optimal item pool.
Phase II. P-Optimal item pool development
With the bin-count table for the 24 p-optimal item pools, the p-optimal item pool can be
developed accordingly. In practice, real items should be created to match the bin-count table. In
47

this study, items are simulated. Items within each MDIFF-bin are set to be equally distributed.
For example, if three are 8 items in the central MDIFF-bin for items with 𝒂𝒂𝑖𝑖 = (1,0,0), 8 items

with MDIFF value equally distributed from -0.2 to 0.2. The MDIFF value is then converted to dparameter according to equation (2.9) and (2.10). Therefore, 24 p-optimal item pools can be
developed by simulation based on bin-count tables created in the previous phase.
Phase III. Baseline Pool Development
To evaluate the 24 p-optimal item pools, baseline pools should be created as the bases for
comparison. Previous studies (e.g., Gu, 2007; He, 2010; Reckase, 2010) use existing operational
item pools as the bases. However, there is no existing operational MCAT program so far and
herein the operational multidimensional item pool is not available. Therefore, the item pools
used for MCAT in research articles are adopted in this study as the baseline pools. Some of the
multidimensional item pools in current literature are modified from its correspondent
unidimensional operational item pool. For instance, Segall (1996) and Yao (2012, 2013) created
the multidimensional item pool based on the operational item pool for CAT-ASVAB. Other
multidimensional item pools in the literature are created by pure simulation, such as the item
pool used in van der Linden (1996, 1999).
In this study, because the target examinee population and content areas are based on the CATASVAB, it is straightforward to develop the baseline pools based on the CAT-ASVAB as well.
There are three test specifications for the 24 p-optimal item pools in this study. Item pools with
different test specifications cannot be compared. Therefore, three baseline pools, one for each
test specification, are created based on the CAT-ASVAB. Yao (2013) provided a detailed
description of the multidimensional item pool for the CAT-ASVAB, including the pool size and

48

item distribution. Based on Yao’s description, the development for three baseline pools is
described below.
For Test Specification 1(two-dimensional simple structure), the baseline pool consists of 480
items with 240 items from each of the two clusters. In this study, the MCAT based on this test
specification gives 15 items from each cluster to each examinee. In the CAT-ASVAB, 15 AR
items and 15 WK items are administered, and the number of AR or WK item in the item pool is
around 240. This is the reason for setting the size of the baseline pool to 2*240 = 480 for Test
Specification 1. For Test Specification 2 (three-dimensional simple structure), the baseline pool
consists of 480 items with 160 items from each of the three clusters. For Test Specification 3
(three-dimensional non-simple structure), the baseline pool is consisted of 560 items with 140
items from each of the four clusters. Similar reasons are used to determine the pool size for Test
Specification 2 and 3. The mean and standard deviation (SD) of the MDIFF value for the items
in the three baseline pools are presented in Table 4.4.
Phase IV. Simulation Study Conduct
A simulation study is conducted to compare the performance of the MCAT using p-optimal
item pools against MCAT using baseline pools. The algorithm for the MCAT is described in
Section 4.1. Two types of examinee distribution were used for the simulation.
First, to evaluate the MCAT performance in general, 5,000 examinees are randomly sampled
from the multivariate normal distribution with mean vector and variance-covariance matrix
specified in Table 4.1.
Second, to evaluate the MCAT performance at each θ point, 100 examinees are generated at
several θ points. The 29 θ points for the two dimensional case are displayed in Figure 4.1. No
point on the upper left and lower right is selected. This is because, given 𝜃𝜃1 and 𝜃𝜃2 are highly or
49

Table 4.4: Item Statistics for the Three Baseline pools
2-dimension
simple structure
N
Mean
SD

3-dimension
simple structure
N
Mean
SD

3-dimension
non-simple structure
N
Mean
SD

Cluster 1

240

-0.76

2.55

160

-0.76

2.55

140

-0.76

2.55

Cluster 2

240

-0.35

3.07

160

-0.35

3.07

140

-0.35

3.07

160

-0.17

2.12

140

-0.17

2.12

140

0.10

2.58

Cluster 3
Cluster 4

𝜃𝜃1
-3
-3
-3
-3
-2
-2
-2
-2
-2
-2
-2
-1

-3
-3
-2
-2
-3
-3
-2
-2
-2
-1
-1
-2

-3
-2
-3
-2
-3
-2
-3
-2
-1
-2
-1
-2

13
14
15
16
17
18
19
20
21
22
23
24

-1
-1
-1
0
0
0
0
0
0
0
1
1

-2
0
0
-1
-1
0
0
0
1
1
0
0

-1
-1
0
-1
0
-1
0
1
0
1
0
1

25
26
27
28
29
30
31
32
33
34
35
36
37

1
1
2
2
2
2
2
2
2
3
3
3
3

3

2

1

Theta 2

No.
1
2
3
4
5
6
7
8
9
10
11
12

Table 4.5: The 37 θ Points for the Three Dimensional MCAT
No.
No.
𝜃𝜃2
𝜃𝜃3
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
𝜃𝜃1

0

-1

-2

-3
-3

-2

-1

0
Theta 1

1

2

3

Figure 4.1: The 29 θ Points for the Two Dimensional MCAT
50

𝜃𝜃2
2
2
1
1
2
2
2
3
3
2
2
3
3

𝜃𝜃3
1
2
1
2
1
2
3
2
3
2
3
2
3

3
2

Theta 3

1
0
-1
-2
-3
-4

-2

0

0

-2

2

4

2

4

-4
Theta 2

Theta 1

Figure 4.2: The 37 θ Points for the Three Dimensional MCAT

moderately correlated, examinees are very unlikely to have a very high value in 𝜃𝜃1 and very low
value in 𝜃𝜃2 , or vice versa. The 37 θ points for the three dimensional case are displayed in Table

4.5 and Figure 4.2. Again, because 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are correlated, only a limited number of points
on the three dimensional space are selected.
4. 3 Evaluation Criteria
The performance of MCAT is evaluated based on precision of the ability estimation and the
item pool utilization. The evaluation criteria for precision of the ability estimation include
Pearson product-moment correlation between the true θ and estimated θ, bias, and root mean
squared error (RMSE). The bias and RMSE are denoted as:
𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = �

� 𝒊𝒊 − 𝜽𝜽𝒊𝒊
𝜽𝜽
,
𝑛𝑛
𝑖𝑖=1
𝑛𝑛

51

(4. 1)

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = ��
where n is the sample size.

� 𝒊𝒊 − 𝜽𝜽𝒊𝒊 �2
�𝜽𝜽
,
𝑛𝑛
𝑖𝑖=1
𝑛𝑛

(4. 2)

For item pool utilization, the evaluation criteria are the overall pool usage, the test overlap
rate, and the percentage of items with varying exposure rate. As Chang and Ying (1999)
proposed, the efficiency of overall item pool usage can be measured by the discrepancy between
the observed and expected item exposure rate. It follows 𝜒𝜒 2 distribution and is denoted as
2

�𝑟𝑟𝑗𝑗 − 𝐿𝐿/𝑁𝑁�
𝜒𝜒 = �
,
𝐿𝐿/𝑁𝑁
𝑗𝑗 =1
2

𝑁𝑁

(4. 3)

where 𝑟𝑟𝑗𝑗 is the observed exposure rate for item j, L is the test length, N is the number of items in
the item pool. A low 𝜒𝜒 2 value implies that most of the items are fully used.

Test overlap describes item exposure as well, and it has been used as item pool security index.

Overlap rate is defined as the average proportion of items that two randomly selected examinees
have in common (Way, 1998):
𝑅𝑅 =

𝑇𝑇/𝐶𝐶𝑛𝑛2
,
∑𝑛𝑛𝑖𝑖=1 𝐿𝐿𝑖𝑖�𝑛𝑛

(4. 4)

𝐿𝐿
where T is the total number of item shared by 𝐶𝐶𝑛𝑛2 pair of n examinees in the test and ∑𝑛𝑛𝑖𝑖=1 𝑖𝑖�𝑛𝑛 is
the total number of the items administered for n examinees. In practice, the overlap rate less
than 15% is desired.
Item exposure rate is the ratio of the number of item administrations to the total number of
examinees. In this study, the percentage of items over- and under-exposed for each item pool is

52

also reported. A rate higher than 0.2 is regarded as overexposed (Segall, Moreno, & Hetter,
1997), and a rate lower than 0.02 is regarded as underexposed (Gu, 2007).

53

Chapter 5

Simulation Results

The simulation results are summarized in two parts. The first part presents the general
characteristics of the 24 p-optimal item pools, and how the characteristics are affected by test
specification, exposure control, correlation among dimensions, and bin size. The second part
describes the performance of the MCAT using each p-optimal item pool, and how their
performance compared with the MCAT using baseline pools.
5.1 Item Pool Characteristics
Because the primary purpose of this study is to design and develop p-optimal item pools for
MCAT, the results of the item pool develop are presented first in this chapter. The general
characteristics for the p-optimal item pools and the baseline pools are summarized and compared
in Section 5.1.1. The item distribution for the 24 p-optimal item pools is then described in 5.1.2.
5.1.1 Summary for Item Pool Characteristics
The summary characteristics, including pool size and the mean and standard deviation (SD) of
the item difficulty, for the .96-optimal item pools and .86-optimal item pools are presented in
Table 5.1 and 5.2, respectively. The twelve .96-optimal item pools are based on the bin-size of
0.4, and the twelve.86-optimal item pools are based on the bin-size of 0.8. The characteristics
for the three baseline pools are also presented in the two tables.
All the .96-optimal item pools, as shown in Table 5.1, have smaller pool sizes than the
baseline pools. For the 2-dimension simple structure and 3-dimension simple structure cases, the
pool size for the .96-optimal item pools is about 110 to 150 items less than the baseline pools.
For the 3-dimension non-simple structure case, the pool size for the .96-optimal item pools is
about 150 to 190 items less than the baseline pools. The average difficulty level (i.e., the mean
54

Table 5.1: Summary for the .96-optimal item pools and baseline pools
High Correlation

Test
specification
2-dimension
Simple
Structure
3-dimension
Simple
Structure
3-dimension
Non-simple
Structure

Statistics

Moderate Correlation

Exposure No Exposure
Control
Control
371
328

Exposure
Control
333

Baseline
pool

Pool size

No Exposure
Control
369

Mean of Difficulty

0.01

-0.01

0.02

-0.01

-0.52

SD of Difficulty

1.65

1.63

1.57

1.55

2.75

Pool size

366

370

322

330

480

Mean of Difficulty

0.02

0.00

0.00

0.01

-0.32

SD of Difficulty

1.61

1.58

1.51

1.48

2.72

Pool size

407

407

363

369

560

Mean of Difficulty

-0.01

-0.01

-0.01

0.00

-0.32

SD of Difficulty

2.09

2.09

1.95

1.94

2.68

480

Table 5.2: Summary for the .86-optimal item pools and baseline pools
Test
specification
2-dimension
Simple
Structure
3-dimension
Simple
Structure
3-dimension
Non-simple
Structure

High Correlation
Statistics

Moderate Correlation

Baseline
pool

Pool size

No Exposure
Control
206

Exposure
Control
252

No Exposure
Control
192

Exposure
Control
246

Mean of Difficulty

0.00

0.00

0.00

0.00

-0.52

SD of Difficulty

1.82

1.66

1.70

1.55

2.75

Pool size

207

251

190

236

480

Mean of Difficulty

-0.02

0.00

-0.03

0.00

-0.32

SD of Difficulty

1.76

1.60

1.63

1.47

2.72

Pool size

233

272

216

253

560

Mean of Difficulty

0.00

0.01

0.01

0.00

-0.32

SD of Difficulty

2.28

2.14

2.11

1.95

2.68

480

of MDIFF) for all the .96-optimal item pools is around zero. This is as expected because the
mean ability of the target examinee population is zero and the p-optimal item pools are
developed based on this examinee population. The mean difficulty level for all the baseline
pools is slightly below zero, suggesting the items in the baseline pools are easier than the items
55

in the p-optimal item pools on average. The comparison between the SD’s for the .96-optimal
item pools and the baseline pools suggests that items in baseline pools are more widely
distributed.
All the .86-optimal item pools, as shown in Table 5.2, also have smaller pool sizes than the
baseline pools. The pool size for the .86-optimal item pools is about half or less than half of the
baseline pools. The mean difficulty level for all the .86-optimal item pools is around zero. The
SD of item difficulty for the .86-optimal item pools is also smaller than the baseline pools.
The comparison between the .96- and the .86-optimal item pools tells the effect of bin size on
the p-optimal item pools design. First, the pool size of the .96-optimal item pools is much larger
than the .86-optimal item pools. For conditions without item exposure control, the pool size of
the .96-optimal item pools is about twice as much as the .86-optimal item pools. Therefore, a
larger bin size results in a larger item pool. Similar results can be found for the UCAT in
Reckase (2003). Second, the SD of item difficulty for the .96-optimal item pools is slightly
smaller than the .86-optimal item pools. Although the range of the item difficulty for both
the .96- and .86-optimal item pools is similar, the proportion of the difficult or easy items is
slightly higher for the .86-optimal item pools, and thus the SD value is larger. For example,
there are 6% items with MDIFF larger than 2.8 in the .86-optimal item pool for condition of 2dimension simple structure, high correlation, and no exposure control; while there are only 4%
for the .96-optimal item pool for the same MCAT.
In addition to the bin size, test specifications also affect the p-optimal item pools design. The
pool size for the all the p-optimal item pools based on 2- and 3-dimension simple structure is
very similar, except there is a 5-item difference between the two .96-optimal item pools with
moderate correlation and no exposure control. The p-optimal item pools with test specification

56

of 3-dimension non-simple structure consist of about 10-12% more items than the rest of the two
test specifications. Therefore, if the test length is the same, adding one more clusters of items
that measure a different content area does not require a larger item pool (e.g., from 2-dimension
simple structure to 3-dimension simple structure); however, if the added items measure more
than one content area (e.g., from 3-dimension simple structure to 3-dimension non-simple
structure), the pool size needs to be increased. In addition to the pool size, the SD of item
difficulty is also affected by test specifications. The SD in the 2-dimension simple structure
condition is slightly larger than the SD in the 3-dimension simple structure condition. The items
in the p-optimal item pools based on both types of test specification have the same difficulty
range, but the proportion of difficult item and easy item is slightly larger for the 2-dimension
simple structure condition. The SD for the 3-dimension non-simple structure is much larger
compared with the SD for the other two test specifications. This is because the item difficulty
for items measuring all the three content areas (with 𝒂𝒂𝑖𝑖 = (1,1,1)) is more spread.

This study also examines how the correlation among dimensions affects the design for the p-

optimal item pool. Table 5.1 and 5.2 show that if dimensions are highly correlated, the pool size
and the SD of item difficulty will be larger than the condition that dimensions are moderately
correlated. This is because, when dimensions are highly correlated, a slightly larger number of
examinees will have very high ability in all dimensions, and thus more difficult items are needed
in the item pool. For the similar reason, more easy items are also need in the item pool when
dimensions are highly correlated.
If item exposure control is implemented in the MCAT, a larger item pool is necessary.
Similar results can be found in Gu (2007), He (2012), and Zhou (2013) for UCAT. For the .96optimal item pools, given the pool size is already over 350 items, adding item exposure control

57

only increases the pool size by less than 10 items. For the .86-optimal item pools, about 40-50
more items are added to the item pool in order to minimize item exposure rate and meanwhile to
provide precise ability estimation. Because items with difficulty level around zero have a higher
possibility to be selected (as more examinees are located in the middle), those additional items
are all added to the MDIFF-bins in the middle, and therefore, the SD values for the p-optimal
item pools with item exposure control decrease.
In summary, the characteristics of the p-optimal item pools change with different bin sizes,
test specification, correlation among dimensions, as well as whether item exposure control is
implemented. A larger item pool is necessary if the bin size decreases, the test becomes nonsimple structure, dimensions are highly correlated, or item exposure control is considered.
5.1.2 Item distribution for p-optimal item pools
Each of the p-optimal item pool consists of items from more than one cluster. The number of
items in each cluster for the .96- and .86-optimal item pools is presented Table 5.3 and 5.4,
respectively. For the 2-dimension simple structure case, half items are from Cluster 1, and the
other half are from Cluster 2. For the 3-dimension simple structure case, one third of items are
from each cluster. For the 3-dimension non-simple structure case, there is same number of items
from Cluster 1 – 3, and slightly more items from Cluster 4.
For the 2- and 3-dimension simple structure cases, the reason of items equally distributed
between each cluster with simple structure is because the D-Optimality method selects the same
number of items from each cluster. Based on equation (3.8), when an item from Cluster 1 is
administered, the test information on the direction of dimension 1 will be slightly larger than that
of dimension 2, and the next item from Cluster 2 will be selected next. After this item is
administered and the ability estimate is updated, the test information on the direction of
58

Table 5.3: Item distribution for the .96-optimal item pools
Test
Specification

High Correlation
Number of Items

Moderate Correlation

No Exposure

Exposure

No Exposure

Exposure

Control

Control

Control

Control

2-dimension

Item with a = (1,0)

184

185

164

167

Simple

Item with a = (0,1)

185

186

164

166

Structure

Total

369

371

328

333

Item with a = (1,0,0)

122

124

106

105

Item with a = (0,1,0)

123

124

111

107

Item with a = (0,0,1)

121

122

113

110

Total

366

370

330

322

Item with a = (1,0,0)

100

100

88

89

3-dimension

Item with a = (0,1,0)

102

101

87

89

Non-simple

Item with a = (0,0,1)

100

100

91

92

Structure

Item with a = (1,1,1)

105

106

97

99

Total

407

407

363

369

3-dimension
Simple
Structure

Table 5.4: Item distribution for the .86-optimal item pools
Test
Specification

High Correlation
Number of Items

Moderate Correlation

No Exposure

Exposure

No Exposure

Exposure

Control

Control

Control

Control

2-dimension

Item with a = (1,0)

103

126

96

123

Simple

Item with a = (0,1)

103

126

96

123

Structure

Total

206

252

192

246

Item with a = (1,0,0)

68

83

60

76

Item with a = (0,1,0)

71

85

64

80

Item with a = (0,0,1)

68

83

66

80

Total

207

251

190

236

Item with a = (1,0,0)

56

66

52

61

3-dimension

Item with a = (0,1,0)

56

66

52

61

Non-simple

Item with a = (0,0,1)

56

65

53

61

Structure

Item with a = (1,1,1)

65

75

59

70

Total

233

272

216

253

3-dimension
Simple
Structure

59

Figure 5.1: The direction of the information for items with a = (1,1,1)

dimension 2 will be larger than that of dimension 1, and an item from Cluster 1 will be selected.
Therefore, items from each cluster take turns to be selected next. Occasionally, two items from
the same clusters are administered successively. If this happen, two items from another cluster
will be selected to balance test information between the two directions.
For the 3-dimension non-simple structure case, items measuring all the three dimensions are
included in the item pool as the 4th cluster. Items from Cluster 4 provide 1 unit of information on
the direction of the 𝜃𝜃1 , 𝜃𝜃2 ,and 𝜃𝜃3 composite (see Figure 5.1), and also

√3
3

unit of information on

the direction of 𝜃𝜃1 , 𝜃𝜃2 ,and 𝜃𝜃3 . Items from Cluster 1 – 3 provide 1 unit of information on the

direction of 𝜃𝜃1 , 𝜃𝜃2 ,or 𝜃𝜃3 , and also a small amount of information on the direction of the

composite. Suppose three items, one from Cluster 1, one from Cluster 2, and one from Cluster 3,

has been administered. At this point, the information on the direction of the diagonal is the
smallest; thus, the fourth item is chosen from Cluster 4. Then, items from Cluster 1 – 3 are
selected next. Most of the time, items from the four cluster take turns being selected. Because
the amount of information that items from Cluster 4 provide on the direction of 𝜃𝜃1 , 𝜃𝜃2 ,or 𝜃𝜃3 is
60

more than the amount of information that items from Cluster 1 – 3 provide on the direction of the
composite, Cluster 1 – 3 may be skipped sometimes in each rotation. Therefore, in this study,
about 8 to 9 items from Cluster 4, and about 7 to 8 items from each of the Cluster 1, 2, and 3, are
given to each examinee. Because more items from Cluster 4 are administered, more items
should be available in the item pool.
The distribution for the .96- and .86-optimal item pool without exposure control (twodimension simple structure, high correlation) is presented in Figure 5.2 and 5.3, respectively.
Each bar in the figure represents the number of item in each MDIFF-bin. For both item pool, the
distribution for item difficulty is flatter than a normal distribution. Half of the items are from
Cluster 1 and the other half are from Cluster 2. Figure 5.4 and 5.5 present the distribution for
the.96- and .86-optimal item pool with exposure control (two-dimension simple structure, high
correlation), respectively. For both item pools, items are distributed from -3.2 to 3.2, with many
more items located in the middle bins. Figure 5.2 and 5.4 is only different in the central MDIFFbin: 15 items in 5.2 and 17 in 5.4. They look different because the scale of y-axis is different.
For the .86-optimal item pool, the difference between Figure 5.3 and 5.5 is in the three bins in
the middle. Because of the item exposure control, the number of item in the central MDIFF-bin
is double in Figure 5.5. The distribution for p-optimal item pools in other condition is in a
similar shape, and therefore they are not represented here. The number of items in each item
MDIFF-bin for all the 24 p-optimal item pool can be found in the Appendix.

61

Figure 5.2: Item distribution for the .96-optimal item pool without exposure control
(Two-dimension simple structure, high correlation)

Figure 5.3: Item distribution for the .86-optimal item pool without exposure control
(Two-dimension simple structure, high correlation)

62

Figure 5.4: Item distribution for the .96-optimal item pool with exposure control
(Two-dimension simple structure, high correlation)

Figure 5.5: Item distribution for the .86-optimal item pool with exposure control
(Two-dimension simple structure, high correlation)

5.2 Performance of the p-Optimal Item Pools
The previous section described the characteristics of the p-optimal item pools and how the
characteristics change with the MCAT design (including bin size, test specification, correlation,
and exposure control). In this section, the performance of the MCAT using the p-optimal item
pools is evaluated based on the simulation results. Two questions are addressed: (1) how does
the performance of the MCAT using p-optimal item pools compared with the MCAT using
63

baseline pools? and (2) how does the MCAT designs influence the performance of MCAT using
the p-optimal item pools? The simulation results for Test Specification 1 (two-dimension simple
structure) are first presented in 5.2.1 (for high correlation condition) and 5.2.2 (for moderate
correlation condition), followed by Test Specification 2 (three-dimension simple structure) in
5.2.3 (for high correlation condition) and 5.2.4 (for moderate correlation condition), and Test
Specification 3 (three-dimension non-simple structure) in 5.2.5 (for high correlation condition)
and 5.2.6 (for moderate correlation condition).
5.2.1 Performance for item pools based on Test Specification 1 (high correlation)
Table 5.5 and 5.6 presents the results of the ability estimation and item pool utilization for
the .96-optimal item pool, the .86-optimal item pool, and the baseline pool based on the
condition of two-dimension simple structure test specification with 𝜃𝜃1 and 𝜃𝜃2 are highly
correlated. The results in Table 5.5 are under the condition without item exposure control; and
Table 5.6 is with item exposure control. In both tables, there are two values for bias, RMSE and
correlation, representing the results for (𝜃𝜃1 , 𝜃𝜃2 ).

Under the condition without item exposure control (see Table 5.5), the two p-optimal item

pools and the baseline pool show no bias on the 𝜽𝜽 estimates. Also, the RMSE are all at 0.40, and

the correlations between estimated 𝜽𝜽 and true 𝜽𝜽 are around 0.91. The average test information is
also very similar among the three item pools. The amount of information on the direction of 𝜃𝜃1

and 𝜃𝜃2 is around 3.59. This value is very high for the MCAT in this study, because 15 items

from each cluster are administered and the maximum amount of information an item can provide
is 0.25. Because of the feature of simple structure, the off-diagonal values of the information
matrix are zero. In general, the results suggest that the .96- and .86-optimal item pool can
provide accurate estimation for 𝜽𝜽, and the level of accuracy is the same as the baseline pool.
64

Table 5.5: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control
Statistics

(2-dimension simple structure, high correlation)
.96-optimal pool .86-optimal pool Baseline pool

Bias

(0.00, 0.00)

(0.00, 0.00)

(0.00, 0.00)

RMSE

(0.40, 0.40)

(0.40, 0.40)

(0.40, 0.40)

Correlation

(0.91, 0.91)

(0.91, 0.91)

(0.92, 0.91)

Average test information
Overall Pool Usage

�

3.59
0
�
0
3.60
29.03

�

3.58
0
�
0
3.59
32.31

�

3.59
0
�
0
3.60
60.92

Overlap rate

0.16

0.30

0.19

% of overexposed item (r > 0.2)

11%

34%

9%

% of underexposed item (r < 0.02)

35%

33%

54%

Table 5.6: The performance of the .96- and .86-optimal pool and the baseline pool with exposure
control
Statistics

(2-dimension simple structure, high correlation)
.96-optimal pool .86-optimal pool Baseline pool

Bias

(0.00, 0.00)

(0.00, 0.00)

(0.00, 0.00)

RMSE

(0.41, 0.41)

(0.41, 0.41)

(0.41, 0.42)

Correlation

(0.91, 0.91)

(0.91, 0.91)

(0.91, 0.91)

Average test information
Overall Pool Usage

�

3.34
0
�
0
3.35
5.02

�

3.28
0
�
0
3.28
2.19

�

3.30
0
�
0
3.10
13.38

Overlap rate

0.09

0.13

0.09

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r < 0.02)

6%

0%

26%

Table 5.5 also presents the results about item pool usage. The overall pool usage index for
the .96-optimal item pool is slightly smaller than that of the .86-optimal item pool, and the index
for the baseline pool is about twice as much as the .96- and .86-optimal item pool. Because a
small overall pool usage index implies more items in the item pool are fully used, the results
65

suggest that the .96-optimal item pool has slightly better usage than the .86-optimal item pool,
and the two p-optimal item pools have much better usage than the baseline pool.

More

specifically, for the .96-optimal item pool, the overlap rate is 0.16, indicating that two randomly
selected examinees will receive about 16% of items in common; and the percentage of
overexposed and underexposed item are 11% and 35%, respectively. For the .86-optimal item
pool, the results are: 30% of items overlap, 34% overexposed, and 33 % under exposed. Because
more items from the .86-optimal item pool are overlapped and overexposed, the .86-optimal item
pool is less secure than the .96-optimal item pool. This finding is reasonable because the size of
the .86-optimal item pool is only 206 items, but the .96-optimal item pool has 369 items. The
overlap rate for the baseline pool is 0.19, which is slightly higher than the .96-optimal item pool
and lower than the .86-optimal item pool. Although a smaller number of items (9%) from the
baseline pool are overexposed, more than half of the items (54%) are rarely used. It implies
many items in the baseline pool are wasted. In brief, based on these pool usage results, the item
pool usage for the .96- and .86-optimal item pool is much better than the baseline pool.
When item exposure control is implemented (see Table 5.6), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimation as the baseline pool, and
yield better item pool usage than the baseline pool. Compared with the condition without item
exposure control, item exposure control only results in a 0.01 to 0.02 increase for the RMSE, and
about 0.3 decrease for the average test information. The reason why item exposure control rarely
affects the ability estimation is because the p-optimal item pool design takes the item exposure
rate into account and makes sure there is adequate number of items for selection. For the item
pool usage, when the item exposure control is implemented, no item is overexposed, and the
percentage of underexposed and overlapped items is also decreased. The .86-optimal item pool

66

has been fully used with no item underexposed in this condition. The overall pool usage index
for the .96- and .86-optimal item pool and the baseline pool are 5.02, 2.19, and 13.38,
respectively. The value is much smaller than the condition without item exposure control. Thus,
item exposure control can effectively increase the item pool usage and reduce the item exposure
rate without obvious loss on the accuracy of ability estimation.
In addition to the overall performance, the conditional bias and RMSE at the 29 (𝜃𝜃1 , 𝜃𝜃2 )

points are also calculated in this study to evaluate the ability estimation at each 𝜽𝜽 point. The
results are presented by the contour plots. Each contour curve in the plot connects points with
the same bias or RMSE value. The conditional bias for each 𝜽𝜽 point is plotted in Figure 5.6 and
5.7, for the MCAT without and with item exposure control, respectively. In each Figure, the two

plots (subplot a and b) at the top present the conditional bias for 𝜃𝜃1 and 𝜃𝜃2 for the .96-optimal
item pool; the two plots (subplot c and d) in the middle present the conditional bias for the .86-

optimal item pool; and the subplot e and f at the bottom present the conditional bias for the
baseline pool. The conditional RMSE is plotted in Figure 5.8 and 5.9 in the same manner. The
red points in the contour plot represent the 29 (𝜃𝜃1 , 𝜃𝜃2 ) points.

Under the condition without item exposure control (see Figure 5.6 for bias and 5.8 for RMSE),

it is obvious that the plot for the .96-, .86-optimal item pool, and the baseline pool are very
similar. This finding supports the results of the overall bias and RMSE, and also suggests the poptimal item pools can provide as accurate ability estimation as the baseline pool at each 𝜽𝜽 point.

In general, larger bias and RMSE occurs when 𝜃𝜃1 and 𝜃𝜃2 are very large or very small, which is
the upper right corner and lower left corner in the contour plot. In addition to the value of the 𝜽𝜽,
the difference between 𝜃𝜃1 and 𝜃𝜃2 also affects the estimation accuracy. More specifically, when

𝜃𝜃1 is within (-1, 1) and 𝜃𝜃2 is near 𝜃𝜃1 , the bias for 𝜃𝜃1 is close to 0 and the
67

(a) Bias for 𝜃𝜃1 for the .96- optimal item pool

(b) Bias for 𝜃𝜃2 for the .96- optimal item pool

(c) Bias for 𝜃𝜃1 for the .86- optimal item pool

(d) Bias for 𝜃𝜃2 for the .86- optimal item pool

(f) Bias for 𝜃𝜃2 for the comparison pool

(e) Bias for 𝜃𝜃1 for the comparison pool

Figure 5.6: Conditional bias for the 𝜽𝜽 estimates without exposure control
(2-dimension simple structure, high correlation)
68

(a) Bias for 𝜃𝜃1 for the .96- optimal item pool

(b) Bias for 𝜃𝜃2 for the .96- optimal item pool

(c) Bias for 𝜃𝜃1 for the .86- optimal item pool

(d) Bias for 𝜃𝜃2 for the .86- optimal item pool

(e) Bias for 𝜃𝜃1 for the comparison pool

(f) Bias for 𝜃𝜃2 for the comparison pool

Figure 5.7: Conditional bias for the 𝜽𝜽 estimates with exposure control
(2-dimension simple structure, high correlation)
69

(a) RMSE for 𝜃𝜃1 for the .96- optimal item pool

(b) RMSE for 𝜃𝜃2 for the .96- optimal item pool

(c) RMSE for 𝜃𝜃1 for the .86- optimal item pool

(d) RMSE for 𝜃𝜃2 for the .86- optimal item pool

(f) RMSE for 𝜃𝜃2 for the comparison pool

(e) RMSE for 𝜃𝜃1 for the comparison pool

Figure 5.8: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(2-dimension simple structure, high correlation)
70

(a) RMSE for 𝜃𝜃1 for the .96- optimal item pool

(b) RMSE for 𝜃𝜃2 for the .96- optimal item pool

(c) RMSE for 𝜃𝜃1 for the .86- optimal item pool

(d) RMSE for 𝜃𝜃2 for the .86- optimal item pool

(f) RMSE for 𝜃𝜃2 for the comparison pool

(e) RMSE for 𝜃𝜃1 for the comparison pool

Figure 5.9: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(2-dimension simple structure, high correlation)
71

RMSE is less than 0.4. Negative bias and large RMSE appear when the value of 𝜃𝜃1 increases

and the difference between 𝜃𝜃1 and 𝜃𝜃2 increases. For example, at point (3, 1) and (3, 2) in the plot,
the bias for 𝜃𝜃1 is about -1.0 and RMSE for 𝜃𝜃1 is about 1.0. Meanwhile, positive bias and large
RMSE appear when the value of 𝜃𝜃1 decreases and the difference between 𝜃𝜃1 and 𝜃𝜃2 increases.

At point (-3, -1) and (-3, -2), the bias for 𝜃𝜃1 is about 1.0 and RMSE for 𝜃𝜃1 is about 1.0. Similar

results for 𝜃𝜃2 can be observed from the right panel of Figure 5.6 and 5.8. When 𝜃𝜃2 is within (-1,
1) and 𝜃𝜃1 is near 𝜃𝜃2 , the bias and RMSE for 𝜃𝜃2 is very small. When the value of 𝜃𝜃2 becomes

more extreme and 𝜃𝜃1 is away from 𝜃𝜃2 , large bias and RMSE values appear. This finding is
probably due to the Bayesian MAP estimation method. As described in Chapter 2, the Bayesian

method set the distribution of the true 𝜽𝜽 as the prior. In this condition, the true 𝜽𝜽 has a mean

vector of (0, 0) and a high correlation between 𝜃𝜃1 and 𝜃𝜃2 . The prior will shrink the ability

estimation into the middle and reduce the difference between 𝜃𝜃1 and 𝜃𝜃2 . In this study, the

overall test length is 30 so that about 15 items are selected from each cluster. The effect of the
likelihood function is probably not strong enough to overcome the effect of the prior. If the test

length further increases, the effect of the likelihood function will dominate the effect of the prior
eventually, and therefore reduce the bias and RMSE in those extreme cases.
When item exposure control is implemented, similar findings can be observed from Figure 5.7
and 5.9. Again, there is nearly no difference between the two p-optimal item pools, and between
the p-optimal item pools and the baseline pool. The results support the finding based on the
overall bias and RMSE, and further suggest the MCAT using the three item pools perform
similarly in terms of the ability estimation on the 29 𝜽𝜽 points. In addition, larger bias and RMSE

also occurs when 𝜃𝜃1 and 𝜃𝜃2 are very large or very small, and when 𝜃𝜃1 and 𝜃𝜃2 are away from

each other. A comparison between the condition with and without item exposure control shows,
72

when item exposure control is built in, the magnitude of the bias and RMSE at some extreme
points becomes larger. The increase of estimation error is due to the item exposure control.
Because the item exposure control prevents the most informative item from being frequently
selected, the information available for ability estimation reduces slightly. When information
reduces, the prior plays a more important role in the ability estimation. Thus, the measurement
error at extreme 𝜽𝜽 points becomes larger if item exposure control is added into the item selection

process.

In summary, this section presents the results for the MCAT with the test specification of twodimension simple structure and with high correlation between 𝜃𝜃1 and 𝜃𝜃2 . In general, the poptimal item pools perform similar as the baseline pool in terms of both overall and conditional

accuracy of ability estimation, but the p-optimal item pools can save over 100 items and have a
better item pool usage. When item exposure control is implemented, the item exposure rate and
item overlap rate can be controlled very well. The p-optimal item pools still can provide reliable
ability estimation with a relatively small pool size.
5.2.2 Performance for item pools based on Test Specification 1 (moderate correlation)
The results for the MCAT with the same test specification, but with 𝜃𝜃1 and 𝜃𝜃2 are moderately

correlated, are presented in Table 5.7 and 5.8. The results in Table 5.7 are under the condition
without item exposure control; and Table 5.8 is with item exposure control. In both tables, there
are two values for bias, RMSE and correlation, representing the results for (𝜃𝜃1 , 𝜃𝜃2 ).

Under the condition without item exposure control (see Table 5.7), the p-optimal item pools

and the baseline pool show nearly no bias on the 𝜽𝜽 estimation. Also, the RMSE are all at 0.46,
and correlations between estimated 𝜽𝜽 and true 𝜽𝜽 are around 0.88. The average test information
is also very similar among the three item pools. The amount of information on the direction of
73

Table 5.7: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control
(2-dimension simple structure, moderate correlation)
Statistics

.96-optimal pool .86-optimal pool Baseline pool

Bias

(-0.01, 0.00)

(-0.01, 0.00)

(-0.01, 0.01)

RMSE

(0.45, 0.46)

(0.45, 0.46)

(0.45, 0.46)

Correlation

(0.89, 0.89)

(0.89, 0.89)

(0.89, 0.89)

Average test information
Overall Pool Usage

�

3.58
0
�
0
3.58
28.47

�

3.57
0
�
0
3.58
31.69

�

3.58
0
�
0
3.58
66.65

Overlap rate

0.18

0.32

0.20

% of overexposed item (r > 0.2)

16%

34%

10%

% of underexposed item (r < 0.02)

32%

29%

55%

Table 5.8: The performance of the .96- and .86-optimal pool and the baseline pool with exposure
control
(2-dimension simple structure, moderate correlation)
Statistics

.96-optimal pool .86-optimal pool Baseline pool

Bias

(-0.01, 0.00)

(-0.01, 0.00)

(-0.01, 0.00)

RMSE

(0.46, 0.47)

(0.47, 0.47)

(0.47, 0.48)

Correlation

(0.88, 0.88)

(0.88, 0.88)

(0.88, 0.87)

Average test information
Overall Pool Usage

�

3.31
0
�
0
3.31
3.55

�

3.27
0
�
0
3.29
1.59

�

3.29
0
�
0
3.02
13.48

Overlap rate

0.10

0.13

0.09

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r < 0.02)

0%

0%

26%

𝜃𝜃1 and 𝜃𝜃2 is around 3.58. In general, the results suggest that the .96- and .86-optimal item pool
can provide accurate estimation for 𝜽𝜽, and the level of accuracy is the same as the baseline pool.

Table 5.7 also presents the results about item pool usage. The overall pool usage index for

the .96-optimal item pool is slightly smaller than that of the .86-optimal item pool, and the index
74

for the baseline pool is more than twice as much as the .96- and .86-optimal item pool. The
results suggest that the .96-optimal item pool has slightly better usage than the .86-optimal item
pool, and the two optimal item pools have much better usage than the baseline pool. More
specifically, for the .96-optimal item pool, the overlap rate is 0.18, and the percentage of
overexposed and underexposed item are 16% and 32%, respectively. For the .86-optimal item
pool, the results are: 32% of items overlap, 34% overexposed, and 29 % under exposed. Because
more items from the .86-optimal item pool are overlapped and overexposed, the .86-optimal item
pool is less secure than the .96-optimal item pool. The overlap rate for the baseline pool is 0.20,
which is slightly higher than the .96-optimal item pool and lower than the .86-optimal item pool.
Although a smaller number of items (10%) from the baseline pool are overexposed, more than
half of the items (55%) are rarely used. It implies many items in the baseline pool are wasted. In
brief, based on these pool usage results, the item pool usage for the .96- and .86-optimal item
pool is much better than the baseline pool.
When item exposure control is implemented (see Table 5.8), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimation as the baseline pool, and
yield better item pool usage than the baseline pool. Compared with the condition without item
exposure control, item exposure control only results in a 0.01 to 0.02 increase for the RMSE, and
about 0.3 decrease for the average test information. For the item pool usage, when the item
exposure control is implemented, no item is overexposed, and the percentage of underexposed
item and overlapped item are also decreased. The .96- and the .86-optimal item pool has been
fully used with no item underexposed. The overall pool usage index for the .96-, .86-optimal
item pool and the baseline pool are 3.55, 1.59, and 13.48, respectively. The value is much
smaller than the condition without item exposure control. Thus, the results suggest the item

75

exposure control can effectively increase the item pool usage and reduce the item exposure rate
without obvious loss on the accuracy of ability estimation.
In addition to the overall performance, the conditional bias and RMSE at the 29 (𝜃𝜃1 , 𝜃𝜃2 )

points are also calculated in this study to evaluate the ability estimation at each 𝜽𝜽 point. The

conditional bias for each 𝜽𝜽 point is plotted in Figure 5.10 and 5.11, for the MCAT without and
with item exposure control, respectively. The conditional RMSE is plotted in Figure 5.12 and
5.13.
Under the condition without item exposure control (see Figure 5.10 for bias and 5.12 for
RMSE), it is obvious that the plot for the .96-, .86-optimal item pool, and the baseline pool are
very similar. This finding supports the results for the overall bias and RMSE, and also suggests
the p-optimal item pools can provide as accurate ability estimation as the baseline pool at each 𝜽𝜽

point. Similar to the results in Section 5.1.1, larger bias and RMSE occurs when 𝜃𝜃1 and 𝜃𝜃2 are

very large or very small, which is the upper right corner and lower left corner in the contour plot.
In addition to the value of 𝜽𝜽, the difference between 𝜃𝜃1 and 𝜃𝜃2 also affects the estimation

accuracy. More specifically, when 𝜃𝜃1 is within (-1, 1) and 𝜃𝜃2 is near 𝜃𝜃1 , the bias for 𝜃𝜃1 is close

to 0 and the RMSE is less than 0.4. Negative bias and large RMSE appear when the value of 𝜃𝜃1

increases and the difference between 𝜃𝜃1 and 𝜃𝜃2 increases. For example, at point (3, 1) and (3, 2)
in the plot, the bias for 𝜃𝜃1 is about -0.7 and RMSE for 𝜃𝜃1 is about 0.8. Meanwhile, positive bias

and large RMSE appear when the value of 𝜃𝜃1 decreases and the difference between 𝜃𝜃1 and 𝜃𝜃2

increases. At point (-3, -1) and (-3, -2), the bias for 𝜃𝜃1 is about 0.7 and RMSE for 𝜃𝜃1 is about 0.8.

Similar results for 𝜃𝜃2 can be observed from the right panel of Figure 5.10 and 5.12. When 𝜃𝜃2 is
within (-1, 1) and 𝜃𝜃1 is near 𝜃𝜃2 , the bias and RMSE for 𝜃𝜃2 is very small. When the value of 𝜃𝜃2
becomes more extreme and 𝜃𝜃1 is away from 𝜃𝜃2 , large bias and RMSE values appear.
76

(a) Bias for 𝜃𝜃1 for the .96- optimal item pool

(b) Bias for 𝜃𝜃2 for the .96- optimal item pool

(c) Bias for 𝜃𝜃1 for the .86- optimal item pool

(d) Bias for 𝜃𝜃2 for the .86- optimal item pool

(f) Bias for 𝜃𝜃2 for the comparison pool

(e) Bias for 𝜃𝜃1 for the comparison pool

Figure 5.10: Conditional bias for the 𝜽𝜽 estimates without exposure control
(2-dimension simple structure, moderate correlation)
77

(a) Bias for 𝜃𝜃1 for the .96- optimal item pool

(b) Bias for 𝜃𝜃2 for the .96- optimal item pool

(c) Bias for 𝜃𝜃1 for the .86- optimal item pool

(d) Bias for 𝜃𝜃2 for the .86- optimal item pool

(f) Bias for 𝜃𝜃2 for the comparison pool

(e) Bias for 𝜃𝜃1 for the comparison pool

Figure 5.11: Conditional bias for the 𝜽𝜽 estimates with exposure control
(2-dimension simple structure, moderate correlation)
78

(a) RMSE for 𝜃𝜃1 for the .96- optimal item pool

(b) RMSE for 𝜃𝜃2 for the .96- optimal item pool

(c) RMSE for 𝜃𝜃1 for the .86- optimal item pool

(d) RMSE for 𝜃𝜃2 for the .86- optimal item pool

(f) RMSE for 𝜃𝜃2 for the comparison pool

(e) RMSE for 𝜃𝜃1 for the comparison pool

Figure 5.12: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(2-dimension simple structure, moderate correlation)
79

(a) RMSE for 𝜃𝜃1 for the .96- optimal item pool

(b) RMSE for 𝜃𝜃2 for the .96- optimal item pool

(c) RMSE for 𝜃𝜃1 for the .86- optimal item pool

(d) RMSE for 𝜃𝜃2 for the .86- optimal item pool

(f) RMSE for 𝜃𝜃2 for the comparison pool

(e) RMSE for 𝜃𝜃1 for the comparison pool

Figure 5.13: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(2-dimension simple structure, moderate correlation)
80

By comparing the contour plots in this section with the plots in Section 5.2.1 (when 𝜃𝜃1 and 𝜃𝜃2

are highly correlated), it is easy to see that the pattern of the contour plot is the same, but the
magnitude of the bias and RMSE is smaller. When the correlation between 𝜃𝜃1 and 𝜃𝜃2 decreases,

the prior weakly reduces the difference between 𝜃𝜃1 and 𝜃𝜃2 . Therefore, when 𝜃𝜃1 and 𝜃𝜃2 are
moderately correlated, the bias the RMSE values are slightly smaller at those points where 𝜃𝜃1

and 𝜃𝜃2 are away from each other, compared with the condition when 𝜃𝜃1 and 𝜃𝜃2 are highly
correlated.

When item exposure control is implemented, similar findings can be observed from Figure
5.11 and 5.13. Again, there is nearly no difference between the two p-optimal item pools, and
between the p-optimal item pools and the baseline pool. The results support the finding based on
the overall bias and RMSE, and further suggest the three item pool performs similarly in terms of
the ability estimation on the 29 𝜽𝜽 points. In addition, larger bias and RMSE also occurs when

𝜃𝜃1 and 𝜃𝜃2 are very large or very small, and when 𝜃𝜃1 and 𝜃𝜃2 are away from each other. Similar to
results under the high correlation condition in Section 5.2.1, when item exposure control is built
in, the magnitude of the bias and RMSE at some extreme points becomes larger.
In summary, this section present the results for the MCAT with the test specification of twodimension simple structure and with moderate correlation between 𝜃𝜃1 and 𝜃𝜃2 . The p-optimal
item pools perform similar as the baseline pool in terms of the accuracy of ability estimation, but

the p-optimal item pools can save over 140 items and have a better item pool usage. When item
exposure control is implemented, the p-optimal item pools still can provide accurate ability
estimation and meanwhile the item exposure rate and item overlap rate can be well controlled.
In general, the findings from this section are similar to the finding in previous section. A close
comparison between these two sections reveals that the measurement error in this section is

81

slightly larger. This result is due to the magnitude of the correlation between 𝜃𝜃1 and 𝜃𝜃2 . Unlike

the UIRT model estimating 𝜃𝜃1 and 𝜃𝜃2 one at a time, the MIRT model estimates 𝜃𝜃1 and 𝜃𝜃2

simultaneously, by borrowing information from one to another. When 𝜃𝜃1 and 𝜃𝜃2 are highly
correlated, more variance in 𝜃𝜃1 can be explained by 𝜃𝜃2 , so that more information can be

borrowed for ability estimation. When the correlation between 𝜃𝜃1 and 𝜃𝜃2 decreases, the amount

of information that can be borrowed reduces accordingly, and therefore the RMSE for 𝜽𝜽

estimates increase. In addition to the accuracy of ability estimation, the pool usage for the two p-

optimal item pool in this section is also slightly better. This is probably because of the pool size.
When the correlation between 𝜃𝜃1 and 𝜃𝜃2 decreases, the pool size decreases as well. A smaller
item pool is more likely to be fully used.

5.2.3 Performance for item pools based on Test Specification 2 (high correlation)
The results for the MCAT based on the three-dimension simple structure, and with 𝜃𝜃1 and 𝜃𝜃2

are highly correlated, are presented in Table 5.9 and 5.10. The results in Table 5.9 are under the
condition without item exposure control; and Table 5.10 is with item exposure control. In both
tables, there are three values for bias, RMSE and correlation, representing the results for (𝜃𝜃1 , 𝜃𝜃2 ,
𝜃𝜃3 ).

Under the condition without item exposure control (see Table 5.9), the p-optimal item pools

and the baseline pool show nearly no bias on average. Also, the RMSE ranges from 0.41 to 0.46,
and the correlations between the estimated 𝜽𝜽 and the true 𝜽𝜽 are around 0.90. The average test

information is also very similar among the three item pools. The amount of information on the
direction of 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 is around 2.39. This value is very high for the three dimensional

MCAT in this study, because only 10 items from each cluster are administered and the maximum
amount of information an item can provide is 0.25. In general, the results suggest that the .9682

Table 5.9: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control
(3-dimension simple structure, high correlation)
Statistics

.96-optimal pool

.86-optimal pool

Baseline pool

Bias

(-0.01, 0.00, 0.00)

(0.00, 0.00, 0.00)

(0.00, 0.00, 0.00)

RMSE

(0.44, 0.42, 0.45)

(0.44, 0.41, 0.45)

(0.44, 0.41, 0.46)

Correlation

(0.90, 0.91, 0.89)

(0.90, 0.91, 0.89)

(0.90, 0.91, 0.89)

Average test information
Overall Pool Usage

2.39
0
0
2.38
0
0
2.39
0
0
� 0
2.40
0 � � 0
2.39
0 � � 0
2.40
0 �
0
0
2.40
0
0
2.39
0
0
2.40
28.47

31.69

66.65

Overlap rate

0.18

0.32

0.20

% of overexposed item (r > 0.2)

16%

34%

10%

% of underexposed item (r<0.02)

32%

29%

55%

Table 5.10: The performance of the .96- and .86-optimal pool and the baseline pool with
exposure control
(3-dimension simple structure, high correlation)
Statistics

.96-optimal pool

.86-optimal pool

Baseline pool

Bias

(-0.01, 0.00, 0.01)

(0.00, 0.00, 0.01)

(0.00, 0.00, 0.01)

RMSE

(0.45, 0.43, 0.47)

(0.46, 0.44, 0.47)

(0.46, 0.43, 0.47)

Correlation

(0.89, 0.90, 0.89)

(0.89, 0.90, 0.88)

(0.89, 0.90, 0.89)

Average test information
Overall Pool Usage

2.17
0
0
2.13
0
0
2.08
0
0
� 0
2.18
0 � � 0
2.15
0 � � 0
2.03
0 �
0
0
2.17
0
0
2.13
0
0
2.15
3.55

1.59

13.48

Overlap rate

0.10

0.13

0.09

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r<0.02)

0%

0%

26%

and .86-optimal item pool provide accurate estimation for 𝜽𝜽, and the level of accuracy is the
same as baseline pool.

Table 5.9 also presents the results about item pool usage. Compared with the MCAT based
on the Test Specification 1 in 5.2.1 and 5.2.2, similar results can be drawn from Table 5.9. The
83

item pool usage for the .96-optimal item pool is slightly better than the 86-optimal item pool.
And the two p-optimal item pools are much better used than the baseline pool.
When item exposure control is implemented (see Table 5.10), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimation as the baseline pool, and
yield better item pool usage than the baseline pool. Compared with the condition without item
exposure control, item exposure control only results in a 0.01 to 0.03 increase for the RMSE, and
about 0.3 decrease for the average test information. For the item pool usage, when the item
exposure control is implemented, no item is overexposed, and the percentage of underexposed
item and overlapped item are also decreased. The two p-optimal item pools have been fully used
with no item underexposed. The comparison between the condition with and without item
exposure control suggests the item exposure control can effectively increase the item pool usage
and reduce the item exposure rate without obvious loss on the accuracy of ability estimation.
In additional to the overall performance, the conditional bias and RMSE at 37 (𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 )

points are also calculated in this study to evaluate the ability estimation at each 𝜽𝜽 point. The 3-

dimensional bias and RMSE cannot be plotted in a contour plot. The conditional bias for each 𝜽𝜽
point is presented in Table 5.11 and 5.12, for the MCAT without and with item exposure control,
respectively. In each table, the conditional bias is color coded based on the value. Negative bias
is colored in blue and positive bias is in red.

Deeper color represents larger bias.

The

conditional RMSE is presented in Table 5.13 and 5.14 in the same manner. Small RMSE is
colored in green and large RMSE is colored in red.
Under the condition without item exposure control (see Table 5.11 for bias and 5.13 for
RMSE), the conditional bias and RMSE for the .96-, .86-optimal item pool, and the baseline pool
are quite similar. This finding supports the results for the overall bias and RMSE, and also

84

Table 5.11: Conditional Bias for the 𝜽𝜽 estimates without exposure control
(3-dimension simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.48 0.52 0.56 0.41 0.44 0.50 0.53 0.51 0.56
-3 -3 -2 0.65 0.60 0.69 0.66 0.61 0.68 -0.06 -0.07 -0.05
-3 -2 -3 0.82 0.80 0.76 -0.13 -0.13 -0.19 0.81 0.79 0.74
-3 -2 -2 0.94 0.91 0.93 0.06 0.03 0.08 0.20 0.16 0.21
-2 -3 -3 -0.07 -0.02 -0.11 0.72 0.77 0.72 0.72 0.72 0.72
-2 -3 -2 0.14 0.02 0.05 0.96 0.86 0.89 0.12 0.08 0.10
-2 -2 -3 0.16 0.15 0.10 0.07 0.05 0.01 0.93 0.90 0.89
-2 -2 -2 0.29 0.30 0.28 0.23 0.27 0.26 0.26 0.33 0.27
-2 -2 -1 0.45 0.46 0.46 0.48 0.52 0.49 -0.23 -0.18 -0.24
-2 -1 -2 0.56 0.54 0.62 -0.36 -0.39 -0.27 0.52 0.54 0.60
-2 -1 -1 0.72 0.76 0.72 -0.12 -0.07 -0.12 -0.01 0.07 -0.01
-1 -2 -2 -0.19 -0.23 -0.24 0.58 0.56 0.49 0.50 0.49 0.45
-1 -2 -1 -0.09 -0.05 -0.09 0.73 0.79 0.75 -0.12 -0.03 -0.06
-1 0 -1 0.32 0.42 0.38 -0.58 -0.47 -0.51 0.29 0.36 0.31
-1 0 0
0.58 0.60 0.57 -0.25 -0.21 -0.25 -0.16 -0.12 -0.14
0 -1 -1 -0.38 -0.41 -0.40 0.42 0.37 0.42 0.32 0.30 0.32
0 -1 0 -0.25 -0.27 -0.33 0.61 0.62 0.55 -0.27 -0.20 -0.26
0 0 -1 -0.18 -0.16 -0.15 -0.26 -0.22 -0.22 0.52 0.55 0.50
0 0 0
0.07 0.01 -0.03 0.05 0.02 -0.03 0.03 0.03 -0.02
0 0 1
0.12 0.20 0.15 0.16 0.29 0.22 -0.58 -0.49 -0.52
0 1 0
0.24 0.27 0.26 -0.61 -0.59 -0.62 0.24 0.25 0.19
0 1 1
0.47 0.48 0.39 -0.32 -0.33 -0.38 -0.28 -0.29 -0.26
1 0 0 -0.56 -0.58 -0.51 0.26 0.26 0.30 0.16 0.16 0.17
1 0 1 -0.42 -0.38 -0.41 0.48 0.53 0.48 -0.35 -0.32 -0.35
1 2 1
0.09 0.08 0.10 -0.77 -0.75 -0.76 0.06 0.05 0.03
1 2 2
0.29 0.27 0.25 -0.52 -0.50 -0.54 -0.46 -0.47 -0.45
2 1 1 -0.73 -0.67 -0.73 0.08 0.14 0.11 -0.06 0.00 0.01
2 1 2 -0.54 -0.54 -0.56 0.34 0.30 0.33 -0.56 -0.62 -0.58
2 2 1 -0.46 -0.50 -0.46 -0.48 -0.51 -0.52 0.25 0.20 0.19
2 2 2 -0.32 -0.30 -0.30 -0.28 -0.27 -0.27 -0.32 -0.33 -0.33
2 2 3 -0.15 -0.13 -0.15 -0.05 -0.04 -0.07 -0.91 -0.87 -0.92
2 3 2 -0.11 -0.09 -0.07 -0.96 -0.92 -0.91 -0.17 -0.10 -0.08
2 3 3
0.04 0.05 0.11 -0.72 -0.72 -0.68 -0.69 -0.66 -0.67
3 2 2 -0.95 -0.91 -1.00 -0.08 -0.03 -0.11 -0.25 -0.20 -0.21
3 2 3 -0.72 -0.74 -0.75 0.25 0.18 0.20 -0.66 -0.72 -0.76
3 3 2 -0.64 -0.61 -0.68 -0.65 -0.62 -0.68 0.08 0.10 -0.03
3 3 3 -0.49 -0.57 -0.48 -0.40 -0.50 -0.43 -0.48 -0.57 -0.47
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
85

Table 5.12: Conditional Bias for the 𝜽𝜽 estimates with exposure control
(3-dimension simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.65 0.79 0.71 0.61 0.72 0.64 0.69 0.80 0.69
-3 -3 -2 0.78 0.94 0.82 0.78 0.94 0.81 0.04 0.14 0.05
-3 -2 -3 0.84 1.03 0.82 -0.12 0.07 -0.12 0.86 1.02 0.87
-3 -2 -2 1.02 1.11 0.93 0.12 0.19 0.04 0.22 0.31 0.20
-2 -3 -3 -0.04 0.02 0.07 0.79 0.86 0.89 0.80 0.89 0.88
-2 -3 -2 0.13 0.24 0.16 1.00 1.14 1.03 0.14 0.28 0.19
-2 -2 -3 0.19 0.24 0.32 0.11 0.16 0.21 1.01 1.09 1.10
-2 -2 -2 0.35 0.45 0.35 0.32 0.40 0.33 0.36 0.45 0.38
-2 -2 -1 0.54 0.60 0.52 0.57 0.61 0.51 -0.21 -0.14 -0.24
-2 -1 -2 0.65 0.60 0.63 -0.29 -0.34 -0.28 0.60 0.59 0.64
-2 -1 -1 0.79 0.84 0.79 -0.10 -0.02 -0.06 -0.04 0.05 0.06
-1 -2 -2 -0.23 -0.21 -0.21 0.56 0.63 0.59 0.53 0.56 0.56
-1 -2 -1 -0.05 -0.12 -0.02 0.80 0.80 0.85 -0.05 -0.02 0.03
-1 0 -1 0.47 0.34 0.44 -0.42 -0.54 -0.48 0.43 0.38 0.39
-1 0 0
0.64 0.64 0.65 -0.21 -0.18 -0.22 -0.15 -0.09 -0.14
0 -1 -1 -0.34 -0.46 -0.36 0.47 0.38 0.45 0.40 0.33 0.39
0 -1 0 -0.21 -0.22 -0.25 0.64 0.66 0.66 -0.21 -0.19 -0.18
0 0 -1 -0.09 -0.15 -0.20 -0.16 -0.21 -0.26 0.65 0.59 0.54
0 0 0 -0.07 -0.06 -0.08 -0.04 -0.06 -0.10 -0.02 -0.05 -0.08
0 0 1
0.17 0.12 0.24 0.22 0.16 0.32 -0.58 -0.60 -0.46
0 1 0
0.26 0.23 0.27 -0.62 -0.66 -0.61 0.21 0.21 0.23
0 1 1
0.46 0.47 0.37 -0.35 -0.36 -0.44 -0.31 -0.31 -0.36
1 0 0 -0.56 -0.64 -0.61 0.27 0.23 0.24 0.21 0.14 0.16
1 0 1 -0.45 -0.48 -0.42 0.44 0.43 0.48 -0.41 -0.45 -0.38
1 2 1
0.00 0.06 0.04 -0.85 -0.81 -0.84 0.00 0.04 0.01
1 2 2
0.24 0.26 0.20 -0.58 -0.57 -0.60 -0.58 -0.56 -0.55
2 1 1 -0.73 -0.86 -0.82 0.14 0.04 0.06 0.02 -0.01 -0.03
2 1 2 -0.68 -0.70 -0.67 0.26 0.25 0.25 -0.61 -0.68 -0.66
2 2 1 -0.54 -0.63 -0.54 -0.57 -0.67 -0.55 0.16 0.07 0.23
2 2 2 -0.35 -0.47 -0.35 -0.32 -0.44 -0.30 -0.38 -0.50 -0.36
2 2 3 -0.21 -0.32 -0.25 -0.13 -0.24 -0.17 -1.03 -1.13 -1.09
2 3 2 -0.18 -0.24 -0.16 -1.04 -1.11 -1.05 -0.19 -0.27 -0.15
2 3 3
0.01 -0.11 -0.04 -0.80 -0.92 -0.88 -0.82 -0.93 -0.90
3 2 2 -1.02 -1.12 -1.06 -0.10 -0.20 -0.15 -0.21 -0.31 -0.27
3 2 3 -0.91 -1.05 -0.92 0.06 -0.07 0.06 -0.90 -1.04 -0.92
3 3 2 -0.76 -0.94 -0.84 -0.74 -0.94 -0.82 0.05 -0.14 -0.05
3 3 3 -0.65 -0.84 -0.66 -0.60 -0.77 -0.62 -0.71 -0.84 -0.71
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
86

Table 5.13: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(3-dimension simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
C
.96 .86
C
.96 .86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96 .86
-3 -3 -3 0.58 0.63 0.68 0.52 0.57 0.62 0.63 0.63 0.66
-3 -3 -2 0.72 0.71 0.77 0.74 0.70 0.77 0.36 0.34 0.41
-3 -2 -3 0.89 0.91 0.83 0.39 0.43 0.37 0.89 0.89 0.80
-3 -2 -2 1.02 0.98 1.00 0.38 0.35 0.35 0.44 0.38 0.41
-2 -3 -3 0.42 0.37 0.38 0.82 0.85 0.79 0.79 0.81 0.79
-2 -3 -2 0.38 0.39 0.35 1.01 0.95 0.94 0.36 0.45 0.34
-2 -2 -3 0.39 0.40 0.38 0.33 0.33 0.35 0.99 0.96 0.96
-2 -2 -2 0.47 0.44 0.45 0.42 0.42 0.41 0.47 0.50 0.43
-2 -2 -1 0.56 0.58 0.55 0.57 0.63 0.57 0.40 0.40 0.41
-2 -1 -2 0.66 0.65 0.71 0.51 0.55 0.46 0.62 0.67 0.72
-2 -1 -1 0.80 0.81 0.82 0.39 0.31 0.41 0.38 0.33 0.41
-1 -2 -2 0.43 0.38 0.44 0.68 0.63 0.59 0.62 0.59 0.57
-1 -2 -1 0.38 0.37 0.35 0.81 0.87 0.82 0.40 0.32 0.37
-1 0 -1 0.46 0.56 0.51 0.67 0.59 0.63 0.47 0.52 0.48
-1 0 0 0.67 0.69 0.67 0.41 0.41 0.45 0.37 0.38 0.39
0 -1 -1 0.50 0.56 0.53 0.53 0.50 0.54 0.49 0.48 0.48
0 -1 0 0.42 0.43 0.48 0.69 0.72 0.64 0.44 0.42 0.44
0 0 -1 0.38 0.39 0.42 0.40 0.39 0.43 0.59 0.65 0.62
0 0 0 0.37 0.37 0.36 0.37 0.35 0.32 0.40 0.37 0.33
0 0 1 0.40 0.45 0.40 0.37 0.51 0.42 0.67 0.64 0.62
0 1 0 0.43 0.44 0.43 0.71 0.68 0.71 0.46 0.43 0.41
0 1 1 0.59 0.61 0.52 0.46 0.49 0.51 0.45 0.48 0.47
1 0 0 0.68 0.66 0.59 0.47 0.40 0.42 0.43 0.36 0.38
1 0 1 0.53 0.53 0.56 0.58 0.64 0.60 0.51 0.47 0.50
1 2 1 0.35 0.34 0.37 0.84 0.81 0.83 0.34 0.31 0.38
1 2 2 0.47 0.46 0.41 0.62 0.62 0.63 0.59 0.59 0.56
2 1 1 0.82 0.75 0.81 0.38 0.38 0.36 0.38 0.39 0.37
2 1 2 0.64 0.66 0.66 0.47 0.48 0.47 0.63 0.72 0.68
2 2 1 0.58 0.61 0.58 0.58 0.63 0.61 0.42 0.43 0.40
2 2 2 0.48 0.45 0.49 0.44 0.44 0.47 0.46 0.48 0.52
2 2 3 0.40 0.38 0.36 0.37 0.34 0.34 0.97 0.93 0.98
2 3 2 0.40 0.41 0.34 1.03 0.99 0.97 0.42 0.40 0.36
2 3 3 0.36 0.40 0.37 0.81 0.81 0.77 0.78 0.76 0.75
3 2 2 1.02 0.99 1.06 0.35 0.38 0.38 0.43 0.44 0.42
3 2 3 0.81 0.82 0.84 0.45 0.40 0.44 0.75 0.80 0.83
3 3 2 0.75 0.70 0.76 0.74 0.70 0.76 0.36 0.37 0.41
3 3 3 0.61 0.72 0.60 0.55 0.64 0.54 0.63 0.68 0.58
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
87

Table 5.14: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(3-dimension simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
C
.96 .86
C
.96 .86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96 .86
-3 -3 -3 0.74 0.90 0.84 0.71 0.84 0.79 0.79 0.91 0.82
-3 -3 -2 0.88 1.03 0.92 0.88 1.02 0.92 0.42 0.43 0.45
-3 -2 -3 0.91 1.11 0.93 0.37 0.39 0.46 0.93 1.10 0.98
-3 -2 -2 1.10 1.18 1.01 0.45 0.44 0.40 0.46 0.51 0.44
-2 -3 -3 0.39 0.43 0.45 0.87 0.97 1.00 0.88 0.99 0.99
-2 -3 -2 0.41 0.47 0.43 1.07 1.20 1.09 0.40 0.47 0.42
-2 -2 -3 0.42 0.45 0.50 0.40 0.42 0.43 1.08 1.15 1.15
-2 -2 -2 0.51 0.57 0.55 0.49 0.55 0.53 0.52 0.58 0.55
-2 -2 -1 0.65 0.72 0.64 0.68 0.74 0.62 0.44 0.44 0.41
-2 -1 -2 0.76 0.73 0.76 0.49 0.55 0.50 0.72 0.74 0.75
-2 -1 -1 0.87 0.94 0.89 0.36 0.41 0.41 0.40 0.39 0.42
-1 -2 -2 0.44 0.43 0.41 0.69 0.73 0.68 0.67 0.68 0.64
-1 -2 -1 0.35 0.40 0.36 0.87 0.89 0.92 0.34 0.37 0.38
-1 0 -1 0.59 0.51 0.62 0.56 0.67 0.65 0.59 0.53 0.58
-1 0 0 0.72 0.72 0.76 0.39 0.37 0.44 0.36 0.35 0.43
0 -1 -1 0.49 0.58 0.51 0.57 0.52 0.58 0.53 0.48 0.54
0 -1 0 0.45 0.45 0.46 0.75 0.76 0.76 0.41 0.40 0.40
0 0 -1 0.37 0.42 0.40 0.37 0.43 0.42 0.76 0.73 0.66
0 0 0 0.35 0.37 0.44 0.34 0.36 0.43 0.36 0.37 0.42
0 0 1 0.42 0.37 0.49 0.41 0.39 0.51 0.70 0.68 0.61
0 1 0 0.44 0.48 0.51 0.71 0.77 0.75 0.44 0.46 0.47
0 1 1 0.60 0.58 0.52 0.53 0.49 0.57 0.48 0.47 0.52
1 0 0 0.67 0.73 0.71 0.45 0.43 0.42 0.45 0.41 0.40
1 0 1 0.56 0.61 0.60 0.54 0.57 0.64 0.52 0.57 0.56
1 2 1 0.35 0.44 0.38 0.92 0.91 0.93 0.36 0.39 0.38
1 2 2 0.42 0.51 0.44 0.68 0.70 0.71 0.69 0.69 0.68
2 1 1 0.81 0.94 0.90 0.39 0.37 0.36 0.35 0.34 0.37
2 1 2 0.81 0.83 0.77 0.49 0.49 0.46 0.74 0.80 0.75
2 2 1 0.66 0.73 0.69 0.68 0.76 0.71 0.40 0.36 0.49
2 2 2 0.52 0.65 0.51 0.50 0.63 0.46 0.54 0.67 0.52
2 2 3 0.41 0.48 0.43 0.39 0.43 0.40 1.10 1.20 1.16
2 3 2 0.42 0.46 0.42 1.10 1.18 1.11 0.42 0.48 0.41
2 3 3 0.39 0.47 0.36 0.89 1.04 0.95 0.91 1.05 0.97
3 2 2 1.08 1.20 1.14 0.37 0.46 0.44 0.42 0.54 0.50
3 2 3 1.00 1.18 0.99 0.38 0.53 0.39 0.96 1.17 1.00
3 3 2 0.88 1.04 0.96 0.85 1.05 0.96 0.42 0.48 0.48
3 3 3 0.73 0.94 0.81 0.67 0.87 0.77 0.76 0.94 0.84
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
88

suggests the p-optimal item pools can provide as accurate ability estimation as the baseline pool
at each 𝜽𝜽 point. In general, larger bias and RMSE occurs when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or

very small, which is the top and the bottom of each table. In addition to the value of 𝜽𝜽, the

difference between two 𝜃𝜃′𝑠𝑠 also affects the estimation accuracy. More specifically, when 𝜃𝜃1 is

around 0, and 𝜃𝜃2 and 𝜃𝜃3 are near 𝜃𝜃1 , the bias for 𝜃𝜃1 is close to 0 and the RMSE is less than 0.4.
Negative bias and large RMSE appear when the value of 𝜃𝜃1 increases and the difference between

𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1 and 𝜃𝜃3 , increases. For example, at point (3, 2, 2) in the table, the bias
for 𝜃𝜃1 is almost -1.0 and RMSE for 𝜃𝜃1 is around 1.0. Meanwhile, positive bias and large RMSE

appear when the value of 𝜃𝜃1 decreases and the difference between 𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1

and 𝜃𝜃3 , increases. At point (-3, -2, -2), the bias for 𝜃𝜃1 is about 0.93 and RMSE for 𝜃𝜃1 is around
1.0. Similar results for 𝜃𝜃2 can be observed from the three columns in the middle of Table 5.11

and 5.13. When 𝜃𝜃2 is around 0, and 𝜃𝜃1 and 𝜃𝜃3 is near 𝜃𝜃2 , the bias and RMSE for 𝜃𝜃2 is very small.
When the value of 𝜃𝜃2 becomes more extreme and 𝜃𝜃1 and 𝜃𝜃3 is away from 𝜃𝜃2 , large bias and
RMSE values appear. Again, similar results can be found for 𝜃𝜃3 from the three columns on the

right side of Table 5.11 and 5.13. As described in Section 5.2.1, this finding is probably due to
the Bayesian MAP estimation method. The prior for 𝜽𝜽 estimation is a multivariate normal
distribution with a mean vector of (0, 0) and a high correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . The prior

will shrink the ability estimation into the middle and reduce the difference among each 𝜃𝜃. Under

this condition, the overall test length is 30 so that about 10 items are selected from each cluster.
The effect of the likelihood function is relatively weak comparing to the effect of the prior. If
the test length further increases, the effect of the likelihood function will dominate the effect of
the prior eventually, and therefore reduce the bias and RMSE in those extreme cases.

89

When item exposure control is implemented, similar findings can be observed from Table
5.12 and 5.14. Again, there is nearly no difference between the two p-optimal item pools, and
between the p-optimal item pools and the baseline pool. The results support the finding based on
the overall bias and RMSE, and further suggest the three item pools perform similarly in terms of
the ability estimation on the 37 𝜽𝜽 points. In addition, larger bias and RMSE also occurs when

𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or very small, and when 𝜃𝜃 ’s are away from each other. A

comparison between the condition with and without item exposure control shows, when item

exposure control is built in, the magnitude of the bias and RMSE at some extreme points
becomes larger. The increase of estimation error is due to the item exposure control. As
explained in Section 5.2.1, because the item exposure control prevents the most informative item
from being frequently selected, the information available for ability estimation reduces slightly.
Thus, the measurement error at extreme 𝜽𝜽 points becomes larger if item exposure control is built

into the item selection process.

In summary, this section present the results for the MCAT with the test specification of threedimension simple structure and with high correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . In general, the poptimal item pools perform similarly as the baseline pool in terms of both overall and conditional
accuracy of ability estimation, but the p-optimal item pools can save about 100 items and have a
better item pool usage. When item exposure control is implemented, the item exposure rate and
item overlap rate can be controlled very well. The p-optimal item pools still can provide reliable
ability estimation with a relatively small pool size.
5.2.4 Performance for item pools based on Test Specification 2 (moderate correlation)
The results for the MCAT with the same test specification, but with 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are

moderately correlated, are presented in Table 5.15 and 5.15. The results in Table 5.15 are under
90

Table 5.15: The performance of the .96- and .86-optimal pool and the baseline pool without
exposure control
(3-dimension simple structure, moderate correlation)
Statistics

.96-optimal pool

.86-optimal pool

Baseline pool

Bias

(0.01, 0.00, 0.00)

(0.01, 0.01, 0.00)

(0.00, 0.00, 0.00)

RMSE

(0.52, 0.49, 0.49)

(0.51, 0.49, 0.49)

(0.52, 0.49, 0.49)

Correlation

(0.86, 0.87, 0.87)

(0.86, 0.87, 0.87)

(0.86, 0.87, 0.87)

Average test information
Overall Pool Usage

2.38
0
0
2.38
0
0
2.38
0
0
� 0
2.39
0 � � 0
2.38
0 � � 0
2.38
0 �
0
0
2.39
0
0
2.38
0
0
2.39
29.38

31.95

67.46

Overlap rate

0.18

0.33

0.20

% of overexposed item (r > 0.2)

15%

35%

8%

% of underexposed item (r<0.02)

31%

29%

53%

Table 5.16: The performance of the .96- and .86-optimal pool and the baseline pool with
exposure control
(3-dimension simple structure, moderate correlation)
Statistics

.96-optimal pool

.86-optimal pool

Baseline pool

Bias

(0.01, -0.01, 0.00)

(0.00, 0.00, 0.00)

(0.00, 0.00, 0.00)

RMSE

(0.54, 0.51, 0.50)

(0.54, 0.51, 0.50)

(0.55, 0.52, 0.50)

Correlation

(0.85, 0.86, 0.86)

(0.84, 0.86, 0.86)

(0.84, 0.85, 0.86)

Average test information
Overall Pool Usage

2.14
0
0
2.12
0
0
2.07
0
0
� 0
2.16
0 � � 0
2.15
0 � � 0
2.00
0 �
0
0
2.17
0
0
2.15
0
0
2.15
1.50

0.47

9.82

Overlap rate

0.10

13%

0.08

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r<0.02)

0%

0%

21%

the condition without item exposure control; and Table 5.16 is with item exposure control. In
both tables, there are three values for bias, RMSE and correlation, representing the results for
(𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 ).
91

Under the condition without item exposure control (see Table 5.15), the p-optimal item pools
and the baseline pool show nearly no bias on the 𝜽𝜽 estimates. Also, the RMSE are all at 0.50,
and correlations between estimated 𝜽𝜽 and true 𝜽𝜽 are around 0.87. The average test information is

also very similar among the three item pools. The amount of information on the direction of

each 𝜃𝜃 is around 2.38. In general, the results suggest that the .96- and .86-optimal item pool can
provide accurate estimation for 𝜽𝜽, and the level of accuracy is the same as baseline pool.

Table 5.15 also presents the results about item pool usage. The overall pool usage index for

the .96-optimal item pool is slightly smaller than that of the .86-optimal item pool, and the index
for the baseline pool is more than twice as much as the .96- and .86-optimal item pool. The
results suggest that the .96-optimal item pool has been slightly better used than the .86-optimal
item pool, and the two optimal item pools have been much better used than the baseline pool.
More specifically, for the .96-optimal item pool, the overlap rate is 0.18, and the percentage of
overexposed and underexposed item are 15% and 31%, respectively. For the .86-optimal item
pool, the results are: 33% of items overlap, 35% overexposed, and 29 % under exposed. Because
more items from the .86-optimal item pool are overlapped and overexposed, the .86-optimal item
pool is less secure than the .96-optimal item pool. The overlap rate for the baseline pool is 0.20,
which is slightly higher than the .96-optimal item pool and lower than the .86-optimal item pool.
Although a smaller number of items (8%) from the baseline pool are overexposed, more than
half of the items (53%) are rarely used. It implies many items in the baseline pool are wasted. In
brief, based on these pool usage results, the item pool usage for the .96- and .86-optimal item
pool is much better than the baseline pool.
When item exposure control is implemented (see Table 5.16), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimation as the baseline pool, and

92

yield better item pool usage than the baseline pool. Compared with the condition without item
exposure control, item exposure control only results in a 0.01 to 0.03 increase for the RMSE,
about 0.01 to 0.02 decrease in correlation, and about 0.3 decrease on the average test information.
For the item pool usage, when the item exposure control is implemented, no item is overexposed,
and the percentage of underexposed item and overlapped item are also decreased. The .96- and
the .86-optimal item pool has been fully used with no item underexposed. The overall pool
usage index for the .96-, .86-optimal item pool and the baseline pool are 1.50, 0.47, and 9.82,
respectively. The value is much smaller than the condition without item exposure control. Thus,
the item exposure control can effectively increase the item pool usage and reduce the item
exposure rate without obvious loss on the accuracy of ability estimation.
In addition to the overall pool performance, the conditional bias and RMSE at 37 (𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 )

points are also reported. The conditional bias for each 𝜽𝜽 point is presented in Table 5.17 and

5.18, for the MCAT without and with exposure control, respectively. Negative bias is colored in
blue and positive bias is in red. Deeper color represents larger bias. The conditional RMSE is
presented in Table 5.19 and 5.20 in the same manner. Small RMSE is colored in green and large
RMSE is colored in red.
Under the condition without item exposure control (see Table 5.17 for bias and 5.19 for
RMSE), the conditional bias and RMSE for the .96-, .86-optimal item pool, and the baseline pool
are quite similar. This finding supports the results for the overall bias and RMSE, and also
suggests the p-optimal item pools can provide as accurate ability estimation as the baseline pool
at each 𝜽𝜽 point. Similar to the condition that dimensions are highly correlated, larger bias and
RMSE occurs when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or very small. The difference between two 𝜃𝜃′𝑠𝑠

also affects the estimation accuracy. More specifically, when 𝜃𝜃1 is around 0, and 𝜃𝜃2 and 𝜃𝜃3 are
93

Table 5.17: Conditional Bias for the 𝜽𝜽 estimates without exposure control
(3-dimension simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.61 0.60 0.68 0.48 0.59 0.67 0.48 0.55 0.57
-3 -3 -2 0.70 0.78 0.74 0.70 0.75 0.74 0.01 0.03 0.05
-3 -2 -3 0.78 0.71 0.70 0.11 0.10 0.09 0.77 0.70 0.73
-3 -2 -2 0.89 0.83 0.84 0.29 0.22 0.26 0.20 0.17 0.20
-2 -3 -3 0.15 0.13 0.15 0.65 0.66 0.69 0.64 0.70 0.69
-2 -3 -2 0.33 0.26 0.28 0.86 0.80 0.84 0.20 0.19 0.18
-2 -2 -3 0.20 0.23 0.18 0.18 0.28 0.11 0.83 0.83 0.77
-2 -2 -2 0.42 0.32 0.33 0.42 0.30 0.29 0.40 0.29 0.27
-2 -2 -1 0.48 0.54 0.44 0.44 0.56 0.50 -0.16 -0.13 -0.16
-2 -1 -2 0.48 0.50 0.49 -0.19 -0.10 -0.14 0.45 0.46 0.52
-2 -1 -1 0.60 0.60 0.70 -0.04 0.08 0.09 -0.02 0.00 0.03
-1 -2 -2 -0.06 -0.06 0.03 0.41 0.37 0.51 0.44 0.35 0.52
-1 -2 -1 0.03 0.09 0.00 0.57 0.62 0.54 -0.02 -0.04 -0.10
-1 0 -1 0.31 0.27 0.26 -0.27 -0.24 -0.39 0.36 0.30 0.31
-1 0 0
0.44 0.39 0.45 -0.09 -0.14 -0.08 -0.06 -0.11 -0.03
0 -1 -1 -0.22 -0.28 -0.24 0.31 0.24 0.21 0.33 0.23 0.24
0 -1 0 -0.06 -0.04 -0.09 0.45 0.49 0.45 -0.24 -0.13 -0.13
0 0 -1 -0.16 -0.08 -0.14 -0.19 -0.14 -0.11 0.46 0.48 0.47
0 0 0
0.06 -0.06 0.01 0.09 -0.04 -0.04 0.09 -0.02 -0.01
0 0 1
0.15 0.10 0.09 0.18 0.15 0.16 -0.47 -0.48 -0.45
0 1 0
0.10 0.12 0.16 -0.46 -0.43 -0.37 0.14 0.13 0.18
0 1 1
0.13 0.25 0.21 -0.29 -0.29 -0.28 -0.36 -0.29 -0.27
1 0 0 -0.39 -0.40 -0.40 0.09 0.09 0.10 0.09 0.07 0.15
1 0 1 -0.27 -0.22 -0.32 0.25 0.29 0.26 -0.38 -0.28 -0.37
1 2 1 -0.11 -0.04 -0.04 -0.68 -0.61 -0.68 -0.08 -0.01 -0.03
1 2 2
0.04 0.00 0.06 -0.47 -0.45 -0.47 -0.48 -0.47 -0.49
2 1 1 -0.62 -0.67 -0.60 -0.06 -0.04 -0.05 -0.01 -0.07 -0.01
2 1 2 -0.45 -0.44 -0.41 0.11 0.09 0.13 -0.45 -0.49 -0.49
2 2 1 -0.47 -0.52 -0.54 -0.48 -0.52 -0.49 0.17 0.10 0.16
2 2 2 -0.30 -0.31 -0.38 -0.38 -0.35 -0.38 -0.32 -0.35 -0.31
2 2 3 -0.21 -0.30 -0.32 -0.13 -0.24 -0.20 -0.80 -0.86 -0.83
2 3 2 -0.33 -0.30 -0.27 -0.86 -0.81 -0.87 -0.20 -0.16 -0.16
2 3 3 -0.11 -0.14 -0.17 -0.62 -0.63 -0.62 -0.64 -0.71 -0.61
3 2 2 -0.91 -0.87 -0.91 -0.24 -0.28 -0.22 -0.22 -0.24 -0.23
3 2 3 -0.76 -0.68 -0.75 -0.16 -0.05 -0.11 -0.77 -0.72 -0.74
3 3 2 -0.76 -0.73 -0.71 -0.75 -0.69 -0.76 -0.02 -0.03 -0.11
3 3 3 -0.65 -0.57 -0.56 -0.56 -0.53 -0.61 -0.56 -0.52 -0.56
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
94

Table 5.18: Conditional Bias for the 𝜽𝜽 estimates with exposure control
(3-dimension simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.81 0.90 0.73 0.69 0.82 0.69 0.69 0.84 0.66
-3 -3 -2 0.98 1.05 0.97 0.94 1.02 0.94 0.17 0.17 0.20
-3 -2 -3 0.92 0.99 0.89 0.22 0.20 0.17 0.94 0.99 0.81
-3 -2 -2 1.06 1.19 1.00 0.36 0.44 0.30 0.29 0.39 0.24
-2 -3 -3 0.28 0.24 0.18 0.88 0.95 0.86 0.86 0.90 0.82
-2 -3 -2 0.35 0.40 0.37 1.01 1.09 1.02 0.22 0.33 0.21
-2 -2 -3 0.43 0.38 0.31 0.32 0.30 0.32 1.04 1.09 0.93
-2 -2 -2 0.43 0.48 0.44 0.37 0.43 0.45 0.35 0.43 0.39
-2 -2 -1 0.58 0.54 0.58 0.57 0.56 0.55 -0.10 -0.19 -0.14
-2 -1 -2 0.61 0.54 0.57 -0.05 -0.15 -0.13 0.55 0.53 0.51
-2 -1 -1 0.69 0.78 0.67 0.05 0.17 0.09 0.08 0.11 0.09
-1 -2 -2 -0.08 0.00 -0.02 0.51 0.57 0.51 0.53 0.60 0.52
-1 -2 -1 0.14 0.15 0.11 0.74 0.79 0.75 0.05 0.08 0.01
-1 0 -1 0.41 0.32 0.38 -0.20 -0.30 -0.33 0.40 0.36 0.34
-1 0 0
0.42 0.45 0.40 -0.12 -0.13 -0.16 -0.11 -0.12 -0.15
0 -1 -1 -0.25 -0.27 -0.20 0.29 0.29 0.35 0.31 0.30 0.34
0 -1 0 -0.10 -0.15 -0.02 0.50 0.49 0.55 -0.14 -0.18 -0.13
0 0 -1 -0.17 -0.10 -0.16 -0.15 -0.13 -0.22 0.49 0.51 0.49
0 0 0 -0.02 -0.07 0.05 0.07 0.00 0.05 0.00 0.02 0.08
0 0 1
0.22 0.14 0.15 0.29 0.16 0.22 -0.44 -0.47 -0.42
0 1 0
0.13 0.21 0.10 -0.46 -0.47 -0.51 0.22 0.16 0.15
0 1 1
0.28 0.25 0.21 -0.31 -0.29 -0.36 -0.34 -0.30 -0.34
1 0 0 -0.40 -0.43 -0.49 0.14 0.16 0.13 0.17 0.18 0.11
1 0 1 -0.28 -0.37 -0.32 0.34 0.25 0.28 -0.33 -0.40 -0.39
1 2 1 -0.09 -0.17 -0.10 -0.78 -0.75 -0.74 -0.03 -0.06 -0.09
1 2 2 -0.03 0.00 0.03 -0.58 -0.60 -0.55 -0.54 -0.62 -0.54
2 1 1 -0.65 -0.73 -0.72 -0.07 -0.13 -0.08 -0.03 -0.10 -0.03
2 1 2 -0.60 -0.64 -0.57 0.06 0.07 0.05 -0.63 -0.66 -0.64
2 2 1 -0.63 -0.56 -0.65 -0.67 -0.58 -0.60 0.10 0.14 0.07
2 2 2 -0.53 -0.51 -0.44 -0.39 -0.45 -0.46 -0.38 -0.48 -0.47
2 2 3 -0.32 -0.40 -0.40 -0.28 -0.37 -0.31 -1.04 -1.11 -1.07
2 3 2 -0.32 -0.43 -0.35 -1.01 -1.14 -1.06 -0.26 -0.39 -0.31
2 3 3 -0.19 -0.32 -0.34 -0.84 -1.04 -0.89 -0.84 -1.00 -0.89
3 2 2 -1.05 -1.18 -1.07 -0.34 -0.40 -0.32 -0.28 -0.34 -0.33
3 2 3 -0.91 -0.99 -0.92 -0.13 -0.24 -0.18 -0.88 -1.00 -0.92
3 3 2 -0.92 -1.05 -0.98 -0.95 -1.04 -0.95 -0.19 -0.25 -0.17
3 3 3 -0.79 -0.99 -0.74 -0.69 -0.90 -0.70 -0.71 -0.87 -0.68
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
95

Table 5.19: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(3-dimension simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.74 0.74 0.78 0.62 0.73 0.77 0.60 0.70 0.72
-3
-3
-2
0.81 0.87 0.85 0.80 0.86 0.84 0.45 0.39 0.41
-3
-2
-3
0.88 0.80 0.82 0.45 0.41 0.39 0.85 0.79 0.82
-3
-2
-2
0.97 0.91 0.93 0.47 0.49 0.46 0.44 0.44 0.44
-2
-3
-3
0.41 0.41 0.43 0.77 0.76 0.80 0.74 0.79 0.82
-2
-3
-2
0.53 0.49 0.50 0.96 0.88 0.91 0.43 0.45 0.45
-2
-2
-3
0.44 0.43 0.43 0.50 0.49 0.41 0.93 0.91 0.85
-2
-2
-2
0.63 0.51 0.51 0.58 0.49 0.47 0.55 0.51 0.49
-2
-2
-1
0.63 0.66 0.60 0.57 0.66 0.63 0.43 0.41 0.41
-2
-1
-2
0.62 0.62 0.61 0.46 0.40 0.41 0.61 0.59 0.68
-2
-1
-1
0.72 0.72 0.81 0.40 0.38 0.38 0.41 0.41 0.40
-1
-2
-2
0.42 0.41 0.37 0.55 0.53 0.65 0.58 0.48 0.65
-1
-2
-1
0.44 0.40 0.42 0.67 0.72 0.69 0.40 0.38 0.40
-1
0
-1
0.48 0.44 0.47 0.47 0.45 0.57 0.52 0.49 0.51
-1
0
0
0.61 0.54 0.65 0.40 0.46 0.41 0.35 0.40 0.40
0
-1
-1
0.44 0.48 0.49 0.45 0.43 0.41 0.50 0.42 0.45
0
-1
0
0.44 0.36 0.43 0.60 0.61 0.59 0.50 0.36 0.35
0
0
-1
0.43 0.40 0.42 0.47 0.36 0.43 0.59 0.65 0.61
0
0
0
0.41 0.41 0.42 0.37 0.37 0.39 0.41 0.36 0.41
0
0
1
0.45 0.39 0.44 0.43 0.38 0.43 0.63 0.60 0.60
0
1
0
0.36 0.38 0.40 0.62 0.59 0.51 0.37 0.46 0.41
0
1
1
0.41 0.47 0.45 0.50 0.51 0.45 0.52 0.49 0.46
1
0
0
0.57 0.55 0.57 0.46 0.36 0.43 0.43 0.38 0.48
1
0
1
0.48 0.46 0.47 0.48 0.45 0.49 0.58 0.47 0.56
1
2
1
0.39 0.41 0.41 0.79 0.72 0.77 0.39 0.41 0.41
1
2
2
0.37 0.46 0.42 0.60 0.59 0.61 0.60 0.61 0.63
2
1
1
0.73 0.79 0.71 0.39 0.41 0.40 0.38 0.41 0.35
2
1
2
0.58 0.64 0.60 0.44 0.42 0.42 0.58 0.66 0.63
2
2
1
0.63 0.68 0.63 0.62 0.63 0.63 0.48 0.37 0.37
2
2
2
0.52 0.49 0.53 0.55 0.50 0.55 0.48 0.54 0.50
2
2
3
0.39 0.50 0.50 0.37 0.45 0.42 0.88 0.93 0.91
2
3
2
0.47 0.49 0.50 0.92 0.87 0.95 0.47 0.39 0.46
2
3
3
0.47 0.38 0.46 0.73 0.76 0.76 0.74 0.81 0.74
3
2
2
1.02 0.96 1.00 0.47 0.49 0.48 0.47 0.47 0.49
3
2
3
0.87 0.79 0.85 0.42 0.38 0.42 0.89 0.82 0.82
3
3
2
0.84 0.82 0.81 0.84 0.78 0.84 0.38 0.40 0.43
3
3
3
0.78 0.68 0.71 0.66 0.63 0.70 0.69 0.63 0.66
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
96

Table 5.20: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(3-dimension simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.90 0.97 0.84 0.80 0.92 0.80 0.78 0.92 0.77
-3
-3
-2
1.07 1.11 1.07 1.02 1.11 1.02 0.47 0.47 0.44
-3
-2
-3
1.01 1.06 0.99 0.51 0.42 0.40 1.05 1.05 0.90
-3
-2
-2
1.14 1.26 1.09 0.51 0.58 0.49 0.50 0.54 0.45
-2
-3
-3
0.50 0.49 0.54 0.96 1.03 1.00 0.95 0.98 0.95
-2
-3
-2
0.55 0.60 0.56 1.09 1.17 1.12 0.43 0.55 0.48
-2
-2
-3
0.61 0.57 0.49 0.54 0.48 0.52 1.11 1.16 1.01
-2
-2
-2
0.58 0.63 0.61 0.53 0.60 0.64 0.52 0.59 0.57
-2
-2
-1
0.71 0.67 0.72 0.73 0.69 0.68 0.45 0.48 0.39
-2
-1
-2
0.76 0.67 0.71 0.43 0.46 0.41 0.71 0.68 0.64
-2
-1
-1
0.82 0.86 0.81 0.40 0.45 0.43 0.44 0.41 0.45
-1
-2
-2
0.46 0.42 0.44 0.64 0.69 0.66 0.70 0.74 0.66
-1
-2
-1
0.41 0.45 0.41 0.86 0.89 0.85 0.42 0.38 0.39
-1
0
-1
0.57 0.53 0.56 0.48 0.54 0.54 0.58 0.55 0.56
-1
0
0
0.58 0.61 0.55 0.43 0.43 0.43 0.43 0.39 0.39
0
-1
-1
0.49 0.46 0.46 0.48 0.49 0.57 0.52 0.48 0.57
0
-1
0
0.40 0.46 0.43 0.65 0.63 0.69 0.44 0.45 0.44
0
0
-1
0.42 0.42 0.46 0.46 0.45 0.45 0.67 0.68 0.65
0
0
0
0.41 0.42 0.44 0.44 0.37 0.38 0.44 0.40 0.39
0
0
1
0.45 0.44 0.45 0.52 0.42 0.52 0.61 0.65 0.60
0
1
0
0.45 0.47 0.43 0.62 0.62 0.66 0.47 0.42 0.43
0
1
1
0.49 0.46 0.48 0.49 0.47 0.56 0.52 0.46 0.55
1
0
0
0.64 0.60 0.70 0.44 0.44 0.45 0.45 0.50 0.45
1
0
1
0.52 0.55 0.54 0.52 0.48 0.53 0.53 0.56 0.61
1
2
1
0.44 0.46 0.43 0.89 0.88 0.87 0.40 0.40 0.44
1
2
2
0.47 0.47 0.43 0.73 0.75 0.68 0.69 0.74 0.69
2
1
1
0.76 0.88 0.84 0.39 0.47 0.42 0.38 0.49 0.36
2
1
2
0.73 0.82 0.71 0.40 0.47 0.42 0.75 0.80 0.75
2
2
1
0.76 0.72 0.76 0.77 0.73 0.71 0.40 0.43 0.45
2
2
2
0.64 0.66 0.66 0.56 0.65 0.64 0.52 0.62 0.63
2
2
3
0.55 0.64 0.58 0.50 0.62 0.50 1.12 1.19 1.14
2
3
2
0.53 0.64 0.54 1.09 1.23 1.16 0.50 0.58 0.56
2
3
3
0.48 0.60 0.54 0.93 1.17 0.98 0.93 1.14 0.99
3
2
2
1.13 1.27 1.15 0.52 0.58 0.56 0.45 0.56 0.56
3
2
3
1.00 1.10 1.03 0.42 0.52 0.44 0.96 1.09 1.00
3
3
2
1.01 1.16 1.08 1.04 1.14 1.07 0.47 0.55 0.47
3
3
3
0.89 1.11 0.88 0.79 1.04 0.80 0.84 1.01 0.82
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
97

near 𝜃𝜃1 , the bias for 𝜃𝜃1 is close to 0 and the RMSE is around 0.4. Negative bias and large RMSE
appear when the value of 𝜃𝜃1 increases and the difference between 𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1 and
𝜃𝜃3 , increases. For example, at point (3, 2, 2) in the table, the bias for 𝜃𝜃1 is around -0.74 and

RMSE for 𝜃𝜃1 is around 0.79. Meanwhile, positive bias and large RMSE appear when the value

of 𝜃𝜃1 decreases and the difference between 𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1 and 𝜃𝜃3 , increases. At
point (-3, -2, -2), the bias for 𝜃𝜃1 is about 0.74 and RMSE for 𝜃𝜃1 is around 0.79. Similar results for

𝜃𝜃2 can be observed from the three columns in the middle of Table 5.17 and 5.19. When 𝜃𝜃2 is
around 0, and 𝜃𝜃1 and 𝜃𝜃3 is near 𝜃𝜃2 , the bias and RMSE for 𝜃𝜃2 is very small. When the value of
𝜃𝜃2 becomes more extreme and 𝜃𝜃1 and 𝜃𝜃3 is away from 𝜃𝜃2 , large bias and RMSE values appear.
Again, similar results can be found for 𝜃𝜃3 from the three columns on the right side of Table 5.17
and 5.19. As described in Section 5.2.1, this finding is probably due to the Bayesian MAP
estimation method. By comparing the conditional Bias and RMSE in this section with the results
in Section 5.2.3, when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are highly correlated, it is easy to observe that the pattern

of these tables are the same, but the magnitude of the bias and RMSE in this section is smaller.
When the correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 decrease, the prior will only weakly reduce the
difference among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . Therefore, the bias the RMSE values are slightly smaller at

those points where 𝜃𝜃1 and 𝜃𝜃2 are away from each other, compared with the condition when 𝜃𝜃1 ,
𝜃𝜃2 , and 𝜃𝜃3 are highly correlated.

When item exposure control is implemented, similar findings can be observed from Table

5.18 and 5.20. Again, there is nearly no difference between the two p-optimal item pools, and
between the p-optimal item pools and the baseline pool. The results suggest the three item pools
perform similarly in terms of the ability estimation on the 37 𝜽𝜽 points. In addition, larger bias

and RMSE also occurs when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or very small, and when 𝜃𝜃′𝑠𝑠 are away
98

from each other. Similar to results under the high correlation condition in Section 5.2.3, when
item exposure control is built in, the magnitude of the bias and RMSE at some extreme points
becomes larger.
In summary, this section present the results for the MCAT with the test specification of threedimension simple structure and with moderate correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . The p-optimal
item pools perform similarly as the baseline pool in terms of the accuracy of ability estimation,

but the p-optimal item pools can save over 140 items and have a better item pool usage. When
item exposure control is implemented, the p-optimal item pools still can provide accurate ability
estimation and meanwhile the item exposure rate and item overlap rate can be well controlled.
In general, the findings from this section are similar to the previous, when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are

highly correlated. A closely comparison between these two sections reveal that the measurement

error in this section is slightly larger. This result is due to the correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 .

As explained in Section 5.2.2, the MIRT model estimates all the 𝜃𝜃 ’s simultaneously by
borrowing information from one to another. When 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are highly correlated, more
information can be borrowed for ability estimation. When the correlation decreases, the amount
of information that can be borrowed can be reduced, and therefore the RMSE increase. In
addition to the accuracy of ability estimation, the pool usage for the two p-optimal item pool in
this section is also slightly better.

This is probably because of the pool size.

When the

correlation decreases, the pool size decreases as well. A smaller item pool is more likely to be
fully used.
5.2.5 Performance for item pools based on Test Specification 3 (high correlation)
The results of the ability estimates and item pool utilization for the .96-optimal item pool,
the .86-optimal item pool, and the baseline pool based on the three-dimension non-simple
99

structure test specification with 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are highly correlated are presented in Table 5.21
and 5.22. The results in Table 5.21 is under the condition that no item exposure control is
implemented; and Table 5.22 is when item exposure control is implemented. In both tables,
there are three values for bias, RMSE and correlation, representing the results for (𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 ).

Under the condition without item exposure control (see Table 5.21), the p-optimal item pools

and the baseline pool show no bias on the 𝜽𝜽 estimates. Also, the RMSE are between 0.31 and
0.37, and correlations between estimated 𝜽𝜽 and true 𝜽𝜽 are around 0.94.

The average test

information between the .96-optimal item pool and the baseline pool is very similar, but the
information for the .86-optimal item pool is slightly smaller. The amount of information on the
direction of 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 (i.e., the value on the diagonal) is about 3.50 for the .96-optimal item

pool and the baseline pool, and about 3.39 for the .86-optimal item pool. These values are over

one unit higher than the values under the three-dimensional non-simple structure case. The
additional information comes from items in Cluster 4 with 𝒂𝒂 = (1, 1, 1). Because these items

measure all the 𝜃𝜃’s, they provide information on the direction of 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 , as well as on the

direction of the diagonal in the three dimensional space (see Figure 5.1). For this reason, the offdiagonal values in the information matrix are no longer zero. The values on the off-diagonal

represent the amount of information on the direction of the 𝜃𝜃1 - 𝜃𝜃2 composite, 𝜃𝜃1 - 𝜃𝜃3 composite,
and 𝜃𝜃2 - 𝜃𝜃3 composite. In general, the results suggest that the .96- and .86-optimal item pool can

provide accurate estimation for 𝜽𝜽, and the level of accuracy is the same as baseline pool, but the
average test information for the .86-optimal item pool is slightly small than the other two.

Table 5.21 also presents the results about item pool usage. Compared with the MCAT based
on Test Specification 1 and 2 in Section 5.2.1 to 5.2.4, similar results can be drawn from Table
5.21. The item pool usage for the .96-optimal item pool is slightly better than the 86-optimal

100

Table 5.21: The performance of the .96- and .86-optimal pool and the baseline pool

Statistics

without exposure control
(3-dimension non-simple structure, high correlation)
.96-optimal pool
.86-optimal pool

Baseline pool

Bias

(0.00, 0.00, 0.00)

(0.00, 0.00, -0.01)

(-0.01, 0.00, 0.00)

RMSE

(0.35, 0.31, 0.37)

(0.35, 0.31, 0.37)

(0.35, 0.31, 0.37)

Correlation

(0.94, 0.95, 0.93)

(0.94, 0.95, 0.93)

(0.94, 0.95, 0.93)

Average test information
Overall Pool Usage

3.40 1.61 1.61
3.50 1.75 1.75
3.52 1.77 1.77
�1.75 3.48 1.75� �1.61 3.37 1.61� �1.77 3.49 1.77�
1.61 1.61 3.39
1.75 1.75 3.50
1.77 1.77 3.52
28.06

28.69

35.26

Overlap rate

0.14

0.25

0.12

% of overexposed item (r > 0.2)

3%

30%

1%

% of underexposed item (r<0.02)

33%

31%

43%

Table 5.22: The performance of the .96- and .86-optimal pool and the baseline pool

Statistics

with exposure control
(3-dimension non-simple structure, high correlation)
.96-optimal pool
.86-optimal pool

Baseline pool

Bias

(0.00, 0.00, 0.00)

(0.00, 0.00, 0.00)

(0.00, 0.00, 0.00)

RMSE

(0.37, 0.33, 0.39)

(0.37, 0.33, 0.39)

(0.36, 0.32, 0.38)

Correlation

(0.93, 0.94, 0.92)

(0.93, 0.94, 0.92)

(0.93, 0.95, 0.93)

Average test information
Overall Pool Usage

2.94 1.30 1.30
2.89 1.30 1.30
3.18 1.61 1.61
�1.30 2.94 1.30� �1.30 2.89 1.30� �1.61 3.13 1.61�
1.30 1.30 2.94
1.30 1.30 2.89
1.61 1.61 3.24
3.64

1.65

8.26

Overlap rate

0.08

0.12

0.07

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r<0.02)

1%

0%

17%

item pool. And the two p-optimal item pools are much better used than the baseline pool.
When item exposure control is implemented (see Table 5.22), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimates as the baseline pool, and yield
better item pool usage than the baseline pool. Compared with the condition without item

101

exposure control, item exposure control only results in 0.01 to 0.02 increase for the RMSE, 0.01
decrease in correlation, and about 0.5 decrease for the average test information. For the item
pool usage, when the item exposure control is implemented, no item is overexposed, and the
percentage of underexposed items and overlapped items are also decreased. The two p-optimal
item pools have been fully used. No item from the .86-optimal item pool is underexposed, and
only 1% of items from 96-optimal item pool are underexposed. The comparison between the
condition with and without item exposure control suggests the item exposure control can
effectively increase the item pool usage and reduce the item exposure rate without obvious loss
on the accuracy of ability estimation.
In addition to the overall pool performance, the conditional bias and RMSE at the 37 (𝜃𝜃1 , 𝜃𝜃2 ,

𝜃𝜃3 ) points are also calculated. The conditional bias for each 𝜽𝜽 point is presented in Table 5.23

and 5.24, for the MCAT without and with item exposure control, respectively. Negative bias is
colored in blue and positive bias is in red. Deeper color represents larger bias. The conditional
RMSE is presented in Table 5.25 and 5.26 in the same manner. Small RMSE is colored in green
and large RMSE is colored in red.
Under the condition without item exposure control (see Table 5.23 for bias and 5.25 for
RMSE), the conditional bias and RMSE for the .96-, .86-optimal item pool, and the baseline pool
are quite similar. This finding supports the results of the overall bias and RMSE, and suggests
the p-optimal item pools can provide as accurate ability estimation as the baseline pool at each 𝜽𝜽
point. Similar to the results from the Test Specification 2 (three-dimension simple structure),

larger bias and RMSE occurs when 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or very small, which is the top

and the bottom of each table. The difference between 𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1 and 𝜃𝜃3 , also
affects the estimation accuracy.

102

Table 5.23: Conditional Bias for the 𝜽𝜽 estimates without exposure control
(3-dimension non-simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.40 0.41 0.24 0.33 0.33 0.19 0.42 0.43 0.28
-3 -3 -2 0.59 0.52 0.44 0.59 0.53 0.44 -0.15 -0.21 -0.30
-3 -2 -3 0.55 0.60 0.50 -0.36 -0.35 -0.46 0.61 0.61 0.50
-3 -2 -2 0.73 0.75 0.74 -0.19 -0.15 -0.18 -0.02 0.00 -0.05
-2 -3 -3 -0.29 -0.27 -0.38 0.51 0.51 0.42 0.49 0.53 0.44
-2 -3 -2 -0.13 -0.13 -0.14 0.71 0.73 0.70 -0.11 -0.13 -0.13
-2 -2 -3 -0.02 -0.02 -0.10 -0.14 -0.12 -0.19 0.73 0.75 0.72
-2 -2 -2 0.08 0.13 0.11 0.05 0.10 0.07 0.11 0.19 0.14
-2 -2 -1 0.30 0.37 0.33 0.31 0.36 0.35 -0.46 -0.44 -0.39
-2 -1 -2 0.39 0.38 0.43 -0.54 -0.55 -0.51 0.40 0.37 0.41
-2 -1 -1 0.62 0.62 0.64 -0.23 -0.27 -0.25 -0.11 -0.18 -0.18
-1 -2 -2 -0.47 -0.43 -0.42 0.36 0.40 0.41 0.29 0.38 0.38
-1 -2 -1 -0.22 -0.22 -0.18 0.64 0.64 0.68 -0.22 -0.18 -0.19
-1 0 -1 0.38 0.31 0.36 -0.54 -0.61 -0.55 0.32 0.29 0.39
-1 0 0
0.57 0.56 0.57 -0.26 -0.29 -0.29 -0.16 -0.18 -0.20
0 -1 -1 -0.48 -0.55 -0.49 0.36 0.29 0.33 0.26 0.24 0.25
0 -1 0 -0.32 -0.27 -0.30 0.58 0.62 0.59 -0.26 -0.26 -0.29
0 0 -1 -0.22 -0.23 -0.21 -0.29 -0.29 -0.28 0.50 0.51 0.52
0 0 0
0.02 -0.01 0.04 0.04 -0.02 0.00 0.05 -0.03 0.00
0 0 1
0.19 0.21 0.24 0.26 0.27 0.29 -0.53 -0.54 -0.52
0 1 0
0.26 0.26 0.31 -0.63 -0.61 -0.59 0.23 0.25 0.25
0 1 1
0.46 0.52 0.45 -0.34 -0.33 -0.36 -0.27 -0.29 -0.29
1 0 0 -0.50 -0.52 -0.54 0.35 0.32 0.30 0.25 0.22 0.18
1 0 1 -0.32 -0.33 -0.37 0.59 0.55 0.55 -0.30 -0.36 -0.29
1 2 1
0.22 0.19 0.22 -0.65 -0.68 -0.64 0.19 0.17 0.20
1 2 2
0.41 0.42 0.42 -0.41 -0.41 -0.38 -0.36 -0.39 -0.32
2 1 1 -0.63 -0.64 -0.66 0.23 0.25 0.22 0.11 0.15 0.12
2 1 2 -0.42 -0.42 -0.43 0.52 0.50 0.51 -0.42 -0.44 -0.42
2 2 1 -0.33 -0.33 -0.32 -0.35 -0.36 -0.34 0.41 0.42 0.44
2 2 2 -0.12 -0.11 -0.14 -0.06 -0.08 -0.09 -0.12 -0.14 -0.17
2 2 3
0.02 0.09 0.08 0.13 0.16 0.19 -0.77 -0.72 -0.71
2 3 2
0.17 0.13 0.10 -0.68 -0.74 -0.75 0.15 0.04 0.05
2 3 3
0.19 0.27 0.35 -0.60 -0.49 -0.47 -0.57 -0.48 -0.48
3 2 2 -0.70 -0.74 -0.75 0.21 0.14 0.18 0.07 0.01 0.03
3 2 3 -0.63 -0.56 -0.51 0.35 0.37 0.46 -0.59 -0.60 -0.51
3 3 2 -0.56 -0.50 -0.41 -0.57 -0.49 -0.42 0.18 0.21 0.34
3 3 3 -0.42 -0.32 -0.20 -0.34 -0.26 -0.14 -0.44 -0.37 -0.22
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
103

Table 5.24: Conditional Bias for the 𝜽𝜽 estimates with exposure control
(3-dimension non-simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.44 0.65 0.50 0.38 0.61 0.45 0.45 0.72 0.53
-3 -3 -2 0.67 0.81 0.65 0.66 0.80 0.66 -0.12 -0.01 -0.13
-3 -2 -3 0.75 0.88 0.71 -0.23 -0.12 -0.26 0.77 0.86 0.74
-3 -2 -2 0.90 0.96 0.82 -0.03 0.01 -0.10 0.07 0.13 0.05
-2 -3 -3 -0.20 -0.09 -0.20 0.65 0.78 0.64 0.70 0.79 0.64
-2 -3 -2 0.02 -0.01 -0.05 0.89 0.91 0.84 0.03 0.03 0.00
-2 -2 -3 0.09 0.14 0.08 -0.01 0.06 -0.03 0.88 1.01 0.90
-2 -2 -2 0.22 0.32 0.23 0.18 0.26 0.19 0.25 0.30 0.23
-2 -2 -1 0.46 0.51 0.39 0.47 0.52 0.42 -0.31 -0.30 -0.34
-2 -1 -2 0.52 0.53 0.46 -0.43 -0.43 -0.50 0.52 0.54 0.47
-2 -1 -1 0.72 0.71 0.74 -0.17 -0.23 -0.17 -0.06 -0.14 -0.08
-1 -2 -2 -0.37 -0.33 -0.42 0.47 0.53 0.41 0.44 0.53 0.39
-1 -2 -1 -0.20 -0.18 -0.11 0.69 0.72 0.76 -0.20 -0.14 -0.14
-1 0 -1 0.38 0.41 0.39 -0.56 -0.54 -0.56 0.38 0.36 0.33
-1 0 0
0.61 0.57 0.54 -0.27 -0.30 -0.34 -0.23 -0.22 -0.24
0 -1 -1 -0.50 -0.48 -0.49 0.35 0.38 0.35 0.29 0.31 0.27
0 -1 0 -0.32 -0.24 -0.29 0.61 0.64 0.61 -0.25 -0.28 -0.31
0 0 -1 -0.21 -0.22 -0.28 -0.26 -0.29 -0.30 0.57 0.54 0.57
0 0 0 -0.03 0.00 -0.01 -0.02 0.00 -0.01 -0.01 0.01 0.00
0 0 1
0.19 0.18 0.20 0.27 0.21 0.25 -0.53 -0.62 -0.61
0 1 0
0.24 0.31 0.30 -0.64 -0.59 -0.59 0.26 0.30 0.30
0 1 1
0.51 0.47 0.51 -0.34 -0.39 -0.34 -0.28 -0.33 -0.29
1 0 0 -0.60 -0.63 -0.57 0.28 0.26 0.30 0.19 0.18 0.20
1 0 1 -0.40 -0.39 -0.31 0.51 0.55 0.60 -0.39 -0.38 -0.32
1 2 1
0.11 0.14 0.16 -0.77 -0.75 -0.75 0.08 0.10 0.10
1 2 2
0.38 0.33 0.37 -0.47 -0.52 -0.48 -0.46 -0.50 -0.46
2 1 1 -0.75 -0.78 -0.75 0.17 0.12 0.17 0.10 0.03 0.12
2 1 2 -0.49 -0.54 -0.46 0.46 0.42 0.48 -0.49 -0.55 -0.44
2 2 1 -0.43 -0.51 -0.46 -0.46 -0.52 -0.48 0.35 0.29 0.32
2 2 2 -0.25 -0.32 -0.23 -0.20 -0.30 -0.18 -0.29 -0.35 -0.22
2 2 3 -0.05 -0.16 -0.11 0.04 -0.10 0.01 -0.86 -1.04 -0.92
2 3 2
0.01 -0.11 -0.03 -0.88 -1.00 -0.91 -0.03 -0.12 -0.05
2 3 3
0.14 0.11 0.11 -0.67 -0.77 -0.75 -0.67 -0.81 -0.76
3 2 2 -0.86 -0.99 -0.88 0.07 -0.03 0.07 -0.07 -0.17 -0.06
3 2 3 -0.70 -0.91 -0.77 0.29 0.09 0.23 -0.71 -0.91 -0.78
3 3 2 -0.66 -0.77 -0.69 -0.66 -0.75 -0.66 0.11 0.04 0.16
3 3 3 -0.59 -0.59 -0.57 -0.53 -0.55 -0.51 -0.62 -0.63 -0.61
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
104

Table 5.25: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(3-dimension non-simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.48 0.50 0.37 0.42 0.42 0.32 0.51 0.50 0.39
-3
-3
-2
0.63 0.59 0.51 0.64 0.59 0.51 0.31 0.34 0.42
-3
-2
-3
0.60 0.67 0.56 0.43 0.44 0.51 0.66 0.67 0.56
-3
-2
-2
0.78 0.80 0.79 0.31 0.31 0.31 0.24 0.29 0.28
-2
-3
-3
0.41 0.39 0.47 0.59 0.57 0.50 0.56 0.61 0.54
-2
-3
-2
0.29 0.29 0.28 0.75 0.77 0.74 0.29 0.31 0.28
-2
-2
-3
0.25 0.22 0.29 0.28 0.24 0.33 0.77 0.79 0.78
-2
-2
-2
0.29 0.30 0.29 0.29 0.26 0.24 0.30 0.32 0.31
-2
-2
-1
0.38 0.43 0.42 0.37 0.42 0.42 0.53 0.50 0.46
-2
-1
-2
0.47 0.46 0.49 0.59 0.61 0.55 0.48 0.47 0.50
-2
-1
-1
0.68 0.66 0.68 0.33 0.35 0.35 0.27 0.30 0.34
-1
-2
-2
0.52 0.50 0.50 0.43 0.47 0.47 0.40 0.47 0.45
-1
-2
-1
0.32 0.32 0.31 0.68 0.67 0.73 0.33 0.31 0.34
-1
0
-1
0.47 0.39 0.44 0.59 0.65 0.59 0.41 0.39 0.46
-1
0
0
0.62 0.62 0.63 0.36 0.37 0.38 0.31 0.32 0.33
0
-1
-1
0.54 0.61 0.55 0.42 0.35 0.41 0.35 0.34 0.37
0
-1
0
0.41 0.38 0.38 0.63 0.66 0.62 0.37 0.35 0.38
0
0
-1
0.32 0.34 0.32 0.36 0.37 0.36 0.56 0.56 0.58
0
0
0
0.23 0.23 0.26 0.22 0.23 0.24 0.24 0.25 0.24
0
0
1
0.29 0.32 0.35 0.35 0.36 0.38 0.60 0.60 0.58
0
1
0
0.36 0.35 0.42 0.67 0.65 0.64 0.36 0.37 0.36
0
1
1
0.54 0.57 0.50 0.43 0.39 0.41 0.40 0.38 0.38
1
0
0
0.55 0.58 0.58 0.42 0.39 0.35 0.35 0.34 0.29
1
0
1
0.40 0.40 0.45 0.62 0.59 0.59 0.39 0.44 0.38
1
2
1
0.33 0.29 0.32 0.68 0.72 0.68 0.32 0.30 0.32
1
2
2
0.51 0.49 0.49 0.52 0.47 0.44 0.47 0.47 0.41
2
1
1
0.68 0.69 0.70 0.33 0.34 0.32 0.27 0.32 0.28
2
1
2
0.50 0.51 0.48 0.57 0.56 0.56 0.49 0.55 0.51
2
2
1
0.43 0.40 0.41 0.42 0.42 0.41 0.48 0.49 0.50
2
2
2
0.25 0.28 0.25 0.24 0.23 0.26 0.31 0.27 0.29
2
2
3
0.24 0.25 0.29 0.27 0.27 0.34 0.82 0.76 0.77
2
3
2
0.29 0.33 0.30 0.72 0.79 0.79 0.31 0.29 0.28
2
3
3
0.34 0.36 0.45 0.65 0.54 0.54 0.63 0.56 0.56
3
2
2
0.74 0.79 0.79 0.31 0.30 0.29 0.27 0.29 0.28
3
2
3
0.67 0.61 0.57 0.41 0.45 0.52 0.64 0.66 0.58
3
3
2
0.62 0.57 0.50 0.61 0.55 0.50 0.31 0.35 0.45
3
3
3
0.51 0.46 0.37 0.44 0.41 0.31 0.53 0.50 0.36
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
105

Table 5.26: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(3-dimension non-simple structure, high correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.59 0.80 0.59 0.54 0.76 0.56 0.59 0.84 0.63
-3
-3
-2
0.75 0.91 0.74 0.74 0.90 0.75 0.38 0.38 0.38
-3
-2
-3
0.82 0.99 0.78 0.41 0.48 0.42 0.84 0.98 0.82
-3
-2
-2
0.96 1.02 0.89 0.35 0.37 0.37 0.35 0.43 0.34
-2
-3
-3
0.42 0.38 0.41 0.74 0.85 0.73 0.78 0.86 0.73
-2
-3
-2
0.36 0.39 0.36 0.97 0.98 0.90 0.41 0.39 0.30
-2
-2
-3
0.36 0.40 0.38 0.32 0.38 0.35 0.94 1.08 0.96
-2
-2
-2
0.36 0.48 0.37 0.35 0.42 0.33 0.42 0.45 0.36
-2
-2
-1
0.55 0.62 0.50 0.56 0.62 0.52 0.45 0.49 0.46
-2
-1
-2
0.61 0.61 0.55 0.53 0.53 0.58 0.60 0.64 0.54
-2
-1
-1
0.78 0.78 0.79 0.35 0.41 0.31 0.35 0.34 0.28
-1
-2
-2
0.50 0.47 0.53 0.57 0.63 0.50 0.55 0.61 0.49
-1
-2
-1
0.34 0.40 0.27 0.75 0.80 0.80 0.35 0.39 0.32
-1
0
-1
0.49 0.52 0.46 0.64 0.62 0.60 0.49 0.48 0.43
-1
0
0
0.68 0.66 0.60 0.38 0.44 0.42 0.38 0.40 0.35
0
-1
-1
0.58 0.59 0.55 0.46 0.51 0.43 0.42 0.45 0.39
0
-1
0
0.46 0.38 0.39 0.69 0.71 0.66 0.40 0.40 0.40
0
0
-1
0.37 0.39 0.38 0.39 0.43 0.40 0.63 0.63 0.63
0
0
0
0.30 0.32 0.23 0.30 0.31 0.23 0.32 0.33 0.26
0
0
1
0.39 0.34 0.36 0.41 0.34 0.38 0.60 0.68 0.66
0
1
0
0.40 0.44 0.39 0.71 0.66 0.65 0.41 0.44 0.40
0
1
1
0.59 0.57 0.59 0.44 0.48 0.45 0.41 0.43 0.42
1
0
0
0.66 0.70 0.63 0.40 0.39 0.41 0.35 0.34 0.33
1
0
1
0.52 0.49 0.41 0.59 0.62 0.66 0.51 0.48 0.44
1
2
1
0.30 0.38 0.29 0.82 0.83 0.79 0.31 0.36 0.27
1
2
2
0.54 0.46 0.49 0.60 0.60 0.56 0.58 0.59 0.55
2
1
1
0.81 0.84 0.80 0.35 0.34 0.32 0.35 0.35 0.29
2
1
2
0.58 0.62 0.54 0.55 0.53 0.56 0.58 0.63 0.54
2
2
1
0.53 0.59 0.56 0.56 0.59 0.58 0.48 0.44 0.48
2
2
2
0.39 0.44 0.38 0.37 0.43 0.34 0.45 0.48 0.39
2
2
3
0.37 0.37 0.36 0.35 0.37 0.32 0.92 1.10 0.96
2
3
2
0.40 0.43 0.32 0.95 1.09 0.97 0.37 0.41 0.36
2
3
3
0.41 0.43 0.32 0.77 0.87 0.80 0.77 0.91 0.82
3
2
2
0.92 1.06 0.94 0.37 0.36 0.32 0.38 0.39 0.34
3
2
3
0.79 0.99 0.82 0.47 0.41 0.38 0.80 0.99 0.84
3
3
2
0.78 0.88 0.77 0.78 0.87 0.74 0.43 0.42 0.35
3
3
3
0.69 0.76 0.66 0.64 0.73 0.60 0.73 0.79 0.69
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
106

When item exposure control is implemented, similar findings can be observed from Table
5.24 and 5.26. The results suggest the three item pools perform similarly in terms of the ability
estimation on the 37 𝜽𝜽 points. In addition, larger bias and RMSE also occurs when 𝜃𝜃1 , 𝜃𝜃2 , and

𝜃𝜃3 are very large or very small, and when 𝜃𝜃’s are away from each other. A comparison between

the condition with and without item exposure control shows, when item exposure control is built
in, the magnitude of the bias and RMSE at some extreme points becomes larger. The reason
why the item exposure control increases the estimation error is explained in previous sections.
In summary, this section present the results for the MCAT with the test specification of three-

dimension non-simple structure and with high correlation among 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . In general, the

p-optimal item pools perform similarly as the baseline pool in terms of both overall and

conditional accuracy of ability estimation, but the p-optimal item pools can save over 100 items
and have a better item pool usage. When item exposure control is implemented, the item
exposure rate and item overlap rate can be controlled very well. The p-optimal item pools still
can provide reliable ability estimation with a relatively small pool size.
A comparison for the results between Test Specification 2 and 3 suggests 𝜽𝜽 can be more

accurately estimated under the condition of three-dimension non-simple structure. Under the
condition of simple structure, the RMSE value is from 0.42 to 0.47 for 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 , and the

correlation is about 0.90; under the condition of non-simple structure, the RMSE is less than 0.4
and the correlation is about 0.94. The increase in the estimation accuracy is primary due to the
items with 𝒂𝒂 = (1, 1, 1). Because those items provide more information than the items that only
measure one 𝜃𝜃, there is more information available for the ability estimation. An increase in

information will result in a decrease on the measurement error. Another possible explanation for
the estimation accuracy is the pool size. The item pools with non-simple structure have about

107

40-item more than the item pools with simple structure. A larger item pool is expected to yield
more accurate ability estimation, because there are more items available for selection.
5.2.6 Performance for item pools based on Test Specification 3 (moderate correlation)
The results for the MCAT based on the test specification of three-dimension non-simple
structure, and with 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 moderately correlated, are presented in Table 5.27 and 5.28.

The results in Table 5.27 are under the condition without item exposure control; and Table 5.22
is when item exposure control is implemented. In both tables, there are three values for bias,
RMSE and correlation, representing the results for (𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 ).

Under the condition without item exposure control (see Table 5.27), the p-optimal item pools

and the baseline pool show no bias on the 𝜽𝜽 estimates. Also, the RMSE are between 0.42 and
0.46, and correlations between estimated 𝜽𝜽 and true 𝜽𝜽 are around 0.89.

The average test

information between the .96-optimal item pool and the baseline pool is very similar, but the
information for the .86-optimal item pool is slightly smaller. The amount of information on the
direction of 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 (i.e., the value on the diagonal) is about 3.49 for the .96-optimal item
pool and the baseline pool, and about 3.38 for the .86-optimal item pool. In general, the results
suggest that the .96- and .86-optimal item pool can provide accurate estimation for 𝜽𝜽, and the
level of accuracy is the same as baseline pool, but the average test information for the .86optimal item pool is slightly small than the other two. Table 5.27 also presents the results about
item pool usage. Compared with the MCAT based on Test Specification 3 with high correlation,
similar results can be drawn from Table 5.27. The item pool usage for the .96-optimal item pool
is slightly better than the .86-optimal item pool. And the two p-optimal item pools are much
better used than the baseline pool.

108

Table 5.27: The performance of the .96- and .86-optimal pool and the baseline pool

Statistics

without exposure control
(3-dimension non-simple structure, moderate correlation)
.96-optimal pool
.86-optimal pool

Baseline pool

Bias

(0.01, 0.00, 0.00)

(0.00, 0.00, 0.00)

(0.00, 0.00, 0.00)

RMSE

(0.46, 0.43, 0.42)

(0.46, 0.43, 0.42)

(0.46, 0.43, 0.42)

Correlation

(0.89, 0.90, 0.90)

(0.89, 0.90, 0.90)

(0.89, 0.90, 0.91)

Average test information
Overall Pool Usage

3.49 1.74 1.74
3.39 1.60 1.60
3.51 1.76 1.76
�1.74 3.48 1.74� �1.60 3.38 1.60� �1.76 3.49 1.76�
1.74 1.74 3.48
1.60 1.60 3.38
1.76 1.76 3.50
28.82

28.14

40.25

Overlap rate

0.16

0.27

0.13

% of overexposed item (r > 0.2)

12%

32%

2%

% of underexposed item (r<0.02)

33%

29%

45%

Table 5.28: The performance of the .96- and .86-optimal pool and the baseline pool
with exposure control
Statistics

(3-dimension non-simple structure, moderate correlation)
.96-optimal pool
.86-optimal pool

Baseline pool

Bias

(0.00, -0.01, 0.00)

(0.00, 0.00, 0.00)

(-0.01, 0.00, 0.00)

RMSE

(0.48, 0.44, 0.44)

(0.48, 0.45, 0.43)

(0.47, 0.44, 0.43)

Correlation

(0.88, 0.89, 0.90)

(0.88, 0.89, 0.90)

(0.88, 0.89, 0.90)

Average test information
Overall Pool Usage

2.96 1.33 1.33
2.92 1.33 1.33
3.15 1.60 1.60
�1.33 2.96 1.33� �1.33 2.93 1.33� �1.60 3.09 1.60�
1.33 1.33 2.97
1.33 1.33 2.94
1.60 1.60 3.23
3.09

1.12

9.02

Overlap rate

0.09

12%

0.07

% of overexposed item (r > 0.2)

0%

0%

0%

% of underexposed item (r<0.02)

0%

0%

21%

When item exposure control is implemented (see Table 5.28), similar results can be observed:
the two p-optimal item pools provide as accurate ability estimates as the baseline pool, and yield
better item pool usage than the baseline pool. Compared with the condition without item
exposure control, item exposure control only results in 0.01 to 0.02 increase in the RMSE, 0.01

109

decrease in correlation, and about 0.5 decrease in the average test information. For the item pool
usage, when the item exposure control is implemented, no item is overexposed, and the
percentage of underexposed item and overlapped item are also decreased. The two p-optimal
item pools have been fully used, and no item from the two p-optimal item pools is underexposed.
The comparison between the condition with and without item exposure control suggests the item
exposure control can effectively increase the item pool usage and reduce the item exposure rate
without obvious loss on the accuracy of ability estimation.
In addition to the overall pool performance, the conditional bias and RMSE at 37 (𝜃𝜃1 , 𝜃𝜃2 , 𝜃𝜃3 )

points are also calculated. The conditional bias for each 𝜽𝜽 point is presented in Table 5.29 and
5.30, for the MCAT without and with exposure control, respectively.

In each table, the

conditional bias is color coded based on the value. Negative bias is colored in blue and positive
bias is in red. Deeper color represents larger bias. The conditional RMSE is presented in Figure
5.31 and 5.32 in the same manner. Small RMSE is colored in green and large RMSE is colored
in red.
Under the condition without item exposure control (see Table 5.29 for bias and 5.31 for
RMSE), the conditional bias and RMSE for the .96-, .86-optimal item pool, and the baseline pool
are quite similar. This finding supports the results for the overall bias and RMSE, and also
suggests the p-optimal item pools can provide as accurate ability estimation as the baseline pool
at each 𝜽𝜽 point. Similar to the results in previous sections, larger bias and RMSE occurs when

𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are very large or very small, which is the top and the bottom of each table. The

difference between 𝜃𝜃1 and 𝜃𝜃2 , and between 𝜃𝜃1 and 𝜃𝜃3 , also affects the estimation accuracy.

When item exposure control is implemented, similar findings can be observed from Table

5.30 and 5.32. The results suggest the three item pools perform similarly in terms of the ability

110

Table 5.29: Conditional Bias for the 𝜽𝜽 estimates without exposure control
(3-dimension non-simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.58 0.54 0.36 0.48 0.45 0.31 0.51 0.43 0.31
-3 -3 -2 0.65 0.61 0.54 0.66 0.61 0.47 -0.03 -0.13 -0.25
-3 -2 -3 0.65 0.54 0.49 -0.10 -0.10 -0.27 0.59 0.55 0.43
-3 -2 -2 0.75 0.72 0.66 0.01 0.03 0.04 0.03 -0.01 -0.05
-2 -3 -3 0.03 0.00 -0.18 0.50 0.56 0.38 0.56 0.50 0.41
-2 -3 -2 0.08 0.01 0.03 0.66 0.65 0.63 0.00 -0.01 -0.09
-2 -2 -3 0.12 0.10 -0.03 -0.01 0.00 -0.06 0.70 0.71 0.64
-2 -2 -2 0.16 0.13 0.16 0.16 0.17 0.11 0.20 0.08 0.13
-2 -2 -1 0.29 0.33 0.30 0.33 0.30 0.32 -0.39 -0.37 -0.39
-2 -1 -2 0.39 0.35 0.31 -0.31 -0.36 -0.33 0.38 0.38 0.31
-2 -1 -1 0.50 0.53 0.50 -0.16 -0.07 -0.06 -0.14 -0.14 -0.12
-1 -2 -2 -0.22 -0.28 -0.30 0.30 0.37 0.27 0.28 0.34 0.28
-1 -2 -1 -0.12 -0.08 -0.08 0.54 0.50 0.50 -0.19 -0.19 -0.13
-1 0 -1 0.20 0.18 0.24 -0.40 -0.43 -0.39 0.24 0.23 0.24
-1 0 0
0.44 0.37 0.42 -0.15 -0.15 -0.19 -0.22 -0.17 -0.15
0 -1 -1 -0.37 -0.34 -0.34 0.23 0.24 0.20 0.22 0.26 0.23
0 -1 0 -0.13 -0.14 -0.14 0.46 0.46 0.45 -0.21 -0.27 -0.28
0 0 -1 -0.12 -0.25 -0.17 -0.22 -0.23 -0.24 0.44 0.48 0.45
0 0 0
0.00 0.02 -0.01 0.01 0.00 -0.03 0.03 -0.02 -0.02
0 0 1
0.17 0.19 0.21 0.21 0.23 0.17 -0.51 -0.46 -0.44
0 1 0
0.19 0.12 0.11 -0.44 -0.48 -0.46 0.25 0.19 0.22
0 1 1
0.27 0.33 0.30 -0.19 -0.24 -0.20 -0.23 -0.21 -0.26
1 0 0 -0.40 -0.42 -0.46 0.19 0.15 0.11 0.15 0.16 0.18
1 0 1 -0.25 -0.25 -0.21 0.40 0.40 0.44 -0.34 -0.29 -0.27
1 2 1
0.11 0.14 0.08 -0.52 -0.49 -0.51 0.20 0.17 0.20
1 2 2
0.26 0.26 0.28 -0.29 -0.37 -0.29 -0.23 -0.37 -0.29
2 1 1 -0.55 -0.50 -0.56 0.09 0.07 0.07 0.08 0.12 0.12
2 1 2 -0.42 -0.32 -0.31 0.35 0.34 0.34 -0.36 -0.33 -0.34
2 2 1 -0.34 -0.38 -0.38 -0.40 -0.34 -0.41 0.37 0.35 0.35
2 2 2 -0.12 -0.15 -0.16 -0.14 -0.15 -0.11 -0.06 -0.16 -0.10
2 2 3 -0.01 -0.08 0.03 0.03 0.03 0.10 -0.65 -0.64 -0.61
2 3 2 -0.01 -0.08 -0.01 -0.63 -0.66 -0.62 0.09 0.03 0.13
2 3 3
0.05 0.02 0.15 -0.56 -0.54 -0.37 -0.55 -0.50 -0.37
3 2 2 -0.61 -0.75 -0.64 -0.03 -0.04 0.04 0.01 0.00 0.07
3 2 3 -0.53 -0.67 -0.50 0.11 0.05 0.22 -0.53 -0.63 -0.45
3 3 2 -0.61 -0.62 -0.51 -0.59 -0.62 -0.46 0.13 0.12 0.23
3 3 3 -0.51 -0.53 -0.38 -0.47 -0.48 -0.22 -0.43 -0.43 -0.20
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
111

Table 5.30: Conditional Bias for the 𝜽𝜽 estimates with exposure control
(3-dimension non-simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1 𝜃𝜃2 𝜃𝜃3 .96
-3 -3 -3 0.57 0.77 0.64 0.54 0.69 0.52 0.53 0.70 0.51
-3 -3 -2 0.80 0.88 0.72 0.77 0.83 0.73 -0.12 -0.03 -0.15
-3 -2 -3 0.74 0.82 0.73 -0.07 -0.04 -0.07 0.68 0.81 0.67
-3 -2 -2 0.87 0.97 0.84 0.08 0.13 0.17 0.05 0.15 0.09
-2 -3 -3 0.04 0.08 0.05 0.70 0.79 0.70 0.66 0.77 0.66
-2 -3 -2 0.16 0.24 0.19 0.87 0.96 0.85 0.11 0.16 0.04
-2 -2 -3 0.25 0.24 0.15 0.12 0.21 0.06 0.92 1.01 0.76
-2 -2 -2 0.31 0.35 0.25 0.25 0.37 0.25 0.21 0.31 0.23
-2 -2 -1 0.44 0.51 0.42 0.49 0.53 0.46 -0.25 -0.26 -0.28
-2 -1 -2 0.50 0.49 0.47 -0.29 -0.22 -0.29 0.46 0.46 0.44
-2 -1 -1 0.59 0.66 0.59 -0.08 -0.08 -0.06 -0.09 -0.07 -0.09
-1 -2 -2 -0.23 -0.22 -0.15 0.43 0.44 0.43 0.47 0.43 0.39
-1 -2 -1 -0.01 0.00 -0.06 0.62 0.67 0.58 -0.12 -0.03 -0.09
-1 0 -1 0.29 0.29 0.25 -0.41 -0.41 -0.42 0.34 0.32 0.26
-1 0 0
0.42 0.49 0.49 -0.17 -0.13 -0.16 -0.19 -0.13 -0.18
0 -1 -1 -0.32 -0.31 -0.27 0.31 0.27 0.28 0.31 0.30 0.25
0 -1 0 -0.20 -0.10 -0.18 0.47 0.47 0.53 -0.21 -0.24 -0.17
0 0 -1 -0.22 -0.26 -0.15 -0.22 -0.29 -0.24 0.49 0.53 0.49
0 0 0 -0.02 0.03 0.01 0.00 0.01 0.01 -0.04 0.00 0.03
0 0 1
0.27 0.22 0.14 0.24 0.25 0.30 -0.49 -0.42 -0.45
0 1 0
0.17 0.19 0.21 -0.47 -0.47 -0.53 0.22 0.21 0.25
0 1 1
0.33 0.41 0.30 -0.32 -0.29 -0.29 -0.36 -0.21 -0.31
1 0 0 -0.44 -0.44 -0.46 0.22 0.17 0.23 0.22 0.22 0.22
1 0 1 -0.32 -0.29 -0.34 0.33 0.41 0.39 -0.36 -0.30 -0.33
1 2 1
0.05 0.03 0.06 -0.56 -0.69 -0.61 0.19 0.07 0.13
1 2 2
0.22 0.12 0.21 -0.43 -0.47 -0.44 -0.40 -0.50 -0.42
2 1 1 -0.64 -0.63 -0.60 0.02 0.08 0.08 0.07 0.04 0.10
2 1 2 -0.47 -0.54 -0.45 0.22 0.20 0.25 -0.46 -0.59 -0.43
2 2 1 -0.51 -0.48 -0.46 -0.45 -0.56 -0.47 0.24 0.26 0.29
2 2 2 -0.32 -0.43 -0.33 -0.28 -0.28 -0.22 -0.23 -0.35 -0.27
2 2 3 -0.18 -0.26 -0.14 -0.11 -0.22 -0.07 -0.87 -0.98 -0.91
2 3 2 -0.15 -0.27 -0.11 -0.86 -1.03 -0.86 -0.11 -0.19 -0.05
2 3 3 -0.02 -0.05 -0.06 -0.69 -0.80 -0.78 -0.71 -0.80 -0.77
3 2 2 -0.81 -1.01 -0.89 -0.03 -0.26 -0.16 -0.04 -0.19 -0.13
3 2 3 -0.78 -0.89 -0.81 0.02 -0.11 -0.03 -0.73 -0.85 -0.79
3 3 2 -0.72 -0.93 -0.76 -0.70 -0.88 -0.81 0.12 -0.03 -0.05
3 3 3 -0.61 -0.77 -0.70 -0.56 -0.71 -0.66 -0.56 -0.68 -0.66
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
112

Table 5.31: Conditional RMSE for the 𝜽𝜽 estimates without exposure control
(3-dimension non-simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.68 0.64 0.50 0.60 0.58 0.46 0.63 0.54 0.45
-3
-3
-2
0.74 0.68 0.64 0.72 0.68 0.57 0.32 0.31 0.40
-3
-2
-3
0.74 0.63 0.60 0.31 0.35 0.44 0.67 0.63 0.56
-3
-2
-2
0.83 0.79 0.75 0.28 0.32 0.29 0.27 0.31 0.29
-2
-3
-3
0.34 0.30 0.41 0.62 0.66 0.50 0.66 0.59 0.52
-2
-3
-2
0.37 0.33 0.33 0.73 0.74 0.72 0.32 0.35 0.34
-2
-2
-3
0.31 0.35 0.35 0.30 0.29 0.30 0.78 0.78 0.71
-2
-2
-2
0.41 0.35 0.34 0.37 0.36 0.34 0.36 0.36 0.34
-2
-2
-1
0.45 0.47 0.45 0.41 0.42 0.44 0.49 0.45 0.48
-2
-1
-2
0.52 0.48 0.45 0.40 0.49 0.44 0.48 0.49 0.43
-2
-1
-1
0.61 0.63 0.59 0.31 0.34 0.29 0.33 0.35 0.34
-1
-2
-2
0.42 0.41 0.45 0.41 0.50 0.41 0.40 0.49 0.43
-1
-2
-1
0.33 0.34 0.30 0.62 0.59 0.58 0.37 0.34 0.32
-1
0
-1
0.38 0.39 0.36 0.50 0.53 0.47 0.39 0.38 0.38
-1
0
0
0.53 0.48 0.53 0.33 0.32 0.36 0.36 0.30 0.34
0
-1
-1
0.50 0.46 0.44 0.38 0.42 0.40 0.39 0.38 0.38
0
-1
0
0.35 0.35 0.34 0.55 0.55 0.55 0.36 0.40 0.41
0
0
-1
0.38 0.39 0.36 0.36 0.38 0.40 0.53 0.57 0.54
0
0
0
0.28 0.32 0.32 0.30 0.29 0.31 0.27 0.30 0.31
0
0
1
0.34 0.38 0.35 0.35 0.37 0.35 0.59 0.54 0.53
0
1
0
0.34 0.34 0.37 0.54 0.56 0.55 0.39 0.36 0.37
0
1
1
0.44 0.44 0.44 0.38 0.38 0.37 0.36 0.35 0.38
1
0
0
0.51 0.54 0.55 0.37 0.34 0.28 0.34 0.35 0.33
1
0
1
0.39 0.40 0.37 0.48 0.48 0.56 0.47 0.42 0.38
1
2
1
0.35 0.35 0.35 0.61 0.59 0.61 0.35 0.37 0.36
1
2
2
0.42 0.41 0.44 0.42 0.49 0.40 0.38 0.49 0.43
2
1
1
0.62 0.58 0.66 0.32 0.35 0.31 0.30 0.32 0.34
2
1
2
0.51 0.48 0.48 0.48 0.46 0.48 0.47 0.48 0.45
2
2
1
0.48 0.50 0.50 0.53 0.48 0.53 0.49 0.48 0.45
2
2
2
0.37 0.41 0.39 0.33 0.39 0.33 0.34 0.35 0.33
2
2
3
0.33 0.32 0.38 0.36 0.37 0.32 0.71 0.69 0.68
2
3
2
0.36 0.35 0.34 0.70 0.74 0.68 0.30 0.31 0.34
2
3
3
0.36 0.34 0.37 0.63 0.61 0.48 0.64 0.58 0.50
3
2
2
0.67 0.81 0.70 0.33 0.32 0.33 0.33 0.36 0.34
3
2
3
0.63 0.74 0.61 0.31 0.33 0.41 0.64 0.70 0.53
3
3
2
0.71 0.70 0.60 0.65 0.70 0.57 0.36 0.34 0.40
3
3
3
0.65 0.65 0.52 0.61 0.60 0.37 0.57 0.55 0.37
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
113

Table 5.32: Conditional RMSE for the 𝜽𝜽 estimates with exposure control
(3-dimension non-simple structure, moderate correlation)

𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
37 Points
.96
.86
C
.96
.86
C
.96
.86
C
𝜃𝜃1
𝜃𝜃2
𝜃𝜃3
-3
-3
-3
0.69 0.88 0.73 0.65 0.82 0.65 0.64 0.82 0.63
-3
-3
-2
0.92 0.96 0.85 0.88 0.95 0.82 0.48 0.41 0.43
-3
-2
-3
0.83 0.91 0.82 0.43 0.40 0.40 0.79 0.91 0.76
-3
-2
-2
0.95 1.06 0.93 0.35 0.40 0.38 0.40 0.43 0.36
-2
-3
-3
0.35 0.44 0.33 0.81 0.90 0.78 0.77 0.89 0.73
-2
-3
-2
0.45 0.45 0.40 0.95 1.06 0.93 0.41 0.48 0.36
-2
-2
-3
0.49 0.48 0.37 0.41 0.41 0.33 0.99 1.08 0.84
-2
-2
-2
0.51 0.54 0.47 0.47 0.54 0.43 0.44 0.51 0.43
-2
-2
-1
0.58 0.62 0.56 0.59 0.63 0.57 0.44 0.43 0.42
-2
-1
-2
0.61 0.62 0.59 0.47 0.44 0.47 0.60 0.58 0.52
-2
-1
-1
0.69 0.75 0.68 0.36 0.38 0.33 0.32 0.38 0.32
-1
-2
-2
0.44 0.43 0.42 0.58 0.55 0.53 0.59 0.56 0.51
-1
-2
-1
0.36 0.34 0.35 0.69 0.75 0.66 0.36 0.38 0.33
-1
0
-1
0.48 0.42 0.42 0.54 0.52 0.51 0.49 0.48 0.40
-1
0
0
0.54 0.61 0.56 0.39 0.37 0.35 0.38 0.35 0.35
0
-1
-1
0.49 0.49 0.45 0.43 0.44 0.41 0.45 0.46 0.41
0
-1
0
0.40 0.38 0.38 0.60 0.59 0.63 0.39 0.41 0.36
0
0
-1
0.41 0.42 0.36 0.39 0.45 0.43 0.57 0.62 0.60
0
0
0
0.39 0.35 0.38 0.33 0.30 0.29 0.32 0.31 0.33
0
0
1
0.41 0.41 0.38 0.40 0.44 0.43 0.57 0.52 0.54
0
1
0
0.37 0.39 0.38 0.59 0.58 0.62 0.36 0.39 0.39
0
1
1
0.47 0.52 0.43 0.45 0.44 0.40 0.49 0.42 0.43
1
0
0
0.54 0.55 0.56 0.40 0.34 0.38 0.39 0.40 0.37
1
0
1
0.48 0.46 0.47 0.47 0.53 0.54 0.50 0.46 0.42
1
2
1
0.38 0.35 0.30 0.67 0.78 0.71 0.43 0.36 0.36
1
2
2
0.46 0.41 0.38 0.55 0.60 0.55 0.53 0.61 0.53
2
1
1
0.74 0.73 0.69 0.35 0.39 0.36 0.34 0.34 0.34
2
1
2
0.59 0.66 0.58 0.41 0.42 0.44 0.58 0.71 0.56
2
2
1
0.61 0.63 0.58 0.54 0.67 0.59 0.42 0.46 0.46
2
2
2
0.47 0.58 0.48 0.46 0.48 0.40 0.43 0.51 0.44
2
2
3
0.46 0.50 0.36 0.41 0.49 0.34 0.96 1.06 0.97
2
3
2
0.48 0.49 0.37 0.94 1.11 0.93 0.44 0.39 0.35
2
3
3
0.40 0.43 0.36 0.79 0.89 0.87 0.81 0.90 0.86
3
2
2
0.90 1.09 0.97 0.40 0.48 0.41 0.40 0.46 0.39
3
2
3
0.91 1.00 0.87 0.44 0.43 0.33 0.86 0.96 0.87
3
3
2
0.83 1.05 0.85 0.80 0.97 0.88 0.43 0.45 0.40
3
3
3
0.71 0.89 0.80 0.66 0.80 0.76 0.67 0.79 0.75
Note: .96 represents .96-optimal pool; .86 represents .86-optimal pool; C represents baseline pool
114

estimation on the 37 𝜽𝜽 points. In addition, larger bias and RMSE also occurs when 𝜃𝜃1 , 𝜃𝜃2 , and

𝜃𝜃3 are very large or very small, and when 𝜃𝜃’s are away from each other. A comparison between

the condition with and without item exposure control shows, when item exposure control is built
in, the magnitude of the bias and RMSE at some extreme points becomes larger. The reason
why the item exposure control increases the estimation error is explained in the previous sections.
In summary, this section presents the results for the MCAT with the test specification of
three-dimension non-simple structure and with moderate correlation between 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 . In

general, the p-optimal item pools perform similarly as the baseline pool in terms of both overall
and conditional accuracy of ability estimation, but the p-optimal item pools can save over 100

items and have a better item pool usage. When item exposure control is implemented, the item
exposure rate and item overlap rate can be controlled very well. The p-optimal item pools still
can provide reliable ability estimation with a relatively small pool size.
A comparison between the high correlation condition and moderate correlation for Test
Specification 3 suggests that, the measurement error significantly increases as the correlation
among dimensions decreases. The RMSE increases about one unit, and the correlation decreases
about 0.5. One possible explanation is that, when the correlation decreases, the amount of
information that can be borrowed among each 𝜃𝜃 reduces, and thus the estimation accuracy
decreases.

Although the measurement error for the Test Specification 3 with moderate

correlation is large, it is still smaller than the error for the Test Specification 2 with moderate
correlation. When 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 are moderately correlated, adding the cluster of items with
𝒂𝒂 = (1, 1, 1) decreases the RMSE by 0.5 and increase the correlation by 0.3 on average.

Therefore, item pools with non-simples structure characteristic yield more accurate ability
estimation than the item pools with simple structure.

115

Chapter 6

Discussion and Conclusion

In this chapter, the simulation results and their implications are discussed. Section 6.1 first
summarizes the findings from the simulation study and addresses the research questions. Section
6.2 presents the discussion of results.

The implications for item pool development and

management are then described in Section 6.3. Finally, the limitations and suggestions for the
future research are discussed in Section 6.4
6.1 Summary of Results
This study aimed to generalize the p-optimal item pool design method (Reckase, 2003 & 2007)
to multidimensional CAT (MCAT). The reason why the p-optimal item pool is “p-optimal” is
because the item pool design is specifically tailored to the adaptive test. And because of this, no
single p-optimal item pool is universally p-optimal. The characteristics of the p-optimal item
pool are determined by a number of factors such as the examinee population and the algorithms
for the adaptive test. Therefore, this study not only designs p-optimal item pools for MCAT, but
also examines how the p-optimal item pool is affected by the test specifications, item exposure
control, correlation among dimensions, and bin sizes. The results based on a simulation study
are summarized below.
A total of 24 p-optimal item pools were designed and then developed in this study. Generally
speaking, the item difficulty (i.e., the MDIFF value) was symmetrically distributed with more
items located on the middle of the MDIFF scale, and fewer items located on each side. The
standard deviation of the MDIFF value was 1.5 to 2.3 times larger than the standard deviation of
the target examinee population and the distribution of the MDIFF value was flatter than a
standard normal distribution.

116

The performance of the MCAT using the 24 p-optimal item pools was evaluated by
comparison with the MCAT using baseline pools through a simulation study.

The results

showed the MCAT using the p-optimal item pools and the MCAT using the baseline pools
performed very similarly in terms of the ability estimation accuracy, but the pool size for the poptimal item pools was more than 100-item smaller than the baseline pools. In addition, the item
pool usage for all the p-optimal item pools was better than the baseline pools.
Specifically, when bin size increased from 0.4 to 0.8, item pool size decreased by 40% on
average. Bin size also determined the how much information the best available item in the item
pool could provide for ability estimation. A bin size of 0.4 implies the best available item can
provide at least 96% of the maximum possible information, and therefore the item pool is called
the .96-optimal item pool. Similarly, a bin size of 0.8 implies a .86-optimal item pool. This is
the reason why the average test information yielded for the .86-optimal item pools was smaller
than the .96-optimal item pools in the simulation study. Even though the pool size and the
average test information for the .86-optimal item pools were smaller, the MCAT using the two
types of item pools performed very similar in terms of the accuracy in ability estimation. Similar
findings were observed for a unidimensional CAT in Reckase (2010). Because of the small pool
size, the item overexposure rate and the item overlap rage for the .86-optimal item pools was
larger than the .96-optimal item pools.
The 24 p-optimal item pools were designed based on three test specifications. The pool size
for the two-dimension simple structure condition and three-dimension simple structure condition
are very similar. For the two-dimensional case, half of the items in the item pool measured 𝜃𝜃1

and another half measured 𝜃𝜃2 ; For the three-dimensional case, one third of items in the item pool

measured each of the three 𝜃𝜃’s. Therefore, when the test length was the same, the pool size for
117

the p-optimal item pools did change if a cluster of items measuring a different ability was added
to the current test. However, when test specification changed from simple structure to nonsimple structure, the pool size for the p-optimal item pools increased by about 9%. For the threedimension non-simple structure case, the proportion of items measuring three ability was slightly
larger than items measuring only one ability. The measurement error for ability estimation
yielded from the p-optimal item pools were all within the acceptable range for these three test
specifications. The error in the two-dimension simple structure condition was slightly smaller
than the three-dimension simple structure condition. This was due to one more θ is estimated in
the three dimensional test, but overall test length was the same for the two tests. The error in the
three-dimension non-simple structure condition was also smaller than simple structure condition,
because the items measuring three ability provided more information for θ estimation.
A unique factor that influenced the functioning of the MCAT is the correlation among 𝜃𝜃’s. If

ability were highly correlated, the size of the p-optimal item pool is about 10% larger than the
condition when ability were moderately correlated. Those 10% of items were mainly located on
each side of the MDIFF scale, with relatively high or low item difficulty. That is to say, for a
MCAT measuring highly correlated ability, a larger number of difficult items and easy items
should be created for the p-optimal item pool. The ability estimation accuracy in the high
correlation condition was better than the moderate correlation condition. Similar results can be
found for multidimensional linear test and MCAT in Liu (2007), Segall (2005), Yao (2010), and
Yao and Boughton (2007).
When item exposure control was built into the item selection process, the most informative
items will not be too frequently selected. In this situation, to ensure the ability estimation
accuracy for the adaptive test, another equally informative item should be available in the item

118

pool. If item exposure control is necessary for a MCAT, the p-optimal item pool design can take
the item exposure rate into account and adjust the number of item within each MDIFF-bin. The
goal is to make sure the there is sufficient number of item in the p-optimal item pool to ensure
both ability estimation accuracy and test security. Based on the simulation results, when the bin
size was 0.4, item exposure control had nearly no influence on the pool size. When the bin size
was 0.8, about 20% more items were needed if item exposure control was implemented. These
20% of items were all located on the middle of the MDIFF scale with item difficulty close to 0.
If item exposure control was implemented, the measurement error yielded from all p-optimal
item pools only slightly decreased. This finding suggests the p-optimal item pool design is able
to balance the ability estimation accuracy and the test security.
6.2 Discussion of Results
The p-optimal item pools produced in this study was a union of items that meet all the
predetermined psychometrical specifications, and that target to a predetermined examinee
population. van der Linen (1999) provided three criteria for an optimal item pool: 1) an optimal
item pool should be sufficiently large to allow several thousand overlapping subtests to be drawn
from its items; 2) an optimal item pool should consist of items spanning the entire range of item
difficulty relative to the population of interest; and 3) an optimal item pool should consist of an
appropriate mix of high and low discriminating items to lower the item creation cost while
meeting the needs of the ability estimation accuracy.
The first criterion addresses the issues of the item pool size. The findings from the simulation
study suggest that the size of the p-optimal item pools was affected by a number of factors. For
different MCAT programs, the lower limit of optimal item pool size is different. For example,
Stocking (1994) recommended the item pool size for a high-stakes CAT should be
119

approximately 12 times the test length. A longer CAT requires a larger item pool. Also, for
high-stakes CAT, item exposure rate is an important issue for test validity and security. When
item exposure control is implemented, the item pool should consist of a larger number of items
in order to prevent items from being overexposed to examinees. Because a larger item pool
tends to solve all these issues, many adaptive testing programs usually develop a very large item
pool for operational use. However, a larger item pool does not necessarily increase the ability
estimation accuracy. Instead, the pool usage for very large item pool might be undesirable. In
this study, for example, the three baseline pools consisted of more than 100 items than the
corresponding p-optimal item pools. According to the simulation results, baseline pools yielded
similar level of measurement accuracy as the p-optimal item pools. When item exposure control
is not implemented, about half of the items in the baseline pools had exposure rate less than 2%;
when item exposure control is implemented, still 20% of items were underexposed. Those
underexposed items were wasted because they were very unlikely to be selected. Therefore, item
pool design should seek a balance between the demands for a larger item pool, and the potential
risk of items being wasted in a larger item pool. The results in this study suggest the design for
p-optimal item pools can achieve such a balance when item exposure control is considered. The
p-optimal item pools ensure the ability estimation accuracy and let all the items in the item pool
to be fully used.
The second criterion is about the range of item difficulty. As stated in the criterion, the range
of item difficulty of a optimal item pool is determined by the examinee population. The standard
deviation of the p-optimal item pools in this study was 1.5 to 2.3 times larger than the standard
deviation of examinees’ ability. The range of the item difficulty is from -4.0 to 4.0. Similar
results were found by Gu (2009), Reckase (2010) and Zhou (2012) for unidimensional p-optimal

120

item pools. For the baseline pools in this study, the standard deviation of item difficulty was
more than 2.5 times larger than that of the ability distribution. Baseline pools consisted of a
number of items with extremely high or extremely low item difficulty. These items are useful
for examinees with very high or very low ability. However, since those examinees are rare in the
population, most of those extreme items are underexposed. Therefore, although the optimal item
pool should span the entire range of item difficulty, only a couple of very difficult or very easy
items are sufficient.
In addition to the examinee population, the range of item difficulty also depends on the
purpose of the test. For licensure exams, the purpose of the test is to classify examinees into two
or more categories. If the cut score is in the middle of the θ scale, a large number of items with
middle item difficulty should be included in the item pool to ensure the measurement error at the
cut score is sufficiently low. In this situation, it is acceptable to drop items with very high or
very low item difficulty from the item pool, because they don’t contribute much to the
measurement accuracy at the cut score. However, if the purpose of the test is to selected gifted
student, or to indentify low achieving students, a large number of difficult items or easy items
should be added into the item pool, respectively.
The third criterion addresses the issue of item discrimination. Because the MCAT in this
study is based on the multidimensional Rasch model, the magnitude of the item discrimination is
fixed, but the direction of the item discrimination is not fixed and it can affect the test precision
and item creation cost. If an item only loads on one dimension, for instance 𝜃𝜃1 , the direction that

is best measured by this item is along 𝜃𝜃1 . In other words, this item can only discriminate

examines with variations on 𝜃𝜃1 . If an item loads on more than one dimension such as an item

from Cluster 4 with 𝒂𝒂 = (1,1,1), the direction that is best measured by this item is along
121

direction of the 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 composite. This item can most effectively discriminate examinees

located on different points along the 𝜃𝜃1 , 𝜃𝜃2 , and 𝜃𝜃3 composite line, and can moderate effectively

discriminate examinees with variations on 𝜃𝜃1 , 𝜃𝜃2 , or 𝜃𝜃3 . The simulation results in this study
suggested that, for a test with simple structure to meet the ability estimation accuracy for all the
𝜃𝜃’s, the p-optimal item pool should consist of the same proportion of items measuring each 𝜃𝜃.

Compared with tests with simple structure, consisting of items with 𝒂𝒂 = (1,1,1) in the item pool

yielded better ability estimation accuracy but increased the pool size at the same time. A larger
item pool would cost more to create. Moreover, compared with items measuring only one ability,
items with 𝒂𝒂 = (1,1,1) are relatively more difficult to write and cost more to create. Although
these items are desirable in psychometrical perspective, they might not be the best choice in

practice considering the cost of item creation.
Overall, the p-optimal item pools developed in this study met these three criteria, because the
item pool design process considered the features of the examinee population, ability estimation
accuracy, item pool usage, and the purposes for the test. The size of the p-optimal item pools
was sufficient for a large number of examinee. Items in the p-optimal item pools spanned the
entire range of item difficulty.

Also, the p-optimal item pools yielded acceptable ability

estimation accuracy and fairly good item pool usage. Even though the item creation cost is not
directly addressed in the item pool design process, it can be controlled by adding a content
balancing constrain. For example, if there is an upper limit for the proportion of the expensive
items in the item pool, content balancing algorithms are able to control the number of expensive
items from being frequently selected, and therefore control the proportion of the expensive items
in the item pool.

122

6.3 Implications
The end product of the p-optimal item pool design for MCAT is a bin-count table, which tells
the proportion of item from each cluster and the minimum number of items in each item bin.
The bin-count table serves as an instructive guide for item creation, item pool development, and
item pool management.
Similar to the function of a test blueprint for a linear paper-and-pencil test, the bin-count table
is also a target for item creation. Item writers should create items that meet the requirements of
the bin-count table. For items measuring only one ability, they can be treated as unidimensional
items, so that item writers can create them in the same way they create unidimensional items.
Items measuring more than one ability, however, can be difficult to write. When creating items
measure more than one ability, the first thing to consider is the direction that is best measured by
this item. Items with 𝒂𝒂 = (1,1,1) should measure the three abilities with the same level of

discrimination power. In practical, this is very hard to control because more than one strategy
may be used to solve an item, and different strategy may require different combinations of these

three abilities. Therefore, in this situation, it might be helpful to provide some examples to item
writers and give them instructions on how to write items measuring multidimensional abilities.
Even though we assume a set of items with 𝒂𝒂 = (1,1,1) is successfully created, these items still
cannot be guaranteed to function the same for different groups of examinees. As emphasized in

Reckase (2009), “dimensionality is a property of the data matrix, not the test.” Although these
items are sensitive to differences along the three dimensions, the response data matrix may not
be three-dimensional unless there is adequate amount of variation in the examinee sample along
each dimension. Because dimensionality is sample-specific, the quality of the examinee sample

123

for field test is very important.

If the sample is not representative, the characteristics of

multidimensional items may be greatly affected.
Because items measuring more than one dimension are expensive to create and may be
unstable in practice, a p-optimal item pool with simple structure might be easier to develop in
practice. For an item pool only consisting of items measuring only one ability, some may argue
for fitting this item pool with a unidimensional IRT model and treating each cluster of items as
one content area. It is feasible to do so, but the advantages of using a multidimensional IRT
model are apparent.

First, if a unidimensional IRT model is fitted to this item pool, the

assumption of unidimensionality might be violated as items are measuring different content areas.
Second, if subscores are reported to examinees, MCAT will yield more accurate subscores than
UCAT. Third, MCAT can estimate all the θ’s simultaneously, but UCAT needs to estimate each
θ separately and one at a time. Therefore, MCAT is more efficient in terms of subscore
reporting than UCAT. Because of these advantages for MCAT, multidimensional p-optimal item
pools are more desirable than unidimensional item pools.
In practice, operational item pools are always being renewed. Obsolete items are removed
from time to time and new items are filled in accordingly. van der Linden and Veldkamp (2000)
summarized that monitoring item usage and replenishing new items are two important tasks for
item pool management. The p-optimal item pool design presented in this study can be adapted
for use in item pool management. If an item located in bin X is retired, a new item should be
added to bin X. Because items within each bin are considered to be equivalent in terms of the
amount of information they provide for ability estimation, the new item does not need to be
identical to the old item; rather, any item that fits into bin X can be used to replace the old item.
In this situation, the concept of item bin can reduce the cost for item replenishing. In addition,

124

when there is a need to create a master pool which supplies several operational item pools, the poptimal item pool design can be used to design the master pool as well. If the master pool needs
to supply N operational item pools, the size of the master pool would be at least N times the poptimal item pool.
6.4 Limitation and Future Studies
The results of this study demonstrated the advantages of using the p-optimal item pool design
to develop item pools for several MCATs with different features. The results indicate the poptimal item pools can ensure ability estimation accuracy as well as a good item pool usage.
This conclusion, however, is restricted by the fact that items are fit by the multidimensional
Rasch model. The item discrimination parameter for the multidimensional Rasch model is fixed
by test developers, instead of estimated from the data matrix. If inaccurate item discrimination
parameters are assigned to some items in the item pool, the extent the ability estimation accuracy
would be affected is unknown.

Future study can examine the consequences of item

discrimination being inaccurately identified. In addition, compared with the multidimensional
Rasch model, the multidimensional 2PL or 3PL model tends to fit the data better in practice. Gu
(2007) has generalized the p-optimal item pool design method to unidimensional 3PL model. It
is also worthwhile to generalize Gu’s methodology to multidimensional 2PL or 3PL model.
This study is based on the assumption that examinees are multivariate normally distributed.
However, in reality, the distribution of examinees is not always normal, and the expected
distribution may not always match the reality. The question raised is how robust the p-optimal
item pool design is to the violation to the shape of the examinee distribution. That is, if a poptimal item pool is developed based on a multivariate normal distribution, but the actual

125

examinee is not normally distributed, how the performance of the MCAT using this p-optimal
item pool will be affected. Future study can investigate this issue by a simulation study.
Two bin sizes were considered in this study for the p-optimal item pool design. An increase in
the bin size will results in a smaller p-optimal item pool. The .86-optimal item pool in this study
yielded similar level of ability estimation accuracy as the .96-optimal item pool, if item exposure
control was not implemented. If item exposure rate is not an important issue, a smaller item pool
is desirable because a smaller item pool will cost less to create. Therefore, it might be interesting
to investigate how large the bin size can get before the MCAT does not function well. The
results could be useful to determine the bin size in the future.
The item type is this study is purely dichotomous. As the educational measurement area is
changing towards to the next generation of assessments, new types of items have emerged and
brought significant challenges for test developers. For example, new types of items have been
created for the Smarter Balanced tests and will be used operationally.

Performance task

questions, for example, are one type of new item. A performance task usually requires students
to follow several steps to accomplish it. Each step can be treated as one item and the entire task
is considered to be a testlet. At this point, it is still unknown how to develop a p-optimal item
pool for adaptive test consisting of this item type. Since a number of states will soon adopt the
Smarter Balanced assessments to replace their current K-12 large-scale standardized assessments,
the quality of the item pool is an important issue from both psychometric and policy perspectives.
Therefore, the p-optimal item pool design for new types of items is definitely a promising
direction for future research as well.

126

APPENDIX

127

Table A.1: Bin count table for the .96-optimal item pool (Test Specification 1, high correlation,
without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0)

3

7

9

11

13

13

14

15

15

14

14

13

12

11

9

7

4

a = (0, 1)

3

7

9

11

13

13

14

15

15

15

14

13

12

11

9

7

4

Table A.2: Bin count table for the .86-optimal item pool (Test Specification 1, high correlation,
without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0)

6

11

13

14

15

14

13

11

6

a = (0, 1)

6

11

13

14

15

14

13

11

6

Table A.3: Bin count table for the .96-optimal item pool (Test Specification 1, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0)

2

5

8

9

11

12

13

14

14

14

14

12

11

10

8

5

2

a = (0, 1)

2

5

8

9

11

12

13

14

14

14

14

12

11

10

8

5

2

Table A.4: Bin count table for the .86-optimal item pool (Test Specification 1, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0)

3

10

13

14

15

14

13

10

4

a = (0, 1)

4

10

13

14

15

14

13

10

3

128

Table A.5: Bin count table for the .96-optimal item pool (Test Specification 2, high correlation,
without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

2

4

6

7

8

9

10

10

10

10

10

9

8

7

6

4

2

a = (0, 1, 0)

2

4

6

7

8

9

10

10

10

10

10

9

8

7

6

4

3

a = (0, 0, 1)

1

4

6

7

8

9

10

10

10

10

10

9

8

7

6

4

2

Table A.6: Bin count table for the .86-optimal item pool (Test Specification 2, high correlation,
without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

3

7

9

10

10

10

9

7

3

a = (0, 1, 0)

4

8

9

10

10

10

9

8

3

a = (0, 0, 1)

3

7

9

10

10

10

9

7

3

Table A.7: Bin count table for the .96-optimal item pool (Test Specification 2, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

0

3

5

6

7

8

9

10

10

9

9

8

7

6

5

2

1

a = (0, 1, 0)

1

3

5

6

7

8

9

10

10

10

9

8

7

6

5

3

1

a = (0, 0, 1)

1

3

5

6

7

8

9

10

10

10

9

9

7

6

5

3

1

Table A.8: Bin count table for the .86-optimal item pool (Test Specification 2, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

1

6

8

10

10

10

8

6

1

a = (0, 1, 0)

2

7

9

10

10

10

9

6

1

a = (0, 0, 1)

2

7

9

10

10

10

9

7

2

129

Table A.9: Bin count table for the .96-optimal item pool (Test Specification 3, high correlation,
without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

2

4

5

6

7

7

7

8

8

8

7

7

7

6

5

4

2

a = (0, 1, 0)

3

4

5

6

7

7

7

8

8

8

7

7

7

6

5

4

3

a = (0, 0, 1)

2

4

5

6

7

7

7

8

8

8

7

7

7

6

5

4

2

MDIFF

-5.6

-4.9

-4.2

-3.5

-2.8

-2.1

-1.4

-0.7

0

0.7

1.4

2.1

2.8

3.5

4.2

4.9

5.6

a = (1, 1, 1)

3

4

6

6

7

7

8

8

8

8

8

7

7

6

5

4

3

Table A.10: Bin count table for the .86-optimal item pool (Test Specification 3, high correlation,
without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

3

6

7

8

8

8

7

6

3

a = (0, 1, 0)

3

6

7

8

8

8

7

6

3

a = (0, 0, 1)

3

6

7

8

8

8

7

6

3

MDIFF

-5.6

-4.2

-2.8

-1.4

0

1.4

2.8

4.2

5.6

a = (1, 1, 1)

4

7

8

9

9

9

8

7

4

Table A.11: Bin count table for the .96-optimal item pool (Test Specification 3, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

1

3

4

5

6

7

7

7

8

7

7

7

6

5

4

3

1

a = (0, 1, 0)

1

3

4

5

6

7

7

7

7

7

7

7

6

5

4

3

1

a = (0, 0, 1)

2

3

4

5

6

7

7

7

8

7

7

7

6

6

5

3

1

MDIFF

-5.6

-4.9

-4.2

-3.5

-2.8

-2.1

-1.4

-0.7

0

0.7

1.4

2.1

2.8

3.5

4.2

4.9

5.6

a = (1, 1, 1)

1

3

5

6

7

7

8

8

8

8

7

7

7

6

5

3

1

Table A.12: Bin count table for the .86-optimal item pool (Test Specification 3, moderate
correlation, without item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

2

5

7

8

8

8

7

5

2

a = (0, 1, 0)

2

5

7

8

8

8

7

5

2

a = (0, 0, 1)

2

5

7

8

8

8

7

5

3

MDIFF

-5.6

-4.2

-2.8

-1.4

0

1.4

2.8

4.2

5.6

a = (1, 1, 1)

2

6

8

9

9

9

8

6

2

130

Table A.13: Bin count table for the .96-optimal item pool (Test Specification 1, high correlation,
with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0)

4

6

9

11

12

13

14

15

17

15

14

13

12

11

9

7

3

a = (0, 1)

4

7

9

11

12

13

14

15

17

15

14

13

12

11

9

7

3

Table A.14: Bin count table for the .86-optimal item pool (Test Specification 1, high correlation,
with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0)

6

11

13

17

32

17

13

11

6

a = (0, 1)

6

11

13

17

32

17

13

11

6

Table A.15: Bin count table for the .96-optimal item pool (Test Specification 1, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0)

2

5

8

10

11

12

13

14

18

14

13

12

11

9

8

5

2

a = (0, 1)

2

5

8

9

11

12

13

14

18

14

13

12

11

10

7

5

2

Table A.16: Bin count table for the .86-optimal item pool (Test Specification 1, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0)

4

10

13

18

33

18

13

10

4

a = (0, 1)

4

10

13

18

33

18

13

10

4

131

Table A.17: Bin count table for the .96-optimal item pool (Test Specification 2, high correlation,
with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

2

4

6

7

8

9

10

10

12

10

10

9

8

7

6

4

2

a = (0, 1, 0)

2

4

6

7

8

9

10

10

12

10

10

9

8

7

6

4

2

a = (0, 0, 1)

1

4

6

7

8

9

10

10

12

10

10

9

8

7

6

4

1

Table A.18: Bin count table for the .86-optimal item pool (Test Specification 2, high correlation,
with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

3

7

9

12

21

12

9

7

3

a = (0, 1, 0)

3

8

9

12

21

12

9

8

3

a = (0, 0, 1)

3

7

9

12

21

12

9

7

3

Table A.19: Bin count table for the .96-optimal item pool (Test Specification 2, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

0

2

4

6

7

8

9

10

12

10

9

8

7

6

5

2

1

a = (0, 1, 0)

1

3

5

6

7

9

9

10

12

10

9

8

7

6

5

3

1

a = (0, 0, 1)

1

3

5

6

8

9

9

10

12

10

9

9

7

6

5

3

1

Table A.20: Bin count table for the .86-optimal item pool (Test Specification 2, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

1

6

8

12

22

12

8

6

1

a = (0, 1, 0)

2

6

9

12

22

12

9

6

2

a = (0, 0, 1)

2

6

9

12

22

12

9

6

2

132

Table A.21: Bin count table for the .96-optimal item pool (Test Specification 3, high correlation,
with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

2

4

5

6

7

7

7

8

8

8

7

7

7

6

5

4

2

a = (0, 1, 0)

3

4

5

6

7

7

7

8

8

7

7

7

7

6

5

4

3

a = (0, 0, 1)

2

4

5

6

7

7

7

8

8

8

7

7

7

6

5

4

2

MDIFF

-5.6

-4.9

-4.2

-3.5

-2.8

-2.1

-1.4

-0.7

0

0.7

1.4

2.1

2.8

3.5

4.2

4.9

5.6

a = (1, 1, 1)

3

4

6

6

7

7

8

8

9

8

8

7

7

6

5

4

3

Table A.22: Bin count table for the .86-optimal item pool (Test Specification 3, high correlation,
with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

3

6

7

9

15

9

7

6

4

a = (0, 1, 0)

4

6

7

9

14

9

7

6

4

a = (0, 0, 1)

3

6

7

9

15

9

7

6

3

MDIFF

-5.6

-4.2

-2.8

-1.4

0

1.4

2.8

4.2

5.6

a = (1, 1, 1)

4

7

8

10

17

10

8

7

4

Table A.23: Bin count table for the .96-optimal item pool (Test Specification 3, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.8

-2.4

-2

-1.6

-1.2

-0.8

-0.4

0

0.4

0.8

1.2

1.6

2

2.4

2.8

3.2

a = (1, 0, 0)

1

3

4

5

6

7

7

7

9

7

7

7

6

5

4

3

1

a = (0, 1, 0)

1

3

4

5

6

7

7

7

8

7

7

7

6

6

4

3

1

a = (0, 0, 1)

1

3

5

6

6

7

7

7

8

7

7

7

6

6

5

3

1

MDIFF

-5.6

-4.9

-4.2

-3.5

-2.8

-2.1

-1.4

-0.7

0

0.7

1.4

2.1

2.8

3.5

4.2

4.9

5.6

a = (1, 1, 1)

1

3

5

6

7

7

8

8

10

8

7

7

7

6

5

3

1

Table A.24: Bin count table for the .86-optimal item pool (Test Specification 3, moderate
correlation, with item exposure control)
MDIFF

-3.2

-2.4

-1.6

-0.8

0

0.8

1.6

2.4

3.2

a = (1, 0, 0)

2

5

7

9

15

9

7

5

2

a = (0, 1, 0)

2

5

7

9

15

9

7

5

2

a = (0, 0, 1)

2

5

7

9

15

9

7

5

2

MDIFF

-5.6

-4.2

-2.8

-1.4

0

1.4

2.8

4.2

5.6

a = (1, 1, 1)

2

6

8

10

18

10

8

6

2

133

REFERENCES

Adams RJ, Wilson M, Wang W (1997). The multidimensional random coefficients multinomial
logit model. Applied Psychological Measurement 21:1-24.
Ansley, T.N. (1 984). Using a unidimensional latent trait model with multidimensional data: An
empirical investigation of robustness. Unpublished doctoral dissertation, University of Iowa,
Iowa city, IA.
Bejar, I. I, & Weis, D. J. (1979). Computer programs for scoring test data with item
characteristic curve models (Research Report 79-1). Minneapolis: University of Minnesota,
Department of Psychology, Psychometric methods Program.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In
F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472).
Reading, MA: Addison-Wesley.
Bock, D. R., & Mislevy, R. (1982). Adaptive EAP estimation of ability in a microcomputer
environment. Applied Psychological Measurement, 6(4), 431-444.
Bolt, D.M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using markov chain monte carlo. Applied
Psychological Measurement，27(6), 395-414.
Buyske, S. (2005). Optimal design in educational testing. In M. P. F. Berger & W. K. Wong
(Eds.), Applied optimal designs (pp. 1-19). West Sussex, UK: Wiley.
Common Core State Standards Initiative (2010). Common Core State Standards for Mathematics.
Washington, DC: CCSSO & National Governors Association.
Chang, H.-H., & Stout, W. (1993). The asymptotic posterior normality of the latent trait in an
IRT model. Psychometrika, 58, 37-52.
Chang, H.-H., & Ying, Z. (1996). A Global Information Approach to Computerized Adaptive
Testing. Applied Psychological Measurement, 20(3), 213–229.
Chang, H.-H., & Ying, Z. (1999). a-Stratified Multistage Computerized Adaptive Testing.
Applied Psychological Measurement, 23(3), 211 –222.

134

Chen, S.-Y., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection rules
at the early stages of computerized adaptive testing. Journal of Educational Measurement,
41(2), 149-174.
Chen, S.-Y., Ankenmann, R. D., & Chang, H.-H. (2000). A Comparison of Item Selection Rules
at the Early Stages of Computerized Adaptive Testing. Applied Psychological Measurement,
24(3), 241–255.
Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely constrained
item selection in computerized adaptive testing. British Journal of Mathematical and
Statistical Psychology, 63, 369-383.
Childs, R. A., & Oppler, S. H. (2000). Implications of Test Dimensionality for Unidimensional
Irt Scoring: An Investigation of a High-Stakes Testing Program. Educational and
Psychological Measurement, 60(6), 939–955.
Diao, Q. (2009). Comparison of ability estimation and item selection methods in MCAT
(Unpublished doctoral dissertation). Michigan State University, East Lansing, MI.
Frey, A., Cheng, Y. & Seitz, N. (2011). Content Balancing with the Maximum Priority Index
Method in Multidimensional Adaptive Testing. Presented at the 2011 meeting of the
National Council on Measurement in Education, New Orleans, Louisiana.
Frey, A., & Seitz, N.-N. (2009). Multidimensional adaptive testing in educational and
psychological measurement: Current state and future challenges. Studies in Educational
Evaluation, 35, 89–94.
Georgiadou, E., Triantafillou, E., & Economides, A. A. (2007). A review of item exposure
control strategies for computerized adaptive testing from 1983 to 2005. The Journal of
Technology, Learning, and Assessment, 5(8).
Gordon Commission (2013). To assess, to teach, to learn: a vision for the future of assessment.
The Gordon Commission, Princeton, NJ. . Retrieved from:
http://www.gordoncommission.org/rsc/pdfs/gordon_commission_technical_report.pdf
Gu, L. (2007). Designing optimal item pools for computerized adaptive tests with exposure
controls (Unpublished doctoral dissertation). Michigan State University.
He, W. (2010). Optimal item pool design for a highly constrained computerized adaptive test
(Unpublished doctoral dissertation). Michigan State University.
Hetter, R., & Sympson, B. (1997). Item exposure control in CAT-ASVAB. In W. Sands, B.
Wasters, & J. McBride (Eds.), Computerized Adaptive Testing: From Inquiry to Operation
(pp. 141-144). Washington, DC: American Psychological Association.

135

Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical
Statistics, 22, 79-86.
Leung, C. K., Chang, H., & Hau, K. T. (2003). Incorporation of content balancing requirements
in stratification designs for computerized adaptive testing. Educational and Psychological
Measurement, 63, 257-270.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale
NJ: Erlbaum.
Luecht, R. M. (1998). A framework for exploring and controlling risks associated with test item
exposure over time. Paper presented at the annual meeting of the National Council on
Measurement in Education.
Luecht, R. M. (1996). Multidimensional computerized adaptive testing in a certification or
licensure context. Applied Psychological Measurement, 20, 389–404.
Mao, L., Luo, X. & Zhou, X. (2013). The Comparison of the Unidimensional and
Multidimensional CAT in terms of Composite Score Estimation. Paper presented at the 2013
Annual Meeting of the American Educational Research Association, San Francisco, CA.
Mulder, J., & van der Linden, W. (2009). Multidimensional Adaptive Testing with Optimal
Design Criteria for Item Selection. Psychometrika, 74(2), 273–296.
National Assessment of Educational Progress (NAEP). (2010). Retrieved from
http://nces.ed.gov/nationsreportcard/tdw/analysis/2007/scaling_determination_correlations_
math2007conditional.asp
Patsula, L. N., & Steffan, M. (1997). Maintaining item and test security in a CAT environment: A
simulation study. Paper presented at the annual meeting of the National Council on
Measurement in Education.
Rasch G (1962). On general laws and the meaning of measurement in psychology. Proceedings
of the fourth Berkeley symposium on mathematical statistics and probability 4: 321-334.
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied
Psychological Measurement, 9(4): 401-412.
Reckase, M. D. (1989). Adaptive testing: The evolution of a good idea. Educational
Measurement Issues and Practice, 8(3), 11-15.
Reckase, M. D. (2003). Item pool design for computerized adaptive tests. Paper presented at the
2003 Annual meeting of the National Council on Measurement in Education, Chicago, IL.
Reckase, M. D. (2009). Multidimensional Item Response Theory. New York: Springer Dordrecht
Heidelberg London.
136

Reckase, M. D., & Mckinley, R. L. (1991). The discriminating power of items that measure more
than one dimension. Applied Psychological Measurement, 15(4): 361-373.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331–354.
Segall, D. O. (2010). Principles of Multidimensional Adaptive Testing. In Elements of Adaptive
Testing (pp. 57–75). New York: Springer.
Segall, D. O., Moreno, K. E., & Hetter, D. H. (1997). Item pool development and evaluation. In
W. A.Sands, B. K.Waters, & J.R. McBride (Eds.), Computerized adaptive testing: From
inquiry to operation (pp. 117–130). Washington DC: American Psychological
Association.
Seitz, N.-N., & Frey, A. (2013). The sequential probability ratio test for multidimensional
adaptive testing with between-item multidimensionality. Psychological Test and
Assessment Modeling, 55(1), 105–123.
Smarter Balanced Assessment Consortium (2013). On Track and Moving Forward: The Smarter
Balanced Assessment System. Retrieved from http://www.smarterbalanced.org/resourcesevents/publications-resources/
Song, T. (2010). The effect of fitting a unidimensional IRT model to multidimensional data in
content-balanced computerized adaptive testing (Unpublished doctoral dissertation).
Michigan State University.
Stocking, M. L. (1994). Three practical issues for modern adaptive testing item pools (ETS
Research Rep No. 94-05). Princeton, NJ: Educational Testing Service.
Svetina, D. (2013). Assessing Dimensionality of Noncompensatory Multidimensional Item
Response Theory With Complex Structures. Educational and Psychological
Measurement, 73(2), 312–338.
Swanson, D., & Stocking, M. L. (1993). A method for severely constrained item selection in
adaptive testing. Applied Psychological Measurement, 17, 277-292.
Sympson, J. (1978). A model for testing with multidimensional items. In D.J Weiss (Ed.),
Proceedings of the 1977 computerized adaptive testing conference. Minneapolis MN:
University of Minnesota.
Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized
adaptive testing. Paper presented at the 27th Annual Meeting of the Military Testing
Association, San Diego, CA.
Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory. Journal of
Educational Measurement, 14(2), 181-196.
137

van der Linden, W.J. (1999). Multidimensional adaptive testing with a minimum error-variance
criterion. Journal of Educational and Behavioral Statistics, 24(4), 398–412.
van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer.
van der Linden, W., & Glas, C. (2010). Elements of Adaptive Testing. New York, NY: Springer.
van der Linden, W. J., & Reese, L. (1998). A model for optimal constrained adaptive testing.
Applied Psychological Measurement, 22 (3), 259-270.
Veldkamp, B. P., & van der Linden, W. J. (2000). Designing item pools for computerized
adaptive testing. In Computerized Adaptive Testing: Theory and Practice (pp. 149–166).
Dordrecht: Kluwer.
Veldkamp, B., & Linden, W. (2002). Multidimensional adaptive testing with constraints on test
content. Psychometrika, 67(4), 575–588.
Wainer, H. (1990). Computerized adaptive testing: A primer. In. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Wainer, H. (2000). Computerized adaptive testing: A primer. Mahwah, N. J.: Lawrence Erlbaum
Associates.
Wang, C., & Chang, H. (2011). Item selection in multidimensional computer adaptive testing—
gaining information from different angles. Psychometrika, 76(3), 363–384.
Wang, W. C., & Chen, P. H. (2004). Implementation and Measurement Efficiency of
Multidimensional Computerized Adaptive Testing. Applied Psychological Measurement,
28, 450–480.
Way, W. D. (1998). Protecting the integrity of computerized testing item pools. Educational
Measurement: Issues and Practice, 17, 17-27.
Weiss, A.A., 1982, Asymptotic theory for ARCH models: Stability, estimation and testing,
Discussion paper 82-36 (University of California, San Diego, CA).
Wu, M., & Adams, R. (2006). Modelling mathematics problem solving item responses using a
multidimensional IRT model. Mathematics Education Research Journal, 18(2), 93–113.
Yao, L. (2012). Multidimensional CAT item selection methods for domain scores and composite
scores with item exposure control and content constraints. Paper presented at the 2012
Annual Meeting of the National Council on Measurement in Education, Vancouver,
Canada.

138

Yao, L. (2013). Comparing the Performance of Five Multidimensional CAT Selection
Procedures With Different Stopping Rules. Applied Psychological Measurement, 37(1),
3–23.

139