"(w r

A'. S‘- 1" 0.y.’
0.2.: «a...

-

~,L
"Y

 

' . WEN?
' .U.

m ‘1;
. i'

' ‘32

2: 22:23-2- A.

I’l"
w...

r»

22.2

 

ﬂ.» ' '4’.

'3'
“I

If:

'._-—
w...
-

 

if:

i

'2:
4 .5216”:
2...,
.

.h
- *3:
33.“.
3F”.

 

In ‘
:49:
c

.1

w.

5. a

E 2

Hon

.5?

at”: 3
in: a"
' 2;. 522

a...

sum
‘ 2%»
....

 

 

N.
.-.w
.u.

ﬁQ‘n

*—

?

- I

.'
5

£1.

‘rﬁ

"Q-

n-' ’

..
c ,
mm» '
rx‘l‘ -
f
:«n»

 

ox.

2::
3',

..

mm,

.‘x’

u

     

nu»...-
W .—
.""4 n

o.-

m.

>=:

-: “
«MN .1“
-.

u.

2... \
.31.

.
a:
...-

 

‘-
m.-

3.
m,“

w

This is to certify that the
dissertation entitled

DEVELOPMENT AND EVALUATION OF AN AUTOMATED
MULTIDIMENSIONAL TEST ASSEMBLY METHOD BASED
ON QUADRATIC KNAPSACK PROGRAMMING

presented by

Raymond Mapuranga

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in CEPSE

 

 

7er4) £9 [Qua/2,

 

Major Professofs Signature

I/BI/O'7
/ / .

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

- --—-— u-n---—n-c-n—Q-I-I-O-u-l—.-o--.-.-—.-l-Q--.--.-.--C-n-O-I-C‘o-D-I-C-O’I-O_O-Q-h-.-.-.--._

_-_._.-.—-—-—._-—_- .-.

 

L E :11: L :Ij‘ti {Y
I‘H'Cliii‘ in 53139
L,“ ‘i: iirz: it‘y

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATEDUE

DAIEDUE

DATEDUE

 

AUG 2 9 2009

 

JUL 0 1 2009

 

1m; 0 :3 MO
L, .

 

092109

 

 

 

 

 

 

 

 

 

 

6/07 p:lCIRC/DateDue.indd-p.1

 

DEVELOPMENT AND EVALUATION OF AN AUTOMATED
MULTIDIMENSIONAL TEST ASSEMBLY METHOD BASED ON QUADRATIC
KNAPSACK PROGRAMMING

By

Raymond Mapuranga

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counseling, Educational Psychology, and Special Education

2007

ACKNOWLEDGMENTS
I would like to thank Mark Reckase for his tremendous help, direction and support. The
other dissertation committee members: Kimberly Maier, Richard Houang and Suzanne
Wilson, also provided outstanding help and direction. Michael Jodoin made signiﬁcant
contributions during the entire duration of this project. Blakely and Adam’s wonderful
friendship and encouragement made the completion of this work possible. I am indebted
to Dinoj for his technical help and superior insights. Some other people that were
extremely helpful are: Joseph Martineau, Catherine McClellan, Lixiong Gu, Jonathan
Manalo and Leah Walker. Thanks to Max for his love and kindness. A special thank you
to my parents for making sacriﬁces that have allowed me to reach this point in my formal
education. The ﬁnancial support of the Spencer Research Training Grant from Michigan
State University and the ETS Graduate Fellowship allowed the timely completion of this

work.

iii

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................ vii
LIST OF FIGURES ............................................................ A ............................................. viii
KEY TO ABBREVIATIONS .......................................................................................... xiii
CHAPTER 1 INTRODUCTION ........................................................................................ 1
1.1 Test Speciﬁcations ..................................................................................................... 2
1.2 Combinatorial Optimization and Heuristic Techniques ............................................ 4
1.3 Item Response Theory and Dimensionality ............................................................... 7
1.4 Test Assembly Process .............................................................................................. 9
1.5 Achievement and Certiﬁcation Tests ....................................................................... 11
1.6 Motivation ................................................................................................................ 13
1.7 Purpose ..................................................................................................................... 14
CHAPTER 2 LITERATURE REVIEW ........................................................................... 16
2.1 Multidimensional Two Parameter Logistic Model .................................................. 16
2.2 Item Vectors ................................................................................... . ......................... 20
2.3 Item Content Clusters .............................................................................................. 21
2.4 Multidimensional Information ................................................................................. 24

2.4.1 Ackerman Information .................................................................................. 25

2.4.2 Reckase Information ..................................................................................... 27

2.4.3 Difference between Reckase and Ackerman Information ............................ 29
2.5 D-optimality Criterion ............................................................................................. 30
2.6 0-1 Linear Programming .......................................................................................... 32

2.6.1 Veldkamp’s Automated Multidimensional Test Assembly Approach ......... 34
2.7 Quadratic Knapsack Programming .......................................................................... 36
CHAPTER 3 METHODOLOGY ..................................................................................... 39
3.1 Modeling the Test Assembly Problem ..................................................................... 40
3.2 The QKP Objective Function ................................................................................... 41
3.3 The Quadratic Knapsack Programming Heuristic ................................................... 43

iv

3.3.1 Solving Nonlinear Problems ......................................................................... 44

3.3.2 Phases of the QKP Heuristic ......................................................................... 44
3.3.3 Details of the QKP Heuristic .......................... ' .............................................. 45
3.4 Random Item Selection Method .............................................................................. 46
3.5 Item Pool Simulations .............................................................................................. 48
3.6 Test Length .............................................................................................................. 53
3.7 Simulees ................................................................................................................... 54
3.8 Test Assembly Model .............................................................................................. 55
3.9 Unidimensional Item Response Theory Simulations ............................................... 56
3.10 Data Analyses and Evaluation Criteria .................................................................... 58
3.10.1 Computational Efﬁciency ........................................................................... 59
3.10.2 Maximum Information ................................................................................ 59
3.10.3 Pairwise Comparisons Using Clamshell Plots ............................................ 60
3.10.4 Relative Efficiency ...................................................................................... 61
CHAPTER 4 RESULTS ................................................................................................... 63
4.1 Simulated Data ......................................................................................................... 63
4.2 D-optimality ............................................................................................................. 69
4.3 Comparison of Constructed Tests ............................................................................ 73
4.4 Computational Efﬁcrency .............................................. 77
4.5 Maximum Information ............................................................................................. 77
4.6 Pairwise Comparisons Using Clamshell Plot .......................................................... 88
4.7 Relative Efﬁciency ................................................................................................. 102
CHAPTER 5 DISCUSSION AND CONCLUSIONS .................................................... 117
5.1 Unique Contributions ............................................................................................. 1 18
5.2 Computational Experiments and Evaluation Criteria ............................................ 119
5.2.1 Item Pool and Ability Simulations ........................................ 2 ..................... 119
5.2.2 Computations of D-optimality and Characteristics of Assembled Tests.... 121
5.2.3 Computational Efficiency ...................................................................... 123
5.2.4 MAXINFO and Pairwise Comparisons with Clamshell Plots .................... 123

5.2.5 Unidimensional Relative Efﬁciency of Assembled Tests .......................... 126

5.3 Limitations ............................................................................................................. 127
5.4 Future Research ......................................................... , ............................................ 128
REFERENCES ............................................................................................................... 1 32

vi

2.1

3.1

4.1

4.2

4.3

4.4

4.5

4.6

LIST OF TABLES

Item parameters for a 20 item test ............................................................ 24
Item Characteristics for Low, Moderate and High Discrimination Item Pools........51
Descriptive Statistics of Parameters for 300-Length Item Pools ....................... 64
Descriptive Statistics of Clusters for Low, Moderate and High Discrimination Item

Pools of length 300 ............................................................................ 66
Descriptive Statistics of Parameters for 900-Length Item Pools .............................. 67
Descriptive Statistics of Clusters for Low, Moderate and High Discrimination Item

Pools of Length 900 ............................................................................ 68
Items Selected for 21-Item QKP, LP and RND Assembled Tests Using the 300-Item
Low Discrimination Item Pool ............................................................... 75
Items Selected for 21-Item QKP, LP and RND Assembled Tests Using the 900-Item
Low Discrimination Item Pool ................................................................ 76

vii

1.1

1.2

1.3

2.1

2.2.

2.3

2.4

4.1.

4.2.

4.3.

4.4.

4.5.

4.6.

4.7.

4.8.

4.9.

4.10.

LIST OF FIGURES

Sample Specifications for a Fictitious Mathematics Test Blueprint .................... 3
The Coding of Items in a Pool as 0-1 Decision Variables ............................... 5
Schematic of the Test Assembly Process .................................................. 10
Item Response Surface for the M2PLM .................................................. 19
Sample Plot of Item Vectors ................................................................ 21
Angular Distance in Two-Dimensional Space ............................................ 22

Dendrogram of item content clusters based on item parameters in Table 2.1 . ...25

D-Optimality Criterion Results for the BOO-Length Low Discrimination Item
Pool70

D-Optimality Criterion Results for the 300-Length Moderate Discrimination Item
Pool ............................................................................................ 70

D-Optimality Criterion Results for the 300-Length High Discrimination Item
Pool ............................................................................................ 71

D-Optimality Criterion Results for the 900-Length Low Discrimination Item
Pool ............................................................................................ 71

D-Optimality Criterion Results for the 900—Length Moderate Discrimination Item
Pool ............................................................................................ 72

D-Optimality Criterion Results for the 900-Length High Discrimination Item
Pool ............................................................................................ 72

Real Time Performance of Each Assembly Method for the 300-Length Low
Discrimination Item Pool ................................................................... 78

CPU Time Performance of Each Assembly Method for the 300-Length Low
Discrimination Item Pool ................................................................... 78

Real Time Performance of Each Assembly Method for the 300-Length Moderate
Discrimination Item Pool .................................................................... 79

CPU Time Performance of Each Assembly Method for the 300-Length Moderate
Discrimination Item Pool .................................................................... 79

viii

4.11.

4.12.

4.13.

4.14.

4.15.

4.16.

4.17.

4.18.

4.19.

4.20.

4.21.

4.22.

4.23.

4.24.

4.25.

4.26.

4.27.

Real Time Performance of Each Assembly Method for the 300-Length High
Discrimination Item Pool ................................................................... 80

CPU Time Performance of Each Assembly Method for the 300-Length High
Discrimination Item Pool ................................................................... 80

Real Time Performance of Each Assembly Method for the 900-Length Low
Discrimination Item Pool ................................................................... 81

CPU Time Performance of Each Assembly Method for the 900-Length Low
Discrimination Item Pool ................................................................... 81

Real Time Performance of Each Assembly Method for the 900-Length Moderate
Discrimination Item Pool .................................................................... 82

CPU Time Performance of Each Assembly Method for the 900-Length Moderate
Discrimination Item Pool .................................................................... 82

Real Time Performance of Each Assembly Method for the 900-Length High
Discrimination Item Pool ................................................................... 83

CPU Time Performance of Each Assembly Method for the 900-Length High

Discrimination Item Pool.. ........83
MAXINFO Results for a 300-Length Low Discrimination Item Pool ................ 85
MAXINFO Results for a BOO-Length Moderate Discrimination Item Pool. . . ...85
MAXINFO Results for a 300-Length High Discrimination Item Pool ............... 86
MAXINFO Results for a 900-Length Low Discrimination Item Pool ................ 86
MAXINFO Results for a 900-Length Moderate Discrimination Item Pool. . . . ......87
MAXINFO Results for a 900-Length High Discrimination Item Pool ................... 87

Pairwise Comparison of MIN F for 21-Item QKP- and RND-Assembled Tests
Using the Low Discrimination BOO-Length Item Pool90

Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using
the Low Discrimination BOO-Length Item Pool .......................................... 90

Pairwise Comparison of MIN F for 63-Item QKP- and RND-Assembled Tests
Using the Low Discrimination 300-Length Item Pool .................................. 91

ix

4.28.

4.29.

4.30.

4.31.

4.32.

4.33.

4.34.

4.35.

4.36.

4.37.

4.38.

4.39.

4.40.

4.41.

4.42.

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the Low Discrimination 300-Length Item Pool .......................................... 91

Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests
Using the Moderate Discrimination 300-Length Item Pool ............................ 92'

Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using
the Moderate Discrimination BOO-Length Item Pool .................................... 92

Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests
Using the Moderate Discrimination 300-Length Item Pool ............................ 93

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the Moderate Discrimination 300-Length Item Pool .................................... 93

Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests
Using the High Discrimination 300-Length Item Pool .................................. 94

Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using
the High Discrimination 300-Length Item Pool ......................................... 94

Pairwise Comparison of MINF for 63-Item QKP- and RND—Assembled Tests
Using the High Discrimination 300-Length Item Pool .................................. 95

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the High Discrimination BOO-Length Item Pool ......................................... 95

Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests
Using the Low Discrimination 900-Length Item Pool .................................. 96

Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using
the Low Discrimination 900-Length Item Pool .......................................... 96

Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests
Using the Low Discrimination 900-Length Item Pool .................................. 97

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the Low Discrimination 900-Length Item Pool .......................................... 97

Pairwise Comparison of MINF for 21—Item QKP- and RND-Assembled Tests
Using the Moderate Discrimination 900-Length Item Pool ........................... 98

Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests Using
the Moderate Discrimination 900-Length Item Pool .................................... 98

4.43

4.44.

4.45.

4.46.

4.47.

4.48.

4.49.

4.50.

4.51.

4.52

4.53.

4.54.

4.55.

4.56.

4.57.

Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests
Using the Moderate Discrimination 900-Length Item Pool ............................ 99

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the Moderate Discrimination 900-Length Item Pool .................................... 99

Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled Tests
Using the High Discrimination 900-Length Item Pool ................................ 100

Pairwise Comparison of MIN F for 21-Item QKP- and LP-Assembled Tests Using
the High Discrimination 900-Length Item Pool ........................................ 100

Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled Tests
Using the High Discrimination 900-Length Item Pool ................................ 101

Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests Using
the High Discrimination 900-Length Item P001 ....................................... 101

Test Information Functions for 21-Item Tests Assembled from the 300-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods... ..........104

Efﬁciency Functions for 21-Item Tests Assembled from the 300-Length, Low
Discrimination Item Pool with the QKP, LP, and RND Methods .................... 104

Test Information Functions for 63-Item Tests Assembled from the 300-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods. .. . 105

Efﬁciency Functions for 63-Item Tests Assembled from the 300-Length, Low
Discrimination Item Pool with the QKP, LP, and RND Methods .................... 105

Test Information Functions for 21-Item Tests Assembled from the 300-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods ....... 106

Efﬁciency Functions for 21-Item Tests Assembled from the BOO-Length, Moderate
Discrimination Item Pool with the QKP, LP, and RND Methods........ ............106

Test Information Functions for 63-Item Tests Assembled from the 300-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . 107

Efﬁciency Functions for 63-Item Tests Assembled from the 300-Length, Moderate
Discrimination Item Pool with the QKP, LP, and RND Methods .................... 107

Test Information Functions for 2l-Item Tests Assembled from the 300-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods ............. 108

xi

4.58.

4.59.

4.60.

4.61.

4.62.

4.63.

4.64.

4.65.

4.66.

4.67.

4.68.

4.69.

4.70.

4.71.

4.72.

Efﬁciency Functions for 21-Item Tests Assembled from the 300-Length, High
Discrimination Item Pool with the QKP, LP, and RND Methods... .. .. ............108

Test Information Functions for 63-Item Tests Assembled from the 300-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods. . . . . ...109

Efﬁciency Functions for 63- Item Tests Assembled from the 300- Length, High
Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. .....109

Test Information Functions for 21-Item Tests Assembled from the 900-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods......110

Efficiency Functions for 21-Item Tests Assembled from the 900-Length, Low
Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. ..1 10

Test Information Functions for 63-Item Tests Assembled from the 900-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods. .......1 11

Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length, Low
Discrimination Item Pool with the QKP, LP, and RND Methods .................... 111

Test Information Functions for 21-Item Tests Assembled from the 900-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . .....l 12

Efﬁciency Functions for 21-Item Tests Assembled from the 900—Length, Moderate
Discrimination Item Pool with the QKP, LP, and RND Methods. .. .. .. . ...1 12

Test Information Functions for 63-Item Tests Assembled from the 900-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods. . ......1 13

Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length, Moderate
Discrimination Item Pool with the QKP, LP, and RND Methods....................113

Test Information Functions for 21-Item Tests Assembled from the 900-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods ............. 114

Efﬁciency Functions for 21 -Item Tests Assembled from the 900- Length, High
Discrimination Item Pool with the QKP, LP, and RND Methods. ......114

Test Information Functions for 63-Item Tests Assembled from the 900-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods. . . . . ...1 15

Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length, High
Discrimination Item Pool with the QKP, LP, and RND Methods .................... 115

xii

3PLM
ATA
AMTA
CV

IRT

LP
M2PLM
MAXINFO
MDIFF
MDISC
MINF
MIRT
QKP
RE
RND
UIRT

KEY TO ABBREVIATIONS

Three-parameter logistic model
Automated test assembly

Automated multidimensional test assembly
Coefﬁcient of variation

Item response theory

Linear programming

Multidimensional two-parameter logistic model
Maximum information

Multidimensional difﬁculty
Multidimensional discrimination
Multidimensional information
Multidimensional item response theory
Quadratic knapsack programming

Relative efﬁciency

Random item selection algorithm

Unidimensional item response theory

xiii

CHAPTER 1

INTRODUCTION

In recent years, research in psychometrics has given more attention to the optimal
assembly of tests in a fully automated fashion (van der Linden, 1998, 2005; Luecht,
1988). This process called automated test assembly (ATA) can be described as a set of
methods in which test items are selected from an item pool (also called item bank) in
order to construct one or more tests that meet both statistical (e.g., information, reliability,
discrimination, difﬁculty) and substantive (i.e., content and cognitive classiﬁcations) test
speciﬁcations. In practice, the statistical and substantive properties tend to fall into two
classes. First, constraints, which are test speciﬁcations that must be met for an
acceptable test to be assembled. Second, goals, which are those speciﬁcations the test
should meet as best as possible, given the available items. Normally, substantive
speciﬁcations form constraints and statistical speciﬁcations, within some broad
acceptability criteria, form goals. (van der Linden, 1998; Veldkarnp, 2002).

The word “automat ” of course refers to the use of computers to facilitate this
process. Hence, ATA can be viewed as an optimization method where the idea is to
optimize some statistical test property (e. g., test information, test reliability, item
discrimination, or item difﬁculty) fOr a constructed test by either maximizing or
minimizing that property; or setting it equal to some predeﬁned value.

ATA, otherwise known as optimal test design or automated test construction, is a
relatively new method in test construction (van der Linden, 2005). The conceptualization

and impetus for using this method came from the suggestions of several researchers who

espoused the use of optimization techniques for assembling tests (e.g., Feurman & Weiss,
1973; Votaw, 1952; Yen, 1983). The ﬁrst attempts at ATA gave acceptable solutions, but
did not fully automate assembly since the intervention of , test assemblers was required in
order for solutions to be found (e.g., Theunissen, 1985).

The most feasible solution to ATA was ﬁrst presented by van der Linden and
Tirnminga (1989). This paper showed that various types of content and statistical criteria
which govern the construction of a test could be formulated by the test developer in an
ATA model in the same way these criteria had traditionally been speciﬁed. These criteria
were expressed in the form of linear equalities or inequalities. The latter criteria (also
called constraints) deﬁned limitations to which assembly of the test should adhere.
Additionally, in this ATA model, the goal for the development of a particular test (e. g., to
maximize test information; to maximize test reliability; or to ﬁt a speciﬁc target
information function) could be speciﬁed as an “objective function”. Therefore, when the
goal of a test is deﬁned by the “objective function” and test speciﬁcations are deﬁned as
“constraints,” van der Linden & Timminga (1989) showed that computers could be used
to assemble a test that optimally met the needs of the test developer. This remarkable

innovation marked the birth of ATA.

1 . 1 Test Speciﬁcations

Test speciﬁcations are the rules that govern AT A. These rules are deﬁned by
statements that formulate requirements for attributes and/or characteristics of a test or its
items. They are classiﬁed as: a) constraints if they require an attribute or characteristic to

satisfy an upper or lower bound, or b) objectives if the characteristics or attributes are

required to take a maximum or minimum value (van der Linden, 2005). A succinct
classiﬁcation of types of test speciﬁcations is as follows (Davey, 2005):

1. Statistical. These often specify numbers of items in various difﬁculty and
discrimination categories and can also be based on averaged statistics, IRT
information, test reliability, etc. Statistical rules usually comprise the ATA
objective function.

2. Intangible. Such speciﬁcations occur in situations where test developers or
content experts give a qualitative evaluation of test items. Good examples of this
speciﬁcation are item enemies which are pairs of questions that should not be
included together in a test. 3

3. Substantive. Specify numbers of items in each of various categories. Usually
allows a permissible range, either because it is required or because it does not
make a difference. Speciﬁcation of these items is often based on marginal totals

as shown below:

 

 

Problem Data
. Operations Total
Solvrng Sufﬁciency
Arithmetic 5
Algebra 9 3
Geometry . 5
Statistics 2
Total 5 5 5 15

 

 

Figure 1.1: Sample Speciﬁcations for a Fictitious Mathematics Test Blueprint

4. Formal. These are very similar to substantive rules and some are crossed in the
same multi—way table. They are sometimes less lenient with regards to permissible
range (e. g., number of items allowed in each section of a test).

5. Item eligibility. Occur when rules govern the re-use of items across forms and
administrations. These speciﬁcations can be simple (e.g., Neverl), or complicated

(e.g., CAT item exposure rules).

1.2 Combinatorial Optimization and Heuristic Techniques

Combinatorial optimization models are mathematical formulations used in ﬁnding
an optimal combination of a given set of resources in order to meet a desired objective.
These models are most useful in situations where some or all the resources can be I
represented as integer values. In other words, the models can be used to search through a
discrete set of units such that the best subset which meets a number cf speciﬁed
constraints is found. These models are often called integer programming models with the
term “programming” referring to the planning of decisions that need to be made when
only a ﬁnite number of alternative possibilities exist (Hoffman & Padberg, 1996;
Nemhauser & Wolsey, 1988). In ATA, this implies constructing an optimal test by
selecting test items from an item pool, so that the resulting test meets all the
predetermined test speciﬁcations (van der Linden, 2005).

For optimal test assembly models, only two choices of integers exist and they are

characterized as yes and no decisions as follows:

1 if item iis a decision variable of inclusion in the test
x- =
' 0 if item iis a decision variable of exclusion from the test

Hence, a string of zeros and ones can be used to code this test assembly problem as
shown in Figure 1.2 below. Figure 1.2, adds clarity to the description of constraints that
was given above. For example, suppose that our constraint was to construct a test of
length m. Since our decision variables can be coded as either 0 or 1, this test length

constraint can be represented using summation notation as:

I
239- =m (1-1)
i=1

meaning that setting the sum of the x,- variables to the desired value m meets the test
length constraint that are desired. This is standard notation which appears in recent ATA

literature (e. g., van der Linden, 2005).

 

 

Item 1 2 i [—1 I
Variable x1 x2 xi , 161-1 x1
Testl 0 O 0 0
Test2 1 O 0 0
Test3 1 1 0 0
Testn 1 1 l 1

 

Figure 1.2: The Coding of Items in a Pool as 0-1 Decision Variables (van der Linden,
2005) ‘

As noted previously, the general test construction model is only comprised of

linear constraints and therefore integer programming techniques can be used to arrive at a

 

solution. Thus, the combinatorial optimization problem to solve in ATA entails choosing

the most appropriate or relevant subset of test items from an item pool until an optimal

test has been assembled.

Once the combinatorial optimization model has been created, several methods can

be used to solve it. There are at least three different ways of solving integer

programming problems (Hoffman & Padberg, 1996):

a)

b)

C)

Enumerative techniques. These techniques ﬁnd the best solutions by counting all
the possibilities available for solving the problem and solving less desirable
choices. Branch-and-bound algorithms are an example of how this method has
been used in ATA. Here, “branching” refers to the cataloging of all possible
solutions and “bounding” is the comparison and elimination of possible solutions
using known upper and lower bounds (e.g., Adema, 1992).

Lagrangian relaxation and decomposition techniques. These techniques
incorporate constraints into the objective function in a Lagrangian fashion and
solve subproblems iteratively until optimal values are found. They are complex
because special attention has to be given to the integrality property of the
functions involved (e. g. Veldkamp, 1998).

Cutting Plane algorithms based on polyhedral combinatorics. Here, the set of
constraints are replaced with alternative formulations of the potential solutions.
These methods have not yet been applied to ATA problems and present a
potential area for future research.

Several other heuristics techniques have also been used for solving ATA

problems. Veldkamp (2005) deﬁnes heuristic algorithms as procedures used for the

construction of a solution to a test assembly problem based on a plausible intuitive idea.
The most notable ATA heuristics are NW ADH (Leucht, 1988) and WDM (Stocking &
Swanson; 1998; Swanson & Stocking, 1993). These heuristics assemble tests sequentially
by adding one item at a time until the desired overall value (e.g., maximum information,
maximum reliability) of the test has been fulﬁlled. Hence, the utility of each item is based
on the extent to which it increases the optimality of the test’s targeted statistical
properties. It is important to remember that due to the sequential selection of items, the
utility of a particular item depends on the state of the test at the moment when it is added.
The preceding summary of combinatorial optimization and heuristic techniques
illustrates that a large number of methods exist for solving ATA problems. In practice,
combinations of heuristic and standard optimization approaches are often used (Hoffman
& Padberg, 1996). Hence, given that ATA is a relatively new area of psychometrics, it is
important that research focus on implementing other optimization techniques to improve

on existing methods.

1.3 Item Response Theory and Dimensionality

In educational assessment, examinee performance on a test item is typically
scored with a 1 when the examinee answers a test item correctly and scored with a 0
when the examinee answers the test item incorrectly. These item responses are said to be
dichotomously scored. A set of mathematical models are then used to model test item
characteristics and examinee ability from the resulting matrices of Os and Is. These types
of models belong to the general statistical framework called Item Response Theory (IRT).

In IRT it is assumed the performance of an examinee on a test can be predicted in terms

of their latent traits or abilities. When it is assumed that a single ability underlies test
performance, the IRT models are said to be unidimensional (Lord, 1980; Hambleton &
Swaminathan, 1984).

An example of a unidimensional item response theory (U IRT) model is the three

parameter logistic model (3PLM) which models the probability, Pi(0j), of a correct

response on an item i for an examinee with ability 61-. It can be represented

mathematically as follows (Lord, 1980):

 

(1.2)

where,
a i, is the item discrimination parameter which gives the rate of change of the

probability of correct response to changes in ability levels,

b,- is the item difﬁculty parameter, and

Ci is the lower asymptote parameter (also known as the guessing parameter) which

is the probability of correct response approached when abilities assessed by the

items are very low (i.e., approach —oo ). 1

While unidimensional models provide a good method for modeling latent traits,
advances in psychometric theory and practice have shown that many educational and
psychological tests are inherently multidimensional (Ackerman, et al., 2003; Reckase,
1997). In other words, two or more latent traits are thought to govern the response

process. Good examples of items exhibiting this property are mathematical “story

problems”. To answer such items correctly, both mathematical and verbal skills are
required (Reckase, 1985). This extension of IRT to a multidimensional framework is
modeled using multidimensional item response theory (MIRT) models. The MIRT model

used in this study will be described in the Section 2.1.

1.4 Test Assembly Process

IRT is the psychometric theory that commonly underlies the test assembly process
(van der Linden, 1998). In this process, three main steps are followed: (a) an IRT model
is chosen and the test items in the item bank are calibrated, (b) the desired properties of
the test are speciﬁed (e. g. test length, categorical test attributes such as item content and
cognitive level, or quantitative speciﬁcations such as word count or the number of
minutes it takes for an examinee to answer a test question), and (c) a model and/or
algorithm that selects items from the item bank is formulated so that test speciﬁcations
are met (van der Linden, 1998; Veldkamp, 20023, 2002b). A combinatorial optimization
or heuristic technique (or combination of both) called the ATA algorithm is often used to
complete the last step.

Although an ATA algorithm signiﬁcantly streamlines the process of test assembly,
it is merely one component of a very complicated process of quality control as well as
human and system-based evaluation. Figure 1.3 presents the entire cycle of test assembly
which begins when items are constructed by item writers until those items are retired
from use.

After new items are constructed based on speciﬁcations such as item type and

item content, they are placed in an item bank. An item bank is a collection of items that

may be easily accessed for use when assembling tests. In established testing programs,
item banks are constructed only after accurate IRT calibration and thorough goodness-of-
ﬁt testing. These steps are completed to ensure that items, are measuring the same or

comparable constructs with known values for the item parameters (van der Linden, 2005).

 

 

  
 

 

  
   
    

 

 

 

 

 

 

o Retired
: 350.0883 < New items
- Problem items “'m an"

Item Content. Types

 

i

 

   
 

Item History 8. Statistics

1 /\

 

 

 

 

 

Volumes Human & System-based
ltern Use Type Evaluation

Item Stats

 

 

Admin Schedule

I

Figure 1.3: Schematic of the Test Assembly Process

The main advantage of using IRT for creating item banks is that new items are
always calibrated on the same scale as existing or retired items. This is possible because
item statistics are obtained that do not depend on the sample of examinees used in the
calibration of test items. It is advantageous to maintain the same scale because this
facilitates the comparability of test scores across different years or test forms. Moreover,

if item banks consist of items that have valid content with statistically reliable and stable

10

item parameter estimates, the quality of tests produced using item banks is superior to
those produced by test developers preparing test items themselves (Hambleton &
Swaminathan, 1984).

When a test is assembled using an ATA algorithm, only items that fulﬁll speciﬁc
requirements such as exposure, security constraints, content categories, total time
available to take the test and item type are made available. Therefore, this assembly pool
consists only of items that are eligible for use on‘a particular test form(s) that is under
construction. These test speciﬁcations are often formalized in the ATA model as test
constraints that need to be fulﬁlled in order for an optimal test to be constructed. Note
that this description loosely meets both requirements of computer adaptive and paper-
and-pencil tests.

The computer is used to assemble initial drafts Of test forms. These test forms are
then iteratively reviewed and revised by committees and returned to the ATA algorithm
until a test that meets all required speciﬁcations and constraints has been constructed.
This process, although tedious, is a marked improvement on old methods in which '

humans constructed entire tests from start to ﬁnish.

1.5 Achievement and Certiﬁcation Tests

Several MIRT studies have shown that multidimensional models are better at
explaining the structure of data than unidimensional models (e. g., McKinley & Reckase,
1983; Muraki & Engelhard, 1985). Further, it has been argued that it is advantageous to
use MIRT for increasing measurement efﬁciency and measurement precision especially

in situations where tests report multiple correlated scores (Luecht, 1996; Segall, 1996).

11

Examples of such testing contexts include K-12 assessments and certiﬁcation tests as
explained below.

Certiﬁcation tests determine whether an individual has mastered a unit of
instruction or skill and provide information about what an individual knows, not how his
or her performance compares to the norm group. Many certiﬁcation tests report scores
from different subscales that may be correlated and cover a large domain of integrated
knowledge (Luecht, 1996). These tests also cover numerous combinations of crossed and
nested speciﬁcations even though the purpose of the test is to report a single test score
(e. g., Federation of State Medical Boards & National Board of Medical Examiners, 1996).
Hence, these tests have a dual purpose of making pass/fail decisions and reporting
subscores based on sets of items that come from various categories (Luecht, 1996; Segall,
1996). Therefore, certiﬁcation tests are likely candidates for calibration and construction
using MIRT.

In K-12 state assessments, the use of multiple skills is even more apparent due to
the variety of content standards and speciﬁcations that need to be met in each test. In
other words, in practice many test frameworks often require that test items measure
several inter-related (and often highly correlated) content strands or content areas (e.g.,
Michigan Educational Assessment Program, 2003). These test construction frameworks
can be problematic when content is confounded with item difﬁculty (Segall, 1996).
Research has shown that in K-12 state assessments, content strands or subscales in a test
can be clearly deﬁned using MIRT (e. g., Martineau, et al., 2006). Hence, ideally it would
be more helpful to teachers and educators to be able to understand student achievement

not only as a single score, but sets of scores on different dimensions on the test. Overall,

12

there is compelling substantive and statistical evidence to support that MIRT is the best

approach for understanding, reporting, and assembling tests.

1 .6 Motivation

ATA methods for multidimensional tests have recently been formulated (van der
Linden, 1996; Veldkamp, 2002a). These methods have only been applied to ATA
situations in which the intent was to maximize test information. This maximization used
the variance of ability estimators to formulate the objective function and in conjunction
with several test constraints, allowed optimal multidimensional tests to be constructed.
However, an inherent disadvantage with this formulation is that since the resulting
objective function is a nonlinear function, it was necessary to convert it into a linear form
so the model would be suitable for 0-1 linear programming (van der Linden, 1996, 1998;
Veldkamp, 2002a, 2002b).

Using 0-1 linear programming (LP) can be problematic when the mathematical
functions involved are not linear. Researchers have shown that some degree of
nonlinearity is the rule and not the exception in many situations where mathematical
programming is used. Even a slight degree of nonlinearity can cause LP calculations to
differ substantially from the true Optimum (Baumol & Bushnell, 1967). Thus, errors may
arise when a linear programming calculation is used to solve a problem involving any
level of nonlinearity.

Typically, when LP methods are applied in ATA, researchers compensate for
nonlinearity by using linear approximations to convert the mathematical functions

involved into variations suitable for LP (e. g., Veldkamp, 2002; van der Linden, 1996).

13

An example that is most pertinent to this study used a Taylor series approximation to
convert the nonlinear objective function into a linear form — a process called linearization
(Veldkamp, 2002a). Unfortunately however, linearization introduces numerical,
computational and mathematical error.

Furthermore, linearization is a problem because the inappropriate use of linearity
assumptions leads to model rnisspeciﬁcations. Research suggests that in a mathematical
programming context, the type of curvature of these mathematical functions can produce
distortions when a linear approximation is used. That is, the LP calculation can easily
produce a ﬁnal solution that is worse than the intermediate solutions obtained in
preceding computational iterations (Baumol & Bushnell, 1967). Additionally, it has been
shown that linearization techniques have the disadvantage of increasing the problem size
and computational requirements (Glover, 1975; Glover & Woolsey, 1974; Walters, 1967;
Zangwill, 1965). Hence, there is a need to ﬁnd alternative methods for automated

multidimensional test assembly (AMTA) as will be articulated in the section that follows.

1.7 Purpose

Reducing mathematical and computational error that results from linearization is
desirable. Hence, an ideal approach to this problem requires the formulation of a solution
which exploits the specialized mathematical structure of the automated multidimensional
test assembly model when the intention is to maximize information. Therefore, the
purpose of this dissertation is to develop and evaluate a new method for AMTA that is
not reliant on linearization and yet can assemble tests that maximize multidimensional

test information while increasing measurement precision and simultaneously reducing

14

both mathematical and computational error. This new approach incorporates elements of
combinatorial optimization and heuristic techniques that were described previously.

In Chapter 2 the multidimensional item response theory (MIRT) model is
introduced along with associated statistics and evaluation criteria of test quality.
Additionally, an optimal design theory criterion and the mathematical programming
methods of LP and QKP are described. In Chapter 3, an objective function suitable for
QKP is derived and an algorithm for selecting test items that maximize multidimensional
information is developed. A computational experiment for comparing QKP, LP, and
Random (RND) test assembly methods is described along with various evaluation factors
and criteria. Results from the computational experiments and performance of the QKP
algorithm are presented in Chapter 4. Finally, in Chapter 5 I discuss the study’s

conclusions and suggest future directions for this work.

15

CHAPTER 2

LITERATURE REVIEW

An overview of ATA and its associated psychometric and procedural aspects was
provided in the Chapter 1. The purpose of Chapter 2 is to give a detailed description of
the methods and procedures pertinent to this study. Speciﬁcally, this chapter reviews
multidimensional item response theory (MIRT) and the design of multidimensional tests
based on Birnbaum’s (1968) framework. Additionally, an optimality criterion suitable for
maximizing multidimensional information (MINF) is reviewed along with the 0-1 linear
programming (LP) approach to automated multidimensional test assembly (AMT A)
established by Veldkamp (2002a). Lastly, a new combinatorial optimization approach to
AMTA, called quadratic knapsack programming (QKP), is described and its suitability

for multidimensional test assembly is justiﬁed.

2.1 Multidimensional Two Parameter Logistic Model

This study only considers the multidimensional two-parameter logistic model
(Ackerman et al., 2003, Reckase, 1997) as a way to illustrate this new method. However,
it should be noted that extensions to other MIRT models are possible and would provide
interesting directions for future research. For simplicity, only the case of two 65 (31, 62)
will be illustrated, where each ﬁrepresents an unknown trait parameter. The

multidimensional two-parameter logistic model (M2PLM) can be represented as follows

(van der Linden, 1996):

16

exp a ~61+a ~62+d-
P" (61’02)=P(U'7 =l|ali'a2i’d"61’62)5 P" = 1+ rigidly-61 +3362 +22), (2.1)

 

where,
UI-j is a response variable that takes the value 1 if the response of person j =
1,...,M to item i = 1,...,N is correct, and 0 otherwise,

(a1 ,-, 021') are the discrimination parameters of item I for 91 and 92, respectively,
dI- is a scalar parameter that is related to the difﬁculty of the item.

In this study, it is assumed that item parameters (ali, a2,- and di) are known and the

M2PLM is used to estimate 611]- and 921- from a set of dichotomous responses as discussed

in section 1.3 of the preceding chapter.

There are several statistical indices associated with measurement and calibration
in the MIRT framework. First, MDISC enables items to be compared on a general
measure of quality. It gives the relationship of the item response to the combination of
dimensions that gives the strongest relationship as follows (Reckase & McKinley, 1991;

Reckase, 1997):

1/2
n
MDISC =[Z 225,] (2.2)
k=l

where a”, is the ith item’s discrimination on the kth dimension (i.e. 01 and 192).

Second, the difﬁculty of MIRT items is indexed with a statistic called MDIFF

(Reckase, 1997):

17

“di

m 2
Z aik
d=1

MDIFF gives the distance from the origin of the 6-space to the point of steepest slope in a

MDIFF,- = (2.3)

direction from the origin. It has the same interpretation as the b—parameter in UIRT.

Assume each item, i, is represented by a vector 3,- that consists of the elements of

a row of the matrix a — an n by m matrix containing the aik and ajk discrimination

parameters. The angle between the directions of steepest slope for two items can be

formulated as (Reckase, et al., 2000):

m

2 aikajk
k-l

m 2 m 2
Zaire 201k
k=l k=l

where 01,-]- is the angle that the line from the origin of the space to the point of the steepest

 

a~~ =arccos

u (2.4)

slope makes with the jth axis for item i. Equation (2.4) is also called the direction cosine
and represents the direction of greatest slope from the origin. These cosines specify the
directions of measurement in multidimensional space (Reckase, 1997).

Equation (2.1) deﬁnes a surface which gives the probability of a correct response

for a test item. It operates as a function of the location of examinees in the ability space
speciﬁed by the B—vector. Assuming there are only two statistical constructs (01 and 02)

which underlie performance on a particular psychological trait or educational
achievement domain, the probability response surface can be represented graphically as

shown in Figure 2.1.

18

5:
m
.1

Probability
.0
.hi
I

 

 

Figure 2.1. Item Response Surface for the M2PLM with Parameters al=1.5, a2=0.2, and
d=1.

The item response surface (IRS) of Figure 2.1 has the following parameter values:

a1=1.5, a2=0.2, and d=1. It is monotonically increasing and is bounded by 0 and 1 with
its height representing the probability of answering an item correctly given two abilities
(i.e., 91 and 92). Additionally, multidimensional item discrimination (MDISC) is a scalar,

while multidimensional difﬁculty (MDIFF) is a composite of item discrimination and
difﬁculty on each dimension. Recall that in UIRT discrimination and difﬁculty are

simply scalars.

19

2.2 Item Vectors

Essential information about a test item can be displayed graphically using item
vectors. These vectors are used to represent the difﬁculty .of items, the direction of
maximum discrimination and the items’ discriminating power (Reckase, 1985). Based on
the MIRT model of Equation (2.1), the direction of these vectors shows the weighted
combination or composite of abilities for which each item provides the best measurement.
Items with similar direction cosines measure the same set of abilities (Ackerman etal.,
2003; Reckase, Ackerman & Carlson, 1988).

In Figure 2.2, a sample plot of item vectors is shown. These item vectors vary in
difﬁculty. Items in the upper right quadrant are more difﬁcult than items in the lower left
quadrant. Items in this diagram are generally aligned in two distinct directions called
content clusters. One cluster consists of items that are closer to the oil-axis, while the
other cluster consists of items that are closer to the 02-axis. This implies that the two
content clusters measure two different, yet related sets of constructs since their directions
are not orthogonal. Note that the aforementioned description only applies to situations in
which a test is sensitive to only two skills or dimensions.

The vectors of Figure 2.2 have three speciﬁc characteristics. Item discrimination
is represented by the length of the vector and is indexed by the MDISC value of Equation
(2.2). Location of the base of the vector in multidimensional space signiﬁes difﬁculty as
given by MDIFF in Equation (2.3). Lastly, the angular direction of each item relative to
the positive (Ell-axis represents the location of the item as indexed by the direction cosine

formula of Equation (2.4).

20

0.8

I

 

0.6 —
0.4 -
0.2 -
®N
o .. .....................................
-o.2 2

 

6

q“

I
\

 

-0.4 -0.2 0 0.2 0.4 0.6

Figure 2.2. Sample Plot of Item Vectors.

Figure 2.3 further illustrates the concept of angles between test items. This
diagram indicates that the angle between the two item vectors deﬁnes a plane in

multidimensional space since the vectors originate from the origin and have terminal
points. The angle (0121-) represents the angle between item j and item i. This angle is

related to the direction cosine by the formula of Equation (2.4).

2.3 Item Content Clusters

The preceding description of item content clusters was given to illustrate that
since test items can be represented as vectors, they can be used to identify sets of items

that measure similar combinations of skills. Moreover, achievement and certiﬁcation

21

Dimension 2 (62)

    

 

 

\ Dimension 1 (01)
“it
Figure 2.3. The Angular Distance in Two-Dirnensional Space.

tests are constructed so they are efﬁcient at measuring combinations or clusters of skills.
Hence, when item vectors point in the same direction, they measure the same
combination of knowledge and skills. Additionally, the angle between these items
evaluates the difference in knowledge or skills they measure. For example, when the
angle between two items is 0 degrees, those items are measuring the same combination of
skills. However, if the angle is 90 degrees, those items are measuring completely
different skills (Reckase & Martineau, 2004).

For example, in the National Assessment of Educational Progress (NAEP)
frameworks, assessments are constructed to fulﬁll multiple subscales. A typical example
is the 2006 N AEP framework for Economics which contained three subscales: i) Market
Economy, ii) National Economy, and iii) International Economy (National Assessment

Governing Board, 2005). If test items are correctly matched to each of these three

22

subscales, then it would be expected that 3 distinct “item content clusters” of similarly .
aligned sets of items can be distinguished in multidimensional space using the M2PLM
or other MIRT models.

The similarities or patterns of proximities (i.e., measures of closeness or
association) among item vectors can be understood through multivariate statistical
procedures. These patterns or similarities can then be used to identify sets of items that
measure the same combinations of abilities. Measures of proximity are indexed using the
direction cosines given in Equation (2.4). Cluster analysis has been used successfully to
identify the aforementioned item patterns (Kim, 2001; Martineau, et al., 2006; Miller &
Hirsch, 1992).

Kim (2001) determined an optimal cluster analysis method for identifying groups
of items with similar orientation in the O—space. The grouping of items is then analyzed
qualitatively with the purpose of assigning substantive meanings to the resulting content
clusters (or groups of items). Hence, in an achievement test, it would be expected that
groups of items which measure the same content appear in the same groups or “content
clusters” (e. g., Martineau, et al., 2006).

An example from Reckase (2006) can be used to illustrate the concept of content
clusters using cluster analysis. Table 2.1 shows the item parameters for 20 test items that
are two-dimensional (i.e., have to a-parameters). Using direction cosines, these 20 items
can be subdivided into smaller and more meaningful groups according to their similarity.
That is, a dendrogram can be used to display test items that measure similar skills as
shown in Figure 2.4. Therefore, at an average angel of 1.5, there are two distinguishable

clusters. At an average angle of 0.5 there are 4 distinguishable clusters, and so on. In an

23

actual test, a content expert would be able to look at items grouped by the dendrogram

and deﬁne or classify the common content measured by each cluster (see Martineau, et al.,

2005).

Table 2.1 Item parameters for a 20 item test (Reckase, 2006).

 

 

Item Item
Number a1 a; d Number a1 02 d

l 1.81 .86 1.46 11 .24 1.14 -.95
2 1.22 .02 .17 12 .51 1.21 -1.00
3 1.57 .36 .67 13 .76 .59 -.96
4 .71 .53 .44 14 .01 1.94 -1.92
5 .86 .19 .10 15 .39 1.77 -1.57
6 1.72 .18 .44 16 .76 .99 -1.36
7 1.86 .29 .38 17 .49 1.10 -.81
8 1.33 .34 .69 18 .29 1.10 -.99
9 1.19 1.57 .17 19 .48 1.00 -1.56
10 2.00 .00 .38 20 .42 ' .75 -1.61

 

2.4 Multidimensional Information

Two approaches to computing multidimensional information (MINF) exist. One is based
on maximum likelihood estimation and will be called Ackerman information (Ackerman,
1994). The other is based on the direction of items and angles between items and will be
called Reckase information (Reckase & McKinley, 1991). Ackerman information will
used in deriving relevant indices for assembly tests, while Reckase information is useful
for evaluating the measurement precision of assembled tests. The differences between

Ackerman and Reckase information will be explained in Section 2.4.3.

24

 

—L
mmA—s-a-wwowmoomw
i=3 11-1,. . .7

 

u—L

 

 

 

 

 

 

 

Item Number

—L-—L—-L

 

 

 

 

 

 

 

44—3—3534
(OODNNCOO—‘h

 

1 1 1 i

0.5 1 1.5 2 2.5 3
Average Angle Between Items/Clusters

—

C)

Figure 2.4. Dendrogram of item content clusters based on item parameters in Table 2.1

2.4.1 Ackerman Information

Green (1990) was the ﬁrst to suggest a formulation of MINF different from that of
Reckase presented in the section 2.3.2. Ackerman (1994) summarized the latter version
as taking into account the lack of local independence when the information is estimated
for a particular direction. This method builds on Birnbaum’s (1968) introduction of
Fisher’s information matrix. Bimbaum used Fisher’s information matrix to describe the

information structure of a test and quantiﬁed it as the test information function (TIF).

25

This approach allows the test assembler to focus on both the goal of the test (e. g.,
maximizing information) and the selection of combinations of test items that assure the
best results. When assembling tests from an item pool that ﬁts a UIRT model, Bimbaum
(1968) suggested ﬁtting them to a TIF which is deﬁned as a summation of item

information about 19.

2
a lnL] (2.5)

WEBB?

In Equation (2.5) II. (6) represents the information‘or measurement precision provided by

each item. Additionally, in UIRT it is common to take the natural log of the likelihood
function (L) associated with the data under the 3PLM in Equation (1.1) and similar
models. Hence “lnL” in Equation (2.5) simply represents the natural log of this likelihood
function. “-E” is the negative expected value of the second derivative of this likelihood
function. The test information is the sum of the item information functions (Lord, 1980;

Reckase & McKinley, 1991) and it can be represented as follows:

n
[T (a) = 21,19) (2.6)
i=1
where n is the number of items in the test.
However, when there are two dimensions, Fisher’s information becomes a matrix
which for two 195 (61, 62) is obtained by computing the negative expected value of the

second derivative of the log of the likelihood function (Kendall & Stuart, 1967; van der

Linden, 1996) as follows:

26

( azlnL dzlnL \
3612 861862
azlnL 321nL

 

“01.62): -E , (2.7)

 

 

 

where the likelihood function, L, of Equation (2.1) can be written as (Ackerman, 1994):
n
L(u10)=1'1L( ui 101,02) =1:1PI-(0)“ QI)(0)1" (2.8)
i=1

The multidimensional information (called Ackerman information in this paper) in

Equation (2.7) can be written as:

 

n )
:Zlarzil’il" )Qi(9 I Zanazil’iWIQiW)
I (61,62): i=1 (2.9)
LzaliaZiH i 0)Qi(0 ) 20215 (0)Qi(0)

 

i=1 I
Also note a property of the maximum likelihood estimator of 19, that 0 is

asymptotically normal with mean Band variance:
~ ~ —1
v(01,02101,02)=[1(91,02)] (2.10)

In the multidimensional case, this is a variance-covariance matrix. Therefore, to optimize
measurement precision, a function on Fisher’s information matrix has to be maximized or
the variance—covariance matrix has to be minimized (V eldkamp, 2002a). The latter fact

holds since IRT models are special forms of nonlinear regression models (Lord, 1980).

2.4.2 Reckase Information
As noted previously, measurement precision is evaluated using the concept of

information in IRT. The reciprocal of the information function is the asymptotic variance

27

of the maximum likelihood estimate of ability. Hence, the larger the amount of
information measured, the smaller the asymptotic variance and therefore the higher the
measurement precision (Ackerman, Gierl & Walker, 2003; Lord, 1980). Using the
concept of multidimensional information (MINF) the measurement precision of a test can
be assessed.

MINF is related to MDISC because if an item has a high value of MDISC, it will
give a large amount of information. However, MINF and MDISC differ since MINF
identiﬁes that capability of the item to discriminate at each pOint in the 0-space, rather
than just at the steepest point of the IRS (Reckase & McKinley, 1991). The slope in a
particular direction is deﬁned using a directional derivative as follows (Reckase &

McKinley, 1991):

VaP(0) = 612:) cos a1 + 513(0) cos a2+,...,+-5—:§1)—cos an (2.11)
n

 

 

where

or is the vector of angles with the coordinate axes in the 0 space,

or,- (i = 1, . . .,i) is an element of the vector,

0 is the vector of abilities deﬁning a point in the space, and
0 (i = 1,. . .,n) is an element of the vector.
For a single item, the directional derivative of the item response surface (IRS) is:

VaPi (9) = “113' “HQ (9)005 a1 +ar'ZPi (9)Qi (9)0080’2+2-~=+ainpi (9)Qi (9)003 an
(2.12)

=12<o>2~ (Mia... cow.
k=l

28

Since MINF is a direct generalization of the UIRT concept of information (Lord, 1980), it
can be expressed as follows (Reckase & McKinley, 1991):

. 2
[B- (WQ (9) Z 0:1 00801]
111(0): k=1

192(0)Q.- (0) (2.13)

2
= Pi (9)Qi (”[2 aik 005012]

k=1
To obtain MINF for an entire test, the item information of Equation (2.13) is summed

across the number of items in the test such that IT (0) is the sum of item information

functions similar to the UIRT form given in Equation (2.6). When test information is

computed using Reckase information, the same direction must be used for all items.

2.4.3 Diﬁ‘erence between Reckase and Ackerman Information

As noted previously, in this study, Ackerman information was used for
mathematical derivations involving multidimensional information, while Reckase
information was used for evaluating the quality of assembled tests. Although these two
indices essentially index multidimensional information, they are fundamentally different.
Reckase information is basically a multidimensional critical ratio, comparable to Lord’s
(1980, p.69) derivation of the score information function (Ackerman, 1994). For the

M2PLM, Ackerman (1994) explicates this critical ratio as: “a measure of how effective

test score x is at discriminating between a trait level (01, 02) and a trait level “close by”

(61.6.2) along a line through (I91, 0;) at an angle a.”

29

Reckase information cannot be used as a substitute for Ackerman information
when calculating information for more than one test item. This substitution is not possible
because Equation (2.12) of Reckase information does not, account for the lack of local
independence when a particular direction is speciﬁed (Ackerman, 1994). This implies
that the resulting covariance among the traits is improperly accounted for. Ackerman
information, on the other hand, is formulated from a variance-covariance matrix and can
properly account for the resulting covariance when a particular direction is speciﬁed.
Since IRT models can be viewed as nonlinear regression models and experimental design
methods can be used in optimization, Ackerman information facilitates mathematical
derivations that involve multidimensional information when the purpose is to explain the

variance-covariance structure of the IRT parameters.

2.5 D-optimality Criterion

Having noted that IRT models are special types of regression models as described
in section 2.4.1, statistical criteria from optimal design theory can be used for optimizing
the selection of IRT parameters in ATA. This approach works similar to the manner in
which parameters are estimated in experiments that have multiple parameters and can be
modeled as regression models. In prior IRT studies that involved the optimal design of
tests and samples, some function of Fisher’s information matrix was maximized (Berger,
1991; de Gruijter, 1985, 1988; Stocking, 1990; Thissen & Wainer, 1982; Vale, 1986;
Wingersky & Lord, 1984). When the determinant of Fisher’s information matrix is
maximized for the purpose of optimizing the design of a test, this criterion is called D-

optimality.

3O

D-optimality was ﬁrst suggested by Wald (1943) and is also called the
generalized variance criterion (Anderson, 1984; Berger & Veerkamp, 1994). It is useful
in situations where the goal is to minimize the generalized variance of the parameter
estimates. Hence, in the ATA the smaller the variance of the estimated parameters, the
greater the information obtained for each assembled test (Hambleton & Swaminathan,
1984). Speciﬁcally, D-optimality accomplishes this goal (Berger, 1988; van der Linden,
1994; van der Linden & Boekkooi-Timminga, 1989).

D-optimality is the preferred optimality criterion in IRT for the following reasons:
a) it can be used to formulate a conﬁdence interval around the parameter estimates and
hence has an easy and natural interpretation, b) it is invariant to linear transformations of
the logit scale, and c) it is equivalent to other criteria (e.g., G-optimality). However, D-
optimality has some disadvantages: i) it is generally not sensitive to misspeciﬁcations of
the model and ii) models with a different number of parameters cannot be compared (as
is the case with other optimality criteria) (Berger, 1991, 1992; Berger & Veerkamp,
1994). Overall, this optimality criterion has consistently given good results in prior
studies (e.g., Veldkamp, 2002a).

From the explanation above, it follows that D-optimality can be used for the
computation of optimal values in ATA. This criterion can be used to optimally select
items for a test by maximizing the objective function associated with D-optimality and
thus minimizing the generalized variance of the parameter estimates. Hence, it maximizes
measurement precision when it is used to compute the objective function.

In multidimensional ATA, D-optimality can be represented as

Maximize |1(0)|, (2.14)

31

where, II (0)| is the determinant of Fisher’s information matrix for the vector of abilities
0. As will be explained in greater detail in the section 2.5, the expression of Equation

(2.14) is not only a function of x., but also a function of two 65 (01, 62) (e.g., Veldkamp,
2002a). Therefore, in ATA, this objective function can be optimized for a grid of points

instead of the entire 6-region. That is, the problem of maximizing the information
function at certain 6—points can be applied to the multidimensional B-space if the two-
dimensional grid is deﬁned by (s,t), where s = 1,.. ..,S and t = 1,. . ..,T. Hence updating
Equation (2.14) to reﬂect this addition, it becomes

Maximize |1(0,., )|. (2.15)

2.6 0-1 Linear Programming

An important goal in educational and psychological assessment is to construct
tests of minimal length that will yield scores with the necessary degree of reliability and
validity for the intended uses (Berger, et al., 1994; Crocker & Algina, 1986). In the last
few decades, a form of mathematical programming, called 0-1 linear programming (LP),
has been used to construct tests that best meet these requirements using various ATA
approaches. A central assumption of LP is that all its functions, both objective and
constrained, are linear (Hillier & Gerald, 1995).

Examples of LP methods that have been used for optimal test assembly include
the branch-and-bound method (Adema, 1992); 0-1 linear programming (Adema,
Boekkooi-Timminga & van der Linden, 1991); a maxirnin approach (van der Linden &
Boekkooi-Timminga, 1989; Veldkamp, 2002a); and binary programming (Theunissen,

1985). In these examples, LP facilitated optimal test assembly so that test construction

32

goals and speciﬁcations such as maximum reliability, target information, minimal length
and maximum item-total correlation were met to the best degree possible.
The standard format of the LP problem includes only equality constraints. It is set

up as either a minimization or maximization problem of the following form

(Venkataraman, 2002):
Maximize/Minimize f (x) : ch (2.16)
Subject to g (x) : Ax=b (2.17)
Decision variables xI. 6 {0,1} i= 1 . . . I (2.18)
where x represents a column of decision variables, [x1, x2, . . . , xn]T; c is a column vector
of utility (cost) coefﬁcients, [C1, c2, . . . , on]T; b is a column vector of constraint limits,
[b1, b2, . . . , bHJT; A is an mxn matrix of constraint coefﬁcients which can alternatively

be represented as:

(011 011 - aln I’1 (01 Ixﬂ

a a . a C x
A: 21 22 Zn ; lb: I’2 ; c: 2 ; x: 2

Kaml aml ° amn bn Ken \xn I

 

 

 

 

Research on ATA using MIRT has been quite limited. The earliest suggestion of
automated multidimensional test assembly (AMT A) is by van der Linden (1996).
Veldkamp has done most of the documented research on AMTA using both LP methods
(V eldkamp, 2002a) and Lagrangian relaxation techniques (V eldkarnp, 1998). While both
these methods are interesting, the former appears to be the most promising because the
majority of other ATA work uses LP or closely related models. A detailed description of

Veldkamp’s approach is given in section 2.5.2.

33

2.6.1 Veldkamp ’s Automated Multidimensional Test Assembly Approach
In his approach to AMTA, Veldkamp (2002a), imposed a linear approximation ori
the objective function obtained using D-optimality through a Taylor series approximation.

Note that solving the expression in (2.15) for the M2PLM gives:

N N 2
9 x) = Zaiifi (9IQi (oniZalziPi (9IQi (I’M—[Z aliaZiP i( 9IQi (9)19 ] (2-19)
i=1 i=1

Hence, when computing a linear approximation for Equation (2. 19), a grid of points (s,t)

in the 0 space ( 61,62 ) e {—3, 3} x{-3, 3} is chosen so the objective function is maximized.

It follows that a linear approximation of Equation (2.19) can be expressed in terms of x, y,
and 2 so it becomes:

Maximize xy — 22, (2.20)

N N
where x= 20%2P(2(0)Q2(9)x2.y=2012P2(0)Q2(Ixiiandz=Zari02iPi(9IQi(9Ixi
i=1 i=1

Thus, for any given ability point (s, t), and a given test, the functions x, y, 2 can be

calculated for a result that was denoted as (iii). In turn, (2.20) can be rewritten as:
Maximize ﬁx,y,z), (2.21)

with a linear approximation of the objective function at the point (372') being equal to
..af___ 3f___ 3f---
Maxrmrze $(xst'yst in I ' x+'a—y'(xst 'yst ’Zst I ' y +Ta:(xst'yst 'Zst I ° Z + 0 (2°22)

where c = f (3,37,?) + Vf (Ky-if (335,?) and the partial derivatives that deﬁne the linear

approximation are:

34

af __ ._ _ _ ‘
klst = 5;"(xst'YSt'Zst I = I’st

a _ _ _ _
k2st = £0.27 'yst =Zst I = xst I (2.23)

3f

k3st = ELY.” 5st is: I = T2 ' Est

 

Note that coefﬁcients kjs, represent the partial derivatives taken at the point (s, t) for a

given test thus producing completing the linearization and giving the complete AMTA

model as follows:

N N N
2
Max klst Zaripi (9IQi (9Ixi +k28t Zaii’i (9IQi (9I‘i +k3st ZaliaZiPi (9IQi (9Ixi (224)

i=1 i=1 i=1

subject to,

ZxI. S nc, (categorical constraints)

ieC

ZquI. S n9, (quantitative constraints)
‘60 ” i (2.25)
ZxI S 1, (enemy sets)

ieE

 

I
ZxI = n, (test length)

161
xI. e 0 or 1, i = 1,....,I (decision variables) (2.26)

Essentially, Veldkamp expressed the objective function in Equation (2.19) as a linear

function of the decision variable xi, and was able to use a linear programming algorithm

to solve an AMTA problem.

35

2.7 Quadratic Knapsack Programming

The knapsack model of mathematical programming has been studied for several
years as the simplest prototype of a maximization problem. This model invokes the image
of the backpacker who is constrained by a ﬁxed-size knapsack. The backpacker must ﬁll
it only with the most useful combination of items from a list of possible items so the
value of items packed into the knapsack is maximized (Gallo, et al., 1980; Hoffman &
Padberg, 1996; Kellerer, et al., 2004). This model belongs to the combinatorial
optimization family described in Chapter 1. In the literature, the knapsack model was ﬁrst
used for test construction by Feuerman and Weiss (1973).

Research on the knapsack model was motivated by the fact that a) Veldkamp’s
(2002a) AMTA approach may introduce mathematical and computational error, and b)
the non-linearized form of the objective function has a quadratic structure. Moreover, the
method introduced here seemed intuitively suitable. Therefore, this study introduces
quadratic knapsack programming (QKP) as an alternative to the use of LP in AMTA.
QKP extends the knapsack problem by assigning values not only to individual objects,
but to pairs of objects. Additionally, QKP is a generalization of the knapsack problem
obtained when the objective function is allowed to be quadratic. In other words,
extending the QKP metaphor to AMTA where the goal is to maximize MINF, the
knapsack is a two-dimensional test and only items that make the test most informative in
both dimensions can be selected. Note that here a two-dimensional test is deﬁned based
on the item response matrix. That is, a two-dimensional test is an item response matrix

that requires two dimensions to accurately model the data when item responses are

36

dichotomously scored (i.e., correct responses are coded as one and incorrect responses
are coded as zero).

As noted previously, the purpose of optimal test assembly is to choose the best
sets of test items so the measurement precision of a test is maximized. Speciﬁcally, when
using QKP this corresponds to the problem of choosing the best n test items out of N test
items. Moreover, in choosing these items, it is natural to assume that the utility of this
choice should reﬂect how well the items ﬁt together.

Therefore, assume that n items are given, with item j having a positive integer
weight wj. A limit on the total weight chosen is given by a positive integer knapsack
capacity c. Additionally, the model has an (nx n) nonnegative integer symmetric proﬁt
matrix

S = (Sian (2.27)

where Sjj is the proﬁt achieved if item j is selected and sij + st- is a proﬁt achieved if both

items i and j are selected. Proﬁt simply refers to the utility with which selected items

function in unison. Hence, the QKP model has the following mathematical formulation:

N N
Maximize f (x) = xTSx = Z Z xI-xjsI-j (2.28)
i=1 j=l
N
Subject to Z wjxj s c (2.29)
'=1
xje{0,l}, je N, (2.30)

where xTSx is the quadratic form of the QKP model (Hillier & Lieberman, 1995; Kellerer,

et al., 2004).

37

In mathematical programming terminology, Equation (2.28) is called the
objective function. Note that in AMTA the objective function determines the
measurement precision of the test. The latter is formulated using Fisher’s information
matrix and describes the information structure of the test. This information structure is
quantiﬁed as the test information function (Bimbaum, 1968) as described in section 2.4.1.

The knapsack constraints which govern selection of items are shown in (2.29).

(2.30) denotes the decision variables of the QKP model such that xi = 1 if the item is

included in the test and x2- = 0, if the item is excluded from the test. Additionally, in the

test assembly context N is the number of items in the test bank (or item pool) and all test
speciﬁcations have to be expressed in terms of the xis (Caprara, et al., 2000; Kellerer, et

al., 2004).

38

CHAPTER 3

METHODOLOGY

The purpose of this study was to develop and evaluate a new approach to AMTA
when the goal of assembling the test is to maximize information. Development involved
mathematical derivation of the objective function and formulation of a heuristic
algorithm for solving the problem with a computer. The new approach was evaluated
through a series of computational experiments. When evaluating computer algorithms or
computational techniques, it is common to use computational experiments (e.g., Caprara
et al., 1999; Luecht, 1998; Swanson & Stocking, 1993). These typically entail the
comparison of a focal method to similar methods of interest so that conclusions about
efﬁciency and performance can be drawn. Through the series of computational
experiments, the new approach was evaluated under the following conditions: 1) item
pool quality, 2) item pool size, 3) test length, and 4) AMT A method. Additionally, in this
study, computational outcomes (i.e., computations of D-optimality) were linked to
psychometric criteria and indices that inform test assemblers and psychometricians about
the quality of assembled tests.

The ﬁrst two factors were manipulated with the intent to vary item pools available
for assembly. The ﬁrst factor was quality of the item pool as measured by mean MDISC
levels and the second factor was item pool size. Crossing these factors (three levels of
item pool quality and two levels of item pool size) produced six item pool variations.
Varying levels of these two factors allowed the reliability and classiﬁcation accuracy of

tests to be altered. The third factor varied was test length. Lengths of 21 to 99 items were

39

considered since test length is a primary factor that affects reliability and decision
consistency.

Finally, to evaluate the new method by comparison to other assembly methods,
the AMTA method was varied. The three AMTA methods considered were QKP, LP, and
random item selection (RND). While QKP and LP were used for effectiveness
comparison, RND was used as a baseline measure to evaluate and compare performance
of the other two methods.

The remainder of the chapter is divided into the several sections. Two sections
cover modeling the test assembly problem, deriving the QKP objective function and
describing the QKP heuristic. In addition, sections describing the varying factors such as
the AMTA method, item pools, test length, Simulees, the test assembly model, UIRT
simulations, data analyses and evaluation criteria are included. These sections explain the

manipulation of each factor, data generation, and the evaluation criteria used in the study.

3.1 Modeling the Test Assembly Problem
For the three AMTA considered, the following process of modeling the test

assembly problem was followed (van der Linden, 2005):

l. Identiﬁ/ the decision variables. Items are coded as 1 when included in the
assembled test and 0 when excluded. This qualiﬁes the assembly problem as an
integer programming problem.

2. Model the constraints. Sets of content item content clusters are speciﬁed to serve

as assembly constraints.

40

3. Model the objective function. An objective function that can be used to maximize
MINF was formulated based on a derivation of Fisher’s information matrix using
D-optimality for the M2PLM.

4. Solve the model for an optimal solution. For each method, a unique computer

algorithm was used for solving the test assembly problem.

3.2 The OKP Obiective Function

Recall that D-optimality maximizes the determinant of Fisher’s information
matrix. Veldkamp (2002a) derived this function for the two-dimensional case. As
described in the previous chapter, the objective function to be maximized over 6from

Equation (2.15) becomes:

N N N 2
f (92") = Zaiipi (9IQi (oniZalziPi (9IQi (9Ixi ‘[zaliaZiPi (9IQi (9)13] (3-1)
, i=1 i=1 i=1

This becomes a question of maximizing the integral of ﬂ0,x) with respect to Gas follows:

Max( jf(0,x)d0) (3.2)

Since no explicit maximizing solution can be found, the integral should ﬁrst be
approximated. The integral in (3.2) can be approximated numerically. Computationally,

this is done by approximating the continuous distribution of 6with a discrete point-mass

T
distribution with support at 6: t1,12,.. .,tr such that Z P(9 = t j) =1 . Then the function
F1

ﬁx) to be maximized is:

41

f(x)=ZP(0=tJ-)f(tj,x) (3.3)

where 6 = t J- is a generic quadrature point and f (tj,x) is the associated weight. It

follows that the summation over j distributes over the summations over i, so that (3.1)

becomes:

N 2 N 2 N 2
f (XI=ZaiiRixiZaziRixi— ZaliaZiRixi (3.4)
i=1 '=1

i=1

T
where R,- =ZP(9=Ij)Pi(‘jIQi(th'
j=l

The n-dimensional vectors u, v, and W can be deﬁned as follows:
11 = alziRI , v = agI-RI- , w = alI-azI-RI. Subsequently, ﬁx) can then be written in vector
notation (remembering that xy = xT y = yTx )so that Equation (3.4) becomes:

f (x) = (“KIWI-(WINK)

 

 

TuvTx—xTwax
r T T T (3.5)
x uv x+x vu x
= ( )-xTwax
2
= xTSx
(uvT + vuT)
where S: 2 - wa is a symmetric N x N matrix. The second equality in (3.5)

above follows from the fact that xTuvTx = (ux)(vx) = (vx)(ux) = xTvuTx. Now the

unconstrained objective functron ﬁx) = x Sx has to be maxrmrzed. However, in a

42

practical testing situation, it would have to be maximized while taking into account

relevant test constraints.

3.3 The Quadratic Knapsack Programming Heuristic

As noted previously, to solve an AMTA problem for an optimal solution, a
suitable computer algorithm has to be used. For this express purpose, the QKP heuristic
algorithm is introduced. It belongs to a class of heuristics that are deﬁned as sequential
because they assemble a test by selecting one item at a time (van der Linden, 2005). This
heuristic uses the mathematical structure of the QKP model as the basis for item selection.
This heuristic is a simpliﬁed version of the QUADNAP heuristic algorithm (Caprara,

Pisinger & Toth, 1999) and is formulated as
T
ﬁx) = x Sx, (3.6)

where S is a real-valued, symmetric, non-negative an matrix, and x is a binary n-
dirnensional column vector and ﬁx) needs to optimized over all x’s that have n 1’s. This

corresponds to the problem of choosing the best it questions out of N questions such that:

N N N N
f(XI=Zinijij= Z Z Sij (3.7)

i=1j=l i:xI==1j:xJ-=l
Therefore, ﬁx) is the sum of all entries Sij where x,- and xj each equal 1. Put in another

way, if A is the set of indices (a subset of {1, 2, 3... N}) such that those components of x

are 1, then A deﬁnes a submatrix of SA and ﬁx) is the sum of the entries of that submatrix.

Additionally, note that the ijth entry of S A is the (Ai, Aj)‘h entry of S where A,- is the ith

43

element of set A. Thus, the QKP heuristic works in a series of iterative steps whereby it

keeps trying to improve the choice of A, as will be described in detail in section 3.3.2.

3.3.1 Solving Nonlinear Problems

Nonlinear problems are generally solved through numerical analysis. Using
computer code, numerical analysis can be converted into numerical techniques. The set of
methods and techniques used for solving optimization problems are called search
methods. In their implementation, several progressively better attempts are required
before a solution can be obtained. For this reason, they are called iterative techniques.
This implies that each attempt or search is conducted in a well-formulated and consistent
manner so that information from previous iterations is used to update current
computations until an acceptable solution is found.

The process by which this search is conducted is the algorithm. Another meaning
of the term “algorithm” is the translation of an ordered sequence of search procedures for
ﬁnding a solution into a set of speciﬁc actions that can be executed on a computer
(Venkataraman, 2002). The development of an algorithm for solving the
conceptualization of the AMTA problem as a QKP problem is presented in the section

3.3.2. This algorithm is called the QKP heuristic.

3.3.2 Phases of the QKP Heuristic
The algorithm used to implement the QKP heuristic consists of three phases. In
the ﬁrst phase, the algorithm is initialized by summing all the columns (or rows since the

matrix is symmetric) of S (an N x N matrix) as described in section 3.3. In the second

phase, the n columns (or rows) with the highest sums are selected for a submatrix of S
called A. Here n corresponds to the target test length. In the third phase, elements outside
A that do a better job of estimating D-optimality are selected and replace elements of A
one by one.

Based on the phases described above, the QKP heuristic is deﬁned as a greedy
algorithm since local improvements are made each time. Essentially it is called “greedy”
because it selects items based on the decreasing order of efﬁciency with which the items
under consideration are added to the “knapsack,” if the capacity constraints are not
violated (Kellerer, et al., 2004). This is a common approach to formulating algorithms.
For other uses of such algorithms see WDM (Swanson & Stocking, 1988; 1993) and

NWADH (Luecht, 1998).

3.3.3 Details of the QKP Heuristic
In the initialization phase the following steps are performed:

Step 1. Sum all columns (or rows, since the matrix is symmetric) of S.

Step 2. Create a submatrix of S called A that has n (indices number of required test
items) with the highest values of the sums obtained in Step 1.

Step 3. Check to see that the submatrix A contains the speciﬁed number of elements n.

Step 4. If A has less than n elements then select the best element from those that
remain in S

Step 5. Repeat Steps 2, 3 and 4 until the correct number of test items is contained in

submatrix A.

45

The submatrix A is improved in two heuristic steps, the second of which is more
important. In both cases a variable called ‘CONTINUE’ is used in order to ensure the
algorithm stops when no improvements are made to A. (The technical term for variables

like CONTINUE is a ‘ﬂag’.)

Step 6. Initialize A by summing columns with indices in the existing set A.

Step 7. Stop summing when no improvements can be made in A.

The second heuristic replaces elements of A one by one with elements outside A that do a
better job of maximizing the measurement precision of the test based on the objective

function.

Step 8. For each index i in A, ﬁnd the best candidate for j (not in A) to replace i.
Step 9. See if the best candidate is actually better than i.
Step 10. If it is better, replace i with it and repeat Steps 3 and 4. '

Step 11. Continue Step 8, 9 and 10 until there are no improvements in A.

The basic logic of the heuristic can be maintained for different types of
constraints (i.e., quantitative, categorical, and item set constraints); however
computational alterations may be required to ensure that desired outcomes are obtained.
The main advantage of this heuristic is that its logic can be easily extended to different

types of problems as long as they can be formulated in the form of the QKP model.

3.4 R_andom Item Selection Method

Apart from the QKP and LP methods, a random item selection method (RND)

was formulated in order to serve as a baseline measure of the performance of the two

46

main methods that were being compared. This algorithm uses the QKP formulation

described in the section 3.3.3 as the basis for how it functions. Speciﬁcally, RND selects

the number of items that comprise the best possible test that can be constructed from 100

guesses using computations of D-optimality. The algorithm is explained in greater detail

as follows:

Step 1.

Step 2.

Step 3.

Step 4.

Step 5.

Step 6.

Sum all columns (or rows, since the matrix is symmetric) of S.

Randomly select it indices (number of required test items) to create a
submatrix of S called A.

Compute D-optimality for these it indices and set this value to be the largest
value of D-optimality.

Randomly select another n indices and compute D-optimality.

If the value of D-optimality obtained in Step 4 is larger than that obtained in
Step 3, then update it to be the new value. Otherwise keep the value in Step 3
as the largest.

Repeat Steps 4 and 5 100 times, then select the n items which give you the

largest D-optimality from these repetitions to be the best test.

From the description above, it is clear that RND is not truly random. Moreover,

its randomness is derived from the manner in which it uses the computation of D-

optimality at each iterative step. Therefore, RND is comparable to LP and QKP and

provides a good way of estimating how much better the other two methods are than an

unstructured method of selecting test items. Also note that RND only works for

unconstrained problems.

47

3.5 Item Pool Simulations

Item pools were simulated for the purpose of providing a “proof of concept”.
They also formed a basis for the computational experiments. The expression “proof of
concept” can be thought of as an incomplete realization (or synopsis) of a method or
idea(s) in order to demonstrate its feasibility. The proof of concept is commonly
considered a milestone on the way to fulﬁlling a functional prototype. Hence, the
simulations will simply be used as a way of showing that the core ideas are workable and
feasible and will help establish viability (Wikipedia, n.d.).

Two factors were manipulated to create six item pools with characteristics typical
of MIRT item pools. The ﬁrst factor was mean item pool MDISC and the second was
item pool size. Item pools were simulated assuming only a simple structure because all
ability dimensions were intentional and it would be counterintuitive to impose a more
complex dimensional structure. Recall that simple structure implies that all items are pure
because they measure knowledge or skill in only one content strand (Kim, 2003; Roussos,
et al., 1998; Zhang, 2005). Typical item pools are about 400 items in length. Therefore,
the item pool size factor had two levels: a) normal with 300 items and b) large with 900
items.

Item quality was determined by varying values of mean MDISC. This is an
important factor to consider since higher levels of MDISC are generally associated with
increased reliability and classiﬁcation accuracy as is the case with a-parameters in UIRT.
Using the M2PLM of Equation (2.1), three levels of item pool quality were simulated.
Low, Moderate and High discrimination item pools were simulated with MDISC and

MDIFF generated from lognormal and normal distributions, respectively. Standard

48

deviations and correlations among these indices were held constant so the only difference
in the item pools was the value of mean MDISC as shown in Table 3.1.

The conceptual framework of linear regression was used for the simulation of
item parameters in each item pool. Speciﬁcally, from looking at several MIRT studies, it
was found that the linear relationships among parameters were quite similar. These
similarities included correlations between MDISC and MDIFF, as well as variances and
means of MDISC and MDIFF. The aforementioned explorations were conducted on
MIRT parameters provided by Kim (2001), Martineau, et a1. (2005), Min (2003),
Reckase (1997) and Reckase (2006). Although the speciﬁcations used in the item pool
simulations were somewhat arbitrary, they did closely resemble item characteristics from
the aforementioned papers.

Recall that when there is a dependent variable Y and an independent variable X,

the linear relationship between these two variables is given by the following formula:
1?,- = a + bXI- (3.8)

In Equation (3.8), a is the intercept of the line and b is the slope of the line. It follows that

the least squares solution for the slope b is then given by:

b=w=r[fl} (3.9)

var(x) sJIC

where r is the correlation between X and Y, s is the standard deviation of Y and sx is

y
the standard deviation of X. Once the slope has been computed, the least squares solution

fora is:
a = F- b)? (3.10)

where F is the mean of Y and If. is the mean of X.

49

Using the linear regression framework described above, MDIFF and MDISC
values were obtained by ﬁrst simulating MDIFF parameters as random vectors from a
multivariate normal distribution. After this was accomplished, MDISC parameters were
regressed on MDIFF parameters and the variance of MDISC was computed. Additional
variance needed for MDISC was then simulated based on these prior values and then
returned values from a lognormal distribution with mean and variance as speciﬁed in
Table 3.1. Item difﬁculties were computed using the relationship d = -MDIFF/MDISC
and direction cosines based on Equation (2.4) were simulated for associated clusters and
dimensions to which each item belonged. A description of item clusters will be provided
below. Lastly, discrimination parameters were then computed using the relationship a =
(direction cosine x MDISC) for each item. Hence, the resulting discrimination parameters
(a1 and a2) had a lognormal distribution and the difﬁculty parameters (d) had a normal
distribution.

As mentioned above, comprehensive sets of parameter speciﬁcations were used in
simulating the item pools. Table 3.1 provides these values. Speciﬁcally, MDISC and
MDIFF means and standard deviations were speciﬁed. Additionally, correlations between
MDISC and MDIFF values were also speciﬁed. All these values are somewhat similar to
this found from analyzing previous research papers as mentioned above. In Table 3.1,
SDMDISC and SDMDIFF are standard deviations of MDISC and MDIFF, respectively.

Additionally, recall that item pool quality was deﬁned by the mean value of
MDISC. That is, when an item pool was deﬁned as being of Low quality it had a small
mean MDISC value of 0.40, when an item pool was of Moderate quality it had a medium

mean MDISC value of 1.40 and when an item pool was of High quality it had a high

50

mean MDISC value of 2.40. Lastly, correlations between MDISC and MDIF'F values
allowed the linear regression formulas of Equations (3.8) to (3.10) to be used. The

correlation between MDISC and MDIFF was r = .29.

Table 3.1. Item characteristics for Low, Moderate and High Discrimination item pools

 

Item Pool Quality

 

Item .
Characteristic Low Moderate ngh

Meanwmsc) 0.40 1.40 2.40

 

3132222222O 0.30 0.30 0.30
Meanmmpn 0.00 0.00 0.00

313222222222 0.5 0.5 0.5

 

From the discussion of item content clusters in Section 2.3, it was ascertained that
some multidimensional tests consist of clusters with common content. These are called
item content clusters. A realistic example of item content clusters was provided from one
of the NAEP assessments. Therefore, item clusters can be considered constraints in an
assessment and such an example will be illustrated by simulating content clusters in each
of the aforementioned item pools.

For simplicity, only three item content clusters were simulated for the two-
dimensional item pools. Each cluster contained an equal number of items for each of the
three content clusters. Hence, there were 100 and 300 items per cluster in the 300 and 900

length item pools, respectively. Several studies showed that it is reasonable to have such

51

a small number of content clusters (e. g., Martineau et al., 2006; Miller & Hirsh, 1992;

Reckase, 1997).

In simulating these content clusters, each of the three was systematically aligned
with the 612 and Oz-axes as illustrated in Figure 2.2. That is, a matrix of direction cosines
of item direction with each dimension’s axis (i.e., 191 and 02-axes) in the multidimensional

space was deﬁned. These direction cosines for items in each content cluster were deﬁned

as follows:

1. CO = cosine of no alignment with the speciﬁed dimension (i.e., indicates that
none of the item variance is attributable to the speciﬁed dimension)

ii. C1 = cosine of perfect alignment with the speciﬁed dimension (all of the item
variance is attributable to the speciﬁed dimension)

iii. C2 = cosine of equal alignment with both dimensions (i.e., half of the item

variance is attributable to the speciﬁed dimension)

Hence, these clusters were created by adding random noise to direction cosines to
keep all items in the same cluster from pointing in exactly the same direction in the

multidimensional space. As noted above, content clusters were aligned to the two

dimensions systematically. That is, cosines of alignment were deﬁned as CO = 0, C1 = 1,

C2 = J05 and arranged to cover the possible combinations as shown in the following

matrix:

52

6’1 92

 

Cluster 1 l 0

Cluster2 «’05 40.5

Cluster 3 0 1

 

In other words, C0 = 0, C1 = 1, C2 = 40.5 represents 90°, 0° and 45° alignment of

direction cosines with each of the two dimensions, respectively. Therefore, these are
direction cosines for the two axes (i.e., 611 and 62-axes) where for each row (i.e., cluster)
the sum of the squares of direction cosines must equal one.

This is a realistic simulation of item content clusters because these situations
actually exist, although in a more complex form in an actual test. Recall that in an actual
test, these item content clusters could represent subject matter content based on a pre-
speciﬁed test framework. For example, Martineau et a1. (2006) veriﬁed content clusters
in a Grade 4 mathematics assessment and were able to successfully identify meaningful

content clusters which were clearly related to the test framework.

3.6 Test Len

Test length is one of the primary factors that affect test reliability and decision
accuracy. Longer tests are more reliable and have higher decision accuracy (Hambleton
& Swaminathan, 1984). Nevertheless, in testing, two important factors limit the length of
tests. First, a test has to be of reasonable length so that it can be completed in a

reasonable amount of time. Longer tests also increase the cost of administration and

53

these costs are invariably passed along to candidates. Second, longer tests require more
items to be developed and item development is one of the primary costs for testing
programs. Therefore, test length can be limited by economic constraints on a testing
program.

Typical test lengths were chosen for these evaluations. Test lengths of 21 to 100
items were considered. However, in some cases for brevity results for only 21 and 63
item tests were reported. These values were chosen since they most closely approximate
sections of certiﬁcation and licensure tests (e.g., Graduate Management Admissions Test

and United States Medical Licensing Examination).

3.7 Simulees

In MIRT calibrations, 2,000 or more examinees are usually recommended
(Ackerman, 1994; Reckase, 1997). Hence, to ensure the stability of calibrated item
parameters and simulated ability parameters, a single sample of 5,000 sirnulees was

selected. These abilities were randomly selected from a multivariate normal distribution,

1 .6
MVN (u, 2), where 2‘. = [O 6 1 )is the population variance-covariance matrix and

0
it = [0] is the mean vector. The speciﬁed distribution of the two abilities also had a

correlation of 0.6. This was the only type of ability distribution assumed because it
produces the least estimation errors and was a good way to keep comparisons among the
assembled tests meaningful and manageable. It is also the ability distribution assumed in

MIRT calibration software (e. g., NOHARM and TESTFACT).

54

3.8 Test Assembly Model

As noted above, the QKP method uses D-optimality for its objective function, but
does not invoke a Taylor approximation. Recall that LP uses D-optimality with a Taylor
series approximation which may introduce mathematical and computational error.
Hence, in the QKP and LP conditions, test speciﬁcations will be set so that one-fourth of
the items come from each content cluster. No content constraints or objective function
can be used for the RND method because of the way it was formulated.

Only equality constraints will be used in this test assembly model. It should be
noted that even though equality constraints are convenient and easy to handle, they
require more numerical effort in order to be satisﬁed. They have a disadvantage in that
they are restrictive of the model design and limit the region from which the solution can
be obtained (Venkataraman, 2002). Hence, the test assembly model to be solved has the

following form:

Maximize ﬁ0,x) (3.11)
subject to
I W
Z xi=nclusterl (Content Cluster 1)
i =1
I
Z xi=nclusIerz (Content Cluster 2) i (3.12)
i=1
I
Z xi=nclusIer3 (Content Cluster 3)
i=1 ,

 

xI- e 0 or 1, i: 1,....,I (decision variables) (3.13)

where ﬁ0,x) can either represent the objective function for LP in Equation (2.21) or the

one for QKP in Equation (3.5).

55

Formulation of this AMTA model entails the speciﬁcation of content clusters as

constraints. Also tests using this model were assembled over the typical range of abilities
such that (61,62 ) e {—3, 3}x{-3, 3} . Although MIRT is rarely used for item banking or

the assembly of tests, the model presented here is a realistic approach to AMTA since
studies have conﬁrmed that tests consist of identiﬁable content clusters (e. g., Martineau,

et al., 2006).

3.9 Unidimensional Item Response Theory Simulations

Using MIRT parameters from the assembled tests, a secondary simulation was
conducted. This simulation draws on previous research which has shown that MIRT
items which measure more than one trait can be used to construct tests that meet UIRT
assumptions (Reckase, et al., 1988). This situation may arise in the construction of
decision-making tests (e.g., certiﬁcation, achievement, or criterion referenced tests)
because although the item pool consists of items that are sensitive to more than one
ability, they may be required to produce only a single score (e. g., selection tests where
pass/fail decision needs to be made).

There is another way to view this simulation study. Suppose that a test reports
examinee performance on a single score scale. If a test has the property of being
unidimensional (i.e., the data that come from the interaction of persons with the test has
the property that it can be modeled using a unidimensional model), then the latent trait
underlying this score scale will be a unidimensional construct. However, if a test has the
property of being multidimensional (i.e., the data that come from the interaction of

persons with the test has the property that it can be modeled using a multidimensional

56

model), the trait underlying the score scale will be a unidimensional composite of the
multiple constructs being measured in the test. Mathematically this composite ability is

realized as (van der Linden, 2005):

261+(1—2)62, 03231 (3.14)

where I» is the weight assigned each ability. Therefore, the test would be scored using an
estimate of this combination.

Prior research has shown that analyzing multidimensional data with
unidimensional IRT models gives meaningful results (Ackerman, 1987; Bogan & Yen,
1983; Reckase, 1979, 1985; Reckase, et al., 1988). Additionally, research has shown that
unidimensional ability estimates are comparable to the average of the multidimensional
traits (Ansley & Forsyth, 1985) and have different interpretations at different points on
the unidimensional scale (Reckase, et al., 1986). Therefore, to evaluate the measurement
precision of a single composite score that may be created from a weighted linear
combination of the multidimensional scores, the multidimensional parameters for each
assembled test were used to simulate dichotomous item responses.

For each examinee’s two ability scores, simulated item response data was
generated using Equation (1.1). The following standard IRT method (see Li & Schafer,

2003; Zhang, 2005) was used:

1) Given ability score Gj = (011-, 1921-), calculate the probability of answering item i

correctly by examinee j, pi]- = P2( 61-), using the item parameter estimates from the

assembled test

2) Generate a random number r from the (0,1) uniform distribution

57

3) If rI-j < Pij then a correct response is obtained for examinee j on item i and

assigned a value of 1, otherwise, an incorrect response is obtained and assigned a
value of 0.

BILOG-MG 3 (Zimkowski, et al., 2003) was then used to ﬁt a 3PLM model to
each set of dichotomous data. In conducting these calibrations, the marginal maximum
likelihood method was chosen, and the programs’ default prior distributions for the
discrimination, difﬁculty and lower asymptote were chosen, along with a convergence
criterion of 0.0001. The 3PLM model was selected because of its frequent use in similar
studies that showed the relationship between unidimensional and multidimensional IRT
analyses (e.g., McKinley & Reckase, 1983; Reckase, 1979; Reckase & Ackerman, 1986).
Finally, the parameters obtained from this calibration were used to compute
unidimensional test information using Equations (2.6). Functionally, this created the best

unidimensional approximation to the multidimensional space spanned by the test items.

3.10 Data Analyses and Evaluation Criteria

The ultimate goal of ATA is to produce tests of high psychometric quality with
the most efﬁcient computer implementation. Hence, the computational experiment
conditions were used to make comparisons among QKP-, LP- and RND-assembled tests
for the purpose of determining which methods produce tests of higher quality.
Additionally, computational performance of the AMTA methods were compared using
the following evaluation criteria: a) Real- and CPU-tirne performance, b) D-optimality
computation, c) maximum information, d) MINF across the ability continua, and (1)

relative efﬁciency of assembled tests. These criteria were chosen because they help to

58

compare the efﬁciency of QKP and provide an important connection to both MIRT and

UIRT.

3. I 0. 1 Computational Efﬁciency

An important goal in developing and studying knapsack problems is to determine
their computational behavior (Kellerer, et al., 2004). Knapsack problems are commonly
evaluated by computing their Real- and CPU-time performance. CPU-time is the amount
of time the CPU is actually processing instructions during the execution of a computer
program. Recall that a CPU (abbreviation for central processing unit) is the brains of the
computer where most calculations take place. Real-time refers to events occurring in the
computer at the same rate as they occur in real life. In this study, the CPU- and Real-time
performance of QKP, LP and RND were compared in order to gauge each procedure’s
efﬁciency. Therefore, these analyses give an indication of the speed and processing

capability of each procedure.

3.10.2 Maximum Information

In order to compare the three procedures, maximum information (MAXINFO)
was used as an additional criterion for assessing the quality of assembled tests.
MAXINFO is the maximum amount of information that is given by an item or a test
when checking the direction of the composite in the multidimensional space. This
composite exists because for any set of multidimensional items, MIRT statistical
information is obtained for different linear composites of abilities in the multidimensional

space. Based on Equation (2.13), MAXINFO is computed as follows:

59

MAXINFO = max[II, (0)] (3.15)
a

where

a is the vector of angles with the coordinate axes in the 0 space,
01,- (i = 1, . . .,i) is an element of the vector,

Equation (3. 14) simply indicates that depending on the direction of maximum
discrimination of the particular items, one composite of abilities is measured better than
the rest. The formula follows from the formula for computing MINF in Equation (2.13).
In general, this measurement of composite abilities occurs because of the manner in
which angles of the composite are aligned with the coordinate axes. MAXINFO tells how

much information is provided for that composite.

3. 10.3 Pairwise Comparisons Using Clamshell Plots

Reckase and McKinley (1991) developed a way to graphically display
multidimensional information (MINF) using Clamshell plots (named because of their
shape). These give the amount of information in different directions. Clamshell plots

measure the amount of information provided by a test (or test item) at 10 different

. . . o o . o .
composrtes or measurement directions, from 0 to 90 in 10 increments. They are

created by computing the amount of information at 61, 02 points along each

corresponding ability continuum and represented by vectors originating from the points
where the two 0-points intersect (Ackerman, 1996; Reckase & McKinley, 1991).
The measurement precision of assembled tests was compared using Clamshell

plots. Comparisons were made through a pairwise comparison of areas in the

60

multidimensional space where tests assembled with different methods provided the most
information. These comparisons not only provide information about measurement
efﬁciency, but also the regions of the multidimensional ability space where each method
is superior to the other in providing MINF. Therefore, these comparisons can be
considered a form of multidimensional relative efﬁciency similar to unidimensional

efﬁciency which is described below.

3. 10.4 Relative Efﬁciency

The ratio of the information functions for two tests can be computed when the
measurement efﬁciency of two tests is being compared. This is called relative efﬁciency
(Lord, 1980) and as noted above, when measuring multidimensional tests in a
unidimensional manner, this amounts to providing a measure of the relative efﬁciency of
composite test scores created with each method. Therefore, when comparing the relative

efﬁciency (RE) of LP- and QKP-assembled tests, the following formula is used:

 

I 6, LP
RE(LP, QKP) = —(———)— (3.16)
1 (corp)
Similarly, comparison of RE for RND- and QKP-assembled tests becomes:
1 6, R nd
RE(Random, QKP) = ( ‘1 0'") (3.17)
I (6,QKP)

RE gives the relative merits of the amount of information for each test and
facilitates making decisions about the quality of tests. It is represented graphically as a
comparison of percentiles of measurement efﬁciency based on UIRT test information.
Hence, a test that provides the most information in the region of the ability scale of

interest is preferred. As noted above, this evaluation criterion serves as a relative measure

61

CHAPTER 4

RESULTS

This chapter presents results from the study. Four factors were examined: 1)item
pools quality measured by mean item pool MDISC, 2) item pool size, 3) length of
assembled test, and 4) AMTA method. The ﬁrst two factors were completely crossed to
form six item pools. These were Low Discrimination with 300 items, Moderate
Discrimination with 300 items, High Discrimination with 300 items, Low Discrimination
with 900 items, Moderate Discrimination with 900 items, and High Discrimination with
900 items. The remaining factors were incompletely crossed as permitted by the item
pool available for test assembly.

The Section 4.1 reports results of the item pool and ability simulations. Sections
4.2 and 4.3 summarize ﬁndings from the computation of D-optimality and the
characteristics of assembled tests. Section 4.4 presents ﬁndings on computational
efﬁciency and Section 4.5 presents comparisons of MAXINFO for the computational
experiment conditions. Additional MIRT comparisons are presented in Section 4.6 using
pairwise comparisons of Clamshell plots. Section 4.7 reports ﬁndings on the relative

efﬁciency of assembled tests.

4.1 Simuljted Da_ta

Descriptive statistics of the simulated item pools and ability parameters are
presented in this section. Recall that three levels of item pool quality based on mean

MDISC values and two levels of item pool size were two conditions in this study. These

63

two conditions were fully crossed to produce six item pools that were used for
assembling tests of different test lengths.
Table 4.1 summarizes properties of the 300-Length item pools. All the

discrimination parameters are positive as is required both practically and for the

mathematical derivations to work. Generally, means of the al parameters were higher

than means of the a 2 parameters as expected in a compensatory model. These mean

discrimination parameters are also consistent with the MDISC levels stipulated in the
simulation. Mean difﬁculty was centered at approximately zero for all three item pools.
Standard deviations for all item parameters were also generally reasonable and showed a
good range of variability. Moreover, MDISC and MDIFF values were consistent with
deﬁnitions of item pool quality that were proposed. Therefore, the lower the mean
MDISC value for an item pool, the smaller the number of highly discriminating items it

contained.

Table 4.1. Descriptive Statistics of Parameters for 300-Length Item Pools

 

 

 

 

 

Item Pool Statistical Item Pool Property

Discrimination Property a1 a2 (1 MDIS C MDIF F
Low Mean 0.3737 0.1474 -0.0441 0.4062 0.0117

SD 0.2602 0.0966 0.2623 0.2662 0.5162

Moderate Mean 1 .0659 0.6750 -0.041 1 1.4096 -0.0021

SD 0.4851 0.5108 0.7089 0.3158 0.4974

High Mean 2.2612 0.8104 -0.0103 2.4109 -0.0141

SD 0.2863 0.2447 0.0688 0.3144 0.5052

 

Extended descriptive statistics for 300-length item pools are provided in Table 4.2.

Here, a further breakdown of item parameter descriptive statistics allowed their

64

reasonableness to be determined. Descriptive statistics were computed for M2PLM
parameters and further analyzed by content cluster for each item pool. The three content
clusters were used to specify AMTA model constraints.

Overall, parameter values for the 300-length item pools are reasonable. The
means and standard deviations indicate that an appreciable amount of variation existed
within each cluster. Additionally, minimum and maximum values are within a
meaningful range of typical MIRT parameters. However, in a few cases, the mean a-
pararneters were quite small with values of 0.04 and 0.01 for Cluster 2 of the High
Discrimination pool and Cluster 2 of the Low Discrimination pool, respectively. This did
not pose a major concern since mean MDISC values for the entire pool were reasonable,
as reported in Table 4.1.

Similar summary statistics were computed for the 900—length item pools. Table
4.3 summarizes properties for the Low, Moderate and High Discrimination item pools.
Notably, mean MDISC values were consistent with those stipulated in Chapter 3 and
were also comparable to those of the 300-length item pools. Interestingly, d values for the
Low Discriminating pool were lower than those for the Moderate and High
Discrimination pools. This implies that Low Discrimination item pools generally consist
of harder items. As was the case with summary statistics reported in Table 4.1, mean
MDIFF values for the Moderate and High Discrimination item pools tended to be lower
than the stipulated value of 0.2. Although this is a deviation from the values originally
stipulated for mean MDIFF, it did not pose a problem for the intended comparisons since

the resulting 300-

65

 

 

 

 

 

 

nomad 33 .o 336 cbmmd owned mwomd nomad 386 chNd Qm
9 S . 7 Sec; acmm.~ wwmnd- ammmd meme; hwmmd- mmw g .o omvvd :32 m 5330
._ 054.0 «me .o vmnmd m3md >800 2: mo good 83 .o wummd Om
R36 533 53 .N . good gem; vad 3.86. 336 mvomd :82 N dogma—O
mwood m _ cod mmomd Smmd owood amend .32 .o 386 amend Om
memo; wwmmd ohwmd movmd cvmmd momma momma wmvod S Ed 302 ﬂ Bums—O
E me 5 me No S n «a S
:wi 03632 Bo..—
3630 sea as:

 

com 4&5— uo £00m Eu: eouwﬁﬁtoma :mE 98 88252 .33 .8.“ Beams—O .«o mouwuﬁm ozatomon— NV 2an

66

and 900-length item pools had almost-identical statistical properties.

Table 4.3. Descriptive Statistics of Parameters for 900-Length Item Pools

 

 

 

 

 

Item Pool Statistical Item Pool Property
Discrimination Property a1 a; d MDISC MDIFF
Low Mean 0.4063 0.0254 -0.0536 0.4075 0.0210
SD 0.2974 0.0298 0.2573 0.2984 0.4999
Moderate Mean 1.3779 0.1 106 -0.0604 1.3868 0.0162
SD 0.2882 0.1 168 0.6875 0.2903 0.4878
High Mean 2.3867 0.1516 -0.0590 2.3959 0.0044
SD 0.3061 0.1512 1.2457 0.3090 0.5115

 

An extended summary of item pool statistical properties stratiﬁed by cluster for
the 900-length item pools is given in Table 4.4. Similar to the ﬁndings of Table 4.2, a-
parameter values are small in a few of the content clusters. Namely, values of 0.01 and
0.04 were found in Cluster 4 of the Low Discrimination and Cluster 2 of the High
Discrimination pool, respectively. Again, these small means are not a concern since the
overall MDISC values displayed in Table 4.3 are reasonable.

Lastly, recall that a sample of ability parameters was simulated. This sample was

assumed to have been randomly selected from multivariate normal distribution, MVN (1.1.,

0.6

1
2), with 2'. =
0.6 1

0
J, 1.1 = (OI. Analysis of the simulated data produced the

0.60 0.0087
, "I = and p =
1.0042] (4.0001I

J. This sample had statistical characteristics very close to those that were

1.008

following ability distribution: 2 =
0.60

1.00 0.604
0.604 1.00

67

 

 

 

 

 

 

 

embed mmvod moomd Emmd mmood Swmd nomad .m—mcd ommmd Gm
$5.7 mmmmd nmva wwmed- on 56 $32 vmhmd- 336 ecomd 522 m cams—U
Named vmood Neomd $56 $36 9.de mhwod wmmod Shad Om
3.86- ow Sd 9»me 3.36. 886 8mm; mmood- Emod mmmmd :82 m Sums—U
moowd $23 $36 Smmd 836 omwmd mmwﬁo 386 wcmmd Om
ﬁb— 4 K36 ~mmmd omvmd Seed coomg o2 ﬁo 986 m2 md :82 ~ Bums—U
No S S Se me 5 w No S
:wE 88082 33
3:25 sea as:

 

com Ewe»: «0 .325 So: 5:33:5me :wmm was 880.52 .33 c8 B835 mo woumuﬁm 359.880 414 035.

68

were stipulated.

4.2 D-optimality

The multidimensional optimality of each AMTA procedure was compared using
D-optimality. Recall that D-optimality may be considered a summary index of MINF or
measurement precision provided by each test. It is computed from the determinant of
Fisher’s information matrix and is a single numeric value for each assembled test.
Therefore, the higher the value of D-optimality computed, the higher the resulting MINF
for that test. An AMTA procedure which selects items that add the most utility to D-
optimality is preferred.

Using the six item pools described in the previous section, D-optimality was
compared for each assembled test. Tests varied from short tests of length 21 to long tests
of length 100. Additionally, these tests were assembled while accounting for the content
constraints described previously. Remember that the goal of assembling each test was to
maximize MINF. D-optimality results from assembling these tests are presented in
Figures 4.1 to 4.6.

For example, Figure 4.1 displays D-optimality results for tests that ranged from
21 to 100 items. This ﬁgure shows that at every test length, the QKP method assembled
tests with higher D-optimality values than the LP method. Additionally, RND was used
as a baseline measure of the effectiveness of LP and QKP. Therefore, since the values for
LP and QKP are distinctly higher than those for RND, it can be concluded that using
optimization procedures is a valuable endeavor. Moreover, since the values of D-

optirnality for QKP were higher than those for LP this indicates that for a BOO-Length

69

 

0.16 __RND F l T j l l

0.14

 

 

0.12

D-Optimality Criterion
O O
'o 'o .0
0) an «A

'2

 

o
o
N
I \
\
I
l
\
l

 

 

 

g0 30 40 50 60 70 80 90 1 00
Test lengths

Figure 4.1. D-Optimality Criterion Results for the 300-Length Low Discrimination Item
Pool

 

 

 

1.4 ____RND I l I l l T
------- LP
1.2 —QKP q

A

 

 

 

 

8
'C
9.
'5 0.8
g
g 0.6
d:

0.4

0.2

$0 30 40 50 60 70 80 90 100
Test lengths

Figure 4.2. D-Optimality Criterion Results for the 300-Length Moderate Discrimination
Item Pool

70

 

.4“
or

 

 

D-Optimality Criterion
.-* N 9°
-¥ OI N OI OJ OI #-

2°
01

 

 

 

 

60 70 80 90 100
Test lengths

Figure 4.3. D-Optimality Criterion Results for the 300-Length High Discrimination Item
Pool

0. T f r ﬂ r ' r
025 _ _ RND

 

 

 

0.02

_o
O
.1.
0'!

0.01

D-Optimality Criterion

0.005

 

 

 

 

Figure 4.4. D-Optimality Criterion Results for the 900—Length Low Discrimination Item
Pool

71

 

0.35

0.3

 

 

0.25

.0
N ,

0.15

D-Optimality Criterion

.0
..L

0.05

 

 

 

 

70 80 90 100

Figure 4.5. D-Optimality Criterion Results for the 900-Length Moderate Discrimination
Item Pool

 

0.7

 

.0
c»

 

D-Optimality Criterion
.0 9 .0
w -h 0|

.0
N

.0
.3

 

 

 

 

70 80 90 1 00

60
Test lengths

Figure 4.6. D-Optimality Criterion Results for the 900-Length High Discrimination Item
Pool

72

item pool of Low Discrimination, QKP assembles tests with higher measurement
precision. It is also noteworthy that QKP D-optimality values begin to diverge from those
of LP as length of assembled tests increases. This means that QKP is more efﬁcient at
assembling longer tests than LP.

For item pools of varying discrimination, there are noticeable differences in D-
optimality values. In Figure 4.1 and 4.3 the difference in D-optimality values are
distinctly greater than those displayed in Figures 4.2, 4.5, 4.5, and 4.6. The difference lies
in the item pools that were used. The ﬁrst two ﬁgures display values for tests that were
assembled using Low Discrimination item pools while the latter four ﬁgures display
results from Moderate and High Discrimination item pools. This implies that in
comparison to LP, the QKP procedure has less of an advantage as the items in the pool
become increasingly discriminating.

Overall, these ﬁndings imply that for a wide range of test lengths and varying
levels of item pool quality, QKP assembles tests of higher measurement precision.
Unfortunately, D-optimality has no direct links to common psychometric measures of test
quality and its evaluation will have to continue in the succeeding sections. However,
before linking D-optimality to any psychometric indices, an example and comparison of

assembled tests will be provided in the next section.

4.3 Comparison of Constructed Tests

Recall that D-optimality was derived mathematically and used in the objective
function to facilitate maximizing multidimensional information of the assembled test.

Therefore, test items that maximize the value of D-optimality are selected from the item

73

pools using each of the three procedures. For illustrative purposes, this section compares
assembled tests.

Only 21-item tests assembled from Low Discrimination item pools of length 300 and
900 will be described here. Tables 4.5 and 4.6 summarize the properties of these tests
based on items selected from each cluster, type of AMTA method used, and M2PLM
parameters. Additionally, the number of items selected in common between QKP and the
other two methods are identiﬁed.

In Table 4.5, of the 21 items selected, 16 items were the same between QKP and
LP and 6 the same between QKP and RND. Similarly, in Table 4.6, of the 21 items
selected, 18 items were common between QKP and LP and 4 items were common
between QKP and RND. Clearly, QKP and LP are similar in the way they select items
from the pool. However, a closer look reveals that QKP tends to select items that have
higher a-parameter values as analyzed in greater detail below.

The superiority of QKP can be determined by comparing the multidimensional

parameters of items selected by each AMTA procedure. For example, in Table 4.5,

among the items not common between LP and QKP, the mean values of a1 and a; for LP

are 1.13 and 0.28, while those for QKP are 2.04 and 0.51. Similarly, in Table 4.6 these
mean values for LP are 0.97 and 0.46 while those for QKP are 1.54 and 0.66.
Consequently, higher measurement precision would result from QKP-assembled tests
since items of higher discrimination contribute more to the computation of MINF based
on Equation (2.13). Although this breakdown of ﬁndings is only reported for two tests,
the same conclusions would be true for other comparisons of QKP- and LP-assembled

tests since computations of D-optimality (as shown in Figures 4.1 to 4.6) are consistently

74

8mm. 33833-sz .28 ago Son 5 2:8. 3.36:. ...... .38. doiﬁomma-n: .28 no.0 Son. a. $5.. 8.86:. ...

 

 

 

 

 

 

m . .d- ddd NNd ...N eed- de N..d *odN eed- de th 8N

w . .d- m . .d and neN mNd- eNd wwd wNN wed- wmd emd weN

de- eNd wmd meN mmd- dmd edd *NNN mmd- dmd edd NNN

MNd- de bod n .N hmd- Ned md. *w.N ..md- Ned md. w.N

de- mmd $d ..c... .N NNd- ..Nd 5d *e.N NNd- bNd Sd e.N

mdd- d. .d bed 0: mod. mNd mmd d... de- nmd hwd ..N

NNd dmd Nbd ......Nw NNd dmd th *Nw NNd dmd th Nw m 8.3.0
d. .d- N. .d .Nd NeN mNd- mmd dmd *oNN eed- .cd ed. dmN

...d- m . .d . mNd mmN e. .d- dmd .md mo. mNd- nmd and dNN

wdd- w . .d .md ed. m . .d- .ed Ed *3. m . .0. .ed dud mm.

m . .d- .ed dud ......ww. e. .d- bod e... *N... e. .d- 5d e... Nb.

e. d- nod e... ......Nn. d. .d edd 3.. *3. d. .d edd 3.. cm .

wdd- wmd mod .h. mdd cmd dmd *mN. mdd dmd dmd mN.

mdd e. .d eNd .N. ...d bNd eed ... .d. ...d bNd eed .d. N 8.3.0
odd- 8d .md on. eNd- 8d oed ...NmN eNd- ..dd eed NmN

..dd mdd mmd 3 d. .d hdd dmd *8. d. .d hdd dmd dd.

m . .d ..dd ..ed mm mNd bod bed *we mmd ... .d N... E.

med N. .d and ......Ne med N. .d and Ne mNd nod bed we

eNd cod ded dm eNd odd ded mm med N.d and Ne

de odd eed ......mm de odd eed *mm de odd eed mm

mod ddd mod dN mod ...d end *e. mod 2.0 end e. . .8355

.e 8 :9 .5828 h 3 S .5823 .e me 6 8.8.8
88. 88. EB.
92 A: 5.0

 

.8. ea. 8:228:85 33 ace-8m 2: mam: emu... 8353?. mzm Ba .3 55 82:3 .5. 383 was: .3. 2%..

8.88. 8388857.”. ...... ago ...on :. 88.. 8.8.6:. ...... .88. 838884.. ...... g0 59. :. 88.. 8.86:. ...

 

 

 

 

76

 

and. mod de mww .o. .- eNd RN 8% .d. . - eNd .mN wmw
.d..- eNd .w.N *wmw Nd. .- m . d .n. *Ndw mad- N. d wN. NNw
cNd- edd Rd m... eed- ...d d... ...mK Nd..- m . .d .m. Now
mNd- edd med edw de- d. .d wdd moo ded- . ..d d. .. m . 5
Ned- e. d .e.. 53 Ned- e. .d .e.. *wec Ned- e. .d .e.. wee
.dd mdd de mNe mNd- ddd 3d 3.3 mNd- ddd dwd wmo
bod cod cod don 3d h. d m5. *m.e. ..dd h. .d mm. m . e m 8.8.0
. ..d- ddd dmd 5m dmd- ddd ed. ...mdo and- ddd ed. moo
mdd- ddd dmd emm ..Nd- ddd Nbd *moc hNd- ddd Ed 80
mod. ddd. med .mm 8d- ddd om. *ece oNd- odd dmN mmm
.dd ddd med e.e ddd ddd Nd.. *cme 3d- ddd on. eoe
e. d ddd m . .. *wem .dd ddd .cd cNe ddd ddd Nd. cme
e. .d ddd eed 3N e. d ddd m .... *wem e. .d ddd m . .. wem
mNd ddd 5d dd. cNd ddd dwd *wNN oNd ddd dwd wNN N 8.8.0
. ..d- wdd 3d MNm m . .d- ..dd w..d 3% m . d- bod w..d mom
de nod and *mmN e. d Odd mud *edN e. d odd m..d edN
o. .d mod wmd eh. m . .d odd med RN de bod dud mmN
m . d mod end on. de 8d and *mmN mNd wdd dwd .mN
de edd Ned .m . MNd wdd dwd ... . mN ded N. .d 2.... mNN
wmd bod dwd de. dmd wdd odd ... .NN dmd mod add .NN
mNd Ndd eNd 0N m . .. e. .d 9.... *3 m . .. e. d 3.. mo . 8.8.0
.e S. 5 8.8.8 .e we 5 8.8.8 .e N: 5 8.8.8
88.. 88.. 88..
:2: 8 5.0

 

.8. as. 8.8.885 33 522.8 2.. 8.5 88 8353.8 :2: Ea 8 5.0 53.8 8. 8.8.8 2:2. .3. 28.

higher for QKP -assembled tests.

4.4 Computational Efﬁciency

As noted previously, CPU- and Real-time performance of computer programs
helps in the evaluation of their efﬁciency. Although today’s computers have much faster
processing speeds than ten years ago when CPU- and Real-time performance were of
greater concern, in this study it is still important to compare the computational speeds of
QKP and LP as a way of evaluating the efﬁciency of both methods. Therefore, if QKP
was mathematically more accurate than LP, but signiﬁcantly slower at computing a
solution, then the mathematical advantage would be less important. This is because in
operational settings, the speed at which a computer program computes a solution is far

more important than minute deviations in accuracy.

Figures 4.7 to 4.18 present results of CPU- and Real-time computations. These
results were computed for 21- to lOO-length assembled tests using each of the six item
pools. Recall that item pools were varied in terms of their length and mean MDISC.
Generally, LP assembles tests using the least amount of Real and CPU-time. However,
the difference between LP and QKP is small enough to conclude that the two procedures

perform comparably.

4.5 Maximum Information

MAXINFO is a multidimensional index which measures the composite of abilities in the
direction of maximum information for a particular test or set of items. This index

facilitates comparison of assembled tests since it was shown previously that the AMTA

77

 

 

 

 

 

 

 

 

 

' ' ' ——RND
-LP

—QKP
a;
'0
5

8 ‘ .
In
‘6
g

F —
i
U
[K

/\ —————— A\—-——/’——‘—\','-—-

.- 1 g. ‘1‘ ’---l ...... I l .............

70 so 90 100

60
Test lengths

Figure 4.7. Real Time Performance of Each Assembly Method for the BOO-Length Low
Discrimination Item Pool

Q:
.d.’

 

—-RND
LP
0.3 - QKP

 

 

 

CPU Tlme(seconds)

 

 

 

 

‘20 30 4o 50 60 70 so 90 100
Test lengths

Figure 4.8. CPU Time Performance of Each Assembly Method for the 300-Length Low
Discrimination Item Pool

78

 

 

 

 

 

 

 

0.14 I ﬁr l l T I _—RND
\ ------LP
l —QKP
0.12-
l r \ t I A _ \ {\
‘ p ;\ //\ ,\\/\ /\ /\
\ , \ I x I \ /\ ,\ / \ I \
1;; 01-l , l I / \ \. -\ I L
a \ / \ x , / ,
3 / x I x I \/ \__J
8 K / \ I H
€003. ./ V v _
I:
8
050.06
0.04 .
0°30 30 4b 50 60 7o 80 90 100
Testlengths

Figure 4.9. Real Time Performance of Each Assembly Method for the 300-Length
Moderate Discrimination Item Pool

 

 

 

 

 

 

 

0.14 T T r I I I ' __RND
------LP
0.121 —QKP
\
x
a 01-‘ A ~-
g \ /
F-'
E
00.
0.04 .‘
0°50 ' 30 40 so so 70 so 90 100
Testlenghs

Figure 4.10. CPU Time Performance of Each Assembly Method for the 300-Length
Moderate Discrimination Item Pool

79

0.16 l l T l T I

 

 

 

 

 

 

 

— - RND
LP
0.14 » l —oKP
0.12
’0?
'0
g 0.1 .,_
a, .
8’
E '.
l: 0.08
E
0
(I:
0.06
004- “‘1: z"... " -------------- it“ {.-.-“x “‘31.: “‘x-n-x ------- *2 ...", “it; -
0'080 30 40 50 60 70 80 90 1 00
Test lengths

Figure 4.11. Real Time Performance of Each Assembly Method for the BOO-Length High
Discrimination Item Pool

 

0.11

.0
—‘L

 

 

CPU 11me(seconds)
.0 so 9 .o
8 9. 8 8

P
o
o:

 

I
O
.5

0.03- 8..-"-.-"4’ ‘x” ":5 i. 3' ‘a.-.-....-.:’ ‘2' -
. 4

 

 

 

0'030 30 40 50 60 70 80 90 100
Test lengths

Figure 4.12. CPU Time Performance of Each Assembly Method for the 300-Length High
Discrimination Item Pool

80

1.6 1

 

l

1.4

1.2-x

Real Tlme(seconds)

 

 

—-RND
LP

 

— QKP

 

 

 

Figure 4.13. Real Time Performance of Each Assembly Method for the 900-Length Low

Discrimination Item Pool

Test lengths

90 1 00

 

0.65 j I f T I l __RND
0.6 - , "‘"“ LP
\ / \ /——\ / / ~— ..
0.55‘ \ / \’ \// \/ \\/ \f _
v
0.5 ~ ‘

P

h

OI
I

CPU Time(seconds)
3 o
01 -h

9
to

0.25

0.2 '

 

 

 

 

 

 

Figure 4.14. CPU Time Performance of Each Assembly Method for the 900-Length Low

Discrimination Item Pool

60
Test lengths

81

0.9 I I T T I l

 

 

 

——RND
------LP
08- ——oKP
0.7'\ /\ x s
806- \-
0 .
E“
F05?- -
i
0
m
0.4
0.3

 

 

 

 

70 80 90 1 00

Figure 4.15. Real Time Performance of Each Assembly Method for the 900-Length
Moderate Discrimination Item Pool

 

 

 

 

0.65 I I I f I I , __RND
0.6 -, / “'"'* LP

N\/ / / \/// \\ //\\//\\/\ .. QKP

0.55- V \ \. ’ -

0.5- -

.0

uh

(II
I
1

CPU Time(seconds)
8 o
01 h»

 

 

 

 

Figure 4.16. CPU Time Performance of Each Assembly Method for the 900-Length
Moderate Discrimination Item Pool

82

.3
a)
‘l

 

 

 

 

i -——RND

"th
15" mm
1“

A
N
I

A
l

Real Tlme(seconds)
o
CD

9
a)

 

 

 

 

0.4
0.2
90 30 40 50 60 70 80 90 100
Test lengths

Figure 4.17. Real Time Performance of Each Assembly Method for the 900-Length High
Discrimination Item Pool

 

 

 

 

 

 

 

 

0.8 T I T I I I __RND
"th
07— aw
{l
l l
l I \

0.6-l / l .
. I
In I A l
g l / \ // \
305 L/\——// \- \\/___,____#~-v-_--
£1
0
E
l: 0.4
D
a.
0

0.3

0.2

0' 80 30 40 50 60 70 80 90 1 00

Test lengths

Figure 4.18. CPU Time Performance of Each Assembly Method for the 900-Length High
Discrimination Item Pool

83

procedures do not select identical items from the item pool. Comparisons of MAXINFO
are presented graphically in Figures 4.19 to 4.24.

These results show that differences in MAXINFO between QKP and LP are
uniformly different for tests of length 21 to 99. If these lines are extrapolated, QKP
would always assemble tests with higher MAXINFO than LP. Additionally, it appears
this would hold true for item pools of any length. Moreover, both QKP and LP have
higher values of MAXINFO than RND. This implies it is indeed worth the effort to use
optimization methods for assembling multidimensional tests.

It is interesting to note however that the higher the item quality, the smaller the
difference in MAXINFO. Recall that item quality imposed on different item pools by
specifying mean MDISC values with Low Discrimination = 1.20, Moderate
Discrimination = 2.20, and High Discrimination = 3.80. What this ﬁnding suggests is that
QKP is less effective at assembling tests when the multidimensional discrimination of
items is higher. However, this conclusion may not be entirely correct. A more logical
explanation is that differences between AMTA procedures are nulliﬁed when the
coefﬁcient of variation becomes smaller.

The coefﬁcient of variation represents the ratio of the standard deviation to the mean, and
it is a useful statistic for comparing the degree of variation from one data set to another,
even if the means are drastically different from each other. The formula for calculating
the coefﬁcient of variation is:

SD
Mean

CV:

 

(4.1)

where SD is the standard deviation. CV computations for the MDISC values in Table 4.1

gave the following values: a) Low Discrimination = 0.66, b) Moderate Discrimination =

84

 

 

 

MAXINFO

 

 

 

 

30 30 40 50 70 80 90 100

60
Test lengths

Figure 4.19. MAXINFO Results for a 300-Length Low Discrimination Item Pool

 

 

 

 

 

 

 

520 30 40 50 60 70 80 90 100
Test lengths

Figure 4.20. MAXINFO Results for a BOO-Length Moderate Discrimination Item Pool

85

 

140 I TIT j I I I

 

 

 

 

 

—-—RND
+LP
120 _QKP ' / /--
/
/
100- ' \/ _
,, ./
2 / \
Z 30- - /“ -
E /
q
60- ,/ -
4o— -/ -
2°20 30 4o 50 60 70 so 90 100
Test lengths

Figure 4.21. MAXINFO Results for a 300-Length High Discrimination Item Pool

 

22 __RND I I I I I‘ I

 

 

 

 

 

 

30 30 40 50 60 70 80 90 1 00
Test lengths

Figure 4.22. MAXINFO Results for a 900-Length Low Discrimination Item Pool

86

 

——RND

 

 

MAXINFO

 

 

 

l l I

180 30 40 50 60 70 80 90 100
Test lengths

 

Figure 4.23. MAXINFO Results for a 900-Length Moderate Discrimination Item Pool

 

14o ——RND f r I I I I

 

120

 

100

80

MAXINFO

60

40

 

 

 

 

J 1 l l l l 4
290 30 40 50 60 70 80 90 100
Test lengths

Figure 4.24. MAXINFO Results for a 900-Length High Discrimination Item Pool

87

0.22 and c) High Discrimination = 0.15. Similarly, from Table 4.3, CVs have the
following values: a) Low Discrimination = 0.73, b) Moderate Discrimination = 0.21 and
c) High Discrimination = 0.13. Hence, the efﬁciency of QKP is lessened when the

coefﬁcient of variation becomes smaller.

4.6 Egirwise Comparisons Using Clamshell Plot

Pairwise comparisons of MINF were conducted using clamshell plots. Presented
pairwise because every assembled test was compared for QKP versus LP and QKP versus
RND. Speciﬁcally, the comparisons focused on differentiating the measurement precision
of QKP in comparison to LP, and QKP in comparison to RND. The purpose of the ﬁrst
comparison was to determine which procedure was better. The second comparison was
used as a baseline measure to assess the magnitude of the difference between QKP and
LP comparisons. In interpreting these diagrams, blue is positive (i.e., where QKP gives
more MINF than the method to which it was being compared) and red is negative (i.e.,
where QKP gives less MINF than the method to which it was being compared).
Additionally, the red and blue clamshells are plotted so that red is inverted, while blue is
upright. The density of color and length of the clamshells are also related to the
magnitude of MINF represented at those grid points. These comparisons are made by
eye-balling the areas of red and blue clamshell plots while corresponding item pool
quality and test length plots are juxtaposed.

Out of all the tests assembled, only two test lengths were selected for illustrative
purposes. The 21-length test is commonly considered a short test while a 63-length test

can be regarded as a long test. Recall that generally, test length is an important factor in

88

test assembly because the longer the test, the more reliable it becomes, but at signiﬁcant
cost in terms of other resources such as item exposure and size of item pools. Therefore, a
63-item test is expected to be more reliable than 21-item test. The ﬁndings from these
pairwise comparisons are presented in Figures 4.25 to 4.48.

At each test length, the level of MINF was compared. Looking at Figures 4.25
and 4.26, for example, there is a difference between the clamshell plots. Figure 4.25
compares QKP and RND, while Figure 4.26 compares QKP and LP. The length and
darkness of the clamshells is slightly less distinct in the former than it is in the latter. This
means that QKP-assembled tests give higher MINF when compared to RND-assembled
tests, than when compared to LP-assembled tests. This was the expected outcome based
on ﬁndings from D-optimality and MAXINFO comparisons.

In these ﬁgures, the presence of red, inverted clamshell plots indicates the areas of the
two-dimensional space where the comparison method gives higher MINF than QKP.
However, since these areas of red inverted clamshell plots are few and indistinct, that
means LP and RND do not give that much more MINF in these regions of the
multidimensional space than QKP.

Additional conclusions can be drawn by looking at the remaining pairwise
comparison plots. QKP assembles tests with higher MINF than LP and RND when the
item pool consists of items with low discrimination. Evidence for this conclusion is
corroborated by Figures 4.25 to 4.28, 4.37 to 4.39, and 4.40.

Furthermore, QKP is more efﬁcient for shorter tests. Evidence for the latter
observation comes from the pairwise clamshell plot comparisons for the 21- and 63-

length tests. Another conclusion highlighted by these ﬁnding is that item pools of high

89

NNOOO)

—t—-.<'3.o 99.-24 ....
de808mnmNAmmmA
I

Q-

 

 

 

-161208OA 004081216 224283235 4

Figure 4.25. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled
Tests Using the Low Discrimination 300-Length Item Pool

mwmw
I I I I I I
I
In
......

I

--oo 0044

hmbhbhohbhbmbbNbA
IT I

 

 

4~3m63228242-1m6120804 00.4081216 22m4283236 4
6
1
Figure 4.26. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the Low Discrimination 300-Length Item Pool

90

    
  
    
        
   
 
 
 
  
 
  
   

   

 

    

 

 

  

      
 
  
   
  
   

  

 
    
  

 

      
  
 
 
 

   
  

 

 

. 444<.4"';"{€"/_:’ £45414 4 , ‘4 as -
33 .4 4. 44444”; 4.4.474», 4 A .4... I. I
3.2— @ -.-' -’ ﬂﬂgﬂwzéf 7,??155722’7/9/ . ‘ - - -
-~ “ ‘ ‘ “ ‘ '4 “474/147 ‘ “
g ...I '4. 4 .4. ' 14544.4“ 41:, ..
2-3' .4 ‘ .22 444444422,— * :2 ‘ ‘
16 g .4 _ .44. 4147474744,- .4 g... ._
- " " ‘ ,‘ '7,/i7,('l'.’ir”/:/%/7 “
~- 4 .4 =4 4424’ 4 g, ,,

12. ‘- eﬁ -_ -_. , ,,,.,"_;”,}7W; 4 ‘- .-
08- ‘ﬂ:& 2:»: 14454:“: 4: ‘4.-
' 54,,71’Zv’7fér. . ﬂ
0-4' " ‘ ‘4 “ !?I¢"/¢il7¢ff ‘ ‘
N a ‘ deﬁ ”- 4" 19' é”4’"-’»- "

CD 0' ‘ ' , 4774:“
._ ‘ & é 4: . '4 '_4.“”:: g "1
-O.4 " (cw!
-0_8- - ﬂ ‘1‘ I“ 27474;, 2'
4’/ 4
_ a ﬁ ‘ i g - (a "41 "_
'1-2 1,7479.
_ ‘ 4 -4 4‘. z:
'1-6 ‘ ‘ " ‘ 2‘ {742444.
-2 - - _. - '{éﬁ'ﬁf/ZOZ9:J
'2-4' ‘ ‘ "' ‘ f “g , 44747449,;
- ﬂ _: :: :4: - — ’1.- ‘-_4 4' -
g8“ ‘ " z 4 , 74717471414/9
_ ._ - n a. '5 .4: 4 ‘_»4-4 4
3% - .. .. f4 .4 4 4 4'4
' it _ - - ..., .4 -.. 4 :44 444'
l I l l I l l l I l l l l J l l J: I I l I
-4-3.63.2-2.8-2.4-2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4

—§

Figure 4.27. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled
Tests Using the Low Discrimination 300-Length Item Pool

4- ﬂeéﬁ— ... £55 4 4 ‘1. .- - - .

36- j» W”; .4 4%24'344'4 _ “ - _ - _ _
3.2_ a ‘ ﬂ , Egg/2:454. “a a. - - . .
2.8- - - 44144474404 4 44-- - - - . .
2.4- - o a £4 *4 4 {g’gfrgrsas _ a: .- - - _ -

.2“ - . ﬁr ﬂﬁ _—: 44'4/é'14’54’ 4 t- - - - . .
16_ - --‘éger/ﬂg/yg’44 ﬂ... - . ‘
1.2,. - a - ‘ g «4 {ﬁg/L4,: 4'4 ‘- - - - _
0.8- - - n‘:% ”4: sé'é/O’ﬂ- _- 4 n- _ - .
0.4' " - ‘ ‘ g vs 1.; éﬂg/r}%r/: 4 44 ‘ - o n -

N - - 4: 4.4-.4545”; :4 4 - .
0 02:. - 1' :24 4444444....- ‘2: - -

:oi8- . a a ‘ ‘ é _ ,3 4? 44/022": 4 :4 i- - -
-1.2- . n - a - ‘ i "4"4’£'=£ﬂ:/”; : 4: ﬂ - -
-1:6... . - a. - &£ ”-4 >4 4 4&4 ‘:: .4 ‘ - -
-2- . n ‘ ‘ﬁ i- a
-2.4— - . - a ‘ ‘ é ‘.
-23- . . - - - .. “4W1-
-32- - . - .. agggdéWéa-
-35- . . - - 4 £12 m:
_4_ . . . - ‘ _ ‘ “g g

 

 

-4 -3:63f2-2.l82j4 -2—1f61.12-0.18~0f4 0 (f4 01.81j21f6 2 2:4 2f8 3:2 3:6 4
6
1
Figure 4.28. Pairwise Comparison of MINF for 63-Item QKP— and LP-Assembled Tests
Using the Low Discrimination 300-Length Item Pool

91

NNOOOO

us5‘.'.

”Huntup;
I
E
f

.-|I||...
E
I
I

4406 0044

-'IOIIII}|
E
I
I

'Nbbbbohbbbwhbbbh

I I I I I T I I I r r I I I I I I l I I I

aase
OJthu

 

I
A

l l l

-4 -3.63.2—2.82.4 -2 -1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4
6
. 1
Figure 4.29. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled

Tests Using the Moderate Discrimination BOO-Length Item Pool

\ iii

449$ Ppee ....
mthOthmthNm-h

I
N

l I j I I j I l 1 I I I I I I I I ﬂ I I l

:1:
A

-2.
-3.2
-3.6

‘
-4 -3.63.2-2.8-2.4 -2 -1.61.2-0:8-0.4 0 0.4 0.81.2 1.6 2 2.4 2.8 3.2 3.6 4
1

Figure 4.30. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the Moderate Discrimination BOO-Length Item Pool

 

 

92

  

 

4..

3.6L - - -

3.2- - - - -

2.8- - - - - -

2.4- : « - - -

2.. D J J a o

1.6- D .D J J n

1.2.. D D 1 I D

0_8- p D D I l

0.4.. o a n J D . .

CDN 0.. a J .n o . . . .
—0.4r- .’ D D I o o
_0.8_ - D D D D D O
-12L. - D O D U D U
-1_6~ . o . u n o D
_2L. . . o e o D I
-2.4- - - - - -
-2, - . - .
-3.2- - -

-3.6+ - -
-4 - ‘ .

l J i l I l 4 4L 1—

-4 -3.63.2-2.82.4 -2 -1.61.2—0.8-0.4 0 .4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

61

Figure 4.31. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled
Tests Using the Moderate Discrimination 300-Length Item Pool

 

 

 

4- . .- ‘ ‘ ‘ﬁ‘ .- .
3.6_ a ‘I ‘ ‘ﬁ‘- .
3.2.. - ‘ - ‘i‘i‘ -
2.8.. c ‘ ‘ ‘ it- . .
2.4L . - a. J ‘- ‘4‘. - .
2|— - ‘ ‘ ‘ﬁ‘- -
1_6.. .- n n 4‘ £1- -
1.2. a - ‘ c‘ ﬁ“ .
0.8.. o - ‘ ‘ “é‘ a .
0.4.. '. n .I ‘ “ti- .
CDN 0i. 0 .v ‘ “ﬁg- 0
'0.4’ - - ‘ ‘ ﬁﬂ‘ - .
_0_8- .. an ‘1 ‘ “£‘- a
-1.2L- a n 4‘ ‘ﬁ‘- .
-1.6_ - a. n ‘3‘“ -
-2- . a ‘ ‘ ‘ﬁ‘ 4-
'2.4‘ . a- an ‘ lﬁﬁ - ..
-2.8- - n 4‘ ﬁ‘é“ .
-3.2.. - a ‘ ‘ﬁ‘ﬁ -
“3.6i' . a- :- ‘ ﬁﬂ- -
.,4- . a . ‘ﬁﬁ‘-
-4-3.63.2—2.82.4 -2-1.61.2—0.8-0.4 0 0.4081215 2 2.42.83.23.6 4
9
‘I

Figure 4.32. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests
Using the Moderate Discrimination 300-Length Item Pool

93

 

 

4+- . - .. , a W‘- .
3.6" - w .- V - ﬁg-
3.2" ' - v u i“- g
281' - _ , - . n %‘ .
2.4” - - - . a ﬁg- .
2... . -, - . a a“- .
15. , , ..., - . as (63%.- .
1.2_ - - - _ - ﬁm-
0.8- . - - ' . . ééﬁd .
0.4“ - '- ' - ‘ ‘ﬁ'é‘ "
co“ 0L . , , _, . drug‘s/gm-
-0.4L - - ..., . . ‘ éﬁﬁa -
-0,8 . .. ., ... - .. géﬁﬁ -
'1.2£ - - - .- - ‘ééﬁn .
-1.6_ - - ' I, . ‘ ﬁﬁﬁ . -
-2- - ' V ' . ‘ ﬂ“- ‘ a .
-2.4r- . - - ' . - égﬁ ‘ ‘n o .
-2_8- - ' ' - a ‘ﬁ‘ ‘ - a .
-32- - ., v - . £432- ‘ a .
-3'6- . , ., ' an ag“ - a. - .
-4_ . - - v - .- ‘ﬁ‘ . n a .
-4-3.63.2-2.82.4 -2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4
0

1

Figure 4.33. Pairwise Comparison of MINF for 21-Item QKP- and RND—Assembled
Tests Using the High Discrimination 300-Length Item Pool

 

 

4— - . - . - a. 4. - .

3.6_ , , . . a Q ‘ a .

3.2_ , _ , . . a ‘ G . .

2'8__ , - o a - - a .

2_4— . . - - .- 4. n a .

2__ , _ _ , - a - ‘I . o .

1.6“ , _ , . a d ‘ a .

1.2_ , - _, . . n - - ‘1 o -

0.8" a o c a- - - ‘ a .

0'4” , , . o .- ‘ a -

ON 0" . ' , . a - ‘ C a .
-0.4— , - - . a ﬂ . O - . .
_0.8__ , _ - - . n a a n - .
-1'21- . , - a a a d n . .
-1.6~ . . - . a a a a a .- .

-2L . . - . . a a a n a . .
-2.4— . - ., a- d a a d a .
_2. _ , - - - - n a - a a .
-3'2- - , . . a a a a a . .
'3.6" _ ' - - a. a a a .

_4_ , , - - a a - c n a

-4-3.6-3.2-2.8-2.4 -2-1.61.2-0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4
0
1

Figure 4.34. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the High Discrimination BOO-Length Item Pool

94

 

 

4— .. .. n - n a - .

3.6.. - - - a d a- a ..

32E - . a a a a. .

28 , .. . a n n n .. .

2_4- - .. . - a n as .. .

2— . - - . n a- a a .

1_6.. - - , a ‘ ‘ - . .

1'2... - ,, . .- A J a - .

0.8?- — - - - n ‘2 a .. .

0.4” - , ... a ‘ ﬂ .- .-

ON 0. ., ... . o H ‘ a -
-04- .. _ - . a ‘t n -
‘0.8L‘ - c - - 1“ ‘r- a .
-1.2,.. . - - a i 1 - .-
‘1.6” - - - . ‘1 ﬂ ‘ -

-2». . .. - . a i ‘ n .
-2_4- - .. - . ‘ l n .
‘2.8‘ - .- . - ‘ n -
‘3.2" . - c - ‘ ‘ n .
'3. ' - - - - I» 4.- -

_4,.. . , , - . a ‘ ‘ .-

-4-3.63.2—2.8-2.4 -2-1.61.2-0.8-0.4 0 0.4081215 2 2.42.83.23.6 4
0
1

Figure 4.35. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled
Tests Using the High Discrimination BOO-Length Item Pool

pmww

ohbwbmabhmh

s
‘ I I I 0 I

U I I | | b | O I '

e2

4456 ooea
' C I . | . | . I . '

hwbbbb

II
IIFIIIIIIIIITIITIFFII
|I||||||$'

I
N

-223
-32
-35 ~ - '
-4 . . . . . . . . . . . .
-4-3.63.2—2.82.4-2-1.61.2-0.8-0.4 0 0.4031215 2 2.42.83.23.6 4
e
1

Figure 4.36. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests
Using the High Discrimination 300-Length Item Pool

 

95

 

 

4— . .. an. ‘12 -. /2;4r”-.g<’.=_-'c.. # _ _
3.5. - - .. «a _ ._. w- 4 » .. -
3_21- - - -. & v! 4‘ 7’! —- ‘ -
23- - - - ‘3 .. _ 44:!ng _ ._ _ _
2'4 ---..._.-«m._ --
.2: - .. ..g. _. 4mg- ~__ _ _ _
1 6 : _ .. ‘2 _ ag ﬂ'gp .. - .
12: . _ a. ‘2 -_ -.4”¢'"§;Zf:4r 2.. - - _
0.8 - - ‘2: -—- .3.» 1%1‘ . .. .
0.4- - an - -_—.’ «g’g’z-A’igr’; ‘ - -
N I - . .../......r/ﬁ,’
0— - - ‘ --------- ’ —- ... ‘- - -
(D («a/:4; ... _
'0-4' ' : : ' ngtz:-_ ._ _ :
—?'3_ : .. 4 - ' -_-r' -‘12AZ-W’3: ’_ . ...; _ -
-1'6: - .. ... -. Ia-ia/ﬂf’ﬁ-v'- ‘ - .
- 2 . .—-' .4”: I’ll/’-
- — ‘ - ‘ — 7 / "— ‘ .-
-2.4- - - .. t: a .4 «MC/fag- - -
1, ,.4% .¢/
-2_8- - - A & -_. _. is -. _ -
-32- - n 4. & e! 5:2;7/41—4'; _ _
-3.6- - -. _. & U:<::{:€ 2. .. _
-41 - - - - - - é ,_ «cat-«=2 - .. _ . . .
I L I I I J I I l l I I I l I J I I l I I
-4-3.63.2-2.8-2.4 -2-1.61.2-0.80.4 0 0.4081215 2 2.42.83.23.6 4

1
Figure 4.37. Pairwise Comparison of MINF for 21-Item QKP— and RND-Assembled
Tests Using the Low Discrimination 900-Length Item Pool

 

 

4L - .. 4.1-sam— .
3.6- - .. ‘w- .
3'2- - - a. ﬂag- -
23- . _ n. W- -
2_4_ - - .. 423%.. .

2- - _ ‘ we- - .
15- _ - ‘2 w- - .
1.2? - .. 4 w- _ .
0.8- - - 4h ‘ - -
0_4_ - ... 4am- - .

CDN 0- - ‘ W- n .
-o_4- .. 4 % - .
-0.8* .. 4 W... - .
-1.2_ - ... W4 - .
-15- - ‘ W4. - .

-2- - ‘ W4. .. -
-2_4— - an W4. .. .
-28- - ‘ W4 .. -
.32r - at W4. .. .
-35 - - W4. - -
-4 - . . . . . _ g 422%.: - - . . . .

-4-3.63.2—2.8-2.4 —2-1.61.2—0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4

1
Figure 4.38. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the Low Discrimination 900-Length Item Pool

96

             

  
   
   
  
 
      
 
 
   
 
  
   
  
   
   
  
 
     
  

     
   

  
 

    

4L- . - 4 ‘ & =.. 7:: " :" :37. 5'” _ _ _

3'6L. . - ._ -: é .— :: - . l 3!.I’1 ‘ _ ‘
L . - _ - ‘2 '_ 4 _.".=.-." wife ""'<=’='§;:.,-v __ _ _

. " ' - 4 '_ {7 r7 , _/’.—‘ 4 " ‘
24 .. - - ‘ ‘2 ‘ -..- £!;;’;;;”IZ’;;’ ' - _ _

2— - .. .- ‘ _. 4 aﬁcé? ‘ ‘ - -

‘ ,- It

16~ . - - .. a. -._ 4 £259. «1 .. - _ .
{2% _ - V ,/%{ r _

. . - - _’ ". “ 7- : A ‘ .
0.8L . - - — ‘2 — g {’g’ i ‘ ‘ _ ‘ _

_ - .. .— 42 5’3““? ‘ - - .
N 0‘4 - ..-/2’14 _ ‘
q; 0‘ ‘ ‘ ‘ -’ -‘ a}; 5 - - .

0 4 _ . - ‘ ﬁ 5" .—_._4' ‘_ ”rgf‘”: ~ ‘ - _ _
- . [/1’11’.

-08- . - — 42 .4 "=- ’g’g'; . ‘ ‘_ _ _
-1_2_ - - _ 4g .- 1..- -_; 7,2,7 4: .— - -
-16- . - - ‘. ‘2 {42.41% _ - -

2 ‘ ‘ ‘2 _~ #:1'4’ﬁ _

' - .E - -114??- _ _
‘3'gr - - - £ 'ﬂ"/"'¢f’ ‘
- — . - - an '5 -- .‘ _ _ _
-32- . - _ ¢ 1% ..ng' _, ‘ _ -
-35- . - - .. ‘g .. ~ _ - _ -

-4- n - A ‘g -_ —=: »_ I - _ _

I I I J I l

 

 

l J

-ll1-3:63.2-2f8»2.4 -2 -1.61.2—0:80.4 0 0 4 .81.21.6 2 2.4 2.8 3.2 3.6 4

Figure 4.39. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled
Tests Using the Low Discrimination 900-Length Item Pool

 

 

 

 

 

4- - - ‘ ‘m‘_ - -
3.6- . - . ..m..- - -
3.2- . - - ‘Mm‘- _ _
32: : : : -£%:m-:i :

2- . - _ «Mm- - -
1.6- - .. ‘éw- .. .
1'2- - - imam- - -
03- - _ ..t new... - -
o.4_ _ - ... game-4n- - _

- - ... We“ - -
(DN 0‘ ‘3 éiiéiiii
-0'4- — ‘- ‘ é ‘ a. - .
-03- . - -‘m- - - .
-12- . .. ‘ gm- _ - .
-15- . . . - - - -49-mg..- - - .

-2- . . . . - ‘— ¢ “'1 ‘ A - .
-2_4_ . . . . - ‘ ng- _ - -
'Z'SF ' ‘ ‘ ‘Mm‘ ‘ ‘ '
-3'2 - - ‘ t ‘ n - .
-35 . a _ - .. ‘m- - - _
-4 . - - 4. ‘m‘- - . .

 

 

0.30:4 (1) 0:4 0:81I21j6 é 2f4 2f8 3f2 3J.6 :1
e
1
Figure 4.40. Pairwise Comparison of MIN F for 63-Item QKP— and LP—Assembled Tests
Using the Low Discrimination 900-Lcngth Item Pool

-4 —3.63.2~2.82.4 -2 -1.6-1.

rp

97

mnww

'oa'Nbolc-olhbo'w'mwlc-boiomh

H'HILII
I
I
I

..L-Loo 004—;

@kb.
NOD-5N

-3.6

1 I I I I I I I I I I I I I I I I I I I I

IIIIIIIIIII
I
IE

I
.h

 

-.....utIIIIIIIIIII
I
:5

i‘tII-'

J I L J I A

-4 -3.63.2—2.82.4 -2 -1.61.2—0.80.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

W

Figure 4.41. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled
Tests Using the Moderate Discrimination 900-Length Item Pool

I
I

NNOOOO

W‘

LeOb ppee ....
mNoo-ho-boonN-hoon-h

I
NI
#N

-2.
-3I6

do
N

cutsdquqttttIIIIIIIII
tllIltIIIIIIIIIIIIIII

Ii

IIIFIIIIjIIIIII—IIIIII

I
.hs

 

.1l11111111IIIIIIIIIII

-4 -3.63.2—2.82.4 -2 -1 6120.80.11 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

W

Figure 4.42. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the Moderate Discrimination 900-Length Item Pool

98

wwww

NbbbbohbhbwthmA

I I I I I I I I I I I fj I I I I I I 7 I

4.300 (DO—LA

I
......s».......

44

h
m

62
$5

”IIIIIIII
I
I
I
I

i

 

-4 -3.63.2-2.82.4 -2 -1 .6-1.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4
0
1
Figure 4.43. Pairwise Comparison of MINF for 63-Item QKP- and RND-Assembled

Tests Using the Moderate Discrimination 900-Length Item Pool

Nwww

III-...;

2

8
b 0044

I I I
_l_Lo

thbhobbhbwbbhm¢

@kb
mm»

IIIIIIIIIIIIIII
I
I
I
I

IIIII
I
I
I
I

65

 

ﬁrIIIIIIIIIIﬁIIIITIIﬁI
~ivinI|I|||IIIIIIII|||

I
«h

—

l I 41 l L I J J

-4 -3.63.2—2.82.4 -2 -1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

%

Figure 4.44. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests
Using the Moderate Discrimination 900-Length Item Pool

99

NNOJOJ

I I I I I I I I I I

I
c I I I I I I I a

.5400 00—5.;

hwmhbhohbhbmhbhbh

IITFIIIIIIIIIIIIIIﬁII
I

-2.8
-3.2
-3.6

-4

...IIIIIIIIIIIIIIIIII

 

"IIIIIIIIIIIIIIIIIIIII

1 L 1 4L 1

-4 -3.6-3.2-2.8-2.4 -2 —1.61.2-0.8-0.4 0 0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

81

Figure 4.45. Pairwise Comparison of MINF for 21-Item QKP- and RND-Assembled
Tests Using the High Discrimination 900-Length Item Pool

 

 

4.. an é‘.

3.6- ‘- Ira-
3.2- - A.-.
2.8- - «e- -
2.4- - .... -

2- - amen: .
1.6- - .....-
1.2- - «nu-=-
0.8L - ‘2‘-
0.4- - .....-

CON 0- .. ail-2‘:- .
-0_4_ . ‘2‘:- .
-0.8L. . ‘ ‘n .
-1.2.. .. ‘ t .- .
-1.6’ ‘ i ‘ -
-21— ‘: ‘= a: -
'2.4' at i ‘ -
-2.8_ ‘I. ‘ ‘ .-
-3.2- an ‘ ‘ -
-3_6L ‘ ‘ ‘. a.
-4}. - ‘ ‘ - .
-4-3.6-3.2-2.82.4 -2-1.61.2—0.8-0.4 0 0.40.81.21.6 2 2.42.83.23.6 4
9
1

Figure 4.46. Pairwise Comparison of MINF for 21-Item QKP- and LP-Assembled Tests
Using the High Discrimination 900-Length Item Pool

100

 

 

4.. .. . ‘ an . - -
3.6L. .. an a. an . - -
3.2- - - - - - - -
2.8- - - - - . -
2.4r ' " ‘ "' ‘ '

2L - .- .. m - -
1.6r- ‘ ‘ "‘ "‘ ‘ '

1.2- - - -‘ - - -
0.8- - - - - - -
0.4— - - - - - -

0N 0__ - - ‘ a. - - -
-o.4- - - - - - -
-0.8r ° - ‘ ‘ -
-1.2- - - - - -
-1.6— - - - - -

-2,. - - a. as .- .
-2.4- - - - - - -
-2.aL - - - - - -
-3.2~ - - - - -
-3.6— - - - - -
4r ‘ ‘ ‘ " ‘ '

-4-3.6-3.2—2.82.4 -2-1.6-1.2~0.80.4 0 0.40.81.21.6 2 2.42.83.23.6 4
0
1

Figure 4.47. Pairwise Comparison of MINF for 63-Item QKP— and RND-Assembled
Tests Using the High Discrimination 900-Length Item Pool

 

 

4— - - .. a. .1: ‘z ._ .
3.6.. - .. . n ‘ ‘ ‘ .
3.2- - - - - - - - -
2.8_ - - - - ‘ ‘ - .-
2.4— - - - - - ‘- - -

2- - - - -. - - an a .
1.6- - , - .- ‘ a: ‘ .- .
1.2.. - , - - an .n m a. .
0.8» - - - - - - - -
0.4L - - - - - - ‘ -

GDN 0- - - ‘ - a ‘ n ..
-0.4- - - a ‘ a a- .
-0.8~ - - - - - ‘~ -
-1.2- - as - - n a .
-1.6- - - - - - - -

-2- - - - - - - - .
-2.4- - - - - - - - -
-2.8- - - - - - - - -
-3.2r - - - - - - - -
-3.6— - - - - - - - -
-4- - . .. - - .- -

-4-3.6-3.2-2.8-2.4 -2-1.61.2—0.8-0.4 0 0.40.81.21.6 2 2.42.83.23.6 4
9
1

Figure 4.48. Pairwise Comparison of MINF for 63-Item QKP- and LP-Assembled Tests
Using the High Discrimination 900-Length Item Pool

101

multidimensional discrimination do not provide much difference in the computation of
MINF among all three methods. Support for this observation comes from looking at tests
assembled from High Discrimination item pools. These plots do not show much
difference among the three methods. However, item pool length does not seem to have
an impact on the measurement precision of assembled tests.

Some overall conclusions can be drawn from these pairwise comparisons.

Generally, these plots indicate that QKP provides better measurement precision in the
center of the two—dimensional space. That is, 61 and 02 values are either negatively

related or one value is considerably larger than another. These conclusions are consistent
with the use of the M2PLM which is compensatory in the sense that a low value on one
dimension is compensated for by a value of the corresponding dimension and vice vers'a.

Lastly, LP generally provides more MINF in the extreme portions of the

91— and 92 —space.

4.7 Relative Efﬁciency

Recall that Relative Efﬁciency (RE) is a method used for comparing the
measurement efﬁciency of two tests. Speciﬁcally, this is accomplished by comparing
their information functions graphically. In this section, the RE of QKP-, LP-, and RND-
assembled will be compared. Remember that after tests were constructed using MIRT
parameters, dichotomous responses were simulated for each assembled test using a ﬁxed
sample of 5,000 simulated abilities. These dichotomous responses were then calibrated

unidimensionally for 21- and 63—length tests using BILOG-MG. Essentially, this section

102

repeats the comparison conducted with the pairwise clamshell plots. Therefore, RE and
UIRT information comparisons were conducted for item pools of varying discrimination
and length using the three AMTA procedures for 21, and 63-length tests.

Results from computations of RE and UIRT information are presented in Figures
4.49 to 4.72. For example, Figure 4.48 shows that QKP-assembled tests provide the most
information at about 0 on the ability scale. As expected, LP-assembled test provide the
second highest amount of information. The ﬁndings from Figure 4.49 are then translated
to RE in Figure 4.50. The horizontal line with RE = 1 indicates the position where tests
assembled by all three methods are equally efﬁcient at providing UIRT information.
Around the 50‘h percentile, LP-assembled tests are about 0.7 times as efﬁcient as QKP-
assembled tests. Similarly, around the 50'h percentile, RND-assembled tests are about 0.3
times as efﬁcient as QKP-assembled tests at estimating ability.

Additionally, the relative merits of each assembled tests can be compared in terms
of efﬁciency along the ability continuum. Figure 4.50 shows that QKP-assembled tests
are more efﬁcient between the 35‘h and 72“d percentile, while LP-assembled tests are
more efﬁcient over the remainder of the continuum. This supports conclusions drawn
from the clamshell plot pairwise comparisons that LP-assembled tests are more efﬁcient
at extreme portions of the ability continuum.

The remaining plots of UIRT information show that QKP-assembled tests
consistently provide the highest information. This ﬁnding holds true regardless of the
length of the item pool or its level of discrimination. The RE plots are not as easy to
interpret. They seem reasonable for the tests assembled using the Low Discrimination

item pools, but become erratic and uneven for tests assembled from the High

103

 

18 I ﬁ F I I T _—QKP

 

16

 

-—L —l
N -h

Information
_L
on O

 

 

 

 

Figure 4.49. Test Information Functions for 21-Item Tests Assembled from the 300-
Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods

 

2.4 I I I I I I T __RE(LP'QKP)
—‘— RE(RND, QKP)

 

 

.N
N
1

Relative Efﬁciency
A

 

0.8

0.6

 

 

 

 

0'4 90 100

Percentile

Figure 4.50. Efﬁciency Functions for 21-Item Tests Assembled from the 300-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods

104

 

30

 

 

25

Information
A to
UI O

_L
O

 

 

 

 

Figure 4.51. Test Information Functions for 63-Item Tests Assembled from the 300-
Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods

 

1.4 T I T r j I I — — RE(LP’QKP)
+ RE(RND, QKP)

 

 

1.3

I

 

Relative Efﬁciency

 

 

 

l l l l J 1 l 1 l

0 10 20 30 40 50 60 70 80 90 100
Percentile

 

Figure 4.52 Efﬁciency Functions for 63-Item Tests Assembled from the BOO-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods

105

 

'8

Information
a

..L
O

 

g?

 

—QKP
+LP
—-RND

 

 

v v v V v

 

Figure 4.53. Test Information Functions for 21-Item Tests Assembled from the 300-
Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

 

Relative Efﬁciency

’ ‘ ——RE(LP,QKP)
+ RE(RND. QKP)

 

 

 

 

 

 

 

 

0'50 10 20 30 40 50 60 70 80 90 100
Percentile

Figure 4.54. Efﬁciency Functions for 21-Item Tests Assembled from the 300-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

106

 

1 00 I I T T I I — QKP

+LP

—*RND

 

 

Information

 

 

 

 

Figure 4.55 . Test Information Functions for 63-Item Tests Assembled from the 300-
Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

 

3 i I I I I I I _ — RE(LP’QKP)
—'— RE(RND, QKP)

 

 

N

Relative Efﬁciency
3.

 

 

 

 

 

0.50 1 0 20 30 40 50 60 70 80 90 1 00

Figure 4.56. Efﬁciency Functions for 63-Item Tests Assembled from the 300-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

107

 

8

OI
0

Information
.h.
0

 

 

 

 

 

 

I I I I ‘I l
—QKP
—-RND
A
—l
—l
d
.1
e e t :4: t: 4%
3 4

Figure 4.57. Test Information Functions for 21-Item Tests Assembled from the 300-
Length, High Discrimination Item Pool with the QKP, LP, and RND Methods

 

 

 

 

 

 

 

 

9 I I I I r I T - _ RE(LP'QKP)
8. + RE(RND, QKP)
7 _
as ‘
C
.2 5
0 a
E
.3 4 -
E
0
(I 3 -
2 / / .l
/ /
/ I
1 _ _ x ’
\ __ __, ...- "
0o 10 20 30 4o 50 60 7o 80 90 100

Percentile

Figure 4.58. Efﬁciency Functions for 21-Item Tests Assembled from the BOO-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods

108

 

 

 

 

 

 

 

 

 

250 I I I I I I I
—QKP
—-—LP
--RND
20m ‘
150- -
.5
E
E 100r -
50b ‘
4. t e t i e ¢ ¢ ;ﬁ ‘ L L i
-4 -3 -2 -1 0 1 2 3 4

Ability .
Figure 4.59. Test Information Functions for 63—Item Tests Assembled from the 300-
Length, High Discrimination Item Pool with the QKP, LP, and RND Methods

 

— — RE(LP,QKP)
—*— RE(RND, QKP)

 

 

1.8

Relative Efﬁciency
:1 a:

—L
N

 

 

 

 

 

0 10 20 30 40 50 60 70 80 90 100
Percentile

Figure 4.60. Efﬁciency Functions for 63—Item Tests Assembled from the BOO-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods

109

 

25 I T I I I I _ QKP

 

 

  
  

-+—LP
--RND
20~ ~
15- _
.§
5 1o

 

 

 

Figure 4.61. Test Information Functions for 21-Item Tests Assembled from the 900-
Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods

 

'— — RE(LP,QKP)

 

 

3.5

U

Relative Efﬁciency
... 8

..a
U"

 

—L

S3
or

 

 

 

 

Percentile

Figure 4.62. Efﬁciency Functions for 21-Item Tests Assembled from the 900-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods

110

 

45

40

 

 

35

Information
N N
0 0|

..L
01

10

 

 

 

 

Figure 4.63. Test Information Functions for 63-Item Tests Assembled from the 900-
Length, Low Discrimination Item Pool with the QKP, LP, and RND Methods

 

‘ ——RE(LP,QKP)
+ RE(RND, QKP)

 

 

 

Relative Efﬁciency

 

 

 

 

00 10 20 30 40 50 60 70 80 90 100
Percertile

Figure 4.64. Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length,
Low Discrimination Item Pool with the QKP, LP, and RND Methods

111

 

DO
0'!

 

 

Information
to N b)
o‘ 01 O

.3
U"

A
O

 

 

 

g?

 

Figure 4.65. Test Information Functions for 21-Item Tests Assembled from the 900-
Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

 

3.5 I I I I I I I I —_RE(LP,QKP)
+ RE(RND, QKP)

 

 

Relative Efﬁciency

 

 

 

 

 

0.50 10 20 30 40 50 60 70 80 90 1 00
Percettile

Figure 4.66. Efﬁciency Functions for 21-Item Tests Assembled from the 900-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

112

 

 

 

 

 

100 . j . . . .
—QKP
—«—LP
--RND
.§ 2
E _
“.5 r
3 i i i i 4

 

 

Ability

Figure 4.67. Test Information Functions for 63—Item Tests Assembled from the 900-
Length, Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

 

 

 

 

 

 

i——RE(LP,QKP)
+ RE(RND. QKP)
3.5- -
3- _
5‘
a“:
"525* .
E
0
..>. _ .
r! 2 .
0
o:
1.5- —
i\
\
\
\
1 \\\ ””””””””” ‘v—i
\‘W \‘E‘4
0.5 l J L J; I J J l l
o 10 20 30 40 so 60 70 so 90100

 

Figure 4.68. Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length,
Moderate Discrimination Item Pool with the QKP, LP, and RND Methods

113

 

70 ﬁ I I I ﬁ T —— QKP

+LP
—-RND

 

O)
O
T

 

Information
or A U!
0 O O
W7 I I

N
O
I

_l
O
I

 

 

 

VYTVVViTY

I0
I:
0
4r
I»
u
I
on»

 

Figure 4.69. Test Information Functions for 21-Item Tests Assembled from the 900-
Length, High Discrimination Item Pool with the QKP, LP, and RND Methods

 

25 r

—1
—4
—1
—4
-

— - RE(LP,QKP)
+ RE(RND, QKP)

 

 

20

.L
U!

  
 

Relative Efﬁciency
8

 

c-v— A
Y V v YT—w—V

l

50
Percertile

 

 

 

 

 

Figure 4.70. Efﬁciency Functions for 21-Item Tests Assembled from the 900-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods

114

 

180

 

160

 

140

120

Information
8 8

60

40

20

 

 

 

g?

 

Figure 4.71. Test Information Functions for 63-Item Tests Assembled from the 900-
Length, High Discrimination Item Pool with the QKP, LP, and RND Methods

 

 

 

 

 

 

 

7 I I I I I I I — — RE(LP’QKP)
—°— RE(RND, QKP)
5 _\
\
\

5+ \\
g~ \
s \
.6 4 .. \
'5 \
I.I.l
o \
2% 3- \
3 \
tr \

2 - \

\
\
\
1 MA 3 ﬁw

 

00 110 210 SIO 40 50 60 70 8O 90 1 00
Percentile

Figure 4.72. Efﬁciency Functions for 63-Item Tests Assembled from the 900-Length,
High Discrimination Item Pool with the QKP, LP, and RND Methods

115

Discrimination item pools. One reason could be that the differential effect of content
clusters is becoming realized in the modeling. However, given that tests assembled from
High Discrimination item pools had no recognizable difference in the clamshell plot
comparisons; these RE ﬁndings may not be reliable or meaningful. Therefore, these

ﬁndings may simply be a case of modeling noise.

116

CHAPTER 5

DISCUSSION AND CONCLUSIONS

The ﬁrst purpose of this study was to formulate a new method for maximizing
multidimensional information in automated multidimensional test assembly. To
accomplish this purpose, QKP was adapted to AMTA in conjunction with a new
computer algorithm called the QKP heuristic. The second purpose was to evaluate the
QKP procedure. This was accomplished through a series of computational experiments
based on realistic test scenarios. Comparisons of QKP to LP were conducted to determine
whether the mathematical formulation reduced mathematical error. Another reason for
this comparison was to determine the computational efﬁciency of QKP. A baseline
measure to determine whether it was at all meaningful to use QKP and LP was used. This
baseline measure used a random item selection method comparable to the other two
methods and was called RND.

Speciﬁcally, four conditions were manipulated in the computational experiments.
There were three levels of item pool quality, two levels of item pool size, several levels
of test length (although in appropriate cases only two levels were reported), and three
AMTA methods. These conditions were all evaluated for QKP in comparison to LP and
RND methods. Evaluations were mainly of two types. The ﬁrst type used
multidimensional data and indices while the second type used unidimensional data and
indices. Hence, both multidimensional and unidimensional data was simulated. The
multidimensional data was simulated using the M2PLM for creating various item pools.

QKP, LP and RND procedures were then used to assemble tests that ranged from 21 to

117

99 items in length. After assembling these multidimensional tests, a ﬁxed sample of
5,000 simulated abilities was then used to simulate dichotomous responses for each
assembled test. These dichotomous responses were then calibrated with BILOG-MG
using a 3PLM. Evaluations of multidimensional and unidimensional tests compared

QKP-, LP-, and RND-assembled tests for all conditions.

5.1 Unique Contributions

This dissertation made several new contributions. This was the ﬁrst time
quadratic knapsack programming has been used in AMTA. Prior to this dissertation,
authors have simply mentioned the possibility of using it (V eldkarnp, 2002a) or have
used it in non-IRT contexts (Feuerman & Weiss, 1973). The successful use of QKP in
MIRT allows numerous extensions to be made to the model proposed herein. The QKP
heuristic was also another new contribution to the psychometric literature. This heuristic
is ﬂexible enough that it can be extended to assemble tests with: a) more complex
constraints such as inequalities, b) multiple test forms, c) complex multiple section or
multi-stage designs, and (1) ﬁtting target information functions.

The last contribution has more to do with MIRT than AMTA. Speciﬁcally, a new
MIRT index called MAXINFO was formulated. This index can be used to measure the
largest composite of information in a particular direction for any multidimensional test.
Hence, this index can be used to assess the overall quality of assembled tests and is

applicable to multidimensional tests in general.

118

5 .2 Computational Experiments and Evaluation Criteria

Overall, ﬁndings from the computational experiments indicate that QKP is a
better method for assembling multidimensional tests. Recall that the purpose of
assembling these tests was to maximize multidimensional information. Therefore, QKP
assembles tests with higher measurement precision than LP. Moreover, the incorporation
of RND into these computational experiments seems to indicate that it is worthwhile to
use optimization methods (i.e., QKP and LP) for assembling tests. Hence, QKP seems to
be an improvement on LP. The advantage of QKP may be due to its ability to reduce the
error produced by linearization. Another advantage may be due to the manner in which
the QKP heuristic selects items. Discussion of the computational experiments and
evaluation criteria used in this study follows in the succeeding sections.

The random method performs comparably only when it comes to the MAXINFO
index. However, from looking at the other 5 evaluation criteria (i.e., D-optimality, CPU-
time, Real-time, pairwise clamshell plots and relative efﬁciency), we see that the random
method is far less efﬁcient. Also recall that the random method is actually “pseudo-
random” because it is based on the D-optimality criterion and uses 100 iterations in order
to select items from the item pool. This design of the random method was made so that it

was a more realistic comparison to QKP and LP.

5.2.] Item Pool and Ability Simulations
The simulated item pools ﬁt the pre-speciﬁed conditions fairly well. Summary
statistics given in Tables 4.1 and 4.3 showed that target mean MDISC and MDIFF levels

were close to pre-speciﬁed values. Additionally, standard deviations were similar to those

119

stipulated for the simulation. A breakdown of item pool statistics by content cluster in
Tables 4.2 and 4.4 also showed that simulated M2PLM values were reasonable and had
appreciable levels of variability.

The mean and covariance of ability parameters for simulated abilities closely
followed a multivariate normal distribution as stipulated. The correlation between the two
simulated ability traits was 0.604 which was close to the speciﬁed correlation of 0.60.
Hence, the distribution of sirnulee abilities had a moderate correlation and default ability
distribution (standardized bivariate normal distribution) which produces the least
estimation errors and is assumed in standard calibration software such as NOHARM and

TESTFACT.

Generally, the variances of the al values were much smaller than those of the a2

values. This ﬁnding along with low means for the (12 values suggests that the simulated

item pools were mainly sensitive to differences on the ﬁrst dimension. This was not
necessarily the intended outcome of the simulation. However, all methods were evaluated
uniformly using the same item pools and this provided an adequate comparison of the
three methods. Additionally, the assembled 21-item tests in Tables 4.5 to and 4.6 indicate
that all three methods generally chose highly discriminating items that were available in
each of the item pools. Therefore, even though mean values generally indicated
anomalies in the item pools, this did not have an adverse impact on the veracity of the

computational experiments.
Most of the information provided is found along the 01 dimension. This ﬁnding is

consistent with ﬁnding that the simulated item pools were mainly sensitive to differences

120

on the ﬁrst dimension as noted in above. Although this was not the intended purpose of
the simulations, since the assembly methods were compared using the same item pools,

the computational experiments proved to be admuate in evaluating the three methods.

5.2.2 Computations of D-optimality and Characteristics of Assembled Tests

D-optimality may be considered a summary index of the multidimensional
information or measurement precision provided by each test. Computations of D-
optimality were provided for each of the six item pools for test lengths ranging from 21 to
99 items. These results are presented in Figures 4.1 to 4.6. The diagrams showed that in
general, QKP-assembled tests produced the highest values of D-optimality indicating
they had the highest measurement precision among tests assembled with the three
optimization methods (i.e., QKP, LP, and RND). Additionally, QKP-assembled tests had
higher measurement precision than LP- and RND-assembled tests for the low-
discrirrrination item pools. This ﬁnding is reﬂected in Figures 4.1 and 4.4.

The smallest difference in D-optimality among the three assembly methods
existed for high-discrimination item pools. This ﬁnding seems to indicate that QKP is
most efﬁcient when item pools have low discrimination, but was less efﬁcient for item
pools of high discrimination. As extended discussions will reinforce below, measurement
precision was not measured at a single point, but was compared for different regions of
the multidimensional and unidimensional space. Further simulation studies would have to
be conducted in order or to determine the reason for this occurrence.

Sample test characteristics for assembled tests are provided in Tables 4.5 and 4.6.

Only 21-item tests assembled from Low Discrimination item pools of length 300 and 900

121

were shown for illustrative purposes. The properties of these tests were summarized
based on items selected from each cluster, type of AMTA method used, and M2PLM
parameters. Additionally, the number of items selected in common between QKP and the
other two methods are identiﬁed. As expected, there were higher numbers of items
selected in common between QKP- and LP-assembled tests versus between QKP- and
RND-assembled tests. In Table 4.5, of the 21 items selected, 16 were the same between
QKP and LP and 6 items were the same between QKP and RND. Similarly, in Table 4.6,
of the 21 items selected, 12 were common between QKP and LP and 4 items were
common between QKP and RND.

QKP and LP performed similarly in the way they select items from the pool.
However, a closer look revealed that QKP tended to select items with higher a-parameter

values. In Table 4.5, among the items not common between LP and QKP, the mean

values of a1 and oz for LP are 1.13 and 0.28, while those for QKP are 2.04 and 0.51.

Similarly, in Table 4.6 these mean values for LP are 0.97 and 0.46 while those for QKP
are 1.54 and 0.66. Consequently, higher measurement precision would result from QKP-

assembled tests since items of higher discrimination contribute more to the computation

of MINF based on Equation (2.13).

Generally, a2 parameters in the simulated item pools had means near zero. This

was an unanticipated outcome of the simulations and would imply that item pools were
generally most sensitive to one dimension. However, this outcome did not appear to

affect the evaluations that were done using the computational experiments because

assembled tests rarely contained items with a2 parameter values that were very close to

122

zero. Hence, the optimization methods closely followed the stipulation of the objective

function to select items with al and a2 parameters that were as highly discriminating as

possible so that D-optimality was maximized. This ﬁnding seems to imply that even
when poor item pools exist, optimization methods are still a very good way of assembling

tests that meet both content and statistical speciﬁcations.

5.2.3 Computational Eﬁiciency

An analysis of CPU- and Real-time performance of QKP, LP and RND helped
with the evaluation of each method’s efﬁciency in formulating a solution. Figures 4.7 to
4.18 presented results of CPU- and Real-time computations. These results were computed
for 21- to 99-length assembled tests using each of the six item pools. Recall that item
pools were varied in terms of their length and mean MDISC. Generally, LP-assembled
tests using the least amount of Real and CPU-time. However, the difference between LP

and QKP is small enough to conclude that the two procedures performed equally well.

5.2.4 MAXINFO and Pairwise Comparisons with Clamshell Plots

It was shown in Tables 4.5 and 4.6 that QKP, LP and RND do not select identical
items from the item pool. A multidimensional index called MAXINFO was formulated
to measure the composite of abilities in the direction of maximum information for each
assembled test. Comparisons of MAXINFO were presented graphically in Figures 4.19 to
4.24. These results showed that differences in MAXINFO between QKP and LP were

unifomrly different for tests of length 21 to 99. QKP and LP had higher values of

123

MAXINFO than RND. This implied it is indeed worth the effort to use optimization
methods for assembling multidimensional tests.

Comparisons of MINF were conducted using clamshell plots. These comparisons
were presented pairwise because every assembled test was compared for QKP versus LP
and QKP versus RND. The comparisons focused on differentiating the measurement
precision of QKP in comparison to LP, and QKP in comparison to RND. The purpose of
the frrst comparison was to determine which procedure was better. The second
comparison was used as a baseline measure to assess the magnitude of difference
between QKP and LP comparisons. In interpreting these diagrams, blue is positive (i.e.,
where QKP gives more MINF than the method to which it was being compared) and red
is negative (i.e., where QKP gives less MINF than the method to which it was being
compared). Additionally, the red and blue clamshells are plotted so that red is inverted,
while blue is upright. The density of color and length of the clamshells are also related to
the magnitude of MINF represented at those grid points. These comparisons are made by
eye-balling the areas of red and blue clamshell plots while corresponding item pool
quality and test length plots are juxtaposed.

Only two test lengths were selected for illustrative purposes. Recall that generally,
test length is an important factor in test assembly because the longer the test, the more
reliable it becomes, but at signiﬁcant cost in terms of other resources such as item
exposure and size of item pools. The 21-length test was considered a short test while a
63-length test was considered a long test. Hence, a 63—item test is expected to be more
reliable than 21-item test. Findings from these pairwise comparisons were presented in

Figures 4.25 to 4.48. QKP-assembled tests give higher MINF when compared to RND-

124

assembled tests, than when compared to LP-assembled tests. This outcome was expected
based on D-optimality and MAXINFO ﬁndings.

Overall, the pairwise comparisons showed that QKP-assembled tests were most
efﬁcient in the middle portions of the ability continuum. Furthermore, ﬁndings showed
that QKP assembled tests had higher MIN F than LP and RND when the item pool
consisted of items with low discrimination. Evidence for this conclusion is corroborated
by Figures 4.25 to 4.28, 4.37 to 4.39, and 4.40. QKP also proved to be more efﬁcient for
assembling shorter tests as evidenced by comparisons using pairwise clamshell plots of
21- and 63-length tests. This conclusion was supported by the observation that item pools
of high multidimensional discrimination do not assemble tests which show much
difference in the computation of MINF when QKP-, LP- and RND—assembled tests were
compared.

Recall that the quality of the item pool (low, medium and high discrimination)
was deﬁned by mean MDISC for each of the simulated item pools. Figures 4.33 to 4.36
are pairwise comparisons of QKP to the linear (LP) and random methods (RND). These
comparisons indicate that there is virtually no distinction between QKP and and the other
two methods for 21- and 63-length assembled tests. Similar results were found for
Figures 4.45 to 4.48. What these sets of ﬁgures have in common is the used of the high
discrimination item pools. That is, when a item pools contain highly discriminating test
items, the QKP method does not assemble tests of higher multidimensional information.

Additional conclusions can be drawn from these pairwise comparisons. These

plots indicated that QKP provides better measurement precision in the center of the two-

dimensional space. That is, 91 and 02 values are either negatively related or one value is

125

considerably larger than another. These conclusions are consistent with the use of the
M2PLM which is compensatory in the sense that a low value on one dimension is
compensated for by a high value on the corresponding dimension and vice versa. The last
conclusion drawn was that LP generally provided more MINF in the extreme portions of

the two-dimensional space.

5.2.5 Unidimensional Relative Efﬁciency of Assembled Tests

Additionally, in order to evaluate the measurement precision of a single
composite score that may be created from a weighted linear combination of the
multidimensional scores, the multidimensional parameters for each assembled tests were
used to simulate dichotomous item responses. BILOG-MG was used to ﬁt a three
parameter logistic model to each set of data and the parameters used to compute
unidimensional information. Functionally, this created the best unidimensional
approximation to the multidimensional space spanned by the test items. The ratio of the
information functions was then computed to provide a measure of the relative efﬁciency
of composite test scores created with each method. Consequently, this evaluation
criterion serves as a relative measure of the measurement precision that might be
expected in composite scores that may be more meaningful to practitioners and
psychometricians less familiar with multidimensional IRT indices.

Results showed that when assembling tests to report a single score from item
pools composed of multidimensional items, QKP-assembled tests provide the most UIRT
information. The latter conclusion is supported by ﬁndings from the computations of
UIRT information which showed that QKP-assembled tests consistently gave the highest

amount of information. The comparison of the information of each test was also

126

evaluated as relative efﬁciency by computing the ratio of information for tests assembled
with LP and RND methods in comparison to tests assembled with the QKP method.
Computations of relative efﬁciency showed that at the targeted ability level of 0 = 0,
QKP-assembled tests were the most efﬁcient tests.

Although the highest amount of UIRT information was generally obtained at 0 =
0 on the ability scale, the plots of information did not make it clear which tests gave the
most information at other points on the ability scale. Recall that the 5,000 Simulees had a
mean ability of 0; hence we expected the highest information to occur at this point. Plots
of relative efﬁciency (RE) showed that QKP did not assemble the most efﬁcient test over
the entire ability continuum. For example, in Figure 4.50, QKP is only more efﬁcient
than LP between the 30th and 80‘h percentiles of ability, while LP is more efﬁcient than
QKP on the remainder of the ability distribution. Also, the overall level of discrimination
of the item pool had an inﬂuence on the RE of assembled tests. QKP-assembled tests
were more efﬁcient than both LP- and RND-assembled tests when they were assembled
from Low Discrimination item pools. The RE of QKP-assembled tests was less distinct

for Moderate and almost imperceptible for High Discrimination item pools.

5.3 Limitations

As with any experimental study, in this study there were also questions about the
generalizability of ﬁndings to real testing situations. In this study, care was taken to
ensure that the simulated conditions were realistic. The IRT model used to generate and
analyze data is commonly used in the literature for item calibration and scoring. The

chosen item parameters were congruent with those typically used and reported in many

127

studies dealing with real and simulated data. However, no matter how realistic the
generated data may be, it will never substitute real data. Therefore, the absence of real
data should be considered a major limitation of this study.

The fact that simulated data generally behave too well should not weaken the
value of the ﬁndings obtained, however, since the focus of this study was mainly
methodological. Nonetheless, test developers should expect some decrease in
performance between the development phase where simulations are conducted and the
operational phase where testing is implemented.

In practice, certiﬁcation tests are constructed in order to make pass/fail decisions
at a speciﬁc point on the ability continuum. However, the evaluation conducted herein
did not assemble tests to achieve such a purpose. Rather, this evaluation was conducted
simply to show which of the three AMTA methods assembled tests most efﬁciently and
accurately. Therefore, until further research is conducted on using QKP for creating cut-
scores, this AMTA method remains severely limited in its practical application to real
testing contexts. Moreover, the conclusions drawn about the performance of QKP in
comparison to LP may not be completely accurate due to the simplicity of the test
speciﬁcations used here. It is expected that as more constraints are included in the QKP
model, its applicability and efﬁciency may become different than the ﬁndings presented

in the current study.

5.4 Future Research

The success of optimization problems hinges on their ability to choose a good

initial solution. In many cases a ﬁnite number of iterations have to be implemented so

128

that the algorithm dOes not become an inﬁnite loop. In this study, this was not a problem
since the computational experiments were carefully formulated and hence the QKP
heuristic was fairly well-behaved. However, with the extension of the QKP model to
solve larger and more complex AMTA problems, these issues may become a problem.
Additionally, there are times when an algorithm can give different solutions even when
used consistently to solve a particular problem. Several factors may cause these
inconsistencies. Namely, the degree and type of nonlinearity, the number of design
variables, the number of constraints, and accuracy of the initial guess (Venkataraman,
2002). Therefore, it is important for further evaluations of QKP to include and vary the
aforementioned conditions.

Future studies that consider more variations in test length, item pool
characteristics, and test speciﬁcations will be important to enable the study’s results to be
further generalized. Moreover, additional variations to sirnulees’ characteristics will also
allow better evaluations of the QKP method in comparison to the other methods. The
most important extension of this work however will be to the kinds of problems currently
handled by LP approaches (such as content constraints, enemy sets, etc.).

The process of ATA is quite complex and AMTA is even more complex. Issues
of dimensionality and computation made the development of this method quite difﬁcult.
The solution presented here pertains only to the two-dimensional case and extensions to
higher dimensions require more derivations. Therefore, more efﬁcient and suitable
mathematical and computational extensions are required for this procedure to be more

applicable to a larger number of testing situations.

129

Solving combinatorial optimization problems like QKP can be a difﬁcult task.
This difﬁculty arises because unlike with LP where the feasible region is a convex set, in
combinatorial problems, a matrix of feasible points needs to be searched in order to ﬁnd
an optimal solution. Hence, in LP, any local solution is a global optimum. However, in
the case of QKP, there can be situations in which many local optima exist. Therefore,
ﬁnding a global Aoptimum to the problem requires one to prove that a particular solution
dominates all feasible points by arguments other than the calculus-based derivative
approaches of convex programming (Hoffman & Padberg, 1996). Hence, it is important
for some form of quality control to be developed for the QKP procedure. For instance, a
method for determining situations in which the QKP-derived solution is infeasible should
be created. Generally, an ATA or AMT A solution is considered infeasible when there is
no solution that satisﬁes all constraints. Methods for analyzing (Huitzing, et al., 2005)
and solving (Tirnminga, 1998) infeasibility problems in test assembly exist. However,
thus far these have only been developed for unidimensional tests. Therefore, these should
also be adapted to the AMTA context.

The adaptation of QKP to commercially available software such as CPLEX would
provide a standard form for sharing this software with other researchers. Not only is
software like CPLEX easier to use, but it is already commonly used for ATA and would
provide some standardization of software used. Another advantage of using CPLEX is
that it has built-in sensitivity analyses. Sensitivity analysis refers to determining the range
of parameters for which the optimal solution still has the same variables in the basis, even
though values at the solution may change (Venkataraman, 2002). Another likely

sensitivity test would use the Likelihood Ratio Test to compute regions of arbitrary

130

conﬁdence based on the maximum likelihood estimates that arise from using D-

optimality (see Neyer, 1994).

131

REFERENCES

Ackerman, T. A. (1987). The use of unidimensional item parameter estimates of
multidimensional items in adaptive testing. Paper presented at the annual meeting
of the American Educational Research Association, Washington, DC.

Ackerman, T. A. (1994). Creating a test information proﬁle for a two-dimensional latent
space. Applied Psychological Measurement, 18(3), 257—275.

Ackerman, T. A. (1996). Graphical representation of multidimensional item response
theory analyses. Applied Psychological Measurement, 20(4), 248-258.

Ackerman, T. A., Gierl, M. J., & Walker C. (2003). Using multidimensional item
response theory to evaluate educational and psychological tests. Educational
Measurement. Issues and Practice, 22(3), 37- 53.

Adema, J. J. (1992). Implementation of the branch-and-bound method for test
construction. Methodika, 6, 99-117.

Anderson, T. W. (1984). An introduction to multivariate statistical analysis (2“d ed.) New
York: Wiley.

Ansley, T. N. & Forsyth, R. A. (1985). An examination of the characteristics of
unidimensional IRT parameter estimates derived form two-dimensional data.
Applied Psychological Measurement, 9, 37—48.

Baumol, W. J. & Bushnell, R. C. (1967). Error Produced by' Linearization in
Mathematical Programming. Econometrica, 35, 447-471.

Berger, M. P. F. (1991). On the efﬁciency of IRT models when applied to different
sampling designs. Applied Psychological Measurement, 15, 293-306.

Berger, M. P. F. (1992). Sequential sampling designs for the two-parameters item
response theory model. Psychometrika, 57, 521-538.

Berger, M. P. F. & Veerkamp, W. J. J. (1994). A review of selection methods for optimal
test design. (Report No. RR-94-4). Enschede, Netherlands: Twente University,
Faculty of Educational Science and Technology.

Berger, M. P. F. (1998). Optimal design of tests with dichotomous and polytomous items.
Applied Psychological Measurement, 22(3), 248-258.

Bimbaum, A (1968). Some latent trait models and their use in inferring an examinee’s
ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores
(pp. 397-479). Reading, MA: Addison-Wesley.

132

Bogan, E. D. & Yen, W. M. (1983). Detecting multidimensionality and examining its
eﬁects on vertical equating with the three-parameter logistic model. Paper
presented at the annual meeting of the American Educational Research
Association, Montreal, Canada.

Burden, R. L. & Faires, J. D. (1997). Numerical Analysis. (6" ed.). New York:
Brooks/Cole Publishing.

Caprara, A., Pisinger, D., & Toth, P. (1999). Exact solution of the quadratic knapsack
problem. INFORMS Journal on Computing, 11, 125-137.

Caprara, A., Kellerer, H., Pferschy, U. & Pisinger, D. (2000).Approximation algorithms
for knapsack problems with cardinality constraints. European Journal of
Operational Research, 123, 333-345.

Crocker, L., Algina, J. (1986). Introduction to Modern and Classical Test Theory.
Harcourt Fort Worth, TX: Brace Jovanovich.

Davey, T. (2005). Improving ETS ’s psychometric infrastructure. R & D Seminar
presentation, Educational Testing Service, Princeton, N].

de Gruijter, D. N. M. (1985). A note on the asymptotic variance-covariance matrix of
item parameter estimates in the Rasch model. Psychometrika, 50, 247-249.

de Gruijter, D. N. M. (1988). Standard errors of item parameter estimates in incomplete
designs. Applied Psychological Measurement, 12, 109-116.

Federation of State Medical Boards & National Board of Medical Examiners. (1996).
General instructions, content description, and sample items. Philadelphia, PA:
National Board of Medical Examiners.

Feuerman, F., & Weiss, H. (1973). A mathematical programming model for test
construction and scoring. Management Science, 19, 961-966.

Gallo, G., Hammer, P. L., & Sirneone, B., (1980). Quadratic knapsack problems.
Mathematical Programming, 12, 132-149.

Glover, F & Woolsey, R. E. (1974). Converting the 0-1 Polynomial Programming
Problem to a 0-1 Linear Program. Operations Research, 22, 180-182.

Glover, F. (1975). Improved Linear Integer Formulations of the Nonlinear Integer
' Problems. Management Science, 23, 445-460.

Green, B. F. (1990). Notes on the item information function in the multidimensional
compensatory IRT model. Report No. 88-10). Baltimore, MD: Johns Hopkins
University, Psychometric Laboratory.

133

Hambleton, R & Swaminathan, H. (1984). Item Response Theory: Principles and
Applications. New York: Springer-Verlag.

Hillier, Frederick S., & Lieberman, Gerald, J. (1995) Introduction to Operations
Research. New York: McGraw Hill, Inc.

Hoffman, K. & Padberg, M. (1996). Integer and Combinatorial Programming.
Encyclopedia of Operations Research and Management Science. Saul I. Gass and
Carl M. Harris (Eds). Boston : Kluwer Academic.

Huitzing, H. A.; Veldkamp, B. P.; Verschoor, A. J. (2005). Infeasibility in automated test
assembly models: A comparison study of different methods. Journal of
Educational Measurement , 42(3), 223-243.

Kellerer, H., Pferschy, U., & Pisinger, D. (2004). Knapsack Problems. New York:
Springer.

Kendall, M. s. & Stuart, A. (1967). The Advanced Theory of Statistics, Vol. 2, 2nd edition.
New York: Hafner Publishing Company.

Kim, J .-P. (2001). Proximity measures and cluster analyses in multidimensional item
response theory. Unpublished doctoral dissertation, Michigan State University,
East Lansing, MI.

Li, Y. H. & Schafer, W. D. (2003). The effect of item selection methods on the accuracy
of CATs’ ability estimates when item parameters are contaminated with
measurement errors. Paper presented at the Annual Meeting of the American
Educational Research Association, Chicago, IL.

Lord, F. M. (1980). Application of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum.

Luecht, R. M. (1988). Computer-assisted test assembly using optimization heuristics.
Applied Psychological Measurement, 22, 224-236.

Leucht, R. M. (1996). Multidimensional computerized adaptive testing in a certiﬁcation
or licensure context. Applied Psychological Measurement, 20(4), 389-404.

Martineau, J. A., Mapuranga, R. & Ward, K. (2006, April). Conﬁrming content structure
in standardized state assessments using multidimensional item response theory.
Paper presented at the annual meeting of the American Educational Research
Association, San Francisco, CA.

McKinley, R. L. & Reckase, M. D. (1983). The use of [RT analysis on dichotomous data

from multidimensional tests. Paper presented at the Annual Meeting of the
American Educational Research Association, Montreal, Quebec.

134

Michigan Education Assessment Program (2003). Design and Validity of the Test.
Retrieved March, 2004, from http://www.meap.org/.

Miller, T. R. & Hirsh, T. M. (1992). Cluster analysis of angular data in applications of
multidimensional item response theory. Applied Measurement in Education (5),
193-212.

Min, K. (2003). The impact of scale dilation on the quality of the linking of
multidimensional item response theory calibrations. Unpublished doctoral
dissertation, Michigan State University, East Lansing, MI.

Muraki., E & Engelhard, G. (1985). Full-information item factor analysis: Applications to
EAP scores. Applied Psychological Measurement, 9, 417-430.

National Assessment Governing Board: US. Department of Education. (2005).
Economics framework for the 2006 National Assessment of Educational Progress.
Retrieved December, 2006, from http://www.nagb.org/pubs/economics_06.pdf.

Nemhauser, G. L. & Wolsey, L. A. (1988). Integer and Combinatorial Optimization. New
York: John Wiley.

Neyer, B. T. (1994). A D-optimality based sensitivity test. Technometrics, 36(1), 61-70.

Reckase, M. D. (1979). Unifactor trait models applied to multifactor tests: Results and
implications. Journal of Educational Statistics, 4(3), 207-230.

Reckase, M. D. (1985). The difﬁculty of test items that measure more than one dimension.
Applied Psychological Measurement, 9, 401-412.

Reckase, M, D. & Ackerman, T. A. (1986). Building a test using items that require more
than one skill to determine a correct answer. Paper presented at the Annual
Meeting of the American Educational Research Association, San Francisco, CA.

Reckase, M. D., Carlson, J ., Ackerman, T. A. & Spray, J. A. (1986). The interpretation of
unidimensional IRT parameters when estimated from multidimensional data.
Paper presented at the annual meeting of the Psychometric Society, Toronto,
Canada.

Reckase, M., Ackerman, T. & Carlson, J. (1988). Building a unidimensional test using
multidimensional items. Journal of Educational Measurement, 25(3), 193-203.

Reckase, M. D. & McKinley, R. L. (1991). The discrimination power of test items that

measure more than one dimension. Applied Psychological Measurement, 9, 401-
412.

135

Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item
response data. In W. J. van der Linden & R. K. Hambleton (Eds), Handbook of
Modern Item Response Theory, (pp. 271-286). New York: Springer-Verlag.

Reckase, M. D., Martineau, J. A., & Kim, J.-P. (2000, June). A Vector Approach to
Determining the Dimensionality of a Data Set. Paper presented at the annual
meeting of the Psychometric Society, Seattle, WA.

Reckase, M. D. (2006). Multidimensional item response theory. In C. R. Rao and S.
Sinharay (Eds) Handbook of Statistics, Volume 26. North—Holland: Elsevier.

Roussos, L. A., Stout, W. F. & Marden, J, I. (1998). Using new proximity measures with
hierarchical cluster analysis to detect multidimensionality. Journal of Educational
Measurement, 35, 1-30.

Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354.

Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimation in,
item response theory. Psychometrika, 55, 461-475.

Stocking, M. L., & Swanson, L. (1998). Severely constrained adaptive testing with
extensions to item pool design. Applied Psychological Measurement, 22, 271-279.

Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item
selection problems. Applied Psychological Measurement, 1 7(2), 151-166.

Theunissen, T. J. J. M. (1985). Binary programming and test design. Psychometrika, 50,
41 1-420.

Thissen, D. & Wainer, H. (1982). Some standard errors in item response theory.
Psychometrika, 47, 397-412.

Timminga, E. (1998). Solving infeasibility problems in computerized test assembly.
Applied Psychological Measurement, 22(3), 280-291.

Vale, C. D. (1986). Linking item parameters onto a common scale. Applied
Psychological Measurement, 10, 333-344.

van der Linden, W. J. & Boekkooi-Timminga, E. (1989). A maxirnin model for test
design with practical constraints. Psychometrika, 54, 237-247.

van der Linden, W. J. (1994). Optimum design in item response theory. In G. H. Fisher &

D. Laming. Contributions to mathematical psychology, psychometrics, and
methodology (pp. 305-318). New York: Springer-Verlag.

136

van der Linden, W. J. (1996). Assembling tests for the measurement of multiple traits.
Applied Psychological Measurement, 20, 373-388.

van der Linden, W. J. (1998). Optimal assembly of psychological and educational tests.
Applied Psychological Measurement, 22, 195-211.

van der Linden, W.J. (2005). Linear models for optimal test design. New York: Springer-
Verlag.

Veldkamp, B. P. (1998). Multidimensional test assembly based on Lagrangian relaxation
techniques (Report No. RR-98-08). Enschede, Netherlands: Twente University,
Faculty of Educational Science and Technology.

Veldkamp, B. P. (2002a). Constrained multidimensional test assembly. Applied
Psychological Measurement, 26(2), 133-145.

Veldkamp, B. P. (2002b). Optimal test construction (Report No. RR-O2-08). Enschede,
Netherlands: Twente University, Faculty of Educational Science and Technology.

Venkataraman, P. (2002). Applied optimization with MATLAB programming. New York:
Wiley.

Votaw, D. F. (1952). Methods of solving some personnel classiﬁcation problems.
Psychometrika, 17, 255-266.

Wald, A. (1943). On the efﬁcient design of statistical investigations. Annals of
Mathematical Statistics, 14, 134-140.

Walters. L. G. (1967). Reduction of Integer Polynomial Problems to Zero-One Linear
Programming. Operations Research, 15, 1171-1174.

Whittaker, E. T. and Watson, G. N. (1990). Forms of the remainder in Taylor's Series. A
Course in Modern Analysis, 4th ed. Cambridge, England: Cambridge University
Press.

Wikipedia: The Free Encyclopedia. (n.d.). Proof of Concept. Retrieved December, 2006
from http://en.wikipedia.org/wild/Proof_of_concept

Wingersky, M. S. & Lord, F. M. (1984). An investigation of methods for reducing
sampling error in certain IRT procedures. Applied Psychological Measurement, 8,
347-364.

Wrightrnan, L. F. (1988). Practical issues in computerized test assembly. Applied
Psychological Measurement, 22, 292-302.

137

Yen, W. M. (1983). Use of the three-parameter model in the development of standardized
achievement tests. In R. K. Hambleton (Ed.), Applications of Item Response
Theory. Vancouver: Educational Research Institute of British Columbia.

Zangwill, W. I. (1965). Media Selection by Decision Programming. Journal of
Advertising Research, 5, 23-27.

Zhang, J. (2005). Estimating multidimensional item response models with mixed
structure. ET S Research Report (ET S RR-05-04). Princeton, NJ: ETS.

Zimowski, M., Muraki, E., Mislevy, R., & Bock, D. (2003). BILOG-MG (version 3)

[Computer software and manual]. Lincolnwood, IL: Scientiﬁc Software
International.

138

   

uiijiiiigi