ﬁt?"

‘7":‘5
, :Jx:
‘ 2

J

 

 

 

 

-3: ”‘3

2M2;- - 3;-

 

Jh-Jh. 1g ‘9'“ '

 

2. ..u. lﬁ'ém-kyj

Wyn.

v

 

2'.’ ‘ :3 '1'

u m

.
4“
A". UV
ustwﬁ

w

’51
4 - V 5 1 ~
mfg-kin

‘5‘

s

‘2:
.7

F, 1 ‘ :11
5:541:35“!
3.x! ~: -

L ﬂ
6"?

11‘
gm

irrit-

‘

.~I:‘.-;-I.'.“.
3.

 

 

LIBRARY
Michigan State
University

 

 

 

This is to certify that the

dissertation entitled

THE IMPACT OF 'SCALE DILATION ON THE QUALITY
OF THE LINKING OF MULTIDIMENSIONAL ITEM
RESPONSE THEORY CALIBRATIONS

presented by

Kyung-Seok Min

has been accepted towards fulﬁllment
of the requirements for

Ph. D. degree in Counseling, Educational

Psychology & Special Education

 

 

Major professor

 

Date May 23, 2003

 

M5 U is an Afﬁrmau'w Action/Equal Opportunity Institution 0-12771

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE I DATE DUE

 

u 3 ,. 0.
[£03 3 1 2005 th "014 220%

 

5&2 91 Q 390?

 

,, A
'ﬂBH-irEmr-
000 20 2005

 

MAY 1 6 2007

 

_j40208

 

092109

 

 

A000 .2; 2m

 

 

 

 

 

 

6/01 cJCIRC/DateDuepGS—sz

 

THE IMPACT OF SCALE DILATION ON THE QUALITY OF THE LINKING

OF MULTIDIMENSIONAL ITEM RESPONSE THEORY CALIBRATIONS

BY

KYUNG-SEOK MIN

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology,
and Special Education

2003

ABSTRACT
THE IMPACT OF SCALE DILATION ON THE QUALITY OF THE LINKING
OF MULTIDIMENSIONAL ITEM RESPONSE THEORY CALIBRATIONS
BY

KYUNG-SEOK MIN

This study compares and evaluates multidimensional item
response theory (MIRT) linking methods in terms of the
accuracy and stability of metric transformations across
several testing conditions.

Most psychological and educational tests are sensitive to
multiple traits. This suggests that the application of MIRT
is needed. One factor that limits the application of MIRT
in practice is difficulty in establishing equivalent scores
on multiple ability dimensions. While several MIRT linking
methods have been developed to solve the problem, each of
them has unique properties in terms of statistical
characteristics and optimization criteria, and it is not
yet known whether different MIRT linking methods lead to
the same/similar metric transformations.

Both simulation and real data are analyzed to compare
different MIRT linking methods. In addition, a new way of
MIRT linking is suggested based on orthogonal Procrustes

solutions using a diagonal dilation matrix.

ACKNOWLEDGEMENTS

My special appreciation goes to Dr. Mark Reckase, my
dissertation director. His insight, encouragement, and
friendship have been invaluable for me to finish this
manuscript. I wish to send my true thanks to Dr. Ken Frank,
my academic advisor, who has been encouraging and monitoring
my academic works from the beginning to the end of my
doctoral study. I also appreciate Drs, Richard Houang, James
Stapleton, and Edward Wolfe, who spent their precious times

to provide suggestions and comments on my work.

I wish to send my recognition to Drs. Jong—Sung Lee and
Sang-Jin Kang at Yonsei University, who supported me to
study abroad and kept guiding me in various issues for a

long time.

I also wish to thank my parents and Miran, who are
lifelong supporters and friends whatever happens around me.
Having you as a family is more valuable than what any word

can express in the world.

TABLE OF CONTENTS

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

LIST OF TABLES vi
LIST OF FIGURES vii
CHAPTER 1 INTRODUCTION 1
1.1 Invariance and Indeterminacy in Item Response Theory ...... 2
1.2 Equating and Dimensionality 3
1.3 Purpose of the Study 5
CHAPTER 2 MULTIDIMENSIONAL IRT MODEL AND LINKING 8
2.1 Multidimensional IRT Models 8
2.2 Multidimensional IRT Linking Methods 16
2.2.1 Oshima, Davey and Lee's Method 17
2.2.2 Li and Lissitz' Method 20
2.3 Extension of the LL Method with a Diagonal Dilation
Matrix 23
2.3.1 Example: LL Method and M method 27
2.4 Other MIRT Linking Methods 31
2.4.1 Hirsch’s Method 31
2.4.2 Thompson, Nering and Davey's Method 32
2.5 Evaluation Criteria 33
CHAPTER 3 METHODS 36
3.1 Simulation Study 36
3.1.1 Equating Design and Specification of MIRT
Model 36
3.1.2 Generation of Item Parameters and Response
Patterns 38
3.1.3 Simulation Factors 44
3.1.4 Evaluation Criteria and Data Analysis 48
3.2 Real Data Analysis 52
3.2.1 Evaluation Criteria and Data Analysis 56
CHAPTER 4 RESULTS 58
4.1 Simulation Study 58
4.1.1 Results of Repeated Measures Analysis of
Variance 58
4.1.2 Comparison of the Three Linking Methods .......... 61
4.2 Real Data Analysis 70

 

W

4.2.1 Item Estimates Comparison
4.2.2 True Score Comparison

 

 

 

CHAPTER 5 SUMMARY, DISCUSSION, AND CONCLUSION

 

5.1 Simulation Study
5.2 Real Data Analysis
5.3 Discussion
5.3.1 Rotation and Optimization Criteria
5.3.2 Evaluation Criteria
5.3.3 Relative Efficiency of Linking Methods ........................
5.3.4 Test Response Surface and Ability Levels ..................
5.4 Conclusion

 

 

 

 

 

APPENDIX A

APPENDIX B

REFERENCES

 

71
73

76

76
77
79
79
81
82
83
85

94

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

LI ST OF TABLES

Two Sets of Item Estimates and Rotated
Estimates 27

 

Comparison of Transformed Results with a
Dilation Constant and with a Diagonal Dilation
Matrix 29

 

Five MIRT Discrimination and Difficulty Levels
40

 

Item Parameters for Twenty Common Items ..................... 41

Ability Distributions for Five Examinee Groups

 

 

 

44
Composition of Two Test Forms 54
Item Difficulties of Two Test Forms 54
Item Parameter Estimates of Common Items .................. 56

Test Statistics (F), Degrees of Freedom (DF) and

Effect Sizes (02) of Biases from Repeated
Measures ANOVA 59

 

Test Statistics (F), Degrees of Freedom (DF) and

Effect Sizes (”2) of RMSEs from Repeated
Measures ANOVA 6O

 

vi

Figure

Figure

Figure

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure

Figure

Figure

.10:

.11:

.12:

.13:

.14:

.15:

LIST OF FIGURES

Item Response Surface with agﬂ” 55:1, c=.2
and d=1 12

 

 

UIRT and MIRT Linking Components 15

Two dimensional Structures in Simulation
Data 39

Item Vectors, Approximate Simple Structure ......... 42

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Vectors, Mixed Structure 43
Bias (a1, n=1000) 62
Bias (a2,:n=lOOO) 63
Bias (d, n=1000) 63
RMSE (au n;1000) 64
RMSE (a?,:n71000) 64
RMSE (d, n=1000) 65
Bias (aw n;2000) 65
Bias (a2, n=2000) 66
Bias (d, n=2000) 66
RMSE (aw n:2000) 67
RMSE (a?,:n=2000) 67
RMSE (d, n=2000) 68
Mean Differences of Five Sets of Samples ............ 72

Difference Variations of Five Sets of Samples
73

 

Differences of Transformed Test Scores and

Estimated True Scores on the Base Test Form
74

 

vii

CHAPTER 1

INTRODUCTION

The main role of standardized tests is to provide fair,
reliable, and objective information regarding examinees’
ability or skill that the tests are developed to measure.
Since test scores are often used to make individual,
institutional, and national decisions, in conjunction with
other available information, equitability and comparability
of test scores have been important issues in the testing
arena (Cook & Eignor, 1991).

Various situations exist that make test users concerned
about different measures of allegedly the same thing (Kolen,
2001). For example, because of the security of test items,
most testing programs develop multiple test forms of the
same test (e.g., ACT and SAT). By the very nature of the
test design, recently popular computer adapted tests (CAT)
use different sets of items for different examinees (Dorans,
2000). Moreover, a comparable test scale provides meaningful
interpretations over test questions matched to different
grades (e.g., grade equivalents), types of tests (e.g.,
paper-pensile and CAT versions of Armed Services Vocational
Aptitude Battery [ASVAB], Maier, 1993) or somewhat different

constructs (e.g., state and national achievement tests,

Feuer, Holland, Green, Bertenthal, & Hemphill, 1999).

In this chapter, as an introduction, some background
information on IRT models and the need for metric
transformations under certain test administration conditions
are provided. In addition, the purpose of the study and

research questions are described.

1.1 Invariance and Indeterminacy in Item Response Theory

One of the important and useful features of item response
theory (IRT) compared with classical test theory is the
invariance of parameters. That is, item characteristics
denoted as item parameters do not depend on the distribution
of examinees’ abilities (Lord, 1980). Examinee ability is
also invariant across different item sets. However, it is
well known that there can be an infinite number of
legitimate translations of IRT parameters. This is true
because of scale indeterminacy similar to that of
traditional factor analysis, the so called identification
problem (Baker, 1992; Lord, 1980).

Therefore, the invariance property of IRT parameters
holds only after a certain common metric is set across
examinee samples or sets of items. In practice, IRT
calibration programs (e.g., BILOG3 [Mislevy & Bock, 1990]

and LOGIST [Wingersky, Barton, & Lord, 1982]) solve scale

indeterminacy by setting features of ability or difficulty
distributions to specific values. Specifying these values
makes the model identified and allows unique estimation of

the model parameters.

1.2 Equating and Dimensionality

Equating is defined as those processes establishing
equivalent scores on different instruments or subject groups
(Crocker & Algina, 1986; Embretson & Reise, 2000; Hambleton
& Swaminathan, 1985; Lord, 1980). Equating, as a way to
build a common metric, has been used in various testing
situations, such as administrations of multiple test forms,
detecting differential item functioning (DIF), development
of an item bank, computerized adapted testing (CAT), and
more.

Traditionally, IRT models have been developed with the
assumption of unidimensionality: The item-person interaction
is modeled with a single latent trait. However, the
mechanisms and cognitive processes that an examinee uses to
respond to test items do not seem so simple (Reckase, 1985,
1995), and many psychological and educational researchers
agree that multidimensional abilities/traits come into play
in test performances (Ackerman, 1991; Traub, 1983). For

example, a mathematical item with a verbal description may

require reading ability to transform the text into a
mathematical formula as well as mathematical knowledge to
find a solution from the formula. Moreover, achievement
tests are likely to be sensitive to multiple dimensions
because several content areas are included in the particular
subject matter. Research has shown that when test responses
known to be multidimensional are modeled under the
unidimensional assumption, they violate the local
independence assumption and result in increased measurement
errors and incorrect inferences (Ackerman, 1994; Baker,
1992; Reckase, 1985).

Due to their popularity and simplicity, most IRT equating
methods have been based on unidimensional item response
theory (UIRT) models. These methods make adjustments for
different scales (i.e., origin and unit of scale) (Cook &
Eignor, 1991; Kolen & Brennan, 1995; Lord, 1980). But when
the goal is to establish comparable scores on tests that
seem to be affected by more than one dimension, the
directions of dimensions also need to be adjusted to obtain
equitable meaning of the reference/coordinate system. That
is, multidimensional item response theory (MIRT) models are
directionally indeterminant as well as scale indeterminant.
Therefore, MIRT equating requires a composite transformation

of rotation and scaling to derive comparable scores (refer

to Green, 1976; Lissitz, Schonemann, & Lingoes, 1976;

Schonemann, 1966; Schonemann & Carroll, 1970).

1.3 Purpose of the Study

Even though most psychological and educational tests are
sensitive to multiple traits/skills implying the need for
MIRT, the application of MIRT is limited in practice by
difficulties in establishing equivalent scores on multiple
ability dimensions. While several MIRT linking methods have
already been developed to solve the problem of comparability
(Davey, Oshima, & Lee, 1996; Hirsch, 1988, 1989; Li, 1997;
Li & Lissitz, 2000; Oshima & Davey, 1994; Oshima, Davey, &
Lee, 2000; Thompson, Nering, & Davey, 1997), each of them
has unique properties in terms of statistical
characteristics and optimization criteria (i.e., what is
minimized or maximized). Moreover, it is not yet known
whether different MIRT linking methods lead to the
same/similar metric transformations. If it is found that
there are significant differences across MIRT linking
methods, it might be possible that one method is more
appropriate than others in certain testing conditions. Then,
careful consideration should be taken to apply any specific
linking technique according to the properties of each method

and the specific goal of test equating.

The purposes of this study are to compare and evaluate
two leading linking procedures1 for MIRT equating (i.e., Li
& Lissitz, 2000; Oshima et al., 2000) in terms of the
accuracy and stability of metric transformations across
various testing conditions (e.g., sample sizes, structures
of dimensionality, and distributional shapes of true
ability) and to develop and verify a new linking method that
provides a more desirable multidimensional metric
transformation, especially in dilation/contraction of a
scale. One of leading MIRT linking methods developed by Li
and Lissitz (2000) includes only a single dilation constant
for multiple dimensions based on traditional factor analysis
techniques (i.e., orthogonal Procrustes solutions). However,
more desirable transformations might be expected when
linking allows a unique dilation/contraction for each
dimension. A new MIRT linking method that incorporates a
diagonal dilation matrix into orthogonal Procrustes
solutions was developed and compared with the previous two
linking methods.

Both simulation and real data were used to investigate

statistical characteristics and practical implications of

 

1 Linking is essential part of equating processes that generate
comparable test scores. Throughout the remainder of the study, linking
is defined as specific statistical procedures to determine the
relationship among sets of item/ability parameters.

the comparisons. More specifically, in the simulation study,
the question is which linking procedure is better for
placing item parameter estimates on the same scale. In real
data analysis, the concern is about which linking procedure
would be better at transforming item estimates on one test
form into item estimates of the other test form. Beyond the
item level comparison, it was also explored whether or not
there were distinguishable linking error patterns on the
ability space by investigating differences in test response
surfaces (i.e., estimated true test scores on the ability

space) across different linking procedures.

CHAPTER 2

MULTIDIMENSIONAL IRT MODEL AND LINKING

In this chapter, MIRT models and linking methods are
described based on a review of the literature. Two leading
MIRT linking methods are examined in detail. A new linking
method with a diagonal dilation matrix is proposed and

demonstrated with an example.

2.1 Multidimensional IRT Models

When real phenomena turn out to be complicated, statistic
models have to allow for complexity in order to explain and
reflect realities as they are. In test situations, MIRT has
been developed to explain effects of multiple ability
dimensions on test performances. When conducting linking in
the IRT framework, it is assumed that the latent space is
sufficiently represented in the model such that item
responses are independent after controlling for ability
levels (i.e., local independence). Even though most IRT
linking research has focused on unidimensional models,
modeling more than one ability dimension may better satisfy
the requirements for model fit, specifically the local
independence assumption. Previous studies (Ackerman, 1994;

Baker, 1992; Reckase, 1985) indicated that UIRT models might

violate the invariance property when a test is sensitive to
more than one dimension and examinees' abilities vary on
those multiple dimensions. Moreover, Reckase and Hirsch
(1991) claimed that the number of dimensions is often
underestimated and that overestimating the number of
dimensions does little harm.

In general, unidimensional interactions between persons
and items are sufficient to model test data under the
following conditions: (a) both examinee’s ability and test
item Characteristics are varying on one dimension as assumed
in the model; (b) examinee ability varies only on one
ability dimension even though test items are sensitive to
more than one ability and vice versa; or (c) examinee
abilities are different on multiple ability dimensions but
all items are sensitive to the same composite of abilities
(Li, 1997; Reckase, 1990). Therefore, a UIRT model is easily
expressed in a form of an MIRT model by making constant one
set of item parameters or ability parameters. Put in another
way, a UIRT model can be treated as a special case of an
MIRT model.

Two types of MIRT models have been developed, i.e.,
compensatory and noncompensatory models. These are different
with respect to relationships among the ability dimensions

that determine the probabilities of person’s item responses.

In compensatory models (Lord & Novick, 1968; McDonald, 1967;
Reckase, 1985, 1995), the proficiencies are additive in the
logit such that low ability on one trait can be compensated
by high ability on other trait(s). In noncompensatory
(partial compensatory) models (Embretson, 1984; Sympson,
1978), the multiplication of the probability of getting a
component of an item right results in a low ability on one
trait being only partially compensated by being high on
other trait(s). In fact, the lowest probability for an item
component sets the upper limit on the probability of correct
response for a noncompensatory model. Since most research on
MIRT equating has been done using compensatory models
(partly because of estimation difficulties for
noncompensatory models), and the fit of the two types of
MIRT models appear indistinguishable from a practical point
of view (Spray, Davey, Reckase, Ackerman, & Carlson, 1990),
the compensatory model is considered in this study.

The compensatory multidimensional extension of the three—
parameter logistic model with m dimensions2 is (Mckinley &

Reckase, 1983; Reckase, 1985, 1995)

 

2 Corresponding noncompensatory MIRT model is

CXP(aik 9,1: “ bik)

 

m
P(u--=1|a-,b~,c-,0-)=c-+(1-c-)II ,
U l ‘ l J ' lk=ll+exP(aik6jk 'bik)

where m is the number of dimensions, and aik and bik are the

discrimination and difficulty parameter, respectively, for item i and
dimension k.

10

exp(a}0j + d,)

, . (1)
1+ exp(a,-0j + d,)

 

P(u,-j =l|ai,Ci,d1,9j)=Ci+(1—Ci)

where

IKuU=dlabchdh0j)is the probability of a correct response

for examinee j on test item 1 in an m-dimensional space,

up is the item response for person j on item 1 (1 correct;

0 wrong),

a, is a vector of discrimination parameters of item i,

c; is the lower asymptote (or guessing parameter), the

probability of correct answer when an examinee’s ability is
very low,

d, is a parameter related to item difficulty of item i, and,

0 .

J is a vector of jth examinee’s abilities.

This model implies that the probability of a correct item
response is a monotonic increasing function bounded between
0 and 1. A two—dimensional three—parameter item response
surface (analogous with the item characteristic curve in
UIRT) with afﬂ” eg=1, c=.2 and d=1 is provided in Figure
2.1. The height represents the probability of correct
response corresponding to a pair of abilities (6% and 02).

The probability spans .2 to 1 because of the lower asymptote

ll

(.2). This item measures two ability dimensions equally

because it has the same item discrimination on both

dimensions.

Probability

 

 

Figure 2.1. Item Response Surface with afﬂ” 55:1, c=.2 and
d=1

Unlike UIRT models, multidimensional item—discrimination
and person—ability parameters are vectors rather than
scalars, and the difficulty-related parameter is a composite
of item difficulty and discrimination on each dimension. The
interpretation of MIRT discrimination parameters is

analogous with UIRT parameters but each element of the

12

vector is related to a direction in the dimensional space.
The meaning of the MIRT difficulty parameter is not directly
equivalent to that of unidimensional difficulty parameter
because of a different parameterization. Two statistics
(i.e., MDISC and MDIFF) were developed to capture
multidimensional item characteristics corresponding to UIRT
item discrimination and difficulty.

The discrimination power of a multidimensional item in
the dimensional space can be defined as a function of item
discrimination parameters (Reckase, 1985, 1995; Reckase &
McKinley, 1991; for graphic representations see also

Ackerman, 1994, 1996) as

m 2 1/2
aik] , (2)

NHNSC1=[
k=l
where MDISC,- denotes ith item's multidimensional
discrimination as a function of the slope at the steepest
point, and am is ith item’s discrimination on the kth
dimension.

Multidimensional item difficulty equivalent to

unidimensional difficulty is

MDIFF, = ”i" (3)

MDISC,-

where MDIFF, is the distance between the origin and the

point of the steepest slope of the item response surface.
In addition, the direction of the greatest discrimination

in the dimensional space is given by

aik “ii:
a”, =arccos——— (or cos a”, =——) , (4)
NHMSC; NHMSC;

where ‘Hk is an angle from the kth dimension.

As is shown in Equation (1), the probability of the
correct answer is a linear function of item (a auxici) and
ability (0) parameters in the exponent. Therefore, any
linear transformation of an ability scale results in the
same value of the exponent for a given response patterns if
the item and ability parameters are transformed in a
consistent way. The probability that an examinee gets an
item right is identical when IRT scale and item parameters
are transformed properly. This property is referred to as
invariance (Kolen & Brennan, 1995; Lord, 1980). While scale
indeterminacy (unspecified location of the origin and the

unit of scale) is a concern when finding a proper

l4

transformation in URT equating, the rotation to determine
the comparable reference system as well as the scale
alteration has to be considered in MIRT due to

multidimensionality (see Figure 2.2).

OE Us
Scale E ——_

\, ‘5
Translation \ Dilation
—— \ \—

‘N T":
Scale B _—>

V

 

 

(a) UIRT Linking

Scale E

Rotation

n

  
    
 
  

‘. Di_w_io_n

’4

Tmmditm/‘t
/°"

(b) MIRT Linking (two dimensions)

‘
7

U3 Scale B

 

Figure 2.2. UIRT and MIRT Linking Components*

* O is the location of origin, U is the length of unit, the subscript E
is the metric to transform and B is the target metric. Note the picture
of MIRT linking is a modification of Figure 11-5 in Li’s study (1997, p
37)

The rotation is required to line up one coordinate system

15

with the other. Scaling procedures are about matching the
origin (0, translation) and unit length (U, dilation)

between two scales.

2.2 Multidimensional IRT Linking Methods

Even though modeling more than one dimension often
improves model fit, the use of MIRT models is limited in
testing practice (Gosz & Walker, 2002; Reckase, 1997). One
of the reasons might be the difficulty in finding comparable
multidimensional scales across different test forms or
examinee groups (Oshima et al., 2000). So far, several
multidimensional linking methods have been proposed (Hirsch,
1989; Li and Lissitz, 2000; Oshima et al., 2000; Thompson et
al., 1997). These methods use a two—dimensional compensatory
model and consist of some or all of three linking
components: a rotation matrix deals with directional
indeterminacy to establish a common reference/coordinate
system, and a translation vector and a dilation constant
remove scale indeterminacy by finding comparable origin and
unit.

Even though Hirsch’s study (1988, 1989) can be valued as

the first attempt to deal with multidimensionality in IRT

linking, his linking method is similar to later methods,

16

especially Li and Lissitz’ method. The method of Thompson et
al. (1997) has a strong potential, but it is still
experimental. The focus of this study is on two more recent

linking methods (Li & Lissitz, 2000; Oshima et al., 2000).

2.2.1 Oshima, Davey, and Lee's method

Oshima and her colleagues’ linking method (2000), called
ODL method afterwards, is based on the anchor item design: a
set of common items are included in multiple test forms to
define a common scale. Transformations of parameters of the

compensatory multidimensional model with the exponent of

aEOi+1h, are conducted through the following set of linking

equations.
a: =(A“)’a,-, (5)
d; =d,- -a;. A'IB, (6)
03=A0j+p, (7)

where A (me, m is the number of dimensions) is a rotation
matrix, 8 (mxl) is a translation vector, and the asterisk
(*) indicates transformed parameters. Here, the rotation

matrix A has two functions: (a) to rotate to a proper

dimensional orientation, and (b) to adjust the variances of

ability dimensions. The translation vector 8 is used to

shift to a compatible origin by altering the origin of a
scale. The equality of the transformed exponent and the

original exponent can be illustrated by

aft)";- +d,-*=(a; A"l)(A0j +B)+(d,- —a;. A‘15)=a; oj +d,~. (8)

As a result, transformed components of the exponent are
mathematically alternate ways to express the initial
relationship without changes of probabilities of correct
responses.

Oshima and her colleagues compared four MIRT linking
procedures according to different evaluative criteria, and
suggested the test characteristic function (TCF) and the
item characteristic function (ICF) methods are more stable
than other procedures (i.e., direct method and equated
function method). The two favorable methods are
distinguishable in terms of what is minimized. The TCF
method is a multidimensional extension of Stocking and
Lord's method (1983) that minimizes squared differences
between two test response surfaces (i.e., sum of item

response surfaces) of common items, while the ICF method

18

minimizes the sum of squared differences of item response.
surfaces. Finally, they concluded that the TCF method was
best at estimating the rotation matrix over other sub—
methods and was also relatively good at estimating the
translation vector.

The minimization function for the ODL TCF method is

m» = iii-<0), and
i=1

Ewan. (m-TE (9)12 . <9)

* I I
where Tb and TE indicate expected number-correct scores for

the common items for the examinee on the base test and the

transformed equated test, respectively, n is the number of

items, and up is a weighting value which allows some

regions on the ability space of 0 to be more important than
others. Also, if all weights are equal for all regions, the
result is unweighted estimation.

The ODL method is unique in that it estimates the
rotation matrix and the translation vector simultaneously
but orthogonality of rotation is not constrained. It means
that relative distances among items (e.g., directional item

vectors on the dimensional space, see Figures 3.2 and 3.3 in

19

the next chapter) may not be the same after and before the

rotation is conducted.

2.2.2 Li and Lissitz' method

Li and Lissitz (2000; see also Li, 1997), called LL
method afterwards, developed four different linking
procedures based on the anchor item design and claimed that
the best procedure was a composite transformation with three
linking components: a rotation matrix from the orthogonal
Procrustes solutions, a translation vector obtained by a
least-square method of minimizing differences between
initial difficulty parameters and transformed parameters,
and a central dilation constant obtained by the trace method
of minimizing the sum of squared errors.

The LL method uses the following set of linking equations

to transform model parameters in the exponent, aEOi+wh

af=kag T, (10)
df=d,--a; Tm, <11)
0"} =(1/k)(T'10j+m), (12)

where T (ann is an orthogonal rotation matrix, change in

20

orientation, m.(mX1) is a translation vector for location,
k (1x1) is a central dilation constant for unit change.
Then the equality of exponent terms after and before

transformation is established by

are} +d}"=(ka;. T)(l/k)(T'10j +m)+(d,- —a; Tm) =a; 0]. +d,— . (13)

Note that Equations (10) to (12) are mathematically the
same as Equations (5) to (7) except for pre—multiplication
or post-multiplication of the rotation matrix, and a
dilation constant.

Li and Lissitz provided a fair amount of information on
multidimensional linking procedures by explaining three
linking components, i.e., rotation, translation, and central
dilation (refer to Schonemann, 1966; Schonemann & Carrol,
1970). It is straightforward in that the three components
provide useful information on specific stages of
multidimensional metric transformations.

While the ODL method deals with dimensional direction and
unit change at once in the nonorthogonal rotation matrix, Li
and Lissitz split these two components into the orthogonal
rotation matrix and the central dilation constant. Here, the

term ‘central’ means that unit changes are assumed to be

21

similar/constant across dimensions such that one scalar (k)
can account for all unit changes. They justified the central
dilation constant with its mathematical tractability and
relatively reasonable accuracy.

The best linking procedure of the LL method is obtained
by minimizing the following functions for the rotation

matrix T, the dilation constant k, and the translation

vector m”
El=kAET—AB, (l4)
”'(E’lEl)=tr(kAET—AB)’(kAET-AB)l (15)
n * 2
Q: .2101”; -d,°3) I (16)
1:
In Equations (14) to (16), tr is the matrix operator of

the sum of diagonal elements (trace), A and d are item
discrimination and difficult, subscripts B and E denote the
base test and equated test, and the asterisk (*) indicates
transformed values onto the dimension of the base test. Note
that there are two equations to minimize for the LL method
(Equation [15] for rotation and dilation, and Equation [16]
for translation) while there is one minimization equation

for all linking components in the ODL method (Equation [9]).

22

2.3 Extension of the LL Method with a Diagonal Dilation
Matrix

From the review of ODL and LL methods, there is clear
difference between two methods in term of the structure of
linking components: the rotation matrix of the ODL method is
supposed to adjust for the orientation of the reference
system and the length of unit simultaneously while there are
two linking components in the LL method, a rotation matrix
and a central dilation scalar. Further, the LL method uses
an orthogonal rotation matrix that maintains relative
distances among item vectors after and before rotation, but
the ODL method uses an oblique rotation that optimizes the
minimization criterion.

The two MIRT linking methods take different positions
about whether changes of the unit lengths are constant
across dimensions. Li and Lissitz (2000) clearly indicated
that they assumed constant change of the unit length across
multiple dimensions such that one dilation constant was
enough to cover overall unit length adjustment. They
provided two reasons for a dilation constant: mathematical
tractability and reasonable accuracy. In the ODL method,
this issue was not clearly stated because there is not a
specific component for dilation. Even though simulation

examples in the paper (Oshima et al., 2000, Table 2 on p.

23

364) showed that their main concern was about a constant
unit change across dimensions, there are no statistical
constraints on a constant unit change in the ODL method so
that it can allow unique dilation for each dimension.

Back to the previous example, a mathematics test
measuring mathematical knowledge and reading skills, suppose
the examinee group A shows less variation in mathematical
knowledge compared with the examinee group B, should the
group A have less variation on reading skills as well? The
answer is maybe yes or maybe no. A more extensive answer
could be found from theories on the relationship between
mathematical knowledge and reading ability in general, and
from specific characteristics of a given test, such as how
extensive are the reading skills needed to determine the
mathematical solution.

One reasonable argument for constant overall dilation of
multiple dimensions may be that the dimensions measured by a
test are strongly related. The change in one dimension goes
along with other dimension(s) with the same
dilation/contraction rate. However, this may not be typical
for various constructs measured by educational and
psychological tests. In addition, from a methodological
perspective, the dilation constant can be treated as a

special case of multiple dilation constants.

24

In order to model different unit changes along with an
orthogonal rotation in MIRT linking, the dilation constant
adopted in the LL method is replaced with a diagonal
dilation matrix, called the M method afterwards.

Transformation equations of the M method are

a: =a; TK, (17)
d}"=d,-—a; Tm, and (18)
9’; =K'1(T'10j+m), (19)

where K is a diagonal dilation matrix and other terms are

defined as before. For the two—dimensional case, K is

k1

defined as
0 k2

]. Here, kl in the matrix K indicates the

dilation component for the first dimension, and k2 is for

the second dimension. Off—diagonal elements in K are set to
zero because the relationship/direction between two
dimensions is not defined by K but only by the orthogonal
rotation matrix, T. Then, the equality of exponent terms

after and before transformation is established by
1t 313 IF , _l _1 I ’
a,0,~+d,~=(aiTK)(K )(T 0j+m)+(di—ai Tm)=ai01+d,-. (20)

25

Two points should be mentioned. First, the Equations (17)
to (19) are the same as the Equations (10) to (12) except
for including a diagonal dilation matrix rather than a
dilation constant. Second, when k1 is equal to k2 in the

dilation matrix Equation (20) becomes the same as the LL
method (Equation [13]).

The proposed linking method in Equations (17) to (19)
differs from the ODL method by splitting the rotation matrix
and the dilation matrix and using an orthogonal rotation. It
also differs from the LL method by allowing a unique unit
change for each dimension rather than a constant change for
all dimensions.

The same minimization criteria as Equations (15) and (16)
are used for each linking component of the M method.
However, because the dilation component, K is a diagonal
matrix rather than a scalar, the solution is somewhat
different from the LL method and further details are
provided in Appendix A. In addition, a simple example is
given in the next section to emphasize the similarity and
difference between the LL method and the new method with a

diagonal dilation matrix.

26

2.3.1 Example: LL method and M method
Suppose we have two sets of two dimensional item

parameter estimates for twenty anchor items as in Table 2.1.

Table 2.1. Two Sets of Item Estimates and Rotated Estimates3

 

 

 

 

 

 

l Base Test Equated Test Rotated Equated Test
mm 01 a2 01 02 01 a2
1 1.81 0.86 0.33 1.55 1.44 0.66
2 1 .22 0.07 0.59 0.88 1.06 0.05
3 1.57 0.36 0.56 1.29 1.37 0.32
4 0.71 0.53 -0.08 0.66 0.49 0.46
5 0.86 0.19 0.36 0.59 0.69 0.06
6 1.72 0.18 0.84 1.28 1.53 0.08
7 1.86 0.29 0.82 1.52 1.71 0.24
8 133 034 0A3 106 1J1 028
9 1.19 1.57 -0.45 1.31 0.79 1.14
10 2.00 0.00 1.04 1.30 1.66 -0.07
1 1 0.87 0.00 0.45 0.61 0.75 0.00
12 2.00 0.98 0.29 1.93 1.72 0.92
13 1 .00 0.89 -0.20 1.21 0.85 0.89
14 1.22 0.14 0.65 0.92 1.12 0.03
15 1.27 0.47 0.33 1.10 1.08 0.39
16 1.35 1.15 -0.24 1.36 0.95 1.00
17 1.06 0.45 0.23 0.99 0.94 0.41
18 1.92 0.00 1.15 1.42 1.83 -0.07
19 0.96 0.22 0.35 0.79 0.84 0.18
20 1.20 0.12 0.57 0.93 1.08 0.09
SmtSS 632 273 255 529 541 228

 

The estimates from the equated test are to be linked to
the estimates from the base test. Based on the orthogonal

Procrustes rotation solutions, the rotation matrix is

_ .59 -.80
.80 .59

]. By using this rotation matrix, the item

 

3 Two sets of item estimates were taken from Li’s study (1997, Table II-
1 on p. 57).

27

estimates of the equated test are rotated to the orientation
of the estimates of the base test (see columns 6 and 7 in
Table 2.1).

The mathematical measure of the length of a vector is the
square root of sum of square of elements of the vector (Sqrt
SS). The bottom row of Table 2.1 gives the length of the
vectors and it shows that rotated item estimates are
shrunken compared with the estimates of the base test
estimates. Also the lengths show that the first dimension of
the rotated matrix is less shrunken (6.32/5.4l=1.17) than
the second dimension is (2.73/2.28=1.20) although the
difference is not much in this example.

There are two ways to deal with the discrepancy in vector
lengths. First, by assuming that the change of unit lengths
was constant across dimensions, and the differences of
dilation/contraction rates were sampling/estimation errors
which can be ignored, a single dilation constant was
calibrated to account for all changes of unit lengths
(dilation constant of the LL method). Second, the difference
of the dilation rates across dimensions is real so unique
dilation components were included to adjust the unit length
for each dimension. In the second case, a diagonal dilation
matrix was used for the adjustment for each dimensional unit

change.

28

After modeling both a dilation constant and a diagonal
dilation matrix, final transformed matrices for

discrimination estimates can be obtained (see Table 2.2).

Table 2.2. Comparison of Transformed Results with a Dilation
Constant and with a Diagonal Dilation Matrix

 

 

 

 

. . . Difference between dilated matrix
Dilated matrix after rotation and base matrix
Item With a constant With a matrix With a constant With a matrix
“1 02 01 02 a1 02 a] “2
1 1.55 0.71 1.67 0.78 -0.26 015 -0.14 -0.08
2 1.13 0.05 1.23 0.06 -0.09 -0.02 0.01 -0.01
3 1.47 0.34 1.60 0.38 -0.10 -0.02 0.03 0.02
4 0.52 0.49 0.57 0.54 -0.19 —0.04 -0.14 0.01
5 0.74 0.07 0.80 0.07 -0.12 -0.12 -0.06 -0.12
6 1.64 0.09 1.78 0.10 -0.08 -0.09 0.06 -0.08
7 1 .83 0.26 1 .99 0.29 -0.03 -0.03 0.13 0.00
8 1.19 0.31 1.29 0.34 -0.14 -0.03 -0.04 0.00
9 0.85 1.23 0.92 1.36 -0.34 -0.34 -0.27 -0.21
10 1 .78 -0.07 1 .93 -0.08 -0.22 -0.07 -0.07 -0.08
11 0.81 0.00 0.88 0.00 -0.06 0.00 0.01 0.00
12 1.85 0.98 2.00 1.09 -0.15 0.00 0.00 0.11
13 0.92 0.95 0.99 1.05 -0.08 0.06 -0.01 0.16
14 1.20 0.03 1.30 0.03 -0.02 -0.11 0.08 -0.11
15 1.16 0.42 1.26 0.46 -0.11 -0.05 -0.01 -0.01
16 1.02 1.08 1.10 1.19 -0.33 -0.08 -0.25 0.04
17 1.00 0.44 1.09 0.48 -0.06 -0.01 0.03 0.03
18 1.96 -0.08 2.12 -0.09 0.04 -0.08 0.20 -0.09
19 0.91 0.20 0.98 0.22 -0.05 -0.02 0.02 0.00
20 1 .16 0.10 1.26 0.1 1 -0.04 -0.02 0.06 001
Sqrt SS Sum
5.81 2.45 6.30 2.71 —2.43 -1.22 -0.36 -0.42

 

 

 

 

 

For the example data, the value of estimated dilation

constant is k=1.074 and the diagonal dilation matrix is

1.163 0

. Note that the value of the
0 1.187

estimated as K=[

dilation constant is smaller than either of the two diagonal

29

components of the dilation matrix. That is, for this example
data, the transformed matrix with the dilation matrix was
more dilated in both dimensions than that with the dilation
constant.

Finally, Table 2.2 shows that the dilated discrimination
matrix with the diagonal dilation matrix is more similar to
the base matrix than that with the dilation constant in
terms of the length of vectors (the row headed as Sqrt SS)
and overall differences for 20 items (the row headed as
Sum).

Because the dilation matrix always has more parameters
than the dilation constant it is expected that the M method
should be better than the LL method in transforming one
scale to the other. The way to compare overall goodness of
two linking methods is to calculate sum of squared
differences (SS) between the base matrix (say matrix B) and
the transformed matrix (say, matrix A). The SS is calculated
by tr(A!A-B'B) and the ratio of 88s can indicate the
proportion of linking errors.

For the present example data, the error SS with the
constant is 4.14 and the SS of the diagonal matrix is .40.
Then the ratio of two 88s is .10 (=.40/4.14). It means that

modeling the diagonal dilation matrix reduces linking errors

30

by 90%, denoted by SS, from modeling the constant.4

2.4 Other MIRT Linking Methods

Two MIRT linking methods have been reviewed, and a new
method with a diagonal dilation matrix and orthogonal
Procrustes solutions has been suggested. In addition to
these, there are two other MIRT linking methods proposed by

Hirsch (1988, 1989) and Thompson et al.(1997).

2.4.1 Hirsch’s Method

Hirsch (1988, 1989) developed an MIRT linking method
based on a common-examinee design, and it was composed of
three transformation steps. First, find common orthogonal
basis vectors. Second, the orthogonal Procrustes rotation
matrix is sought to align the reference systems between two
examinee groups. Third, after conducting two rotational
transformations, scaling indeterminacy is handled by linear
methods of UIRT linking (e.g., mean and sigma method, or
Stocking and Lord’s method [1983]).

Hirsch’s method can be valued as the first attempt to

 

‘ The statistically elegant way to compare two different statistical
models is to take the ratio of likelihoods. However, it does not apply
to the present situation. The reason is (a) the distribution of MIRT
discrimination parameter is not known yet and (b) estimation procedure
of the LL method is different from the new method such that the former
is not nested within the latter.

31

deal with multidimensionality in IRT linking. However, it
takes multiple and complicated stages to find a final
rotation matrix mainly due to different basis vectors
between the base group discrimination matrix and the equated
group discrimination matrix. However, recently popular MIRT
calibration programs (e.g., NOHARM [Fraser, undated] and
TESTFACT [Wilson, Wood, & Gibbons, 1991]) provide orthogonal
ability dimensions as a default option so there is no need
to find common basis vectors for multiple test forms or
examinee groups. The remaining two steps of this method are
very similar to later developed MIRT linking methods,

especially, to Li & Lissitz’ method.

2.4.2 Thompson, Nering and Davey’s Method

Thompson and his colleagues (1997) developed linking
procedures for multiple test forms when there were neither
common items nor common examinees. They argued that linking
information for different test forms could be obtained from
the assumption of randomly equivalent examinee groups and
specification of common item content cluster(s). Even though
this method could have important implications in practices
because it relaxes equating conditions by not requiring
common items/examinees, their assumptions are still in

question. The assumption of equivalent examinee groups is

32

hard to verify without large sample sizes and random
sampling. Moreover, identifying the similar item clusters is
also critical and may require large item sets. As Li and
Lissitz (2000) mentioned, this method is still an
experimental stage and needs more technical justification

for its procedures.

2.5 Evaluation Criteria

Different linking methods may be expected to produce
different results because statistical or optimization
criteria differ from one to another (Bolt, 1999; Harris &
Crouse, 1993), but each method should be good at what it is
supposed to minimize/maximize. Researchers who have pursued
multidimensional equating methods usually provide various
evaluation criteria that support their own linking methods.
However, when different linking methods are adopted,
different sources of linking errors play roles that make the
final equating/linking results different. Therefore, the
evaluation of linking methods depends on what criteria are
to be employed for comparison and evaluation. In practice,
the selection of criteria could be determined according to
the purpose and practical conditions of linking. For
example, if test scores are reported on true score scales,

it would be better to use a certain linking method which

33

could minimize the differences between test response
surfaces.

The popular evaluation criterion for MIRT linking of
simulation data is to compare transformation parameters
(i.e., rotation matrix and scaling vector/constants) with
their estimates. Even though the comparison of estimates of
linking components with their parameters is reasonable for
evaluating different linking techniques, it is limited to
the situation that all evaluated techniques have to have the
same structures to transform one set of estimates into
another. For instance, the ODL method consists of two
linking components (the rotation matrix and the translation
vector) while the LL method is composed of three stages (the
rotation matrix, the translation vector, and the central
dilation constant). Further, the rotation matrix of the ODL
method functions equivalently to both the rotation matrix
and the dilation constant of the LL method. Therefore,
these two linking methods can not be evaluated using the
comparison of parameters and estimates of linking
components. In order to compare different linking
frameworks, criteria which work for all methods need to be
developed.

A way to develop evaluation criteria is to go back to the

goal of equating. For example, in the anchor item design,

34

information obtained from common items is used to estimate a
transformation which makes independent calibrations of
common items as similar as possible.

In this study, the anchor item design is considered and
three linking methods described in this chapter are
evaluated in terms of how similar transformed item estimates
are to estimates on the base test or item parameters. More
details on evaluation criteria will be provided in the

method chapter.

35

CHAPTER 3

METHODS

In this chapter, statistical procedures for simulation
and real data analyses are described. Evaluation criteria

for comparisons of three linking methods are also provided.

3.1 Simulation Study

Simulation data is sometimes recommended rather than real
data to evaluate equating methods in order to separate
effects of model misfit and equating errors (Bolt, 1999;
Davey, Nering, & Thompson, 1997). Because true parameters
are known in the simulation study, it is easier to compare
true parameters with transformed estimates. The purpose of
the present simulation study is to evaluate the three
linking methods described in the previous chapter by
quantifying linking errors, discrepancies between item

parameters and transformed estimates.

3.1.1 Equating Design and Specification of MIRT Model

Two test forms which shared a set of common items were
considered, the so called anchor item design. Suppose there
are a base test form and another form, the equated test, and

each form includes common items and unique items. The

36

equated test scores need to be converted onto the metric of
the base test scores. The common item set consisted of
twenty items for both tests and they were used to find a
comparable test scale. Because parameters are known in
simulated data, a set of known item parameters was treated
as estimates for the base test and item parameter estimates
from various simulation conditions were used as equated test
estimates.

The number of common items was set to twenty for all
simulation conditions. Although there is no absolute
agreement about the length of the common/anchor test, most
frequently cited rule of thumb is no fewer than 20 items or
20% of total test items (Angoff, 1968, 1971; also see
Budescu, 1985).

In order to obtain item parameter estimates, a
compensatory two-dimensional two—parameter logistic model
was used as in Equation (1) with c=0. The two-dimensional
case is the simplest situation of multidimensionality but
the same linking procedures for each method can be easily
expanded to higher dimensions. Note that lower asymptote
parameters were not considered for the present simulation
mainly because the lower asymptote parameters are on the
probability metric so they do not directly relate to linking

processes.

37

3.1.2 Generation of Item Parameters and Response Patterns
Item parameters were specified by the selection of item
dimensional structures. Two types of dimensional structures
were examined: approximate simple structure (APSS) and mixed

structure (MS) (Roussos, Stout, & Marden, 1998; see also
Kim, 1994; Kim, 2001). APSS means that each item has
relatively higher loadings on one dimension than on other
dimensions. In other words, a set of items (e.g., item
cluster) has high discrimination on the same dimension and
low discriminations on the other dimension. However, in
reality, test items may measure some composites of
dimensions as well as relatively pure dimensions. MS refers
to a test that measures both relatively pure trait
dimensions and their composites.

For the present simulation with twenty common items and
two dimensions, APSS was represented by two sets of items,
ten items for each dimension. One set of items mainly loaded
on the first dimension and the other set on the second
dimension. In MS, there were four sets of items, five items
for each. Two sets among them highly loaded on either of the
two dimensions, and the remaining two sets were sensitive to
composites of the two dimensions. These two dimensional
structures for twenty common items are illustrated in Figure

3.1.

38

Dimension 2 (82)

J

Items 11 to 20 (75°~90°)

Items 1 to 10 (O°~15°) Z"””,74"”"

I

 

Dimension 1 (0;)

(a) Approximate Simple Structure (APSS)

Dimension 2 (01)

  
   

Items 16 to 20 (75°~90°) —'>
Items 11 to 15 (50°~65°)
Items 6 to 10 (25°~40°)
Items 1 to 5 (0°~15°)

 

  

 

 

Dimension 1 (01)

(b) Mixed Structure (MS)

Figure 3.1. Two Dimensional Structures in Simulation Data*
* All angles are defined from the first dimension.

For APSS, two clusters of 10 items highly loaded on
either of the two dimensions. For MS, two sets of five items
clearly measured each of the two dimensions and remaining
two sets were measuring composites of dimensions. Among two
sets of composite measuring items, one was slightly more

sensitive to the first dimension and the other set to the

39

second dimension.

To construct the dimensional structures, angles (a, see
Equation [4]) between item vectors and the first dimension
were randomly drawn from a uniform distribution with given
ranges of item clusters defined in Figure 3.1.

In order to define item parameters, fixed values of
MDISCs and MDIFFs in Table 3.1 generated by Roussos et al.
(1998) were used. These two sets of MIRT item
characteristics were selected because they are realistic,
cover item features usually found on a test, and they do not
relate dimensionality and item difficulty levels. The
average value of MDISC is 1.2, and the average value of

MDIFF is zero.

Table 3.1. Five MIRT Discrimination and Difficulty Levels

 

 

 

Level MDISC MDIFF
1 0.4 -1.5
2 0.8 1.0
3 1.2 0.0
4 1.6 —1.0
5 2.0 1.5
Mean 1.2 0.0

 

 

 

This pattern of MDISCs and MDIFFs was repeated four times
for twenty common items. Then, discrimination and
difficulty-related parameters were determined by Equations

(2), (3), and (4) with given angles, MDISCs, and MDIFFs.

4O

A set of item parameters that were used for the present

simulation conditions is given in Table 3.2.

Table 3.2. Item Parameters for Twenty Common Items

 

 

 

 

 

 

 

Item APSS MS
a1 a2 a1 a2 d
1 0.40 0.03 0.40 0.03 0.60
2 0.80 0.07 0.78 0.17 —0.80
3 1.19 0.16 1.20 0.07 0.00
4 1.56 0.34 1.60 0.10 1.60
5 2.00 0.04 1.98 0.29 -3.00
6 0.40 0.05 0.34 0.21 0.60
7 0.78 0.17 0.71 0.36 -O.80
8 1.20 0.06 1.01 0.64 0.00
9 1.60 0.11 1.25 1.00 1.60
10 2.00 0.09 1.68 1.08 -3.00
11 0.04 0.40 0.25 0.31 0.60
12 0.15 0.79 0.47 0.65 -0.80
13 0.09 1.20 0.64 1.01 0.00
14 0.16 1.59 0.75 1.41 1.60
15 0.47 1.94 1.03 1.71 -3.00
16 0.08 0.39 0.03 0.40 0.60
17 0.04 0.80 0.10 0.79 —0.80
18 0.30 1.16 0.14 1.19 0.00
19 0.37 1.56 0.34 1.56 1.60
20 0.23 1.99 0.21 1.99 -3.00
Mean 0.69 0.65 0.75 0.75 -0.32
SD 0.66 0.68 0.57 0.60 1.59

 

The average angle from the first dimension for the first
ten items with APSS was 6.000 and that of the remaining ten
items was 81.040. The average angles of four sets of five
items with MS were 6.190, 32.390, 56.870, and 82.830,
respectively.

Directional vectors of twenty common items are

41

illustrated in Figures 3.2 (APSS) and 3.3 (MS). The length
of an item vector indicates the degree of discrimination
(MDISC) and the distance between the origin and the starting
point of the vector (arrow point of the vector on the third
quadrant) is item difficulty (MDIFF). All vectors are
extended through the origin, and they are located in the
first and third quadrants because of positive discrimination

parameters (a’s) (Ackerman, 1996; Reckase & McKinley, 1991).

 

 

 

 

I
>—
,_
t.
>-
>—
1..
L

 

Figure 3.2. Item Vectors, Approximate Simple Structure

42

 

 

 

 

 

 

 

i
,—
..
,_.
..
>—
u—
I—

 

Figure 3.3. Item Vectors, Mixed Structure

The probability of getting an item right was computed by
the compensatory two~dimensional two—parameter IRT model,
Equation (1) with c=0. Given the MIRT item parameters, the

response probability PU was computed for each examinee.
Then PU was compared to a uniform random value P* where 0 S
P* S 1. A binary item score of XU = 1 (correct response)

was assigned when PU > P*. Otherwise, a score of x” = 0

(incorrect response) was assigned.

43

3.1.3 Simulation Factors

(1) Ability Distributions

Five bivariate normal distributions with different means
and variances/covariances were considered for examinee true
abilities. Different ability distributions mean that the
test forms are administered to somewhat different
populations. For example, considering vertical equating,
examinee groups may have different ability levels
represented by different means of distributions. It might
also be possible that there are different relationships
between dimensions when examinee groups have different
ability levels. Table 3.3 shows mean vectors (p),
variance/covariance matrices (2) and correlation

coefficients (2)) of two traits for five examinee groups.

Table 3.3. Ability Distributions for Five Examinee Groups

 

 

 

Group 1 Group 2 Group 3 Group 4 Group 5
0 1 0 0 l .5 5 1 .5 .5 8 $1 .5 L2 5
ﬂ: 9:: 9 ’ ’ ’
0 0 1 () .5 l .5 5 l .5 A-.8 .5 .5 .8
1):.00 .50 .50 .50 .51

 

 

 

 

 

The distribution of Group 1 is a default ability
distribution (standardized independent bivariate normal

distribution) that is assumed in MIRT calibration programs

44

(e.g., NOHARM [Fraser, undated] and TESTFACT [Wilson, Wood &
Gibbons, 1991]). Therefore, least estimation errors as well
as less linking errors are expected for Group 1. From Group
2, true abilities are assumed to have moderate correlation
(17:0.5). Because MIRT calibration programs generate
independent ability dimensions by default, the dimensional
orientations of Groups 2 to 5 become arbitrary. However,
item discrimination estimates reflect dimensional dependency
when correlated ability dimensions are transformed to be
independent. In Group 3, a different mean vector from the
default zero means for both dimensions was used. For the
last two groups, differences of variances of abilities were
implemented but still the correlations of two dimensions
were maintained about 0.5. Note that Group 5 shows expansion
for the first dimension (1.2) and shrinkage for the second
dimension (.8), while the rates of shrinkage for Group 4 are
the same for the two dimensions (.8).

One may consider more variation in ability distributions
than those discussed so far. However, five distributional
conditions in Table 3.3 cover essential types of
distributional changes (e.g., means, variances/covariances)
and still retain simplicity to make the comparison of the

three linking methods clear.

45

(2) Number of Examinees

Usually 2000 or more examinees are suggested for MIRT
calibration (Ackerman, 1994; Reckase, 1995). In order to
evaluate the stability of linking results under less
desirable conditions, relatively small sample size (1000)

along with recommended size of 2000 was considered.

(3) Dimensional Structures
As was described in the section 3.1.2, two sets of twenty
common items were specified according to two dimensional

structures (i.e., APSS and MS).

(4) Linking Methods

Three linking methods, described in the chapter 2, were
compared based on how closely item estimates for twenty
common items were transformed into item parameters, i.e.,
degrees of parameter recovery through metric
transformations.

For each method, there are several sub—procedures which
result in slightly different transformations. One relatively
better, or best, sub-procedure for the ODL and LL methods
was selected for the comparison: test characteristic
function procedure (TCF) for the ODL method, and the

composite procedure of orthogonal Procrustes solutions for

46

the LL method. The new method with a dilation matrix
followed the same criteria of the LL method (see also

Appendix A).

(5) Procedures and Computer Programs

Given ability distributions (5), sample sizes (2), and
dimensional structures (2), there were twenty combinations
of simulation conditions. Fifty test response patterns were
generated for each combination as was described in the
section 3.1.2. Even though there is not a clear guideline
for the number of replications needed for reliable results,
at least 25 replications have been recommended in IRT—based
research (Harwell, Stone, Hsu & Kirisci, 1996)

There were three steps to conducting the simulation
study; generation of test response patterns, MIRT
calibration, and linking. First, dichotomous response data
were generated based on MIRT model (Equation [1] with c=0),
item parameters (Table 3.2), and characteristics of ability
distributions (Table 3.3). This step was completed by using
GENDATS developed by Thompson (undated).

Second, item parameters were estimated from item response
patterns generated in the first step. For MIRT calibration,
a modified version of NOHARM (Normal Ogive Harmonic Analysis

Robust Method, Fraser, undated; Thompson, 1996) was used

47

under several options; exploratory analysis, starting values
given by the program, and latent trait parameterization.
Third, estimates of item characteristics of twenty common
items were transformed into initial item parameters
according to three linking methods. While item parameter
estimates of one test are equated to other estimates in
practice, item parameters were used as base test parameters
for the present simulation. IPLINK (Lee & Oshima, 1996) was
used for the ODL method with options: scaling constant of
1.702, #2 parameterization, no weighting, and TCF method.
MDEQUATE (Li, 1996) was run to implement the LL method. For
the expansion of the LL method with a diagonal dilation
matrix, a new linking program using MATLAB (The MathWorks,

1995) was written (see Appendices A and B).

3.1.4 Evaluation Criteria and Data Analysis

Although the three linking methods easily apply to other
equating designs, originally they were developed with the
anchor item design. In the IRT framework, one of the
evaluation criteria for linking methods with the anchor item
equating is based on the size of the differences between
base estimates and transformed estimates. Adopting the
statistical concepts of accuracy and stability for metric

transformation, two summary statistics were used as

48

evaluation criteria: how far transformed estimates depart
from initial item parameters on average (linking bias), and
how much differences fluctuate (root mean square error,

RMSE) across common items. Bias and RMSE were computed by

I 1!
2(11011), (21)

1

1/2

I a:
A 2
“Z (01,- " 01:)
'=1 , respectively, (22)

l-l

where, an is the discrimination parameter on the first

a a n n A* 0 o c a u
dimenSion of item 1 , a“ Is the transformed discrimination

estimate on the first dimension of item i, and I is the
number of common items, twenty items for the present
simulation.

These two summary statistics represent the quality of
linking for the first dimensional discrimination. Then
overall patterns were examined with mean bias and mean RMSE
for 50 replications. The same formula were applied to other
item characteristics, item discrimination on the second

dimension (az) and difficulty related parameters (d). As

each item had three parameters and transformed values, there

49

were three sets of bias and RMSE for each replication.
Because three linking methods applied to the same test
response patterns, a repeated measures analysis of variance
(ANOVA) model was used to detect effects of simulation
conditions (between—factors) and linking methods (within-
factor) on bias and RMSE.5‘The repeated measures ANOVA model

is

Bias(al ),,-,,g, = p + 19,, + 78 + ,1, + Wig. + ”Kngs)

(23)
+ (I, + aﬂln + Wig + ails + W’ilg s + eli(ngs)

where,

Bkwﬂn)mms; bias of the first dimensional discrimination for

1th linking method, ith iteration, nth sample size, gth
group and sth structure,

p ; overall mean in the population,
ﬁt; effect of nth sample size (1000 and 2000),

78; effect of gth distributional group (Groups 1 to 5),

is; effect of sth dimensional structure (APSS and MS),

)d

85; interaction effect of group and structure,

 

5 In order to obtain more desirable distribution (i.e., normality). a
natural logarithm was taken for RMSEs.

50

”me)’ effect of ith iteration within nth sample size, gth

group and sth structure (iterations 1 to 50),

a); effect of 1th linking method (three linking methods),
inﬁn; interaction effect of linking method and sample size,

07%; interaction effect of linking method and group,

ad“; interaction effect of linking method and dimensional

structure,

amhgs; interaction effect of linking method, group, and

dimensional structure, and

empw”; interaction effect of linking method and iteration

within nth size, gth group and sth structure.6

In Equation (23), there are three between-factors: sample
size, distributional shape of the group, and dimensional
structure. The interaction term of between—factors (group by
structure) was selected based on initial examinations of
full model results. Also there is one within-factor, linking
method, and there are several interaction terms for between-

by within-factors.

Equation (23) is the model for the bias of the first

 

6 Statistical tests of the repeated measures analysis of variance model
are based on the symmetry conditionszl) the variance-covariance matrix
of the transformed variables used to test effects has covariances of
zero and equal variances (sphericity) 2) the variance—covariance matrix
must be equal for all levels of the between subject factors
(homogeneity).

51

dimensional discrimination. The same model applies to bias
and log transformed RMSE for all three item parameters.
Inference statistics of this model tested whether simulation
conditions and linking methods had statistically significant
effects on linking bias and RMSE, and then descriptive
statistics of two summary statistics were examined to
provide more detailed patterns of linking errors and to

compare the three linking methods.

3.2 Real Data Analysis

Simulation data have advantages in that they can clarify
which linking method and testing condition(s) lead to
favorable metric transformations, because one knows the true
model and parameters. However, simulation is not reality
itself, and the overall meaning of simulation depends on how
realistic assumed conditions and the following resultant
data are. One way to scrutinize a simulation study is to
compare its results with real data and see if they lead to
consistent conclusions.

For this purpose, scoring outcomes of a statewide

7U‘grade students were analyzed. The

mathematics test for
test consisted of three sections of 115 multiple choice

items (four alternatives) and each item was classified into

one of six content areas. More than 130,000 students took

52

the test and the valid sample size was 124,481. In order to
evaluate three linking methods in real testing conditions,
two test forms that shared common items (base test form and
equated test form), were artificially assembled as
following.

First, twenty common items among 115 items were randomly
selected based on original test structures (three sections)

and content areas (six areas) in order to represent the

whole test. The number of common item was decided by
following Angoff’s rule of thumb (1971).

Second, the remaining 95 items were randomly assigned to
unique items of either of the test forms, again based on
test structures and content areas in order to make two sets
of unique items as similar as possible. As a result, the
base test form was assembled with twenty common items and 47
unique items, and the equated test form was composed of the
same twenty common items and 48 unique items. Table 3.4
shows the composition of two test forms.

Third, in order to replicate real data comparisons, five
non—overlapped groups of 2000 examinees for the base test
and the equated test were randomly sampled from the overall
examinees. The number of examinees was decided based on the
recommendation of related literature on MIRT models

(Ackerman, 1994; Reckase, 1995). Descriptive statistics of

53

classical item difficulties (proportion of examinees who got

an item right)

Table 3.5.

Table 3.4. Composition of Two Test Forms*

for common and unique items are provided in

 

 

 

Content Sectionl Section2 Section3 Total

Area
1 1, 2, 1 1, 1, 3 1, 0, 2 3, 3, 6
2 0, 1, 0 0, 1, 0 2, 4, 5 2, 6, 5
3 0, 0, 0 0, 0, O 4, 10, 8 4, 10,8
4 0, 1, 0 O, 0, 2 1, 1, 3 1, 2, 5
5 2, 5, 4 2, 4, 3 3, 7, 9 7, 16, 16
6 0, 0, 0 0, 0, 0 3, 10, 8 3, 10, 8

Total 3, 9, 5 3, 6, 8 14,32,35 20, 47, 48

 

 

 

* The first number in the cells indicates the number Of common items.
The second and third numbers are the number of unique items in the base

form and in the equated form, respectively.

 

 

 

 

Table 3.5. Item Difficulties of Two Test Forms

Common items Unique items Unique items

on base test on equated

(47) test (48)

Mean SD Mean SD Mean SD

Base Groupl .57 .15 .57 .16 .58 .15
Group2 .57 .15 .57 .16 .59 .15
Group3 .56 .15 .57 .17 .59 .15
Group4 .57 .16 .57 .17 .59 .15
Group5 .57 .16 .57 .17 .58 .15
Equated Groupl .58 .16 .58 .17 .59 .15
Group2 .58 .15 .58 .16 .59 .15

Group3 .57 .16 .57 .17 .59 .16

Group4 .57 .16 .58 .17 .59 .16

GroupS .57 .15 .57 .16 .59 .15

 

 

 

 

54

From Tables 3.4 and 3.5 one can find that twenty common
items were similar to unique items in terms of test contents
and item difficulties. Unique items on the base form are
also similar to those on the equated form.

Fourth, the two—dimensional three—parameter logistic
model (Equation [1]) was applied to each data set and three
linking methods were applied to five pairs of random
examinee groups (e.g., base groupl and equated groupl, and
so forth).

Because NOHARM does not provide estimates of lower
asymptotes, these were estimated from BILOG3 and were used
as input data for MIRT calibration. Item parameter estimates
of twenty common items for the first pair of samples after a
varimax rotation are provided in Table 3.6.

Even though test dimensions were not a main focus of this
study, the first dimension could be defined as an elementary
algebra dimension (e.g., whole number and basic computation)
and the second dimension as a logical reasoning dimension

(e.g., problem solving and statistics).

55

 

 

 

 

Table 3.6. Item Parameter Estimates of Common Items

It Base test form (groupl) Equated test form (groupl)
em a1 a2 d c a1 a2 (1 c
1 0.88 0.42 -0.58 0.21 1.01 0.47 —0.87 0.24
2 0.69 0.25 -0.31 0.14 0.66 0.18 —0.27 0.15
3 1.14 0.66 0.04 0.10 1.33 0.45 0.14 0.10
4 1.05 0.48 0.23 0.12 1.23 0.51 0.20 0.16
5 0.34 0.39 0.56 0.15 0.43 0.35 0.70 0.15
6 0.54 0.42 0.43 0.25 0.52 0.56 0.70 0.15
7 1.32 0.71 —0.76 0.08 1.37 0.65 -0.77 0.12
8 0.72 0.37 —0.05 0.18 0.74 0.43 —0.10 0.24
9 1.01 0.40 —1.11 0.15 0.98 0.41 -1.26 0.20
10 0.17 0.31 -0.10 0.18 0.27 0.23 -0.17 0.21
11 0.83 0.62 —0.91 0.16 0.58 1.56 -1.16 0.14
12 0.63 0.76 0.24 0.16 0.69 0.60 0.36 0.16
13 0.50 0.55 -0.39 0.23 0.43 0.38 -0.24 0.16
14 0.31 1.19 1.58 0.16 0.43 0.87 1.41. 0.13
15 0.41 0.38 —0.25 0.26 0.32 0.34 -0.02 0.14
16 0.61 0.58 -1.73 0.26 0.45 0.88 -1.67 0.22
17 0.54 0.80 0.23 0.13 0.68 0.74 0.35 0.11
18 0.60 0.69 0.87 0.12 0.61 0.54 0.89 0.12
19 0.39 0.63 —0.60 0.25 0.38 0.43 -0.35 0.16
20 0.57 1.21. 0.77 0.11 0.85 0.85 0.77 0.15

Mean 0.66 0.59 -0.09 0.17 0.70 0.57 —0.07 0.16
SD 0.30 0.26 0.76 0.06 0.33 0.31 0.79 0.04

 

 

 

3.2.1 Evaluation Criteria and Data Analysis

For the real data analyses,
were evaluated in two ways,

true test score estimates for the twenty common items.

the three linking methods

item parameter estimates and

In order to compare transformed item estimates for’the

equated test form with estimates of the base test form,

Equations

(21)

and (22) were used. Note that in the real

data analysis true item parameters were unknown but there

56

 

were two sets of item estimates for the common items. So,

item estimates (ﬁn) on the base form were used instead of

item parameters (au) in both Equations (21) and (22). They

are mean difference and difference variation rather than
bias and RMSE.

In addition to item level comparisons, differences of
estimated true scores on the common items using test
response surfaces were evaluated for the three linking
methods. By summing Equation (1) for the base group and
equated group, two test response surfaces for the 20 common
items can be estimated (see also the first part of Equation
[9]). Practical information is expected to be obtained by
exploring where most/least discrepancy between base test
response surface and transformed surface occurs for each

linking method.

57

CHAPTER 4
RESULTS

Based on study designs described in the previous chapter,
the main results of the simulation study and the real data

analysis are provided along with initial interpretations.

4.1 Simulation Study

4.1.1 Results of Repeated Measures Analysis of Variance

Linking errors for each replication with twenty common
items were calculated using two statistics: the mean and
standard deviation of differences between transformed
estimates and item parameters (bias and RMSE, see Equations
[21] and [22]). These summary statistics were considered as
indicators of quality of linking for each replication.

After finding significant multivariate results for the
model given by Equation (23), univariate test results for
six dependent variables are provided in Tables 4.1 and 4.2.
In each cell there are three numbers; F value, degrees of
freedom, and eta square (proportions of explained variance
to overall variance). It should be noted that the first
degree of freedom regarding linking method for the
difficulty parameter is 1 rather than 2. The reason is that

only the ODL and LL (or M) methods were compared for

58

difficulty parameters because the LL and M methods resulted
in the exactly same transformations of difficulty

parameters.

Table 4.1. Test Statistics (F), Degrees of Freedom (DF) and
Effect Sizes (02) of Biases from Repeated Measures ANOVA7

 

 

 

 

 

 

Source Bias, al Bias, a2 Bias, d
Between Factor
,6 F = 72-86“ 58.67** 8.08**
n . DF = (1' 989) (1, 989) (1, 989)
sample Size ”2 = .07 .06 .01
Y 175.04** 167.27** .12
5 (4, 989) (4, 989) (4, 989)
Distributional Group .41 .40 .00
2 353.82** 239.62** 28.75**
f _ (l, 989) (l, 989) (1, 989)
DimenSional Structure .26 .20 .03
A 29.32** 22.27** .40
7;”
(4, 989) (4. 989) (4, 989)
GroupXStructure _11 .08 .00
Within Factor
a, 179.95** 101.36** 393.32**
! . (2, 1978) (2, 1978) (1, 1978)
Linking Method .15 .09 .29
(HZ 58.05** 53.44** 2.74
. n . (2, 1978) (2, 1978) (l, 1978)
LlnkXSIZG .06 .05 .00
07 67.16** 75.76** 1.07
'3 (8, 1978) (8, 1978) (4, 1978)
LinkXGroup .21 .24 _00
a1 130.85** 127.24** 8.35**
.“ (2, 1978) (2, 1978) (1, 1978)
LinkXStructure ' .12 .11 .01
l 16.20** 16.16** 3.20*
aylg, (8, 1978) (8, 1978) (4, 1978)
LinkXGroupXStructure .06 .05 _01

 

** p<.01, * p<.05

 

7 When the sphericity assumption is violated as the present simulation
adjustments of test statistics can be made by Grenhouse-Gesser Epsilon.
Because initial F values and adjusted statistics are not very different,
initial statistics under the sphericity assumption are reported.

59

Table 4.2. Test Statistics (F), Degrees of Freedom (DF) and
Effect Sizes (02) of RMSEs from Repeated Measures ANOVA

 

 

 

 

 

Source LN RMSE,a.L LN RMSE,a2 LN RMSE, d
Between Factor
ﬂ F = 184.23** 201.42.”. 132.38**

" . DF = ‘1' 989’ (1, 989) (1, 989)
Sample Size ”2 = .16 .17 .12
7 165.15** 188.77** 6.77**

S (4, 989) (4. 989) (4, 989)
Distributional Group .40 .43 .03
A 8.47** 0.21 0.45

f . (1. 989) (l. 989) (l. 989)
DimenSional Structure .01 .00 .00
Z 2.93* 1.53 6.08**
733

(4. 989) (4. 989) (4, 989)
GroupXStructure .01 .01 .02
Within Factor
a, 536.22** 614.73** 1486.73**

1 _ (2, 1978) (2, 1978) (1. 1978)
Linking Method .35 .38 _50
64% 49.89** 87.26** 7.65**

. “ . (2, 1978) (2, 1978) (1, 1978)
LinkXSize . 05 . 08 , 01
ay 87.99** 99.68** 0.79

'3 (8. 1978) (8. 1978) (4, 1978)
LinkXGroup .26 .29 .00
ad 27.53** 45.36** 5.76*

.b (2, 1978) (2, 1978) (1, 1978)
LinkXStructure .03 .04 .01

1 4.62** 7.49** 6.31**
“7185 (8, 1978) (8, 1978) (4. 1978)
LinkXGroupXStructure .02 .03 .03

 

 

** p<.Ol, * p<.05

The statistical test results indicated that effects of

linking methods depended on simulation conditions (i.e.,
statistically significant interactions of between— and
within—factors). In addition to interactions of between— by
three main factors of simulation conditions

within-factors,

had significant effects on linking bias and log transformed

60

RMSE.

An interesting finding is that discriminations
(especially the first dimensional discriminations) were more
sensitive than difficulty estimates to simulation
conditions. This pattern was clearer in the effect sizes
denoted by eta squares. That is, distributional groups,
dimensional structures and linking methods accounted for
large portions of bias variation of discriminations but only
linking methods were an important factor for bias variation
of difficulties. In the log transformed RMSEs, sample sizes,
distributional groups and linking methods were important for
discrimination, and sample sizes and linking methods were
for difficulties.

In sum, the results of the repeated measures ANOVA showed
that the type of linking methods had significant effects on
all 6 dependent variables, and the soundness of linking
results (i.e., how close transformed estimates were to true
parameters) depended on various test conditions, linking

methods and their interactions.

4.1.2 Comparison of the Three Linking Methods
In order to directly compare behaviors of the three MIRT
linking methods across simulation conditions, linking errors

(i.e., bias and RMSE) are illustrated in Figures 4.1 to

61

4.12.

Bounded vertical lines represent upper and lower

limits of one standard deviation for bias and RMSE of fifty

replications of the twenty common items,

and marked middle

points of the lines are mean values of bias and RMSE. Note

that the horizontal axis represents the combinations of five

distributional shapes and two dimensional structures.

For

example, APl indicates distributional Group 1 with APSS

items.

 

EANS a1
b

  

 

 

-.1 1

-.2 '

-.3

 

   

 

 

Figure 4.1.

AP1 ARZ ABS AR4 APS ~é1 ~62 Néa ~é4 Nés
Bias (a1, n=1000)

62

Ibﬁ

*i—c ow

8

r
r

3

 

 

 

- - - - -

AP1 AP2 AP3 A94 AP5 ~61

Bias (aw r1

 

 

 

1000)

Figure 4.2.

 

 

 

 

- - - [0&2 t -

AP1 AFQ APB A94 AP5 ~61

Bias (d,

 

 

 

Figure 4.3.

1000)

n:

63

 

 

 

db

' ' ' 'Mszmsa' 0164-0165

j

A91 A92 AFB A94 A95 ~61

RMSE (a1, n=1000)

 

 

51

.4'

Figure 4.4.

 

 

 

 

 

 

 

N64M-S5

~1s2 Msa

 

 

a
n .3...
Tllllillllli A
_:|.._.|:.-.. .m
Tlldll. A
_...I.__.?|.:. .m
Tll-Illll
Wﬁui .m
TIITIII
- 7-..-.. ........ _ __ ..
.... m u. ... o.
O
Na mmgﬁa

Figure 4.5.

1000)

RMSE (aw r1

 

 

T

 

 

 

APiAP2AP3Ae4AP5E11/52MS3NB4MS5

 

 

.6!

0.0

Figure 4.6.

1000)

RMSE (d, n

 

 

 

m u M
I. .14. 1*
.II;
.1... a.
-8 a
._
_1_.T.u_
_. ooooo . nnnnn J .m
_ 41.-.-.. .m
.5--. ..... . _ ___ _ .m
_I|.||+ A
T¥l
.55.. .m
TI
w ..... .u iiiii . _ _ rm
TI
.- ..... O ..... ... _ _ .m
.. .
1.1..-- _.
m .u

 

 

Bias (au :1

Figure 4.7.

2000)

65

 

 

 

 

 

 

 

 

 

 

 

 

 

.2I g 7: -
6 i Q _
1 9 i
(‘8 .:. . -. E T -.- 1'
9 ' 9 s i 5 ‘='
m I f 1 if F . ;
-:. 5 5 : . I on
11111111119-
o.o _.-_ _ 3' A j -
1 ’ 1 LL
9 o
J. I M
-.1 *
A51AF>2A138A134A55~I§1~I§2~E3~§4~§5
Figure 4.8. Bias (a2, n=2000)
.1
#--?* . ' .
0.0 Tj_ : J T;_ _ i
'o
7 1 l
1 I out
1. I
Jb .
1 LL
0
I M
-.2 *

 

 

 

Figure 4.9. Bias (d, n=1000)

66

AhAéAﬁAhAasué1~é2~é3~é4~és

 

FMSEa1

 

0.0

 

 

 

 

 

 

A91 A732 A38 A94 A95

Figure 4.10. RMSE (av IFQOOO)

Né1 NéZ

 

.3'

 

0.0

 

 

 

 

 

 

AP1AP2AF’3A94A95Né1Né2NéSMS40/S5
Figure 4.11.

RMSE (a n:2000)

2!

67

i

Il—l

*i—i ow

Il—l

8

*I—i ow

8

r
r

2

r
r

3

 

-------J

-------J

 
 

 

 

 

 

 

 

8

l_
I

 

 

   

L------
r
r

 

 

*l—l ow
E

0.0

 

A131 AE’ZAgAi-MAPSNé1Né2Né3MSTNB:
Figure 4.12. RMSE (d, n=2000)

In general, one can find the M method and the ODL method
were less biased and more stable than the LL method for the
three item parameters. More specific points follow.

JJ As the sample size became larger and the ability
distribution was close to the default conditions (i.e.,
orthogonal standard bivariate normal distribution),
metric transformations were more accurate (less
variations of bias) and stable (smaller RMSE).

2) For discrimination estimates, transformations of MS items
were less biased than those of APSS items, especially

with the LL method; Figures 4.1, 4.2, 4.7 and 4.8.

68

3)

4)

5)

6)

7)

8)

Transformations of difficulty estimates were less stable
than those of discriminations, especially with the ODL
method; Figures 4.6 and 4.12.

The LL method showed relatively larger biases of
discriminations compared with two other methods. Compared
with the M method, the ODL method is less biased for
discriminations of APSS items, but more biased for
discriminations of MS items; Figures 4.1, 4.2, 4.7 and
4.8.

While transformations of the ODL method was relatively
stable among different ability distributions, two
orthogonal Procrustes based methods showed drastic
changes in bias and RMSE of discrimination parameters
between Group 1 and Group 2 (i.e., whether or not two
dimensions were correlated).

The M method and the LL method showed very similar RMSEs
for discrimination parameters.

The two orthogonal Procrustes based methods over-
transformed difficulty estimates while the ODL method
under-transformed; Figures 4.3 and 4.9.

The ODL method had relatively small RMSEs for two
discriminations but it showed more RMSE variations for 50
replications compared with the other methods; Figures

4.4, 4.5, 4.10 and 4.11

69

EH Compared with the M method and LL method, the ODL method
showed less accurate and less stable transformations of
difficulty related parameters; Figures 4.3, 4.6, 4.9 and
4.12.

10)The LL method and the M method showed exactly same
transformation results for the difficulty parameter
because they estimated the same rotation matrix and

translation vector; Figures 4.3, 4.6, 4.9, and 4.12.

In sum, inference statistics from the repeated measures
ANOVA model indicated that the linking methods and the three
testing conditions significantly affected linking bias and
RMSE. From mean bias and RMSE of 50 replications in each
simulation condition, one can find that the ODL method and
the M method provided less biased metric transformations of
discrimination estimates compared with the LL method. And
the M method with the diagonal matrix made more stable
transformations for difficulty related parameters than the

ODL method.

4.2 Real Data Analysis
As a real data example, the original 115 test items were
artificially assigned to two parallel test forms, which

shared twenty common items, as similar as possible in terms

70

of test structures, content areas and item difficulties.
Item composition of two test forms and common items was
shown in Table 3.4. Five pairs of sub-samples for base and
equated examinee groups randomly sampled from originally
124,881 valid examinees. For the purpose of the simplest
demonstration of multidimensional analysis, a two—
dimensional three—parameter logistic model was used. Twenty
common item parameter estimates with a varimax rotation for

the first pair of samples were provided in Table 3.6.

4.2.1 Item Estimates Comparison

After item estimates of the equated form was transformed
into those of the base test scale by the three linking
methods, transformed estimates were compared with item
estimates of the base form. This procedure was replicated
for five pairs of sub—samples in order to find consistent
patterns. Differences between transformed estimates and base
form estimates are illustrated in Figures 4.13 and 4.14.

Bounded vertical lines indicate ranges of maximum and
minimum values, and markers on the lines are mean values of
five replications. The LL method showed largest mean
differences for two discrimination parameters and the two
orthogonal Procrustes solution based methods resulted in

less biased difficulty transformations than the ODL method.

71

For difference variation, the LL method showed the most
stable results, and the ODL method was worst but not by
much. These findings for real data generally confirmed the

results of the simulation study.

 

 

 

 

 

--,- --,-
'2' I i
I I
A Q I on
I
«3; 5 : '
I I -
i I 1 LL
0 l .
-.4" : :
u --L-
---- I M
-.5 *
a1 a2 d

Figure 4.13 Mean Differences of Five Sets of Samples

 

 

5. --r-——

i. ___, [a]

0.0

k

 

 

 

 

D'fl'erenoe Variation
0)

 

---.--------------
Ii—i
F 8

*I—I ow
g

 

 

 

a1 a2 d
Figure 4.14 Difference Variations of Five Sets of Samples

4.2.2 True Score Comparison

In addition to the item level comparison, overall true
score comparisons for the first pair of sub-samples were
conducted to evaluate linking methods using a different
aspect. When the three linking methods were compared on
estimated true score scale (test response surface for twenty
common items), differences between two sets of estimated
true scores (transformed true scores minus base true scores)
were calculated for limited points (49 points, 7 by 7) from
the ability space. Difference scores along with true score
estimates of the base test form are presented in Figure

4.15.

73

 

3 0.44 0.32 0.01 0.01 -0.01 -0.02 -0.01

2 0.04 0.27 0.01 -0.09 -0.03 0.00 0.00

1 -0.30 -0.08 -0.05 -0.25 -0.17 -0.04 -0.01

02 0 -0.14 -0.09 -0.12 -0.38 -0.45 -0.23 -0.08
'1 0.04 0.13 0.17 -0.02 -0.35 -0.39 -0.25

-2 0.05 0.13 0.23 0.24 0.07 0.03 0.15

-3 0.02 0.06 0.11 0.13 -0.01 -0.03 0.47

-3 -2 -1 0 1 2 3
0|

(a) ODL method

 

(b) LL method

 

3 0.25 0.17 -0.09 -0.15 -O.15 -O.12 -0.08
2 -0.05 0.19 0.09 -0.07 -0.15 -O.14 -0.11
1 ~0.20 -0.02 0.09 0.01 -0.15 -0.21 -O.2O
92 O 0.02 0.15 0.26 0.23 -0.09 -0.33 -0.38
'1 0.18 0.35 0.58 0.81 0.65 0.09 -0.34
'2 0.16 0.32 0.58 0.98 I 0.61

'3 0.10 0.20 0.37 0.70

-3 -2 -1 0 1 2 3
0|

(c) M method

 

3 10.68 12.69 15.47 17.84 19.10 19.62 19.82
2 9.01 10.80 13.41 16.36 18.32 19.25 19.64
1 7.31 8.84 11.12 14.26 17.02 18.56 19.28
02 O 5.89 6.93 8.72 11.62 15.01 17.37 18.60
-1 4.94 5.49 6.58 8.71 12.12 15.39 17.39
-2 4.50 4.75 5.28 6.49 8.99 12.45 15.25
-3 4.35 4.46 4.72 5.35 6.87 9.67 12.67
-3 -2 -1 0 1 2 3
6|

(d) True score estimates on the base form

Figure 4.15. Differences of Transformed Test Scores and

Estimated True Scores on the Base Test Form*
* Dark cells of the three MIRT linking methods indicate absolute
differences larger than 1.0.

74

The results show that the three methods had different
patterns of linking errors in terms of estimated true test
scores for 20 common items. The LL method showed the largest
mismatch between two test response surfaces, and differences
were located above and below the diagonal. It means that the
LL method had relatively large discrepancies when an
examinee had high or low scores. On the other hand, the M
method improved true score transformations by modeling
unique unit changes compared with the LL method, but there
were relatively large gaps when an examinee’s ability was
high on the first dimension and low on the second dimension.
The ODL method showed the most favorable result among three
linking methods and discrepancies for any ability value were
less than 0.5. The better performance of the ODL method was
expected because its minimization criterion is based on the

test response surface (Equation [9]).

75

CHAPTER 5
SUMMARY, DISCUSSION, AND CONCLUSION

In this chapter, overall results are summarized, related
issues are discussed, and conclusion and suggestions for

further studies are provided.

5.1 Simulation Study

By using simulated data, three MIRT linking methods based
on the compensatory two—dimensional two-parameter logistic
model have been evaluated for the anchor item equating
design. In order to emulate real test conditions, several
simulation factors were incorporated, e.g., sample sizes,
dimensional structures, ability distributions. The amounts
of linking errors were quantified by bias and RMSE based on
basic statistical concepts of accuracy and stability. For
statistical tests of effects of simulation factors and
linking methods, a repeated measures design was applied
because each simulated test response pattern was transformed
into three sets of metrics according to the three linking
methods. Then comparisons of the three linking methods were
conducted by using mean bias and mean RMSE of 50
replications.

The result of the repeated measures ANOVA showed that the

76

choice of linking methods had statistically significant
impacts on linking errors, both bias and log transformed
RMSE, for all three item parameters. Further, linking
methods had significant interaction effects with simulation
conditions. That is, the soundness of metric transformations
depends on the type of linking methods as well as test
administration conditions.

When the degree of the recovery of parameters was
quantified in terms of bias and RMSE, the new linking
method, which is based on orthogonal Procrustes solutions
with a diagonal dilation matrix, reduced linking biases,
especially for discrimination parameters, compared with the
LL method. The ODL method and the new method were relatively
good at obtaining less biased metric transformations of
discriminations, and the new method outperformed the ODL
method in terms of stable transformations of difficulty

related parameters.

5.2 Real Data Analysis

For real data analysis, a statewide mathematics test was
used. Because there was one test form originally, two test
forms with twenty common items were artificially assembled
based on initial test structures, content areas and item

difficulties. In order to examine patterns of linking

77

results, five pairs of sub—samples with 2,000 examinees
(base and equated groups) were randomly sampled from the
originally more than 120,000 examinees, i.e., five
replications of metric transformations. The evaluation of
behaviors of the three linking methods was done in the same
way as the simulation study comparing linking bias (mean
differences) and RMES (difference variation). Transformed
test response surfaces were also evaluated based on
similarity to the base test response surface.

The new method and the ODL method outperformed the LL
method in terms of how close transformed discrimination
estimates were to estimates on the base test. However, the
two orthogonal Procrustes based methods resulted in more
favorable linking for difficulty related parameters than the
ODL method. The linking results with the real test data were
consistent with the simulation study results in general.

The comparison of test response surfaces (i.e., estimated
true scores at 49 ability points) revealed different error
regions for the three linking methods. While the ODL method
resulted in the closest agreement, a large region of linking
error was found for the LL method when examinees had low or
high scores. The new method with a dilation matrix generated
more acceptable agreement than the LL method, but was a

little less favorable than the ODL method.

78

5.3 Discussion

5.3.1 Rotation and Optimization Criteria

Statistically, the main differences between the two types
of linking methods evaluated in the study could be found in
linking components and optimization criteria. The ODL method
consists of two linking components, the rotation matrix and
the translation vector, while the two orthogonal Procrustes
solution based methods include the dilation factor (a
constant or a diagonal matrix) in addition to the previous
two components. In some sense, the rotation matrix of the
ODL method can be considered as a composite of the rotation
matrix and the dilation factor because it alters both
variances and covariances of the initial discrimination
matrix.

More noticeable difference between the ODL method and the
two Procrustes based methods lies in the types of rotation.
The rotation matrix of the ODL method adopts general
rotation procedures, i.e., oblique rotation, because it does
not put any constraint on the rotation matrix, while the two
Procrustes methods constrain an orthogonal structure in the
rotation matrix. One concern when using an oblique rotation
in factor analysis techniques (Harman, 1976) is that the

meaning of the reference axes could change after rotation,

79

because the angles among axes (correlations/covariances) are
changed when finding the optimal rotation, while the
orthogonal rotation maintains the initial structure of a
reference system. In the MIRT model context, the orthogonal
rotation in the two Procrustes solution based methods keeps
the relative distances among item vectors before and after
conducting metric transformations, while the structure of
item vectors would be somewhat changed with the oblique
rotation of the ODL method. However, it is not clear whether
we need to maintain item vector structure of the equated
test through a MIRT metric transformation, or to what degree
the oblique rotation of the ODL method changes the vector
structures. Further study is needed on this issue.

Another distinguishable difference between two types of
methods is optimization criteria for estimating linking
components. That is, the TRS ODL method is an expansion of
the UIRT equating framework minimizing the differences
between two test response surfaces (Stocking & Lord, 1983).
So the ODL method uses one equation (Equation [9]) to obtain
both rotation and translation components simultaneously.
However, the orthogonal Procrustes based methods adopt
traditional factor analysis techniques and estimate the
rotation and translation components separately (Equations

[14] and [16], and Appendix A).

80

These different estimation procedures may explain
different behaviors of the two type of linking methods.
Because the TRS ODL method is to optimally minimize
differences between transformed test response surface and
the base test response surface, it outperformed the
orthogonal Procrustes based methods in obtaining desirable
concurrences of true scores. On the other hand, the
orthogonal Procrustes based methods establish an additional
problem equation for the translation vector such that these
methods were better than the ODL method in transformations

of difficulty related parameters.

5.3.2 Evaluation Criteria

In the simulation study and in the real data analysis,
three linking methods were evaluated by means (bias) and
standard deviations (RMSE) of differences between
transformed values and estimates of the base test for common
items. Especially for bias, opposite signs of differences
could be canceled out across items in taking an average of
both over— and under—transformed differences, such that
using mean differences as an indicator of the quality of
linking may provide inaccurate information on the amount of
linking errors across items. While using absolute values of

differences is a possible, easy alternative solution,

81

further study is also needed to develop more informative
summary statistics that would be impartial to various
linking techniques. Moreover, if distributional
characteristics of summary statistics are investigated, more
statistically elegant comparisons could be employed rather
than simply using descriptive statistics.

Estimates of true scores were examined as another way to
evaluate the three linking methods with real data. Reporting
one test score is the most popular way to state an
examinee’s performance, but important benefits of using the
MIRT model compared with UIRT are that we can obtain more
detailed information on multiple dimensions (e.g., ability
estimates on each dimensions). Therefore, comparisons of
transformed ability estimates on each dimension would
provide more meaningful information on MIRT linking methods
than comparing overall true scores. For this purpose,
TESTFACT should be used for MIRT calibrations because NOHARM

does not provide ability estimates (Miller, 1991).

5.3.3 Relative Efficiency of Linking Methods

From a statistical point of view, as a model includes
more parameters, estimation results get better in terms of
model fit, but the cost of a more complex model is the loss

of degrees of freedom and model parsimony.

82

As was mentioned before, three linking methods include
different number of linking components. The ODL method
includes two linking components while the orthogonal
Procrustes solution based methods adopt three. Further, the
new linking method estimates more dilation components than
the LL method by allowing different dilation rates for
different ability dimensions. Then the question is whether
including more linking parameters reduces linking errors
significantly.

A traditional way to evaluate efficiency of different
statistical models is to compare likelihoods or amounts of
error. The LL method and the new method were compared by
calculating amounts of linking errors (in chapter 3) but a
statistical test was not conducted. In order to test
different models, several conditions should be satisfied
such as distributional assumptions (e.g., normality) or
consistent error terms (e.g., nested relationship among
competing models), which were not examined in the study.
Research on statistical comparison of linking methods needs
to be pursued in order to provide statistically persuasive
evidences about different behavior of various linking

methods.

83

5.3.4 Test Response Surface and Ability Levels

Because reporting one overall test score rather than
multiple scores is the most popular way of giving test
results, comparisons of transformed test response surfaces
were conducted as part of real data analysis. As a result,
the three linking methods contained different amounts of
linking errors on various regions of the ability space, and
the ODL method and the new method produced transformed test
response surfaces closer to the base one than the LL method.
This result confirms previous evaluation using linking bias
and RMSE for both simulated and real data.

However for practical applications, we need to notice
different disagreement patterns, although the ODL method
showed the best results in the present demonstration. For
example, if any critical decision is made at low or high
test scores (above or under the diagonal in Figure 4.15) for
an examinee who takes the equated form, it would be better
to adopt the ODL method or the new method than the LL
method. On the other hand, at moderate test scores (around
the diagonal) all linking methods do not much differ. So, it
can be said that the selection/decision of which linking
method is to be adopted depends on the purpose of the
equating, such as low equating error for all ranges of test

scores or for a certain decision point score (e.g., cutoff

84

score). That is, the selection of the linking method is a
situational specific decision in real application. In
general, an equating procedure requires individual judgments
that are made by the individuals who are doing equating. The
judgment should be informed on practical testing issues and

statistical characteristics of equating techniques.

5.4 Conclusion

The results from this study indicate that modeling unique
dilation rate for each ability dimension improves the
orthogonal Procrustes metric transformation which was
initially modeled in the LL method, and the ODL method and
the new method provide more favorable linking of
discriminations than the LL method but the orthogonal
Procrustes solutions based methods produce better
translation vectors than the ODL method. These differences
of the three linking methods can be explained by the types
of rotation and the number of linking components. The
oblique rotation of the ODL method may provide closer
agreement of dimensional orientation. And unique dilation
components of each dimension improve metric transformations
compared with a dilation constant.

Frequently, two academic camps, IRT framework and

traditional factor analysis, have been referred to in

85

dealing with dichotomous variables such that MIRT linking
methods are not free from these theoretical origins.
Therefore, optimization criteria and statistical behaviors
of linking components need to be explored further by
revisiting the theoretical/statistical origin of each
method.

It should also be noted that even if the purpose of the
study is to evaluate statistical methods to obtain
comparable test scores from different test forms, the focus
is set on linking of MIRT scales rather than on the whole
equating procedures that finally provide conversion tables
on different test forms. It means that overall equating
results with common items and unique items might be somewhat
different from the comparison of common items, which have
been discussed in the study. Moreover, only the anchor item
design was discussed so it is not known whether the
presented simulation and real data results would hold under
other equating design (e.g., common/equivalent group
design).

In addition, further research is needed on various issues
of evaluation criteria, theoretical estimation errors, and
other testing—related factors which were not dealt with in
the study, such as the number of common items and non—normal

ability distributions.

86

APPENDIX A

The purpose of this appendix is to derive linking
components that allow a unique dilation for each dimension
based on orthogonal Procrustes solutions. Essential concepts
of the orthogonal Procrustes problem were well explained by
SChOnemann (1966) and the extension with a translation
vector and a dilation constant was presented by Schdnemann
and Carroll (1970). The procedures of Schonemann and Carroll
(1970) are followed to derive the solution of the case with
a diagonal dilation matrix. The main difference from
previous procedures is that the solution for the rotation
and the dilation components are obtained from two different
problem equations while original methods derived solutions
from one problem equation for both components. More details
will be discussed at corresponding steps of the derivation.

The orthogonal Procrustes problem is defined as,

B=AT+EU (k4)

TTI=TIT=II (A-Z)

minimizing tr(E;E1) , (A-3)
where,

A and B are known nXm.matrices,

87

T is the orthogonal mXleotation matrix,
E1 is the me residual matrix (i.e., El=B-AT), and

I is the identity m m matrix.

Equation (A—l) indicates the model of the problem to
solve, (A-2) is the side condition of orthogonality and (A—
3) is the minimization criterion.

Considering anchor item equating conditions, A and B
would be treated as item discrimination matrices for the
equated test and the base test, respectively, n is the
number of common items, m is the number of dimensions, and T
indicates the rotation matrix.

To obtain the solution for T in Equation (A—l) with (A-2)

and (A-3), set

f=f|+f2I (A-4)

where
f1 = tr(E’1E1)
= tr(B'B — B'AT — T'A'B + T'A’AT) , and

f2 = tr(L[T’T - I]) .

The anlnatrix L of f; is unknown Lagrange multipliers with

88

respect to T.
By taking a partial derivative of Equation (A—4)

regarding T and setting it equal to zero, we obtain

af/8T=A’AT—A'B+T(L+L’)=0. (A—S)

By applying eigenvalue and eigenvector techniques
(singular value decomposition) to Equation (A—S) the
solution for T can be obtained (refer to Schonemann, 1966).

After obtaining the rotation matrix, the concern is about
unit lengths between the base matrix B and the rotated

matrix AT. By considering a diagonal dilation matrix the

second problem equation is

B=ATK+E2, (A-6)

where K is the diagonal anldilation matrix.

Under the same minimization criteria as in Equation (A-
3), minimization of UKEQEZ) requires the partial derivative

with regard to the diagonal dilation matrix K.

atr(E'2E2)/ 3K = diagonal elements of (KT'A'AT - B'AT) = 0 . (A— 7)

89

Then, the solution for K is

diag [KT'A'AT - B'AT] = 0
=> K{diag [T’A'AT] } = diag[B'AT]

=> K = diag[B'AT]x (diag [T'A'ATD'l , (A- 8)

where the matrix operator diag means that off-diagonal
elements equal zero.

Note that the traditional orthogonal Procrustes problem
with a dilation constant, on which the LL method is based,
needs only one equation such as Equation (A-6) (see Equation
[15]) by using a constant, k instead of a matrix, K. Because
of mathematical intractability, two equations are used, (A-
l) for the rotation matrix and (A—6) for the dilation
matrix. As a result, the LL method and the new method with
the dilation matrix provide the exactly same solutions for T
and m.(a translation vector, see Equation [16]), such that
difference of linking results of the two methods come only
from the dilation component, a scalar or a matrix.

The MATLAB (The MathWorks, 1995) program used in the new

linking method is provided in APPENDIX B.

90

APPENDIX B

The program for the M method for the two dimensional case
is mainly based on MDEQUATE developed by Li (1996). The only
difference from MDEQUATE is the procedures of estimating the

diagonal dilation matrix rather than the dilation constant.

MATLAB Program for the M method

 

base =x; % base form file, id a1 a2 d;
global Al A2 d A10 A20 do D;

d= base(:,4);

Al: base(:,2);

A2= base(:,3);

BASE: base (:,2:3);

equated =y; % equated form file, id a1 a2 d;
do: equated(:,4);

TA1= equated(:,2);

TA2= equated(:,3);

EQUATED_1= equated(:,2:3);

dM= [d do];

cordM=corrcoef(dM);

BASEEQUATED_1 = [BASE EQUATED_1];
COR_obs = corrcoef(BASEEQUATED_1);

disp('Orthogonal Procrustes Rotation, Schonemann, 1966');
S=EQUATED_1'* BASE ;

STS=S'*S ;
SST=S*S';

[U,S,V] =Svd (STS);
Vl = U;

D1 = S;

Vl = V;

[U,S,V] =svd (SST);
W1 = U;

D1 = S;

W1= V;

ESTRM_1 = W1 * Vl';

91

ESTRM =ESTRM_1;
EQUATEDrot = EQUATED_l * ESTRM_1;

disp('Rotation Matrix, T');
ESTRM

EQUATEDrot;

AlO =EQUATEDr0t(:,l)i

A20 =EQUATEDr0t(:,2);

disp('Diagonal Dilation Matrix, K');
EQUATED_1_C = EQUATED_1;

BASE_C = BASE;
LEFT=ESTRM'*EQUATED_1_C'*EQUATED_1_C*ESTRM;
RIGHT=BASE_C'*EQUATED_1_C*ESTRM;
DEN=inv(diag(diag(LEFT)));
NUM=diag(diag(RIGHT));

K=NUM*DEN

disp('Start Value for m using the Least Squares Procedure');
%Need two function files (by Li, 1996);

SUMdB = sum (d):

SUMdE = sum (do);

SUMal = sum (A10);

SUMa2 = sum (A20);

Est_m = (SUMdB -SUMdE) / (SUMal +SUMa2);

ml = Est_m;

m2 = Est_m;

D 21.702;

dml=0.0001;

dm2=0.0001;

for Iteration=1z99;
s=[ml,m2];

mlp= m1 + dml;
m2p= m2 + dm2;

J(l,1)= (func_m1(m1p,m2) —func_ml(ml,m2))/dm1;
J(l,2)= (func_m1(m1,m2p) -func_ml(ml,m2))/dm2;
J(2,1)= (func_m2(mlp,m2) -func_m2(ml,m2))/dml;
J(2,2)= (func_m2(ml,m2p) -func_m2(m1,m2))/dm2;

P(1,l)=J(1,l);
P(1,2)=J(1,2);
P(2,l)=J(2,1):
P(2,2)=J(2,2);
f(1)=func_ml(ml,m2);
f(2)=func_m2(m1,m2);
ds= -P\f';

92

ml: ml + ds(l);

m2: m2 + ds(2);

fprintf('Iteration=%2.0f, ml=%7.4f,
m2=%7.4f‘,Iteration,m1,m2)

fprintf(' ')

fprintf('f(l)=%8.4f,f(2)=%8.4f\n',f(1),f(2))

if (abs(f(1))<0.00001 & abs(f(2))<0.00001), break; end
end

disp('translation vector, ml and m2');
ml
m2

disp ('Transformed discriminations and difficulties');
Est_d = do + m1*Alo + m2*A20;

TEMP=[Alo A20];

M_FINAL_A=TEMP *K;

final =[M_FINAL_A Est_d]

93

 

REFERENCES

Ackerman, T. A. (1991). The use of unidimensional parameter
estimates of multidimensional items in adaptive
testing. Applied Psychological Measurement, 15, 12—24.

Ackerman, T. A. (1994). Using multidimensional item response
theory to understand what items and tests are
measuring. Applied Measurement in Education, 18, 255-
278.

Ackerman, T. A. (1996). Graphical representation of
multidimensional item response theory analyses. Applied
Psychological.Measurement, 20, 311—329.

 

Angoff, W. H. (1968). How we calibrate College Board scores.
College Board Review, 68, 11—14.

Angoff, W. H. (1971). Scales, norms and equivalent scores.

In R. L. Thorndike (Ed.), Educational Measurement (ZmU
(pp. 508-600). Washington, DC: American Council of
Education.

Baker, F. B. (1992). Item Response Theory: Parameter
Estimation Techniques. New York: Marcel Dekker.

Bolt, D. M. (1999). Evaluating the effects of
multidimensionality on IRT true-score equating. Applied
.Measurement in Education, 12, 383-407.

Budescu, D. (1985). Efficiency of linear equating as a
function of the length of the anchor test. JOurnal of
Educational Measurement, 22, 13-20.

Cook, L. L., and Eignor, D. R. (1991). An NCME instructional
module on IRT equating methods. Educational
.Measurement: Issues and Practice, 10, 37-45.

Crocker, L., and Algina, J. (1986). Introduction to
Classical and.Mbdern Test Theory. New York: Holt,
Rinehart and Winston.

Davey, T., Nering, M. L., and Thompson, T. (1997). Realistic

simulation of item response data. ACT Research Report
series ONR 97—4. Iowa City, IA: ACT, Inc.

94

Davey, T., Oshima, T. C., and Lee, K. (1996). Linking
multidimensional item calibrations. Applied
Psychological.Measurement, 20, 405-416.

Dorans, N. J. (2000). Scaling and equating. In H. Wainer
(Ed.), Computerized Adaptive Testing: A Primer (ZmU
(pp. 135-158). New Jersey: Lawrence.

Embretson, S. E. (1984). A general latent trait model for
response processes. Psychometrika, 40, 175-186.

Embretson, S. E., and Reise, S. P. (2000). Item Response
Theory for Psychologists. New Jersey: Lawrence Erlbaum
Associates.

Fraser, C. (undated). NOHARM: A computer program for fitting
both unidimensional and multidimensional normal ogive
models of latent trait theory.

Feuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M.
W., and Hemphill, F. C. (Ed.) (1999). Uncommon
.Measures: Equivalence and Linkage among Educational
Tests. Washington, DC: National Academic Press.

Green, P. E. (1976). Mathematical Tools for Applied
Multivariate Analysis. New York: Academic Press.

Gosz, J. K., and Walker, C. M. (2002). An empirical
comparison of multidimensional item response data using
TESTFACT and NOHARM; Paper presented at the annual
meeting of the National Council for Measurement in
Education. New Orleans, LA.

Hambleton, R. K., and Swaminathan, H. (1985). Item Response
Theory: Principles and Applications. Boston: Kluwer.

Harman, H. (1976). Mbdern Factor Analysis,.Tﬂ ed., Chicago:
University of Chicago Press.

Harris, D. J., and Crouse, J. D. (1993). A study of criteria
used in equating. Applied Measurement in Education, 6,
195-240.

Harewell, M. R., Stone, C. A., Hsu, T., and Kirisci, L.

(1996). Monte Carlo studies in item response theory.
Applied Psychological.Measurement, 20, 101—125.

95

Hirsch, T. M. (1988). Multidimensional equating. Unpublished
doctoral dissertation. Florida State University.

Hirsch, T. M. (1989). Multidimensional equating. Journal of
Educational Measurement, 26, 337-349.

Kim, H. (1994). New techniques for the dimensionality
assessment of standardized test data. Unpublished
doctoral dissertation. University of Illinois at
Urbana-Champaign.

Kim, J.—P. (2001). Proximity measures and cluster analyses
in multidimensional item response theory. Unpublished
doctoral dissertation. Michigan State University.

Kolen, M. J. (2001). Linking assessments effectively:
Purpose and design. Educational Measurement: Issues and
Practice, 20, 5—9.

Kolen, M. J., and Brennan, R. L. (1995). Test Equating:
Methods and Practices. New York: Springer.

Lee, K., and Oshima, T. C. (1996). IPLINK: Multidimensional
and unidimensional item parameter linking in item
response theory. Applied Psychological.Measurement, 20,
230.

Li, Y. H. (1996). MDEQUATE [Computer software]. Upper
Marlboro, MD: Author.

Li, Y. H. (1997). An evaluation of multidimensional IRT
equating methods by assessing the accuracy of
transforming parameters onto a target test metric.
Unpublished doctoral dissertation. The University of
Maryland.

Li, Y. H., and Lissitz, R. W. (2000). An evaluation of the
accuracy of multidimensional IRT linking. Applied
Psychological Measurement, 24, 115—138.

Lissitz, R. W., Schénemann, P. H., and Lingoes, J. C.
(1976). A solution to the weighted Procrustes problem
in which the transformation is in agreement with the
loss function. Psychometrika, 41, 547-550.

96

Lord, F. M. (1980). Applications of Item Response Theory to
Practical Testing Problems. New Jersey: Lawrence.

Lord, F. M., and Novick, M. R. (1968). Statistical Theories
of Mental Test Scores. Reading, MA: Addison-Wesley.

Maier, M. H. (1993). Military aptitude testing: The past
fifty years (DMDC Technical Report 93-007). Monterey,
CA: Defense Manpower Data Center.

McDonald, R. P. (1967). anlinear factor analysis
(Psychometric Monographs, No. 15). Iowa City:
Psychometric Society.

Mckinley, R. L. and Reckase, M. (1983). An EXtension of the
two-parameter logistic model to the multidimensional
latent space. ACT Research Report Series ONR 83—2. Iowa
City, IA: ACT, Inc.

Mislevy, R. J., and Bock, R. D. (1990). BILOG-3: Item
Analysis and Test Scoring with Binary Logistic Medals
[Computer Software]. Mooresville, ID: Science Software.

Miller, T. R. (1991). Empirical estimation of standard
errors of compensatory.MIRT.model parameters obtained
from the.NOHARM estimation program. ACT research report
series ONR 91-2. Iowa City, IA: ACT, Inc.

Oshima, T. C., and Davey, T. C. (1994). Evaluation of
procedures for linking multidimensional item
calibrations. Paper presented at the annual meeting of
the National Council on Measurement in Education, New
Orleans, LA.

Oshima, T. C., Davey, T. C., and Lee, K. (2000).
Multidimensional linking: Four practical approaches.
Journal of Educational Measurement, 37, 357-373.

Reckase, M. D. (1985). The difficulty of items that measure
more than one ability. Applied Psychological
.Measurement, 9, 401 - 412.

Reckase, M. D. (1990). Unidimensional data from
multidimensional tests and multidimensional data from
unidimensional tests. Paper presented at the annual
meeting of the American Educational Research
Association, Boston, NJ.

97

Reckase, M. D. (1995). A linear logistic multidimensional
model for dichotomous item response data. In W. J. van
der Linden and Hambleton (Ed.), Handbook of.MOdern Item
Response Theory (pp. 271-286). New York: Springer.

Reckase, M. D. (1997). The past and future of
multidimensional item response theory. Applied
Psychological Measurement, 21, 25-36.

Reckase, M. D., and Hirsch, T. M. (1991). Interpretation of
number correct scores when the true number of
dimensions assessed by a test is greater than two.
Paper presented at the annual meeting of the National
Council on Measurement in Education, Chicago, IL.

Reckase, M. D., and Mckinley, R. L. (1991). The
discriminating power of items that measure more than
one dimension. Applied Psychological.Measurement, 14,
361-373.

Roussos, L. A., Stout, W. F., and Marden, J. I. (1998).
Using new proximity measures with hierarchical cluster
analysis to detect multidimensionality. JOurnal of
Educational Measurement, 35, 1-30.

Schonemann, P. H. (1966). A Generalized solution of the
orthogonal Procrustes problem. Psychometrica, 31, 1-10.

Schonemann, P. H., and Carroll, R. M. (1970). Fitting one
matrix to another under choice of a central dilation
and a rigid motion. Psychometrica, 35, 245-255.

Spray, J. A., Davey, D. C., Reckase, M. D., Ackerman, T. A.,
and Carlson, J. E. (1990). Comparison of two logistic
multidimensional item response theory models. ACT
Research Report Series ONR 90-8. Iowa City, IA: ACT,
Inc.

Stocking, M. L., and Lord, F. M. (1983). Developing a common
metric in item response theory. Applied Psychological
Measurement, 7, 201-210.

Sympson, J. B. (1978). A model for testing with
multidimensional items. In D. J. Weiss (Ed.),
Proceedings of the 1977 Computerized Adaptive Testing
Conference (pp. 82-98). Minneapolis: University of

98

Minnesota.

The MathWorks. (1995). MATLAB: The Ultimate Computing
Environment for Technical Education. Englewood Cliffs,
NJ: Prentice-Hall.

Thompson, T. (Undated). GENDAT5: A computer program for
generating multidimensional item response data.

Thompson, T. (1996). NOHARM21: NOHARM (C. Fraser, undated)
converted to Windows.

Thompson, T., Nering, M., and Davey, T. (1997).
.MUltidimensional IRT scale linking without common items
or common examinees. Paper presented at the annual
meeting of the Psychometric Society, Gatlinburg, TN.

Traub, R. E. (1983). A priori considerations in choosing an
item response model. In R. K. Hambleton (Ed.),
Applications of Item Response Theory (pp. 57-70).
Vancouver: Educational Research Institute of British
Columbia.

Wilson, D., Wood, R. and Gibbons, R. D. (1991). TESTFACT:
Tests scoring, item statistics, and item factor
analysis. Mooresville, ID: Scientific Software.

Wingersky, M. 3., Barton, M. A., and Lord, F. M. (1982).

LOGIST V USer’s Guide. Princeton, NJ: Educational
Testing Service.

99

IES
‘W

[WWHT‘ETW‘T ]
111111.][HIM M

[111‘] ’l
3 ‘1293 02461 8542

    

   

  

mm
1
[