, LIBRARY
"it?” Michigan State
{It 0153 , University

 

 

This is to certify that the
dissertation entitled

USING A PROJECTION METHOD TO ESTIMATE
SUBSCORES FROM TESTS WITH MULTIDIMENSIONAL
STRUCTURES

presented by

YU FANG

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Measurement and Quantitative
Methods

 

Major Professor’s Signature
! 21/ 3/08

Date

MS U is an Afﬁrmative Action/Equal Opportunity Employer

_ _.-.-.-.-.-

 

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE
03 0 5 1 3

’11‘ .
Il'°z~'., ,l l' r.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:/Prolecc&Pres/C lRC/DateDue. indd

 

USING A PROJECTION METHOD TO
ESTIMATE SUBSCORES FROM TESTS WITH
MULTIDIMENSIONAL STRUCTURES

By

Yu Fang

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Measurement and Quantitative Methods

2008

ABSTRACT

USING A PROJECTION METHOD TO
ESTIMATE SUBSCORES FROM TESTS WITH
MULTIDIMENSIONAL STRUCTURES

By

Yu Fang

A long time problem in factor analysis is the rotational indeterminacy of solutions.
This same problem also exists for multidimensional item response theory (MIRT)
model analyses. Commonly, the widely used mathematical criteria from factor anal-
ysis, such as the Varimax and Promax methods, are adopted for the MIRT model to
match the item discrimination matrix to the simple structure for better interpretation
of item characteristics. However, no substantial steps have been taken to provide a
better explanation and solution to the estimation of correlated proﬁciency estimates
from MIRT calibration procedures. These steps are often ignored in some popular
software which provides only the uncorrelated proﬁciency estimates. This study uses
the MIRT item and person parameter estimates when the proﬁciencies are assumed
uncorrelated, and projects the uncorrelated proﬁciency estimates onto the most dis-
criminating directions of item clusters to get the subscore estimates. This solution
provides the correlated construct scores related to different item clusters, explains the
relationship and difference between the model dimensionality and the number of item

clusters, and is useful for subscore reporting.

Dedicated to my beloved wife: Yang Lu

iii

ACKNOWLEDGMENTS

I would like to express my sincere thanks to my dissertation advisor, Dr. Mark
Reckase, for his continued wisdom and advice through my dissertation writing. The
constructive conversation with him greatly inspires my interest in the measurement
ﬁeld and in MIRT. He is always readily available for questions, and Without his timely
and insightful feedback, I would not have understood the essence of MIRT or been so
efﬁcient with my analysis.

I am especially grateful to Dr. Kimberly Maier, my academic advisor and chair
of my guidance committee, who always trusts me, encourages me, and guides me
through my graduate study.

Also special thanks go out to Dr. Ken Frank, who teaches me how one can enjoy
the research, and guides me to think of questions from perspectives of social networks
and sensitivity analysis.

In addition, I appreciate the support from Dr. Connie Page, who helps me with
funding opportunities for most of my graduate study. The statistical consulting work
really enhances my skill in explaining complicated things in a simple way. I am very
fortunate to work with her.

Thanks again to these four kind and knowledgable committee members for their
suggestions and encouragements towards my dissertation and program study, as well
as my career life. They are always teachers and friends to me.

I also gratefully acknowledge Dr. Joseph Martineau and Steve Viger, both of Whom
are working at the Michigan Department of Education, for their generous help with
the MEAP data for my analysis.

Finally, I owe a huge debt of gratitude to my parents, my wife and my friends, for

their unfailing and patient support in all aspects. Without them, my life would not

iv

 

be so interesting and rewarding.

TABLE OF CONTENTS

LIST OF TABLES viii

LIST OF FIGURES x

1 Introduction to MIRT 1

1.1 Dimensionality .............................. 1

1.2 Importance of Subscores ......................... 4

1.3 Compensatory MIRT model ....................... 6

1.4 Indeterminacies in MIRT Models .................... 10

1.5 Solution to Rotational Indeterminacy in Factor Analysis ....... 12
1.6 Interdependency between Proﬁciency Correlation and Item Discrimi-

nation ................................... 13

2 Construct Estimation Using Projection 18

2.1 Transformation in the Orthogonal Coordinate System ......... 18

2.2 Construct Estimation ........................... 20

2.2.1 Step 1: Calibration ........................ 20

2.2.2 Step 2: Cluster Analysis ..................... 24

2.2.3 Step 3: Reference Composite ................... 26

2.2.4 Step 4: Projection ........................ 27

2.3 Dimensionality and Number of Clusters ................. 28

2.4 Characteristics of Construct Estimates ................. 29

3 Simulation Study 32

3.1 Methods .................................. 32

3.1.1 Parameter Simulation ...................... 32

3.1.2 Simulation Design ......................... 33

3.1.3 Calibration and Projection .................... 34

3.1.4 Evaluation Criteria ........................ 34

3.2 Results ................................... 36

3.2.1 Parameter Estimation ...................... 36

3.2.2 Item and Person Parameter Recovery .............. 41

3.2.3 Coordinate Axes Recovery .................... 48

3.2.4 Item Grouping .......................... 49

3.2.5 Stability of Construct Estimates ................. 52

3.2.6 Relationship with NC Subscores and Unidimensional Estimates 54

3.2.7 Accuracy of Construct Estimates ................ 55

3.2.8 Correlation Recovery ....................... 62

vi

4 Real Data Applications

4.1 Test Information and Analysis Procedure ................

4.2 Dimensionality and Cluster Detection ..................
4.3 MIRT Calibration and Subscore Reporting ...............

5 Conclusions, Implications, Limitations and Future Research

5.1 Conclusions and Implications

5.2 Limitations and Future Research

REFERENCES

vii

64
64
69
79

86
91

96

3.1

3.2

3.3

3.4

3.5

3.6

3.7

3.8

3.9

3.10

3.11

3.12

4.1

4.2

4.3

4.4

4.5

4.6

4.7

LIST OF TABLES

MIRT Item Parameters for Design 1 .................. 37
MIRT Item Parameters for Design 2 .................. 38
Descriptive Statistics of MIRT Person Parameters ........... 40
Recovery of Item Parameters ....................... 42
Recovery of Person Parameters ..................... 45
True and Estimated Proﬁciency Correlation Matrix .......... 47
Reference Composite Vectors from Parameters and Replication 1 Es-

timates ................................... 48

Summary of Standard Deviations of Raw Proﬁciency and Construct
Estimates ................................. 54

Average Correlation Among Construct Estimates, Unidimensional Es-

timates and NC Subscores ........................ 55
Construct Recovery from Unidimensional Estimates and Construct Es-
timates ................................... 58
Hypothesis Testing on the Difference of Correlations for Construct Re-
covery by Construct Estimates and Unidimensional Estimates . . . . 59
Correlation Recovery by Unidimensional Estimates, the Promax

Method and Construct Estimates .................... 63

Strands and Domains for Michigan Mathematics Content Expectation 65

Item Content of the Fall 2007 Grade 7 MEAP Mathematics Test . . . 66
Strand and Domain Percentage in the Test ............... 66
Summary Statistics for the Population Data .............. 67
Results from the DIMTEST Software .................. 70
Eigenvalues for the Population Data and Ten Random Datasets . . . 71
Model Fit Indices from the NOHARM Software ............ 73

viii

4.8 Item Cluster Content of the Fall 2007 Grade 7 MEAP Mathematics Test 78

4.9 Varimax Transformed NOHARM Item Estimates for the Population

Data .................................... 81
4.10 Summary Statistics for Item Parameter Estimates ........... 82
4.11 Reference Composite Vectors from the Population Data Estimation . 82
4.12 Eigenvalues of A2141 Matrix ....................... 82
4.13 Correlations among NC Subscores and Construct Estimates ..... 82

ix

1.1

1.2

3.1

3.2

3.3

3.4

3.5
3.6

3.7

3.8

3.9

3.10

4.1

4.2

4.3

4.4

4.5

4.6

LIST OF FIGURES

Representation of the Characteristics of 40 Items in a Two-Dimensional

Space .................................... 9
Representation of the Characteristics of 40 Items in a Two-Dimensional

Space when the Proﬁciency Correlation is 0.0, 0.2, 0.4 and 0.6 . . . . 15
Plots of Item Vectors and Reference Composite Vectors ........ 39
Plots of Item Parameters versus Bias .................. 43
Plots of Proﬁciency Parameters versus Bias ............... 46

Plots of Reference Composite Vectors from Parameters and Replication
1 Estimates ................................ 50

Dendrogram for Replication 1 in Design 1 ............... 51
Plots of Construct Parameters versus Estimated Standard Deviations 53

Plots of Average Construct Estimates versus Average Unidimensional
Estimates across all Replications ..................... 56

Plots of Average Construct Estimates versus Average NC Subscores
across all Replications .......................... 57

Plots of Recovery Bias by Construct Estimates and Unidimensional
Estimates in Design 1 ........................... 60

Plots of Recovery Bias by Construct Estimates and Unidimensional
Estimates in Design 2 ........................... 61

Plot of Eigenvalues for the Population Data and Ten Random Datasets 71
Dendrogram for the Population Data Calibrated with 8 Dimensions . 72
Dendrogram for the Population Data Calibrated with 4 Dimensions . 75
Dendrogram for the 5000 Sample Data Calibrated with 4 Dimensions 76
Dendrogram for the 10000 Sample Data Calibrated with 4 Dimensions 77

Plots of NC Subscores versus Construct Estimates for the Population
Data Calibrated with 4 Dimensions ................... 84

4.7 Plots of NC Subscores versus Raw Proﬁciency Estimates for the Pop-
ulation Data Calibrated with 4 Dimensions . . . . ; .......... 85

xi

CHAPTER 1

Introduction to MIRT

1 . 1 Dimensionality

The unidimensional IRT model has become increasingly important, generally accepted
and widely used in the educational measurement ﬁeld, especially in test construction,
equating and person proﬁciency estimation. The underlying assumption which most
model-users seldom cast any doubt on is that all the test items measure the unidimen—
sional or vaguely general proﬁciency of a person, which is intended to be measured
by test developers. For this reason, the person proﬁciency and item difficulty can be
put on the same continuum, and the probability of correctly answering each item is
actually determined by the difference between the values of these two indices.
However, in one test, there can be items measuring the knowledge of different
subareas. For example, there can be algebra and geometry items in one mathemat-
ics test. Furthermore, Reckase (1985) pointed out that, for most test items, more
than one hypothetical construct is needed for people to get the answer correct. His
well-known argument to support this is that verbal, mathematical computation and
problem solving skills are indispensable to get the mathematical story-problem item
correct. Hence, if the person proﬁciency is supposed to be deﬁned for these items, it
should be related to the proﬁciencies on these three skills within that person, maybe
a weighted composite of them. The tests including any of the above items are called
multidimensional tests, and the multidimensionality in these two cases are later dis-
tinguished as between-item multidimensionality and within-item multidimensionality

(W.-C. Wang et al., 1997).

 

Generally speaking, for any testing situation, people are assumed to have one set of
proﬁciencies, denoted as 01m son vector, where correlation can be allowed between any
two vector elements. The whole test requires another set of proﬁciencies from people,
denoted as 0test vector, and each item is most discriminating along one proﬁciency
direction, or the direction of a composite of several proﬁciencies. Since the responses
are interactions between person proﬁciencies and item characteristics, the proﬁciency
set required for the responses, 9person,test vector, should be the intersection of Operson
and 0test- That is to say, the responses are determined by the proﬁciencies that are
not only owned by persons but also measured by items.

Although 9person,test is required for the test, it does not mean that all these proﬁ-
ciencies are intended to be measured by test developers. For example, test developers
want to evaluate high school students’ performance on mathematics, sometimes they
cannot avoid the descriptive sentences that require students’ reading skills. They will
try to make the words and sentences as simple as possible to make sure all the students
can understand these questions. In other words, these tests are made “insensitive” to
students’ reading skills, which are needed but not intended for the test. Therefore, it
is actually 9pcrson,test,sensitive vector that determines the dimensionality of the re-
sponse matrix generated by the person-item interaction. Another way to explain the
difference between 9pcrson,test and 0pm. some st, sensitive is that the ﬁrst one is deﬁned
from the psychological view to include all the required proﬁciencies, while the second
one is determined from the statistical view to model the proﬁciencies that not only
vary among persons but also can be discriminated by items.

The formal deﬁnition of dimensionality is the total number of proﬁciencies that
are required to meet the local independence assumption for the IRT model (Lord &
Novick, 1968). The reasoning for 9pcrson,test, sensitive makes sure that it is the critical
factor explaining the variation in the data matrix.

The estimation and interpretation of dimensions for 9person,te st, sensitive are based

on the empirical data analysis and expert judgement from the perspectives of psy-
chometrics and psychology. However, after the idea of multidimensional tests ﬁrst
emerged, lots of efforts have been made to ﬁnd the reasoning of approximating the
widely used unidimensional model to the test that requires more than one hypo—
thetical construct (Reckase, 1979; Drasgow & Parsons, 1983; M. Wang, 1985, 1986;
Reckase et al., 1988; Ackerman, 1989; Luecht & Miller, 1992), while at the same time,
some research also shows cautions and warnings to this approximation (Reckase et
al., 1986; Ackerman, 1991).

As the MIRT model became more and more popular and acceptable, determining
the dimensionality for it turns out to be an interesting topic. Lots of literature is
trying to justify the number of dimensions by using different model ﬁt indices and
criteria (Hambleton & Rovinelli, 1986; Stout, 1987; Roznowski et al., 1991; Gessaroli
& Champlain, 1996; Stone & Yeh, 2006; Kao, 2007), infer it by analyzing the number
of item clusters from the MIRT calibration result (Miller & Hirsch, 1992; Roussos
et al., 1998), or use parallel analysis to compare the observed eigenvalues with those
from randomly simulated data (Ledesma & Valero—Mora, 2007).

Furthermore, the determination of dimensionality is complicated by the fact that
the data can be ﬁtted better by a more complex model. Hirsch and Miller (1991)
showed that overfactoring does not lead to serious negative consequences. Reckase
and Hirsch (1991) pointed out that overfactoring is very useful to avoid projection
problems which occur when fewer numbers of dimensions are used to analyze the
response matrix which is supposed to have more dimensions involved. Actually, the
suggestion for overfactoring leads to one important difference between factor analysis
and the MIRT analysis: as a data reduction procedure, factor analysis tries to use
fewest factors possible to explain the relationship among items or tests; however, the
MIRT analysis needs to identify all the dimensions, which differentiate persons or

even a subgroup (e.g. high proﬁciency group) of persons, from the test data (Reckase

et al., 1986; Reckase, 2009). The disadvantage for overfactoring is that extracting
too many factors may cause serious estimation errors since more parameters need
to be estimated based on the same dataset. Thus, choosing a suitable number of
dimensions for the MIRT model can also be regarded as ﬁnding a good balance
between the number of dimensions and parameter estimation.

All in all, it is still by no means certain how to best determine the number of ele-
ments in 01,8, son,te 3t, sensitive or choose the number of dimensions for the MIRT model.
The difﬁculty greatly lies in the model-data ﬁt deﬁnition, sampling and estimation

errors, and complex situations in the real world.

1.2 Importance of Subscores

It is common for testing programs to purposely design items measuring different con-
structs in one test; accordingly, there is an increasing demand to extract more infor-
mation from the test, namely to report the subscores in addition to or in replacement
of the overall score. Through this way, test takers or policy makers can know the
strength and weakness of proﬁciencies in speciﬁc areas besides the ambiguous general
proﬁciency, and get more detailed diagnostic information for remediation.

As is well known, for the multidimensional tests where more than one hypothetical
construct is required, it is only when all the test items measure the same weighted
composite of proﬁciencies or all the proﬁciencies are highly correlated that the uni—
dimensional model can be used to calibrate the whole response matrix (Reckase et
al., 1988; Yao & Boughton, 2007). Only in these situations, the use of one overall
general proﬁciency score is justiﬁed and there is no need for the subscore estimation.
Otherwise, there is no reasoning to assume the proﬁciencies measured by different
test items are exactly the same.

The biggest challenge to report subscores is that they are less reliable than the total

score; therefore, some research shows caution in reporting these unreliable subscores
and doubts on any added value of these subscores over the total score (Sinharay et al.,
2007). However, they also pointed out that, on the positive side, the subscore seems
very useful when there are a reasonably large number of items for each subcategory
to ensure reliability, and also there can be moderate but not high correlation between
the subscore and the total score to ensure the added value.

For the purpose of subscore reporting, either the commonly used Number-Correct
(NC) subscore or unidimensional estimate can be calculated separately for each item
cluster, where all items in the same cluster are assumed to measure the same construct
proﬁciency that can be put on the same continuum scale. These item clusters are
either predeﬁned by the careful item selection and content scrutiny by item developers
and content experts or postdeﬁned by some empirical data analysis, such as the cluster
analysis based on item estimates (Luecht & Miller, 1992).

If there is no or low correlation between person proﬁciencies for different item
clusters, separate analysis for items in each cluster seems to be all that needs to be
done. As long as there is moderate correlation between different proﬁciencies, in order
to increase the reliability of the subscore, its estimate can be post—hoc adjusted by
borrowing information from the total score or the estimates of other subscores, and
some research has already shown this with classical test theory (Yen, 1987; Wainer
et al., 2000; Haberman, 2008).

However, by allowing simultaneous estimation of parameters for all dimensions,
the MIRT model is a growing methodology for the calibration of multidimensional
test data. Under the framework of the MIRT, the proﬁciency can be generalized to
be the weighted composite, which is a linear combination of proﬁciencies on several
hypothetical constructs. Items are flexible to have different discrimination power to
each proﬁciency dimension, and even items in the same cluster are not supposed

to most discriminate along exactly the same direction. After the dimensionality of

the MIRT model is empirically and theoretically chosen, the model calibration can

provide a proﬁciency vector for each person, which is useful for subscore reporting.

1.3 Compensatory MIRT model

The formula for the commonly used compensatory MIRT model is

exp (1.7(agej + (11))

1 +exp(1.7(ag9,- + d,)) (1'1)

 

P(uij = llgjaa'iadiaci) = Ci+(1— Ci)

where uij is the response of j “1' person to ith item, 0j is a column vector of jth person
proﬁciency coordinates in a m—dimensional space, a,- is a column vector that speciﬁes
the discrimination power of the ith item for each of the m dimensions, d,- is a scalar
parameter that is related to the item difﬁculty, c,- is a scalar parameter for guessing,
and the constant 1.7 is used to approximate the logistic function to the normal ogive
one with the input as a£0j + d,.

This compensatory model follows the logic of factor analysis and assumes the prob-
ability of the person-to—item response is related to a linear combination of several
proﬁciencies. Accordingly, this similarity allows the compensatory MIRT model to
borrow some analysis and estimation methods from factor analysis.

There is also a noncompensatory version of the MIRT model (Sympson, 1978),

m exp (1.7a,:k(9j1c — dik))

P(uij=1|9j,ar,di,6i) = Ci + (1 — Ci) H 1+exp(l.7a,-k(9jk _ dud) (1.2)
k=1

 

where aik, Bjk and dik indicate the item discrimination, person proﬁciency, and item
difﬁculty for the kth dimension.

With this model, Sympson (1978) argued that the probability of getting an item
correct cannot exceed the probability of getting each dimension correct. When a
person with low reading skill cannot understand the story problem, he has no chance
to get this item correct, how can the deﬁcit be compensated by his high mathematical

skill? From this argument, the noncompensatory model seems more realistic, since the

probability for this version of MIRT models is related to the product of probabilities
for each dimension. However, several problems enormously hinder the development
and use of this model, such as more parameters to be estimated, inefﬁcient algorithm
for parameter estimation and the proﬁciency rescaling issue when the number of
dimensions increases (Bolt & Lall, 2003; Reckase, 2009).

For these two model versions, the study by Spray et a1. (1990) concluded that the
difference between the compensatory model and the noncompensatory model could be
considered unimportant in practice, especially when the proﬁciencies are correlated.
Besides the above reasons, the compensatory model is preferred due to its comparably
easy estimation procedure and interpretation.

In order to ﬁnd the analogous counterparts in MIRT as in IRT, Reckase (1985) and
Reckase and Mckinley (1991) deﬁned the multidimensional generalized discrimination
and direction cosines for each item. Through this way, the Cartesian coordinates for
each discrimination vector are converted to the polar coordinates in the Cartesian

coordinate system.

 

 

m
1

MDISC, = agai = (Eng-2k)? (1.3)

k=l

I

cosa; = (COSOi1,"' ,cosozim)
2 “2'1 I “im 1’ (1.4)
(221:1 “1219)? (221:1 0,21,)?

The MIRT generalized discrimination index M DI S C,- is actually the length of the
discrimination vector, which is an overall measure of the capacity of an item that
distinguishes persons in the multidimensional space. The direction cosines vector
satisﬁes the constraint that (cosa,)’cosa,~ = 1. Therefore, this vector can also be
regarded as a normalized version for the discrimination vector. It is well known that
this vector plays a very important role in MIRT models, where it determines not
only the most discriminating but also the most non-discriminating direction for all

items with this direction cosines vector. On the one hand, these items can most

differentiate the people whose positions are parallel to the direction cosines line. On
the other hand, these items give no discrimination power to the people positioned in
the plane, which is orthogonal to the direction cosines line. The reason is that all
points in the plane provide the same value for the linear combination, ag0, which is
sufﬁcient to determine the probability of person’s response to these items. The line
or plane with the form of ag6 = Constant in the 0 space is deﬁned as the contour
line corresponding to some speciﬁc probability.

The multidimensional difﬁculty is a generalized version of the unidimensional IRT

difﬁculty, and its deﬁnition is given by the following formula:

d- d-
Bi 2 _ z 1 = _————z (1.5)
.MDISC'

The generalized discrimination, generalized difﬁculty and direction cosines can be

 

represented by arrowed lines in the multidimensional 0 space. The length of the
arrowed line stands for M D1 3 C,, the distance from the origin to the base of the
arrowed line is 8;, and the direction of the arrowed line is the same as represented
by the direction cosines. One example of the item vector plot is shown in Figure 1.1.

Another detailed index for determining the discrimination power of an item along a
certain direction in the 0 space is the information function. For each item, the value
of information varies with different values of 0 and ,6, the latter of which is deﬁned

as a certain direction in the multidimensional 0 space (Reckase & Mckinley, 1991).

m
twm— - 172P-0( mm x: mesa)? (1.6)
k=1
For the ﬁxed item and person, the largest information is given when 3 = (1,, since
m m m
(2 0113608302 = (Z 0,21%: COSOikCOSﬁkl2
k=1 lit—'1 1:21

|/\
we:
5;»
Ms
8
:5
e;
M
O
8
$.93

 

 

 

 

 

 

-04 ................................ f. ...................................... _
-06 _ ................................................ ..
-0-8 .................................................................... _
_, , r. .
-1 -0 5 0 0 5 1
91

Figure 1.1. Representation of the Characteristics of 40 Items in a Two—Dimensional
Space

The information for this direction is simpliﬁed as
m
mam, = 1.72P.-(0>Q.-(6)Z at. (1.7)
k=1

Thus, oftentimes oz,- is also called the most discriminating direction of the item.
The above formula can be further maximized when 13,-(6) = Qi(0) = 0.5. The
maximum value is 0.852 221:1 “fie, which comes out just as a generalized version for
the maximum information provided by the unidimensional IRT model. In words,
the information for each item is largest for the people who are positioned in the 0.5
probability contour line, with the differentiating direction provided by (1;.
The test information along a certain direction is the sum of all the item information

along that direction.

Mam = 214% (1.8)

1.4 Indeterminacies in MIRT Models

This compensatory MIRT model is a combination of the IRT model and factor anal—
ysis model; as a result, it suffers not only from the origin and unit indeterminacies
in both models, but also from the rotational indeterminacy that especially plagues
factor analysis. With the formula, the MIRT indeterminacies come from the fact
that there are inﬁnite new sets with 9* = K T0 + 0, (a.*)' = a'T‘1K_1 and
d* = d — a'T‘lK—IO = d —— (a*)’0, which satisfy

(a*)’0* + d* = (a’T—lK-lxxre + 0) + d -— a’T—IK-lo
= a’e + a’T—lK—lo + d — a’T-lK—lo
2 (1’0 + d
Kmxm is a diagonal matrix used for adjusting the unit length for each dimension,

mem is a rotational matrix with the uniqueness deﬁned by the row normalization,

0mx1 is a column vector for the origin change.

10

For these new sets, the order for describing the transformation is arranged as rota-
tion, unit change, and ﬁnally origin change. As this order is changed, the formula for
the new sets will be changed too; however, it can still do the transformation between
the same old and new sets. For this transformation, only one matrix, which is the
product of K and T, can do both the rotation and unit change. However, the K
matrix is oftentimes separated as the rescaling matrix after the rotation. This ma-
trix will not be emphasized here, since this study mostly focuses on the rotational
indeterminacy, namely the mem matrix.

To ﬁnd a suitable rotation matrix is very important in the MIRT as in factor anal-
ysis. The common factor analysis terms the rotation as searching for the best simple
loading structure to explain intercorrelations among items or tests; however, in the
MIRT modeling, this process is also important for ﬁnding an explainable coordinate
system for person proﬁciency estimates.

The common way for partly solving the indeterminacies is to put some assumption
constraints on the 0 vector: E (0) = Omxl and cov(0) = I mxm. These constraints
greatly simplify the parameter estimation procedures of the commonly used NO-
HARM and TESTFACT software for the MIRT calibration (Fraser, 1988; Beck et
al., 2003). However, the zero correlation assumption can be easily violated, since most
proﬁciencies are correlated in reality. For example, most people believe a person’s al-
gebraic skill is highly correlated with his geometric skill. Some people may even doubt
the zero correlation assumption between reading skill and mathematical skill. Hence,
the proﬁciency estimates with these constraints enforced cannot be directly used for

the interpretation of person proﬁciencies.

11

1.5 Solution to Rotational Indeterminacy in Fac-
tor Analysis

Varimax and Promax are two widely used methods for solving the rotational inde-
terminacy in factor analysis, and the ﬁrst one is for orthogonal rotation, while the
second one is designed for oblique rotation, which is based on the result from the
Varimax rotation. Both rotations try to rotate the item discrimination matrix to a
simple structure matrix (Thurstone, 1947).

Suppose the item discrimination matrix is A Ixm, with 1 items and m dimensions.
The Varimax method aims at searching for the orthogonal rotation to maximize the
sum of squared loading variance across all factors, with adjustments from the item
communalities. Mathematically, it resulted in the matrix AVam-maa, 2 AT, with the

constraint as T’T = mxm and the following criterion maximized for the AVarimax

(Kaiser, 1958).

I 1
(«Zea/e)? — (Zea/amp)

1 i=1 i=1

( (aik/hil4 — I(

12:

:2
II
M3

a:
ll

(aik/hil2l2) (1-9)

'M"
M~

H
a

ll
Nrlr—I
M3

k '21

km dimension, hf is the communality of the

a-ik is the loading for the ith item and
2"” item.

The Promax method is based on the result from the Varimax method and the
new axes are free to take any position in the multidimensional space (Hendrickson &

White, 1964). First, a target matrix P is deﬁned using the tth power of each element

in Avarimax, where t is commonly from two to four.

1
Pile = lat: I/az’k (1-10)

Then the least-squares ﬁt of Avarimax to the target matrix is found by the following

12

formula, which is similar to the regression coefﬁcient estimation.

T = ( l/ar'irnaasAVaTi77iam)—1AIVarimamP (1'11)

Finally, after the columns of T are normalized, A prom“ = AVarimaxT-

Finch (2006) pointed out that both methods are effective in identifying which item
is associated with which factor; however, the Promax rotation performs better in
matching the simple structure.

Both methods are popular in practice and they are the pure mathematical criteria
for matching the rotated loading matrix to the simple structure. If these methods
are applied to the MIRT model, the disadvantage is that the results from these ﬁxed
procedures can seldom be adjusted by the evidence from item contents. Furthermore,
the rotated loadings may serve for identifying the grouping of different items; however,
it cannot be guaranteed that the loading matrix after the rotation recovers the true

item discrimination power for the MIRT Model.

1.6 Interdependency between Proﬁciency Correla-
tion and Item Discrimination

M. Wang (1986) pointed out that for the general case, the new interpretable 0* =
L9+u can be used such that E(6l*) = mel and cov(6*) = mem, if LL’ = E and
one possible choice of L can be obtained by the Cholesky decomposition. Accordingly,
(a*)’ = a’L'1 and d* = d -— (a*)’u.

For the test calibration where there is no reference group, it is reasonable to set the
zero mean and unit variance for the proﬁciency of each dimension, just like the origin
and unit solution for the unidimensional model. In this situation, the constraints for
9* actually should be E(6*) = Omxl and cov(9*) = Rmxm, where Rmxm is the

correlation matrix among proﬁciencies.

13

Now the biggest problem is that there is no presumed value for the Rmxm matrix,
which is unknown for most of the time. Obviously, if the item and person parameter
estimation is based on the convenient constraints as E (0) = 0mx1 and cov(0) =
I mxm, their estimates are essentially already adjusted for the correlation matrix,
since a’ = (a*)’ L and 0 = L—IB" can satisfy the above constraints. As pointed out
by Reckase (1997), in this case, “the observed correlations among the item scores will
be accounted for solely by the a-parameters” (p.275).

In situations where Rmxm is unknown, it is even harder to obtain the a* and
6* estimates. Researchers should be cautious when interpreting the a vector as the
item discrimination power for the proﬁciencies, since a. is paired with 9, not with the
realistically correlated construct estimate 0*, which is of interest to test developers.
Especially, it is very common to use a vectors to deﬁne item clusters for the inference
of dimensionality and the computation of composite scores (Miller &- Hirsch, 1992;
Luecht & Miller, 1992). Since the correlation-adjusted 0 depends on both a* and L,
we are never certain that the item clustering inferred from a is due to similar item
direction cosines or highly correlated proﬁciencies for the person population.

Figure 1.2 gives the item vector plots in a two-dimensional space when the proﬁ-
ciency correlation is on different levels. In order to ensure the same product as a’G,
namely the invariance property of the MIRT model, when the proﬁciency coordinates
are transformed, so are the item vectors. From the ﬁgure, it is also easy to see that
the angle between item clusters increases as the proﬁciency correlation increases.

Commonly, two assumptions can be adopted to interpret and use the information
from correlation-adjusted a vectors. The ﬁrst one is to assume there is no correlation
between multidimensional proﬁciencies. With this assumption, the a vector may be
the same as a*, which can be interpreted as the weight of the orthogonal proﬁcien-
cies and used for the composite score calculation. The other one is to assume the

simple structure for a*, and try to ﬁnd the correlation matrix among the elements of

14

p=0.0 p=o.2

 

 

2 a 2 .
/§

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.2. Representation of the Characteristics of 40 Items in a Two-Dimensional
Space when the Proﬁciency Correlation is 0.0, 0.2, 0.4 and 0.6

15

0*. The use of these two assumptions is very similar to those for the Varimax and
Promax methods in factor analysis, except that 0* may only be interpreted as the
primary factor score in factor analysis while a more general view in the MIRT is to
interpret 0* as the construct score, which may be a weighted composite of several
raw proﬁciencies. Without these two useful assumptions, it is hard to separate the
proﬁciency correlation and item discrimination matrix, because they are unknown
but dependent during the calibration for the person-by-item response matrix.

As is well known, the ﬁrst assumption above is very useful in data generation
and parameter estimation, while the second one can better serve for the purpose of
interpretation and score reporting. Now there comes an interesting research question:
how can the construct estimate 0* be obtained for better interpretation when the
proﬁciency correlation matrix is unknown? Can it be converted from the uncorrelated
0 estimate?

The subscore estimation method given by Luecht and Miller (1992) avoids this
transformation problem by obtaining proﬁciency estimates only from several separate
unidimensional IRT calibrations. They used a two-stage approach; they ﬁrst did the
MIRT calibration on the data and identiﬁed the item clusters through the angular
analysis for item pairs, and then used the unidimensional model to calibrate the items
within each cluster. In their study, the MIRT calibration was only used to group
items from the empirical evidence instead of the subjective judgement by experts.
Therefore, except the item grouping, this method is no different from the commonly
used unidimensional estimation.

The study by Yao and Boughton (2007) proposed the MCMC algorithm to simulta-
neously estimate parameters for the conﬁrmatory version of the MIRT model, where
the correlation matrix among proﬁciencies is known and all the items are assumed
to have discrimination power only on one dimension. Since the coordinate system is

already set up with these constraints, the problem of rotational indeterminacy doesn’t

16

exist in their study.

In real test settings, the proﬁciency correlation is seldom known and it is com—
mon that within-dimensionality items exist in the test. This situation leads to the
common use of the general exploratory version of the MIRT model, whose item and
person parameters are all free to be estimated. In order to solve the rotational inde-
terminacy in the exploratory version, this study uses the item and person estimates
from the MIRT model calibration, and projects the uncorrelated 0 estimates onto
the most discriminating direction for each item cluster to get 0* solution. The sim-
ulation study was conducted to detect the effect of balance/ unbalanced item design
and sampling errors on this 0* estimation. Finally, the empirical analysis using the

Michigan Educational Assessment Program (MEAP) test data are shown.

17

CHAPTER 2

Construct Estimation Using

Projection

2.1 Transformation in the Orthogonal Coordinate
System

Commonly, the orthogonal Cartesian coordinate system is used for positioning each
person’s proﬁciency vector in the m-dimensional space. Although the axes in the
coordinate system are orthogonal to each other, correlation can be allowed among
the coordinates; therefore, besides the uncorrelated proﬁciencies, the correlated ones
can also be fully represented in this orthogonal system. That is to say, there is no
need to turn to the oblique system, where axes are not restricted to be orthogonal to
each other.

Linear transformation can be performed for points represented by this coordinate
system. In the proﬁciency context, it is deﬁned as the transformation, which can be
denoted as the T matrix, from 0 coordinates to 0* coordinates, and the transformation
is only involved with the elements of 0 that are of the ﬁrst degree. If the number
of elements in 0* is the same as in 0, this transformation can also be regarded as
the rotation of the coordinate system. Linear transformation and the rotation of the
coordinate system function for the same purpose, and the difference is whether the
rotation is made on points relative to ﬁxed axes or on axes relative to ﬁxed points.

For this reason, the new proﬁciency coordinates, deﬁned as 0* 2 mem0, can be

18

interpreted as the linear transformations of points in the old system, or as the ﬁxed
points represented by the new system.

If cov(0) = I is assumed,

cov(0*) = Tcov(0)T’

= TT’ (2.1)

The diagonal entry in TT’ matrix determines the variance for each element of 0*,
while the off-diagonal entry provides information for the covariance between elements
of 0*. If TT’ = K and K is a diagonal matrix, this transformation is called an or-
thogonal transformation, and it retains the zero correlation between elements of 0* in
the above situation. Moreover, when TT’ 2 I , it is called an orthonormal transforma-
tion. The advantage for this transformation is that it is an isomorphic transformation,
which preserves the structure between vectors or points. More speciﬁcally, the dis—
tance of points to the origin does not change, and the angle or scalar product between
vectors remains the same, because the conﬁguration is not altered by applying this
transformation matrix (Thurstone, 1947). The existence of this transformation leads
to the fact that, besides the zero mean and identity variance-covariance matrix con-
straints on person proﬁciency estimates, additional constraint is necessary in order
for the parameters to be uniquely identiﬁed. Some examples are the QR decomposi-
tion for the NOHARM software (Stoer & Bulirsch, 2002) and the Varimax criterion
to match the simple structure. Both of them are constraints forced on the item
discrimination matrix.

The orthogonal projection to vectors is another concept that is very useful for
deﬁning the transformation matrix in this study. Suppose there are two vectors 'u.
and 'v in an m-dimensional space, a is the angle between them, the projection from

v to u is calculated by

19

 

u
Pv_,u = (v v’vcosa)
Vu’u

/
= m M u

( \[u’ux/fj’v)\/u’u

u'v

 

 

If u is of unit length, the above equation can be further simpliﬁed as
Pv_.u = (u’v)u (2.3)

where u’v is the length of the projection and u is a unit-length vector deﬁning the
direction of the projection.

This projection separates vector 2) into two unique vectors based on 11.: one is in
the direction of u, and the other is orthogonal to it. This is not only very useful in
constructing the orthonormal basis, but also serves as the best prediction of vector 1)
from vector u.

The following section is focused on how to use the projection to deﬁne the T matrix,
which can be used to transform the orthogonal proﬁciency estimates to the correlated

and interpretable construct scores.

2.2 Construct Estimation

2.2.1 Step 1: Calibration

With the local independence assumption in the MIRT, the joint probability for the
person-by-item data matrix can be obtained by multiplying the probability of each
person-by-item interaction across all items and persons. Therefore, the likelihood
function for the response data based on the model is

J I
L(U|A, (1,6,9) = H H P(Uz‘j = 1|9jlu’j(1 — P(uij = 1I9j))l_u’j (2-4)
j=1i=1

20

where J stands for the number of persons, 1 for the number of items, and P(u,-j =
1|0j) is deﬁned by Equation 1.1.

According to maximum likelihood theory, the estimate of A,d,c,(-) is the set which
can maximize the likelihood function in Equation 2.4. However, in order to avoid
heavy computation, some variations of estimation procedures are implemented in the
two commonly used MIRT software: NOHARM and TESTFACT. Both software uses
the constraint of E (0) = Omxl and the normal ogive model conﬁguration, and need
to input c vector into the software if guessing is assumed. The guessing vector can be
estimated from the BILOG software (Zimowski et al., 2003) by the unidimensional
IRT calibration to the complete multidimensional data (Beck et al., 2003). Both
software gives the unrotated item discrimination matrix as the default and also can
provide the Varimax solution for the orthogonal proﬁciency structure and Promax
solution for the oblique proﬁciency structure. All these three versions of the item
discrimination matrix are actually linear transformations of each other.

There are also some differences, in the estimation constraint, estimation procedure
and initial coordinate system setup between these two software (Fraser & McDonald,
1988; Beck et al., 1988; Reckase, 2009).

In the NOHARM software, the default setting for cov(0) is I me; however, it is
also flexible to be changed to other conﬁgurations. This software uses polynomials
to approximate the normal ogive values, and applies the unweighted least-squares
criterion and quasi-Newton algorithm to ﬁnd the best match between the observed
and model-predicted values for the population estimate of the joint probability of
correctly answering item pairs. It constrains the estimate of the item discrimination
matrix with the ﬁrst m items to a lower triangle matrix structure, through which
the coordinate system for proﬁciencies is constructed. The resulting discrimination
matrix is similar to the R matrix when the QR decomposition method is applied to

the original item discrimination matrix (Stoer & Bulirsch, 2002). The disadvantage

21

is that the item discrimination matrix estimate is not very stable or accurate when
some of the ﬁrst m items seem to measure the same construct, and this software does
not provide any 0 estimate.

The TESTFACT software directly uses the constraint of cov(0) = I mxm and
simpliﬁes the estimation procedures by applying the EM algorithm to maximize the

marginal probability (Beck et al., 1988).
J
L(,U|A d) = H/L (uj|e)g( 6)d6 (2.5)

where L() is the likelihood function, uj is the response vector by jth person, and
9(0) is the distribution of 0, which is usually assumed to be the multivariate standard
normal distribution.

It is common that many people may have the same response strings, so the likeli-

hood can also be written as the multinomial form:

 

L(UlA,d) = 1",! ![/ L(ull0)g(0)d0]r1 [/ L(u3|0)g(0)d0]rs (2.6)

rllrg. - - -r3

where N is the total number of persons, 3 is the number of unique response strings,
and r1, - - - ,rs stand for the observed frequency for each unique response string.

Clearly, the person proﬁciency 0 is integrated out in Equation 2.6, and only item
discrimination and difﬁculty are the unknown parameters that influence the value of
the likelihood function. The EM algorithm is applied to maximize the log version
of the likelihood and the starting values are obtained from the principal compo-
nent analysis for the guessing-adjusted tetrachoric correlation matrix of responses.
The integration in Equation 2.6 can only be approximated numerically, and different
numbers of quadrature points lead to different degrees of accuracy for the integral
value, which will result in different parameter estimation errors. The disadvantage of
this software is that the result is much inﬂuenced by the quality of starting values,

and the EM algorithm takes some time to converge. For example, the computer for

22

the later simulation study was equipped with the Intel Pentium D processor of CPU
3.49 GHZ speed and 1.99 GB RAM, and it took more than half an hour for each
calibration run.

After the item parameters are estimated, they are regarded as ﬁxed and the per-
son estimates are calculated under the Bayesian framework. Two score options are
available in the TESTFACT software: The MAP (Maximum A Posteriori) score is
calculated as the mode of L(uj; 0) 9(0), and the EAP (Expected A Posteriori) score,

which will be used in this study, is calculated as

5,- : feﬂemjde

_ f9L(u';9)9(9)d9
‘ Maj-Immeda (2’7)

 

The EAP score is much preferred because it only needs easy computation, incor-
porates the prior information, and avoids inﬁnite scores when response strings are all
0’s, all 1’s, or inconsistent with the model. The EAP score is also much more reliable
than the maximum likelihood estimate (Muraki & Engelhard, 1985).

The characteristic of the EAP score is that the prior normal density information
gives large weights to the 0 values close to the center, so the EAP score is biased
toward the mean when the number of items are ﬁnite (Muraki & Engelhard, 1985;
Li & Lissitz, 2000). This leads to the fact that the mean of the estimated scores
is approximately the same as the mean of the proﬁciency distribution; however, the
standard deviation of the estimated scores underestimates the standard deviation of
the proﬁciency distribution (DeMars, 2006). Although the EAP score is not com-
parable to the person parameter as the widely used unbiased maximum likelihood
estimation (MLE) score, these two scores give roughly the same rank ordering for
people. These EAP scores are treated as 0 estimates for the later projection step.

It should be noted that, in this data calibration step, it is assumed that the model

ﬁts the data well, and the number of dimensions has also been conﬁrmed based on

23

some data analysis and expert judgement on item contents.

2.2.2 Step 2: Cluster Analysis

This step is to identify item clusters based on the item discrimination estimates
from the MIRT calibration (Miller & Hirsch, 1992). The purpose of cluster analysis
is to allocate all the items into different clusters, with the within-cluster variation
minimized while the between-cluster variation maximized.

The reason that some items are clustered together is that they are assumed to
measure the same construct. Levine and Drasgow (1982) suggested that items in one
test can be analyzed into interrelated blocks for the appropriateness measurement.
The article by Miller and Hirsch (1992) also pointed out that “each item set can
be treated as a different unidimensional composite of the abilities represented in the
space, and the amount of spread among the vectors in the same set reflects the degree
to which unidimensionality holds for that set”.

The study by Reckase et a1. (1988) already showed that items which have the same
angles with the coordinate axes meet the unidimensionality assumption. In this item
cluster context, the orientations of item vectors in the same cluster should be very
similar in the multidimensional space so that these items can be assumed to measure
the same construct. Accordingly, items in different clusters are supposed to measure
different constructs.

The cluster analysis in this step is to group items according to the similarity and
dissimilarity among these item vectors. For the cluster analysis, both parametric
(Miller & Hirsch, 1992) and nonparametric proximity measures (Roussos et al., 1998)
can be used to construct the dissimilarity matrix as the input of cluster analysis.
The difference is that the ﬁrst one uses the angles between item vectors while the
second one is based on the contingency table between item pairs after people have

been partitioned into groups of equal proﬁciency. After the dissimilarity matrix is

24

obtained, the distance between hypothetical clusters also needs to be deﬁned as a
criterion to group similar elements. Kim (2001) found that the Ward’s method, which
uses the minimum variance as the distance, by employing the parametric proximity
measure yields stable classiﬁcations under various conditions as opposed to other
methods or nonparametric proximity measures. For this reason, the Ward’s method
and parametric proximity measure are speciﬁed for the cluster analysis.

The cosine of the angle between item vector pairs is calculated by Equation 2.8, and
then converted to the degree angle. If cosaii/ = 1, these two items measure exactly the
same construct; however, if cosozii/ is close to 0, the discrimination vectors for these
two items are orthogonal to each other and they actually measure two completely
different constructs. The angular distances for all item pairs form the dissimilarity
matrix as the input matrix, and then the cluster analysis results in a dendrogram
showing the hierarchical tree relationship among items, from which the number of
clusters and the item grouping are determined by “eyeballing” the hierarchy and
linkage of these items.

agai/

coscr--/ = ——
2.2 Iaillai’l

= (cosa,)’(cosai/) (2.8)
Because of this subjectivity in cluster analysis and the inevitable sampling errors,
the true item grouping is sometimes hard to be recovered, especially when it is only
obtained empirically. Therefore, after the cluster analysis, expert judgement on the
substantive meaning of clusters can be considered to adjust the items in each cluster.
After this step, all the items are allocated mutually exclusively and exhaustively
into different clusters, and the items in each group are regarded as measuring the
same construct.
It should be noted that, theoretically, the cluster analysis should be conducted on

the actual item discrimination vector a.*, which is hard to obtain from the MIRT

25

calibration. However, oftentimes, the item clustering pattern is also obvious even
when the correlation-adjusted item discrimination vector a is applied in the analysis,
in which case, all the items seem to be more clustered together than that is done with

the actual a*.

2.2.3 Step 3: Reference Composite

This step calculates the reference composite vector, also called the “centroid” vector,
for the items within each cluster (M. Wang, 1985, 1986). This vector has the minimum
average distance to all the item vectors in the cluster and is regarded as representing
the most discriminating direction for these items. It has been applied to explain the
unidimensional model approximation to the multidimensional model for the whole
test data; however, little research has emphasized its use for the items in the same
cluster.

In previous studies, the reference composite vector is the eigenvector associated with
the largest eigenvalue for the A' A matrix, while A is the discrimination matrix for all

1"” cluster, denoted

items in the test. Similarly, the reference composite vector for the
as 101 here, is deﬁned as the eigenvector related to the largest eigenvalue for the AMI
matrix, where Al is the discrimination matrix for all items in the lth cluster. This
reference composite vector is supposed to point in the most discriminating direction
for the items in the cluster.

Since the eigenvector is already in the normalized version, it can also be treated as

the direction cosines for the reference composite vector

coswl : wl : (wllv ' ° ' rwlm), (2'9)

26

2.2.4 Step 4: Projection

This step is to project the uncorrelated 0 estimate onto the reference composite vector
for each cluster and get the correlated 0* solution. According to Equation 2.3, the

lth cluster is calculated

construct estimate based on the reference composite for the
by the following formula

02" 2 11:20 (2.10)

Suppose the cluster analysis at Step 2 results in m* clusters, all the m* elements
in the 0* solution can be obtained by linear transformations of the 0 estimate. The
transformation matrix is the eigenvector matrix or the direction cosines matrix, with

dimensions of m* x m.

w1 (cosw1)’
6* = = 6 = g 0 (2.11)

w ,.. (coswm*)’
Therefore, the variance-covariance matrix for 0* is given by

“’1

cov0* = 3 cov0 (w1,--- ,wm*) (2.12)

I
w
ma:

If cov0 = Imxm as the assumption,
cov0* = (wgij-j = ((cosw,)'(coswj)),-j (2.13)

Ideally, the diagonal elements in the cov0* matrix are all 1’s and the covariances are
only determined by the closeness between reference composite vectors. However, due
to the sampling errors and biased EAP score, cov0 may not be the identity matrix as
the assumption. Although the rescaling matrix K can possibly be applied to adjust
the unit length to one for each 0 dimension, there may still be a small amount of

correlation between the elements in the 0 estimate.

27

2.3 Dimensionality and Number of Clusters

The number of item clusters can be different from the number of proﬁciency dimen-
sions for the MIRT model (Miller & Hirsch, 1992; Reckase, 2009), which is also the
reason that they are denoted by m* and m separately.

However, the empirical detections of item clusters and proﬁciency dimensions are
dependent on each other. On the one hand, the angular distance matrix for the cluster
analysis is based on the item discrimination estimates from the MIRT calibration
with a certain number of dimensions assumed for person proﬁciencies, which should
at least result in a good ﬁt between model and data. M. Wang (1985, 1986) showed
that if the unidimensional model is used to analyze the multidimensional test, it
actually estimates one composite score for the cluster consisting of all items in the test.
The implicit assumption for this approximation is that only one cluster results from
the cluster analysis based on the multidimensional calibration. On the other hand,
Miller and Hirsch (1992) and Roussos et a1. (1998) gave examples of using the item
cluster analysis to infer the dimensionality of person proﬁciencies required by the test.
Reckase (2009) points out that substantive meaning should be carefully scrutinized
when these two numbers are interpreted, which means that expert judgment on item
contents is indispensable in the dimension and cluster determination process.

When m* < m, this can be the case when the test requires reading and mathemat-
ical computation skills; however, all the items measure a similar weighted composite
of these two skills. In this situation, the high dimensional proﬁciencies are projected
to a low dimensional space, and there is deﬁnitely some loss of information, which
Reckase and Hirsch ( 1991) already warned against. However, if the direction cosines
of item vectors are very similar or the proﬁciencies are highly correlated, the use of
low dimensional solution to the high dimensional test data is justiﬁed.

When m* = m, this is the case when some items in the test measure the reading

skill and the other items measure the computation skill. More generally, it can also be

28

the case when some items measure one weighted composite of these two skills, while
the other items require the same skills but with different weights. The projection
solution in this situation actually chooses a different but interpretable coordinate
system for construct estimates, and there is no information loss for this projection.

When m* > m, the ﬁrst step is to check whether the dimensionality of the MIRT
model needs to be increased or not. Since this study assumes there is a good model—
data ﬁt and experts also conﬁrm only these proﬁciency dimensions are required for
the test after scrutinizing the item contents, the possibility of increasing proﬁciency
dimensions will be skipped here. Now this m* > m situation can be the case when
more items are added to the test in the previous m* = m situation, and they measure
a composite score with weights different than other items. In this case, the elements
in 0* are linearly dependent, and any element can be inferred from any other m
elements in the 0* solution. Therefore, there is no information increase or loss for
this projection. Moreover, each of these m* construct scores corresponds speciﬁcally
to each item cluster.

In short, all these three situations are possible and common in real test situations.

Attention should be paid when the 0* estimate is calculated and interpreted.

2.4 Characteristics of Construct Estimates

Each element in 0* can be regarded as the subscore for items in the same cluster and
can be used for score reporting after the scaling process. There are several advantages
of using this 0* solution.

First, 0* is invariant to any orthonormal rotation of the coordinate system repre-
senting 0 estimates. As mentioned previously, the relative distance between points
or vectors is not altered by this transformation. In theory, any set of item and per-

son estimates with the constraints of E (0) = 0mx1 and cov(0) = I mxm leads to the

29

same 0* solution, no matter what orthonormal rotations the coordinate system takes.
This can be easily shown in Equation 2.14, where 0* indicates the values in the new
coordinate system. Therefore, there is no need to decide which rotation of the item
discrimination matrix needs to be used for the MIRT calibration: the one with the

special lower diagonal, the one after the Varimax rotation, or any other choice.

N 17/1 (Tw1)' 'w’1
9* = g 5 = 5 (T0) = 5 We
17}:n* (Twmak ), 10:71...
w'.
= , 0 = 0* (2-14)
10;”...

Second, 0* is a vector containing several scores with regard to subsets of items,
and these subscores give meaningful orderings for people with the interpretation spe-
ciﬁc to the constructs measured by different item clusters. These subscores are very
important for people to know their strength and weakness in each subarea, which
cannot be achieved by the simple unidimensional calibration to the whole test data
if there is more than one item cluster.

Third, 0* can be interpreted as a composite score, and it allows correlations among
proﬁciencies. Due to the interdependency between proﬁciency correlation and item
discrimination, it is hard to separate them to recover the true parameters. The solu-
tion here is to use the eigenvector matrix as a possible oblique transformation matrix
and obtain the projected construct estimates in the most discriminating directions
of different item clusters. This solves the problem that the orthogonal 0 estimates
cannot be directly used for interpretation or score reporting, since it is difﬁcult to
(give substantial meanings to these uncorrelated proﬁciencies.

Fourth, the elements in 0* solution borrow information from each other, espe-
cially when these proﬁciencies are correlated. This actually should give credit to
the advantage of MIRT over IRT, since the MIRT estimates these proﬁciencies si-

multaneously rather than estimating them separately with the unidimensional IRT

30

calibration. Hence, the 0* solution is more reliable than the estimates given by the
unidimensional model.

Fifth, the 0* solution depends on the grouping of items, which can also include ex-
pert judgement to reduce the effect of sampling errors on the item clustering. There-
fore, this method is preferred to some pure mathematical criteria, such as the Varimax
and Promax, which have ﬁxed procedures to obtain the transformation matrix with-
out any consideration on item contents.

Sixth, the transformation from the 0 estimate to the 0* solution clearly explains
the relationship and difference between the proﬁciency dimension and item cluster
for multidimensional tests, especially when their numbers are different. It also gives
a rationale for when tests that are sensitive to differences on multiple dimensions can
be ﬁt by a unidimensional IRT model.

Finally, this projection method is not too much work after the MIRT calibration.
It applies the cluster analysis to the angular distance matrix based on the item es-
timates and then obtains the transformation matrix by calculating the eigenvector
that corresponds to the largest eigenvalue of the A2141 matrix for each item cluster.

Based on all the above advantages, the 0* solution is easy to calculate and ready

for interpretation and score reporting.

31

CHAPTER 3

Simulation Study

This simulation study is aimed at detecting the accuracy and stability of 0* estimates
and comparing them with NO subscores and unidimensional 0U estimates described
in Luecht and Miller (1992). Due to the interdependency between proﬁciency correla-
tion and item discrimination, this simulation study simply assumed the uncorrelated
proﬁciency coordinates and set the direction cosines of the three reference composite
vectors as close as possible to the direction of (1,0,0), (0,1,0) and (f, ﬁ, ﬁ).
Therefore, the correlations among 0* parameters are built in by the angles between

these reference composite vectors instead of into 0 parameters.

3.1 Methods

3. 1 . 1 Parameter Simulation

The item parameters were simulated from the commonly used distribution in sim-
ulation studies. The generalized discrimination was assumed to follow a lognormal
distribution where log(M DI 50,-) had a mean of 0 and a standard deviation of 0.5.
The within-cluster angular variation was assumed to be 15°, which was suggested by
Roussos et al. (1998) for the approximate simple structure. Therefore, the ﬁrst two
degree angles for each item were assumed to uniformly variate between 0° N 15° or
—7.5° ~ 7.5° around the corresponding reference composite, with the constraint that
the sum of cosine squares of the ﬁrst two angles should be less than 1, and the last
angle was determined by (cosaﬁ'cosa; = 1. Since all the discrimination parameters

were assumed to take positive values, the directions for item vectors close to the axis

32

were more restricted than those somewhat distant from all the axes. The generalized
difﬁculty B,- was sampled from a normal distribution with mean of 0 and standard
deviation of 0.75, and then b,- was calculated by —B,- * M DI SC}. All the ci’s were
set to 0, in order to reduce the guessing effect.

The person parameter matrix was simulated from the multivariate normal distri-
bution with zero mean vector and identity variance-covariance matrix, and the same
matrix was applied to all replications in both designs.

The probability matrix for the person-item interactions was calculated by Equation
1.1 with c=0, and then it was compared with an equal-size matrix whose elements were
randomly generated from the standard uniform distribution on [0, 1]. If the simulated
number was less than or equal to the probability, the corresponding response was
assigned to 1 and 0 otherwise. This step resulted in the dichotomous item score

matrix.

3. 1 .2 Simulation Design

Each dataset contained responses from 5000 persons and 45 items, and one example of
the test with 45 items is the Collegiate Assessment of Academic Proﬁciency from the
ACT. In order to test the effect of balanced/ unbalanced item numbers for different
clusters, the number of items for the three clusters was set to have two designs:
balanced (15—15-15) and unbalanced (10-15-20), and these two designs were labeled
as Design 1 and Design 2 respectively. This unbalanced design was considering 10
as the fewest number of items used for any cluster, which followed most simulation
analyses in Reckase (2009). In order to reduce the effect of sampling errors, 50

replication datasets were created for each design.

33

3.1.3 Calibration and Projection

Under the MIRT framework, the TESTFACT software with the Promax rotation
option speciﬁed in the command was used to calibrate each dataset. As mentioned
in the previous chapter, the Promax rotation is based on the result of the Varimax
rotation. Although the Promax option was requested, the software still provided the
item and person estimates with the Varimax rotated loadings as the starting values;
additionally, it provided the estimated proﬁciency correlation matrix obtained from
the Promax rotation.

The cluster analysis was then performed on the angular matrix to determine the
grouping of items. Although the true clustering of items was known for the simulation
study, the cluster analysis was conducted for the purpose of complete integrity of the
whole procedure. Based on the item grouping, the EAP estimates were projected
onto the reference composite vector for each item cluster.

According to the Luecht and Miller (1992) method, the BILOG software was used
to get the unidimensional estimate separately for each item cluster. 45 items were
divided into three subtests and analyzed separately. In order to make the unidimen-
sional scores more comparable to the multidimensional projected EAP scores, the
EAP estimates instead of the maximum likelihood estimates were requested for the
unidimensional calibration.

In both software, the convergence criterion was set to be 0.005, and the number of
quadrature points for both the EM algorithm and EAP score estimation was set as

the default.

3.1.4 Evaluation Criteria

The evaluation was mostly based on descriptive statistics, such as mean, standard
deviation, correlation, Bias, Root Mean Squared Error (RMSE), and the scatter plot.

First, the recovery of item and person parameters was analyzed by the average

34

correlation across replications and average Bias/RMSE across items and persons.
The formulas for Bias and RMSE are given by Equations 3.1 and 3.2, where n is
deﬁned as any item/ person parameter, ﬁr as the estimate from the rt“ replication
and R as the total number of replications. It is easy to see that Bias and RMSE are
calculated as the average raw or squared difference between the estimated and true
parameter across replications. Large deviation was expected to be found in the person
proﬁciency recovery since the EAP scores obtained under the Bayesian framework are

known to be biased toward the mean.

 

R (‘ — .)
Bias(7)) = Z 11517”— (3.1)
r=1
R (1“ — )2
RMSE(77)= 21—"?— (3.2)
r=1 R

Second, the variation of 0* estimates across replications was calculated to investi-
gate the stability of these estimates.

Third, the correlations among 0* estimates, unidimensional 0U estimates and NC
subscores were provided. All three scores were assumed to be highly correlated and
give roughly the same rank order to people.

Fourth, the recovery of true 0* by both 0* estimates and 0U estimates was cal-
culated according to the same three criteria: correlation, Bias and RMSE. Then the
recovery efﬁciencies by these two estimates were compared separately for each cluster.

Finally, the correlation matrix among the elements of 0* estimates was compared
with those obtained from the Promax method and unidimensional 0n estimates, with

reference to the correlation among 0* parameters.

35

3.2 Results

3.2.1 Parameter Estimation

The simulated item discrimination, difﬁculty, generalized discrimination, generalized
difficulty and directional degree angles with each axis are shown in Tables 3.1 and
3.2 respectively for the balanced and unbalanced designs. These parameters were
only samples from the commonly used distributions, and they were regarded as the
parameters for the following simulation study. It is clear that, in both designs, items
in cluster 1 mostly measure the proﬁciency on the ﬁrst dimension, and items in cluster
2 mostly discriminate the proﬁciency on the second dimension. However, different
than items in the ﬁrst two clusters, the items in cluster 3 mostly discriminate the
roughly equally-weighted composite of proﬁciencies from all three dimensions. This
is also obvious from the directional degree angles: for items in the ﬁrst two clusters,
the angles with one axis are close to zero; however, for items in the third cluster, none
of the angles is close to zero.

It should be noted that this projection method is at the preliminary stage; therefore,
this simulation study assumed the mixed structure items but uncorrelated proﬁciency
coordinates to only investigate the efﬁciency of this method based on the item com-
posite effect.

According to the method in Subsection 2.2.3, the direction cosines for the
three reference composite vectors are (0996,0029, 0.081), (0.101, 0.992, 0.081),
(0.614, 0.614, 0.496) for Design 1 and (0.996, 0.040, 0.081), (0.083, 0.992, 0.097),
(O.616,0.559, 0.555) for Design 2. These reference composite vectors together with
item vectors are shown in Figure 3.1. The arrowed solid lines are the item vectors,
and the arrowed dashed lines indicate the reference composite vectors, whose lengths
are stretched for better view. In the ﬁgure, the number of coordinate axes indicates

the dimensionality for the MIRT model, while the number of reference composite vec-

36

Table 3.1. MIRT Item Parameters for Design 1

 

 

 

Cluster item a1 a2 a3 d M DS I C B on a2 a3
1 1 0.95 0.02 0.04 1.22 0.95 —1.28 3 89 87
1 2 0.69 0.00 0.07 0.39 0.69 -0.57 6 90 84
1 3 0.70 0.01 0.01 0.39 0.70 -0.57 1 89 89
1 4 0.79 0.01 0.07 0.71 0.79 -0.89 5 90 85
1 5 0. 57 0.01 0. 01 0.12 0.57 -0.21 1 89 89
1 6 1. 76 0.03 0. 07 1.17 1.77 -0.66 3 89 88
1 7 0. 75 0.02 0.15 0.63 0.76 -0.82 11 88 79
1 8 0.77 0.01 0.02 —0.11 0.77 0.14 2 89 88
1 9 1.74 0.13 0.24 -0.11 1.76 0.06 9 86 82
1 10 0. 73 0.04 0. 00 0. 51 0.73 -0.70 3 87 90
1 11 0.94 0.01 0. 20 1.14 0.96 -1.19 12 89 78
1 12 0 68 0.01 0.10 0. 01 0.68 -0.02 8 90 82
1 13 1.24 0.00 0.02 -1.51 1.24 1.22 1 90 89
1 14 0.75 0.05 0.12 -0.83 0.76 1.09 10 86 81
1 15 0.92 0.00 0. 00 0.61 0.92 -0.66 0 90 90
2 16 0.02 1.10 0. 02 -1.31 1.10 1.19 89 2 89
2 17 0.00 0.69 0. 01 0.02 0.69 —0.03 90 1 89
2 18 0.07 0.81 0.04 -1.45 0.81 1.79 85 6 87
2 19 0.08 0.78 0.03 -0.24 0.79 0.30 84 6 88
2 20 0. 02 0.81 0. 19 0.26 0.83 -0. 32 88 13 77
2 21 0. 06 1.07 0.10 -0. 99 1.08 0. 92 87 6 85
2 22 0. 02 1.03 0. 01 0. 99 1. 03 -0. 96 89 1 89
2 23 0.05 0.75 0.10 -0.01 0.76 0.02 87 9 82
2 24 0.05 0.58 0.11 0.00 0.59 -0.00 85 12 79
2 25 0. 36 1.53 0.13 -0.39 1. 58 0. 25 77 14 85
2 26 0. 07 0.78 0. 05 0.04 0. 78 -0. 05 85 7 86
2 27 0.09 1.09 0. 08 1.94 l. 09 -1. 77 85 6 86
2 28 0.20 1.04 0.14 0.05 1.07 —0.05 79 13 82
2 29 0.12 0.58 0.06 0.39 0.60 -0.65 78 13 84
2 30 0.00 0.71 0.04 -1.13 0.71 1.59 90 3 87
3 31 0.87 0.80 0. 51 -0.14 1. 29 0.11 47 52 67
3 32 0.33 0.34 0. 44 0.64 0. 64 -1. 00 60 59 47
3 33 0. 94 0.93 0. 68 -0.20 1. 49 0.13 51 51 63
3 34 0.24 0.25 0.21 -0.36 0.40 0.91 54 52 59
3 35 0.26 0.27 0. 22 -0.05 0.43 0.12 53 52 60
3 36 0. 42 0.45 0.50 0. 04 0.79 -0. 05 58 56 51
3 37 0. 49 0.59 0. 43 -0. 74 0.88 0.84 56 48 61
3 38 0.46 0.63 0.61 0.35 0.99 -0.36 62 50 52
3 39 0. 24 0.23 0.28 0.18 0.43 -0.42 57 58 49
3 40 0. 34 0.42 0.44 -0. 35 0. 70 0. 51 61 53 51
3 41 1. 38 1.28 0. 85 1. 37 2.06 -0. 67 48 52 66
3 42 0.64 0.49 0. 60 -0.05 1. 00 0. 05 51 61 53
3 43 0. 73 0.80 0.66 0.01 1.26 -0. 01 55 51 59
3 44 0.18 0.21 0. 25 0. 23 0. 37 -0. 62 62 55 48
3 45 0. 28 0.36 0. 28 -0.16 0.54 0. 30 58 48 59

Mean 0.51 0.48 0.20 0.07 0.91 -0.07
Std 0.45 0.42 0.22 0.73 0.37 0.77

 

37

Table 3.2. MIRT Item Parameters for Design 2

 

 

 

Cluster item a1 a2 a3 (1 MDSIC B a1 a2 03
1 1 0.88 0.14 0.11 0 13 0.90 -0.14 11 81 83
1 2 0.78 0.05 0.08 0. 34 0.78 -0.44 7 86 84
1 3 1.24 0.10 0.26 1 56 1.27 -1.23 13 86 78
1 4 0.96 0.07 0.13 0. 81 0.97 -0.83 9 86 82
1 5 0.81 0.00 0.14 -0. 38 0.82 0.46 10 90 80
1 6 0.83 0.01 0.02 0.76 0.83 —0.91 1 89 89
1 7 2.12 0.01 0.03 -1.35 2.12 0.64 1 90 89
1 8 1.06 0.01 0.02 -0.57 1.06 0.53 1 90 89
1 9 1. 08 0.01 0. 03 0.02 1.08 -0.01 1 90 89
1 10 0. 99 0.11 0.19 -0.45 1.01 0.45 12 84 79
2 11 0. 03 0.58 0.10 -0.28 0.59 0.47 87 11 80
2 12 0.00 0.82 0.01 0.34 0.82 —0.41 90 1 90
2 13 0.05 0.67 0.07 -0.05 0.68 0.07 86 7 84
2 14 0.09 0.60 0.09 -0.07 0.62 0.12 81 12 82
2 15 0.17 1.60 0.31 —1. 11 1.63 0.68 84 13 79
2 16 0.02 0. 66 0.04 0. 10 0. 66 -0.15 88 4 86
2 17 0. 01 1.16 0. 09 -0. 81 1.16 0.69 89 4 86
2 18 0. 03 1. 43 0. 06 1.21 1. 44 -0.85 89 3 88
2 19 0. 31 1.93 0. 21 —0.07 1.96 0.04 81 11 84
2 20 0.03 0.54 0.10 0.96 0.55 -1.74 87 11 79
2 21 0.04 0.89 0.03 0.45 0.89 -0.50 87 3 88
2 22 0.00 0.56 0.00 —0.47 0.56 0.84 90 0 90
2 23 0.03 0.72 0.03 —0.95 0.72 1.32 88 3 88
2 24 0.03 0.20 0.02 0.31 0.20 -1.55 82 11 83
2 25 0.10 1. 05 0. 08 0. 27 1. 06 -0. 25 85 7 86
3 26 0. 42 0. 35 0. 48 0. 88 0. 73 -1. 20 55 61 49
3 27 0. 90 0. 73 0. 89 -1. 90 1. 46 1. 30 52 60 53
3 28 1. 63 1.58 1.25 -2.62 2.59 1.01 51 52 61
3 29 0.27 0.31 0.39 -0.18 0.57 0.32 62 57 47
3 30 0.96 1.10 0.87 0.79 1.70 -0.46 55 50 59
3 31 0.59 0.67 0.71 -0.94 1.14 0.82 59 54 52
3 32 0.39 0.31 0.28 0.14 0. 57 -0.25 47 57 61
3 33 0.87 0.67 0.72 0.15 1. 31 -0.11 49 59 57
3 34 0. 46 0. 47 0. 54 -0. 38 0. 85 0.45 57 56 51
3 35 1. 04 0. 74 0. 89 -0. 25 1.56 0.16 48 62 55
3 36 0. 46 0. 48 0. 51 0.66 0.84 -0.78 57 55 53
3 37 0.57 0.65 0.58 -0.34 1.05 0.33 57 51 56
3 38 0.85 0.68 0.74 -1.34 1.31 1.02 50 59 56
3 39 0.28 0.29 0.24 -0.16 0. 47 0. 34 53 52 59
3 40 0. 99 0. 70 0. 84 -0.77 1. 47 0. 53 48 62 55
3 41 0.86 0. 93 0. 71 0.85 1. 45 -0. 58 54 50 61
3 42 1. 11 0. 80 0. 99 -0.39 1.69 0.23 49 62 54
3 43 0.55 0. 50 0.52 —1.50 0.91 1.65 53 56 55
3 44 0.53 0. 51 0. 67 0. 63 0. 99 -0. 63 58 59 48
3 45 0.17 0.15 0. 22 —0. 22 0. 31 0. 69 58 61 46

Mean 0.57 0.59 0.34 -0.14 1.05 0.05
Std 0.49 0.46 0.34 0.84 0.49 0.77

 

38

Design 1

 

 

 

Figure 3.1. Plots of Item Vectors and Reference Composite Vectors

39

tors shows the number of subscores that are necessary for score reporting. Although
it is not shown in this ﬁgure, it is worth attention that the dimensionality can be
different from the number of item clusters.

The three-dimensional vectors for person proﬁciencies were simulated from the
multivariate standard normal distribution. The means, standard deviations and cor-

relation matrix for this sample are shown in Table 3.3.

Table 3.3. Descriptive Statistics of MIRT Person Parameters

 

01 02 6’3

01 1.0000 0.0128 -0.0020
02 0.0128 1.0000 0.0086
03 -0.0020 0.0086 1.0000

Mean 0.0148 0.0364 0.0243
Std 1.0058 1.0061 1.0147

 

 

 

With these item and person parameters, the dichotomous response matrix was
created by comparing the probability matrix with the random matrix generated from
the uniform distribution. Then the TESTFACT software was used for the MIRT
calibration. The computer for these TESTFACT runs was equipped with the Intel
Pentium D processor of CPU 3.49 GHZ speed and 1.99 GB RAM. It took about 33
minutes of CPU time for each run, and most estimations converged at around 25
iterations for the balanced design and a little longer for the unbalanced design, e.g.,
75 iterations.

TESTFACT is a data calibration software under the MIRT model, and it aims at
estimating item characteristic and person proﬁciency parameters. This software may
not give the estimates of several dimensions in the same order as those in the data
generation. For example, in Design 1, the estimates for the ﬁrst two dimensions were
switched in the TESTFACT output. Furthermore, the TESTFACT software may
give results as almost all the discrimination estimates for the whole dimension are
negative, especially when the Varimax or Promax rotation is speciﬁed. These changes

are valid under the framework of factor analysis. And this “negative discrimination”

40

phenomenon should not lead to any problem for the later construct estimation if
the proﬁciency coordinates for the corresponding dimension are also estimated as the
negated values by ﬁxing these negative 0. values as item parameters.

However, one principle in the TESTFACT calibration is that in order to make the
above-zero scores usually assigned with the above-average achievements, the software
sometimes negates the proﬁciency estimates that are obtained by ﬁxing the current
item estimates. This phenomenon is pointed out in the TESTFACT help ﬁle: “Factor
scores are not unique in the sense that multiplication of any column of factor scores
by -1 does not affect the validity of the estimates. It may therefore happen that
negative scores are associated with above average percent responses and vice versa
for below average responses. TESTFACT attempts to reverse the signs in such a way
that scores above zero are usually assigned with above average achievement.” Several
separate TESTFACT runs conﬁrmed that, no matter whether the item discrimination
estimates for the whole dimension were kept the same or negated, the proﬁciency
estimates given by the TESTFACT software never changed.

The solution to avoid this “negative discrimination” effect was to obtain the corre-
lation between the estimated and true item / person parameters. When the correlation
was found to be negative, these item / person estimates were taken as the negative val-
ues of the estimates from the output. And this is to obtain one set of “correct” item
and person estimates from the MIRT calibration and avoid situations where item

estimates are negative while person estimates are not negated or vice versa.

3.2.2 Item and Person Parameter Recovery

In the TESTFACT software, item parameters are estimated using the EM algorithm
With starting values from the Varimax transformed loadings, which come from the
factor analysis on the tetrachoric correlation matrix.

Table 3.4 shows the average correlation across all replications and average Bi-

41

as/RMSE across all items. From the table, both correlation and RMSE indices are
roughly the same for the discrimination estimates of the ﬁrst two dimensions and for
both designs. The recovery for (L3 is a little worse than for 0.1 and oz in both designs,
and especially it is much worse in Design 2, with the average Bias as -0.1549 and

RMSE as 0.1741.

Table 3.4. Recovery of Item Parameters

 

 

 

01 (12 0.3 d

Designl Correlation 0.9848 0.9732 0.9186 0.9989
Bias -0.0307 0.0064 -0.0384 0.0305

RMSE 0.0642 0.0731 0.1219 0.0457

Design 2 Correlation 0.9906 0.9854 0.9359 0.9986
Bias -0.0394 0.0360 —0.1549 0.0406

RMSE 0.0672 0.0795 0.1741 0.0534

 

The Bias in the table is averaged across all items, which may not be a good measure,
since the parameters at different value levels may have different degrees of bias in the
estimation and Bias values with different signs are cancelled out. Therefore, it is not
surprising that these Bias values in the table may seem quite different. To check the
discrimination parameter recovery for each speciﬁc item, the scatterplots between
the parameters and their Bias are shown in Figure 3.2. The pictures in the ﬁrst
three rows illustrate the Bias for the item discrimination parameters. The ideal case
is that all the points locate on the Bias = 0 line, which means that there is no
systematic estimation error for all the item parameters in the long run. From the
ﬁgure, the al and a2 parameters with large values are slightly underestimated, while
the estimation of almost all the a3 parameters is negatively biased. This is more
serious for items with the high M DSI C index in Design 2. The underestimation of
a3 may indicate that, with the Varimax rotation in the TESTFACT software, the
estimation of discrimination parameters for some dimension is negatively biased if
there is no item vector close to that dimensional axis.

As is well known, due to the indeterminacy, it is extremely hard to recover the

42

w
.9
m

Bias

Bias

Bias

Design 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

02*
OWW‘XD 00% 0%
-o.2» ° ° ’
-04 . l l l
0 0.5 I 1.5
31
0.2 7,
0 ommco 803% 05%) o o
'02
-o.4 g .
0 0.5 1 1.5
32
0.2' o
0W” 9% 000
-02 .
-o.4~ . .
0 0.5 I 1.5
83
0.2‘
0 00 00 0o (6%”) 0% O
-02 l
'3 '2 'I 0 I
d

Bias

Bias

Bias

 

 

 

 

 

 

 

 

 

 

Design2
0.2: ' '
00%: 9 03060 @300 o
-02 . 0
-o.4+ . .
0 0.5 I 1.5
31
0.2 r o
0 oOOQyOooooooo o o
'02
-o.4 , 1
0 0.5 1 1.5
32
0.2
0 o 08
‘02 8° 0% Q8 0
'0.4 1 A .
0 0.5 1 1.5
83
o .
0.2' o
00 ° 0 0
o- 3 0&9“th o
-02 1
' '2 '1 0 1
(I

Figure 3.2. Plots of Item Parameters versus Bias

43

 

 

 

 

original coordinate system in factor analysis. It is the same here that the orientations
of coordinate axes for the discrimination estimates provided by the TESTFACT soft-
ware may not be the same as those for parameters, and it is especially problematic for
the axis where no item most discriminates that proﬁciency dimension. Furthermore,
the alignment of the coordinate axes is complicated by the sampling errors involved
with each speciﬁc replication.

d is a scalar parameter, whose estimation is not influenced by the rotational inde-
terminacy, so the recovery for this parameter is much more satisfactory than for the
discrimination parameters. From Table 3.4, the correlations between the estimated
and true d parameters are very close to 1. In addition, with regard to the low values
of Bias and RMSE, the recovery for the (1 parameters is acceptable. Observed from
the last row in Figure 3.2, the d parameters with large values are underestimated,
while those with small values are overestimated. The more extreme the true value is,
the more bias between the estimated and true parameter there exists. The RMSE for
d is 0.0511 for Design 1, which is less than 0.0704 for Design 2. The reason may be
related to the fact that there are more extreme d values in Design 2 than in Design
1.

The same three criterion indices for the recovery of raw proﬁciency parameters are
shown in Table 3.5. Since 01 and 02 are most inﬂuential in determining persons’
responses to certain subsets of items, their recovery is much better than that of
(7’3. The correlations between the true and estimated proﬁciencies for the ﬁrst two
dimensions are around 0.9, which indicates that these estimates roughly retain the
true ordering of people. The correlation for 03 recovery is only around 0.6 for Design
1, and barely reaches 0.7 even when the third dimensional proﬁciency influentially
determines more responses in Design 2. The bad recovery of person parameters for this
dimension is not surprising because the coordinate system, especially the orientation

of the third axis, is not the same as that for the true parameters, which also results

44

in the bad estimation for (13.

Table 3.5. Recovery of Person Parameters

 

 

 

61 02 63
Designl Correlation 0.9140 0.9082 0.6180
Bias -0.0284 -0.0303 -0.0262
RMSE 0.3970 0.4113 0.7459
Design 2 Correlation 0.8893 0.9083 0.6982
Bias -0.0094 -0.0342 -0.0201
RMSE 0.4417 0.4082 0.6823

 

In Table 3.5, every Bias value is averaged across all persons, and they are all of
negative values. This is mostly because the mean of proﬁciency parameters is positive
for all three dimensions, or there are more people with positive proﬁciency parameters
for all three dimensions, and the estimation of positive parameters is known to be
mostly negatively biased. I

The RMSE criterion also shows that there is a big deviation between the estimated
EAP score and true proﬁciency. Its value is much larger for 03 recovery than for
01 and 02 recovery, which conﬁrms again that the parameters related to the third
coordinate axis are not well recovered due to its incorrectly recovered orientation
in the multidimensional space. The RMSE is 0.4417 for 01 in Design 2, which is a
little larger than the corresponding value in Design 1. This is reasonable since 01 is
inﬂuential on responses of fewer simple structure items in Design 2 than in Design
1. The similar reason can be used to explain the RMSE difference for 03 recovery in
both designs.

Figure 3.3 gives the scatterplots between the proﬁciency parameters and their Bias
for each dimension and for each design, in order to check the Bias related to different
proﬁciency levels. From the ﬁgure, it is clear that all these proﬁciencies are under-
estimated for large values and overestimated for small values. This phenomenon is
within expectation, since the EAP score is known to be biased toward the mean of

the prior distribution, which is 0 for the default multivariate standard normal distri-

45

Bias

Bias

Bias

Design 1 Design 2

 

 

 

 

Bias

 

 

 

 

 

 

 

 

 

   

 

 

 

 

 

-4 -2 0 4 -4 -2 0 2 4
01 01
2- 2’,
ll)
0- .1! 0
CO
-2 . 1 -2 .
-4 -2 0 2 4 -4 -2 0 2 4
62 02

 

 

 

 

 

 

 

 

 

Figure 3.3. Plots of Proﬁciency Parameters versus Bias

46

bution in the TESTFACT software. Furthermore, it is easy to see that Bias values
for 03 recovery dramatically change across different proﬁciency levels, and the vertical
spread of 01 Bias in Design 2 is larger than that of the other 01 and 02 Bias in both
designs.

Table 3.6 gives the correlation matrix among raw proﬁciency parameters and also
the average correlation matrix among proﬁciency estimates across replications. Com-
pared with the roughly zero true correlation, there are small values of correlations
among proﬁciency estimates. This is acceptable for the EAP score since this method
shrinks the range of proﬁciency estimates.

Table 3.6. True and Estimated Proﬁciency Correlation Matrix

 

 

 

 

01 02 03

True 01 1.0000 0.0128 -0.0020
02 0.0128 1.0000 0.0086

03 -0.0020 0.0086 1.0000

Recovery from Design 1 01 1.0000 0.0401 0.1713
02 0.0401 1.0000 0.1504

03 0.1713 0.1504 1.0000

Recovery from Design 2 01 1.0000 0.0674 0.1057
02 0.0674 1.0000 0.1656

03 0.1057 0.1656 1.0000

 

From all the above descriptive analysis for the parameter recovery, it is obvious
that parameters for the third dimension recover much worse than for the ﬁrst two
dimensions, because there is no simple structure item to correctly orient the third
coordinate axis. The item characteristic and person proﬁciency estimates seem mis-
leading especially for this dimension. However, this set of estimates was obtained
with the TESTFACT software and it is a valid one among the inﬁnite solutions max-
imizing the likelihood. The problem is that when there is no simple structure item to
deﬁne the coordinate axis, it is very difﬁcult to recover the original coordinate system

even when the proﬁciencies are assumed uncorrelated.

47

3.2.3 Coordinate Axes Recovery

The unsatisfactory recovery of both item and person parameters, especially for the
third dimension, reveals the problem that besides the estimation error, the coordinate
system may not be well recovered with reference to the original one, which is not
surprising due to the indeterminacy of the coordinate system.

Table 3.7 shows the direction cosines of the reference composite vectors from both
the item parameters and their estimates from replication 1. One problem found in
obtaining the most discriminating direction for the item cluster was how to get the
correct direction for the eigenvector. As is well known, the negative of one eigenvector
can be regarded as another valid eigenvector for the same matrix. The eigenvector
direction is difﬁcult to choose when elements in the eigenvector have mixed signs.
In order to get the correctly positioned direction for the projection purpose, the
eigenvector was obtained by forcing the vector element with the largest absolute
value to be positive.

Table 3.7. Reference Composite Vectors from Parameters and Replication 1 Estimates

 

 

 

 

Design 1 Design 2

Cl CZ C3 Cl Cg C3

True 01 0.996 0.101 0.614 0.996 0.083 0.616
02 0.029 0.992 0.614 0.040 0.992 0.559

03 0.081 0.081 0.496 0.081 0.097 0.555

Repl 01 0.993 0.089 0.585 0.983 0.001 0.650
02 0.043 0.995 0.643 0.142 0.999 0.641

03 0.110 0.056 0.495 -0.115 0.046 0.407

 

Since 01 and 02 are fairly dominant for the reference composite directions of cluster
1 and cluster 2 respectively, as long as the weights for these proﬁciencies are still
large, both reference composite vectors are well recovered. However, compared with
the true reference composite vector for cluster 3 in Design 1, the estimate gives less
weight to 01 while more weight to 02. And for cluster 3 in Design 2, 03 is less weighted,

while both 01 and 02 are more weighted. These patterns are consistent across most

48

replications in both designs.

Figure 3.4 shows the graphical presentation of reference composite vectors for both
true and estimated parameters in replication 1. The dashed line indicates the vec-
tor calculated from the parameters, while the solid line stands for that from the
estimates. None of the true and estimated reference composite vectors are exactly
overlapped. Despite the large deviation between the true and estimated reference
composite vectors for the third cluster, it is not obvious to observe this deviation
from this angle.

Therefore, besides the estimation error, one partial explanation for the deviation is
that the coordinate system consistently rotates to roughly the same deviated direction
during the estimation if the estimated reference composite vectors are regarded as
ﬁxed without too much inﬂuence from the estimation error or sampling error. The
ﬁnal orientation of the coordinate system chosen by the software may depend on the
parameters, the sampling error and the rotation criterion. One extra analysis was
conducted to rotate the estimated item discriminations to match their parameters.
Although the rotated version and true parameters are not exactly the same due to
the nonignorable estimation and sampling errors, this rotated version matches the
true parameters much better than the raw estimates given by the software.

Based on the discrepancy between the true and estimated coordinate system in
this simulation, one ﬁnding is that it is hard to recover the axis orientation of the

proﬁciency dimension without any simple structure item measuring that proﬁciency.

3.2.4 Item Grouping

Although the true item grouping was known, the cluster analysis on the estimated
item discrimination matrix for the ﬁrst replication in Design 1 was still conducted to
show how this analysis works.

Figure 3.5 shows the dendrogram of the linkages among all 45 items for that repli-

49

 

 

 

Figure 3.4. Plots of Reference Composite Vectors from Parameters and Replication
1 Estimates

50

|-‘

H

H l—‘H
KOI—‘Qwom

mmwmmI—Imp
. l I

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.5. Dendrogram for Replication 1 in Design 1

51

i
}__.
L c1
3*; c2
03
II-
? l L l l l l
0 so 100 150 200 250 300

cation. The horizontal axis shows the Ward distance measure, while the vertical axis
denotes the item number. The grouping pattern is very clear that items 1-15 belong
to the ﬁrst cluster, items 16—30 to the second cluster, and ﬁnally items 31-45 to the
third cluster. This result is quite satisfactory; however, it should be kept in mind
that this subjective eyeballing may lead to different item grouping, when there are
large sampling errors involved with the data or high correlations among the proﬁcien-
cies. Even combined with the evidence from expert judgement, it may still result in
ambiguous item grouping due to different content classiﬁcation criteria used by these

experts.

3.2.5 Stability of Construct Estimates

0* estimates were calculated as the projection of 0 estimates onto the most discrimi-
nating direction of each item cluster. Although the orthonormal rotation changes the
coordinates of points represented by the coordinate system, their relative position
and distance are invariant under this transformation. Therefore, the values of 0*
estimates do not change whatever orthonormal rotation the coordinate system takes.

Table 3.8 shows the mean, minimum and maximum for the standard deviations of
estimated 0 and 0* across all persons. From the table, 0* estimates seem to be more
stable than 0 estimates, especially for the third element. Also the variation of the
third construct estimates is smaller than that of the other two construct estimates
while the opposite pattern occurs in the raw proﬁciency estimates. Besides the sam-
pling and estimation error, 0 estimates are largely inﬂuenced by the orientation of
the coordinate axes. However, the values of 0* estimates do not change no matter
what orthonormal rotation of the coordinate system representing the 0 estimates.
This invariance is especially important in recovering the weighted composite, where
more than one element of 0 plays a signiﬁcant role.

Figure 3.6 illustrates the relationship between the true 0* and the variations of

52

 

Design 1

Std

 

 

 

 

 

 

 

 

 

0.4_
P.
”02»
0 r r 1
-4 -2 o 2 4
t
62

 

Std

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Design 2
B
(I)
0 r
-4 -2 0 2 4
*
91
0.4 . l
B
0’ 0.2-
0 1 1
—4 -2 0 2 4
1:
02
B
(O
0 1
-4 -2 0 2 4
03

Figure 3.6. Plots of Construct Parameters versus Estimated Standard Deviations

Table 3.8. Summary of Standard Deviations of Raw Proﬁciency and Construct Esti-

 

 

 

mates
01 02 63 6f 62 93
Designl Mean 0.3500 0.3484 0.4975 0.3371 0.3400 0.3028
Min 0.1764 0.2029 0.2868 0.1706 0.2121 0.1790
Max 0.5371 0.5144 0.7430 0.4979 0.4937 0.4929
Design2 Mean 0.3426 0.3312 0.4810 0.3378 0.3308 0.2634
Min 0.1820 0.1760 0.1758 0.1417 0.1866 0.1424
Max 0.5330 0.4870 0.6788 0.5200 0.4938 0.4635

 

their estimates. All the shapes look roughly like the letter “M”, which indicates that
the variation for the extreme and middle values is slightly smaller than for the values
between them. This is reasonable since the EAP scoring method restricts the possible
range of proﬁciency estimates, and the middle-value proﬁciencies are better estimated.
The reason for the latter is that the average test difﬁculty is also around the middle
values and the EAP estimates for these middle-value proﬁciencies are more stable by
incorporating the prior standard normal density. It can also be observed from the
ﬁgure that points for the third construct are closer to the horizontal axis than those
for the other constructs. This phenomenon reveals that the variation of 0; estimates
is slightly smaller than for 01‘ and 0; estimates, and this can also be observed from

Table 3.8.

3.2.6 Relationship with NC Subscores and Unidimensional

Estimates

As a new method to obtain subscores related to different item clusters, it was assumed
that construct estimates should be highly correlated with the commonly used NC
subscores and unidimensional estimates, especially when the variation of directional
degree angles for the items in the same cluster is very small.

Table 3.9 shows the average correlations among 0* estimates, 0,, estimates and NC

subscores across all replications. Obviously, all the correlations are very high, with

54

the lowest value as 0.9450. For 0* estimates, all their correlations with 01, estimates
are higher than their corresponding ones with NC subscores, since both 0* and 0a
are estimated using the models which take the item discrimination and difficulty into
consideration. However, for 0a estimates, they seem to be more closely related to
NC subscores rather than to 0* estimates, and the reason may be the assumption of
unique proﬁciency for both estimations.

Table 3.9. Average Correlation Among Construct Estimates, Unidimensional Esti-
mates and NC Subscores

 

 

 

 

 

Design 1 Design 2
0* 0,, NC 0* 0,, NC

Cluster 1 0* 1.0000 0.9847 0.9726 1.0000 0.9783 0.9701
0,, 0.9847 1.0000 0.9852 0.9783 1.0000 0.9905

NC 0.9726 0.9852 1.0000 0.9701 0.9905 1.0000

Cluster 2 0* 1.0000 0.9796 0.9729 1.0000 0.9804 0.9638
0,, 0.9796 1.0000 0.9911 0.9804 1.0000 0.9801

NC 0.9729 0.9911 1.0000 0.9638 0.9801 1.0000

Cluster 3 0* 1.0000 0.9633 0.9450 1.0000 0.9806 0.9715
u 0.9633 1.0000 0.9799 0.9806 1.0000 0.9801

NC 0.9450 0.9799 1.0000 0.9715 0.9801 1.0000

 

Figures 3.7 and 3.8 show the scatterplots of 0* estimates with 01, estimates and
with NC subscores separately. Clearly there are lower and upper bounds, which are
respectively 0 and total number of items in that cluster, for NC subscores. Observed
from the ﬁgures, the range of 0* estimates and 0a estimates is restricted by the EAP
scoring method, since none of the absolute proﬁciency estimates exceeds 3. It is also
clear that, for the people located at either tail of the proﬁciency distribution, they are
more differentiated with the 0* estimates than with the NC subscores, and slightly

more than with the 01, estimates.

3.2.7 Accuracy of Construct Estimates

0* parameters were created by projecting 0 parameters onto different reference com-

posites for the three clusters. They were regarded as the reference for comparing the

55

Design 1 Design 2

 

 

H

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(10*

Figure 3.7. Plots of Average Construct Estimates versus Average Unidimensional
Estimates across all Replications

56

Design 1 Design 2

 

 

20

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 3.8. Plots of Average Construct Estimates versus Average NC Subscores across
all Replications

57

recovery efﬁciency by 0* estimates with that by unidimensional estimates.

Table 3.10 shows the average correlation across all replications and average Bi-
as/RMSE across all persons for the recovery of 0* parameters by the estimated 0*
and 01). As in previous discussion for the 0 recovery, all the Bias values in this table
are negative due to the fact that more people have positive 0* values. Also, because
of the EAP score shrinkage, all the RMSEs show the large deviations between the

estimated and true 0*.

Table 3.10. Construct Recovery from Unidimensional Estimates and Construct Esti-
mates

 

 

 

 

unidimensional 0,, multidimensional 0*
01,1 01,2 01,3 01‘ 0; 0;
Design 1 Correlation 0.9121 0.9103 0.9098 0.9243 0.9217 0.9385
Bias -0.0192 -0.0394 -0.0446 -0.0312 -0.0347 -0.0483
RMSE 0.3988 0.4085 0.4041 0.3742 0.3833 0.3421
Design 2 Correlation 0.8984 0.9134 0.9385 0.9150 0.9251 0.9558
Bias —0.0169 -0.0394 -0.0438 -0.0130 -0.0373 -0.0362
RMSE 0.4255 0.3973 0.3381 0.3926 0.3732 0.2909

 

From the table, all the scale-free correlation indices for the recovery by 0* estimates
are consistently higher than for the corresponding recovery by 01, estimates, and their
RMSEs are also lower than those from 01) estimates. Both the correlation and the
RMSE indicate that 0* estimates perform better for the recovery of 0* parameters.
Compared with previous recovery of 0 parameters in Table 3.5, the recovery of 0*
is much better by either 0* estimates or 01, estimates, since their values are not
inﬂuenced by the orthonormal rotation of the coordinate system.

Table 3.11 gives the hypothesis tests on the difference of correlations for 0* pa-
rameter recovery by 0* estimates and 0a estimates. With each replication as one
observation, there were 50 observations for each hypothesis test, and the test was
conducted separately for different designs and clusters. There is no testing for Bias
or RMSE, since they were already averaged across replications. Paired tests were

used because the same dataset was calibrated to give both the multidimensional and

58

unidimensional estimates. Also, due to the possible nonnormality, the nonparamet-
ric sign test and commonly used Fisher’s r-to—z transformation in Equation 3.3 were

applied in the test.

 

1
z = 2 log (3.3)

1—r

Table 3.11. Hypothesis Testing on the Difference of Correlations for Construct Re-
covery by Construct Estimates and Unidimensional Estimates

 

 

 

 

 

 

 

Design Cluster Scale Method Mean Std Error Statistics p—value
1 1 Raw Paired t 0.0122 0.0002 71.6861 0.000
Fisher Paired t 0.0778 0.0011 70.2359 0.000

Raw Signtest 6.9296 0.000

2 Raw Paired t 0.0114 0.0021 5.4395 0.000

Fisher Paired t 0.0762 0.0091 8.3590 0.000

Raw Signtest 6.6468 0.000

3 Raw Paired t 0.0287 0.0021 13.8690 0.000

Fisher Paired t 0.2073 0.0126 16.4260 0.000

Raw Signtest 6.0811 0.000

2 1 Raw Paired t 0.0166 0.0003 47.7248 0.000
Fisher Paired t 0.0938 0.0020 47.7994 0.000

Raw Signtest 6.9296 0.000

2 Raw Paired t 0.0117 0.0025 4.7332 0.000

Fisher Paired t 0.0826 0.0106 7.8124 0.000

Raw Signtest 6.6468 0.000

3 Raw Paired t 0.0173 0.0001 131.6440 0.000

Fisher Paired t 0.1694 0.0012 138.7443 0.000

Raw Signtest 6.9296 0.000

 

Therefore, each testing consists of three different tests, which include the raw scale
paired t-test, the transformed scale paired t-test, and the raw scale nonparametric
sign test. From the table, it is easy to see the p-values in the last column are all
0000’s, which indicates all the tests are highly signiﬁcant; therefore, the recovery by
0* estimates performs signiﬁcantly better than 01,, estimates.

Figures 3.9 and 3.10 illustrate the Bias for the 0* recovery by the estimated 0*
and 01, for Design 1 and 2 respectively. As before, parameters with large values are
underestimated, and the pattern is the opposite for parameters with small values. The

different vertical spread for recovery Bias from both estimates indicates that there is

59

fEmmae

 

Bias

 

ﬁr

 

 

 

 

 

 

 

 

 

 

 

 

Ill
.9
m
-2 .
'4 '2 0
02
2
f 0
m
-2 ,
'4 '2 0
t
63

Figure 3.9. Plots of Recovery Bias by Construct Estimates and Unidimensional Esti-

mates in Design 1

60

Bias

Bias

Bias

(Esnae

 

 

 

 

 

 

 

 

 

 

 

 

2 0
0 .
-2 t
'4 '2 0 2
9I
2 +
0 .
-2 .
'4 '2 0 2
92
2 o
0 .
-2 [
'4 '2 0 2
03

 

 

 

wamme

 

Bias

 

 

 

 

Bias

 

 

 

 

 

 

 

 

 

-2 ,
'4 '2 0
*
02
2
w
.9 0
m
-2 1
'4 '2 0
03

Figure 3.10. Plots of Recovery Bias by Construct Estimates and Unidimensional

Estimates in Design 2

61

Bias

Bias

Bias

%Emmme

 

 

 

 

 

 

 

 

 

 

 

-2 n .

'4 '2 0
%

2 0

0 .

-2 . .

'4 '2 0
%

 

 

 

more recovery variation for 0* estimates than 0” estimates, while the absolute Bias

values by 0* estimates are smaller for the parameters somewhat distant from the

middle.

3.2.8 Correlation Recovery

The correlations among 0* parameters can be recovered by 0* estimates, and in
theory, the correlation values depend on how close the reference composite vectors for
different item clusters are. Although the TESTFACT software uses the cov0 = Imxm
assumption for the proﬁciency estimation, it is likely that there are still small values
of correlations among 0 estimates due to the sampling errors and the EAP scoring
method, which is already shown in previous Table 3.6.

Table 3.12 shows the correlation matrix recovered from 01, estimates, the Promax
method and 0* estimates, and these correlation matrices were averaged across all
replications. The Promax method is based on the result of the factor loadings after
the Varimax rotation, whose computation involves the eigenvector calculation in the
TESTFACT software. As previously explained, an eigenvector is free to get the sign
changed. In order to reduce the effect of this indeterminacy, all correlations from the
Promax method were forced to take positive values.

From the table, correlations are underestimated by either unidimensional EAP
estimates for 0a or the Promax method, while they are overestimated by the EAP
estimates for 0*. It is not surprising that 0U estimates give the lowest correlation
related to 0g, since the unidimensional calibration does not take it into consideration
that even for items in the same cluster, they can most discriminate along slightly
different directions and the angular variation in the third cluster is larger than in
other clusters. The correlation matrix given by the Promax method is obtained from
the rotation matrix, which converts the Varimax transformed factor loadings to be

more like a simple structure solution. It is obvious to see that this method still

62

Table 3.12. Correlation Recovery by Unidimensional Estimates, the Promax Method
and Construct Estimates

 

Design 1 Design 2

 

91‘

93

9:?

91‘

93

95

 

True

1.0000
0.1489
0.6714

0.1489
1.0000
0.7173

0.6714
0.7173
1.0000

1.0000
0.1423
0.6819

0.1423
1.0000
0.6659

0.6819
0.6659
1.0000

 

Uni

1.0000
0.1210
0.5466

0.1210
1.0000
0.6006

0.5466
0.6006
1.0000

1.0000
0.1191
0.5799

0.1191
1.0000
0.5685

0.5799
0.5685
1.0000

 

Promax

1.0000
0.0995
0.5715

0.0995
1.0000
0.6300

0.5715
0.6300
1.0000

1.0000
0.0737
0.6079

0.0737
1.0000
0.5960

0.6079
0.5960
1.0000

 

Multi

1.0000
0.1931
0.7126

0.1931
1.0000
0.7676

0.7126
0.7676
1.0000

1.0000
0.1991
0.7437

0.1991
1.0000
0.7206

0. 7437
0. 7206
1.0000

 

cannot recover the correlations among 0* parameters. The slight overestimation of

the correlations by 0* estimates is expected since there are already small values of

correlations among 0 estimates obtained using the EAP scoring method.

63

CHAPTER 4

Real Data Applications

4.1 Test Information and Analysis Procedure

In this study, the fall 2007 grade 7 mathematics test from the Michigan Educational
Assessment Program (MEAP) was used for the real data analysis. To meet the
requirement of the federal N 0 Child Left Behind (NCLB) Act, the MEAP is developing
tests of English language arts, mathematics, science and social studies, the first two of
which are the core subjects used to measure the Adequate Yearly Progress (AYP) of
students in the elementary and middle schools. This AYP is used to track year-to-year
student achievement, and most importantly, after combined with other indicators,
e.g., attendance rate, it is applied to each school and district to evaluate whether
their AYP goal is achieved.

The MEAP mathematics test is administered at the beginning of each school year,
and it actually covers the contents of the preceding grade level. There are ﬁve strands
in the mathematics tests, within which multiple domains are measured. The contents
of these strands and domains are shown in Table 4.1, and the emphasis for them
varies across different grade levels due to different curriculum contents and standard
expectations.

Table 4.2 specifies the content classiﬁcation for each item in this fall 2007 grade 7
mathematics test, and Table 4.3 summarizes the number and percentage of the items
within each category. This grade 7 mathematics test actually measures the students’
learning in grade 6, and one general scaled score is assigned to each student, which

is later categorized into four different achievement levels: Not Proﬁcient, Partially

64

Table 4.1. Strands and Domains for Michigan Mathematics Content Expectation

 

Strand Domain

Numbers and Operations ME Meaning, notation, place value, and comparison
MR Number relationships and meaning of operations
FL Fluency with operations and estimation

Algebra PA Patterns, relations, functions, and change
RP Representation
FO Formulas, expressions, equations, and inequalities

Measurement UN Units and systems of measurement
TE Techniques and formulas for measurement
PS Problem solving involving measurement

geometry GS Geometric shape, properties
and mathematical arguments
LO Locations and spatial relationships
SR Spatial reasoning and geometric modeling
TR Transformation and symmetry

Data and Probability RE Data representation .

AN Data interpretation and analysis
PR Probability

 

 

 

 

 

Proﬁcient, Proﬁcient and Advanced. Besides this score, the NC subscores are also
given according to the ﬁve strands speciﬁed in Table 4.1. It should be noted that
all these MEAP test and item information are retrieved from the web pages of the
Michigan Department of Education.

Based on Table 4.3, N and A are the two main strands that are measured in this
grade, and both of them cover 80% of the items on the whole test. The other strands
M, G and D only contain 3, 6 and 3 items. Since there are only a few items for these
three strands, their subscores are theoretically unreliable to be reported if the item
responses in each strand are calibrated separately from those in other strands.

This test was designed to measure different constructs related to these content
strands or domains; therefore, it can be regarded as a multidimensional test, and
analyzed with the MIRT projection method to improve the subscore estimation.

For the real data, the total pOpulation size is 124,674 and the number of items
for the core test is 60. After listwise deletion for the missing data, the population
size reduced to 124,641. The classical p-value for the difﬁculty of each item was

scrutinized, which was calculated as the percentage of people correctly answering this

65

Table 4.2. Item Content of the Fall 2007 Grade 7 MEAP Mathematics Test

 

Item Content Classiﬁcation Item Content Classiﬁcation

 

1 N -FL 31 D-PR
2 N-F L 32 D-PR
3 N-FL 33 D-PR
4 N-FL 34 N-MR
5 N-FL 35 N -MR
6 N -FL 36 N-MR
7 N -MR 37 N-FL
8 N-MR 38 N—FL
9 N -MR 39 N-F L
10 N -FL 40 N-FL
1 1 N -FL 41 N-FL
12 N -F L 42 N -FL
13 N -ME 43 N-ME
14 N -ME 44 N-ME
15 N-ME 45 N—ME
16 A-PA 46 A—FO
17 A-PA 47 A-FO
18 A-PA 48 A-FO
19 A-RP 49 A—FO
20 A-RP 50 A-FO
21 A-RP 51 A-FO
22 A-FO 52 A—FO
23 A-FO 53 A-FO
24 A—FO 54 A-FO
25 A-FO 55 M-UN
26 A-FO 56 M-UN
27 A-FO 57 M-UN
28 G-GS 58 G-TR
29 G-GS 59 G-TR
30 G-GS 60 G—TR

 

Table 4.3. Strand and Domain Percentage in the Test

 

 

 

 

 

 

Strand Count Percentage Domain Count Percentage
N 27 45% ME 6 10%
MR 6 10%

FL 15 25%

A 21 35% PA 3 5%
RP 3 5%

F0 5 25%

M 3 5% UN 3 5%
G 6 10% CS 3 5%
TR 3 5%

D 3 5% PR 3 5%

 

item. Item 57 was found to be problematic based on this criterion, because only 8.18
percent of peOple answered it correctly. The question is as following,

57. What is the total number of square inches in 5 square feet?

66

A 25
B 60
C 300
D 720

This is one of the only three measurement questions, and the correct answer is D.
Clearly, this item was a little difﬁcult for the students. With the BILOG software,
the item—test pearson correlation for this item was 0.039, and the classical biserial
correlation was 0.071. Both indices were too low, so this item was deleted from the
analysis.

Table 4.4 shows the descriptive statistics of the NC total scores and difficulty p-

values for the 124,641-by-59 response matrix, which was later used in the subscore

 

 

estimation.
Table 4.4. Summary Statistics for the Population Data
Count Mean Std Min Max
Total Score (Person) 124,641 34.5403 11.6417 0 59
Difﬁculty p—value (Item) 59 0.5854 0.1518 0.2542 0.8691

 

In the ﬁrst analysis for this test data, guessing parameters were estimated from a
unidimensional calibration using the BILOG software and these estimates were input
to the NOHARM and TESTFACT software for the MIRT calibration. However, with
these guessing values, both software gave unusual results, for example, very high
value of the item discrimination. Another common method proposed by Lord (1980)
was also applied to handle the guessing effect. With this method, all the guessing
parameters were ﬁxed to a reasonable value of 0.2 and input into both software. This
constant value 0.2 was calculated as 1 / (1 + n), where n was the number of choice

options for each item and four was the number for this MEAP data. Unfortunately,

67

this method didn’t give any better result. The reason for these bad estimations
may be that these guessing parameters were not estimated simultaneously with other
item parameters; instead, they were regarded as ﬁxed for the MIRT calibrations.
Therefore, although people were supposed to have a chance to guess the answer, at
least, it seemed that neither the guessing estimates from the unidimensional model
nor the commonly used ﬁxed guessing values ﬁt the MIRT calibration for this MEAP
data. Hence, this guessing effect was ignored from the later analysis.

In the following analysis, two datasets were designed to be used. One was the
population response matrix, with 124,641 rows and 59 columns. The other was the
5,000-by-59 sample matrix, where people were drawn randomly and without replace-
ment from the population. This sample data were used when the software had difﬁ-
culty handling the large dataset, and they were also applied for conﬁrming the item
grouping structure obtained from the population data.

The big problem for the MIRT calibration is that the number of dimensions is
unknown for this real data; therefore, some procedures were applied to determine
the dimensionality of the response matrix, or more precisely and conservatively, to
ﬁnd a suitable number of dimensions for the MIRT calibration. First, the DIMTEST
software was used on the sample data to test whether the ﬁt of the unidimensional
model to the data was rejected or not (Stout et al., 1999). After this hypothesis testing
was rejected, the parallel analysis was used to compare the observed eigenvalues with
those from the randomly simulated data with the same size and difficultyp—values as
the population response matrix (Ledesma & Valero—Mora, 2007). In their study, the
number of dimensions was determined by the number of observed eigenvalues from the

5th percentile eigenvalues

population data that were larger than the corresponding 9
from a large number of random data samples. In this study, the eigenvalues from
all random samples were fairly consistent; therefore, only ten random datasets were

simulated and the number of dimensions was determined by the number of observed

68

eigenvalues that were larger than the corresponding ones from all the samples. After
the MIRT calibration with this number as the dimensionality, the cluster analysis was
performed on the angular dissimilarity matrix from the item discrimination estimates
for further analysis on dimensionality.

After the number of dimensions was ﬁnally decided for the analysis, the calibration
and cluster analysis were gone through one by one for the population data, where
the NOHARM software was used for the MIRT calibration due to the huge dataset.
Since the item grouping was supposed to be consistent across different samples, the
same procedures were applied on the sample data to double check the item clustering.

The raw proﬁciency coordinates were estimated by implementing the item estimates
from the NOHARM software into the TESTFACT software. Then reference compos-
ite vectors and 0* estimates were calculated. Finally, the summary statistics for 0*
estimates were reported and the relationship between them and the NC subscores

was investigated.

4.2 Dimensionality and Cluster Detection

The DIMTEST software is commonly used for the nonparametric hypothesis testing
of whether the response data can be ﬁtted by the unidimensional model (Stout, 1987;
Stout et al., 1999; Stout, Froelich, 86 Gao, 2001).

With this software, all items are divided into a partitioning subtest (PT) and an
assessment subtest (AT), where the items are chosen carefully to be dimensionally
distinct for these two subtests. The PT test provides the person proﬁciency estimate,
which is used as the conditioning variable for the conditional covariance calculation
for items in the AT test. The analysis on the AT test gives a T L statistic, which
is later found to be positively biased. In order to correct the bias, TC is calculated

as the average from many simulated data sets that match the observed data. Then

69

the difference between TL and TC is standardized and compared with the normal
distribution. The large value of this test statistic indicates the rejection of unidi-
mensionality. From Table 4.5, this test statistic T is 21.2236, and the p-value is
0.0000. This led to the rejection of unidimensionality for the sample data, which also
implicitly indicated the multidimensionality for the population data.

Table 4.5. Results from the DIMTEST Software

 

T L TC T p-value
26.9574 5.6279 21.2236 0.0000

 

 

For the parallel analysis, the TESTFACT software was used to calculate the eigen-
values for the tetrachoric correlation matrix, which is more often used for dichotomous
data than the pearson correlation designed for continuous variables. Table 4.6 gives
the eigenvalues for the population data and ten random datasets of the same size.

Figure 4.1 shows the scree plot of these eigenvalues, and the line marked with
circles is for the population data. Clearly, the eigenvalues for the population data
drop dramatically from the ﬁrst factor to the second one and much slower thereafter.
For any random dataset, it seems that all the eigenvalues are roughly the same and lie
in a horizontal straight line, especially from the third one to the end. With this plot,
different conclusions for dimensionality could be drawn based on different criteria.
The commonly used eigenvalue-bigger—than-one criterion gave nine dimensions, and
the “elbow” rule may indicate two or four dimensions. With the criterion from the
parallel analysis, the preliminary result for the dimensionality was eight.

With eight as the number of dimensions, the NOHARM software was used to
calibrate the population data. Then the cluster analysis was conducted on the angular
dissimilarity matrix estimated from the Varimax transformed item discrimination
matrix. The result of item grouping is shown in Figure 4.2.

From the ﬁgure, many small clusters are formed by neighboring items, which in-

70

Table 4.6. Eigenvalues for the Population Data and Ten Random Datasets

 

Eigenvalue

 

1 2 3 4 5 6 7 8 9 10

 

Population 17.064 2.903 2.286 1.666 1.448 1.332 1.147 1.058 1.040 0.987
2.034 1.801 1.049 1.046 1.043 1.039 1.036 1.035 1.032 1.031
2.204 1.882 1.052 1.047 1.040 1.037 1.032 1.030 1.026 1.025
1.964 1.963 1.055 1.052 1.045 1.039 1.034 1.032 1.031 1.028
2.072 1.329 1.062 1.058 1.055 1.049 1.046 1.044 1.042 1.041
1.972 1.803 1.140 1.053 1.049 1.041 1.036 1.032 1.031 1.030
2.007 1.354 1.060 1.056 1.051 1.048 1.046 1.045 1.043 1.041
2.024 1.398 1.062 1.056 1.052 1.048 1.044 1.041 1.038 1.035
2.125 1.556 1.059 1.052 1.049 1.046 1.043 1.040 1.034 1.032
2.296 1.188 1.066 1.056 1.053 1.048 1.046 1.040 1.037 1.036
2.571 1.650 1.245 1.048 1.043 1.036 1.034 1.028 1.023 1.022

SQWNGO’HBWNH

 

 

1 8 l I I r I l f r

16

14-

A
N
U

    
 

A
o

Eigenvalues
oo

 

 

 

 

.-
m
at
h

01

c»

\l
cn—o
(O-U

10

Figure 4.1. Plot of Eigenvalues for the Population Data and Ten Random Datasets

71

 

 

 

 

C1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C2

 

 

 

 

 

 

 

C3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

fWWTlJLT mam will

 

0 20 40 60 80 100 120 140 160

Figure 4.2. Dendrogram for the Population Data Calibrated with 8 Dimensions

72

dicates that these item vectors point in roughly the same direction and measure the
same construct. Especially, items in the cluster labeled as “C1”, “C2” and “C3”
seem to be more clustered together and separated from other items. As mentioned
previously, although overfactoring does not lead to serious consequences, it does need
more parameters to be estimated, which may result in more estimation errors. Since
the dimensionality could also be inferred from the item clustering, based on this ﬁgure

3’

and also the “elbow rule, the dimensionality was reset to four.

In order to check whether four is a reasonable number for the dimensionality, extra
calibrations on the population data were conducted using the NOHARM software
with the dimensionality assumed from two to ﬁve. Three ﬁt indices for each choice
of dimensionality are shown in Table 4.7. The ﬁrst two indices are overall measures
of the model-data misﬁt from the perspective of the residual covariance matrix, and
they take smaller values as the model-data ﬁt improves. The last one is the Tanaka

index, which indicates a good ﬁt if its value is close to one (Tanaka, 1993).

Table 4.7. Model Fit Indices from the NOHARM Software

 

Dimensionality
2 3 4 5 8

Sum of squares of residuals 0.0705 0.0490 0.0296 0.0188 0.0103
Root mean square of residuals 0.0064 0.0053 0.0042 0.0033 0.0025
Tanaka index of goodness of ﬁt 0.9659 0.9763 0.9857 0.9909 0.9950

 

 

 

From this table, it is clear that more dimensionality leads to a better ﬁt. Based on
all three indices, there seems to be a consistent improvement up to the dimensionality
of four, then a slower improvement from the four-dimension solution to the ﬁve-
dimension solution, and much slower improvement to the eight—dimension solution.

McDonald (1997) suggested that a model is a sufﬁciently close approximation to
the data if there is no more complex model which is identiﬁable and interpretable,
and he also pointed out that the inspection of the residual covariance matrix works

the same as the goodness-of—ﬁt indices, especially when the criterion values for these

73

indices are hard to set. Therefore, the estimates from the ﬁve-dimension solution were
scrutinized and the item grouping from this solution resulted in ﬁve item clusters;
however, it is hard to give content meanings for the interpretation of two of these
clusters. Firrthermore, for the residual matrix from the four-dimension solution, the
element values are acceptably small and there is no consistent pattern for any block
of the residual covariance matrix.

As a result, the dimensionality for this MEAP data was ﬁnally determined as four
and the item estimates from this MIRT calibration were used in the cluster analysis
to determine the item grouping. Figure 4.3 shows the dendrogram based on the
estimates from this calibration. The pattern of four item clusters is very clear, and
the clusters in the ﬁgure are labeled according to the number of items within them.

In order to conﬁrm this was the consistent pattern across different samples, the same
procedures were conducted on the 5,000 sample data. The item clustering based on
this sample is shown in Figure 4.4. The items in “C3” and “C4” are exactly the same
as in the previous analysis. However, “N06”, “A18” and “N39” are misclassiﬁed into
“C2” and “A48” is missing from it. This misclassiﬁcation is not surprising, since
only a small portion of people were included in the analysis. Another cluster analysis
was done with the sample size doubled to 10,000. This time, as shown in Figure 4.5,
the grouping of all items except item “A48” is exactly the same as that from the
population data calibrated with four dimensions. Therefore, except the item “A48”,
the grouping pattern is very consistent across different samples, especially when the ‘
sample size is large enough.

The grouping for item “A48” seems a little inconsistent, since this item is included in
cluster “C2” only for the population data calibrated with four dimensions; therefore,
anyone seeking to allocate this item into the correct cluster may need suggestions
from experts on item contents. For simplicity, this item was classiﬁed into cluster

“C2” in the later analysis.

74

 

CZ

 

 

 

 

 

 

 

 

 

 

C3

 

 

 

WWW

 

 

 

 

 

 

 

 

ZUS’ZQD’
gum-U101”
pwmmxod—x

 

 

 

C1

 

 

 

 

 

zznxgcacnzvzz
that-N NNI—H—IH
Nowwxocnxov—Io

0 20 40 60 80 100 120 140 160

Figure 4.3. Dendrogram for the Population Data Calibrated with 4 Dimensions

75

 

 

 

T
J.
A49
1452 :3 02

 

 

 

 

 

 

 

 

C3

 

 

 

 

 

 

 

 

 

 

 

‘—L
CJ—J
:l——

A17

660

A24

A46

A48

322% J

658 Cl

1

0 20 40 60 80 100 120 140 160 180

Figure 4.4. Dendrogram for the 5000 Sample Data Calibrated with 4 Dimensions

76

 

 

 

C1

 

 

 

 

 

 

 

 

{wﬂwrﬁmflw

 

CB

 

 

2222
oooo
\oqoow

 

 

 

 

 

 

CZ

 

 

 

 

 

 

 

a 1”

o
N
o

40 60 80 100 120 140 160

Figure 4.5. Dendrogram for the 10000 Sample Data Calibrated with 4 Dimensions

77

Based on the analysis for the population data, the number of items within each
cluster is 41, 9, 6 and 3. All items for this test can be retrieved from the Michigan De-
partment of Education website, and their contents were scrutinized with distinctions

listed in Table 4.8.
Table 4.8. Item Cluster Content of the Fall 2007 Grade 7 MEAP Mathematics Test

 

Cluster Strand Domain Content
1 N,A,M FL, ME, PA, RP, mixed contents

 

G,D FO,GS, PR, UN, TR
2 A PC equation representation and calculation
3 N FL, MR division with fraction numbers
4 N MR percentage in contextual problems

 

The ﬁrst observation from the table is that this item clustering is neither determined
solely by strands nor by domains. All the clusters except the ﬁrst one are related
to some speciﬁc mathematical concept. Since the ﬁrst cluster includes items from
all strands and almost all the domains except “MR”, it is difficult to interpret the
construct for it. For simplicity, it is labeled as “mixed contents” here. The items in
this cluster are assumed to measure the same construct, where no distinction can be
detected by the MIRT calibration with four dimensions.

This four-cluster grouping seems to be different than the grouping deﬁned by the
ﬁve strands that these items were supposed to measure, and this ﬁve-strand grouping
criterion was adopted for the reporting of subscores. However, it should be noted
that item contents can be deﬁned in different ways based on particular details of the
complicated cognitive processes and different criteria could lead to different groupings
of items; therefore, at least, it can be argued that item subsets can also be determined
from domains, speciﬁc item descriptions, or a mixture of all these criteria. Most
importantly, the grouping criterion on item contents does not take the proﬁciency
correlation into consideration, especially when there are already warnings against the
reporting of highly correlated proﬁciencies (Sinharay et al., 2007).

The cluster analysis determines the item grouping from the empirical data analysis

78

and sorts similar or dissimilar items based on the statistical criteria. This analysis is
supposed to deﬁne the most statistically varying constructs. An extra analysis was
conducted to check the closeness of reference composite vectors based on the item
estimates for the population data calibrated with four dimensions. For the four-cluster
solution, the values of off-diagonal entries for the (wéwﬂzj matrix vary from 0.324 to
0.635; however, for the item grouping based on the ﬁve-strand criterion, these values
vary from 0.806 to 0.999. If the cut point is deﬁned as 0.8 for an acceptable closeness,
the proﬁciencies resulted from the ﬁve-strand criterion are so highly correlated that
there is no need to report all of them.

Therefore, although both the expert judgment and the cluster analysis can give
evidence for the item grouping, it can be argued that the cluster analysis is preferred
because the subscores that are of interest to the users of test results should be sta-
tistically distinct rather than deﬁned by the closely related constructs. Furthermore,
more reliable item grouping can be obtained if both pieces of evidence are taken into

consideration.

4.3 MIRT Calibration and Subscore Reporting

The analysis in the previous section resulted in four dimensions and four item clusters.
In this section, in order to obtain more stable item estimates from the N OHARM
software, the coordinate system was reconstructed with items measuring different
constructs; therefore, item “N02”, “N34” and “A53”, which were selected from differ-
ent clusters, were constrained to have special discrimination vectors. Then the item
estimates from the NOHARM software were rotated with the Varimax criterion.
Table 4.9 shows the resulting item estimates, generalized discrimination, generalized
difficulty and the degree angle with each axis. The descriptive statistics for these

estimates are given by Table 4.10. These estimates were regarded as ﬁxed and input

79

into the TESTFACT software for the uncorrelated proﬁciency estimation.

Table 4.11 shows the directions of four reference composite vectors in the four-
dimensional proﬁciency space deﬁned by the Varimax transformed item discrimination
matrix. Almost all the bolded numbers are larger than 0.9, and these numbers are
related to different proﬁciency dimensions. This indicates that all these reference
composite vectors are close to different coordinate axes so that any construct estimate
is mainly determined by one unique raw proﬁciency estimate, which is different than
those for the other constructs.

Table 4.12 shows all four eigenvalues of the A1141 matrix for each item cluster.
For all clusters, the ﬁrst eigenvalue, which corresponds to the reference composite in
Table 4.11, is always the largest and dominant one. The ratio between the ﬁrst two
eigenvalues is also given in the last row. These ratio values are large, which conﬁrms
again that the variation of this matrix is mainly determined by the ﬁrst eigenvector.

Table 4.13 gives all the correlations among NC subscores and construct estimates.
The correlations between the NC subscores and corresponding construct estimates
for all clusters are highlighted in bold, and almost all of them are higher than 0.9.
In addition, the correlation between any two construct estimates is higher than that
between corresponding NC subscores. This is expected since construct estimates are
calculated under the MIRT model, which simultaneously calibrates the parameters
and allows correlations among the estimates. Besides this reason, correlation values
among construct estimates are also slightly inﬂuenced by the EAP scoring method.

From this table, the means and standard deviations seem good for construct esti-
mates. All the means of construct estimates are slightly above zero, and the reason
may be that the mean difﬁculty of the test is larger than 0.5 based on the classical
p—values, or a little below zero with regard to the MIRT difﬁculty criterion. The
standard deviations are smaller than one, due to the EAP scoring method.

Figure 4.6 shows the plots between NC subscores and construct estimates based

80

Table 4.9. Varimax Transformed N OHARM Item Estimates for the Population Data

 

 

Cluster item a1 a2 a3 a4 (1 M DS I C B a1 (12 a3 a4
1 1 0.24 0.15 0.18 0.40 0.12 0.52 -0. 23 63 73 69 40
3 2 0.65 0.17 0.29 0.46 0.10 0.86 —0. 11 42 78 70 58
3 3 0.82 0.12 0.19 0.19 -0.10 0.87 0. 12 20 82 77 77
1 4 0.26 0.28 0.32 0.68 0.53 0.85 -0. 62 72 71 68 36
3 5 0.70 0.20 0.27 0.31 -0.08 0.84 0.10 33 76 71 68
1 6 0.46 0.30 0.34 0.36 —0.67 0.74 0.90 52 66 63 61
3 7 1.20 0.15 0.20 0.28 -0.07 1.26 0.05 17 83 81 77
3 8 1.10 0.20 0.22 0.29 0.01 1.18 -0.01 21 80 79 76
3 9 1.61 0.17 0.23 0.39 0.01 1.68 0.00 17 84 82 77
1 10 0.21 0.07 0.16 0.79 0.90 0.84 -1.07 75 85 79 19
1 11 0.19 0.08 0.15 0.78 0.80 0.82 -0.98 77 84 79 18
1 12 0.10 0.06 0.10 0.33 0.08 0.36 -0.22 74 81 74 25
1 13 0.22 0.21 0.24 0.53 0.33 0.66 -0.50 71 71 68 36
1 14 0.33 0.19 0.27 0.75 0.93 0.88 -1.05 68 78 72 32
1 15 0.27 0.21 0.31 0.67 0.26 0.81 -0.32 71 75 68 35
1 16 0.17 0.27 0.23 0.73 0.57 0.83 -0.68 78 71 74 28
1 17 0.29 0.47 0.32 0.77 -0.02 1.00 0.02 73 62 71 40
1 18 0.31 0.40 0.29 0.44 -0.22 0.73 0.30 65 57 67 53
1 19 0.17 0.07 0.13 0.77 1.41 0.81 -1.75 78 85 80 16
1 20 0. 21 0.15 0. 23 0. 58 0.59 0. 67 -0.88 72 77 70 31
1 21 0.13 0.07 0.12 0. 40 0. 54 0. 44 -1.23 73 81 75 25
1 22 0.16 0. 06 0.13 1. 00 1. 61 1. 02 -1.57 81 87 83 12
1 23 0.13 0. 11 0.15 0.80 0. 87 0.84 -1.03 81 82 79 16
1 24 0.17 0. 21 0. 29 0.55 0.34 0.68 -0.50 75 72 65 36
1 25 0.18 0.15 0. 22 0.78 0.71 0.84 -0.85 78 80 75 22
1 26 0.16 0.15 0. 24 0.80 0.65 0.86 -0.75 79 80 74 22
2 27 0.16 0.08 0.31 0.32 -0.10 0.48 0.21 71 80 49 49
1 28 0.13 0.07 0.14 0.66 0.75 0.69 -1.08 79 85 78 17
1 29 0.17 0.09 0.19 0.81 1.12 0.86 -1.31 79 84 77 18
1 30 0.13 0.07 0.15 0.55 0.64 0.59 -1.08 77 83 75 21
1 31 0.19 0.20 0.25 0.91 0.84 0.98 -0.86 79 78 75 23
1 32 0.13 0.07 0.17 0.57 0.70 0.61 -1.15 78 84 73 22
1 33 0.17 0.14 0.24 0.50 0.11 0.59 -0.18 73 76 66 33
4 34 0.12 0. 65 0. 13 0.30 0.32 0. 73 -0.43 80 29 80 66
4 35 0. 28 1. 24 0. 27 0. 26 0. 26 1. 32 -0.20 78 21 78 79
4 36 0.40 2.13 0. 36 0. 24 0. 09 2. 22 -0.04 79 16 81 84
1 37 0.21 0.28 0.23 0. 58 0. 27 0.72 -0.38 73 67 71 36
1 38 0.21 0.27 0.24 0.61 0.25 0.74 —0.34 74 68 71 35
1 39 0.10 0.17 0.21 0.31 -0.21 0.42 0.50 76 66 61 43
1 40 0.19 0.16 0.21 0.97 1.27 1.03 -1.24 80 81 78 18
1 41 0.17 0.18 0.19 0.83 1.28 0.89 -1.44 79 78 77 21
1 42 0.21 0.18 0.26 1.17 1.47 1.23 -1.20 80 82 78 18
1 43 0.16 0.11 0.20 0.72 1.33 0.77 -1.72 78 82 75 21
1 44 0.19 0.21 0.30 0.66 0.69 0.78 -0.89 76 75 67 32
1 45 0.19 0.24 0.26 0.39 0.16 0.56 -0.28 70 65 62 46
1 46 0. 12 0.17 0.29 0.50 0.10 0.62 -0.16 79 74 62 35
2 47 0. 24 0. 20 0.46 0. 38 0.02 0.67 -0.03 70 72 47 55
2 48 0.15 0.17 0.34 0. 46 0.00 0. 61 0. 00 75 74 56 42
2 49 0.14 0.18 0. 35 0.15 -0.72 0 44 1. 64 71 66 38 70
2 50 0.23 0.13 0. 74 0.44 0.01 0. 90 -0. 01 75 82 35 61
2 51 0.13 0.16 0. 34 0.26 -0.01 0.48 0.02 74 71 44 57
2 52 0.15 0.17 0.40 0.16 -0. 45 0.48 0. 93 72 69 34 71
2 53 0.31 0.05 2.02 0. 46 -0. 34 2.09 0.16 82 89 15 77
2 54 0. 26 0.05 1.76 0. 42 -0. 15 1.83 0. 08 82 88 16 77
1 55 0.17 0.25 0.25 0. 57 -0. 20 0.69 0. 28 76 69 68 35
1 56 0.11 0.12 0.20 0.32 -0.03 0.41 0.08 74 72 61 39
1 58 0.13 0.18 0.23 0.38 -0.11 0.50 0.22 75 69 62 40
1 59 0.10 0.10 0.16 0.29 -0.21 0.36 0.59 73 74 63 37
1 60 0.07 0.11 0.14 0.22 -0.65 0.29 2.26 76 68 61 41

 

81

Table 4.10. Summary Statistics for Item Parameter Estimates

 

a1 (12 a3 a4 (1 MDSIC B
Mean 0.28 0.22 0.30 0.53 0.32 0.82 -0.34
Std 0.28 0.31 0.32 0.23 0.55 0.38 0.77

 

 

Table 4.11. Reference Composite Vectors from the Population Data Estimation

 

Cl 02 C3 C4
61 0.253 0.183 0.932 0.189
02 0.232 0.081 0.142 0.954
03 0.294 0.920 0.194 0.174
94 0.892 0.336 0.273 0.153

 

 

Table 4.12. Eigenvalues of AfAl Matrix

 

 

Eigenvalue C1 C2 C3 C4
1 21.961 9.284 7.863 7.143
2 0.802 1.112 0.147 0.054
3 0.092 0.089 0.009 0.003
4 0.052 0.003 0.001 0.000

 

1:2 27.369 8.351 53.675 133.257

 

Table 4.13. Correlations among NC Subscores and Construct Estimates

 

N01 N02 N03 NC4 01' 0; 0; 0:

 

 

NCl 1.00 0.64 0.50 0.40 0.98 0.69 0.62 0.55
NC2 0.64 1.00 0.42 0.31 0.70 0.89 0.52 0.45
NC3 0.50 0.42 1.00 0.30 0.55 0.47 0.96 0.42
NC4 0.40 0.31 0.30 1.00 0.44 0.32 0.36 0.92
01‘ 0.98 0.70 0.55 0.44 1.00 0.75 0.67 0.59
0; 0.69 0.89 0.47 0.32 0.75 1.00 0.58 0.46
0; 0.62 0.52 0.96 0.36 0.67 0.58 1.00 0.52
0: 0.55 0.45 0.42 0.92 0.59 0.46 0.52 1.00
Mean 25.92 4.29 2.96 1.68 0.02 0.04 0.04 0.02
Std 8.33 2.24 2.01 1.16 0.96 0.87 0.85 0.83

Mean percent-correct 0.63 0.48 0.49 0.56

 

82

on the analysis of the population data. Clearly, the continuous construct estimate is
not one-to—one correspondence with the discrete NC subscore. For the people who
have the same NC subscore, their construct estimates may be different. Although
not shown here, this construct estimate also gives more score choices than the unidi-
mensional estimate for a limited response pattern with few items in the cluster. For
example, there are only three items in cluster “C4”. The unidimensional calibration
on these items at most gives 23 = 8 different scores; however, the construct estimate
almost covers (—2, 2) continuum. Thus, the construct estimate is preferred in multi-
dimensional tests because it not only takes the item characteristics into consideration
but also implicitly borrows information from other correlated construct estimates.

Figure 4.7 shows the plots between NC subscores and their dominant raw proﬁ-
ciency estimates. The raw proﬁciency estimates for the same NC subscore are much
more spread out horizontally, compared with the construct estimates. What is more,
the difference between the raw proﬁciency estimates for the neighboring NC subscores
is not as clear as that between the construct estimates. It seems that these raw pro-
ﬁciency estimates are unstable with the coordinate system set up by the Varimax
transformed item discrimination matrix, which suggests that this rotation may not
result in a good coordinate system for the construct interpretation.

In conclusion, the projection method for construct estimates clearly works in this
MEAP test, and it not only empirically identiﬁed four item clusters, which were
assumed to measure different constructs from people, but also gave the subscore

estimates under the MIRT framework.

83

Cluster 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10 . ﬁ
40_ _
8-
30' 6*
0' 0N
z 20- z 4*
10- 2’°_ °°
o 0”
-4 2 1 0 1 2
e2
Cluster4
4 . . ,
6+
3.
:04” v 2'
0 0
0 0'- 0°—
.1 1
2 1 0 1 2 -2 1 0 1 2
'1! 31‘
03 04

Figure 4.6. Plots of NC Subscores versus Construct Estimates for the Population
Data Calibrated with 4 Dimensions

84

 

 

 

 

 

 

 

 

 

 

 

 

Cluster 1 Cluster 2
. 4 . 10 . . .
40) aria-4 _
ouf— ” 8
0=
30* E 6. o
‘- s=° 0N
% 20. o= Z 4.
Ea
a 2-
10' .E
.E’
Gnooo 0'- 0—0
0. as", , . .
-4 2 0 2 -1 o 1 2
64 63
Cluster 3 Cluster 4
. . . 4 . . .
6» —
3- .
4.
00 V2
0 0
Z Z 1_
2.
0.
0» —
. 1 -1

_L
o
__L .-
N

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4.7. Plots of NC Subscores versus Raw Proﬁciency Estimates for the Popula-
tion Data Calibrated with 4 Dimensions

85

CHAPTER 5

Conclusions, Implications,

Limitations and Future Research

5.1 Conclusions and Implications

In order to report subscores for multidimensional tests, the NC subscore and uni-
dimensional estimate for items in each cluster may sometimes work well, where the
cluster is deﬁned either by items measuring similar contents or by items with similar
vector directions, which are estimated using the MIRT models.

However, if the number of items within each cluster is small, the subscore is really
suspicious due to lack of sufﬁcient reliability. This problem may be remediated by in-
corporating information from other clusters, since most proﬁciencies measured within
one test are assumed to be positively correlated. This goal is commonly achieved
by some post—hoe adjustments on these subscore estimates; however, this two-stage
estimation procedure is not usually preferred, since it does not take into account the
measurement error in the ﬁrst calibration step.

The MIRT model seems to be very useful in multidimensional tests to allow si-
multaneous estimation of parameters from all dimensions. Currently, because of the
coordinate indeterminacy in the MIRT calibration, several popular software just con-
strains the proﬁciencies to have zero mean and identity variance-covariance matrix for
easy computation. However, it is clear that this zero correlation among proﬁciencies
is not an assumption in the MIRT models.

Precisely speaking, these MIRT calibrations only give one set of possible item and

86

person estimates for a good model-data ﬁt. Because of the constraint discrepancy
between the software estimation and actual MIRT models, these estimates may not
be ready for interpretation before a suitable coordinate system for person proﬁciencies
is set up.

The commonly used Varimax and Promax methods, which are borrowed from the
factor analysis, are mostly used to explain item characteristics, They may be sufﬁcient
to identify item clusters, usually by grouping items with the large loading values for
the same dimension after the rotation. However, unlike the projection method, they
are not designed for proﬁciency estimation under the MIRT framework.

The simulation study shows that the Varimax rotation within the TESTFACT
software is not adept at recovering the generating parameters, partly due to the
incorrect alignment of the coordinate axes, especially for the dimension which there
is no simple structure item to measure. Commonly the Varimax rotation is conducted
on the correlation-adjusted item estimates from the software, and it is most effective
when these item vectors are orthogonal to each other. Therefore, it can be imagined
that the Varimax rotation is most suitable for situations with simple structure items
and uncorrelated proﬁciencies.

The projection method illustrated in this study focuses on ﬁnding an interpretable
coordinate system for person proﬁciency estimates after the MIRT calibration, which
leads to its potential usage in subscore reporting, especially when the number of
dimensions and number of item clusters are different. In this method, item clusters
are obtained from the analysis of the direction cosines of item vectors, instead of
the element values of these vectors. The construct score is then calculated as the
projection of raw proﬁciency coordinates onto the most discriminating direction for
each item cluster, and its value is invariant with any orthonormal rotation of the
coordinate system for the raw proﬁciency estimates.

Researchers may be concerned with the efﬁciency of the item grouping obtained

87

from cluster analysis, especially when there are sampling errors in the data matrix and
estimation errors for the item discrimination matrix. However, both the simulation
study and empirical analysis show that the cluster analysis gives a fairly consistent
item grouping pattern and shows robustness to the effects of estimation and sampling
errors.

First, the angular distance matrix is invariant to any orthonormal rotation of the
coordinate system representing the multidimensional space. However, the method by
judging the element values may give inconsistent conclusions because different item
discrimination matrices may result from different orthonormal rotations.

Second, from the dendrogram for the ﬁrst replication of Design 1, the between
and within cluster pattern is as clear as expected and it seems that although the
estimation errors and the incorrect alignment of the coordinate system could lead to
bad recovery of item parameters, the item grouping from the cluster analysis is often
“insensitive” to these effects.

Third, in the empirical data analysis, the cluster analysis was conducted on both
the large population data and the two sample datasets in order to reduce the effect
of sampling errors. The conclusion is that the item grouping pattern for the MEAP
data is fairly consistent as long as the sample size is large.

Finally, the grouping pattern was also conﬁrmed based on the results from MIRT
calibrations with different dimensionalities. When the population data were cali-
brated with eight dimensions, all item clusters except the ﬁrst one were also obvious
in this “overfactoring” situation, and as expected, the ﬁrst cluster was split into
different small clusters.

The analysis from the simulation study shows that the construct estimate is highly
correlated with the NC subscore and the unidimensional estimate, and the construct
recovery from its estimate is better than the unidimensional estimate. Empirical

analysis based on the MEAP data identiﬁed four item clusters with the last three

88

clearly related to some speciﬁc mathematical content, which was not exactly as deﬁned
by strands or domains. Although the ﬁrst cluster contains many items from different
content areas, which are supposed to measure different proﬁciencies, it is clear that
all these proﬁciencies are highly correlated in this analysis. The reason may be either
they use the same cognitive process or they are taught at the same time in a school.
The clustering of these items also conforms to the rule: when there is high correlation
among proﬁciencies, there is no added value to report all of them.

When the dimensionality and the number of item clusters are the same, in some
sense, this projection method is similar to the Promax method, because both of them
can be used to ﬁnd an oblique rotation for the interpretable proﬁciency solution. They
can transform Varimax estimates to the simple structure item discriminations and
interpretable person proﬁciencies, except that the emphasis is slightly different. The
projection method focuses on the construct proﬁciency estimation, which may lead to
simple structure item discrimination estimates, while the Promax method searches for
the simple structure item discrimination matrix, which results in correlations among
proﬁciency estimates. Additionally, the projection method can use the information
from item contents and MIRT estimates to obtain reference composite vectors for
the projection purpose, while the Promax method is a pure mathematical criterion to
rotate the estimated item discrimination matrix. Finally, the projection method gives
construct estimates which are invariant to the orthonormal rotation of the coordinate
system for raw proﬁciency estimates; therefore, it can also use the unrotated estimates
as the input for the calculation of construct scores.

This projection method is not only important for deﬁning the coordinate system
in one single test administration but also has implications in practice for linking and
equating across different forms under the MIRT framework. In all IRT models, the
response matrix is assumed to be influenced by person proﬁciency and item char-

acteristics, both of which are unknown. This inevitably leads to the fact that some

89

constraints or assumptions are required to obtain a suitable set of parameter estimates
for interpretation. In the unidimensional IRT, item characteristics can be determined
with regard to persons’ proﬁciencies in the reference group, which are commonly set to
have zero mean and unit variance. After the data calibration for this group, the item
characteristics, such as discrimination and difficulty parameters, can be assumed ﬁxed
for the item, and they can be reused for the item pool construction, test construction,
and equating.

In the multidimensional IRT, item characteristics can also be set with regard to
some reference group, where there may be underlying correlations among proﬁcien-
cies. During the MIRT calibration, due to the unknown proﬁciency correlation and
indeterminacy of the coordinate system, the calibration software assumes the uncor-
related proﬁciencies and provides the correlation-adjusted and Varimax transformed
a vectors as the item discrimination estimates. It is problematic if these a’s are used
each time to deﬁne the item discrimination power, since the coordinate system for
interpreting proﬁciency estimates is incorrect. The inconsistency can also be easily
imagined because these a parameters are not ﬁxed, and they are different for groups
with different correlations. This difference could lead to problematic MIRT equat-
ing and linking procedures, which assumes the same item discrimination parameters
invariant to group characteristics.

Therefore, in addition to the constraints as in the unidimensional IRT, an inter-
pretable base coordinate system for person proﬁciencies needs to be constructed for
the reference group. Ideally, item vectors in this multidimensional space represent
the true discrimination power (2*, whose estimation is excluded from the effect of the
unknown proﬁciency correlation. One way to deﬁne the system is that the data cali-
bration is implemented with the proﬁciency correlation values from educated guessing,
or from the correlations among NC subscores or unidimensional estimates. The sec-

ond way is to use the Promax method to match the item discrimination matrix to

90

the simple structure. The projection method introduced in this study may provide
another promising way to deﬁne the coordinate system for person proﬁciencies and
item characteristics.

After the coordinate system for person proﬁciencies is set up for the reference
group, the item characteristics can be obtained and regarded as ﬁxed when these
items are administered in different tests, or to different groups of people, even when
the proﬁciency correlation matrix is different from that in the reference group. Since
the coordinate system is kept consistent across different test settings, the estimates
across different tests can be put into the same scale for comparison and interpretation.

One question now is whether it is legitimate to continue using the commonly used
correlation-adjusted a. to represent the characteristics of items when they are admin-
istered to different groups of people. To ensure the person-independent property of
the item characteristics deﬁned here, at least, one assumption is required that the
correlation matrix among person proﬁciencies for different populations is invariant.
How possible can this assumption be? If the correlation is inherent within the pro-
ﬁciencies themselves, that may be possible. However, it is commonly accepted that
correlation is the characteristic of the population and it can change across different
populations. Even if this invariance property is true, these proﬁciencies may be com-
pared under the same coordinate system; however, they are not compared with regard

to the correct interpretation of constructs.

5.2 Limitations and Future Research

The commonly used NC subscores or unidimensional estimates with the item clus-
ters deﬁned by different content areas may lead to unique subscores for each student.
These scores can be easily interpreted as proﬁciencies related to these clusters, al-

though people may not exactly be statistically distinguished by these proﬁciencies.

91

For this projection method, it may be a little difﬁcult or subjective to determine the
number of item clusters and item grouping in the cluster analysis step, which works
best when the within-cluster item vectors are very close while the between-cluster
item vectors are far apart. When this step is applied to the complicated real settings,
especially when proﬁciencies are highly correlated, the within or between pattern may
not be clear.

The different item grouping may result in different deﬁnition, interpretation and
calculation for construct estimates, since, theoretically speaking, the raw proﬁciency
estimates can be projected onto any item cluster or even one single item. Thus, in
practice, this item grouping pattern needs to be conﬁrmed across different samples
and different numbers of dimensions. Furthermore, the item grouping may need the
involvement from the expert expectation on test structure, which is also useful for
decisions on whether the highly correlated proﬁciency estimates are reported as one
or several scores.

In the real data analysis, the dimensionality for the MIRT calibration is unknown.
Although numerous methods for determining dimensionality have been suggested by
previous research, due to possible estimation dependency between dimensionality and
item clusters, ﬁnding ways to achieve a good balance between them should be exam-
ined in future research. For example, it may be interesting to see how the cluster
analysis and projection method perform with different dimensionality assumed, espe-
cially for the overfactoring cases, since it is supposed that overfactoring will not lead
to serious bad results.

All in all, it is clear that the solution for construct estimates can be ﬂexible for the
analysis of the same data, because different choice of dimensionality, number of item
clusters, item grouping or even software calibration will lead to different subscore
reporting. More research should be conducted on the magnitude of these differences,

and try to ﬁnd an optimal solution for the estimation.

92

In this study, the EAP scoring method was adopted for proﬁciency estimation,
due to its easy calculation and possible preference for the reliable estimation in real
tests. However, these biased scores caused problems for the calculations of Bias and
RMSE, both of which are often used in parameter recovery studies. Large deviation
between the estimated EAP score and the true score was expected. As a result, only
the relative values of Bias and RMSE were compared across different methods or
situations.

The EAP scores in this study were calculated for the uncorrelated proﬁciency es-
timates, then they were used in the projection calculation for construct estimates. It
may be more desirable that this EAP scoring method is applied directly to construct
estimates in order to further reduce the effect of incorrect alignment of the coordinate
system chosen by the software. One way to achieve this goal is to use the rotated item
estimates based on this method, regard them as the ﬁxed parameter input for the
TESTFACT software, and directly obtain the EAP score for the construct estimate.

In this study, it is the uncorrelated proﬁciency coordinates and mixed structure
items that were used in the simulation. Although it can be argued that other situ-
ations with different combinations of proﬁciency correlation and item structure can
be easily transformed to this one, future research can deﬁnitely conﬁrm the efﬁciency
of this projection method in those situations. For example, the simulation study can
focus on correlated proﬁciencies and simple structure items, and ﬁnally it can be
extended into more general cases with correlated proﬁciencies and mixed structure
items. It should be noted that, in all cases, this projection transforms the uncorrelated
proﬁciency estimates directly into the construct estimates, where the proﬁciency cor-
relation effect and item composite effect cannot be distinguished, or it may be argued
that there is no need to separate them.

For the empirical data analysis, guessing parameters were omitted, because either

the unidimensional estimates or the ﬁxed values led to convergence problems and

93

unusual results for other item parameters in the MIRT calibration. Further analysis
of the guessing effect on the MEAP data can be conducted when new software can
incorporate its estimation directly in the MIRT calibration.

The article by Field et al. (2006) pointed out that reducing the two-mode data for
separate analysis of one—mode characteristics may lose the duality information in the
data. From the perspective of social networks, the two-mode data are deﬁned for two
sets of social units and contain measurements of a relation from the units in one set
to those in the other set; accordingly, the one-mode data are the set of social actors
with relations deﬁned only between them (Doreian et al., 2004).

The response matrix can be regarded as the two-mode data, since it contains the
interactions between persons and items. The simple aggregation of information in the
measurement ﬁeld is in classical test theory, where person information is represented
by the NC total score while item information by the difﬁculty p—value. However, all
IRT calibrations, including the MIRT, use the information from person-item interac-
tions and simultaneously calibrate the two-mode response data into two single—mode
scale-dependent item and person characteristics.

While IRT models try to ﬁnd sufficient dimensions and parameters to insure the
local independence assumption between responses, the two—mode analysis in Field
et al.’s article directly groups interactive actors and events (rows and columns) into
the same block, which is very useful in social network analyses to understand how
groups of actors are linked by conducting groups of events. If this two-mode method
is applied to the response data, ﬁrst of all, the blocking conclusion may be limited
only to the persons and items in the test. Secondly, the blocking can be due to
not only the same or highly correlated proﬁciencies measured by items, but also
the difficulty of items, and both effects cannot be separated. Thirdly, this blocking
does not give the continuous proﬁciency estimate to each person as the IRT models.

In the measurement ﬁeld, it is not enough to only understand that one group of

94

people are similar with regard to one group of items; most importantly, these people
need to be put in the same continuum for ordering and comparison regardless of the
characteristics of local items. Even in the cluster analysis step for the item grouping,
what is needed for this study is roughly the same orientation of item vectors, not
how these items are clustered together with regard to the temporary setup of the
coordinate system.

All in all, either the submatrix blocking for the two-mode data or the one—mode
blocking/ continuum provided by the analysis is useful for different problems. In future
research, if people need to be categorized into different blocks according to different
subsets of items, this two-mode method can be applied as an exploratory way.

Although researchers may agree that tests in practice are really multidimensional,
many of them are reluctant to adopt the MIRT to real test settings. This is mostly
because of the convergence difﬁculty in the MIRT calibration and the indeterminacy
of the coordinate system for parameter estimates. To solve these problems, it is
supposed that the conﬁrmatory MIRT model can be carried out after the exploratory
version. With the conﬁrmatory version, not all the parameters need to be freely
estimated. For example, some elements in item discrimination vectors can be ﬁxed
to zero or other values, or some correlation among proﬁciencies can be input into the
parameter estimation (Yao & Boughton, 2007). Future research can focus on how
to extract useful information from the exploratory MIRT calibration for the later
conﬁrmatory analysis and whether the conﬁrmatory analysis leads to more reliable

results.

95

REFERENCES

Ackerman, T. A. (1989). Unidimensional IRT calibration of compensatory and
noncompensatory multidimensional items. Applied Psychological Measurement,
13(2), 113-127.

Ackerman, T. A. (1991). The use of unidimensional parameter estimates of mul-
tidimensional items in adaptive testing. Applied Psychological Measurement,
15(1), 13—24.

Bock, D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis.
Applied Psychological Measurement, 12(3), 261-280.

Bock, D., Gibbons, R., Schilling, S., Muraki, E., Wilson, D., & Wood, R. (2003).
TES T FA CT 4 .0 [Computer software and manual]: Test scoring, item statistics,
and item factor analysis. Lincolnwood, IL: Scientiﬁc Software International.

Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using Markov Chain Monte Carlo. Ap-
plied Psychological Measurement, 27 (6), 395-414.

DeMars, C. E. (2006). Application of the bi-factor multidimensional item response
theory model to testlet-based tests. Journal of Educational Measurement, 43 (2),
145-168.

Doreian, P., Batagelj, V., & Ferligoj, A. (2004). Generalized blockmodeling of two-
mode network data. Social Networks, 26(1), 29-53.

Drasgow, F., & Parsons, C. (1983). Application of unidimensional item response
theory models to multidimensional data. Applied Psychological Measurement,
7(2), 189-199.

Field, S., Frank, K. A., Schiller, K., Riegle-Crumb, C., & Muller, C. (2006). Identi—
fying positions from affiliation networks: Preserving the duality of people and

events. Social Networks, 28(2), 97—123.

Finch, H. (2006). Comparison of the performance of varimax and promax rota-

96

tions: Factor structure recovery for dichotomous items. Journal of Educational
Measurement, 43 (1), 39-52.

Fraser, C. (1988). NOHARM II: A Fortran program for ﬁtting unidimensional and
multidimensional normal ogive models in latent trait theory. The University of
New England, Center for Behavioral Studies, Armidale, Australia.

Fraser, C., & McDonald, R. (1988). NOHARM: Least squares item factor analysis.
Multivariate Behavioral Research, 23(2), 167-169.

Gessaroli, M. E., & Champlain, A. F. D. (1996). Using an approximate chi-square
statistic to test the number of dimensions underlying the responses to a set of
items. Journal of Educational Measurement, 33 (2), 157-179.

Haberman, S. J. (2008). When can subscores have value? Journal of Educational
and Behavioral Statistics, 33(2), 204-229.

Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set of
test items. Applied Psychological Measurement, 10(3), 287-302.

Hendrickson, A. E., & White, P. O. (1964). Promx: a quick method for rotation
to oblique simple structure. The British Journal of Statistical Psychology, 17,
65-70.

Hirsch, T. M., & Miller, T. R. (1991, June). Evaluation of a multidimensional item
response theory procedure for investigating test dimensionality. Paper presented
at the annual meeting of the Psychometric Society, New Brunswick, NJ.

Kaiser, H. (1958). The varimax criterion for analytic rotation in factor analysis.
Psychometrika, 23(3), 187-120.

Kao, S.-C. (2007). The new goodness-of-ﬁt index for the multidimensional item
response model. Unpublished doctoral dissertation, Michigan State University,
East Lansing, MI.

Kim, J .-P. (2001). Proximity measures and cluster analysis in multidimensional item
response theory. Unpublished doctoral dissertation, Michigan State University,
East Lansing, MI.

Ledesma, R. D., & Valero—Mora, P. (2007). Determining the number of factors to
retain in efa: an easy-to—use computer program for carrying out parallel analysis.

97

Practical Assessment, Research and Evaluation, 12(2), 1-11.

Levine, M. V., &. Drasgow, F. (1982). Appropriateness measurement: Review, cri-
tique and validating studies. British Journal of Mathematical and Statistical
Psychology, 35, 42—56.

Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional
irt linking. Applied Psychological Measurement, 24(2), 115-138.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores.
Reading, MA: Addison-Wesley Publishing Company, Inc.

Luecht, R. M., &, Miller, T. R. (1992). Unidimensional calibrations and interpre-
tations of composite traits for multidimensional tests. Applied Psychological
Measurement, 16(3), 279-293.

McDonald, R. P. (1997). Normal-ogive multidimensional model. In W. J. Van der
Linden & R. K. Hambleton (Eds), Handbook of modern item response theory
(p. 257-270). New York: Springer.

Miller, T. R., & Hirsch, T. M. (1992). Cluster analysis of angular data in applications
of multidimensional item-response theory. Applied Measurement in Education,

5(3), 193-211.

Muraki, E., & Engelhard, G. (1985). Full-information item factor analysis: Applica-
tions of eap scores. Applied Psychological Measurement, 9(4), 417-430.

Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests:
results and implications. Journal of Educational Statistics, 4(3), 207—230.

Reckase, M. D. (1985). The difficulty of test items that measure more than one
ability. Applied Psychological Measurement, 9(4), 401-412.

Reckase, M. D. ( 1997). A linear logistic multidimensional model for dichotomous item
response data. In W. J. Van der Linden & R. K. Hambleton (Eds), Handbook
of modern item response theory (p. 271-286). New York: Springer.

Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer.

98

 

 

Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1988). Building a unidimensional
test using multidimensional items. Journal of Educational Measurement, 25(3),
193—203.

Reckase, M. D., Carlson, J.E., Ackerman, T.A., & Spray, J.A. (1986, June). The in-
terpretation of unidimensional irt parameters estimated from multidimensional
data. Paper presented at the annual meeting of the Psychometric Society,
Toronto, Canada.

Reckase, M. D., & Hirsch, T. M. (1991, April). Interpretation of number-correct scores
when the true numbers of dimensions assessed by a test is greater than two.
Paper presented at the annual meeting of the National Council on Measurement
in Education, Chicago.

Reckase, M. D., & Mckinley, R. L. (1991). The discriminating power of items that
measure more than one dimension. Applied Psychological Measurement, 15(4),
361-373.

Roussos, L. A., Stout, William F., & Marden, John I. (1998). Using new proxim-
ity measures with hierarchical cluster analysis to detect multidimensionality.
Journal of Educational Measurement, 35(1), 1-30.

Roznowski, M., Tucker, L. R., & Humphreys, L. G. (1991). Three approaches to deter-
mining the dimensionality of binary items. Applied Psychological Measurement,
15(2), 109-127.

Sinharay, S., Haberman, 8., & Puhan, G. (2007). Subscores based on classical test
theory: To report or not to report. Educational Measurement: Issues and
Practice, 26(4), 21-28.

Spray, J. A., Davey, T. C., Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1990).
Comparison of two logistic multidimensional item response theory models (Tech.
Rep. No. ONR 90-8). Iowa City, IA: ACT.

Stoer, J ., & Bulirsch, R. (2002). Introduction to numerical analysis. New York, NY:
Springer-Verlag New York, Inc.

Stone, C. A., 81. Yeh, C.-C. (2006). Assessing the dimensionality and factor structure
of multiple-choice exams: An empirical comparison of methods using the Mul-
tistate Bar examination. Educational and Psychological Measurement, 66(2),

193-214.

99

Stout, W. (1987). A nonparametric approach for assessing latent trait unidimen-
sionality. Psychometrika, 52(4), 589-617.

Stout, W., Douglas, B., Junker, B., & Roussos, L. (1999). DIMTEST [Computer
software]. The William Stout Institute for Measurement, Champaign, IL.

Stout, W., Froelich, A., & Gao, F. (2001). Using resampling to produce and improved
DIMTEST procedure. In A. Boomsma, M. A. J. van Dujin, & T. A. B. Snijders
(Eds), Essays on item response theory (p. 357-375). New York: Springer.

Sympson, J. (1978). A model for testing with multidimensional items. In Weiss
DJ (ed) Proceedings of the 1977 Computerized Adaptive Testing Conference.
University of Minnesota, Minneapolis.

Tanaka, J. (1993). Multifaceted Conceptions of ﬁt in structural equation models. In
K. A. Bollen & J. S. Long (Eds), Test Structural Equation Models. Newbury
Park, CA: Sage.

Thurstone, L. (1947). Multiple factor analysis. Chicago, Illinois: The university of
Chicago press.

Wainer, H., Sheehan, K. M., & Wang, X. (2000). Some paths toward making praxis
scores more useful. Journal of Educational Measurement, 37(2), 113-140.

Wang, M. (1985). Fitting a unidimensional model to multidimensional item response
data: the eﬁect of latent space misspeciﬁcation on the application of IRT (Tech.
Rep. No. MW: 6-24-85). Iowa City, IA: University of Iowa.

Wang, M. (1986, April). Fitting a unidimensional model to multidimensional item

response data. Paper presented at the ONR contractors conference, Gatlinburg,
TN.

Wang, W.-C., Wilson, M. R., & Adams, R. J. (1997). Rasch models for multidimen-
sionality between items and within items. In M. R. Wilson, G. Engelhard, &

K. Draney (Eds), Objective measurement: Theory into practice (Vol. 4, p. 139-
155). Norwood, NJ: Ablex.

Yao, L., & Boughton, K. A. (2007). A multidimensional item response modeling ap-
proach for improving subscale proﬁciency estimation and classiﬁcation. Applied
Psychological Measurement, 31(2), 83-105.

100

Yen, W. M. (1987). A Bayesian/IRT index of objective performance. Paper presented
at the annual meeting of the Psychometric Society, Montreal, Quebec, Canada,
June, 1-19.

Zimowski, M. M., Muraki, E., Mislevy, R. J., & Bock, D. J. (2003). BILOG-MG for
Windows. Scientiﬁc Software International, Inc., Lincolnwood, IL.

101

1.11111141111114441111ij