....3 “EH

 

 

Ift‘

.1...
.il

, J . ‘
1 , . i, V 5
1‘ .1].
vvvltlxﬂin u n “r. 141'
_ y ..r .I -
$115.1.

‘ f .il
Iv ll.l ..

l§b~v3<‘.:
I; . [3' fly

.ﬁwmawnw w a... a.

 

 

THESIS

2s

\’ 1 l; \\
I i l

 

W

iti‘iijilij

..___’

LIBRARY . |\\\\\\\\\\ii\i\iiiiii\i\i\

Michigan State 3 1293 017
Unlversity

 

 

This is to certify that the

dissertation entitled

SOME GENERALIZATIONS OF THE RASCH MODEL:
AN APPLICATION OF THE
HIERARCHICAL GENERALIZED LINEAR MODEL

presented by

Akihito Kamata

has been accepted towards fulﬁllment
of the requirements for

Psychology and
Special Education

 

 

ﬁat: ,/ ﬂat...
J

Major professor

Date December 15, 1998

 

MSU is an Alﬁrmaliw Action/Equal Opportunity Institution 0-12771

REMOTE STORAGE

PLACE IN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

DATE DUE

DATE DUE DATE DUE

 

 

iLﬂggqm
I \r

 

 

 

 

 

 

 

 

 

 

 

 

 

2/17 2083 Blue FORN S/DateDueForms_ZOi7.indd - 09.5

SOME GENERALIZATIONS OF THE RASCH MODEL:
AN APPLICATION OF THE
HIERARCHICAL GENERALIZED LINEAR MODEL

By

Akihito Kamata

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational
Psychology and Special Education

1998

ABSTRACT

SOME GENERALIZATIONS OF THE RASCH MODEL:
AN APPLICATION OF THE
HIERARCHICAL GENERALIZED LINEAR MODEL

By

Akihito Kamata

In this dissertation the Rasch model is generalized as a special case of
the hierarchical generalized linear model (HGLM), facilitating various
extensions. First, the standard binary-response Rasch model is reformulated
according to the HGLM speciﬁcations. Since the reformulated model is a
special case of the logistic HGLM, the model is referred to as a one-parameter
hierarchical generalized linear logistic model (1-P HGLLM). Illustrative
analyses using hypothetical data sets reveal that parameters estimated via
the HLM program are very similar to the estimates ﬁom the BILOG program.
A parameter recovery study reveals that both item- and person-parameters
are estimated properly by the HLM program. Three extensions are
presented, including (a) a model with a person-level predictor, (b) a
multidimensional model, and (c) a multi-level model. In the ﬁrst extension, a
person-level predictor is added to the model such that the person abilities, as
well as the item parameters, are decomposed as a linear combination of more

than one parameter. A coefﬁcient of the person-level predictor is properly

estimated in the parameter recovery study, as well as in an illustrative
analysis. In the second extension, a multidimensional model is formulated.
It is shown that the correlation coefﬁcients between latent traits are properly
estimated by the HLM program, and that multidimensional analysis can
distinguish people who have the same raw scores. In the last extension, a
three-level 1-P HGLLM, with an additional level, for schools, is formulated.
An illustrative analysis demonstrates that the empirical Bayes estimation
enables one to distinguish people who have the same raw scores based on
which school each individual has attended. Also, the analysis shows that the
empirical Bayes estimation in the three-level model can improve estimation of
school means, as well as person abilities. Contributions of this work include
(a) a pedagogical presentation of this formulation to the learner of item
response theory (IRT) models to facilitate the understanding of IRT models
from a different perspective, (b) an easy application of this formulation to
conduct a one-step analysis of binary-response test data, (c) a readily
accessible multidimensional, as well as multi-level analysis of binary response

test data. Several suggestions for future research are also mentioned.

This dissertation is dedicated
to

my parents

Toshikatsu and Toshiko Kamata.

iv

ACKNOWLEDGEMENT

This dissertation would not have been completed without assistance
and encouragement of many people. I would like to be deeply thankful to Dr.
Betsy Becker, my advisor, friend, and chair of my dissertation committee for
her support, direction, patience, and encouragement throughout my doctoral
studies. I wish to express my deep gratitude to all my committee members,
Dr. Stephen Raudenbush, Dr. Kenneth Frank, Dr. Alexander von Eye, and Dr.
Susan Phillips for their thoughtful and valuable comments. They all
encouraged me to pursue and complete this dissertation topic. I also thank
all other people who gave me valuable insights and comments on this
dissertation work; Yasuo Miyazaki, Michael Rodriguez and Dr. Mark Reckase,
to name a few.

I am also thankful to my mentors when I was at Yamanashi University
as an undergraduate student; Professor Ryoichi Yamada and Dr. Kunihiko
Ogawa, who encouraged me to pursue graduate degrees. They fought for
their lives while I was working on this dissertation. Unfortunately, they did
not have a chance to see the completion of this dissertation.

Finally, I thank my parents who taught me the value of hard work that
enabled me to complete this challenging work. Also, I thank my wife, Yasuyo,
who always believes in me and has been so generous for my spending so much

of time for my doctoral studies and this dissertation.

TABLE OF CONTENTS

page

LIST OF TABLES ........................................................................................... viii

LIST OF FIGURES ......................................................................................... ix
CHAPTER 1

INTRODUCTION ........................................................................................... 1

1.1. Background ......................................................................................... 1

1.2. Review of Literature ........................................................................... 3

1.2.1. Linear Logistic Test Model ...................................................... 4

1.2.2. Many-Facet Model ................................................................... 5

1.2.3. Investigating Item Parameter Drift ........................................ 5

1.2.4. Reformulating IRT Models as GLM ........................................ 6

1.2.5. Random Coefﬁcient Multinomial Logit Model ........................ 7

1.3. Statement of Purpose ......................................................................... 8
CHAPTER 2

GENERALIZING THE STANDARD

BINARY-RESPONSE RASCH MODEL ........................................................ 11

2.1. Model Formulation ............................................................................ 11

2.1.1. The Standard Rasch Model .................................................... 12

2.1.2. Formulation of LP HGLLM ................................................... 15

2.1.3. Estimation .............................................................................. 20

2.2. Illustrative Analysis of Hypothetical Data ....................................... 23

2.3. Parameter Recovery Study ................................................................ 32

2.3.1. Methods ................................................................................... 33

2.3.2. Results ..................................................................................... 35

2.4. Comments on the Simulated Data ..................................................... 45

2.5. Summary and Comments on Practical Issues ................................... 46
CHAPTER 3

MODEL WITH PERSON-LEVEL PREDICTORS ........................................ 49

3.1. Model ................................................................................................. 49

3.2. Illustrative Analysis .......................................................................... 52

3.3. Parameter Recovery .......................................................................... 57

3.4. DIF Model .......................................................................................... 66

3.5. Summary and Comments on Practical Issues .................................. 75

vi

CHAPTER 4

MULTIDIMENSIONAL MODELS ............................................................... 78
4.1. Model ................................................................................................. 78
4.2. Illustrative Analysis ......................................................................... 85
4.3. Parameter Recovery .......................................................................... 94
4.4. Summary and Comments on Practical Issues ................................. 110

CHAPTER 5

THREE-LEVEL MODEL .............................................................................. 112
5.1. Model ................................................................................................. 112
5.2. Illustrative Analysis ......................................................................... 115
5.3. Summary and Comments on Practical Issues ................................. 125

CHAPTER 6

CONCLUSIONS ........................................................................................... 128
6.1. Summary .......................................................................................... 128
6.2. Comments on Practical Issues ......................................................... 131
6.3. Suggestions for Future Research and Recommendations ............... 134

APPENDIX A ................................................................................................ 138

APPENDIX B ................................................................................................ 139

REFERENCES ............................................................................................. 141

vii

Table

10.
11.

12.

13.

14.

15.

LIST OF TABLES

The item difﬁculties used in illustrative analyses and
simulation studies

Person parameter estimates from the HLM and BILOG
programs

Item parameter estimates from the HLM and BILOG
programs

Results of parameter recovery study for the LP HGLLM

Item parameter estimates from the model with a person-
level predictor

The layout of the simulation study for the model with a
person-level predictor variable

The results of parameter recovery study for the model with
a person-level predictor

Item parameter estimates from the DIF model

Person parameter estimates from the DIF model

Item parameter estimates of multidimensional data
Person parameter estimates of multidimensional data

The layout of the simulation study for the
multidimensional model

The results from the parameter recovery study for the
multidimensional model

Item parameter estimates from 2-level and 3-level models

Estimates of school means from 2-level and 3-level models

viii

page

25

29

30

36

54

58

59

73

74

86

89

97

99

117

124

Figure

10.

11.

12.

LIST OF FIGURES

Mean correlation between true and estimated item
parameters and their standard deviations

Root mean squared errors of f and standard deviations of
I".

The relationship between person parameter estimates
ﬁom the model with a predictor and estimates from the

model without a predictor

The root mean squared errors of the coefﬁcient of the
predictor

The mean and the standard deviation of the estimate of
the slope coefﬁcient of the predictor

Plot of person parameter estimates ﬁ'om UD and MD
models

Root mean squared errors of f

Mean correlation between true and estimated item
parameters and their standard deviations

Mean correlation between latent traits and their standard
deviations

Person parameter estimates from 2-1evel model and 3-
level models

The relationship between the estimates from the 2-level
model and the linear combination of the estimates from

the 3-level model

Person parameter estimates from 2-level model and 3-
level models — comparison to the true values

ix

page

37

41

56

60

63

92

100

103

107

118

120

121

Chapter 1

Introduction

In this chapter, some background for this research along with the
importance of this study are described ﬁrst. Second, related literature is
reviewed. This clariﬁes that similar approaches were explored in the past
using other models. At the same time, the review clariﬁes what is new about

this study. Third, the purposes of this study are stated.

1.1. Background

When investigating effects of student characteristics on student
performance on a test, henceforth termed “ability”, a two-step analysis of test
data with student-level predictors is a common practice. Student abilities
are estimated via a standard item response theory (IRT) model as the ﬁrst
step in such a two-step analysis. Student ability scores from an IRT model
are originally expressed in the logit scale, but commonly the scores are
linearly transformed to another scale when they are reported. Then, in the
second step, ability scores are used as an outcome variable, and student
characteristic variables are used as predictors in a simple linear model, such
as a multiple regression. This two-step analysis may be done routinely if
existing IRT-scale-based test scores are used as an outcome variable in a

regression analysis. Such an analysis would be considered a two-step

analysis.

Two possible problems are associated with such a two-step analysis.
First, students’ ability estimates obtained via an IRT model have different
magnitudes of standard errors at different ability levels. Scores in the
middle of the distribution are associated with lower standard errors, while
scores that are apart from the middle of the distribution are associated with
higher standard errors (Hambleton & Swaminathan, 1985). Thus the test
scores, as an outcome variable, have heteroscedastic measurement errors.
However, a two-step analysis typically ignores this heteroscedastic nature of
the standard errors of measurement of the dependent variable. For this
reason, a two-step analysis may not provide accurate results. Second, when
marginal maximum likelihood estimation (MMLE) is used to estimate item
parameters, person ability estimates derived ﬁ'om either the maximum
likelihood or the mean or mode of the posterior distributions (i.e., EAP or
MAP) are biased and inconsistent (Goldstein, 1980; Lord, 1984).
Inconsistency of the outcome variable is problematic in regression models.

One possible solution to the ﬁrst problem is to take into account the
different standard errors of measurement of dependent variables in the
second step of the analysis, e.g., by applying the weighted least squares
solution (W LS) in regression estimation instead of the ordinary least squares
solution. Unfortunately, this approach is only good for the problem of

unequal standard errors of the dependent variable, but it will not solve the

problem of unbiasedness and inconsistency of the dependent variable. Also,
this approach can provide distorted results if standard errors are estimated
poorly.

Another solution is to perform a one-step analysis that includes student
characteristic variables as predictors in an IRT model (e.g., Zwinderman,
1991). By including student-level predictors in an IRT model, the second
step of the two-step analysis will be embedded in the ﬁrst step. In other
words, a regression model that estimates effects of student characteristics is
embedded within an IRT model that estimates students’ abilities. This way,
the effects of student-level predictors are simultaneously estimated with the
effect of items, as well as person parameter estimates. As a result, the
heteroscedastic nature of standard errors, as well as the bias and
inconsistency, of the person parameter estimates would not affect the
estimates of student-level predictors on the outcome variable. Also, one can
expect improved estimation of the effects of predictors on a latent trait via a
one-step analysis rather than a two-step analysis. This study follows in this
tradition, as described in the subsequent chapters. In addition, we can
expect improved precision for estimates of item- and person-parameters

(Mislevy, 1987), although that is not the focus of this study.

1.2. Review of Literature

To date, attempts to generalize IRT models have been made by several

authors (detailed below). Such generalizations are achieved mainly by
adding predictor variables as a linear combination of parameters to IRT
models. Also, attempts to reformulate IRT models in terms of the
Generalized Linear Model (GLM) have been made. This section summarizes

such models that have been proposed in the past.

1.2.1. Linear Logistic Test Model

Fischer (1973, 1983a, 1995) was probably the ﬁrst to incorporate a
regression analysis in the IRT model. He generalized the standard binary
Rasch model by decomposing an item difﬁculty parameter into linear
combinations of more than one item-varying parameter. More speciﬁcally, in

his generalization an item difficulty parameter (5} for item i is decomposed into

P
parameters a1(l= 1, , p), such that 5, = dea, +c, where a, are called “basic
l=l

parameters” (Fisher, 1973, 1983a) that are decomposed parameters. On the
other hand, w” is a coefﬁcient or weight for parameter I and item i, and c is a
normalization constant. This approach enables one to include person
characteristic variables as linear constraints in the Rasch model.

This model can be applied for measuring such things as the effect of
experimental conditions on item difﬁculty and the imp act of educational
treatments on ability. This approach also has been also applied in

measuring change in unidimensional latent traits (Fischer, 1983a, 1983b).

In this speciﬁc application of the model, any change in person parameters
occurring between time points is described as a change of item parameters,

instead of change in the person parameters.

1.2.2. Many-Facet Model

Linacre (1989), on the other hand, added an indicator variable for
raters as a linear constraint to polytomous Rasch models. This becomes
important when students’ scores are rated by different raters in order to detect
different degrees of severity between the raters, such as in a performance test.
Linacre considered the indicator variable for raters as a “facet” additional to
those for item and examinees. Therefore, he called the model a “many-facet
Rasch model” (Linacre, 1989). This approach can also be seen as a three-

factorial logit model.

1. 2. 3. Investigating Item-Parameter Drift

Bock, Muraki and Pfeiffenberger (1988) extended the three-parameter
logistic model by adding a variable for time points when the item is tested.
They decomposed each item difﬁculty parameter into a linear combination of

“item speciﬁc” parameters and a time point indicator. More speciﬁcally, the

item difﬁculty parameters were expressed as 6'] = b; + ,4th + ﬂzjtkz, where b, is

the “item speciﬁc” difﬁculty parameter, tk is an indicator for time point k, ,61 ,- is

a linear effect of time on item j, and ,63, is a quadratic effect of time on item j.

In other words, the decomposition can be viewed as a “growth model” for item
parameters. As a result, this approach was used for detecting change in item
difﬁculty parameter values across multiple time points, in effect, item
difﬁculty parameter drift. In estimating parameters, they dealt with the
person parameters as random parameters, and incorporated the marginal
maximum likelihood estimation (MMLE) to estimate item parameters and

effects of time points.

1.2.4. Reformulating IRT models as GLM

Mellenbergh and Vijn (1981) reformulated the Rasch model as a log
linear model. However, like Fischer and Linacre’s generalizations, their
reformulation treated both item and person parameters as ﬁxed parameters.
As a result, in most applications the model would have so many parameters to
be estimated (the number of items plus the number of examinees) that it could
not rely on a parameter estimation algorithm used for log-linear models in
general. Instead, Mellenbergh and Vijn had to rely on modiﬁed conditional
maximum likelihood (CML) estimation.

On the other hand, Mellenbergh (1994) later successfully summarized
different IRT models in terms of the generalized linear model (GLM)
(McCullagh & Nelder, 1989), including both dichotomous and polytomous one-
and two- parameter IRT models. The three-parameter logistic model is

classiﬁed as a so-called “left-side added” model (Thissen & Steinberg, 1986)

and cannot be reformulated under the GLM family. Generally speaking,
GLM provides a way to estimate a function of the mean responses as a linear
function of the values of some set of predictors. However, Mellenbergh did
not address the issue of providing parameter-estimation algorithms for the
reformulated models.

One critical limitation of the above generalizations is that, except for
the model of Bock et al. (1988), they deal with all parameters, including item
and person parameters and parameters associated with predictors, as ﬁxed
parameters. As a result, models are formulated within a single level, despite
the fact that predictor variables may represent different levels of measures.
This especially becomes obvious when dealing with variables other than
person- or item-characteristics as predictors. In the many-facet Rasch model,
for example, when raters are included as a predictor variable and rater-
characteristic variables are also of interest, the above approach would treat all
variables as being associated with parameters that are ﬁxed and independent
of all other parameters. However, it may be more reasonable to treat each
rater effect as a random parameter and to let rater-characteristic variables be
associated with ﬁxed parameters that characterize all raters, because rater

characteristics vary among raters.

1.2. 5. Random Coeﬂicient Multinomial Logit Model

More recently, Adams and Wilson (1996) proposed another model with

person-level predictors, called a random coefﬁcient multinomial logit model
(RCMLM). It is general enough to include a wide range of models from the
Rasch family, both dichotomous and polytomous models. This model, unlike
other models mentioned above, considers person parameters to be random (or
random coefﬁcients, in their words). Adams, Wilson and Wang (1997)
further generalized the RCMLM to a multidimensional model (MRCMLM).

These models are technically formulated as multi-level models because
of the appearance of the random parameters. However, Adams, Wilson and
Wong at ﬁrst did not explicitly recognize them as multi-level models. The
random parameter formulation was used mainly for the purpose of parameter
estimation and decomposing each of the person parameters into a linear
combination of multiple parameters.

Later, Adams, Wilson and Wu (1997) explicitly recognized the RCMLM
and MRCMLM as multi-level models, in which person-characteristic variables
could be added as ﬁxed parameters that are related to latent traits. This was
probably the ﬁrst time that a regular IRT model was explicitly conceptualized
as a multi-level model. However, their approach was limited to a two-level
formulation; therefore, their model was only able to include person-varying

variables as linear predictors.

1.3. Statement of Purpose

This study shows another way to generalize the Rasch model as a

multi-level model. As a result of this generalization, the new model can
include level-2 predictors when the Rasch model is formulated as a two-level
model. Although similar generalizations were proposed by several others
(e.g., Adams and his colleagues; Zwinderman, 1991), the generalization in this
study is distinct from earlier work in two ways. First, the generalized model
in this study makes explicit connections between two seemingly unrelated but
well understood models: the hierarchical generalized linear model (HGLM)
(Raudenbush, 1995; Stiratelli, Laird, & Ware, 1984; Wong & Mason, 1985)
and the Rasch model. Second, the generalized model in this study enables
one to formulate and estimate parameters using a range of models with more
than two levels, while previous formulations (Adams, Wilson and Wu, 1997)
allowed only two levels.

First, the Rasch model is reformulated as a special case of the HGLM .
The HGLM is an extension of the generalized linear model (McCullagh &
Nelder, 1989) to hierarchical data. This study, accordingly, treats item
response data as hierarchical data, where items are nested within people.
The generalized model is referred to as the one-parameter hierarchical
generalized linear logistic model (1-P HGLLM).

Estimation of person and item parameters per se is not the purpose of
the LP HGLLM. In order for the approach to be applied in extended models,
however, it is essential to properly estimate person- and item- parameters for

the Rasch equivalent case (the simplest case among the generalized models).

Therefore, one primary purpose of this study is to demonstrate equivalence
between the LP HGLLM and the Rasch model, both algebraically and
numerically.

This study then shows extensions of the generalized model. First, a
model with a person-level predictor is presented. This includes a case where
person ability parameters are decomposed into two parameters, as well as a
differential item functioning (DIF) model, where item parameters are
composed of two parameters, including a person-level predictor. Second, a
multidimensional model, where more than one latent trait is assumed, is
presented. An illustrative analysis, as well as a parameter recovery study,
are given for this case. Finally, a three-level model is presented, where an
additional level (e.g., for schools, classes, etc.) is added. These extensions
suggest various possible applications useful to educators, test practitioners,

and researchers alike.

10

Chapter 2

Generalizing the Standard Binary-Response Rasch Model

In this chapter the standard binary-response Rasch model is
generalized as a special case of the HGLM. First, the model is formulated
according to the HGLM framework. Second, an illustrative data analysis is
presented in order to show how an actual data analysis using the model would
take place. Third, a simulation study demonstrates parameter estimation by
the HLM program.

This chapter deals with a very basic model, the Rasch model equivalent.
However, estimating the Rasch item and person parameters per se is not the
primary purpose of this generalization, as mentioned in the previous chapter.
The primary purpose of the generalization is to expand the model to a Rasch
model with predictors, a multidimensional Rasch model, and a multilevel
Rasch model, which will be discussed in the following chapters. This chapter

provides a sound basis for those extensions.

2.1. Model Formulation

In this section, the standard binary-response Rasch model is ﬁrst
presented for the purpose of making a clear connection with the HGLM.
Then, the Rasch model is carefully reformulated using the HGLM framework.

Speciﬁcations of the HGLM framework include a sampling distribution for

11

item responses, its expectation and variance, a link function, a level-1

structural model, and level-2 models.

2.1.1. The Standard Rasch Model

Let py- be the probability that the person j (i = 1, , n) gets item 1' correct,
6,- be the latent trait of person j, a: be the difﬁculty of item i (i = 1, , k), and y,-
be a binary outcome, indicating a score for the jth person on the ith item (ya- = 1
if the person answers the item correctly, and y,-, = 0 if the person answers
incorrectly). Then, the conditional distribution of the outcome y”, given pg“, is
a binomial distribution with parameters 1 and p,-,-, which is also known to be a

Bernoulli distribution with a parameter p1,. Speciﬁcally,
mp. ~ 3(1, pg) . (1)
Based on the above probability model, the Rasch model is deﬁned to be

exp[6’, «2] __ 1

p”=1+exp[6lj—é‘,]—1+exp[-(6lj-o‘,)]’ (2)

 

following Wright and Stone (1979), which is equivalent to stating

12

log[-1—’i—]= a —.5, . (3)

In the above Rasch model, parameters 6}- and 6, are considered to be ﬁxed, in
effect, no distributional assumptions are made for the parameters. There are
n + k — 2 parameters to be estimated in the model. However, the unique
number of parameters to be estimated is reduced to 2k + 1 because people with
the same raw score have the same ability level estimate in the Rasch model.
As already mentioned, attempts to reformulate Rasch models in terms
of a linear model have been made by several other authors. Some of the
parameter estimation methods proposed by those authors are based on
conditional maximum likelihood estimation (CMLE) (Andersen, 1972). In
CMLE both @ and d are considered to be ﬁxed parameters as the above Rasch
model assumes. The CMLE is based on sufﬁcient statistics for the Rasch
model, in effect, the numbers of correct responses for each person and each

item. One advantage of CMLE is that the likelihood function does not
contain 9 = (61, , 62,), yet it produces consistent and efﬁcient estimation of
item parameters. Also, the CMLE does not require any assumptions about
the population distribution of dvalues, because both item- and person-
parameters are considered to be ﬁxed parameters. However, this approach is

strictly limited to a family of the Rasch model.

Another perspective on the standard Rasch model is that the latent

13

trait :9, can be considered to be a random variable (e.g., Andersen and Madsen,
1977; Thissen, 1982). In other words, the examinees represent a random
sample from a population in which ability is distributed according to a

speciﬁed density function. In general, the density function of a latent trait is

a function of 9, given 7%, where L5, is the vector containing the parameters of the

examinee population ability distribution; we write

g(9 I é). (4)

Often the standard normal distribution, 6?] ~ N (O, 1), is used for the density

function g, but it could have other forms, of course. Item and person
parameters based on the assumption of random person parameters are
estimated via marginal maximum likelihood estimation (MMLE) (Bock &
Aitkin, 1981; Bock & Lieberman, 1970). To date, the MMLE is considered to
be the only available likelihood method that makes use of the population
distribution of 9. With MMLE, one integrates out person parameters from
the likelihood function so that the estimation of item parameters will not
depend on person parameters. As a result, the MMLE estimates item
parameters ﬁrst, then estimates person parameters afterwards, under the
assumption that the estimated item parameters have known parameter

values.

14

2.1.2. Formulation of 1-P HGLLM

In this section, the unidimensional l-P HGLLM is formulated following
the GLM framework. Then the LP HGLLM is shown to be algebraically
equivalent to the Rasch model with 6} being a random variable. According to
the GLM framework, a sampling distribution of item responses, its
expectation and variance, link function, and a linear predictor model have to
be speciﬁed. Then following the HLM framework, level-2 models are
formulated. Here the linear predictor model is considered to be the level-1
model, in effect, an item-level model. Consequently, the level-2 models are
person-level models. In this formulation, the ability parameter 6} is
considered to be a random parameter.

For item i (i = 1, , k) and personj (i = 1, , n), a binomial sampling
model with one trial is employed. This is the same as Equation 1 in the

regular Rasch model. Thus, the expected value and variance of yy- are

E(y.,|p.,)=p., and Var(y,,|p.,)=p.,(1-p,)- (5)

When the level-l sampling model is binomial, a GLM can utilize one of several
link functions, including logit, probit, and complementary log-log functions.

In this case, the logit link function

15

’7” = iog[——1 1’” j (6)

is used. This is equivalent to Equation 3 if 77,,- = 6} — 3;.
Now the level-1 structural model, in effect, the level-1 linear predictor

model, is the item—level model, and

10 pi = 77,}
l—py.

=160j+161leij +ﬂ2jX2y +u°+ﬂ(k-l)jX(k-l)ij (7)

k-l
= 16:)! + 21”!”qu ’
q:

 

where qu- is the qth dummy variable for person j, with values -1 when q = i,
and 0 when q¢ i, for item i. The coefficient A), is an intercept term, and ,6.”- is a
coefﬁcient associated with X”, where q = 1, , k— 1. Equation 7 can be

reduced to
pt'
log[I—J_] = 771/ = 160/ -16,” v (8)

for item i that is associated with the qth dummy variable, given qu = -1 for q =

i, and 0 for q¢ i. This way, ,6,” represents the effect of the qth dummy

16

variable, consequently the effect of item i when i = q. Further, Equation 7 can

be written as
: ){ 160}
T]! = [d] g j]i:B‘/] (9)

I
I

in matrix form, where W}. =[dj X1] and B, = [4,1 B j] . The purpose of

writing Equation 7 in matrix form is to show how the data are laid out. Refer

to Appendix A for the full-matrix representation of Equation 9. Here, (I, is a

kxl column vector whose elements are all 1, ﬂ), is a constant, B’, is a (k —1)x1

column vector that contains ,6, through Aim», and X, is a kx(k — 1) matrix,
whose diagonal elements for rows 1 through k -— 1 are —1 and the off-diagonal
elements are 0. No indicator variable is associated with the kth item because
we set the constraint that A, = 0. As a result, all elements in row k in X, are
zeros. This constraint is needed so that the design matrix has full rank. A
parameter other than ﬂk, can be assumed to be 0, of course, but ,3,- was chosen
to be 0 for convenience. Here, A), is an intercept term, and a value 1 is
assigned to d,, for all observations. Therefore, ﬂ), is considered to be an
overall effect common to all items. On the other hand, ,6.”- represents the

speciﬁc effect of the qth dummy variable, for q = 1, , k. The constraint ,5, =

17

0 means that the effect of the kth item, which is subtracted from the overall
effect, is assumed to be zero. Then the probability that person j answers item

i correctly is expressed as

1

= 1+exp[- 77,1] ’ (10)

 

Pu

which follows from Equation 6.

It may at ﬁrst seem a little odd to have the person subscript j on the As
in Equation 7, because effects of items (item difﬁculties) should be constant
across people. However, the level-1 model is the item-level model, and we do
not assume that ,6b are constant across people at this level of the model. In
fact, the ﬁts are not the parameters that are considered to be item difﬁculties.
The parameters of interest are deﬁned in the level-2 model, and may be
characterized as being constant across people in the level-2 model.

The level-2 models are person-level models. In the level-1 model ﬂ), is
the mean item effect (or overall mean) with a constraint ,5,- = 0. Since ﬂ), is
treated as a parameter that represents the common effect of all items in the
level-1 model, it must be assumed in the level-2 models that A, is a random
effect across people. This way, a latent trait that is common to all items but
varies across people can be modeled. Also, while the level-1 model did not

assume that ,ﬁ, through A“), were equal across people, the level-2 models may

18

specify that item effects are constant across people. Therefore, the level-2

models are

r 160} = 700 +u0}
,2]
161/ . IO (11)
“ﬂat-1), =7(k-1)o ’

 

where uo, is a random component of ﬂy and distributed as N (O, r) , which states

that uo, is normally distributed with a mean of O and variance of r. The
level-1 model together with the level-2 models shows that item parameters are
ﬁxed across people and vary across items because there are no random terms
added into ,5, through ﬁlmy, while a latent trait (person parameter) varies
across people and is ﬁxed across items. As a result, when level-l and -2
models are combined, the linear predictor model, Equation 7, becomes

77,}. = 700 + uo, — no for a speciﬁc person j and a speciﬁc item i that is associated

with the qth dummy variable. Then, the probability that person j answers a

speciﬁc item i correctly is expressed as

 

1
p, - (12)

—1+exp{-[uo, "(7qo “700)” ’

where i = q. This is algebraically equivalent to the Rasch model in Equation

19

2, where 6,1 = no], and 6, =qu —700 for i = q. Note that both & and no “700 are

ﬁxed parameters for item parameters. For person parameters 6, in the Rasch
model is a ﬁxed parameter, while no, in the HGLM framework are random

variables, with ac, ~ N (O, r) .

According to the work of Neyman & Scott (1948) on non-linear models
for panel data, inconsistency of parameter estimators occurs if item and
person parameters are estimated simultaneously. The LP HGLLM
approach avoids this problem by treating person parameters as random
components of the intercept term, in effect, residuals in the level-2 model. In
other words, it does not treat person parameters as parameters to be
estimated. As a result, only k + 1 parameters need to be estimated (i.e., the
number of items plus one 1). When the Rasch model is reformulated in terms
of a non-hierarchical GLM, both person and item parameters have to be
treated as ﬁxed parameters in the same level of the model (Mellenbergh,
1994). This results in many parameters to be estimated in a regression
equation (i.e., k + n - 2 regression coefﬁcients), and leads to inconsistency of
parameter estimates. This is one strong advantage of applying the HGLM
over the GLM when item responses are formulated in the framework of a

linear logistic model.

2.1.3. Estimation

In estimating parameters in the LP HGLLM, I use the currently

20

available algorithm in the HLM program (Bryk, Raudenbush, & Congdon,
1996), in which the HGLM is incorporated. The HGLM estimation is
referred to as a “doubly-iterative algorithm” (Raudenbush, 1995), that is, a
combination of the GLM and HLM estimation procedures. Both the GLM
and HLM estimation procedures are performed iteratively within and
between the two procedures, and this is why the combination of the two
procedures is called a doubly-iterative algorithm. The HLM iterations are
referred to as macro iterations, while the GLM iterations are referred to as
micro iterations. In the GLM estimation, the penalized quasi-likelihood
(PQL) (Breslow & Clayton, 1993) is maximized in order to achieve the most
plausible estimates of linearized dependent variables, Z,-,, and weights, w,-,.

The Z,-, and w, are deﬁned as;

WI} = Pg(1"P.,) (13)

and

Z =L+Ug , (14)

where p,-,, u,-,, and 77,-, are deﬁned above. The HLM estimation for the level-2

residuals uo,, the person parameter estimates, is performed by the empirical

21

Bayes (EB) methods, and for the level-2 coefﬁcients 7,0, the item parameter
estimation is done by the generalized least squares (GLS).

The doubly-iterative algorithm works in the following way. First,
initial estimates of the predicted probability value, pg‘o’, a linearized

’, are computed by Equations 13 and

dependent variable, Z,,‘°’, and weights, w,‘0
14. Then, given those initial estimates, a weighted HLM analysis with Z,-,“” as
the level-1 outcome is computed (micro iterations). According to the new
predicted values from a weighted HLM analysis, the new linearized

), are computed (macro iteration).

dependent variables, Z9“), and weights, w,"
Then, a new weighted HLM analysis with Z,,(” as the level-1 model is
computed (micro iterations). This iterative process is repeated until it
achieves a predetermined convergence criterion. As a result, the HGLM
produces approximate joint posterior modes of the distributions of level-1 and
level-2 parameters, given a variance-covariance matrix estimated based on a
normal approximation to the restricted likelihood, in effect, the PQL.

This approach is able to estimate parameters with the existence of
missing data. Also, abilities for people with zero or perfect scores are able to
be estimated, whereas the cases are simply discarded in some Rasch IRT

software. For more details on parameter estimation, readers are referred to

Raudenbush (1995).

22

2.2. Illustrative Analysis of Hypothetical Data

Although estimating person and item parameters per se is not the
primary purpose of the LP HGLLM, it is essential that the model can
estimate person and item parameters appropriately for the simplest case of
the model. As mentioned in the previous section, the simplest case of the LP
HGLLM is when there is no predictor variable in the model, essentially, the
Rasch model equivalent case. For the purpose of illustrating possible data
analyses, a numerical example is presented here.

A set of hypothetical data is used for the purpose of illustrating the
model and demonstrating how parameters are estimated. First, person and
item parameters in the LP HGLLM are estimated via the HLM program.
The same data set is analyzed by the BILOG program (Mislevy & Bock, 1990)
to estimate person and item parameters, and their results are compared.

The BILOG program is chosen to be compared because it treats a latent trait
as a random variable, so does the LP HGLLM. Also, since the PQL is an
approximation of the marginal maximum likelihood estimation (MMLE), the
comparison should be able to tell how well the LP HGLLM approach
estimates parameters. It is important to know how differently parameters
are estimated in comparison to other methods.

For this study, the BILOG program is conﬁgured so that it will estimate
item parameters via the MMLE with no prior distribution speciﬁed, and

person parameters are estimated via an expected a posteriori (EAP) algorithm

23

with the standard normal prior distribution.

Both person- and item-parameter estimates may be rescaled into a
speciﬁc common scale by adjusting their means and standard deviations, for
example, mean = 0 and variance = 1. Since the rescaling is done based on
linear transformations, it is sufﬁcient to see that estimates from both
programs are highly correlated before rescaling. Once estimates are
obtained, it will be quite simple to transform both estimates to a common scale,
and they will be still correlated at the same magnitude. Here, rescaling is
done, but it is solely for the procedural illustration purpose.

Rescaling of the Rasch parameter estimates is done by the BILOG
program in the following manner. First, person parameter estimates are
transformed in order to have the mean and standard deviation of the prior
distribution. In this case, a mean of O and a standard deviation of 1 are used
to standardize the estimates. Then item parameter estimates are linearly
transformed, using the slope and intercept constants from the linear
transformation of the person parameter estimates. Several other rescaling
options are available in the BILOG program (see Mislevy & Bock, 1990).
Rescaling of parameter estimates from the HLM program was done in the

same manner.

24

Table 1

The item difﬁculties used in illustrative analyses and simulation studies

 

 

item item difficulty item item difficulty
1 -2.000 16 -1.250
2 -1.500 17 -0.250
3 -1.000 18 0.250
4 -0.500 19 1.750
5 0.250 20 0.000
6 0.500 21 -1.875
7 1.000 22 -1.625
8 1.500 23 -1.375
9 2.000 24 -1.125
10 0.000 25 -0.875
11 -1.750 26 -0.625
12 -0.750 27 -0.375
13 0.125 28 -0.125
14 0.750 29 2.125
15 1.250 30 -2.125

 

 

25

In the hypothetical data set, 10 items and 250 examinees are assumed.
Item-difﬁculty-parameter values are arbitrarily chosen. The values that are
used are shown in Table 1. Item difﬁculty values for 30 items are shown in
this table, because the table is also used to determine item difﬁculty values
for other illustrative analyses and simulation studies, in which the number of
items is as large as 30. According to the table, item difﬁculties (— 2.00, -1.50,
-1.00, - 0.50, — 0.25, 0.00, 0.50, 1.00, 1.50, 2.00) are used. 250 person-
parameter values are independently sampled from the standard normal
distribution, N (0, 1). These values are used as the true item difﬁculty and
true ability measures.

Conventional procedures are used to generate simulated dichotomous
item responses according to the Rasch model (Equation 2). A cumulative
probability is computed based on the Rasch model with the given true item-
and sampled person- parameter values. Then, the probability value is
compared with a random number sampled from a uniform distribution with a
range between 0 and 1. A simulated response is scored correct and a value 1
is assigned if the probability of a correct response is greater than or equal to
the sampled number, and the response is scored incorrect and a value 0 is
assigned, otherwise. Then, the generated data set is analyzed by the HLM
program and the BILOG program, and item and person parameters are

estimated.

26

The model for this example is deﬁned by Equations 7, 8, and 9, where k
= 10, n = 250, and the dummy variable for the tenth item is dropped in order to

achieve full rank for W,,

’7" zﬂol +’6‘IX“J +£1.qu +1631X3IJ +164/X4y' +165/X5I'J

(15)
+I661X6IJ +ﬂ7,X7,, +1581Xsa +,6;,X9,,
In matrix notation,
n1=ijj (16)
where
'q,‘ '1 —1 o o o o o o o 0' 71,11
9,, 1 0 -1 o o o o o o o a,
0,, 1 o o -1 o o o ‘0 o o a,
m, 1 o o o -l o o o o 0 a,
. —1 o o 0 o .
n1=”’/,wj=l 0 ° 0 0 ,and 13.-sf"! . ('7)
0,, 1 o o o o o -1 o o o I a,
’In 1 o o o o o o -1 o o a,
a, 1 o o o o o o o -1 o a,
0,, 1 o o o o o o o o -1 ,4,
3,1,” _1 0 o o o o o o o o, .411

 

 

 

 

 

 

I

Then the entire design matrix across people will be W = (W, W250 J . The

level-2 models are deﬁned by Equation 11, where k = 10,

27

'4,» =71. +“o,

16ij=710

262, =720

2631:730

U64] =7” ,where u0 ~ N(O, 2'). (18)
’52, =75.) ’

166/ =760

ﬂu- =r70

1%,“ =780

ﬁg,- =79.)

 

Then, the item difﬁculty for item 10 is - 700, and the item difﬁculties for the
other items are no — 700, while the ability for person j is uo,.

Person parameter estimates from both the HLM and BILOG programs
before transformation are shown in the ﬁrst two columns of Table 2, and item
parameter estimates before transformation are shown in the ﬁrst two columns
of Table 3. Although they show a nearly perfect linear relationship (r =
0.999) for both person and item estimates, the BILOG estimates vary more
across estimates than the HLM estimates do for both person and item
parameter estimates (0.798 vs. 0.689 for person estimates, and 1.241 vs. 0.976
for item estimates). This is also reﬂected in larger absolute values for the
BILOG estimates than the HLM estimates. This is because of the empirical
Bayes (EB) estimation of the person parameter. The EB estimation shrinks

person parameter estimates based on the overall estimated variance of the

28

Table 2

Person parameter estimates from the HLM and BILOG programs

 

Before Re-Scaling

After Re-Scaling“

 

 

 

 

Raw Score Mean True Values H LMT Bl LOGt HLM Bl LOG
0 -2.125 -1.681 -2.037 -2.439 -2.554

1 -1.447 ~1.323 -1.546 -1.919 -1.939

2 -0.811 -0.988 -1.135 -1.432 -1.425

3 -0.780 -0.677 -0.772 -0.967 -0.969

4 -0.368 -0.354 -0.419 -0.514 -0.527

5 -0.227 0047 -0.054 -0.067 -0.069

6 0.391 0.263 0.330 0.381 0.412

7 0.580 0.578 0.701 0.839 0.885

8 0.936 0.905 1.050 1.312 1.313

9 1.490 1.248 1.357 1.810 1.698

10 1.589 1.616 1.648 2.343 2.061
Mean' ' -0.007 0.000 0.001 0.000 0.000
Std. Deviation 0.987 0.689 0.798 1.000 1.000

Note.

‘ Means and Standard Deviations are weighted by the number of people in each raw score

category.
"’ Both HGLM and BILOG estimates are standardized.
1' Root mean square is 0.643

1: Root mean squared error is 0.642

29

Table 3

Item parameter estimates from the HLM and BILOG programs

 

Before Rescaling

After Rescaling"

 

 

 

Item True Values HLM BILOG HLM BILOG
item 1 -2.00 -1.516 -1.823 -2.198 -2.286
item 2 -1.50 -1.386 -1.671 -2.010 -2.096
item 3 -1.00 -0.754 -0.915 -1.093 -1.149
item 4 -0.50 -0.389 -0.470 -0.564 -0.591
item 5 -0.25 -0.064 -0.071 -0.093 -0.091
item 6 0.00 0.128 0.150 0.186 0.190
item 7 0.50 0.225 0.284 0.326 0.354
item 8 1.00 0.541 0.670 0.784 0.838
item 9 1.50 1.265 1.536 1.835 1.923
item 10 2.00 1.629 1.955 2.361 2.449
Mean -0.058 -0.065 -0.084 -0.084
Std. Deviation 0.976 1 .241 1.41 5 1.556

 

constants used for person parameter re-scaling.

30

Note. 'Rescaled estimates are transformed based on the same linear transformation

distribution of ac, (i.e., r). In this example, twas estimated to be 0.80, which
is a considerable underestimation of the true value r= 1.000. Apparently,
this underestimation of raffects largely on smaller variance of the HLM
parameter estimates than the true values.

The root mean squared error of the BILOG and the HLM person
parameter estimates are 0.643 and 0.642, respectively. These values
indicate that the HLM and BILOG estimates are almost equal in terms of
deviations from the true values. Also, these values indicate that both of them
are not estimating the true values well.

Transformed person and item parameter estimates are shown in the
last two columns of Tables 2 and 3, respectively. It is evident that the
rescaled person parameter estimates are very similar between the HLM and
the BILOG estimates. This makes sense because they are correlated almost
perfectly before transformation, and they were transformed so that they
would have the same mean and the same standard deviation.

However, the transformed item parameter estimates appear somewhat
different between the HLM and BILOG programs; some are different by 0.3
logits. This occurs because each set of item parameters is transformed
according to the transformations taken for each set of person-parameter
estimates; however, the two sets of item parameters are not transformed to
have the same mean and standard deviation. They are still correlated almost

perfectly (r = 0.999), because they are transformed linearly.

31

2.3. Parameter Recovery

The purpose of this simulation study is twofold. First, this simulation
is intended to demonstrate parameter recovery for the simplest model that I
showed, namely, the reformulated binary-response Rasch model via the
HGLM framework. In the previous section, I presented a numerical example
based only on one set of simulated data that illustrated how a set of item
response data can be analyzed using the reformulated model and the HLM
program. However, I was not able to evaluate the quality of those estimates,
because the results were based on only one set of simulated data. In this
simulation study, I replicate more than one data set for the same conditions to
show that the HLM program can consistently reproduce parameter values.

Second, this simulation will also explore the role of the number of
replications on the speciﬁc model applications of the HLM program. In the
past, the number of replications used in simulation studies of IRT models
varied across studies and has seemed arbitrary. For example, Drasgow
(1989) used 10 replications to study the two-parameter logistic (2PL) model,
and Seong (1990) used only 5 replications for the 2PL models. Although
Stone (1992) used 100 replications, pointing out that 5 or 10 replications
might result in unstable results, it was not clear how much stability he gained
from the increased number of replications. On the other hand, Yang (1995)
used 50 replications and was able to present convincing results for binary-

outcome hierarchical models with continuous explanatory variables using the

32

HLM software.

This simulation study examines several different numbers of
replications, and explores how differently they produce parameter estimate
values. It is h0ped that this will give a good indication, although not an exact
solution, of how many replications should be used for the other two simulation

studies that I will conduct with more complicated models.

2.3.1. Methods

The variables of interest in this simulation study are (a) the number of
replications (g = 3, g = 5, g = 10, g = 20, g = 50, and g = 100); (b) sample size (n =
250, n = 500, and n = 1000), representing small, medium and large sample
sizes; and (c) the number of items (k = 10 and k = 20), representing small and
large numbers of items (short and long tests). These three variables produce
6 x 3 x 2 = 36 conditions to investigate. Although 20 is not that large for the
number of items in real test settings, it was selected here to roughly double
the number of parameters to be estimated. The exact number of parameters
to be estimated is 11 when k = 10 (10 ﬁxed parameters and 1 random
parameter — the variance of person parameters), and 21 when k = 20 (20 ﬁxed
parameters and 1 random parameter).

For each replication in each of the 36 conditions, person ability values
are sampled from the standard normal distribution, MO, 1). Item difﬁculty

parameter values are determined by Table 1. The ﬁrst ten difﬁculty values

33

are used for the 10-item test, and the ﬁrst 20 difﬁculties are used for the 20-
item test. The response data are generated in the same manner as described
in section 2.2. Then, the generated item-level data are analyzed by the HLM
program, and item and person parameters are estimated.

Estimated parameter values are summarized across the 36 conditions
using several statistics and indicators. I compute (a) the mean correlation
coefﬁcient between estimated and true item-parameter values across
replications; (b) the standard deviation of the correlation coefﬁcients between
estimated and true item-parameter values across replications; (c) the root
mean squared error (RMSE) of % across replications; (d) the mean of the 2"
values; and (e) the standard deviation of the i" values. The RMSE for z" is

deﬁned to be

. (19)

 

where i, is the estimate of the rin the lth (l = 1, , g) replication. As

mentioned in the previous section, the measurement scales for item and
person parameter estimates are arbitrary, and their values have to be re-
scaled to be compared directly across replications. Here, since the non-
rescaled estimated values are compared, direct comparisons between

parameter estimates are not made. Instead, correlations between true and

34

estimated item parameters are compared across replications. Also, the
estimate of the variance of person parameters (i) is compared across

replications.

2. 3. 2. Results

I report summary statistics from the 36 conditions of the simulation
experiment in Table 4. The ﬁrst three columns indicate the attributes of the
conditions: k is the number of items, it is the number of examinees, and g is the
number of replications.

The means of correlation coefﬁcients between true and estimated item
difﬁculties are shown in the fourth column. The values are consistently very
high, around 0.995, and they are only different in their third decimal place,
except two cases where the number of replications is very small (g = 3), and
the number of examinees is small (n = 250). Also, the standard deviations of
the correlation across replications (shown in the ﬁfth column) are very small,
and they are also only different in their third decimal place. These results
show that the HIM software, through the reformulated model, is able to
reproduce item parameter values very well across all the conditions.

However, despite the generally small differences across the 36
conditions, we can still observe some apparent differences between the

conditions when the values are examined carefully. In Figure 1, the ﬁrst

35

 

Table 4

Results of parameter recovery study for the 1-P HGLLM

 

 

 

 

 

 

 

 

sample Results
items size replication mean(r) sd(r) m_sd(,v) RMSE( 2) mean( t) sd( 2')
3 0.9884 0.0022 0.1799 0.2617 0.7970 0.0409
5 0.9947 0.0016 0.1194 0.2073 0.8188 0.0126
250 10 0.9945 0.0036 0.1317 0.2441 0.7801 0.0125
20 0.9944 0.0032 0.1319 0.1892 0.8461 0.0128
50 0.9934 0.0031 0.1418 0.2125 0.8506 0.0233
100 0.9933 0.0031 0.1406 0.2038 0.8488 0.0189
3 0.9946 0.0051 0.0972 0.1815 0.8263 0.0042
5 0.9983 0.0005 0.0655 0.1715 0.8464 0.0073
10 500 10 0.9960 0.0015 0.1088 0.1307 0.9124 0.0104
20 0.9973 0.0012 0.0922 0.1666 0.8502 0.0056
50 0.9965 0.0020 0.1014 0.1888 0.8349 0.0085
100 0.9962 0.0021 0.1059 0.2065 0.8130 0.0078
3 0.9983 0.0006 0.0700 0.1801 0.8212 0.0007
5 0.9989 0.0004 0.0630 0.2213 0.7808 0.0011
1000 10 0.9985 0.0006 0.0715 0.1776 0.8319 0.0037
20 0.9981 0.0009 0.0745 0.1647 0.8516 0.0054
50 0.9983 0.0009 0.0676 0.2032 0.8057 0.0036
100 0.9983 0.0008 0.0703 0.1999 0.8088 0.0034
3 0.9892 0.0053 0.1476 0.2236 0.8040 0.0173
5 0.9929 0.0015 0.1339 0.1812 0.8537 0.0143
250 10 0.9927 0.0027 0.1272 0.1089 0.9155 0.0052
20 0.9921 0.0019 0.1376 0.1003 0.9170 0.0033
50 0.9914 0.0029 0.1370 0.1494 0.8925 0.0110
100 0.9916 0.0030 0.1406 0.1394 0.9022 0.0100
3 0.9957 0.0006 0.0908 0.0978 0.9025 0.0001
5 0.9951 0.0021 0.0949 0.1005 0.9144 0.0035
20 500 10 0.9948 0.0015 0.1009 0.1412 0.8740 0.0045
20 0.9954 0.0020 0.0973 0.1138 0.9117 0.0054
50 0.9957 0.0013 0.0995 0.1264 0.8891 0.0037
100 0.9959 0.0013 0.0971 0.1204 0.9032 0.0052
3 0.9971 0.0018 0.0737 0.1057 0.8950 0.0002
5 0.9977 0.0008 0.0656 0.1116 0.8950 0.0018
1000 10 0.9978 0.0010 0.0664 0.1498 0.8543 0.0013
20 0.9979 0.0007 0.0675 0.1288 0.8808 0.0025
50 0.9980 0.0007 0.0666 0.1153 0.8942 0.0022
100 0.9979 0.0007 0.0675 0.1189 0.8923 0.0026

 

36

 

Figure 1

Mean correlation between true and estimated item parameters and their standard
deviations

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8 m=250
Q ..
C
.9
E
93
6
0
C
m
a)
2
ID
(I)
o, .
d 1 l l T I
0 20 40 60 80 100
replications
to n=250
O
o_ .
O — k=10
. ' ------- =20
U)
C
.9.
31‘:
05> fa
o .
“5
0
co
0. i
o l W l f I l
0 20 40 60 80 100
replications

37

Figure 1 (cont’d)

 

100

 

=20
80

I

60

 

500
— k=1O
k

 

 

n

 

40

20

 

 

 

0

 

SN: 8? 8.88 8?
5:30:00 coo—2

replications

 

 

 

 

 

 

 

 

 

 

 

0
a . 1
10
8
00
12
= =
kk
:0 s
0 am.
.10. m
-o m a
4
10
2
q d d d 4 q 1 I0
wood wood wood 06

95:90:00 Lo om

Figure 1 (cont’d)

 

 

k=10
k=20

 

n=1000

 

 

 

 

 

4O

 

 

com: 8?. 8?
5:96:00 cams.

mmmd

100

80

60

0

replications

 

 

 

 

 

 

 

 

 

0
ﬂ 1 w
1 O
8
0 0
1 2
= __
k k
S
m em
1 m
.n. 4. W 9
r O m 3
4
r O
2
....
c ...... o
1 . a J - . . 0
wood Sod wood o.o

mCOzm—w—LOO ho Cw

three plots show mean correlations for three different sample sizes. First,
higher correlation coefﬁcient values between true and estimated item
parameters are observed when there are more examinees. This observation
makes sense because, in theory, the quality of item parameter estimates
depends on how large the sample size is, and it does not depend on the
number of items in a test.

On the other hand, these three plots show that correlation coefﬁcients
between true and estimated item parameters are slightly weaker for larger
numbers of items. The difference between the mean rs for k = 10 and 20
becomes smaller as the number of examinees increases. The correlation
coefﬁcients are lower for the longer test, while holding the number of
examinees constant. This makes sense because, holding the number of
examinees constant, the ratio of the number of items to the number of
examinees is higher for the longer test. Since the difference in the two ratios
is smaller when there are more examinees, the correlation coefﬁcient does not
decrease so much when there are a large number of examinees, like n = 1000,
as much as it does when n is smaller. The same trends apply in the plots of
standard deviations of the correlations. They are the last three plots in
Figure 1. However, the plots magnify the differences, and these differences
are all at their third decimal place.

The seventh column in Table 4 contains the RMSE of 2, deﬁned above.

The values are also plotted in the ﬁrst three plots in Figure 2. The standard

40

Figure 2

Root mean squared errors of i and standard deviations of i-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o r1=250
cm
6
A o
H N... H
2 o‘ '5,
:1
g _‘ .... ............................. .
iii o .
2 ‘1- "' '
0: O k=10
. ------ k=20
0.4
O I I I l I
replications
m n=250
0.4
O
— k=10
3" T ----- k=20
O
A 0’)
*6 0.1
.Cl 0
:3
s e,
a o -.
a. .-. ............................ .
o ‘-
0... ..
.,.
0..
O I I l ' I '
0 20 4o 60 80 100
replications

41

Figure 2 (cont’d)

n=500

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  

 

 

 

O
(‘0.
o'
A O
H N‘ A
2 o'
:I \//
E .
(”ii 0 ................ O ............................. .
E g-
0: '- k=10
_ ooooooo k=20
0.
o I I I I
0 2° 4° 50 80 100
replications
‘0 n=500
q.
C
_ k=10
3- """" k=20
O
A CO
‘66 0.-
.r: 0
3| N
£3; q.
8 o
5.
o —o
............................ .
O
0' I I
50 80 100

 

replications

42

 

Figure 2 (cont’d)

RMSE(tau_hat)

sd(tau_hat)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o n=1000
m ..
d
o V
“i ‘ A
O

r .0. _.
O ’ . ............. . ............................ .
o — k=10

- """" k=20
0. .
O I l l 1 l I

0 20 40 60 80 100
replications
LO n=1000
O .
o'
— k=10
3.- ------- ‘ k=20
O
0")
O -
o'
N
O .1
o“
s .
O
O
o‘ ‘l 1
0 20 40 60 80 100
replications

43

deviations of 2' across replications are shown in the last column in the table
and plotted in the last three plots in Figure 2. Also, the actual mean values
of i- are shown in the eighth column in the table. As can be seen ﬁom the
table, the mean of z“ is consistently smaller than the true value, which is 1.0,
in all 36 cases. This result is consistent with those of Yang (1995), who
empirically showed that the algorithm used in the HLM program, PQL
estimation, tends to underestimate rwith binary outcome models. Despite
such a limitation of the algorithm, consistently low RMSE values are observed.
The results show that restimates are closer to the true value, when there are
more items. When k = 10, the mean f are around 0.8, and when k = 20, the
mean i" are around 0.9. On the other hand, the number of examinees does
not seem to affect the RMSE values. Again, this makes sense because the
precision of person parameters is affected by the number of items in a test, but
it is not affected by the number of examinees. The standard deviations of 2"
becomes smaller with larger k, and the difference between k = 10 and k = 20
becomes smaller with larger n. Although I was not able to incorporate the
improvement in this study, Yang (1998) has recently improved the PQL
algorithm to estimate r. By incorporating Yang’s algorithm, it is expected
that the underestimation of rwould not be as severe as observed in this study.
The values plotted in Figures 1 and 2, the values are somewhat
unstable when the number of replications are 3, 5, and 10. Sometimes the

patterns for values based on k = 10 and k = 20 reverse when the number of

44

replications are 3, 5, or 10. From these observations, simulation might still
produce reasonable results with g = 20, but it makes more sense to use g = 50
for subsequent simulations, because results are much more stable with g = 50

than with g = 20. No added beneﬁt is apparent with 100 replications.

2.4. Comments on the Simulated Data

In the conventional method to generate item response data that I
described above, the marginal sums for both persons and items are
determined as a result of independently generated individual 0-1 responses.
In other words, it is not always true that a set of generated responses
represents a set of item and person parameter values. For this reason, a set
of marginal sums is not actually randomly sampled from possible sets of
marginal sums. Since the marginal sums are sufﬁcient statistics in Rasch
model, it is important to have a set of marginal sums that is randomly
sampled from the population of possible marginal sums for the purpose of
reproducing parameter values in a simulation study.

An alternative method to generate item responses is the method
proposed by Snijders (1991). Snijders' method would ﬁrst generate marginal
sums for both items and persons based on a set of item and person parameters.
Then, it would generate individual 0-1 responses based on the marginal sums.
This way, Snijders’ method guarantees that a set of marginal sums is a

random sample from a set all possible marginal sums based on given

45

parameter values. As a result, the distributions of both sets of parameter
estimates will be exactly the same as the theoretical distributions.

One beneﬁt of using Snijders’ method is that we can be sure that any
statistics computed based on parameter estimates are sampled from their
theoretical sampling distributions. Therefore, this simulation method would
enable us to better illustrate how parameter estimates behave than the
conventional simulation method does. However, an application of Snijders’
method is limited to the Rasch model because marginal sums are not

sufﬁcient statistics for two and three parameter IRT models.

2.5. Summary and Comments on Practical Issues

In this chapter it was shown that the regular binary-response Rasch
model can be reformulated as a special case of the HGLM. Also, it was shown
that the HLM program could estimate item and person parameters similarly
to those computed by the BILOG program, an MMLE based IRT program.
The results of the simulation study revealed that the variance of person
parameters tends to be underestimated with HGLM, because of the PQL
estimation. The degree of underestimation depended on the number of items
in a test and the number of examinees.

As stated earlier in this chapter, the primary purpose of this
generalization is not to estimate Rasch parameters per se. There are various

Rasch/IRT parameter estimation programs already available for the purpose

46

 

of estimating item and person parameters. Thus, there is no need for the
HLM program to be used solely for the purpose of estimating item and person
parameters. The real purpose of the generalization is to extend the model to
a Rasch model with predictors, a multidimensional Rasch model, and a
multilevel Rasch model. They will be discussed in the following chapters.

However, this reformulation of the regular binary-response Rasch
model can be considered an important clariﬁcation of the connection between
two seemingly unrelated statistical and psychometric models, in effect, an IRT
model and the HGLM. From this perspective, estimating item and person
parameters using the HLM program could be meaningful for didactic
purposes. IRT models, including the Rasch model, are often thought to be
specialized statistical and psychometric models for item response data.
Furthermore, specialized estimation algorithms, often in specialized software,
are thought necessary for parameter estimation. For this reason, IRT models
and parameter estimation are often treated completely separately from other
statistical models by the learner of IRT models. This generalization clariﬁes
that the Rasch model is a special case of the HGLM. Also, this study clariﬁes
that item and person parameters can be estimated using the currently
available algorithms in the HLM program, which are used for more general
purposes.

Furthermore, interpretation of IRT parameters also tends to be

specialized and separated from other statistical models. More speciﬁcally, an

47

item parameter is simply interpreted as an item difﬁculty in the Rasch model.
This interpretation is accurate and useful in the context of item response data.
However, at the same time, this interpretation is isolated from more general
contexts, such as logistic regression, GLM, and HGLM, where it can be
interpreted as an effect of item. As a result, it is sometimes very difﬁcult for
the learner of IRT models to understand what the parameter really means.
On the other hand, if the Rasch model is presented as a special case of the
HGLM, it is relatively easy to interpret the parameters because the item
parameters are coefﬁcients of dummy variables, and the learner can bring to

bear their knowledge of logistic regression to make sense of IRT analyses.

48

Chapter 3

Model with Person-Level Predictors

In this chapter an extension of the LP HGLLM is presented. As
indicated in Chapter 2, a direct extension of the LP HGLLM is to include
person-level predictors in the model, in effect, person-characteristic variables.
This approach achieves a one-step analysis of test data with person-level
predictor variables. Through such a single analysis rather than a two-step
analysis, one can expect improved estimation of the effects of the predictors on
the latent trait because the effects of predictors are estimated simultaneously
with ability parameters. As a result, the heteroscedastic nature of the
standard errors of the ability estimates is taken into account, and

inconsistency of the ability estimates is avoided.

3.1. Model
Assume a situation where one wishes to analyze the effect of the
amount of instruction on a speciﬁc topic (Time) on reading achievement. In

most cases, such analysis is done by a simple regression model,

(Score), = ﬂo +161 (Time), + e, , (20)

where (Score), is a reading test score for person j, and (Time), is an amount of

49

instruction on a speciﬁc topic for person j. Then, one would be interested in
the magnitude of A. If A is different from 0, it would be interpreted that the
amount of instruction has an impact on reading achievement test score.
However, the analysis based on this model may not give accurate results,
especially when test scores are based on an IRT scale. The reason is that
estimated scores based on IRT are associated with different magnitudes of
standard errors across estimates. Extreme scores, such as zero or perfect
scores, are associated with larger standard errors than scores in the middle of
the range. In other words, measurement errors of the dependent variable are
not homoscedastic. Therefore, unless those different standard errors of
measurement are taken into account in the regression analysis, the result
may be misleading.

One way to avoid such a problem is to perform a one-step analysis of
test data with person-level predictors in the LP HGLLM. The level-1 model

shown in Equation 7 is used, that is,

’71 =ﬂoj +ﬂuXw +ﬂ21Xza ++ﬂ(k-1)1X(k-oa-

Here, for the level-2 models, let the random component of the A, in the level-1

model be uo,* in order to distinguish from uo, from the model without any

predictors (Equation 11). Then, the level-2 models are

50

 

ﬁt); :7“) +7OIPVIJ + +70pr} +140] *
ﬂ]; T710 (21)
3411-1), = 7(k-I)0 ,
where W,, (s = 1, , p) are person-level predictor measures for predictor s and

person j. For the speciﬁc example with Time as a predictor, we write

 

260, =l’oo +701(Time), 'H‘o, *
16117710 (22)
“7611-1), =}’(k—i)o

This way, the regression model (Equation 20) is embedded in the LP HGLLM,
that is equivalent to saying that the regression model is embedded in the
Rasch model. Therefore, there is no need to further account for diﬁ'erent
magnitudes of standard errors of measurement for the outcome variable.
More technically, )0, in the above equation will be estimated simultaneously

with person speciﬁc ability, 110;“. The combined model then becomes

1
17,, _ i, (23)

— 1+exp[-— {[70,(Time)j +110, *1-(7qo “700)”

 

51

where var(u0, *) = r* and i = q. Therefore, the overall person ability in this
formulation is 70, (Time) J + no, *, in contrast to uo, in the model without any

predictor variables (see Equation 12). The relationship of uo,* to ac, is given
algebraically by uo, = )01(Time), + u0,*. In other words, the person ability from
the simple Rasch model is decomposed into two parts in this model. On the
other hand, the item parameters are the same as in the model without

predictor variables, that is, 7110 — 700, for i = q.

3.2. Illustrative Analysis

As a numerical example, a set of hypothetical data for a 20-item test is
analyzed. In order to simulate a data set, it is assumed that the mean of the
Time measures equals zero and variance of Time equals one. Note that the
distribution of the variable may not be normally distributed, because the
amount of instruction time is often represented by two groups; one is a group
of people who received the instruction on the topic, and the other is a group of
people who did not receive the instruction on the topic. However, the shape
of the distribution for a predictor variable does not affect the estimate of the
effect of the predictor. Therefore, the Time measures are assumed to be
distributed normally for the purpose of convenience. Second, it is assumed
that the Time and true student abilities are correlated with a certain degree of
strength. Therefore, the student abilities and the Time measures are

simultaneously sampled from a bivariate normal distribution. In this

52

analysis, 250 students are assumed, and I specify that the student abilities
and the Time are correlated with p = 0.3. Then, the simulated true student
ability measures, along with true item difﬁculties, were used to generate a
zero-one response for each item. The zero-one response data are then
analyzed by two models: (a) the model without Time, and (b) the model with
Time as a predictor. Then, results from these two analyses are compared.

The slope for Time, 701, is estimated to be 0.270 in the second model, a
one-step analysis. Since both Ability and Time are standardized, its slope is
equivalent to its correlation coefﬁcient. This value is reasonable, as
compared to the true correlation coefﬁcient of 0.30 between Time and student
ability, although it is a 10 % underestimation of the true value. Also, the p-
value associated with the )0, estimate is smaller than 0.0005, indicating that
the effect of Time is signiﬁcantly different ﬁom zero.

Table 5 shows item parameter estimates from the two different models.
The ﬁrst column contains the estimates from the model with Time as a
predictor, and the second column contains the estimates ﬁom the model
without Time. Item parameters are estimated very similarly for the two
models. They are only different at the second decimal place at the most and
differ by no more than 0.04. This makes sense because the item parameters
from the two models are algebraically the same and ought to be estimated
similarly. The last column shows the true item difﬁculties. The RMSE is

0.225 for the model with Time, and 0.240 for the model without Time.

53

Table 5

Item parameter estimates from the model with a person-level predictor

 

 

Item with SES without SES true value
1 -1.698 -1.688 -2.000
2 -1.294 -1.289 -1.500
3 —0.846 -0.847 —1.000
4 -0.552 —0.558 —0.500
5 0.312 0.290 0.250
6 0.549 0.524 0.500
7 0.957 0.925 1.000
8 1.123 1.088 1.500
9 1.451 1.411 2.000
10 —0.065 —0.080 0.000
11 —1.667 -1.658 -1.750
12 —0.770 —0.772 -0.750
13 0.245 0.225 0.125
14 0.480 0.456 0.750
15 0.898 0.866 1.250
16 -1.178 -1.175 -1.250
17 —0.147 —0.160 -0.250
18 -0.049 —0.064 0.250
19 1.503 1.463 1.750

20 -0.098 -0.080 0.000

 

54

On the other hand, the person abilities in the model without Time are 110,,
while the person abilities in the model with the Time are }m(Time), + uo,*, as
mentioned above. Since uo, = 701(Time), + uo,*, u0,* can be thought of the person
speciﬁc component of the ability, while )01(Time), is the amount of contribution
that Time makes to the jth person’s ability in this particular example. In
other words, uo,* is the ability when Time is taken into account. The

relationship between 170, * and 120] is plotted in the ﬁrst plot of Figure 3. The
1’20, 3 are clustered into 21 raw score groups. However, 170, * varies within raw

score groups, depending on what Time scores the individuals have.

When )70, = 0.27, the contribution of Time, is taken into account, and the
jth person’s ability is computed as O.27(Time), + 120]. *, the relationship between
170, and 0.27(Time), + 1701* is plotted in the second graph of Figure 3. The

graph shows that the two estimates are almost identical when the effect of the
Time is included. This conﬁrms the algebraic relationship uo, = m(Time), + uo,*,

described above.

55

Figure 3

The relationship between person parameter estimates from the model with a
predictor and estimates from the model without a predictor

 

2" o
O
0:388
1— ozii. *-
0
ii“

0 no 0
mom
0 mmoo

 

 

 

 

 

 

 

._°‘
0
D 0 egg
0 ° 9
o C
1‘ 89 8 o ”
o °8
° 0
O
25 8 E 0 _
O
3 1 1 r 1 1
3 -2 1 0 1 2
U0j
1 A 1 1
2“ a h—
08
.9
1‘ l _
.2. 3
o I
D I
1" I
00‘ q -
g i
E; i
o I
" I
“44-1 I -
° 3
e
8
2‘ 3 F
c
-3 1 r 1 1 r
-3 -2 ~1 0 1
U0j

56

3.3. Parameter Recovery

In this simulation study, the same two-level model with a person-level
predictor variable, given in Equations 7 and 22, is examined. In order to
simulate 50 replicated data sets, the method described in the previous section
is utilized; students’ abilities and Time measures are simultaneously sampled
from a bivariate normal distribution. Item difﬁculties are determined based
on Table 1, as in the simulation study in the previous Chapter.

The variables of interest in this simulation study are (a) sample size (n
= 250, n = 500, n = 1000), (b) the number of items (k = 10, k = 20), and (c) the
magnitude of the correlation between the person parameter and the person-
level predictor (p = 0.2, p = 0.5, p: 0.9). These three variables produce 3 x 2 x 3
= 18 conditions. Table 6 shows the layout of the design of this simulation
study.

I present the results in Table 7. Overall, the RMSEs of )0, estimates
are very small (i.e., all smaller than 0.01). Also, the means of the estimates
of the coefﬁcient are all smaller than the true values, implying that the
coefﬁcient tends to be underestimated. The standard deviations of the
estimates across 50 replications are also small. They range approximately
between 0.02 and 0.07.

Figure 4 shows the plots of the RMSE values. The plots are separated
based on the three different true coefﬁcient values. When ,0: 0.9, the RMSE

values are lower than when either p: 0.2 or ,0: 0.5, while the RMSE values

57

Table 6

The layout of the simulation study for the model with a person -level predictor

 

 

 

variable
p=0.2 p= 0.5 p= 0.9
n=250 n=250 n=250
k=10 n=500 n=500 n=500
n=1000 n=1000 n=1000
n=250 n=250 n=250
k=20 n=500 n=500 n=500
n=1000 n=1000 n=1000

 

58

Table 7

The results of parameter recovery study for the model with a person-level

 

 

 

 

 

 

 

 

predictor
Results
r k n RMSE Mean SD

250 0.0064 0.1770 0.0772

10 500 0.0033 0.1728 0.0508

1000 0.0020 0.1695 0.0335

02 250 0.0044 0.1758 0.0623
20 500 0.0030 0.1701 0.0458

1000 0.0018 0.1696 0.0298

250 0.0064 0.4432 0.0568

10 500 0.0058 0.4431 0.0511

0 5 1000 0.0036 0.4497 0.0330
250 0.0099 0.4295 0.0706

20 500 0.0049 0.4448 0.0436

1000 0.0047 0.4388 0.0310

250 0.0040 0.8715 0.0571

10 500 0.0029 0.8637 0.0396

0 9 1000 0.0017 0.8700 0.0292
250 0.0025 0.8744 0.0436

20 500 0.0015 0.8755 0.0310

1000 0.0015 0.8676 0.0210

 

59

Figure 4

The root mean squared errors of the coefﬁcient of the predictor

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

o rho=02

N

o.‘

o —— k=10
# to ------- k=20
2 5‘

I o

(0

E e

o

LU l0

‘é’ 8‘

m o .........

......................... ﬂ

0

o" T r ' '

400 600 800 1000
n

8 Lha=05

q- __

o k=10
H to ------- k=20
2 5.-

ml O
E
E 52
o
LIJ l0 .—
s e-
m o
o
o" r . ' I
400 600 800 1000
n

60

Figure 4 (cont’d)

 

 

 

 

 

 

 

 

 

8 rho=OQ
O,“
o — k=10
so """" k=20
E 5.-
| 0
<0
E $2
5 8'
o
“J In
2 8:
m o ‘\
o ..........
o ....... .,,,,...............................‘...
C5 I t '
400 600 800 1000

n

61

are relatively high when p: 0.5. However, these differences are only at the
third decimal place. When ,0: 0.2 or p: 0.9, the RMSE values are somewhat
lower with k = 20 than with k = 10, especially for smaller samples. However,
this relationship is not observed when p: 0.5.

Figure 5 shows the plots of the means and the standard deviations of
the estimated coefﬁcient across 50 replications. The means of the estimates
show that the coefﬁcient tends to be underestimated, although it is
underestimated only by about 0.025 when p: 0.2 and p: 0.9, while it is
underestimated by about 0.05 to 0.075 when p: 0.5. The sample size does
not seem to affect the mean of the estimates so much. The means are about
the same across three different sizes for all true coefﬁcient values. On the
other hand, the standard deviations of the estimates across replications are
smaller with larger samples and with more items. The only exception is
when ,0: 0.5 and n = 250, where the standard deviation is smaller for k = 10.

In summary, the coefficient of the predictor variable tends to be
underestimated. The degree of underestimation depends on the true
coefﬁcient values. Although the results revealed that the coefﬁcient tends to
be underestimated more when the true coefﬁcient is 0.5, it is not clear why
this happens. Further investigation is suggested to reveal the reason for this
observation.

Also, the results revealed that the estimates are similar across different

sample sizes. However, sample size affects the consistency of the estimates,

62

Figure 5

The mean and the standard deviation of the estimate of the slope coefﬁcient of

the predictor

Mean of gamma_hat

SD of gamma_hat
0.0 0.02 0.04 0.06 0.08 0.10

 

 

 

 

 

 

 

 

 

 

 

m mo=02
q.
" --- k=10
------- k=20
O TRUE VALUE
N -l --------------------------------------------------------------------------------------------------
o'
I """"""""" + ........................ 4
2
o. H
O
". 'l
O 1 I I I
400 600 800 1000
n
rho = O 2
— k=1O
------- k=20

l

l

 

 

 

 

 

 

 

63

1 000

Figure 5 (cont’d)

Mean of gamma_hat

SD of gamma_hat
0.0 0.02 0.04 0.06 0.08 0.10

 

 

 

 

 

 

 

 

 

 

 

m I'hQ = Q 5

to-

o — k=10
------- k=20

o more VALUE

l0- ..................................................................................................

o'

in

V. - ’4

o r ....... - O—f ............................... .

o """""

O

V.

O I I I n

400 600 800 1000
n
111g: 0 5
1
—— k=10
....... k=20
.,

 

 

 

l

l

 

l

l

 

 

 

400 600 800 1000
n

64

Figure 5 (cont'd)

 

 

 

 

 

 

 

 

 

 

 

 

In rho = 0 9
a),
C’ --- k=10
------- k=20
E o TRUE VALUE
ml c»,- ..................................................................................................
E O
E
“5
C
(D
(D
2
O
m...
d T 1 I l
400 600 800 1000
n
rho = 0 9
— k=10
------- k=20

 

 

 

l

1

SD of gamma_hat
0.0 0.02 0.04 0.06 0.08 0.10

J

 

l

 

 

 

65

that is to say, the larger the sample size, the smaller the standard deviation of
the estimates is. This simulation study is limited to only one predictor in the
model. Therefore, a model with more than one predictor needs to be

investigated.

3.4. DIF Model

In the previous example, a person characteristic variable was added
only in the ﬁrst equation of the level-2 models. This was because the person
parameter was decomposed into more than one parameter, person ability and
an effect of the predictor. Similarly, we can specify that item parameters be
decomposed into more than one parameter. In other words, we can examine
whether item parameters function differently depending on person
characteristics.

As an example of the simplest case, gender is used as a predictor and
added to the level-2 model of the LP HGLLM, Equation 11. Since the level-2
models are person-level models, the person characteristic variables should be
added to the level-2 models, while Equation 7 is used as is. As a result, the

level-2 models are

66

i 760, =2’00 +7o,(gender)j +u0,

ﬂu =710 +7H(gender)j

1 (24)

 

«60-0 = 701-00 T 7(k-i)l (gender)! ,

where (gender), is a dummy variable, and where 1 is given to one of the gender
groups and 0 is given to the other gender group, say 1 for females and 0 for
males in this example. Then, the linear predictor model for a speciﬁc item i,

after the level-1 and level-2 models are combined, will be

77,]. = 700 +701 (gender), +110, —[7qo +7q,(gender),]
=uo,‘ +700—7q0_(7q1-701)(gender)j (25)
=u0,“[740‘700+(7q1’701)(gender),] 2

for i = q (i = l, , k, and q = 1, , k— 1). The combined model shows that

no - 700 + (74, — 70,)(gender), is a difﬁculty of the item that is associated with
the qth dummy variable for gender group j, while — 700 - 701(gender), is a

difﬁculty for the reference item; the item whose dummy variable has been

dropped for that group. Therefore, 74, -— yo, is the effect of (gender), on an item

that is associated with the qth dummy variable, for q = 1, ..., k — 1, while 70, is

the effect of (gender), on the reference item. In other words,

no — 700 + (qu - 70,)(gender), is a difﬁculty for females and no - 700 is a

67

difﬁculty for males for items with q = 1, , k — 1, while - 700 - 70, (gender) , is a

difﬁculty for females and — 700 is a difﬁculty for males for the reference item.

If any of the values )3,” —)?0, for q = 1, , k -1, or fro, for the reference item

are signiﬁcantly different from zero, they indicate that males and females
perform differently on those items, given the same ability. This suggests
that the item may be biased against one of the gender groups. However, such
a statistical difference does not always indicate bias, because performance
differences between gender groups might occur because of real differences
between the genders. Gender difference may result in an effect that looks
like item bias (that is to say, some may argue that such an effect is sufﬁcient
evidence), but a gender difference is not conclusive evidence of item bias per
se.

The above example raises an important issue because it is sometimes of
interest to test publishers whether a speciﬁc item is biased for a particular
group of people. If an item is statistically detected that functions differently
between sub-populations, this situation is referred to as differential item
functioning (DIF). In the IRT, an item is considered to show DIF when its
item characteristic curve (ICC) for the target sub-population (the focal group)
and the ICC for the rest of the population (the reference group) are different.
Since the Rasch model, and consequently the LP HGLLM, can differ only in
terms of item difﬁculties, ICCs can differ only in their locations with respect to

the x -axis (difﬁculty), but not in their shapes (i.e., their slopes or lower

68

asymptotes). Therefore, if we look at the shapes of ICCs for the LP HGLLM,
we will be only comparing values of item difﬁculties between the target sub-
population and the rest of the population. If the target group’s performance
is considerably lower than that of the rest of the population, given the same
ability, it results in a lower item difﬁculty for that group.

The most widely used method to detect DIF for the Rasch model is
described by Wright and Stone (1979) and Wright and Masters (1982). It
simply estimates item difﬁculties and their standard errors separately for the
focal group and the reference group, and tests if they are signiﬁcantly
different from each other. The method described in this paper is equivalent
to the conventional method because both approaches compare the difﬁculties
between two sub-populations. However, the method described in this paper
does the job in a one-step analysis, while the conventional method requires a
two-step analysis. Again, this one-step analysis may increase the precision
of item-parameter estimates, which results in reducing the magnitude of the
standard errors of the estimates. This will make the test statistics more
sensitive to rejecting the null hypothesis to conclude that the two groups
perform differently, given the same ability.

As a numerical example, a set of hypothetical data is analyzed that
assumes that one of 10 items is biased against one sub-group in a population.
The same item difﬁculties as in the ﬁrst numerical example are assumed,

except that the ﬁrst item in the test is assumed to be biased. The biased item

69

is assumed to be harder for people in the target gender group (say, female, in
this example), so that they have less chance to answer the item correctly than
males. 2000 person ability values are sampled from the standard normal
distribution, and item responses are generated using the same method as
mentioned in the previous section. The level-1 model is the same as the ﬁrst

example for the unidimensional l-P HGLLM, while the level-2 models are

r160, =7oo +701(ge"der), +“o,
161; =710 +7“(gender)J
ﬂz, =72.) +721(gender),
163} =730+73l(gender)j
+A”. = y“, +741(gender)j
765) =7so +75,(gender)j
75:3,- =760+761(gender),
767, =770+771(gender),
,4, =780 +781(ge”der),
L160, =790+79,(gender)j

, where 140} ~ N(O, r). (26)

 

In this example, it is speciﬁed that 70,, the effect of gender on the ﬁrst

dummy variable, equals 0.5 in the logit scale, where the ﬁrst item is the
biased item. Here item indicator variable for the ﬁrst item is dropped.
There is no bias for the rest of the items (i.e., items 2 through 10). This is

equivalent to saying 70, - 7,1, = 0 for q = 1, , 9, which correspond to items 2

through 10. Then, the data are analyzed by the HLM program. The null

70

hypotheses

Ho : 70, = 0 (27)

and
H0:7ql-}’Ol=0 (28)
for q = 1, ..... , 9, are tested separately. The ﬁrst hypothesis is tested directly

by the t-test for )0, that is provided by the standard HLM output. On the
other hand, the rest of the null hypotheses are tested by general linear
hypothesis tests, which result in Wald-type asymptotic chi-square tests with
df= 1. This type of hypothesis testing is also readily available in the HLM
program.

Table 8 shows estimates of gender effects on each item from the model

with a linear constraint. The ﬁrst column shows values of iql, the estimated

coefﬁcients for the gender dummy variables, and the second column shows the

values of 77,“ — f0, , the effect of gender on each item. The third column shows
the p-values for the t-tests on fro, and the general linear hypothesis tests on
)3,“ — 170,, described in the previous section. The value of fro, = — 0.444

indicates the parameter value of yo, = — 0.5 is recovered fairly well, although it
is not strong evidence because this result is based on only one data set. The

p-value associated with f0, shows that 70, is signiﬁcantly different from 0,

71

that is, men and women perform differently on the ﬁrst item. On the other

hand, fr” — go, through i9, - fro, are not signiﬁcantly different from 0. One

exception is item 5 for which p = 0.037. This item is assumed no difference
between the gender groups, however, the result shows signiﬁcant difference
between the gender groups, indicating a Type I error. For other items, the
result does not provide evidence that men and women perform differently on
these items.

Table 9 shows person parameter estimates from a model with the
person-level predictor (gender) and a model with no predictor. The values
are not exactly the same, but they are correlated with r = 0.999. This result
conﬁrms that the inclusions of the linear constraints affect the item
parameter estimates, not person parameter estimates.

These results are based on only one replication of a set of simulated
data, and more extensive simulation study is expected to conﬁrm the recovery
of the parameter values, including I. For future work, it is suggested that
standard errors for the person parameters be compared between the two
models on the same data to look into Mislevy’s (1987) claim that such a linear
constraint should improve the precision of the person parameter estimates.
Also, investigations of statistical power for rejecting the null hypothesis, in
comparison with other conventional methods will be beneﬁcial in order to

assess the usefulness of this approach in detecting biased items.

72

Table 8

Item parameter estimates from the DIF model

 

 

7‘“ 7111’ 7’01 p-value
q = 0 (item 1) —0.444 N/A < 0.000
q = 1 (item 2) —0.377 0.066 > 0.500
q = 2 (item 3) —0.409 0.035 > 0.500
q = 3 (item 4) —0.513 -0.069 > 0.500
q = 4 (item 5) .0233 0.206 0.037
q = 5 (item 6) —o.573 —0.129 0.188
4 = 6 (item 7) -0.668 -0.223 0.079
4 = 7 (item 8) —0.491 —0.049 > 0.500
q = 8 (item 9) -0.526 -0.082 > 0.500
q = 9 (item 10) -0.290 0.154 0.118

 

73

Table 9

Person parameter estimates from the DIF model

 

 

Raw Score No Predictor With Predictor
0 -1.720 —1.695
1 -1.347 -1.322
2 —0.997 —0.972
3 —0.662 -0.638
4 —0.337 -0.311
5 —0.015 0.005
6 . 0.308 0.324
7 0.637 0.649
8 0.976 0.983
9 1.337 1.333
10 1.713 1.707

 

74

3.5. Summary and Comments on Practical Issues

In this chapter two examples are presented under one family of
extensions of LP HGLLM, the inclusion of a person-level predictor variable.
Although this study deals with only one person-level predictor in the model,
more than one predictor can be included in the model.

In the ﬁrst example a person-level predictor is included in the ﬁrst
equation of the level-2 models. As a result, person abilities are decomposed
into two parts. This type of analysis is analogous to conducting a two-step
analysis of a’person characteristic variable on test scores. Such results might
be possibly used for accountability purposes, such as accreditation of schools
in a state-wide testing program. Also, such results can be integrated into test
construction processes in order to detect possible bias of test or items. If this
is the case, a reduction of the variance of person abilities (z) in the model with
a person characteristic variable, in comparison to the unconditional model
with no predictor variable, is of interest. If ris considerably reduced, this
would mean the predictor variable accounts for large portion of the variation
in test performance. This could be an indication of bias in the test, unless
there is evidence of similar variation in the criterion. A likelihood ratio test
can be used to test if the reduction of ris signiﬁcant.

This approach can be also used to investigate effects of test conditions
on test scores. For example, if one is interested in the effect of allowing more

time for some examinees, an indicator variable describing the extra time given

75

each examinee to take the test can be included in the model as a predictor
variable. This would be a useful analysis if more time were accidentally
allowed for some groups of people in a large-scale testing program. Again,
the amount of reduction in rwill be our interest.

Another practical issue concerns use of the adjusted ability estimates
(uo,*), instead of uo, as estimates of abilities. Technically, it is reasonable to
use uo,* because it provides the estimate of ability after the effects of predictor
variables are taken into account and the estimates are associated with an
smaller overall error. However, in some settings this may not be reasonable
for philosophical and political reasons. According to this approach, two
people who obtained the same raw scores on the test would have different
ability estimates, depending on their measures on predictor variables. For
example, suppose amount of instruction on a topic (Time) has a positive
relationship with ability level. If the coefﬁcient for Time is positive, a person
with more instruction would have a lower uo,* value than a person with less
instruction, given the same raw scores. One disadvantage of this approach is
that the person with more instruction is penalized just because the person
received more instruction. This could be an unfair judgement of one’s ability,
especially when performance itself on the test is of interest, as with a criterion
referenced test. On the other hand it would be an advantage if appropriate
justiﬁcations are made. An example would be when one believes that test

scores are confounded and therefore scores need to be adjusted.

76

In summary, the one step analysis described in this chapter is
analogous to conducting a two-step analysis of test data. The two-step
analysis involves biased and inconsistent estimates of student abilities, as
well as heteroscedastic errors of measurements. Although it is obvious that
these are problematic in a two-step analysis, the amount of improvement with
the one-step procedure was not investigated. However, other sources that
use different formulations (e.g., Adams, Wilson & Wu, 1997; Zwinderman,
1991) indicate considerable improvement can be expected when a large set of
data is analyzed. This is also true when results are compared with results
from an analysis using raw scores because raw scores are susceptible to both
ceiling and ﬂoor effects. In future research it is expected that improvements

for this speciﬁc model will be investigated.

77

Chapter 4

Multidimensional Models

In this chapter another generalization of the LP HGLLM is presented,
namely, a multidimensional 1-P HGLLM. It is shown that conﬁrmatory
multidimensional Rasch analyses, both between- and within-item
multidimensional structures, can be formulated and analyzed under the

multidimensional 1-P HGLLM.

4.1. Model

A multidimensional model can be treated as an extension of the
generalized Rasch model. In the unidimensional 1-P HGLLM in the previous
sections, only one person-speciﬁc latent trait that relates to the probability of
getting a speciﬁc item correct was modeled. Now, more than one person-
speciﬁc latent trait that determines such probabilities is assumed. For item 1
(i = 1, , k), personj (j = 1, , n), and latent trait s (s = 1, , m), the level-1

structural model is expressed as follows;

”(I zﬂOllXOly' + H. +160!!!) XOmlj +ﬂlHleU + "- +ﬂ(k-M)‘jX(k-m)‘ll

m Ir-m (29)
= ZIﬂOvXO-w + 21/641qu
3: q:

Here, A, is a parameter that is associated with the sth latent trait. A value

78

of 1 will be assigned to the dummy variable X0“, if item i is associated with the

sth latent trait, and 0, otherwise. As in the unidimensional model, A, is the

effect of the qth dummy variable. Notice that the second subscript that

indicates the corresponding latent trait is dropped and represented by a dot

( . ) because each item does not have to be associated with only one latent trait.
As for the unidimensional model, Equation 29 can be written in matrix

form in order to show the data layout as
. Bo-
11.: DX{.1
1 i 1 J B} (30)

where B0, is now an m x 1 column vector that consists of m latent trait
parameters, A], , , Any, D is a kx m matrix in which the sth column is a vector
of dummy variables for the sth latent trait, B,‘ is a (k — m) x 1 column vector

that consists of k — m item parameters, and X is a k x (k — m) design matrix that
consists of k — m dummy variables. Note that one dummy variable from each
of m latent traits has to be dropped to achieve full rank in the W, matrix.

Then the level-2 models are

79

[601) =7010 +1101,

160m} = y0m0 + “0m;

 

(31)
’61-) = 71-0
36(14):), = 7(k—m).0 ,
where
“01/ O z'” . . . Tlm
' ~N 5 , 5 5 . (32)
“0171} 0 2'ml z.mm

Here, 140,, is a random component of the sth latent trait for the jth person,

which implies that each person has a unique value for each latent trait. The

variance for the sth latent trait is r and it is constant across people. Also,

2'”. (s it s') is a covariance between the sth and s' th latent traits, and it is also
constant across people. Note that when m = 1, all items are associated with
the same latent trait and the model (Equations 29, 31, and 32) will be exactly
the same as the unidimensional 1-P HGLLM.

This multidimensional formulation can be directly applied for
conﬁrmatory analysis purposes. It was already mentioned that each item
does not need to be associated with only one latent trait. Assume k, items are

associated with the sth latent trait, then k1 + + k 2 k. A test is considered

m

80

to be multidimensional between items (Adams, Wilson, & Wang, 1997) if

kl + + km = k , in effect, a test consists of several unidimensional subscales.

On the other hand, a test is considered to be multidimensional Wit/11h items

(Adams, Wilson, & Wang, 1997) if kI + + km > k, in effect, at least one of the

items is associated with more than one latent trait. Both types of
multidimensionality can be modeled using Equation 29, depending upon how
the D matrix is deﬁned.

In such multidimensionally constructed tests, one of our interests is in
how strongly latent traits are correlated. The multidimensional l-P HGLLM

approach is able to estimate the variance-covariance matrix of a vector of

I

latent traits, [um] um] in Equation 32; consequently, correlation coefﬁcients

between latent traits can be estimated. If latent traits are highly correlated,
use of a unidimensional IRT model might still be reasonable or provide
meaningful parameter estimates. However, if they are nothighly correlated,
this would suggest that a use of unidimensional IRT item and person
parameter calibration is questionable, and multidimensional analysis should
be encouraged.

In order to describe the two different model formulations (i.e., between-
and within-item multidimensional formulations), assume a 15 item test in

which 2 latent traits are involved. Then, the level-1 model is

81

77g =160in011, +7602,on1, 136,,le + TIbiisnXun-q : (33)

following Equation 29. There are only 13 dummy variables, because one item
from each item group (trait) does not have a dummy variable, that is, (8 — 1) +

(7 — 1) = 13. The level-2 models are

 

r1601; =7010 +“01,

1502, =7020 +1402].

4 ﬂ _ :7 where um, ~ N O 2'“ 7’2 (34)
ll] : (ll-0 , “021' O , 721 T22 ,

(ﬂs, =l’(13)vo

following Equation 31. As mentioned above, both “between” and “within”
multidimensionality can be modeled depending on how the D matrix,
contained in W,, is deﬁned. First, assume between-item multidimensionality
in a 15 item test, in which the ﬁrst 8 items are associated with the ﬁrst latent

trait and the other 7 items are associated with the second latent trait. Then,

82

p

”a,“ '1 o -1 0 o 0 o 0 o o o o o 0 o“ n.0,“
4,, 1 o o -1 0 0 0 0 0 o 0 o o 0 0 4,,
4,, 1 0 o 0 -1 0 0 o 0 0 0 0 o o 0 A”,
,7” 1 0 0 o 0 —1 0 o 0 o o o 0 o 0 A”,
7,, 1 o 0 0 0 0 -l 0 0 o 0 0 0 0 o [11,,
4,, 1 0 0 0 o 0 0 —1 o 0 0 0 o 0 0 a,”
4,, 1 o o o 0 0 o o —1 0 0 0 o 0 0 A»,
n,= '11, w]: 1 0 o o 0 o o o 0 0 0 0 0 o o and 13]: pm, (35)
4,, o 1 o 0 0 0 o 0 o —1 0 0 0 0 0 5,”,
27,0, 0 1 0 0 o o o o o o —l 0 0 o 0 A»,
m, 0 1 0 0 0 0 0 0 o 0 0 -1 0 0 o A”,
r52, 0 1 o 0 0 0 o o o 0 o 0 —1 o 0 Am,
4,, 0 l 0 0 0 0 o 0 0 o 0 o —1 0 A”),
4., 0 1 0 0 0 o 0 0 o o o 0 0 0 -1 Am].
4,), _0 1 o o 0 o 0 o 0 o 0 0 o 0 0, ﬂu»,-

 

 

 

 

 

 

represent the between-item multidimensionality. Notice that dummy
variables for the eighth and the last items are dropped in order that W, have
full rank. As a result, )010 and )020 are difﬁculties of items 8 and 15,
respectively. Also, 7mg — mg, )(gm — )010 71710 — )010 are difﬁculties of items 1
through 7, while ”9,0 — 7020, 71101.0 — )020 703,0 — )030 are difﬁculties of items 9

though 14. The ability of person j is represented by a vector [um] 1:02, ].

On the other hand, assume that a 15 item test has a within-item
multidimensional structure, in which items 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10 are
associated with the ﬁrst latent trait and items 9, 10, 11, 12, 13, 14 and 15 are

associated with the second latent trait. Then,

83

 

 

 

 

in," '1 o -1 o 0 o 0 o o 0 0 o o 0 0‘ "emf
0,, 1 0 0 -1 0 o o o 0 0 0 0 o 0 0 A,”
4,, 1 0 o 0 —1 0 0 o 0 0 o 0 0 0 0 5“,,
77,, 1 0 0 0 0 —1 0 0 0 0 0 0 0 0 0 am,
7,, 1 0 0 0 0 o -1 0 0 o o o 0 0 0 a,”
r7,” 1 0 o 0 0 o 0 —1 0 0 0 o o 0 0 A”,
17,, 1 0 0 0 0 0 0 0 —1 0 o o 0 0 0 pm,
n,= a. wl=1 0 o o 0 o 0 0 o 0 o o 0 0 o and 0,: 19,,“ (36)
7,, 1 1 o 0 0 o 0 o o -1 0 0 0 0 o 11.7,,
17,0, 1 1 0 o 0 0 o o o 0 —1 0 0 o 0 am,
2,1,, 0 1 o 0 0 0 0 0 0 0 o —1 o 0 0 A”,
4,, 0 1 o o 0 0 0 0 0 0 o 0 —1 0 0 .61.,”
4,, 0 1 0 0 0 o 0 0 o 0 0 0 o -1 o [1,”,
25,, 0 1 0 o 0 0 0 o o 0 0 0 0 o -1 Am,
31:). _o 1 0 0 0 0 0 o o o 0 0 o 0 0 , _ﬂ,,,,,l.

 

 

will represent the within-item multidimensionality. Here, the entries of 1s
in columns 1 and 2 of W, for rows 9 and 10 show the items that are associated
with both traits. Again, dummy variables for the eighth and the last items
are dropped in order to achieve full rank for W,. As mentioned in the
previous chapter, any dummy variables can be dropped, as long as the W,
matrix becomes full rank. Here dropping the dummy variable for the tenth
item, instead of the one for the eighth item, would have been more consistent
with the earlier presentations, because the tenth item is the last item to be
associated with the ﬁrst latent trait. However, the eighth item is chosen to
be dropped, because it is the last item to associated with only the ﬁrst latent
trait.

As a result, mg and )(320 are difﬁculties of items 8 and 15, respectively.
For items 1, 2, 3, 4, 5, 6 and 7, difﬁculties are 7mg — 7010, 7(2),) — mm, 7mg — )0”),

and for items 11, 12, 13 and 14, difﬁculties are 711110 - )020, 7(12).0 - 2020, 711310 - 7020

84

and 704,0 — 7030. However, since items 9 and 10 relate to both of the two latent
traits, 7(9)), — )tno — yoga and 700,0 —- 2010 - 2020 will be their difﬁculties, respectively.

The abilities of person j are, again, represented by a vector[u0,, no, 11°

4.2. Illustrative Analysis

As a numerical example, a set of hypothetical data for 15 items and 250
people is generated based on the between-item multidimensional model,
described above. Two latent traits are modeled in the data set, where the
ﬁrst 8 items are associated with the ﬁrst latent trait and the next 7 items are
associated with the second latent trait. The following item parameter values
are arbitrarily chosen; —2.0, —1.5, —1.0, —0.5, 0.0, 0.5, 1.0, 1.5, —1.0, —0.5, 0.0,
0.5, 1.0, 1.5, and 2.0. Also, 250 sets of person parameter vectors are
independently sampled ﬁom the standard bivariate normal distribution,
where the two latent traits are assumed to be correlated with the magnitude

of p: 0.5, such that

Lilli/(iii [015 015]] (37)

Then, dichotomous item response data are generated according to the
unidimensional Rasch model for the two item groups separately. Person and

item parameters and zs are estimated by the HLM program.

85

Table 10

Item parameter estimates of multidimensional data

 

 

 

item MD 00

1 -2.460 —2.672

2 -1.593 -1.653

3 —0.928 —0.941

Latent 4 -O.509 -0.511
Trait 1 5 0.064 0.064
6 0.654 0.657

7 1.240 1.261

8 1.724 1.781

9 —1.390 -1.404

10 —0.882 -0.884

H t 11 -0.273 -0.274
Tfainz 12 0.549 0.554
13 0.842 0.857

14 1.647 1.715

15 1.754 _ 1.833

 

86

Table 10 shows the estimates of item parameters from both
unidimensional (UD) and multidimensional (MD) models on the same data.
For the UD analysis, two separate UD analyses for the two item groups are
conducted. For the MD analysis, two item groups are analyzed at the same
time, using the MD model described above. Item estimates are very close
between the UD and MD models. In fact, they are correlated with r = 0.999.
This occurred because the model assumed the between-item
multidimensionality, where no item is associated with both latent traits. As
a result, item parameters are estimated as if they are ﬁom two separate item
groups.

On the other hand, there are at least two notable results for person
parameter estimates using the MD 1-P HGLLM. First, in this particular
example, there are 72 possible combinations of raw scores from the two item
groups; (kl + 1 )(k2 + 1) = 72, although only 55 patterns are observed here.
However, if a single unidimensional model were employed to analyze the 15
items, there are only 16 possible raw scores and 16 possible corresponding
ability estimates . This shows that the use of the multidimensional model
lets us distinguish people more than the unidimensional model does. Second,
although we can obtain up to 72 possible pairings of the raw scores if the two
item groups are analyzed separately via unidimensional models, this
approach determines the person parameter estimates assuming that the two

item groups are independent. As a result, two people who have the same raw

87

score on one item group will have the same person ability estimate for the item
group no matter what raw scores they have in the other item group. On the
other hand, when the data are analyzed using the multidimensional model,
person parameter estimates are determined by the raw score for the target
item group as well as the raw scores from the other item groups. In other
words, two people who have the same raw score on one item group can have
different person parameter estimates if they have different raw scores on any
of the other item groups. As a result, in this example, there are 72 possible
different person parameter estimates for each of the two item groups.

Table 11 shows estimates of the person parameters from both the UD
and MD models. The ﬁrst ﬁve columns contain estimates for the ﬁrst latent
trait, and the next ﬁve columns contain estimates for the second latent trait.
The columns labeled UD contain estimates from the separate unidimensional
analyses, while the columns labeled MD contain estimates from one
multidimensional analysis. Also, the columns labeled LS contain least
squares (LS) estimates and the columns labeled EB contain empirical Bayes
(EB) estimates. The estimates are grouped by the raw scores from the ﬁrst
latent trait. The estimates for the second latent traits are ordered by the raw

scores on trait 2 within the raw-score groups for the ﬁrst latent trait.

88

Table 11

Person parameter estimates of multidimensional data

 

 

 

 

 

 

Latent Trait 1 Latent Trait _2_
UD estimates MD estimates UD estimates MD estimates
raw raw
score LS EB LS EB score LS EB LS EB
0 -2.851 -1.279 -2.913 -1.383 1 -1.461 -0.746 -1.542 -1.095
0 ~2.851 -1.279 -2.853 -1.276 2 -0.600 -0.324 -0.609 -0.667
0 -2.851 -1.279 -2.800 -1.176 3 0.140 0.078 0.155 -0.267
0 -2.851 -1.279 -2.753 -1.082 4 0.840 0.471 0.851 0.117
0 -2.851 -1.279 -2.710 -0.991 5 1.561 0.865 1.544 0.493
1 -1.955 -0.928 -2.020 -1.158 O -2.595 -1.203 -2.764 -1.441
1 -1.955 -0.928 -1.990 -1.047 1 -1.461 -0.746 -1.519 -0.986
1 -1955 -0.928 -1.963 -0.946 2 -0.600 -0.324 -0.609 -0.567
1 -1.955 -0.928 -1.939 -0.852 3 0.140 0.078 0.148 0.173
1 -1.955 -0.928 -1.917 -0.762 4 0.840 0.471 0.847 0.207
1 -1.955 -0.928 -1.896 -0.674 5 1.561 0.865 1.547 0.583
2 -1.201 -0.591 -1.227 -0.830 0 -2.595 -1.203 -2.683 -1.327
2 -1.201 -0.591 -1.217 -0.725 1 -1.461 -0.746 -1.495 -0.883
2 -1.201 -0.591 -1.208 -0.628 2 -0.600 -0.324 -0.607 «0.472
2 -1.201 -0.591 -1.199 -0.538 3 0.140 0.078 0.144 -0.083
2 -1.201 -0.591 -1.190 -0.451 4 0.840 0.471 0.844 0.295
2 -1.201 -0.591 -1.182 -0.364 5 1.561 0.865 1.551 0.670
2 -1.201 -0.591 -1.174 -0.278 6 2.362 1.269 2.318 1.049
3 -0.525 -0.264 -0.530 -0.512 0 -2.595 -1.203 -2.609 -1.220
3 —0.525 -O.264 -0.529 -O.412 1 -1.461 -0.746 -1.472 -0.785
3 -0.525 -0.264 -0.528 -0.319 2 —0.600 -0.324 -0.603 -0.381
3 0.525 -0.264 -0.526 -0.231 3 0.140 0.078 0.142 0.005
3 -0.525 -0.264 -0.524 -0.145 4 0.840 0.471 0.843 0.381
3 -0.525 0264 -0.522 -0.060 5 1.561 0.865 1.557 0.755
3 -0.525 -0.264 -0.520 0.027 6 2.362 1.269 2.337 1.137
4 0.117 0.059 0.119 -0.202 0 -2.595 -1.203 -2.540 -1.117
4 0.117 0.059 0.117 -0. 105 1 -1.461 -0.746 -1.448 -0.691
4 0.117 0.059 0.116 -0.015 2 -0.600 -0.324 -0.597 -0.292
4 0.117 0.059 0.116 0.072 3 0.140 0.078 0.141 0.090
4 0.117 0.059 0.116 0.157 4 0.840 0.471 0.843 0.465
4 0.117 0.059 0.116 0.243 5 1.561 0.865 1.563 0.841
4 0.117 0.059 0.116 0.330 6 2.362 1.269 2.357 1.225
4 0.117 0.059 0.114 0.421 7 3.332 1.694 3.301 1.626

 

 

89

Table 11 (cont’d)

 

 

 

 

 

melon—rat” M2
LID estimates MD estimates UD ima MD eetimegee
raw raw
score LS EB LS EB score LS EB LS MD!EB
5 0.755 0.381 0.752 0.104 0 -2.595 -1.203 -2.477 -1.017
5 0.755 0.381 0.753 0.199 1 -1.461 -0.746 -1.425 -0.598
5 0.755 0.381 0.754 0.289 2 -0.600 -0.324 -0.592 -0.204
5 0.755 0.381 0.756 0.375 3 0.140 0.078 0.141 0.176
5 0.755 0.381 0.760 0.546 5 1.561 0.865 1.570 0.928
5 0.755 0.381 0.762 0.635 6 2.362 1.269 2.379 1.315
5 0.755 0.381 0.763 0.729 7 3.332 1.694 3.351 1.721
6 1.423 0.707 1.406 0.504 1 -1.461 -0.746 -1.402 —0.507
6 1.423 0.707 1.414 0.593 2 -0.600 -0.324 -0.585 -0.117
6 1.423 0.707 1.422 0.680 3 0.140 0.078 0.141 0.262
6 1.423 0.707 1.431 0.767 4 0.840 0.471 0.844 0.637
6 1.423 0.707 1.440 0.855 5 1.561 0.865 1.578 1.016
6 1.423 0.707 1.450 0.947 6 2.362 1.269 2.401 1.407
6 1.423 0.707 1.459 1.044 7 3.332 1.694 3.406 1.819
7 2.159 1.040 2.109 0.812 1 -1.461 -0.746 -1.380 -0.415
7 2.159 1.040 2.150 0.991 3 0.140 0.078 0.142 0.349
7 2.159 1.040 2.173 1.080 4 0.840 0.471 0.845 0.725
7 2.159 1.040 2.197 1.172 5 1.561 0.865 1.586 1.107
7 2.159 1.040 2.222 1.267 6 2.362 1.269 2.426 1.503
8 3.023 1.386 2.990 1.312 3 0.140 0.078 0.142 0.439
8 3.023 1.386 3.211 1.711 7 3.332 1.694 3.532 2.034

 

 

90

As mentioned above, all 55 observed combinations of raw scores have
unique values for both item groups when EB estimates are obtained by the
MD model. For example, a person parameter estimate on trait 1 for a person
who got 6 items correct on the ﬁrst item group can range from 0.5039 to 1.0438,
depending on the raw score on the second item group. On the other hand,
when EB estimates are obtained by the UD models separately for the two item
groups, the examinee will get the estimate of 0.7069 for latent trait 1 for the
EB estimates, no matter what their raw score on the second item group.

Plots in Figure 6 are based on the values in Table 11. The two plots in
Figure 6-a are plots of the person-parameter estimates from the two separate
UD models; the ﬁrst one shows LS estimates and the second one shows EB
estimates. EB estimates are considerably shrunk while the shape of the
plots are the same. The two plots in Figure 6-b are plots of estimates from
the MD model . Again, the EB estimates are considerably shrunk, but the
shape of the plot is also different from the one for LS estimates. By
comparing the two EB estimate plots, they show the difference between the
estimates from the two models clearly. From the MD analysis, the estimated

variance matrix of rs is

, u, 0.719 0.421
var =
a, 0.421 0.761

91

Figure 6

Plot of person parameter estimates from U0 and MD models

3. Estimates from two unidimensional models

LS Estimates for the Second Latent Trait

EB Estimates for the Second Latent Trait

l l l

 

4

 

 

I l I

-2 o 2
LS Estimates for the F lrst Latent Trait

 

M
L

O
I

o
N
J

 

 

O O O O
O O O O O O
O O O I O O O O
O O O O O O O
O O O O O O O O O
O O O O O O O
I O O O O O O O
O O O O O
f T T

-2 0 2
EB Estimates for the First Latent Trait

92

 

 

Figure 6 (cont’d)

b. Estimates from one multidimensional model

 

 

 

 

4 I l L
o 0 °
o-o
‘5
I:
.. o 0 e o O 0
.52“
3 e e e e e e e e
'o
g o e e e e e e
(7)0“ e e o e e e e o
5% o e e e o o e
.9
33 g o o o o O 0 9
521
fl)
[U . . . C .
‘3
-4 I I 1
.4

-2 0 2
LS Estimate for the First Latent Trait

 

 

 

 

4 J l L
E
l—
H
C
.921 . '
:1 °' .
.2 0......
8 .......
(3 0.....0...
.1 0
0° .......
5 .0 .0
.2 .0. .o
2:1-2‘
U)
UJ
m
LIJ
4 I 1 I
.4

-2 0 2
EB Estimate for the First Latent Trait

93

As a result, the correlation coefﬁcient between the two latent traits estimated
from the covariances for the MD model is 52/1/5322 = 0.570. On the other

hand, the observed correlation between ability estimates for the two latent
traits, based on the estimated (observed) two latent traits from the MD
analysis is f = 0.658 , while 9 = 0.251 for the UD analysis. Since the data
are generated so that the two latent traits are correlated with the magnitude
of p: 0.500 in the population, none of the correlation estimates are
particularly close. However, the estimates of the correlation based on the
estimates of variances and covariance of the two latent traits from the MD
analysis show that the MD model is more appropriate at least for that
purpose.

Since this result is based on only one replication of simulated results,
more extensive simulation study is needed in order to conﬁrm the recovery of
the correlation coefﬁcients between the latent traits, as well as the values of zs

themselves, for the MD model.

4.3. Parameter Recovery

In this simulation study I deal with the between-item multidimensional
model as I did in the previous numerical example. The variables of interest
are (a) the number of latent traits (m = 2 and m = 3), (b) sample size (n = 250, n
= 500 and n = 1000), representing small, medium and large sample sizes, (0)

the magnitude of correlation between the latent traits (,0: 0.2, p: 0.5, p: 0.9),

94

where the three values are intended to represent a weak, medium, and strong
correlation, and (d) the number of items (k1 = k2 = 5, k1 = k2 = 10). These
variables create 2 x 3 x 3 x 2 = 36 conditions, and I replicate an analysis 50
times for each of the 36 conditions. Table 12 shows the layout of the 36
conditions to be investigated. The item difﬁculty parameter values are
determined based on Table 1 as described previously. For each replication in
each of the 36 conditions, person ability values are sampled from a standard

multivariate normal distribution,

11:1; :1)

when m = 2, and

p
)0 . (39)
1

ht-‘b

when m = 3. Correlation coefﬁcients between latent traits are then computed,

as

95

 

 

 

r 12 _ A b. (40)
T11727
for the 18 conditions for the case of m = 2, and as
5' 2" f
_ 12 _ 13 _ 23
’12 ‘ .. A : r13 " A .. and r23 ‘ . . (41)
Z11 T22 711 733 2'22 733

for the 18 conditions for the case of m = 3.

Estimated parameter values are compared across the 36 conditions
using several statistics and indicators. Those include (a) the mean
correlation coefﬁcient between estimated and true parameter values, (b) the
standard deviations of the correlation coefﬁcients between estimated and true
item-parameter values across g = 50 replications, and (c) the root mean

squared error (RMSE) of i. The RMSE of i is deﬁned to be

 

 

11175505,.) = ’=' , (42)
g

for s ¢s' where s and s' = 1, 2 when m = 2; and s and s' = 1, 2, 3 when m = 3.
The RMSE of i is computed for each of the 36 conditions, where 1 indicates

the lth replication (1 = 1, , g), in each of the 36 conditions. The RMSE

96

Table 12

The layout of the simulation study for the multidimensional model

 

 

 

 

 

= m:
,=5 k,=10 k,= k,=10

n=250 n=250 n=250 n=250

r=0.2 n=500 n=500 n=500 n=500
n=1000 n=1000 n=1000 n=1000

n = 250 n = 250 n_= 250 n = 250

r=0.5 n=500 n=500 n=500 n=500
n=1000 n=1000 n=1000 n=1000

n=250 n=250 n=250 n=250

r=O.9 n=500 n=500 n=500 n=500
n=1000 n=1000 n=1000 n=1000

 

97

values are compared between the 36 conditions in order to make inferences
about the recovery of the rvalues.

Table 13 shows the obtained values for the 5 statistics described above.
Overall, the RMSE( 7) values are small, less than 0.1 in more than half of the
cases, indicating that testimates differ no more than 0.1 in about half of the
cases. Figure 7 shows three plots of RMSE( f). The ﬁrst plot shows results
when the true pequals 0.2. It shows that the cases are roughly grouped into
two groups, one is cases with k, = 5, and the other is cases with k, = 10. This
indicates a relationship that the RMSE of f' is affected more by the number of
items within latent traits than the number of latent traits to be estimated. In
other words, rappears to be estimated more poorly when there are fewer items
within latent traits. This result is consistent with the result in the
unidimensional case, where the number of items affects the precision of
estimate of 7. Here, in the multidimensional case, the number of items
within latent traits is the important factor rather than the total number of
items in a test.

The second plot in Figure 7 is for p: 0.5. In this plot, the tendency
described above is more obvious; the RMSE for k, = 10 are much smaller than
ones for k, = 5. The last plot is for p: 0.9. In this plot, the differences
between k, = 5 and k, = 10 are very small. It seems that z'values are
estimated equally well for all the cases. It is difﬁcult to explain why the

differences are small when p: 0.9, and large when p: 0.5.

98

Table 13

The results from the parameter recovery study for the multidimensional model

 

 

 

 

 

 

Results
Dimension items p sample RMSE r_hat var(r_hat) r-(y) se(y)
250 0.205 0.215 0.030 0.993 0.003
0.2 500 0.133 0.216 0.014 0.997 0.002
1000 0.077 0.234 0.007 0.998 0.001
250 0.171 0.576 0.032 0.994 0.003
5 0.5 500 0.139 0.585 0.013 0.997 0.002
1000 0.110 0.581 0.005 0.998 0.001
250 0.076 0.943 0.003 0.994 0.003
0.9 500 0.061 0.950 0.001 0.997 0.002
1000 0.060 0.954 0.001 0.998 0.001
2 250 0.106 0.234 0.011 0.992 0.003
02 500 0.064 0.210 0.004 0.996 0.001
1000 0.063 0.208 0.003 0.998 0.001
250 0.103 0.508 0.008 0.991 0.003
10 0.5 500 0.069 0.518 0.002 0.996 0.002
1000 0.043 0.526 0.002 0.998 0.001
250 0.054 0.925 0.003 0.991 0.003
0.9 500 0.052 0.944 0.001 0.995 0.001
1000 0.044 0.937 0.001 0.998 0.001
250 0.170 0.232 0.024 0.991 0.003
0.2 500 0.123 0.225 0.012 0.996 0.001
1000 0.085 0.233 0.006 0.998 0.001
250 0.179 0.548 0.028 0.992 0.003
5 0.5 500 0.132 0.578 0.010 0.996 0.002
1000 0.104 0.574 0.005 0.998 0.001
250 0.073 0.927 0.003 0.992 0.003
0.9 500 0.059 0.937 0.001 0.997 0.001
1000 0.050 0.946 0.000 0.998 0.001
3 250 0.111 0.213 0.011 0.992 0.002
0.2 500 0.073 0.212 0.006 0.996 0.001
1000 0.062 0.207 0.002 0.998 0.001
250 0.091 0.510 0.009 0.992 0.002
10 0.5 500 0.065 0.519 0.003 0.996 0.001
1000 0.051 0.517 0.002 0.998 0.001
250 0.057 0.912 0.003 0.992 0.003
0.9 500 0.047 0.926 0.001 0.996 0.001
1000 0.037 0.925 0.001 0.998 0.001

 

99

 

dr2412

 

 

s=2,k=5
s=2,k=10

 

 

 

 

 

 

 

600 800 1000

400

 

ﬂrrilﬁ

 

 

=5

=2, k

S
S
S

 

2, k=10
3, k=5

 

 

 

 

 

 

600 800 1000

400

100

Root mean squared errors of 2

Figure 7

8.8 88.8 8.8 8.8 8.8 8.8
cmciamsmmzm

38 848 9.8 3.8 88 88
8213:me

Figure 7 (cont’d)

 

 

=5

s=2, k
s

S
S

 

=2,k=10
3,k

=5

3, k=10

 

 

th=D 9

 

 

   

_

8

_

. ~
h

l-.
-0-0-I-O-I-I-O_I-O- -
.IOIOIIIIOOIC U IO. .

 

 

 

8.8 8.8 2.8 2.8 38 8mo

824.83%”.

101

The top three graphs in Figure 8 (top graphs on each page) show the
mean correlation coefﬁcients between the true and estimated item parameters
when three different correlations between the latent traits are used to
generate the data. The three graphs are very similar, and show no notable
differences. The magnitude of correlation between the latent traits does not
seem to affect the accuracy of item parameter estimation. Item parameters
were estimated relatively poorly with smaller sample sizes, like n = 250.
These differences are all at the third decimal place, and all the correlation
coefﬁcient values are greater than 0.990.

The standard deviations for those correlation values are plotted on the
bottom three graphs in Figure 8 (bottom graphs on each page). Although the
graphs show that there is a tendency that the standard deviations are smaller
with bigger sample sizes, differences are at the third decimal places. All the
values are smaller than 0.004.

The mean correlation coefﬁcients between latent traits are plotted in
top three graphs in Figure 9 (top graphs on each page). Tendencies are quite
similar to those plots of the RMSEs of % in Figure 7. This makes sense
because the correlations are computed from 2. Overall, the correlation
coefﬁcients tend to be overestimated, especially when k, = 5. This happens
because the variance components of rtend to be underestimated, while the
covariance components of IS tend not to be underestimated as much as the

variance components do. This is evident from the results, where RMSE( 2,.) =

102

Figure 8

Mean correlation between true and estimated item parameters and their standard
deviations

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

[hp=0 2
C
.9
E
9.’
'5
o " '
0.9/z """" 5:2: k=10
‘ {I ..... - 5:31 k=5
0 ——- 5:3, k=10
O)
C”.
o 1 I ' I
400 600 800 1000
n
8 rhg=0 2
O,‘
o —- s=2, k=5
....... =2, k=10
c _. ..... - 5:3, k=5
8 m .
(5 O
...>.. Q‘
g C
'E
m " '. .
1:3 \"'\:..
(U ‘- ' '. ????
*6.) 8 4 i —.L-;; T.‘ f, g .........
8 “‘“wlz
0.4
d r ' I I
400 600 800 1000
n

103

Figure 8 (cont’d)

 

mp=0 5

 

 

 

5
2, k=10

3,k

2, k=

8:

 

 

=5

8
S
S

3, k=10

 

 

 

 

£88

a

3&8

COZEGCOO

1 000

 

rhp=0 5

 

 

2, k=10
=5

2, k=5
3, k

S
S
S

 

S

3, k=10

 

 

 

 

 

 

8888

F888 88

c2638 Emocﬁm

104

Figure 8 (cont’d)

rho=0 9

 

 

 

 

 

 

 

 

 

 

0 O
5151
= = = :
kkkk
223&
= = __ = I
s s S s
. .
m _
_ _
xu/.s.
(x. ..o
8888 «888 . 8&8
:o=m_m:oo

 

 

=5

s=2,k

 

 

s=2,k=10
s=3,k=5

s=3,k=10

 

 

duﬁilg

 

 

 

 

88.88

888

Sm8 8_8

cozmSoo Emccmwm

105

0.225 , and RMSE( fa.) = 0.086 , where i¢ i'.

When the true correlation between latent traits is p: 0.9, the
differences between the mean correlations for the four conditions in Figure 9
are very small; the F s are all between 0.9 and 0.95. In addition, when p: 0.9,
the standard deviations of F across 50 replications are all small; they are
always less than 0.003 (standard deviations of F are plotted in bottom graph
on each page of Figure 9). Similarly, when p: 0.2, the differences in the F s
between the four groups are small; the 'r' s are all between 0.2 and 0.24.
However, the standard deviations of those 7 s are not as small as when r = 0.9.
When p= 0.5, the F values range between 0.5 and 0.6. The values are fairly
close to 0.5 when k, = 10, but the values are closer to 0.6 when k, = 5. The
patterns observed for the standard deviations are very similar to those for p=
0.2.

In summary, the simulation study shows that the z'values and
correlation coefﬁcient values between latent traits are well estimated when
there are more items within latent traits, regardless of the number of
dimensions to be estimated. Also, item parameters are well estimated

regardless of the numbers of items within latent traits.

106

Figure 9

Mean correlation between latent traits and their standard deviations

 

    

 

 

 

 

 

 

 

 

 

 

 

 

o rho=i12
co-
0'
L0
or.
.6 ° ........
E ...........
g 8...muemue ...... _ ....... . ........ “"“"‘“"——_~'
8 o
g _" s=2, k=5
“E9 g """" s=2, k=10
0' '''' " =3, k=5
_"" s=3, k=10
2
8‘ . r . r
400 600 800 1000
n
rhg=02
or) _ s=2, k=5
0
o'
c
.9
E
>
m
'o
E
co
1:
c
.9
(D

 

 

 

 

n

107

Figure 9 (cont’d)

 

 

 

 

 

 

 

 

 

 

a. .
j m_
_ m_ M nu no
u ;. saga
_ ”W zaaa
. H_ = __ ._ __ r
_ U .
. m_ m _
5 _ ._ . _
m _ H .
...,. m . -
. _
, m
..., M e -
. U
x w m
. .E
r «m
888 888 88.8 $8 8818

C03m—0tOO mee

600 800 1000

400

 

 

 

 

 

rhp=055

 

 

 

 

c2538 nemucmum

600 800 1000

400

108

Figure 9 (cont’d)

 

 

 

 

 

 

 

 

 

 

a}.
_m_
"m _ no no
.H 5151
.U _ __ __ __ __
w _ k.k.k.k.
.H 2233
w _ = = = =
w .
_ _ m _
9 H. _ . _
O H. _ .
= l
.m. w _ w
m... M
T:
...Z
”...,
a E
.. a U
..
Him
rrm
88.; more 88.8 88.8

c0_um_®.:00 Cmme

600‘ 800 1000

400

 

rhp=0 9

 

 

 

 

 

 

 

 

3
_
_
0 0 .
5151 .
__ __ __ = m
2.2.3.3. .
__ __ __ = _
SSSS .
- um
. .
__ .

__

.0

moo oo

cozm_>mu Emocmﬁ

600 800 1 000

400

109

4.4. Summary and Comments on Practical Issues

In this chapter the LP HGLLM was extended to both between- and
within-item multidimensional models. Between-item multidimensionality
can exist in a test that is constructed so that more than one group of items is
intended to measure different abilities. A good example is a testlet-based
test. A testlet is deﬁned to be “a group of items related to a single content
that is developed as a unit and contains a ﬁxed number of predetermined
paths that an examinee may follow” (W ainer & Kiely, 1987, p.190). For
example, a series of reading-test items that are based on the same reading
passage can be thought of as a testlet. Another example is a series of
science-test items that are based on the same scenario. When a test is
composed of several testlets, the test is referred to as a “testlet based test”.
Examples include the reading section of the Test of English as a Foreign
Language (TOEFL) and the Michigan Educational Assessment of Progress
(MEAP) reading and science tests.

On the other hand, within-item multidimensionality can exist in a test
that is constructed so that some items measure distinctive latent traits and
some items measure more than one latent trait. A good example is a science
test in which some items are strictly about either physical or natural sciences
(not both) and some items require knowledge of both physical and natural
sciences.

The multidimensional extension of the l-P HGLLM can be utilized to

110

estimate the magnitude of correlation between latent traits in a
multidimensionally constructed test, such as examples given above. This
will be also a useful tool in construct validation for test and questionnaire
construction. In such a case, a group of items intended to measure the same

psychological construct is represented by a common latent trait.

111

Chapter 5

T h rec-Level Model

In this chapter another extension of the LP HGLLM, a three-level
model, is presented. In a three-level model formulation, an additional level
is considered. Here the school level is considered as an example. An

illustrative analysis is presented as well.

5.1. Model

Suppose the third level of the model represents schools. Let pg... be the
probability that person j in school m answers item 2' correctly. The additional
subscript m indicates schools, in contrast to the two-level model where p has
only two subscripts. The level-1 model is an item-level model as it is in the

two-level case. It is written as ‘

”W" = I501»! +161/mVVlrjm +162/mulhjm + +16(k-l)jm”/(k-l)ijm , (43)

where i= 1, , k,j = 1, , n, m = 1, , r. It is identical to the level-1 model in a
two-level model, except for the additional subscript for schools. WW. is the
qth dummy variable (q = 1, , k — 1) for item 2', for person j in school m. ﬂy", is
the overall effect of items, and ﬂu," is the effect of the qth item, assuming the

effect of the kth item is 0. As seen in the 2-level model formulation, the ﬂy”.

112

are not assumed to be constant across people, or across schools, in this level-1
model.

The level-2 models for the ﬂy", parameters are person-level models,
which hypothesize that the ﬂy”, are constant across people. The person-level

models for person j in school m are written as

r ﬂOJm =700m +110”.

161/»: = 710:”
i '62»: = 720»: (44)
L’ﬂlk‘lllm = 7(k-l)0m ’

 

where u0 1,, ~ N (0, r7). Again, these models are identical to the level-2 models

in the two-level formulation, except for the extra subscript m. Here uojm
indicates how much person j in school m is deviated from the average ability of
school m, which is denoted as r00," below. Here no". is an overall effect of school
across items, and mm is the effect of the qth item in school m, assuming the
effect of the kth item is 0. The level-2 models do not assume that item effects
are constant across schools.

Now, an additional level-3 model, a school-level model, shows that item
effects are constant across schools. The overall effect across items ()00",)now

contains a school effect as well. For school m, we have

113

i 700m =77ooo+r00m
710». = ”100
i 720»: = ”200 ’ (45)

 

J’u-lwm = ”(k—mo

where moo is a ﬁxed component of 700m, and reg," is a random component of 200m.
On the other hand, 710,, through mine,” only have ﬁxed components. Here,

r00," ~ N (O, 1,). As a result, when the same dummy coding is done as in the

two-level model, the combined model is

1

1+ exp{— [(rmm + u0 1”,) - (moo - from )]} ,

 

pljm = (46)

where 7500 — zrooo is an item difﬁculty for item 2' (i = 1, , k — 1), and 7500 is the
item difﬁculty for item k. This parallels the two-level model, in which item
difﬁculty for the ith item is expressed as 7,0 — m. In this three-level
formulation r00". + uojm is an ability parameter for person j in school m. Unlike
ability parameters in the two-level model, the ability parameters for this
three-level model consist of two parts. First, r00," is the random effect
associated with school m, and can be interpreted as the average ability of

students in school m. Second, uojm is the person-speciﬁc ability of person j in

114

school m, indicating how much the ability of person j is deviated from the

average ability of students in school m.

5.2. Illustrative Analysis

In order to illustrate parameter estimation in this 3-level model, a
hypothetical data set is analyzed. In the data set, binary response data are
generated based on the following speciﬁcations. A sample of 1000 students is
assumed to be from the standard normal distribution. The 1000 students are
further assumed to be in 20 different schools, where 50 students are assumed
to be from each school. The 20 school means are sampled from a normal
distribution with mean of 0 and variance of 0.5.

This data generation is achieved by generating school mean abilities
and individual abilities separately. First, 20 school means ,u,,. (m = 1, , 20)
are sampled from N(0, 02), where 02 = 0.5. Then 50 individual student ability
measures are sampled separately for each school from N(,u,,,, 1 — 02), where ,u,,, is
the school mean for school m, which was generated in the previous step, and a2
= 0.5 in this case. This two-step data generation results in a sample of 1000
students from N(O, 1), while the means of schools are from N(0, 0.5).

First, the data are analyzed using the 2-1evel model, where school
effects are ignored and only item and person parameters are estimated.
Then, the same data set is analyzed using the 3-level model, where school

mean abilities are estimated in addition to the item and person parameters.

115

The item parameter estimates from both the 2- and 3-level models are shown
in Table 14. As the table shows, item parameters are estimated almost
identically in the 2- and 3-level models. They are only different in the fourth
decimal place, at the most. This makes sense because 740— J00 in Equation 12
is algebraically equivalent to 14,00— Irooo in Equation 46.

Similar results are observed for least squares (LS) person ability

estimates. Here, let 170/, for j = 1, , n, be the LS estimates of person abilities

from the 2-level model, in order to distinguish from the empirical Bayes (EB)

estimates that have been used throughout this study. Also, let i701," + 760». be

the LS estimates of person abilities from the 3-level model. Figure 10-a

shows the relationship between £701 and £70,," + F000,. The values for 170]. and
i701”, + F50," are almost identical, except for a small amount of variation at both

the higher and lower ends of the raw scores. The computed correlation

coefﬁcient between the two estimates is 0.9996; more evidence that :70] and
1.701," + F50," are identical. This observation makes sense because uo}. equals
u0 1m + room algebraically, from Equations 12 and 46.

On the other hand, when the EB estimation is employed, the person
abilities are estimated quite differently in the 2- and 3-level models. Figure
10-b shows a scatter plot of the EB person ability estimates from the 2- and 3-

level models. Here, the EB person ability estimates from the 2-level model

116

Table 14

Item parameter estimates from 2—level and 3-level models

 

 

2-level 3-level true value
item1 —1.620590 -1.621330 -2.00
item2 -1 . 180070 -1.180280 -1.50
item3 —0.762940 -0.762980 -1.00
item4 —0.294100 —O.294100 —0.50
item5 0.212798 0.212799 0.25
item6 0.455718 0.455726 0.50
item7 0.919703 0.919794 1.00
item8 1.398376 1.398796 1.50
item9 1.710175 1.711024 2.00
item10 0.096074 0.096074 0.00

 

117

Figure 10

Person parameter estimates from 2-level model and 3-level models

a. LS Estimates

 

-1"

-3~

LS Estimates from 3-Level Model (U0jm+ROOm)

 

00.

 

l l I I T l
-2 .1 0 1 2 3
LS Estimate from 2-Level Model (UOj)

b. Empirical Bayes (EB) Estimates

1

l l l J l l l

 

.‘s o .. N
1 l l 1

00

EB Estimates From 3-Level Model (U0jm+ROOm)
via
1

 

O 000 0(1) 00

00C!) mmooooo
00C!) moooom
coo moooom
ooo 0-1330“)
omoooom
ommooom
0000‘!)
O

O 000 (no 000

 

l

-2.0 -1.5

l l l l l l l
-1.0 -o.5 0.0 0.5 1.0 1.5 2.0
EB Estimates from 2-Level Model (UOj)

118

 

 

are {10], and the EB estimates from the 3-level model are F00," + £01,". People

who obtained the same raw scores have the same EB ability estimates in the
2-level model, because the 2-level model is identical to the Rasch model.
However, people with the same raw score have different EB ability estimates
in the 3-level model. Through the EB estimation, the 3-level model estimates
person abilities so that individual abilities are normally distributed within
schools. As a result, via the EB estimation, the 3-level model estimates
person abilities differently for people with the same raw score, depending on
which school they are in.

The correlation coefﬁcient between 120} and foo," + 1201.," is 0.91.
Although this is a strong relationship, 170]. and F00”, + £20,," are quite different as
Figure 10 shows. However, if the weights for foo", and i201," are allowed to be
other than 1 to maximize the correlation between £201. and aim," + biio M, where a
and b are constants, the multiple correlation of £20]. with afmm + bﬁojm becomes
0.9996, which is almost 1.00. More speciﬁcally, the correlation between ﬁg}.
and 1.489 foo," + 0591 120/," is 0.9996 (see Figure 11). This is because the EB

estimates are results of the “double shrinkage”, where LS estimates are
shrunk by both of level-2 and -3 reliabilities.

The fact that the EB 3-level estimates distinguish among people who
obtained the same raw score improves the relationship between estimated

abilities and true abilities. The ﬁrst scatter plot in Figure 12 shows the

119

Figure 11

The relationship between the estimates from the 2-level model and the linear
combination of the estimates from the 3-level model

 

N
l
l

.n
l

.5
l
O
I

0.591(School Mean) + 1.489(Person Deviation)
' o
l

o
M
l

 

 

r'o
o

I

-0.5

I

0.0

I

0.5

1
1.0

Estimates from 2-Level Model

120

1.5

2.0

 

Figure 12

Person parameter estimates from 2-level model and 3-level model - comparison

to the true values

 

 

 

 

 

 

 

 

 

 

 

3 —1
° 0
o a
O
1 - ° i
O
8 . o g °
to I
> i
34— . '
*- 3
o 8
g o
.3 .1
O
-5 I T r I I r l I
-2.0 -1.5 ~1.0 -0.5 0.0 0.5 1.0 1.5 2.0
Estimates from 2-Level Model
L l l l l
...1 O
3 o o
O
o o G E o
o ,5 fa
O
o o 9
1 r q’ & £0 °° °
9 00
g o O 0
0° 0
a 0 8’3
> 9°
0 -1 - °°
a 3
l" o
8
.3 -
O
5 f I r r x
-2 1 2

-1 0
Estimates from 3—Level Model

121

 

 

relationship between the EB person ability estimates from the 2-level model
and true parameter values. There are only 11 possible ability groups,
because the 2-level model is equivalent to the Rasch model and the Rasch
model estimates person abilities based only on raw scores. This scatter plot
shows a large amount of estimation error. For example, in the 2-level model,
people who obtained the raw score 8 are estimated as having an ability of 1.04
in the logit scale. However, the true ability values for those people range
from approximately — 0.2 to + 2.9 (see the third cluster ﬁ'om right in the plot).
This large variability is seen in the other score groups as well. When there
are more items in a test, this variability should decrease to some degree, as
discussed in Chapter 2.

On the other hand, this variability is somewhat reduced when the
person abilities are obtained from the EB estimation in the 3-level model, as
the lower graph in Figure 12 represents. Actually, the root mean squared
error (RMSE) of person parameter estimates is 0.628 for the EB 2-level
estimation and 0.538 for the EB 3-level estimation, which is about 14 % less
than the RMSE from the 2-level estimation. In other words, the EB 3-level
estimation reduces the amount of estimation error for the person ability
parameters. This is evidence that the 3-level model has the capacity of
including level-3 predictors and estimating effects of them, assuming abilities
are clustered within level-3 units.

Estimation of school means is dramatically improved when the EB 3-

122

level estimation is employed. The EB estimates of school means from both
the 2- and 3-level models, along with the true school means, are listed in Table
15. The school means for the 2-level model are obtained by simply computing

the means of ability estimates within each school, in effect, the mean of 170}

within each school, while the school means from the 3-level model are fmm.

The estimates from the ,3-level model are closer to the true values than
estimates from the 2-1evel model, except for two of the 20 schools in this
analysis (see asterisks in Table 15). Actually, the root mean squared error
among the school means is 0.120 for the 3-level model; that is a 66 % reduction
from 0.354 for the 2-level model. On the other hand, the estimated school
means from the 3-level' model are less variable, resulting in a smaller
standard deviation of estimated school means (0.518 for the 3-1evel model)
than the standard deviation of true school means (0.636). This is again a
result of the double shrinkage, where level-3 random effects are corrected by
reliabilities of both second and third levels. Overall, I can conclude that the
3-level model approach in this example enables one to estimate school means
better than the conventional Rasch model or the 2-level 1-P HGLLM model

approach.

123

Table 15

Estimates of school means from 2-level and 3-Ievel models

 

 

school 2-Ievel 3-Ievel true value
1 0.6463 1.0489 1.0241
2* —0.0037 —0.0025 -0.0500
3 —0.1987 —0.3063 -0.2732
4" 0.3185 0.5116 0.3950
5 0.2287 0.3670 0.3175
6 0.7211 1.1774 1.1659
7 0.2299 0.3671 0.3246
8 0.3079 0.4935 0.5247
9 -0.1777 —0.2738 -0.3143
10 -0.1756 -0.2726 —0.3504
11 —0.8477 -1.4355 -—1 .6018
12 -0.5384 -0.8717 -1.0193
13 —0.6587 —1.0831 -1.3294
14 0.4360 0.7012 0.8988
15 0.6667 1.0831 1.1190
16 -0.2320 -0.3658 —0.5703
17 0.0269 0.0479 -0.03964
18 —0.2413 -0.3779 -0.4936
19 -0.0393 -0.0554 -0. 1419
20 —0.4690 —0.7540 -0.8681

 

Note. Only those schools with ' show 2-Ievel estimates closer to the true value.

124

In summary, while uoj equals uojm + r00". algebraically, this relationship
holds only for LS estimates. Speciﬁcally, this relationship does not hold for

the EB estimates. Instead, the EB weighting scheme is such that 1201,, equals
1.489 foo,” + 0591 £201," in this example. The EB estimates thus distinguishes

among people who obtain the same raw scores. This shows improved person
ability estimates as well as improved school mean estimates, at least in this
speciﬁc illustrative analysis. This relationship is worth investigating further

in a more extensive manner.

5.3. Summary and Comments on Practical Issues

The three-level formulation becomes quite useful when school mean
abilities are of interest, in addition to students’ individual abilities. For
example, when a large-scale assessment is used for the purpose of
accreditation of schools, indicators that represent students’ average
performance within schools are needed. For a criterion referenced test, the
percentage of students who exceed the passing score is often used as an
indicator of school performance. It is a good indicator if the percentage of
students who do not exceed the passing score is the concern for the
accreditation. However, if the concern is more on the average performance of
students within schools, estimates of the mean abilities for each school will be
more appropriate. The three-level 1P-HGLLM approach would provide

accurate estimates of the mean ability of the school, as described in this

125

chapter.

This three-level formulation further enables one to include school-
characteristic variables, as well as student-characteristic variables in more
complex models. This would be analogous to conducting a two-level HLM
analysis with the Rasch model embedded therein. As mentioned in Section
3.1., this is a one-step analysis that avoids bias and inconsistency of MMLE
based person parameter estimates, as well as heteroscedastic measurement
errors in the outcome variable. Furthermore, the fact that the 3-level
formulation reduces measurement errors both in person ability estimates and
school mean estimates should improve the estimates of person and school
characteristic variable effects.

Again, this one step analysis of test data can be applied for the purpose
of accreditation of schools. Judging students’ performance based only on a
test might be unfair for accreditation. Difference in demographic variables,
such as social economic status (SES), might be confounding a school’s average
performance. If this is the case, the usage of the level-3 school mean ability
to represent students performance after adjusting for the eﬁ'ects of such
confounding variables will be appropriate.

One drawback of this 3-level formulation is that the model assumes
equal variances of student abilities across schools. If this assumption is not
met, it can result in unfair judgements about individual abilities.

Robustness to violations of this assumption should be investigated. Also, the

126

fact that students who are in better schools would get a credit just because
they are in good schools may not be acceptable philosophically and politically,
although it is evident that the 3-level formulation improves person parameter
estimates on average. A similar argument is applicable for the use of
student’s ability estimates when person-characteristic variables are taken
into account. This was discussed in Chapter 2. Impacts of using person
parameter estimates from the 3-level formulation are another future potential
areas for investigation. Despite such uncertainties, the EB 3-level analysis
is promising as a way of both conducting a one-step analysis of student- and

school- characteristic variables, as well as estimating school mean abilities.

127

Chapter 6

Conclusions

In this chapter a summary of this dissertation is provided. Also,
several comments on practical issues are given. Some of them are in
addition to the comments given in the previous three chapters. Finally,

several recommendations for future research are suggested.

6.1. Summary

One purpose of this study was to show the equivalence of the Rasch
model and the one-parameter hierarchical generalized linear logistic model
(1-P HGLLM), both algebraically and numerically. In Chapter 2, I
successfully showed that the standard binary-response Rasch model can be
reformulated as a special case of the HGLM. This study also conﬁrmed that
person parameters estimated by the currently available algorithms in the
HLM program were very close to the estimates from the BILOG program,
although the HLM estimates had smaller variance. Also, item parameter
estimates from HLM and BILOG programs were somewhat different when
rescaling was done.

Another purpose of this study was to present various possible
extensions of the LP HGLLM. In the previous three chapters, three

extensions of the LP HGLLM were presented, including a two-level model

128

with a person-level predictor, a two-level multidimensional model, and a
three-level model. These extensions are not exhaustive. Other possible
extensions of the LP HGLLM include (a) a two-level model with more than
one person predictor variable in the level-2 models, (b) a two-level model that
has one or more item characteristic variables in the level-1 models, (c) a two-
level multidimensional model with person characteristic variables in the
level-2 models or item characteristic variables in the level-1 model, (d) a
three-level multidimensional model, and (e) a three-level multidimensional
model with school characteristic variables in the level-3 models, person
characteristic variables in the level-2 models or item characteristic variables
in the level-1 model. Parameters for the all these cases can be estimated by
the currently available HLM program.

These extensions have a variety of potentials in practical settings. For
example, it would be possible to predict item difﬁculties from item
characteristics, using a model with item characteristic variables, instead of
item indicators, in the level-1 model. This would be very useful in a test
where item tryouts are not desirable or impossible for some reasons, such as
high item security. One would be able to obtain a good idea of what item
characteristics determine item difﬁculties from previously administered items.
Then, item characteristic analyses for newly developed items can predict item
difﬁculties before items are given to examinees. A combination of this

approach and a live equating procedure would be able to ensure a sound

129

assessment even though items are not tried out.

The most important contribution of this study is that this
generalization of the Rasch model as a hierarchical model allows one to add
an additional level of data in a relatively simple way. Thus one can include
multi-level linear predictors (e.g., school-level predictors, as well as person-
level predictors) in one analysis, although I did not deal with this extension in
this study. This is an important contribution because inclusion of linear
predictors to an IRT model has been limited to a two-level model in the past.
In addition, I have shown that parameters can be estimated with the
currently available algorithms in the HLM program. Also, I showed that the
generalized model can represent a multidimensional Rasch model. This is
also an important contribution because it can be readily applied to a
conﬁrmatory multidimensional analysis in a multi-level Rasch model, as well
as in a regular single-level Rasch model.

Furthermore, IRT models, including the Rasch model, are often
thought to be specialized statistical and psychometric models for item
response data. Specialized estimation algorithms, often in specialized
software, have been considered necessary for parameter estimation. For this
reason, IRT models and parameter estimation are often completely separated
from other statistical models. This study clariﬁed that the Rasch model is a
special case of the HGLM, and parameters can be estimated within the frame

work of the HGLM using algorithms in the HLM program, which are used for

130

more general purposes. This perspective would be useful for didactic

purposes as discussed in Chapter 2.

6.2. Comments on Practical Issues

As discussed in the previous three chapters, this HGLM approach for
test data analyses is widely applicable. However, I am not insisting that the
LP HGLLM should replace the Rasch model. As I mentioned in Chapter 2, I
do not believe that 1-P HGLLM should be used for the sole purpose of item
and person parameter estimation. The advantage of LP HGLLM is that this
approach can handle tasks that the conventional Rasch analysis does not deal
with, including simultaneous analysis of person-level predictors and item
responses, conﬁrmatory multidimensional analysis, and 3-level analysis of
item response data.

In a large-scale testing program, for example, the 1P-HGLLM could
provide advantages in various stages of the test construction process and test
result reporting. Including person-characteristic variables in an analysis
could provide important information in a pilot testing process, in order to
identify possibly biased items. Conducting a conﬁrmatory multidimensional
analysis would provide additional information for construct validation of the
test, if the test is multidimensionally constructed. Also, when school-level
performance is reported, instead of individual performance, the 3-level model

would provide better estimates of mean school ability level. This was

131

discussed in detail in Chapter 5.

Another practical benefit that was not mentioned in previous chapters,
but which is important, is that the LP HGLLM can handle missing data.
This means that all the students do not have to take the same set of items for
the data to be analyzed. This ﬂexibility would allow one to estimate person-
and school abilities from matrix-sampled test data along with student- and
school-level predictors, all at the same time. Also, one could perform an
anchor-item test equating in one analysis, because all item parameters can be
estimated on a common scale even if students take different sets of items.

Despite many advantages, one drawback of the lP-HGLLM in practical
settings is the fact it does not provide standard errors for person ability
estimates, although it provides an estimate of the variance of the latent trait
distribution. This might become a concern if a testing program decides to
report person-ability estimates from the l-P HGLLM. Parameter estimates
are expected to be accompanied by their standard errors, because estimates
with smaller standard errors are preferable to the ones with larger standard
errors. For this reason, it will be crucial for LP HGLLM to be able to provide
the standard errors for random components of the model, in effect, individual
ability estimates and school abilities, for them to be used in reporting ability
estimates. However, this limitation does not affect the reliability of item-
parameter estimates and estimates of the effects of predictor variables.

This study did not conduct a systematic investigation on required

132

sample sizes for 1-P HGLLM analyses. Therefore, only approximate
recommendations can be suggested. The three simulation studies in
Chapters 2, 3, and 4 provided similar results in terms of the effect of the
sample size on the quality of the estimates. The sample size of 250 produced
means of the estimates across replications that are as good as ones from
sample sizes of 500 and 1000. However, the standard deviations of the
estimates across replications became much smaller when the sample size was
increased to 500, while the differences between 500 and 1000 were not so
large. Therefore, the sample size of 500 is recommended for the purposes of
item-parameter estimation, estimation of the effects of person-characteristic
variables, and estimation of the correlation coefﬁcients between latent traits.
However, no other sample size between 250 and 500 was investigated in the
simulation studies. Moreover, more complicated designs with more
parameters to be estimated were not investigated. Further systematic
investigations are desirable to provide more comprehensive recommendations
regarding this issue.

Overall, it is recommended for anyone who does test data analysis in
relation to demographic variables to have software available to conduct 1-P
HGLLM analyses. Although it would take some effort to become familiar
with the software, the ﬂexibility of the 1-P HGLLM analysis is worth the
investment of time. This is also recommended for a testing program that

wants or needs to estimate and report school-level results.

133

6.3. Suggestions for Future Research and Recommendations

Several recommendations can be suggested regarding practical issues.
First, the reformulation of the Rasch model in terms of the HGLM is strongly
encouraged to be presented for didactic purposes, as discussed in Chapter 2.
Presenting this formulation would provide an alternative view of IRT models,
which might help the learner of IRT models obtain a more general View of IRT,
in terms of both model formulations and interpretations of parameters.

Second, practical impacts of using person parameter estimates ﬁom the
3-level formulation should be investigated. It was mentioned in Chapter 5
that estimating person abilities from the 3-level model generally gives
advantage to people who are in a group with a lower mean. This should be
useful in a lot of cases, but not always.

Finally, the obvious next step is to utilize this generalized model to
answer real research questions using real data sets. For example, using the
3-level formulation. evaluation of school performance along with person- and
school-level characteristic variables could be of interest to a testing program,
as mentioned in Chapter 5. Any other possible applications of 1-P HGLLM,
in comparison to traditional approaches, are encouraged to be conducted.

Also, several technical problems remain unsolved. First, in Chapter 2,
it was demonstrated that item parameter estimates were somewhat different
between l-P HGLLM and BILOG, while person parameter estimates were

very close between the two. The difference might be a result of differences in

134

the error structures between the two models; the Rasch model considers errors
both in item and person parameters, while the LP HGLLM is formulated so
that all random variations are attributed to person variations. If this is the
case, along with the fact that the LP HGLLM provides a smaller variance of
person estimates (see Chapter 2), it might indicate superior parameter
estimation for the LP HGLLM (i.e., I-P HGLLM may be better in error
control). This issue needs further attention.

Second, it was mentioned in Chapter 5 that the 3-level formulation
assumes equal variances of level-3-unit distributions. In the example for the
illustrative analysis, the model assumed equal variances for 20 schools. For
this reason, results from data with unequal variation in level-3 units may be
misleading. Robustness to violations of this assumption should be
investigated in future research.

Third, it was mentioned in the previous section that test equating can
be done in one analysis, because the LP HGLLM would be able to handle
missing data. More accurate equating results are expected to be obtained
because of the possible advantage that the LP HGLLM handles errors better
(see above) and because of the nature of one-step analysis in comparison to a
two-step analysis (see Chapter 3). However, since there is no evidence at this
point that the LP HGLLM always provides better results, a valuable task
would be to investigate how much accuracy one would be able to obtain

compared to other conventional equating procedures.

135

Fourth, there is a possibility that an estimate of the variance of the
level-2 random component (r) can be used as a measure of item ﬁt for a
corresponding item. If a model is speciﬁed so that each level-2 model has it's
own random component, each random component can be conceptualized as the
variation of corresponding item difﬁculty across people. Large variation of
the random component would mean that the effect of the item varies
considerably across people (i.e., different difﬁculty for different people).

Since an item should have the same difﬁculty across people, this will be a
measure of item ﬁt. This approach also has a potential to be used in DIF
analysis along with level-2 predictor variables. It is important that this
approach is further examined in future research.

Finally, although this study was limited to the Rasch model (i.e., one-
parameter logistic model), I hope to extend the generalization to the two-
parameter logistic model, as well as the polytomous response models.

In summary, an application of the hierarchical generalized linear
model (HGLM) to the item response theory (IRT) is new, yet applicable to a
wide range of applied research. Also, the formulation is useful for
pedagogical purposes. However, further investigations are needed to explore
behavior of parameter estimates from the HLM program, especially for the
multidimensional models. Also, some practical beneﬁts and advantages of

the use of parameter estimates from this approach have to be investigated.

136

APPENDICES

137

APPENDIX A

The full matrix representation of Equation 9, for person 1'

For 77,], X and ,6; as deﬁned in section 2.1.2,

 

 

 

 

 

40’
F ’71; I F X01; X11, X21; X(k-l)lj P ﬂo, -
772, X02, X12, X22; X0802) A}
= i i i i E . (A1)
”(k—i), Xou-i), Xl(k-l)j X2(k-l)j ' ' ' Xu-ka-m 141:4);
_ ’71.; J00“) _ X019 Xuq X219 X(k-l)kj 4(kxk) .ﬂik-Uuuxi)

 

Assign -1 for X W if q = i, and 0 if q at i , except when q = 0, then ij = 1.

 

 

 

 

 

 

Then,
— ’71, . P1 -1 0 0‘ F ’63! -
7,2]. 1 0 —1 0 ﬂu
s = s s s s 5 - (A2)
”(HM 1 0 0 --- -1 150—2)}
_ 7h,- LM _1 0 0 JUN.) 358-014;...”

The equation above represents a set of k equations for person j , speciﬁcally,

r ”1} =ﬂ01-1611
”2] =160} _I@j

”(k-1); = '63; "ﬂu-I);
”k1 = ﬂOj

 

138

(A3)

APPENDIX B

The full matrix representation of Equation 17

For 7],], XO_W,XN, ﬂag and ,4“, as deﬁned in section 3.2.1.

 

 

 

 

 

 

r 7531, I
i 771, - P X011, XOml/ X11; X21; Xu-mm - E
’72; X012; X0012] X12, X22, X(k-m).2j 160m;
2 = s s s s 2 ﬂ”. (B1)
”(k-1); Xouk—n/ "'XOmUI—I); Xuk-I); X2.(k-l)j Xu-mu-i); 162.;
_ ’71; ‘(kxu L X011, XOmkj X11; X21, X(k—m).kj ima) 5
J60")! - (kxl)

Assign 1 for X 05,]. , if item iis associated with the sth latent trait of m traits, and
0 otherwise.

Assign -1 for X”. ifq=1§and0if q¢i.

139

REFERENCES

140

REFERENCES

Adams, R. J ., & Wilson. M. (1996). Formulating the Rasch model as a mixed
coefﬁcients multinomial logit. In G. Engelhard & M. Wilson (Eds),

Objective measurement: Theory and practive (V 01. 3, pp. 143-166).
Norwood; NJ: Ablex.

Adams. R. J ., Wilson, M., & Wang, W. (1997). The multidimensional random
coefﬁcients multinomial logit model. Applied Psychological
Measurement. 21(1), 1-23.

Adams, R. J ., Wilson, M., & Wu, M. (1997). Multilevel item response models:
An approach to errors in variables regression. Journal of Educational
and Behavioral Statistics. 22(1), 47-76.

Andersen, E. B. (1972). The solution of a set of conditional estimation
equations. Journal of the Royal Statistical Society, 34, 42-54.

Andersen, E.B., & Madsen. M. (1977). Estimating parameters of the latent
population distribution. Psychometrilra, 42, 357-374.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of
item parameters: An application of the EM algorithm. Psychometrilra,
46', 443-459.

Bock, R. D., & Lieberman, M. (1970). Fitting a reponse model for n
dichotomously scored items. Psychometrilra. 35. 179-187.

Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in
the presence of item parameter drift. Journal of Educational
Measurement, 25(4), 275-85.

Breslow, N ., & Clayton, D. G. (1993). Approximate inference in generalized
linear mixed models. Journal of the American Statistical Association,
88, 9-25.

Bryk, A. S., Raudenbush, S. W., & Congdon. R. (1996). HLM' Hierarchical
linear and nonlinear modeling with the HLlll/ZL and HLlll/3L

programs. Chicago: Scientiﬁc Software Inc.

Fischer, G. H. (1973). The linear logistic test model as an instrument in

141

educational research. Acta Psychologica, 37, 359-374.

Fischer, G. H. (1983a). Logistic latent trait models with linear constraints.
Psychometrilra. 48(1). 3-26.

Fischer, G. H. (1983b). Some latent trait models for measureing change in

qualitative observations. In DJ. Weiss (Ed.), New horizons in testing
(pp. 309-329). New York: Academic Press.

Fischer, G. H. (1995). The linear logistic test model. In G.H. Fischer, & I.W.
Molenaar (Ed.), Rasch models: Foundations, recent developments, and
applications. New York: Springer-Verlag.

Goldstein, H. (1980). Dimensionality, bias, independence and measurement
scale problems in latent trait test score models. British Journal of
Mathematical and Statistical Psychology, 33, 234-260.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory:
Principles and applications. Boston: Kluwer-Nijhoff.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago: MESA
Press.

Lord, F. M. (1984). Maxim um likelihood and Bayesian parameter estimation
in IRT (RR-84-30-ONR). Princeton, NJ: Educational Testing Service.

McCullagh, P., & Nelder, J. A. (1989). Generalized linear models. (2nd edition
ed.). London: Chapman and Hill.

Mellenbergh, G. J. (1994). Generalized linear item response theory.
Psychological Bulletin, 115(2), 300-307.

Mellenbergh, G. J ., & Vijn, P. (1981). The Rasch model as a loglinear model.
Applied Psychological Measurement, 5(3), 369-37 6.

Mislevy, R. J. ( 1987). Exploiting auxiliary information about examinees in the
estimation of item parameters. Applied Psychological Measurement,

11(1), 81-91.

Mislevy, R. J ., & Bock, R. D. (1990). BILOG 3: Item analysis and test scoring
with binary logistic models. Chicago: Scientiﬁc Software Inc.

Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially

142

consistent observations. Econometrilra, 16, 1-5.

Raudenbush, S. W. (1995). Posterior modal estimation for hierarchical
generalized linear models With a pplica tion to dichotomous and count
data (Unpublished manuscript ): Michigan State University.

Snijders, T.A.B., (1991). Enumeration and simulation methods for 0-1
matrices with given marginals. Psychometrilra, 56. 397 -417.

Stiratelli, R., Laird, N ., & Ware, J. H. (1984). Random effects models for serial
observations with binary responses. Biometrics, 40, 961-971.

Thissen, D. (1982). Marginal maximum likelihood estimation for the one-
parameter logistic model. Psychometrilra, 47, 175-186.

Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models.
Psychometrika, 51, 567-577.

Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive

testing: A case for testlets. Journal of Educational Measurement, 26,
247-260.

Wong, G. Y., & Mason, W. M. (1985). The hierarchical logistic regression
model for multilevel analysis. Journal of American Statistical
Association, 80, 513-524.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA
Press.

Wright, B. D.. & Stone. M. H. (1979). Best test design. Chicago: MESA Press.
Yang, M. (1995). A simulation study for the assessment of the non-linear
hierarchical model estimation Via a pproxima te maxim um likelihood.

(Unpublished manuscript): Michigan State University.

Yang, M. (1998). Increasing the efficiency in estimating multilevel Bernoulli
models. Unpublished doctoral dissertation, Michigan State University.

Zwinderman, A. H. (1991). A generalized Rasch model for manifest predictors.
Psychometrilra, 56(4), 589-600.

143

"Iiilliliillliilir