.1...
5+3 5.239%.
5.; _ “yuan”

Eﬁng.

I!!!

y .. i1. 1...... I
h “1.23 IS
. 2.3.

.t
!7§xv{;l
I! .26.: .5.

 

III};
.531. 0-1.3:
.. in. ’

2:... .

 

.1325: 3.5.x. 1. .3?! .

 

? \ LIBRARY
Michigan State
‘ University

 

This is to certify that the
dissertation entitled

INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM
RESPONSE THEORY MIXTURE MODELS

presented by
DIPENDRA RAJ SUBEDI

has been accepted towards fulﬁllment
of the requirements for the

Ph. D. degree in Measurement and Quantitative
Methods

 

Major Professor’s Signature

M [0; 2009

Date

MSU is an Afﬁnnative Action/Equal Opportunity Employer

 

---—.-.-.--.—4-.--.-.-.-.--.-----n-.--.-.-.-.on. - .—

 

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProjIAoc&Pres/ClRC/DateDue.indd

 

INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM RESPONSE
THEORY MIXTURE MODELS

By

Dipendra Raj Subedi

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Measurement and Quantitative Methods

2009

ABSTRACT

INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM RESPONSE
THEORY MIXTURE MODELS

By

Dipendra Raj Subedi

Many item response theory (IRT) scaling and scoring models assume that
examinee samples have comparable test-taking behaviors and comparable performance
among different subgroups (e.g., gender and ethnicity) or in diﬁerent test-taking contexts
(e. g., geographic location or test—taking mode). However, in some situations the
aforementioned assumption of test-taking homogeneity may not hold and test-taking
heterogeneity is said to exist. When these sources of heterogeneity are unobservable (e. g.,
when examinees have unexpected guessing behaviors), then IRT mixture modeling
(MixIRT) may be preferable to traditional IRT (i.e., two- and three-parameter logistic
models) modeling for adjusting the parameter estimation inaccuracies that might
otherwise occur in the presence of unobserved heterogeneity.

Therefore, the goals of this study were to investigate: a) the estimation accuracy
of MixIRT models when test-taking heterogeneity exists and b) the efﬁciency of MixIRT
models in identifying subsets of examinees whose item responses do not ﬁt the speciﬁed
IRT model. Additionally, given the difﬁculty in estimating MixIRT parameters, Bayesian

modeling with the Markov chain Monte Carlo method was used and the robustness of

MixIRT modeling was investigated through a simulation study. This simulation study
investigated several realistic testing factors that included test-taker sample size, test
length, and the proportion of test-taking heterogeneity in the form of examinee guessing
behavior. In other words, varying these testing factors allowed the evaluation of the
impact of test-taking heterogeneity on the accuracy of parameter estimation. The results
of the simulation study showed that the MixIRT model provided more accurate parameter
estimates than traditional IRT models and was quite efﬁcient in identifying subsets of
examinees that had anomalous test-taking behaviors. A real data example also

corroborated the simulation study results.

Cepyright by

DIPENDRA RAJ SUBEDI

2009

Dedication

To my beloved parents: Bhoj Raj Subedi and Shanti Subedi

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to many people who assisted me
throughout my doctoral studies in the Measurement and Quantitative Methods program at

Michigan State University.

First of all, I would like to express my sincere gratitude to my dissertation director
and co-chair of the guidance committee, Dr. Mark Reckase. I am deeply indebted to him
for his guidance and tremendous support throughout my dissertation. Dr. Kimberly
Maier, co-chair of the guidance committee provided continuous support from the
beginning of my doctoral studies. Other dissertation committee members Dr. Yeow Meng
Thurn and Dr. Punya Mishra also provided excellent help and insightful comments during

the various stages of my dissertation.

My special appreciation goes to Dr. Raymond Mapuranga for his excellent help
and camaraderie from the beginning of my doctoral studies to the completion of this
dissertation. I especially appreciate his editorial comments in my dissertation. I would
also like to thank Dr. Joseph Martineau at Michigan Department of Education for giving
me an excellent opportunity to gain hands-on experience on various psychometric
analyses while I was still a graduate student. I am very grateﬁrl to Professor Murari
Suvedi and Dr. Bishwa Adhikari for their continuous support from the beginning of my

graduate studies at MSU.

vi

I would like to express my special thanks to the Graduate School at MSU for
providing me a dissertation completion fellowship. Thanks to my writing group members
at MSU: Michael Sherry, Sungwom Ngudgratoke, and Young Yee Kim for their
constructive feedbacks and comments in my writing. Also, thanks to my professors at
MSU, who provided me various teaching and research assistantships. Thanks to Adam
Wyse and Minh Duong for their wonderful ﬁ'iendship. I would also like to take this
opportunity to express my appreciation to my colleagues and mentors at the American
Institutes for Research, Washington DC. for their understanding during the ﬁnal editing

stage of my dissertation.

Finally, my special thanks go to my family. First, to my wife Shanta Subedi, for
her love, patience, and continuous support throughout this journey. To my parents, who
stood at every stage of my career with their unconditional support. To my brother Bishwa

Subedi and other family members for their wonderful support.

vii

TABLE OF CONTENTS

LIST OF TABLES ............................................................................................................... x
LIST OF FIGURES .......................................................................................................... xii
KEY TO ABBREVIATIONS .......................................................................................... xiv
CHAPTER 1 INTRODUCTION ......................................................................................... 1
1.1 Background ................................................................................................................ l
1.2 Anomalous Examinee Test-taking Behavior ............................................................. 2
1.3 Test-taking Heterogeneity .......................................................................................... 3
1.4 Traditional Item Response Theory Modeling ............................................................ 4
1.5 Motivation .................................................................................................................. 6
1.6 Purpose ....................................................................................................................... 7
CHAPTER 2 LITERATURE REVIEW .............................................................................. 9
2.] Modeling Sources of Unobserved Heterogeneity ................................................. 9
2.2 Mixture Distributions and Mixture IRT Models ................................................. 10
2.3 Psychometric Applications of Mixture Modeling ............................................... 11
2.4 Mixture IRT Model Parameter Estimation .......................................................... 13
2.4.1 Frequentist Approach to Parameter Estimation ................................................ 13
2.4.2 Bayesian Approach ........................................................................................... 14
2.4.2 Markov chain Monte Carlo Algorithm ............................................................. 16
2.5 Research Questions ............................................................................................. 21
CHAPTER 3 METHODOLOGY ...................................................................................... 23
3.1 Models ...................................................................................................................... 23
3.1.1 Model 1: Mixture IRT model with completely random guessing behavior
(MixIRT-R) ................................................................................................................ 24
3.1.2 Speciﬁcation of the MixIRT-R Model in WinBUGS ....................................... 25
3.1.3 Model 2: Mixture IRT model with ability-based guessing (MixIRT-A) .......... 27
3.2 Simulation Study ...................................................................................................... 29

viii

3.2.1 Simulation Factors or Study Design ................................................................. 30

3.2.2 Generation of Simulated Parameters and Item Responses ............................... 31
3.2.3 Parameter Estimation ........................................................................................ 32
3.2.4 Evaluation Criteria and Analysis of Simulated Data ........................................ 34
3.2.5 Simulation Study using Mixture IRT Model with Ability-based Guessing ..... 36

3.3 Empirical Data Analysis .......................................................................................... 37
3.3.1 Analysis of Empirical Data ............................................................................... 38
CHAPTER 4 RESULTS .................................................................................................... 41
4.1 Descriptive Statistics of Simulated Item Parameters ............................................... 42
4.2 Evaluation of Parameter Estimate Convergence ...................................................... 43
4.3 Results of MixIRT-R Model Simulation Analyses .................................................. 47
4.3.1 Results from the Parameter Recovery Study .................................................... 47
4.3.2 Classiﬁcation Accuracy of the MixIRT-R Model ............................................ 63

4.4 Results from Simulation Analyses using MixIRT-A Model ................................... 64
4.5 Results from Empirical Data Analysis ..................................................................... 70
4.5.1 Results Based on the Random Guessing Model ............................................... 70
4.5.2. Results Based on the Ability-based Guessing Model ...................................... 74
CHAPTER 5 DISCUSSION AND CONCLUSIONS ....................................................... 77
5.1 Interpretations of the Results ................................................................................... 80
5.1.1 Results ﬁ'om Parameter Recovery Study .......................................................... 80
5.1.2 Results on Classiﬁcation Accuracy .................................................................. 83
5.1.3 Results from Empirical Study ........................................................................... 83

5.2 Study Limitations ..................................................................................................... 86
5.3 Implications .............................................................................................................. 87
5.4 Future Directions ..................................................................................................... 88
5.5 Summary of the Findings and Conclusions ............................................................. 90
APPENDICES ................................................................................................................... 92
REFERENCES ................................................................................................................ 104

ix

LIST OF TABLES

3.1 Summary of Parameter Recovery Study Factors ............................................ 30
4.] Descriptive Statistics for the Simulated Item Parameter .................................... 42
4.2 Descriptive Statistics of MixIRT Estimates for Selected Parameters ..................... 46
4.3 Bias and RMSE of Item Difﬁculty Parameter Estimates ................................... 49
4.4 Bias and RMSE of Item Discrimination Parameter Estimates ............................ 49
4.5 Correlations between True and Estimated Item Parameters ............................... 52

4.6 Bias and RMSE of Ability Parameter Estimates for all Simulation Conditions. . . . . ....60

4.7 Correlations between Simulated and Estimated Ability Parameters for all Simulated
Condrtrons6l

4.8 Classiﬁcation Accuracy in MixIRT-R Model ................................................. 64

4.9 Descriptive Statistics of Simulated Item Parameters in MixIRT-A Model. . . . . . 65

4.10 RMSE of Discrimination and Difﬁculty Parameter Estimates using MixIRT-A
Model ................................................................................................... 66

4.11 Correlation of Discrimination and Difﬁculty Parameter Estimates using MixIRT-A
Model .................................................................................................... 67

4.12. RMSE of Ability Parameter Estimates in MixIRT-A Model ............................ 68

 

4.13

41-1

4.15

4.16

4.2(

4.31

4.13 Correlation of Ability Parameter Estimates in MixIRT-A Model ....................... 69

4.14 Classiﬁcation Accuracy of MixIRT-A Model .............................................. 69
4.15 MixIRT-R Estimates for Training Sample ................................................... 71
4.16 MixIRT-R Estimates for Validation Sample ................................................ 71

4.17 Distribution of Proﬁciency levels in Original and Modiﬁed Training Sample. . 72
4.18 Distribution of Proﬁciency levels in Original and Modiﬁed Validation Sample. . 72
4.19 Test Statistics from Two-sample Kolmogorov-Smirnov Test ............................ 73
4.20 Distribution of Proﬁciency levels in Original and Modiﬁed Training Sample. ....... 75
4.21 Distribution of Proﬁciency levels in Original and Modiﬁed Validation Sample. . 75

4.22 Test statistics from Two-sample Kolmogorov-Smirnov Test ............................. 75

xi

LIST OF FIGURES

4.1 Sample plots for convergence assessment of discrimination parameter estimate... . .. 44

4.2 25-item test average bias results for difﬁculty parameter estimates ...................... 50
4.3 50-item test average bias results for difﬁculty parameter estimates ...................... 50
4.4 25-item test average RMSE results for discrimination parameter estimates ............ 51
4.5 50-item test average RMSE results for discrimination parameter estimates ........... 51
4.6 25-item test average correlations between true and estimated a-parameters ........... 53
4.7 50-item test average correlations between true and estimated a-parameters ........... 53
4.8 25-item test average correlations between true and estimated b-parameters ........... 54
4.9 50-item test average correlations between true and estimated b-parameters ........... 54

4.10 Recovery of a and b parameters in the 2PL model for sample size of 500 and test
length of 25 and 10% proportion of guessers .................................................... 56

4.11 Recovery of a and b parameters in the MixIRT model for sample size of 500 and
test length of 25 and 10% proportion of guessers ................................................. 57

4.12 Recovery of a and b parameters in the 2PL model for sample size of 2000 and test
length of 50 and 10% proportion of guessers ..................................................... 58

4.13 Recovery of a and b parameters in the MixIRT model for sample size of 2000 and
test length of 50 and 10% proportion of guessers ............................................... 59

xii

4.14 25-item test average RMSE results for ability parameter estimates . ....62

4.15 50-item test average RMSE results for ability parameter estimates .................... 62
4.16 RMSE of discrimination parameter estimates in MixIRT-A model ................... 66
4.17 RMSE of difﬁculty parameter estimates in MixIRT-A model .......................... 67
4.18 RMSE of ability parameter estimates in MixIRT-A model ............................... 68

4.19 Number of examinees identiﬁed as guessers in training and validation sample. . 74

xiii

 

 

 

J

(J)

IRT

2PL

3PL

MixIRT

MIRT

RMSE

MixIRT-R

MixIRT-A

KEY TO ABBREVIATIONS

Item response theory

Two-parameter logistic model

Three-pararneter logistic model

Mixture item response theory model

Multidimensional item response theory model

Root mean squared error

Mixture item response theory model with random guessing

Mixture item response theory model with ability-based guessing

xiv

CL

al.

C\

C0.

C0

1h:

gu.

0c<

CHAPTER 1
INTRODUCTION

1.1 Background

The K-12 test-based accountability system has gained increasing attention from
researchers, educators, and policy-makers since the implementation of the “No Child Left
Behind” legislation (N CLB, 2001). This legislation was designed to improve existing
educational practice and student academic achievement through improved teaching and
curriculum (Hamilton, Stecher, & Klein, 2002). This renewed attention to K-12 education
also led to increased interest in comparative and international assessments because of the
ever-widening subject matter knowledge gap between US students and their counterparts
in other industrialized countries. For example, results from the Third International
Mathematics and Science Survey showed that US fourth and eighth grade students had
comparatively lower mathematics and science achievement than many other developed
countries (Gonzales et a1., 2000; Lemke & Gonzales, 2006).

This increased focus on large scale assessment has also resulted in the added
scrutiny of psychometric modeling approaches. Speciﬁcally, the accuracy of parameter
estimates when using traditional psychometric models has come into question because
they do not efﬁciently account for anomalous examinee behaviors such as cheating and
guessing. These undesirable examinee behaviors can lead to aberrant responses that can

occlude the accuracy of inferences drawn about student knowledge and academic skills.

1.2 Anomalous Examinee Test-taking Behavior

As noted previously, the accuracy of psychometric parameter estimates used in
large-scale assessments is fallible when examinees exhibit anomalous test-taking
behavior. Common anomalous behaviors include guessing, cheating and examinee
motivation. Cheating can be deﬁned as any action that decreases the accuracy of the
intended inferences based on the examinee’s performance, thus threatening the validity of
the inference about the test taker (Cizek, 2001). Examinee motivation can be deﬁned as
the degrees of effort test-taker expend particularly when given a low-stakes assessment
test. Several recent studies have investigated each of these testing phenomena. However,
this study focuses only on heterogeneity introduced by test-takers’ guessing behavior.

Guessing may occurs when test-takers run out of time on the test, when they are
less motivated, or when they ﬁnd test items difﬁcult. Guessing behavior, however, varies
depending upon the nature of the test (low-stakes or hi gh-stakes), item difﬁculty,
examinee ability, available time to complete the test, and cross-cultural differences
among examinees.

The validity of the inference made using scores is partially dependent on the
amount of effort put forth by the examinee while taking the test (Wise, 2006).
Furthermore, when adequate effort is not given by examinees, they tend to guess
randomly which makes it difﬁcult to estimate the test taker’s true subject matter
proﬁciency (Budescu & Bar-Hillel, 1993).

Therefore, anomalous test-taking behavior is a concern and is particularly a
concern in low-stakes tests when examinees are more likely to have low motivation, and

to cheat or guess excessively. In low-stakes tests like the National Assessment of

Educational Progress, attempts to mitigate the negative consequences of unusual
examinee behavior include the use of shorter tests with specialized data collection
designs like balanced incomplete blocking (Johnson, 1992). Even formula scoring does
not adequately deter guessing on tests (Frary, 1988).

It is also particularly important to identify and account for anomalous examined
behavior in the current NCLB era because examinee test scores are an integral aspect of
data-driven educational policy. For example, AYP decisions are based strongly on
examinee test scores and yet not many model-based or sample-speciﬁc adjustments are

made in the estimation of psychometric parameters.

1.3 Test-taking Heterogeneity

It is important to study the undesirable test-taking behaviors described previously
because they can have negative consequences on the interpretation and accuracy of
psychometric models. Particularly, the psychometric models used in low-stakes tests have
the underlying assumption that the same item parameters and ability distributions apply
to all examinees taking the test. This is known as the assumption of test-taking
homogeneity (e. g., Baker & Kim, 2004; Bock & Zimowski, 1997; Lord, 1980). But as
noted previously, examinees often exhibit unconventional test-taking patterns such as
cheating, excessive guessing and low motivation.

When the aforementioned anomalous behaviors exist, test-taking heterogeneity is
said to exist and is often evidenced by sample variability that occurs among different
groups of test-takers. An example pertaining to this study would be guessers and non-

guessers. Similarly, test-taking behavior may be different for different groups of

examinees. For example, the Graduate Record Examinations’ (GRE) verbal assessment is
administered to native and non-native English speakers who have different English
proﬁciency which could impact their performance on the test and not allow the two
groups to be analyzed together.

In situations where the group-membership of test-takers is not observable,
unobservable test-taking heterogeneity is said to exist. In contrast, if the source of
heterogeneity can be observed in the data (e. g., gender, ethnicity), observable
heterogeneity makes it convenient to stratify test-takers for any validation using multi-
group analyses (Muthén & Lehman, 1985).

Multi-group analyses are important in psychometrics and when test-taker
. characteristics are not observable, a set of models called latent class models can be used
for multi-group analyses. In the case of low-stakes tests, a form of latent class models
called mixture models have been used recently by researchers for multi-group analyses
when test-taking heterogeneity is unobservable (Bock & Zimowski, 1997). As a result,
these mixture models were extended to models commonly used in low-stakes tests under
a framework which analyses the interaction between examinee ability and test items
called item response theory (Lord, 1980). IRT is the most common modeling approach

used in many tests like NAEP, TIMSS and PISA.

1.4 Traditional Item Response Theory Modeling

Traditional item response theory (IRT) modeling allows examinee performance
on each test item to be succinctly quantiﬁed across all examinees. The three assumptions
of traditional IRT are: dimensionality, local independence, and the existence of a

monotonically increasing function (Hambleton, 1989). First, the dimensionality

assumption indicates that a test should measure only one ability, personality trait or
attitude -- called unidimensionality. When more than one ability is assumed to exist, these
IRT models are called multidimensional (Hambleton & Swaminathan, 1985; Reckase,
1997). Local independence implies that no item should provide clues to the answers of
other items in a test (Hambleton & Swaminathan, 1985). Finally, the assumption of a
monotonically increasing function relates the probability of success on an item to the
ability measured by the item.

A common traditional IRT model is the three-parameter logistic (3PL) model

which is represented mathematically as:

ediW-bi)
13(0) : Ci +(1— ci)1+eai(6_bi) :

 

i = 1, 2,....,n (1.1)

where 13(6) is the probability that a given test-taker with ability 0 answer a random item

correctly. (I,- is the item discrimination, b,- is the item difﬁculty and C,- is the pseudo
guessing parameter (Hambleton & Swaminathan, 1985). Another common model called

the two-parameter logistic (2PL) model is obtained when 0 = 0 in Equation 1.1 and the

one-parameter logistic (lPL) model is obtained when 6 = 0 and a = 1.

Although these traditional IRT models are useful for quantifying examinee
ability, they are not able to account for unobservable test-taking heterogeneity, which
may result in parameter estimates that are inaccurate. As noted above, mixture models are
capable of accounting for unobservable heterogeneity and extensions of these models in

IRT framework have produced so-called mixture IRT models or MixIRT for short.

Therefore, MixIRT models provide greater ﬂexibility in modeling complex item response
distributions (McLachlan & Peel, 2000). Hence, in the low-stakes testing context,
MixIRT models would be particularly useful and provide impetus for investigation of
their robustness in modeling anomalous test-taking behavior as described in the section

which follows.

1.5 Motivation

Given the psychometric modeling limitations of traditional IRT models in
accounting for test-taking heterogeneity, it is important to investigate the efficiency and
accuracy of MixIRT models in estimating parameters. Moreover, the estimation of latent
distributions (e.g., Mislevy, 1984) is an important area of psychometric research because
even the most intuitively appealing and creative models are not useful unless the
parameters in the model can be estimated accurately. Speciﬁcally, modeling unobserved
test-taking heterogeneity such as aberrant item responses is crucial because ignoring it
can lead to biased parameter estimates and may yield inﬂated measurement and test
reliability (Lord & Novick, 1968; Muthén, 1989). Furthermore, the inaccurate estimation
of examinee latent traits can have consequential impacts such as false interpretation of
student ability, and erroneous measurement of school and teacher effectiveness (Ansari,
Jedidi, & Dube, 2002).

Given the limitations of IRT modeling articulated above, speciﬁcally in the
modeling of guessing, the 3-PL model -- a commonly used model -- is unlikely to sufﬁce
for psychometric modeling in large-scale assessments. This is because it restricts the
guessing parameter to be item dependent. Most importantly, the 3PL model is incapable

of identifying whether individual test—takers actually guess, but rather it models guessing

over the entire sample and hence results in inadequate modeling of guessing. Therefore, a
subsidiary motivation of this study is to explicate the implications of inaccurately
modeling guessing or random response behavior when this phenomenon is not modeled
at the person level, but rather at the item level as is in current IRT modeling practice.
Moreover, the MixIRT approach taken in this study is more appropriate than IRT for
providing evidence of the impact of guessing on individual test items and has the

secondary advantage of possibly identifying students with low motivation.

1.6 Purpose

Mixture model parameters are estimated using either frequentist or Bayesian
approaches. As described in Chapter Two, several practical problems arise in the
ﬁequentist approach to mixture model parameter estimation (Friihwirth-Schnatter, 2006).
On the other hand, Bayesian estimation methods can handle hi gh-dimensional problems
and allow exploration of the distributions of parameters, regardless of the distributional
forms of the likelihood functions or parameters. In addition, model complexity increases
with the increase in number of parameters to be estimated, such as a rrrixture model,
particularly with a large number of mixture components.

Therefore, this study focused on using a Bayesian approach to parameter
estimation in mixture IRT models, with speciﬁc emphasis on item parameter estimation,
test-taker cluster identiﬁcation, and proﬁciency level classiﬁcation. These issues are of
increased interest among researchers, policymakers, and educators in the current era of
test-based accountability systems. In particular, this study compared the performance of
Bayesian mixture IRT modeling to common IRT models in estimating person and item

parameters, and identifying aberrant responses and low—motivation test-takers.

The remainder of the dissertation is divided into four chapters. Chapter Two
reviews the literature that lays out the important empirical and theoretical foundation for
this dissertation. The third chapter presents the methodology and the research design,
implementation of Bayesian estimation methods, and mixture model analysis. The results
from both simulation and empirical data analysis are presented in Chapter Four. Finally,
Chapter Five provides discussion, limitations, suggestions for further research, and

summary of results and conclusions.

CHAPTER 2
LITERATURE REVIEW

As noted in the previous chapter, the purpose of this study is to investigate and
illustrate the efficacy of using mixture models and a Bayesian approach in estimating
item parameters and test-taker ability under the IRT framework. Therefore, the purpose
of this chapter is to introduce important Bayesian and mixture modeling concepts that are
pertinent to this study. In the sections which follow, descriptions of mixture
distributions, mixture model parameter estimation, Bayesian statistical modeling, prior
research on psychometric applications of mixture models, and the modeling of guessing

behaviors in tests, are provided.

2.1 Modeling Sources of Unobserved Heterogeneity

The latent structure model (Goodman, 1974; Lazersfeld & Henry, 1968) is used to
explain underlying, unobservable or latent categorical relationships, and offers an
efﬁcient way of uncovering distinct sub-populations, incorporating correlated non-
normally distributed outcomes, and classifying individuals into classes. That is, these
models can serve as possible elucidations of the observed relationships among a set of
manifest variables (Goodman, 1974). Depending upon the nature of variables used in
these latent structure models, various types of models can be deﬁned under this
framework. Speciﬁcally, mixture modeling is categorized as a subset of latent structure

models when latent variables that represent subpopulations are used for modeling

9

population membership. Mixture models in the context of IRT are presented next.

2.2 Mixture Distributions and Mixture IRT Models

Mixture distributions are comprised of a ﬁnite or inﬁnite number of components,
possibly of different distributional types, that can describe different features of data. A
mixture model is a ﬂexible tool for modeling complex data through an appropriate choice
of data components to accurately represent the data’s true characteristics (McLachlan &
Peel, 2000). As a result, mixture models are a valuable tool for analyzing a wide variety
of latent trait phenomena.

Mathematically, a mixture model can be represented by the observation of n

independent random variables X], x 2, . . .,x,, , from a k-component mixture density as

denoted by Equation 2.1:
k
1’09):an fj(xl.), 1:1, ....,n (2.1)
j=1

where 71'1- >0,j=1, ...,k; 7T] '1' ------- + 77k =1 and f3(x), 1 Sj 5k, are the component

densities of the mixture and 71' 1, . . . , 7t], are the mixing proportions. These proportions

allowed us to estimate the size of subgroups in the sample.

Mixture IRT (MixIRT) models are a combination of LCA and IRT models
(Asparouhov & Muthén, 2008). Their development has been motivated primarily by
diverse phenomena that are encountered when modeling data from populations that are

potentially non-homogeneous (von Davier & Rost, 2007) such as heterogeneous

10

population of guessers. LCA is a statistical method used to identify homogeneous groups,
or classes, from categorical multivariate data. In addition, MixIRT models are useful in
testing for the population invariance of item parameters and ability distribution.
Basically, these models are based on the assumption that the population under
investigation is composed of two or more latent subpopulations dictated by different
degrees of latent traits, each of which responds differentially to psychological tasks and
stimuli (Draney, Wilson, Gluck, & Spiel, 2008). One of the most general MixIRT models

is the mixed Rasch model (Rost, 1990) in which each examinee is parameterized both by

a class membership parameter (g =1, ....G) and a within-class ability parameter (6g). The

probability of a correct response (U) to the item is represented mathematically as:

((9 -b.)

 

e g 1g
P<U=1Ig.6g)= 6 b (2.2)
1+e(g_ jg)

The psychometric applications of mixture modeling, particularly those relevant to

this study, are brieﬂy reviewed in the next section.

2.3 Psychometric Applications of Mixture Modeling

While mixture modeling has been used to detect guessers in large scale
assessment, it has also been used in various psychometric applications. One of the earliest
applications of mixture modeling in psychometrics is the HYBRID model (Y amarnoto,
1989), which was used to detect randomly guessed item responses. Mislevy and Verhelst
(1990) described a family of multiple-strategy IRT models that apply when each subject

belongs to one of a number of exhaustive and mutually exclusive classes that correspond

ll

to item-solving strategies. Other applications of mixture modeling include modeling item
response times with a two-state mixture model (Schnipke & Scrams, 1997) that identiﬁed
guessers. Furthermore, De Ayala, Kim, Stapleton, and Dayton (2002), Cohen and Bolt
(2005) and Samuelsen (2005) used the mixture model approach in differential item
functioning analysis.

Recent research in multivariate and mixture distribution Rasch models are
presented in von Davier and Carstensen (2007), which focused on an extension of the
Rasch model in which certain homogeneity assumptions have been relaxed on both item
and population levels. Furthermore, some applications of these extensions in educational
onpsychological contexts are provided.

To identify individual guessers from item response data, Yarnamoto (1989, 1995)
used the HYBRID model (Y arnarnoto, 1987). It was mainly focussed on estimating the
effect of test length and time on parameter estimation. Similarly, Wise and DeMars
(2006) used the effort-moderated IRT model, which in the presence of guessers
performed better than the standard 3PL model in terms of model ﬁt, accuracy of item
parameter and test information estimation, and the degree of convergence validity in
proﬁciency estimation.

Recently, Yang (2007) reviewed the methods of identifying guessers and
proposed approaches for modeling response times based on the two-state mixture model.
A majority of these methods used item response time; hence their scope is limited to
computer-based or computer-adaptive testing. Additionally, data on item response time is
not available in most large scale assessments and any paper-based assessment. The

aforementioned models and modeling approach provided motivation for the simulation

12

study described herein where estimation of the probability of guessing is based entirely
on item response patterns and not on response time, similar to a recent study by Cao and

Stokes (2008). Parameter estimation is an important issue, which is described next.

2.4 Mixture IRT Model Parameter Estimation

Estimation of person and item parameters is an important problem encountered in
applications of item response theory (Hulin, Lissak, & Drasgow, 1982). Parameter
estimation is an important issue because even complex models are not useful unless their
parameters can be estimated accurately. For example, in this study, it is very important to
classify accurately examinees into guessers and non-guessers. Originally, Pearson (1894)
used the methods of moment to estimate mixture model parameters. Later, Rao (1948)
introduced the maximum likelihood estimation approach to estimate the mixture model
parameters. Rao’s approach follows the frequentist paradigm, but an alternative Bayesian
approach was introduced by Lavine and West (1992). An overview of both frequentist

and Bayesian approaches is provided next.

2.4.1 F requentist Approach to Parameter Estimation

The Expectation-Maximization (EM) algorithm has been typically used in the
estimation of mixture distributions. The EM algorithm was introduced for general latent
variable models by Dempster, Laird, & Rubin (1977). Redner and Walker (1984)
provides an excellent review of maximum likelihood estimation for ﬁnite mixture
models, whereas the monograph of McLachlan and Peel (2000) gives full details for a

wide range of ﬁnite mixture models.

13

Despite having the advantage of conceptual and computational simplicity,
application of the EM approach is difﬁcult when estimating complex models, especially
when mathematical derivations are intractable. Several practical problems arise in the
likelihood-based approach to mixture model parameter estimation (Friihwirth-Schnatter,
2006). First, it may be difﬁcult to ﬁnd the global maximum of the likelihood numerically.
Second, the likelihood function of mixture models is unbounded and can have many
spurious local modes (Kiefer & Wolfowitz, 1956). Similarly, the sample size has to be
very large to apply the asymptotic theory of maximum likelihood for mixture models
(McLachlan & Peel, 2000). Bayesian techniques, such as Markov chain Monte Carlo
(MCMC), which are described next, have become an alternative to address such

problems.

2.4.2 Bayesian Approach

Bayesian statistics attempts to formalize and quantify researchers’ prior
assumptions concerning their research questions. Its main components include a prior
distribution, posterior distribution, and likelihood distribution (or function). In the
formulation of Bayesian inference, y denotes the observed data, 6 denotes model
parameters and missing data, and P(6Ly) denotes probability statements conditioned on
observed data. The foundation of Bayesian inference is to set up a joint probability
distribution P(6, y) over all random quantities (W ilks, Richardson, & Spiegelhalter,
1996). For this purpose, we begin with a model providing a joint probability distribution
P(6, y), which can be expressed as a product of a prior distribution p(6) and the likelihood

distribution, p(y|6) as follows.

14

P(t9,y)°0 19(6) xp(y l 9) (2.3)

The prior distribution usually incorporates expert opinions or prior knowledge.
Additionally, prior distribution parameters are called hyper-parameters when they are not
ﬁxed at speciﬁc numeric values.

When observed data, y, are available, Bayes theorem is used to determine the

distribution of 0 conditional on y:
17(6’, y) = p(0)x My | 6’)
p(y) jpt6)xp(y l 6W
00 p(6’)x 120 l 9) (2.4)

P(9|y) =

 

This is called the posterior distribution of 0, which is the focus of Bayesian inference.
The posterior distributions in simple terms represent the relationship between observed
data and prior assumptions (Gill, 2002). A researcher typically uses a likelihood function
to quantify purported knowledge concerning the observed data, whereas a probability

distribution is placed on prior assumptions to quantify them.

Conjugacy occurs when the posterior distribution follows the same parametric
form as the prior distribution. In such cases, inferences may be drawn about one or only a
few parameters at a time, and only the marginal posterior distribution needs to be
computed for speciﬁc parameters of interest. The marginal posterior distribution is
obtained by ﬁrst deriving the joint posterior distribution of all unknowns and then

integrating over the unknowns that are not of interest.

15

TE:

Ca;

20.

2. 4.2 Markov chain Monte Carlo Algorithm

Bayesian modeling has appealed to many researchers and practitioners in recent
years as a result of development of fast computers along with the availability of powerful
computational tools such as Markov chain Monte Carlo (MCMC) algorithms. As
mentioned in the previous section, in some situations MCMC is the only means of
estimating a model’s parameters when required integrals and/or derivatives may not have
closed form solutions. MCMC—based estimation offers several other beneﬁts. First,
obtaining ability estimates for examinees who answer all questions right or all questions
wrong is no longer a problem as it would be with the EM algorithm in the ﬁ'equentist
paradigm. Finite posterior ability estimates can be computed with the help of appropriate
priors. Also, likelihoods that are not statistically identiﬁable can be combined with priors
to produce unique posterior distributions (Johnson & Albert, 1999). Additionally, MCMC
estimation can also handle several likelihoods in a single analysis. For example, Patz &
Junker (1999a) incorporated both dichotomous and polytomous items in the same
analysis.

The difﬁculty of incorporating uncertainty into item parameter estimates has long
been a concern with frequentist methods such as the E—M algorithm. Patz and Junker
(1999a) noted how the Bayesian approach outlined by Tsutakawa and Soltys (l 98 8)
incorporated parameter estimation uncertainty into the standard errors of the estimate. In
the context of large scale assessment, particularly in low stakes tests, the matrix of
response patterns becomes increasingly sparse as test length increases. Bayesian methods
can handle missing data relatively easily within the parameter estimation scheme (Maier,

2002; Patz & Junker, 1999a, 1999b).

l6

The MCMC approach (e. g., Gelfand & Smith, 1990; Gilks, Richardson, &
Spiegelhalter, 1996) simulates random samples from the multivariate posterior
distribution so that features of the theoretical distribution can be estimated by
corresponding features of the resultant random sample (Patz & Junker, 1999b). Based on
the original work of Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller (1953),
MCMC methods were generalized by Hastings (1970), as the Metropolis-Hastings
algorithm. MCMC estimation methods can be thought of as Monte Carlo integration
using Markov Chains (Gilks et al., 1996). The key is to create a Markov process whose
stationary distribution is the speciﬁed target posterior distribution P(l9|y) and run the
simulation long enough that the distribution of the current draws achieves stationary. The
posterior distribution is summarized by computing statistics based on these draws.

Using previous approaches (Hastings, 1970; Metropolis et al., 1953), Geman and
Geman (1984) employed a version of MCMC called Gibbs sampling in physics. Gelfand
and Smith (1990) and Gelfand et a1. (1990) later introduced this technique to the
statistical commtmity as a tool for ﬁtting statistical models.

The general procedure for sampling from the P(l91 y) is as follows.

0 Using a starting point, run independent parallel sequence of an iterative
simulation, such as Gibbs sampler or the Metropolis-Hastings algorithm.

0 Run the iterative simulation until it reaches the convergence.

0 Discard the beginning of the sequence (also known as bum-in period) to eliminate

draws that were taken before convergence was achieved.

17

0 Finally, summarize inference about the posterior distribution by treating the set of
all iterates from the simulated sequences after bum-in as an identically distributed

sample from the target distribution.

Congdon (2005) provides a lucid explanation of the Gibbs sampler both
theoretically and mathematically. In order to express the Gibbs sampler in simple terms,

let us consider 0 = (6 , ..... , 6d ) with the corresponding univariate conditional

distributions of f}, . . . .fd. The distributions f1, .. . )2 are called the full conditional

distributions. Similarly, suppose that we can simulate from these full conditional

distributions.
9,. |9,,...,a_,,e,+,,...ed ~ﬁ.(l9,.|61,...,6’i_1,6,+1,....¢9d) (2.6)

where i =1, ..... ,d. An iteration of the Gibbs sampler consists of (1 updates of vectors in
iteration t, where each update adjusts one component of 9 conditioning on the other (d-

1) components. At each iteration t, an ordering of the d subvectors of (9 is chosen and, in
g t . . . . . . .
turn, each i is sampled ﬁom the conditional distribution given all the other

components of 9. Thus each subvector 6i is updated conditional on the latest values of 6

for the other components, which are the iteration t values for the components already

updated and the iteration t-I values for the others.

18

The Gibbs Sampler Algorithm can be summarized by the following expression

(Congdon, 2005).
1. 61(t+1) ~ f1(91 162m: ..... ’6d(t))
2. 92ml) "’ f2(62 |61(’+1),63(t), ----- ’pdm)

1 t1 t1 t t
3.93"“~jg(e,|6,‘+),62‘+),64", ..... ,ed“)

t l t l l l
d. 6d( + ) ~ fd (6d 191( + )’92(t+ ), ....... 96d—1(t+ ))

Casella and George (1992) deﬁned Gibbs sampling as a “technique for generating
random variables from a ...distribution directly, without having to calculate the density”
(p. 167). A more technical overview of the Gibbs sampler is provided in Casella and

George (1992).

2.4.2.3 Bayesian Estimation in IRT and Mixture IRT Models

Baker and Kim (2004) provided a thorough review of prior research on Bayesian
estimation for different pararneterizations of item response theory models. In the IRT
framework, Gibbs sampling was ﬁrst used by Albert (1992), which approach was
extended for use with numerous IRT models such as those that analyze testlets (Bradlow,
Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000), multilevel IRT model (Fox &
Glas, 2001), multidimensional models (Béguin & Glas, 2001), and the graded response
model (Johnson, 1997). As reﬂected by the wide use of the Gibbs sampler, Bayesian
parameter estimation offers an attractive methodology for experimentation with new and
potentially complex IRT models (Kim & Bolt, 2007). In the context of the mixture IRT

model, a common MCMC strategy is to sample a class membership parameter for each

19

examinee along with a continuous latent ability parameter. Practical implementation of

MCMC and the Gibbs sample exist in the WinBUGS (Lunn, Thomas, Best, &

Spiegelhalter, 2000) software package which is described next.

2.4.2.4 WinBUGS and Sampling Methods in WinBUGS

As noted above, WinBUGS (Lunn et al., 2000) is a software package that uses
Markov chain Monte Carlo (MCMC) methods to facilitate Bayesian analysis for a wide
variety of applications. The material presented in this section draws signiﬁcantly from
Lunn et a1. (2000), Spiegelhalter et a1. (2003) and Cowles (2004). One of the most
attractive features of the WinBUGS is that it is relatively easy to use because it
automatically implements numerical sampling and has customized output analysis
features that are adequate for most MCMC analytical purposes. On the other hand, as
general-purpose software, WinBUGS is not optimized for speciﬁc models. Therefore, the
time required for completing computations can be rather long and increases for larger
datasets and more complex models. For this reason it is generally not practical to use
WinBUGS in large scale assessment.

WinBUGS employs different sampling for different types of models. Generally,
in simple cases, a conjugate prior distribution is used with a standard likelihood to yield a
posterior distribution from which the parameters in the model can be directly sampled.
When it is not possible to get the samples directly ﬁOm the posterior distribution, as in
more complex models, some forms of Gibbs sampling and Metropolis-Hastings sampling

are used to sample from the posterior distribution.

20

As mentioned earlier, Gibbs sampling algorithms are used to construct the
transition kernels for its Markov chain samplers. Every iteration of a Gibbs sampler
involves drawing a new value for each parameter from its full conditional distribution.
WinBUGS chooses a method to draw samples during the compilation phase.
Consequently, different forms of sampling will be implemented for different parameter
types. More information about WinBUGS and its sampling methods can be found in the
WinBUGS user manual (Spiegelhalter et al., 2003). It should be noted that the selection
of these methods occurs through a process internal to WinBUGS, and is not required to

be speciﬁed by the user.

2.5 Research Questions

As mentioned in the ﬁrst chapter, a primary goal of this dissertation was to
explore the existence and impact of limitation of commonly used IRT models,
particularly their inability to account for the test-taking heterogeneity that might exists in
the testing population. Furthermore, inaccuracies in psychometric modeling could have
an additional impact on the accurate implementation of educational policies which in turn
have consequences for schools, students, parents, and society in general. Therefore, to
evaluate the precision of MixIRT models in accurately estimating sample heterogeneity,
this study examined different examinee test-taking behaviors using both simulation and
empirical analysis. Speciﬁcally, this dissertation investigated the following research
questions.

0 How accurate are the parameter estimates for mixture IRT model when the

number of items, the number of examinees, and the proportion of unobserved

21

test-taking heterogeneity (as represented by % of examinees guessing) are
varied?

0 How comparable is the precision of parameter estimation in the mixture IRT
model to the estimation from the two-parameter logistic (2PL) model and the
three-parameter logistic (3PL) model?

0 How accurately does Bayesian mixture modeling identify a cluster of
examinees who are likely to be guessers in a large-scale assessment?

0 What is the impact of excluding item responses identiﬁed by mixture IRT

mOdeling as coming from guessers in proﬁciency level classiﬁcation?

The next chapter provides an overview of the methodology and the
parameterization of models used in this study which evaluate how varying degrees of
test-taking heterogeneity inﬂuence the parameter estimation. A subsequent section
describes the simulation study design, simulation of data, data analysis and criteria used
in parameter recovery. Finally, the mixture IRT model analysis is presented with a real

data example.

22

CHAPTER 3

METHODOLOGY

The primary goal of this research was to evaluate the precision of mixture IRT
(MixIRT) models in accurately estimating the differential performance of distinct groups
in a sample (i.e., sample heterogeneity). In this study, a MixIRT model was used to
investigate different examinee test-taking behaviors through a simulation study that
varied (a) sample size, (b) test length, and (c) proportion of guessing examinees. These
simulation factors are representative of realistic testing situations and allow an evaluation
of how varying degrees of test-taking heterogeneity inﬂuence parameter estimation.

The remainder of the chapter is divided into the several sections. The ﬁrst section
provides an overview of the methodology and the parameterization of models used in the
simulation study. A subsequent section describes the simulation study design, simulation
of data, data analysis and criteria used in a parameter recovery study (i.e., bias, root mean

squared error, correlation). The ﬁnal section discusses a real data example.

3.1 Models

The MixIRT model is the basic theoretical framework of this study. In general,
MixIRT models stipulate that different parameter values may apply for different latent
classes of examinees in a population. In effect, different IRT models hold true for each of
the latent classes. In this thesis, the MixIRT model is formulated from two latent classes

where one latent class belongs to the responses following the speciﬁed IRT model and

23

another latent class comprises item responses from the examinees engaged in guessing
behaviors.

This study used two types of MixIRT models to characterize various guessing
strategies encountered in a typical large scale assessment. In both models, one latent class
was the set of responses which followed a two-parameter logistic model. The second
latent class represented item responses from examinees using guessing strategies. Two
MixIRT models differed on how they incorporated the guessing behaviors. The ﬁrst
model used was the MixIRT model with completely random guessing behavior, labeled
hereafter as MixIRT-R. The second model was a MixIRT model comprised of responses
ﬁ'om examinees engaged in ability-based guessing behavior, labeled hereafter as
MixIRT-A. It should be noted that unless otherwise speciﬁed, the label MixIRT model is
used to refer to both of these models. What follows is a description of the two MixIRT

models used.

3.1.1 Model I: Mixture IRT model with completely random guessing behavior
WixIR T -R)

The MixIRT model with completely random guessing behavior (MixIRT-R)
basically represents a stochastic statistical mixture. Its purpose is to identify
probabilistically the random responders from legitimate 2-PL responders. The
development of MixIRT-R model is related to the idea of the HYBRID model
(Y amamoto, 1989) and the effort moderated IRT model (Wise & DeMars, 2006), where
the classiﬁcation of examinee into guessers and non-guessers is based on probability
estimation from the examinee item responses. In addition to some parameterization

differences in the model presented below with earlier studies, this study utilized a

24

different estimation method, i.e. Bayesian estimation. The mathematical representation of
MixIRT-R model includes two classes, one of which belongs to a 2-PL model and

another belongs to a random guessing class. Accordingly, the probability of answering an

item 1 correctly by an examinee i is given by:

1 exp[a (9- ’17-)1
PX.. . .,b.,t9. =1— .*———— . J ' 1 .

 

where, nALT represents the number of options in a multiple choice item and g ,-

represents the group membership (e.g., guesser or non-guesser). The symbols (1 j i b j

and 9,- represent item discrimination, item difficulty, and ability parameters and have the

same interpretation as the 3PL parameters of Equation 1.1 in Chapter One. Also, g ,- =

(0,1) with g = 0 as random guessing group (guessers) and g = l as 2PL (non-guessers).

These parameters can be estimated using the WinBUGS software programs as explained

in the next section.

3. 1.2 Speciﬁcation of the M11le T-R Model in WinB UGS

The parameterization of MixIRT-R model speciﬁed in Equation 3.1 is similar to

that of the [RT model described in Chapter One. However, there are two additional
parameters to be estimated. The ﬁrst one is g i which corresponds to the categorical

representation of group identiﬁcation. The hyperparameter of this distribution is

parameterized by a Dirichlet distribution which is a conjugate prior for estimating the

25

proportion of examinees that are likely to be guessers and non-guessers. Refer to the

Appendix A.1 for the WinBUGS code used to implement this model.

3.1.2.1 Speciﬁcation of model parameter priors

Speciﬁcation of a prior distribution is one of the most important methodological
as well as practical problems in Bayesian inference. Often researchers want “the data to
dominate” when there is no prior information and thus attempt to use vague prior
distributions (Lambert, Sutton, Burton, Abrams, & Jones, 2005). The vague prior is a
term used in Bayesian statistics to refer to a prior when the analyst does not have any
information about the value of the unknown parameter.

In a simulation study of the impact of vague prior distributions in MCMC using
WinBUGS in an IRT model, Lambert et al. (2005) found fewer problems with location
parameters (i.e., item difﬁculty parameters), but found major problems with scale

parameters (i.e., item discrimination parameters). This study will use informative priors
on the discrimination parameters ( a j ) since the existence of the joint posterior

distribution is not guaranteed when an improper prior is used (Albert & Ghosh, 2000;
Bazan, Branco, & Bolfarine, 2006). It should be noted that this limitation should not be
attributed to the Bayesian estimation methods so much as to the modeling ﬂexibility of
WinBUGS and obtaining the convergence. However, a vague prior is used for item
difﬁculty parameter as used in earlier studies (e. g., Albert & Ghosh, 2000; Sahu, 2002).

Ability parameters (0) and item difﬁculty parameters (b) were each assigned a

two-stage normal prior as:

26

It?

It?

If

all

c9,~N(O,t'6), i=1, ..... ,n,
bj~N(O,rb), j=1, ..... ,J. (3.2)

where both 79 and Tb follow the conjugate inverse gamma prior,

76 ~ IG(a9aﬂ6)9
Tb ... IG(ab, ﬂb). (3.3)

where a9 9 .83 9 a b 9 and '6b are hyperparameters.

These distributional speciﬁcations are chosen based on earlier studies in IRT parameter
estimation (e. g.,Cao & Stokes, 2008; Hambleton & Swaminathan, 1985; Kim & Cohen,

1998; Patz & Junker, 1999a, 1999b; Sahu, 2002)

3.1.3 Model 2: Mixture IRT model with ability-based guessing (MixIR T -A )

IRT models that handle ability-based guessing are getting increased interest
recently (e.g., Cao & Stokes, 2008; Martin, Pino, & De Boeck, 2006). The motivation for
this modeling approach is largely based on the assumption that the success of guessing is
related to ability. The MixIRT model with ability-based guessing (MixIRT-A) basically
represents another variant of a stochastic statistical mixture. The fundamental assumption
of this model is that both guessers and non-guessers take the test, but interact differently
with items that are harder for the examinee. Speciﬁcally, some examinees utilize their
full potential (thoughtful response) to only the relatively easy items and tend to guess

answers to test items that are difﬁcult for them.

27

Based on the theory of test-taking motivation, such as one illustrated by Wise and
DeMars (2005), the MixIRT-A model assumes that the amount of effort put forth by an
examinee decreases as the task becomes increasingly difﬁcult. In other words, a guesser
is more likely to guess on items that are difﬁcult, where that difﬁculty threshold is related
to the examinee’s own ability parameter. Cao and Stokes (2008) recently used this model
to describe partial guessing behaviors, which was labeled as the “..Difﬁculty-Based

Guessing Model”. The probability of a correct response for an examinee i to test item j in

this model is given by:

poi, =116..a,,b,..c,,6,.>

= exp[aj(6i —bj)—ni1(bj ‘(62' +6») (aj(6i —bj)_cj)]
1+explaj(6.~—b,-)—n.1(bj—(6.-+5.-» (aj(6’,- -b,-)-c,-)]

 

(3.4)

where 77 i =1 if examinee i is a guesser and 0 otherwise. 5,- is a parameter that measures

the difﬁculty threshold for a guesser to guess. In other words, sOme examinees may use
their full potential or even try illuminating one or two choices before making their guess.
Others may not use their full potential, thus guessing on those items that are difﬁcult for

them. The indicator function, represented as I(. . .), in Equation 3.4 above becomes 1 only

if the difﬁculty parameter of item j is larger than the ability of examinee i with some

degree of adjustments controlled by the threshold parameter 5, . The current study

allowed 5,- varying among examinees because different examinees have a different

threshold in terms of their tendency to guess.

28

The priors for 6i a aj 9 bj 9 79 9 Tb are the same as those used in the Model 1

above. Appendix A.2 provides the WinBUGS code used to implement MixIRT-A model.

The prior for 771’ is similar to those used for gi in MixIRT-R, which corresponds to

categorical representation of group identiﬁcation. The hyperparameter of this distribution
is parameterized by a Dirichlet distribution which is a conjugate prior for estimating the

probability that a particular examinee is likely to be a guesser.

3.2 Simulation Study

Typically a simulation study is used to evaluate the performance of a particular
model or method in precisely estimating the model parameters. Accurate estimation of
item parameters is important in any psychometric applications such as test equating, item
banking, etc. The overarching goal of this study was to evaluate the performance of the
MixIRT model in precisely estimating sample heterogeneity, and to study how different
testing characteristics inﬂuence the estimation of model parameters. Therefore, a
simulation study was most appropriate to address these goals. In addition, a simulation
study allowed exploration of the impact of guessing behavior on parameter estimation.

Hence, in order to evaluate the extent to which the MixIRT model can precisely
recover the item parameters using Bayesian estimation, a parameter recovery study was
conducted. The precision of parameter estimation was evaluated in terms of bias, RMSE,
and correlation between estimated and simulated parameters. As mentioned earlier, the
proposed method provides better item parameter recovery when it produces small bias,

small RMSE, and high correlation between estimated and simulated parameters.

29

3. 2.1 Simulation Factors or Study Design

As mentioned earlier, a simulation study allows the evaluation of how different
testing characteristics inﬂuence the estimation of mixture model parameters. In the
context of this study, it is possible to explore the impact of unobserved test-taking
heterogeneity (guessing proportion) on parameter estimation. Typical test characteristics,
which are encountered in applied testing situations, include sample size, test length, and
proportion of guessing. Taking this into account, this study used the factors listed in the
Table 3.1, which are commonly used in parameter recovery studies (e. g., Goldman &

Raju, 1986; Hulin et al., 1982; Kim & Cohen, 1998).

Table 3.1 Summary of Parameter Recovery Study Factors

 

 

Factors Levels
Sample Size 500, 2000
Test Length 25, 50

proportion of “guessing” 0%, 5%, 10%
Estimation model MixIRT, 2PL, 3PL

 

This simulation study used a MixIRT model with simulated random guessing
behavior as labeled as MixIRT-R model above. The estimation from 0% guessing serves
as a baseline. This study investigated the impact of different guessing proportion (5% and
10%) on parameter estimation. The guessing preportion represents the percentage of
examinees who are a guesser in a test. The two-parameter logistic (2PL) model was used

for generating data. The performance of MixIRT-R and 2PL model was compared with

30

3PL model because 3PL is commonly used in practice for parameter estimation when

guessing behavior is suspected in multiple choice items.

Each condition in this study was replicated 15 times. Although this may appear to
be too few replications from a ﬁequentist perspective, this is actually more than the
number of replications used in Bayesian IRT-based simulation studies. This reduction in
replications is partly a result of the computational intensity of WinBUGS software which
can take upto 6 hours to run 25,000 iterations for the item responses with 2000 examinees
and 50 items. Examples from the literature have used only ﬁve (e.g., Bolt & Lall, 2003)
or ten replications (Cao & Stokes, 2008). The general procedures employed to simulate

item and ability parameters, and simulation of item responses are presented next.

3. 2.2 Generation of Simulated Parameters and Item Responses

The simulation of parameters and item responses was based on typical methods
found in IRT literature (Hulin et al., 1982; Kim & Cohen, 1998). Ability parameters were
assumed to follow a normal distribution; thus ability parameters were randomly sampled
from a standard normal distribution (mean=0, standard deviation=l). Similarly, item

discrimination parameters were assumed to follow a lognormal distribution. Thus,

discrimination parameters were randomly sampled from a lognorrnal distribution [611' ~

lognorrnal (0,0.3)]. The item difﬁculty parameters were also assumed to follow a normal
distribution. Therefore, difﬁculty parameters were randomly sampled from a normal
distribution with mean of O and standard deviation of 0.7. The standard deviation was

reduced to slightly less than 1 to avoid too easy or too difﬁcult items.

31

The a and b parameters were randomly paired with each other. Thus, any nonzero

correlations among the item parameters were attributable to chance. These item
parameters may be thought of as simulating an idealistic scenario or one that a
psychometrician using the 2PL model would hope to obtain.

The probability of a correct response to item j by simulated examinee i was then
computed using the two-parameter logistic IRT model (Bimbaum, 1968). A response
vector of dichotomous item scores for each examinee was obtained by generating, for
each item, a uniform random number (ranging between 0 and l) and comparing the value
with the probability of an examinee of that ability level passing the item. If the computed
probability exceeded the random number, then the item score was scored as correct (1);
otherwise, it was scored as incorrect (0).

In order to simulate the guessers, the item responses from a randomly selected 5%
or 10% of total examinees were modiﬁed in such a way that their response patterns
mimicked guessing behavior. The original data with no guessing (labeled as 0%
proportion of guessing) served as baseline data for comparative purposes. The estimation
of modiﬁed item responses allowed evaluation of the impact of guessing on parameter
estimation. Thus, it also showed how 2PL and 3PL models could not account for test-

taking heterogeneity.

3. 2.3 Parameter Estimation

The item responses simulated or modiﬁed above were used as data for item and
ability parameter estimation. The primary methodological objective of this study was to
compare the estimation from various IRT models (MixIRT, 2PL, 3PL) when model

parameters were estimated using computer software WinBUGS. In this program, the

32

estimations were carried out under Bayesian framework, using sampling procedures such

as MCMC and the Gibbs Sampler.

3.2.3.1 Convergence Assessment and Sensitivity Analysis

Evaluating chain convergence is a critical issue in monitoring the simulated states
of the Markov chain (Cowles & Carlin, 1996; Kim & Cohen, 1998). In order to view the
sampled observations as a sample from the posterior distribution of the model
parameters, the sequence of states for the Markov chain should theoretically converge to
a stationary distribution. The rate at which this convergence occurs can vary depending
on several factors, such as correlations between adjacent states, the sampling algorithm
used, and identiﬁcation problems with the model.

A critical issue for MCMC methods, including Gibbs sampling, is to determine
when one can cease sampling and use the results to estimate characteristics of the
distributions of parameters of interest (Kim & Cohen, 1998). In this context, the values
for the unknown quantities generated by the Gibbs sampler can be graphically and
statistically summarized to evaluate for mixing and convergence. Cowles and Carlin
(1996) presented a comparative review of convergence diagnostics for MCMC algorithm.
The most popular and useful method was that proposed by Gelman and Rubin (1992).
This diagnostic measure is implemented in WinBUGS as the Brooks, Gelman, and Rubin
(BGR) plot. In this study, ﬁve diagnostic measures were used to evaluate the sampler
performance: (i) Brooks, Gelman, and Rubin (BGR) diagnostic plots; (ii) Monte Carlo

errors; (iii) history plots; (iv) autocorrelation plots; and (v) density plots.

33

The Gelman-Rubin convergence statistic R compares the ratio of the pooled chain
variance to the within chain variance (Gelman & Rubin, 1992). Once convergence is
reached, R converges to 1. WinBUGS plots 3 items; where the Gelman-Rubin statistic is
plotted in red, which is preferred to converge to 1. In blue, the average width of the 80%
intervals within each individual chain and the width of the 80% interval of the pooled
runs is plotted in green. The blue and green lines should stabilize to some number though
it is not necessarily required to be 1.

Monte Carlo error (MC error) is a measure like the standard error of the mean but
adjusted for autocorrelation. Generally, autocorrelations for the MCMC sequence that
decay slowly as a function of lag imply poor mixing of the MCMC series and could
indicate a high-degree of correlations between the parameters or lack of identiﬁcation of
the model. Finally, history and density plots are also useful to monitor the convergence of
estimates.

Analysis to evaluate the sensitivity to the initial values and the mixing and
convergence of the Gibbs sampler was carried out. The reasonable convergence was
reached in each condition by running 3 chains of 25,000 iterations with the ﬁrst 10,000
discarded as bum-in. For additional replications, however, a single chain of 25,000
iterations was run with the ﬁrst 10,000 iterations discarded as bum-in period. The

estimate of each parameter was based on ﬁnal 15,000 iterations.

3. 2.4 Evaluation Criteria and Analysis of Sim ulated Data
Three commonly used summary statistics were used as evaluation criteria: bias,

Root Mean Squared Error (RMSE), and correlation. Before computing the bias and

34

RMSE, the estimated parameters were transformed to the same scale as the true
parameters. RMSE is the square root of the average of the squared differences between
true and estimated parameters across all the items for item parameters and across all the

subjects for the ability parameter. For example, in case of item parameter recovery, the

RMSE and Bias for each parameter 1] = a, b are expressed as:

 

 

2
J R ﬁ.r—77.
RMSE: ZZ( JJ*RJ) (3.5)
j=lr=l

 

Bias = i i (ﬁjr — ”j ) (3.6)

>

where 771' is the true value and 771', is the corresponding estimate. J is the total number

of items, and R is the number of replications. It should be noted that for ease of
interpretation, the results for all J items were combined across the R replications for each
simulation condition. Thus, the bias and RMSE presented in the results section are
basically the averages of those values across each simulation condition.

Bias index does not indicate in an absolute sense the degree of estimation
accuracy. In bias, equal positive and negative errors are cancelled with each other
producing a zero bias just as would perfect estimation. The bias then suggests whether
there is a systematic tendency to overestimate or underestimate a parameter. A positive
bias implies parameter overestimation and a negative bias implies parameter

underestimation.

35

The correlation between simulated and estimated parameters was also used as an
evaluation criterion because that reﬂects how well the estimated parameters are
correlated with the simulated parameters. The Pearson correlation between estimated and

simulated parameter values is given by:

J __
"=1

r = J (3.7)

J _ J
2(ﬁj—ﬁ)2 EVE—732
j=l j=l

 

 

This study also used classification accuracy as additional criteria to evaluate how
well the MixIRT model classiﬁed examinees into a model generated class. For example,
to evaluate how well the MixIRT model identiﬁed the examinees likely to be in the

guessers class, the classiﬁcation accuracy can be expressed in percentage as:

 

Classiﬁcation Accuracy : Number of guesserszdentzﬁed correctly X 100 (3.8)

Actual number of guessers
Since the group membership was modeled as a categorical variable, the median
was computed for the estimate. The classiﬁcation accuracy was computed separately for
each group (non-guessers and guessers). Because the sample size was different for
different groups, weighted classiﬁcation accuracy was also computed by averaging the

classiﬁcation accuracy values after weighting by sample size.

3. 2.5 Simulation Study using Mixture IRT Model with Ability-based Guessing
Although a large part of the simulation study carried out in this dissertation was

described in Section 3.2, the assumption in which guessing was deﬁned might not be

36

realistic in all practical testing situations. Thus, the goal of this second simulation study
was to use a MixIRT-A model that modeled a different guessing strategy. Speciﬁcally,
this model accounted for ability-based guessing, as speciﬁed as MixIRT-A earlier. Once
again, the objective was to show how the simplicity of the 2PL model failed to account
for the heterogeneity in testing populations, and to show how Mixture IRT model can
account for such heterogeneity. This simulation study, however, simpliﬁed the study
design by considering only the simulation condition in which the estimation model is
varied for a speciﬁc test length and sample size. Speciﬁcally, the estimation from the 2PL
model was compared with the MixIRT-A model for a test of 40 items administered to
1000 examinees. The next chapter provides a summary of simulated item parameters and

the results ﬁom this analysis.

3.3 Empirical Data Analysis

This study used the data from a large scale assessment obtained from a statewide
mathematics assessment administered to Fall 2006 Grade 8 students in a Midwestern
state. The data was obtained from over 100,000 students. Although the original test also
comprised of some constructed response items, this study used the item responses from
54 multiple-choice items only. Due to the longer computational time required for running
MCMC analysis in WinBUGS, samples of 1000 randomly selected test-takers were used.
These moderate sized samples were used to carry out empirical analysis. The primary
objective of this analysis was to demonstrate an application of .the MixIRT model (both

MixIRT-R and MixIRT-A) using real data.

37

3.3.1 Analysis of Empirical Data

The empirical data analysis started with selecting random samples from the
statewide assessment mentioned above. First, two samples of size 1000 were selected
randomly. Then, WinBUGS was used to estimate model parameters (item and ability)
and the group membership of the examinees. In order to demonstrate the application of
the MixIRT model in identifying the guessers and showing the impact of guessing on
parameter estimation, this study estimated the ability parameters with or without guessers
in the sample. The calibration was performed twice. First, the model estimated the ability
parameters and identiﬁed the examinees likely to be from a guesser class. Then, the
model was rerun with those guessers removed. It is important to clarify how an examinee
was classiﬁed as a guesser in this study. As noted earlier, the probability of an examinee
likely to be a guesser was estimated ﬁom the item response pattern of the examinee. This
probability was actually based on the average over a large number of MCMC iterations.
If the probability was equal to or greater than 0.5, the examinee was classiﬁed as a
guesser.

The changes in ability parameter estimation were evaluated in terms of
proﬁciency level classiﬁcation and the difference between the distribution of ability
parameters. The percentage of proﬁcient students is a conceptually simple score-
reporting metric that became widely used for school accountability decisions under the
NCLB Act. In this accountability framework, students are generally classiﬁed into four or
ﬁve different levels based on their performance in a statewide assessment. In most states,
there are four proﬁciency levels: Advanced, Proﬁcient, Basic, and Below Basic. This

study also used the same convention to represent the proﬁciency levels. Based on the

38

ability estimates from a MixIRT model, the distribution of examinees into particular
proﬁciency levels was made as realistic as possible by deriving three cut-scores on the 0-
scale that provided the same percentage of examinees into each level reported by this
assessment. Evaluation of results from this perspective have potential to provide some
policy implications of the ﬁndings.

This study used the two independent sample Kolmogorov-Smimov test
(Kolrnogorov, 1933; Smirnov, 193 9) to evaluate whether the difference in distribution of
0 from the two samples was statistically signiﬁcant. This nonparametric statistical test is
often referred to as distribution free method as it does not rely on assumptions that the
data are drawn from a given probability distribution. Speciﬁcally, the Kolmogorov-
Smirnov test evaluates whether the shapes of the distributions of the two groups are
comparable.

In order to test the statistical signiﬁcance of the differences between proﬁciency
levels classiﬁed by two samples, a chi-square test was performed. Pearson’s chi-square is
the most widely used chi-square test, in which the chi-square statistic is calculated by the
difference between each observed and theoretical frequency of each possible outcome. Its

formula is given in Equation 3.9 .

 

2 n Oi-Ei2
Z =Z( E ) (3.9)
i=1 '

l

2
where Z is the test statistic that asymptotically approaches a chi-square distribution.
0,- is an observed ﬁ'equency; E i is an expected frequency under the null hypothesis; It

is the number of possible outcomes for each event. Pearson’s chi-square statistic is used

39

to test whether or not an observed frequency distribution differs from a theoretical
distribution.

The next chapter provides the results obtained from the simulation study under
both guessing models (MixIRT-R and MixIRT-A models described in this chapter). It
also outlines the results from the analysis of empirical data from a statewide large scale

assessment.

40

CHAPTER 4

RESULTS

This chapter presents ﬁndings from the simulation and real data analyses. Recall
that the primary goal of this study was to explore the feasibility of using mixture IRT
(MixIRT) models to estimate the differential performance of examinees in different latent
classes in a sample i.e., sample heterogeneity. To accomplish this goal, a series of
simulation factors were investigated in fully crossed designs, including two sample sizes
(500 and 2000 simulees), two test lengths (25 and 50 items), and three proportions of
guessing (0%, 5%, 10%). The estimation of model parameters (item and ability) was
compared among three models: MixIRT, 2PL, and 3PL.

This chapter is comprised of ﬁve sections. The ﬁrst section summarizes the
descriptive statistics of simulated item parameters. The second section presents the
convergence of the estimates in WinBUGS because using MCMC sampling to do
statistical inference requires convergence of the MCMC chain to its stationary
distribution. In the third section, the results obtained ﬁom the simulation study under the
random guessing model, described as MixIRT-R in Chapter Three, are presented. The
fourth section summarizes the results from a simulation study under the ability-based
guessing model, described as MixIRT-A in Chapter Three. The ﬁnal section outlines

results ﬁ'om the analysis of empirical data.

41

4.1 Descriptive Statistics of Simulated Item Parameters
Table 4.1 presents descriptive statistics of the simulated item parameters for both
test lengths. Given that these item parameters were randomly selected from speciﬁc

distributions, the two tests were slightly different in difﬁculty levels, with the longer test
(n=50) being slightly easier than the shorter test (n=25). Since this occurred by a chance

due to the difference in test lengths, it should not impact the interpretation of the results.
The discrimination parameters ranged from 0.588 to 1.758 for the shorter test, and
from 0.687 to 1.749 for the longer test. The difﬁculty parameters ranged ﬁom -l .896 to
2.086 for the shorter test, and from -2.108 to 2.152 for the longer test. These item
parameters are similar to those found in many practical assessments and previous studies.
To generalize the results from a simulation study to the practical setting, simulated
parameters should be as realistic as possible. Therefore, extreme values of a- and b-
pararneters were avoided in the simulation. A complete list of item parameters is listed in

Appendix C] for test length of 25, and in Appendix C.2 for test length of 50.

Table 4.1 Descriptive Statistics for the Simulated Item Parameters

 

Test Item Standard

 

 

Length Parameter Mean Deviation Maxrmtun Minimum
25 a 1.030 0.313 1.758 0.588
b -0.120 0.903 2.086 -1.896
50 a 1.076 0.243 1.749 0.687
b -0.266 0.797 2.152 -2.108

 

42

4.2 Evaluation of Parameter Estimate Convergence

Using MCMC sampling to do statistical inference requires convergence of the
MCMC chain to its stationary distribution. Five diagnostic measures, as described in
Chapter Three, were used to evaluate convergence: (i) Brooks, Gelman, and Rubin
(BGR) diagnostic plots; (ii) Monte Carlo errors; (iii) history plots; (iv) autocorrelation
plots; and (v) density plots. It should be noted that no diagnostics can prove convergence,
but these multiple criteria provide the indication that convergence might have occurred.
These criteria may help in evaluating MCMC convergence to ensure that the samples are
fairly representative of the underlying stationary distribution of the Markov chain.

Figure 4.1 presents BGR diagnostic plots, history plots, autocorrelation plots, and
density plots for discrimination parameter of a randomly-selected item estimated using
the MixIRT-R model. This item has true discrimination parameter of 1.757 and the
estimated parameter of 1.818. Similar plots for estimation of difﬁculty parameter of a
randomly selected item and plots for estimation of ability parameters for a randomly
selected examinee are given in Appendix B. These plots were chosen from the dataset in
which the guessing percentage was 10% for a sample size of 500 and a test length of 25.
This condition was chosen here because a small sample size and a short test generally
yielded poor parameter recovery and sometimes produced chains that had difﬁculty in
arriving at convergence. Evaluation of convergence from this condition may capture the

representative ﬁndings from this study.

43

 

a[25] chains 1:3

 

ll

1_O- 7—--&

0.5 r

0.0

 

 

 

q

1 10000 20000
iteration

 

a[25] chains 1:3
4.0
3.0
2.0
1.0

I

 

I

 

 

1 10000 20000
iteration

 

a[25] chains 1:3

1.0-
0.5 - _
0.0 -'--——-—-~ ~ ——
-o.5
-1.0

6

I

 

 

 

-
d
.

lag

 

a[25] chains 1:3 sample: 45000

2.0 -
1.5 L
1.0t
0.5 -
0.01

 

 

 

1.0 1.5 2.0 2.5

Figure 4.1 Sample plots for convergence assessment of discrimination parameter estimate

(From Top: BGR plot, History plot, Autocorrelation plot, Density plot)

BGR Plots

The BGR plot shown in Figure 4.1 indicates that the Gelman-Rubin statistic,
which is plotted in red, has converged to 1. The average width of the 80% intervals
within each individual chain is plotted in blue and the width of the 80% interval of the
pooled runs is plotted in green. Both blue and green lines are stabilized to some number
indicating adequate convergence of the chains. It is important to note that the colors
shown in BGR plot might be difﬁcult to distinguish in gray scale prints.

Monte Carlo error

Monte Carlo error facilitates the evaluation of convergence by suggesting that
how long the simulation should be run to ensure adequate convergence. Table 4.2
presents the descriptive statistics of the MixIRT estimate for randomly selected item
responses. For convenience of illustration, only results for the ﬁrst ﬁve items and the ﬁrst
ﬁve examinees are shown.

As a rule of thumb, the simulation should be run until the Monte Carlo error for
each parameter of interest is less than about 5% of the sample standard deviation
(Spiegelhalter et al., 2003). From the Table 4.2, it is clear that the Monte Carlo error is
less than 1/20th of the standard deviation of the estimate indicating adequate convergence.
History Plots

The history plots in Figure 4.1 suggest convergence has been achieved since three
chains essentially overlapped each other and could not be easily differentiated.

Furthermore, the convergence seems have been reached well before the bum-in period of

10000 used in this study.

45

Table 4.2 Descriptive Statistics of MixIRT Estimates for Selected Parameters

 

 

Node Mean [fetifafildg MC Error
a , 0.9335 0.1462 0.0020
61; 1.0500 0.1536 0.0023
a 3 0.9025 0.1463 0.0018
04 1.1600 0.1711 0.0035
a 5 1.5760 0.2484 0.0052
b 1 0.2472 0.1262 0.0021
b; 0.1520 0.1164 0.0020
b3 0.2184 0.1295 0.0021
1)., -1.1760 0.1808 0.0047
b 5 —1.7730 0.2218 0.0063
6, 1.8550 0.5176 0.0041
6 2 0.5016 0.3885 0.0035
6’ 3 0.8399 0.4132 0.0036
64 -0.5894 0.3922 0.0041
19 5 0.2236 0.3801 0.0032

 

Autocorrelation Plots

As shown in Figure 4.1, autocorrelations for the MCMC sequence decayed
rapidly as a function of lag, which indicates that there was good mixing of the MCMC
series. This shows a lack of correlations between the parameters and indicates

satisfactory convergence.

Density Plots
The density plots of Figure 4.1 also suggested the convergence of estimates

because the density resembled the appropriate distribution for discrimination parameter.

46

Thus, after evaluating all the diagnostic measures, adequate convergence was achieved.

The additional plots given in the appendix also suggested the adequate convergence.

4.3 Results of MixIRT-R Model Simulation Analyses

4. 3. 1 Results from the Parameter Recovery Study

Recall that the parameter estimate was evaluated by comparing the estimated
model parameters (i.e., discrimination, difﬁculty, and ability parameters) to the true
(simulated) parameters. As mentioned earlier, this study used bias, RMSE, and
correlation between estimated and simulated parameters as evaluation criteria. The results

are presented both numerically and graphically.
Table 4.3 below summarizes the bias and RMSE of item difﬁculty parameter (b)
estimates that were described in Equation 3.6 and Equation 3.5 respectively. Similarly,

Table 4.4 summarizes the bias and RMSE of item discrimination parameter (a) estimates.

The Bias and RMSE values for b and a parameters are also plotted separately for test

lengths of 25 and 50. Only selected plots are presented here, and the remaining plots can
be found in Appendix D.

Figure 4.2 shows average bias for recovery of item difﬁculty parameters when
test length is 25, whereas Figure 4.3 shows average bias for recovery of item difﬁculty
parameters when test length is 50. The plots corresponding to RMSE values for recovery
of item discrimination parameters are shown in Figure 4.4 and Figure 4.5 for test lengths
of 25 and 50 respectively. It should be noted that the labels on the x-axis reﬂect guessing
proportion and sample size. For example, 10P500 indicates a sample size of 500 simulees

when the percentage of simulees that were guessing was 10%.

47

It can be noticed in Figure 4.4, that when a 2PL model was used with test length
of 25 and sample size of 500, the RMSE increased from 0.129 to 0.152 when the
percentage of guessers increased ﬁ'om 0% to 5%. The RMSE increased further to a value
of 0.192 when the simulated proportion of examinee guessing increased to 10%.
Similarly, the RMSE increased from 0.130 to 0.146 for a 5% guessing percentage and to
0.174 for 10% guessing. Both bias and RMSE values were generally lower with the
MixIRT-R model than with the 2PL model. However, both bias and RMSE tended to
increase for both models when the percentage of guessers increased to either 5% or 10%.

One of the primary objectives in varying study factors like test lengh and sample
size was to evaluate their capacity to recover stipulated item and person parameters.
These results show that smaller bias and RMSE were produced by larger sample sizes.
The only exception to this sample size ﬁnding occurred with the use of the 2PL model for
a 50-item test when there were 2000 simulees. For example, in a condition with a 25-item
test and 5% guessing percentage, the RMSE value dropped from 0.152 to 0.110 with the
2PL model when sample size was increased ﬁ'om 500 to 2000. Additionally, the RMSE
value dropped from 0.141 to 0.080 in MixIRT-R model estimation when sample size was
increased ﬁ'om 500 to 2000. No clear pattern of results existed for bias when test length

was increased from 25 to 50.

48

Table 4.3 Bias and RMSE of Item Difﬁculty Parameter Estimates

 

 

 

 

 

 

 

 

 

 

 

0% Guessing 5% Guessing 10% Guessing
IRT Number Sample Proportion Proportion Proportion
Model Of Items Size BIAS RMSE BIAS RMSE BIAS RMSE
Mean Mean Mean Mean Mean Mean
2PL 25 500 0.004 0.129 0.058 0.152 0.096 0.192
25 2000 -0.002 0.069 0.061 0.110 0.103 0.159
50 500 0.005 0.127 0.056 0.146 0.088 0.174
50 2000 0.006 0.062 0.063 0.104 0.245 0.252
MixIRT 25 500 -0.012 0.130 -0.036 0.141 -0.058 0.156
25 2000 —0.008 0.069 -0.030 0.080 0.007 0.156
50 500 -0.001 0.128 -0.016 0.134 -0.024 0.137
50 2000 0.004 0.061 -0.014 0.065 0.019 0.070
Table 4.4 Bias and RMSE of Item Discrimination Parameter Estimates
0% Guessing 5% Guessing 10% Guessing
IRT Number Sample Proportion Proportion Proportion
Model of Items Size BIAS RMSE BIAS RMSE BIAS RMSE
Mean Mean Mean Mean Mean Mean
2PL 25 500 0.021 0.144 0.045 0.160 0.080 0.202
25 2000 0.031 0.079 0.065 0.1 15 0.097 0.170
50 500 0.032 0.135 0.054 0.163 0.090 0.202
50 2000 0.038 0.077 0.074 0.116 0.105 0.168
MixIRT 25 500 0.016 0.144 0.015 0.143 0.020 0.151
25 2000 0.029 0.079 0.035 0.085 0.095 0.178
50 500 0.031 0.135 0.028 0.141 0.039 0.147
50 2000 0.037 0.077 0.042 0.081 0.042 0.086

 

49

 

Bias for 'b' parameters (Test length=25)
0.30

 

 

0.20

-.:+*
*-----*
o"’

0'00 w
QPSQO..... ..-._°P.3000_--..--35299----..méfwoo ._ 10" 90....___...}9f.3999_-..

-0.10 ,

 

Average Bias

 

--o-- 2PL -I-Mix1RT

 

Figure 4.2 25-item test average bias results for difﬁculty parameter estimates

 

 

Bias for ’b' parameters (Test length=50)
0.30 ... ,. , _____,

 

__ 0.95.09 .. .. ._ 399.2999. .. .. 59500 V_,_ 5.93999 _ 39859.0..- . . _ -.19??°09._ ..

 

Average Bias

-0.10

--¢-- 2PL —I— MixIRT

 

Figure 4.3 50-item test average bias results for difﬁculty parameter estimates

50

 

 

(L30

(125

(L20

015

(L10

Average RMSE

(L05

(L00

 

RMSE for 'a' parameters (Test Length=25)

- “...—«.-

 

w—« ...—Wu—m—nm ...... .

 

 

OPSOO OPZOOO SPSOO SPZOOO 10P500

--O-- 2PL -l-— MixIRT

10P2000

 

 

1

Figure 4.4 25-item test average RMSE results for discrimination parameter estimates

 

(L30

020

(L10

(LOO

Average RMSE

4110

{L20

 

Figure 4.5 50-item test average RMSE results for discrimination parameter estimates

RMSE for '3’ parameters (Test Length=50)

 

 

 

OPSOO OPZOOO SPSOO SPZOOO 10PSOO

--0-- 2PL —l— MixIRT

51

10P2000

 

Table 4.5 summarizes the average correlations between true (simulated) and

estimated item parameters. These values are presented graphically in Figures 4.6 and 4.7

for discrimination parameters and in Figures 4.8 and 4.9 for difficulty parameters.

Clearly, larger correlations were associated with larger sample sizes for both 2PL and

MixIRT-R models. The impact of guessing was strong in the recovery of a parameters for

the 2PL model. For example, correlations between true and estimated a parameters

dropped from 0.877 to 0.807 when the proportion of guessers increased from 5% to 10%

with the 2PL model as shown in Table 4.5. The correlations were similar (about 0.9) for

both the 2PL and the MixIRT-R model when no guessers were included in the sample for

the condition with the sample size of 500 and test length of 25.

Table 4.5 Correlations between True and Estimated Item Parameters

 

 

 

 

0% Guessing 5% Guessing 10% Guessing
IRT Number Sample Proportion Proportion Proportion
Model of Items Size _
raa’ rbb’ raa’ rbb’ raa’ rbb’
2PL 25 500 0.909 0.989 0.877 0.985 0.807 0.976
25 2000 0.972 0.997 0.932 0.993 0.839 0.978
50 500 0.866 0.985 0.779 0.982 0.665 0.974
50 2000 0.965 0.997 0.907 0.992 0.770 0.979
MixIRT 25 500 0.909 0.989 0.906 0.988 0.902 0.987
25 2000 0.971 0.997 0.967 0.996 0.956 0.994
50 500 0.867 0.985 0.852 0.984 0.842 0.984
50 2000 0.964 0.997 0.961 0.996 0.956 0.996

 

Note: rag, is the correlation between true (a) and estimated (a ’) parameters

rbb’ is the correlation between true (b) and estimated (b ’) parameters

52

 

Correlations for '3' parameters (Test length=25)

1.00 .r
0.95 i
0.90 I
0.85 WW _. 4‘
030 ;W~»wwmwmuw“... WMHHLWW-M.”www.muﬁgffmummuwww
0.75
0.70
0.60 I

0.55 ,

 

 

 

 

Average Correlations

0.50 ‘

 

OPSOO OPZOOO SPSOO SPZOOO 10P500 10P2000

-'O-‘ 2PL -l- MixIRT

 

 

Figure 4.6 25-item test average correlations between true and estimated a-parameters

 

Correlations for 'a’ parameters (Test length=so)
1.00
0.95
0.90
0.85 5

 

0.80
0.75
OJO i ~-r ., 7 w u. 4 .-~ . . sunuwuxunugaWUH..-_nxm
0.65
0.60
0.55

 

Average Correlations

0.50 :

 

OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000

“'0'" 2PL —I— MixIRT

 

 

Figure 4.7 50-item test average correlations between true and estimated a-parameters

53

 

Correlations for 'b' parameters (Test length=25)
1.00 r“
0.99
0.98
0.97 ,_ . ,- .. . - _ ,. . _- _.
0.96
0.95
093 . .......
0.92 _, ,, ........
0.91 ‘. . V .._-_
0.90 1

 

 

 

 

Average Correlations

 

OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000

“'0'" 2PL -l- MixIRT

 

Figure 4.8 25-item test average correlations between true and estimated b-parameters

 

Correlations for 'b' parameters (Test length=50)

 

1.00 .
0.99 '
0.98
097 _ .. . ._ _ ._ . . . .. ---.-.-. ._ .. __ - . _

 

 

0.93 i
0.92
0.91
0.90 '

Average Correlations

 

OPSOO OPZOOO SPSOO SPZOOO 10P500 IOPZOOO

--¢-- 2PL -O— MixIRT

 

 

Figure 4.9 SO-item test average correlations between true and estimated b-parameters

54

The results pertaining to the recovery of item parameters are also displayed using
scatterplots in Figures 4.10 to 4.13. The results presented in these ﬁgures include
recovery of both 2PL and MixIRT-R models when the percentage of guesser in the

sample was 10%. Figures 4.10 and 4.11 are the scatterplots of true and estimated
parameters for the conditions of sample size (N) of 500 and test length (n) of 25 for 2PL

and MixIRT-R models respectively. Similarly, Figures 4.12 and 4.13 represent the
scatterplots for sample size of 2000 and test length of 50 for 2PL and MixIRT-R models
respectively.

In a scatterplot, each dot represents the estimated value of a particular parameter
for the given value of true parameter. Ideally, for a perfect recovery all dots should fall
over the line passing through the origin. Clearly, consistent with the ﬁndings presented
earlier, the recovery of difﬁculty parameters (b) was better than that of the discrimination
parameters (a) in both models. The recovery of both parameters was better in the
MixIRT-R model than in the 2PL model.

The results regarding the recovery of ability (6) parameters are summarized
numerically in Tables 4.6 and 4.7, and graphically in Figures 4.14 and 4.15. Only sample
plots are included in this chapter. The results in these tables were similar for both the 2PL
and the MixIRT-R models, but more clearly distinct for the 3PL model. The ﬁndings
indicate that guessing did not have a meaningful impact on correlations between
estimated and simulated parameter estimates. Specially, correlations between estimated

and simulated 6 parameters were at least 0.9.

55

 

 

 

 

 

 

 

 

g_ . 0,.
O
N—
u)._( o 0
‘— . . F—
m (U
9 o :- :3
Ew‘ 8°-
:1 0 o J:
m o (D
LU ° LLI ‘- o
.- 8
to o
0'?
(N'l—
Q- or)-
o I
T l l l l TTI I ll
0.0 0.5 1.0 1.5 2.0 —3 -2 -1 0 1 2 3
Truea Truea

Figure 4.10 Recovery of a and b parameters in the 2PL model for sample size of 500 and
test length of 25 and 10% proportion of guessers

56

 

 

 

 

 

 

 

 

O
oi‘ "‘
O
o N— .
O
*0...
cu ° .0
'O U
gr" E °‘
*8 a
w Lu
“5.-
O
0']...
Q—I (O—
o I
I I l I r I I r I I
00 05 10 15 2.0 -3 -2 -1 O 1 2 3
Tmea Trueb

Figure 4.11 Recovery of a and b parameters in the MixIRT model for sample size of 500
and test length of 25 and 10% proportion of guessers

57

 

 

 

 

 

 

 

 

O
63— (“—1
. O
O N—‘
to o
T;_ :..
(U .. .0 V.—
8 ~-"' ° 8
‘6 0.. ° "' _
E .— .‘ 3°
8 a
LU o LU ...__
“3.- 0'
O
0']—
C’... 00..
O I I I I I I T I I 1
0.0 0.5 1.0 1.5 2.0 -3 -2 -1 0 1 2
Truea Trueb

Figure 4.12 Recovery of a and b parameters in the 2PL model for sample size of 2000
and test length of 50 and 10% pr0portion of guessers

58

 

 

 

 

 

 

 

3.4 (’0
o N— O
“3.-
(I: .0 v4
‘0 “O
N —I (U _I
E ‘- E o
6 3
LU LU ‘7—
l0
0.-
0']...
2- 9—
I I I I I II I I I I I
0.0 0.5 1.0 1.5 2.0 -3 -2 -1 0 1 2 3
Truea Trueb

Figure 4.13 Recovery of a and b parameters in the MixIRT model for sample size of
2000 and test length of 50 and 10% proportion of guessers

59

 

Table 4.6 Bias and RMSE of Ability Parameter Estimates for all Simulation Conditions

 

 

 

 

 

 

0% Guessing 5% Guessing 10% Guessing
IRT Number Sample Proportion Proportion Proportion
Model of Items Size BIAS RMSE BIAS RMSE BIAS RMSE
Mean Mean Mean Mean Mean Mean
2PL 25 500 —0.004 0.402 —0.058 0.404 -0.096 0.414
25 2000 0.002 0.404 -0.061 0.408 -0.103 0.417
50 500 -0.005 0.290 -0.056 0.296 -0.088 0.307
50 2000 -0.006 0.292 -0.063 0.297 0.01 1 0.304
3PL 25 500 -0.315 0.508 -0.315 0.506 -0.316 0.506
25 2000 -0.212 0.451 -0.229 0.458 -0.242 0.468
50 500 -0.294 0.41 1 -0.297 0.41 1 -0.295 0.409
50 2000 -0.214 0.358 -0.222 0.356 0.000 0.307
MixIRT 25 500 0.012 8 0.404 0.036 0.417 0.058 0.429
25 2000 0.008 0.404 0.030 0.415 -0.007 0.439
50 500 0.001 0.291 0.016 0.303 0.024 0.312
50 2000 -0.004 0.293 0.014 0.306 0.000 0.316

 

60

Table 4.7 Correlations between Simulated and Estimated Ability Parameters for all
Simulated Conditions

 

 

 

 

 

0% Guessing 5% Guessing 10% Guessing
IRT NUIDbCI' 0f Sample Proportion Proportion Proportion
Model Items Size
rao’ 700’ roo’
2PL 25 500 0.910 0.910 0.909
25 2000 0.913 0.913 0.912
50 500 0.955 0.954 0.953
50 2000 0.955 0.955 0.952
3PL 25 500 0.909 0.909 0.908
25 2000 0.912 0.912 0.911
50 500 0.953 0.953 0.953
50 2000 0.954 0.955 0.953
MixIRT 25 500 0.909 0.898 0.889
25 2000 0.913 0.906 0.902
50 500 0.954 0.948 0.943
50 2000 0.955 0.949 0.945

 

61

 

RMSE for 'ability' parameters (Test Length=25)
0.60 ---..- -..--. ...-

 

 

 

9’
:5.
O

F’
N
0

Average RMSE
O
8

F’
H
O

(100

 

OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000

 

--o--2PL -I-MileT thPL

Figure 4.14 25-item test average RMSE results for ability parameter estimates

 

RMSE for 'ability' parameters (Test Length=50)

 

 

 

Average RMSE

 

 

OPSOO OPZOOO SPSOO SPZOOO 10P500 10P2000

 

-+-2PL -I-MixIRT ~-----t3PL

Figure 4.15 50-item test average RMSE results for ability parameter estimates

62

 

 

4. 3.2 Classification Accuracy of the MixIR T -R Model

As noted previously, one of the purposes of this study was to investigate the
accuracy with which Bayesian estimation of the MixIRT model can correctly identify
guessers in a sample. Speciﬁcally, the goal was to evaluate the accuracy of classifying
examinees into guesser and non-guesser groups. This research purpose can be addressed
only through a simulation study because in real assessment it is impossible to make such
a conclusion. Therefore, using estimates from the parameter recovery study described
earlier, classiﬁcation accuracy is ascertained by the extent to which simulees are correctly
categorized as guessers or non-guessers.

Table 4.8 provides results of weighted and unweighted classiﬁcation accuracy for
different guessing proportions when using the MixIRT-R model. Overall, the
classiﬁcation accuracy was over 98% for the non-guessing group and approximately 90%
for the guessing group. The weighted classiﬁcation accuracy, computed by weighting the
results by the associated sample size, was 97.20% when sample size was 500 and the
guessing proportion was 10%. This accuracy increased to 98.06% when sample size
increased ﬁ'om 500 to 2000 simulees. Similarly, when the proportion of guessing was
5%, the weighted classiﬁcation accuracies were 96.92% and 98.00% for sample size of
500 and 2000 respectively. Interestingly, both classiﬁcation accuracy and weighted

classiﬁcation accuracy were 100% when no guessers were present (labeled as 0%).

63

Table 4.8 Classiﬁcation Accuracy in MixIRT-R Model

 

 

 

 

' 13*
Proportion Sample True Class Averaged Classiﬁcation “geighftied .
of guessing size (Group*) Estimate Accuracy % C assr cation
Guesser % Accuracy %
500 NG 1.22 98.78 97.20
G 83.00 83.00
10%
2000 NG 1.27 98.73 98 .06
G 85.20 85.20
500 NG 0.76 99.24 96.92
G 76.00 76.00
5%
2000 NG 0.96 99.04 98.00
G 78.20 78.20
500 NG 0 100 100
G NA NA
0%
2000 NG 0 100 100
G NA NA

 

*NG = Non-guessers, G = Guessers
** Weighted by sample size

4.4 Results from Simulation Analyses using MixIRT-A Model

As noted previously, the guessing factor may not be easy to model in practice, and
hence the only way to illustrate it is through a simulation study. The goal of this second
simulation study was to use a MixIRT model to incorporate a different guessing
strategies (i.e., the assumption of ability-based guessing), that can be modeled using
MixIRT-A of Chapter 3. Therefore, the second simulation study showed how the 2PL
model is limited in its parameter estimation accuracy because it cannot account for

sample heterogeneity. However, this simulation design was simpliﬁed by considering

64

only conditions in which the estimation model was varied for a speciﬁc test length and
sample size. Speciﬁcally, estimation results using the 2PL and MixIRT-A models for 40-
item tests administered to 1000 examinees were compared.

Table 4.9 summarizes descriptive statistics of simulated item parameters used in

second simulation study. The a parameters ranged from 0.633 to 1.897 with mean of

1.015 and standard deviation of 0.274. The b parameters ranged from -2.274 to 1.945

with mean of 0.093 and standard deviation of 0.855. A complete list of item parameters

are listed in Appendix C-3.

Table 4.9 Descriptive Statistics of Simulated Item Parameters in MixIRT-A Model

 

Item Standard

 

Parameter Mean Deviation Maxrmum M um
(I 1.015 0.274 1.897 0.633
b 0.093 0.855 1.945 -2.274

 

The same ﬁve diagnostic measures used in the ﬁrst simulation study were used to
evaluate the convergence of the estimates. The recovery of item and ability parameters
was evaluated using RMSE and correlations between estimated and simulated
parameters. The results from this simulation study are presented in Tables 4.10 to 4.14

and Figures 4.16 to 4.18.

65

Table 4.10. RMSE of Discrimination and Difﬁculty Parameter Estimates using MixIRT-
A Model

 

 

 

 

No Guessers Guessers
IRT MOdCl a b a b
Mean Mean Mean Mean
2PL 0.100 0.096 0.187 0.199
MixIRT 0.102 0.097 0.133 0.084

 

 

RMSE for discrimination parameter estimates

 

g (110 . "ﬁg-TH--..W_M-MW

No Guessers Guessers

"-9-"- ZPL -l- MixIRT

 

 

 

Figure 4.16 RMSE of discrimination parameter estimates in MixIRT-A model

The recovery of discrimination and difﬁculty parameters indicates that both the
2PL and MixIRT-A models produced comparable results when no guessers were present,
i.e. no heterogeneity existed. However, when some simulees were simulated as guessing
on items that were likely to be difﬁcult for their given ability level, the MixIRT-A model
outperformed the 2PL model. This was reﬂected by smaller RMSE and larger

correlations between estimated and simulated item parameters. Recovery of difﬁculty

66

parameters was better than that of discrimination parameters, and guessing had a large
impact on the discrimination parameter estimates. For example, in the presence of
guessing, the correlation between estimated and simulated discrimination parameter
dropped from 0.949 to 0.705 in the 2PL model. However, guessing did not have much
impact on the recovery of difﬁculty parameters. The correlations between true and

estimated parameter remained fairly high with values greater than 0.98 in both models.

 

RMSE for difﬁculty parameter estimates
0.30

 

 

 

Average RMSE

 

No Guessers Guessers

 

"O" 2PL —I— MixIRT

,2 - ...“ ,. - - - , ...N. .. - , - _ .. .-. a- ..i..- -- -.-.-- .. .-...- -.-..w -.-.-- --.--- .. .. .. ._.. .. .. .. ._ .. .. .. .. ._ _. _. _ -1

 

Figure 4.17 RMSE of difﬁculty parameter estimates in MixIRT-A model

Table 4.11 Correlation of Discrimination and Difﬁculty Parameter Estimates using
MixIRT-A Model

 

No Guessers Guessers

 

IRT Model

 

raa’ rbb’ raa’ rbb’
2PL 0.949 0.993 0.705 0.988

MixIRT 0.948 0.994 0.886 0.995

 

67

The recovery of ability parameters was also evaluated in terms of RMSE and
correlations. Table 4.12 and Figure 4.14 show the recovery of ability parameter estimates.
In the case of the 2PL model, RMSE increased from 0.325 to 0.411 in presence of
guessing. However, the increase in RMSE for the MixIRT-A model was small and
increased from 0.326 to 0.343. When guessing was allowed, the correlation decreased

from 0.942 to 0.917 in the 2PL model and from 0.942 to 0.929 in the MixIRT-A model.

Table 4.12. RMSE of Ability Parameter Estimates in MixIRT-A Model

 

 

 

No Guessers Guessers
IRT Model
Mean RMSE Mean RMSE
2PL 0.325 0.411
MixIRT 0.326 0.343

 

 

RMSE for ability parameter estimates
0. 5 O —~- .-______ - WWW--- ._.___.

 

..-—
‘-‘
------
-
--
---

. _. ﬂ

 

 

Average RMSE

No Guessers Guessers

--o-- 2PL —I- MixIRT

 

 

 

Figure 4.18 RMSE of ability parameter estimates in MixIRT-A model

68

Table 4.13 Correlation of Ability Parameter Estimates in MixIRT-A Model

 

 

IRT Model No Guessers Guessers
2PL 0.942 0.917
MixIRT 0.942 0.929

 

As mentioned earlier, classiﬁcation accuracy is an important criterion for
evaluating the degree to which the proposed MixIRT-A model accurately classiﬁes
examinees into their true (simulated) class or group. Table 4.14 provides weighted and
unweighted classiﬁcation accuracy results for the MixIRT-A model. The results indicate
that this model correctly identiﬁed 63.12% of guessers. This indicates a lack of power in
identifying the guessers. Also, misclassiﬁcations occurred from using this model. For the
non-guesser class, 7.52% were incorrectly classiﬁed as guessers. Similarly, even for a
sample with no guessers, the model incorrectly classiﬁed 3% of the examinees as

guessers.

Table 4.14 Classiﬁcation Accuracy of MixIRT-A Model

 

 

' **
True Class Average Classiﬁcation Weighted.
(Group*) N Estimated Accuracy % Class1ﬁcation
Guesser % Accuracy %
Guessing NG 748 7.52 92.48
Allowed G 252 63.12 63.12 85.08
NO: NG 1000 3.00 97.00 97.00
Guessmg G 0 NA NA

 

*NG=Non-guessers, G=Guessers
** Weighted by sample size

69

4.5 Results from Empirical Data Analysis

To address the fourth research question for which the goal was to investigate the
impact of excluding aberrant item responses (from guessers) in proﬁciency level
classiﬁcation, real data from a statewide mathematics assessment was used. Since
guessing behavior can only occur on multiple-choice items, the analyses were conducted
on examinee responses to 54 multiple choice items. Because of extensive MCMC
computational time, only two randomly selected samples of size 1000 were used in a
cross-validation. The ﬁrst sample is referred as a training sample and the second sample
is referred as a validation sample. First, the results based on the random guessing model
(MixIRT-R) are presented in section 4.5.1. However, the analysis was also carried out
using the MixIRT model with ability-based guessing (MixIRT-A) so as to compare the
classiﬁcation of simulees into guessers and non-guessers. These results are presented in

section 4.5.2.

4. 5.1 Results Based on the Random Guessing Model

Tables 4.15 and 4.16 present sample WinBUGS output, particularly highlighting
the estimates of class membership. The node in this table refers to the variable monitored
in WinBUGS. In this output, PI[l] and PI[2] refer to classes or categories corresponding
to guesser and non-guesser respectively. Interestingly, the estimates were similar for both
samples, showing that about four to ﬁve percent of examinees were likely to belong to a
guesser class in this particular assessment. The observation of 95% credible interval
around the estimate and MC error being less than 1/20th of the standard deviation

indicates that these estimates are fairly precise.

70

Table 4.15 MixIRT-R Estimates for Training Sample

 

Node Mean Standard Deviation MC error* 2.50% Median 97.50%

 

PI[l] 0.044 0.008 <0.001 0.029 0.044 0.062
PI[2] 0.956 0.008 <0.000 0.938 0.956 0.971

 

*MC error: Monte carlo error

Table 4.16 MixIRT-R Estimates for Validation Sample

 

Node Mean 5‘”?de MC error 2.50% Median 97.50%
Devratron

 

PI[l] 0.050 0.009 < 0.001 0.034 0.050 0.068
PI[2] 0.950 0.009 < 0.001 0.932 0.950 0.966

 

The estimates of guessing probability for each examinee also produced very similar
results for both samples. Based on group membership estimate for each examinee, the
numbers of guesser identiﬁed by the MixIRT-R model were 40 and 41 in training and
validation samples respectively.

Three 6 scale cut-scores were used for categorizing examinees into four
proﬁciency levels based on values of -1.08, -0.53, and 0.39. As mentioned in the previous
chapter, these cut-scores were chosen in such a way that the proportion of examinees into
each proﬁciency levels in the current sample matched with that obtained from the actual
statewide assessment. Therefore, in order to evaluate the impact of removing guessers
from parameter estimation, the guessers identiﬁed by the MixIRT-R model were
removed ﬁom the sample and the model parameters were estimated again. The results
presented below are summarized for the same number of examinees, i.e. only non-

guessers before and after removing guessers from the calibration.

71

Table 4.17 Distribution of Proﬁciency levels in Original and Modiﬁed Training Sample

 

Original proﬁciency level Modiﬁed proﬁciency level

 

 

 

Frequency Percent Frequency Percent
Advanced 280 29.17 281 29.27
Basic 243 25.31 234 24.38
Below Basic 64 6.67 78 8.13
Total 960 100 960 100

 

A closer look to these results does not indicate any noticeable differences in proﬁciency
levels between the proportion of examinees before and after removing the guessers
identiﬁed by the MixIRT-R model. For example, the percentage of examinees that were
classiﬁed as proﬁcient (proﬁcient or advanced) changed slightly ﬁom 68.02 to 67.50.
Table 4.18 summarizes the distribution of proﬁciency levels for validation
sample. The results from this sample were fairly similar to those obtained for training
sample. There was a small difference between original and modiﬁed examinee
classiﬁcation as proﬁcient (proﬁcient or advanced) as indicated by changes in

proﬁciency level from 68.20% to 68.30%.

Table 4.18 Distribution of Proﬁciency levels in Original and Modiﬁed Validation Sample

 

Original proﬁciency level Modiﬁed proﬁciency level

 

 

 

Frequency Percent Frequency Percent
Advanced 281 29.30 283 29.51
Proﬁcient 373 38.89 372 38.79
Below Basic 65 6.78 93 9.70
Total 959 100 959 100

 

72

Testing statistical signiﬁcance of these differences would provide useﬁil
information for evaluating the meaningfulness of sample differences. As noted in Chapter
3, one way of comparing ability parameter frequency distributions was to use the
Kolmogorov-Smimov test. In addition to this test, a chi-square test was performed in
order to test the statistical signiﬁcance of the differences between proﬁciency levels

classiﬁed by two samples. Table 4.19 provides Kolmogorov-Smimov test results.

Table 4.19 Test Statistics from Two-sample Kolmogorov-Smimov Test

 

Training sample Validation sample

 

 

0-Distributions 0-Distributions
Most Kolmogorov-Smimov Z I 0.456 0.708
Extreme . .
Differences Asymp. Sig. (2-tarled) 0.985 0.698

 

Kolmogorov-Smimov test results from Table 4.19 suggest that the difference between
two distributions is not statistically signiﬁcant. Similarly, the results from the chi-square
test suggest that the difference for original and modiﬁed proﬁciency levels is not
signiﬁcant for training sample (x2=1.60, df=3, p=0.66). The chi-square test also suggested
no signiﬁcant difference for validation sample (12:6.83, df=3, p=0.08).

In an attempt to map the characteristics of the examinees classiﬁed into the
guesser class from this analysis, no speciﬁc conclusions could be made in terms of
gender and ethnicity. The only variable that seemed related with guessing was economic
disadvantage(ED), a measure of socio-economic status, operationalized by ﬂee or

reduced lunch. That is, ED=1 were more likely to be classiﬁed as guessers than ED=0.

73

4. 5. 2. Results Based on the Ability-based Guessing Model

The results based on the MixIRT-A are presented for both training and validation
samples. This model identiﬁed that 7% of examinees were guessers for training sample,
and 10% of the examinees were guessers for validation sample. Interestingly, among
those 70 examinees for training sample and 100 examinees for validation sample that
were classiﬁed as guessers by this model, 36 for training sample and 37 for validation
sample were also classiﬁed as guessers by the previous model (MixIRT-R). This result is
presented in Figure 4.19.

MixIRT-R M‘XIRT'R

) MixIRT-A

Training sample MIXIRT-A Validation sample

Figure 4.19 Number of examinees identiﬁed as guessers in training and validation sample

Table 4.20 presents the distribution of proﬁciency levels in the original and the
modiﬁed training sample. Interestingly, the proﬁciency level for those who were
proﬁcient (proﬁcient or advanced) has decreased from 68.17 (original sample) to 63.87

(modiﬁed sample). This shows a large change in the proﬁciency level.

74

Table 4.20 Distribution of Proﬁciency levels in Original and Modiﬁed Training Sample

 

Original proﬁciency level Modiﬁed proﬁciency level

 

 

 

Frequency Percent Frequency Percent
Advanced 262 28.17 240 25.81
Proﬁcient 372 40.00 354 38.06
Basic 233 25.05 240 25.81
Below Basic 63 6.77 96 10.32
Total 930 100 930 100

 

Table 4.21 presents the distribution of proﬁciency levels in original and modiﬁed
sample for validation sample. Interestingly, the proﬁciency level for those who were
proﬁcient (proﬁcient or advanced) has decreased from 68.11 (original sample) to 61.56

(modiﬁed sample). This also shows a large change in the proﬁciency level.

Table 4.21 Distribution of Proﬁciency levels in Original and Modiﬁed Validation Sample

 

 

 

 

Original proﬁciency level Modiﬁed proﬁciency level
Frequency Percent Frequency Percent
Advanced 259 28.78 215 23.89
Basic 226 25.11 253 28.1]
Below Basic 61 6.78 93 10.33
Total 900 100 900 100

 

Table 4.22 Test statistics from Two-sample Kolmogorov-Smirnov Test

 

Training sample Validation sample

0-Distributions 0-Distributions

 

13er Kolmogorov-Smimov Z 1.322 1.721
Differences Asymp. Sig. (2-ta11ed) 0.061 0.005

 

75

The statistical test of differences between two distributions in original and
modiﬁed sample suggested mixed ﬁndings for statistical signiﬁcance at the a-level of
0.05. For the training sample, the Kolmogorov-Smimov Z shows non-signiﬁcant
(Z=1.322, p=0.061). However, for the validation sample, it shows a signiﬁcant difference
as indicated Z value of 1.72 (p < 0.05).

In the chi-square test, the differences in proﬁciency levels (cell frequencies)
between original and modiﬁed sample were statistically signiﬁcant for both training and
validation samples. For example, the chi-square test statistics of x2=8.363, df=3, p=0.039
indicate the statistically signiﬁcant difference between cell frequencies for training
sample and 38:12.58, df=3, p=0.006 indicate the statistically signiﬁcant difference
between cell frequencies for validation sample.

The next chapter provides discussion and conclusions for this study. It
summarizes the results, interprets those ﬁndings, and lists some implications of those

results.

76

CHAPTER 5

DISCUSSION AND CONCLUSIONS

The primary goal of this study was to explore the effectiveness of mixture IRT
(MixIRT) models in estimating the differential performance of latent classes in a sample
(i.e., sample heterogeneity). The variables (e. g, guessers or non-guessers) used for
classifying examinees are referred to as sources of heterogeneity. When test-taking
heterogeneity sources are unobservable (e. g., examinees’ tendency to guess), and if their
group membership has to be inferred from the data, unobserved test-taking heterogeneity
is said to exist.

Therefore, in this study, the MixIRT model was used to investigate different
examinee test-taking behaviors through a simulation study that varied (a) sample size, (b)
test length, and (c) proportion of guessing. These factors were selected because these
were thought to be useful in many testing applications like item pool design or IRT-based
test bank development and pre-equating where precision of parameter estimation is
paramount. Furthermore, varying these factors allowed the extent to which differing
degrees of test-taking heterogeneity inﬂuence model parameter estimation to be studied,
particularly for different test lengths and sample sizes.

Given that MixIRT models are an extension of [RT models, their parameter
estimation is complicated by the intractability of mathematical forms when trying to use
ﬁ'equentist techniques. Therefore, Bayesian estimation was used instead because it can

handle high-dimensional problems and the distributions of parameters, can be explained

77

regardless of the forms of the distributions of the likelihood and the parameters. Through
a simulation study, the precision of parameter estimation was evaluated in the MixIRT
model for various realistic testing factors. As mentioned in Chapter Three, this study used
two forms of MixIRT model to incorporate different guessing strategies, viz. MixIRT-R
and MixIRT-A.

Considering the extensive computational time required for the MCMC procedures
of Bayesian methods that were used, only two levels of test length and sample size were
considered. Since the impact of unobserved test-taking heterogeneity, represented in this
study as a proportion of guessers, on model parameter estimation was the primary factor
of interest, the proportion of guessers per sample was varied. Two percentages of
guessing were used to represent 5% and 10% of the total examinees as guessers. The data
with no guessers, represented as 0% guessing proportion, was used as a baseline to
compare the results.

Another purpose of this study was to compare the parameter estimation accuracy
of the MixIRT model to two commonly used IRT models: the 2PL and 3PL models. A
parameter recovery study was used for conducting the aforementioned comparison. This
comparison was carried out by varying the three estimation models (i.e., 2PL, 3PL, and
MixIRT) for all the study factors in a fully crossed design. The precision of parameter
recovery was evaluated based on three commonly used evaluation criteria: bias, RMSE,
and Pearson correlation. The interpretations of the results were based on both numeric
and graphic representations.

The study’s third objective was to evaluate the accuracy of MixIRT Bayesian

estimation in identifying guessers when there were guessers in a sample. For this

78

purpose, the MixIRT model estimated the probability that each examinee belonged to
latent classes of either guessers or non-guessers. And the model’s classiﬁcation accuracy,
which indicates the extent to which simulees are correctly categorized as guessers or non-
guessers, was evaluated. As noted earlier, the probability of an examinee likely to be a
guesser was estimated from the item response pattern of the examinee and the probability
was actually based on the average over a large number of MCMC iteration. In this study,
the examinee was classiﬁed as a guesser if that probability was equal to or greater than
0.5.

The study’s ﬁnal purpose was to investigate the impact of excluding aberrant
guessing responses in examinee proﬁciency level classiﬁcation. In other words, the
ability continuum was divided into four different levels so that the impact could be
studied in terms of proﬁciency classiﬁcation. For proﬁciency classiﬁcation, real data was
used as a ﬁrrther illustration of the MixIRT model’s usefulness. This goal has potential
for contributing to the better understanding of issues pertaining to cut-scores variation
and its policy implications.

It is important to clarify that this study does not suggest guessing is a bad thing
from a student’s perspective, especially in circumstances such as when there is no penalty
for guessing and when examinees run out of time. However, from the measurement or
psychometric point of view, guessing introduces construct-irrelevant variance, which is a
major concern in validity studies. Therefore, the objective of this study was to document
the impact of guessing on parameter estimation thereby inﬂuencing proﬁciency level
classiﬁcation. In simple terms, the practical example illustrated in this dissertation was

similar to using a correction for guessing to get the corrected distribution. Therefore, the

79

goal was to illustrate how the proposed mixture modeling approach has potential to

address this very important issue encountered in many large scale assessments.

5.1 Interpretations of the Results
5. 1.1 Results from Parameter Recovery Study

As noted previously, one of this study’s major goals was to evaluate the accuracy
of parameter estimates by comparing them to true (simulated) parameters. The results
presented both numerically in Tables 4.3 and 4.4 and graphically in Figures 4.2 to 4.5
show that both bias and RMSE values for discrimination and difﬁculty parameters are
generally lower in MixIRT-R model estimation as compared to those obtained from the
2PL model.

When no guessers were present in the sample, the bias and RMSE values were
similar in both MixIRT-R and 2PL models. The lower values of these indices show that
parameters are estimated reasonably well when no aberrant responses are present in the
data. However, bias and RMSE values tended to be higher for both models when the
proportion of guessers in the sample increased to 5% and 10%. This suggests that even in
presence of 10% guessers in the sample, the aberrant responses have a huge impact on
precision of item parameter estimation. Since commonly used IRT models (e.g., lPL,
2PL, 3PL) are not designed to handle the test-taking heterogeneity, alternate modeling
approaches are necessary. A mixture model provides such avenues by allowing different
latent classes to have their own set of model parameters.

One of the primary objectives of varying study factors like test lengh and sample
size was to evaluate their capacity to recover stipulated item and person parameters. No

clear interpretation could be drawn from the available evidence about the impact of test

80

length on bias and RMSE. Moreover, as other studies have also shown, the larger sample
size resulted in smaller bias and RMSE. This was, however, not the case for the 2PL
model when test length was 50 and the sample size was 2000. These ﬁndings play an
important role in judging the quality of IRT-based test banks and pre-equating used in
large scale assessments.

The average correlations between true (simulated) and estimated item parameters,
presented in Table 4.5 and Figures 4.6 to 4.9 show the recovery of item discrimination
and item difficulty parameters. Stronger correlations were associated with larger sample
sizes for both 2PL and MixIRT-R models. This ﬁnding was also consistent with the
literature on IRT parameter recovery.

The impact of guessing was profound in recovery of item discrimination
parameters for the 2PL model. For example, the correlation between true and estimated a
parameters decreased ﬁom 0.877 to 0.807 with the use of 2PL model when the proportion
of guessers increased from 5% to 10%. The correlations were similar for both 2PL and
MixIRT-R models when no guessers were included in the sample. This suggests that
when unobserved test-taking heterogeneity is absent (i.e., no guessers are present in the
sample), it may not be necessary to use the complex models like the MixIRT.
Nevertheless, this situation may not be practical in most situations as guessing is widely
known to occur in many large scale assessments. .

Overall, difﬁculty parameters had better recovery than discrimination parameters.
This is consistent with the ﬁndings from earlier research, which showed that
discrimination parameters are usually more poorly estimated than the difﬁculty

parameters.

81

Person parameter recovery results are summarized in Tables 4.6 and 4.7. Sample
plots of the parameter recovery results are also presented in Figures 4.14 and 4.15. The
2PL and MixIRT-R bias and RMSE values were similar, suggesting that ability
parameter recovery was fairly similar for both models. However, among the three models
compared in this study, the 3PL model performed the worst as indicated by large bias and
large RMSE. One possible reason for this poor performance could be a result of the types
of guessing behavior introduced in this simulation. That is, guessing is deﬁned as
examinee behavior and estimated as a person parameter using a probabilistic model.

Generally, in the IRT framework, studies that use the 3PL model simulate data by
associating guessing as a parameter associated with the items indicated by the c
parameter. In addition, c-parameters are often recovered very poorly ﬂVlartin et al., 2006;
Pelton, 2002), because these are generally estimated as lower asymptotes based on fewer
number of examinees. In this study, the model ﬁt index Deviance Information Criteria
(Spiegelhalter, Best, Carlin, & van der Linde, 2002) showed that the ﬁt of 2PL model
was better than that of 3PL model even when 10% guessing proportion was present.
Finally, the true (simulated) parameters were generated based on the 2PL model and
introduction of guessers might have noticeable impact that could not be captured by the
3PL model.

Furthermore, based on the correlations between estimated and simulated ability
parameters, the results were fairly similar in all three models and the magnitude of
correlation was generally strong. This indicates that the inﬂuence of unobserved test-
taking heterogeneity was more noticeable in item parameter estimation than the person

parameter estimation. This ﬁnding suggests that the proposed mixture modeling approach

82

is more appropriate in applications where precise estimation of item parameter is

paramount, such as pre-equating and IRT-based item banking or item pool.

5.1.2 Results on Classiﬁcation Accuracy

To investigate the accuracy of Bayesian MixIRT model estimation in correctly
classifying guessers, this study evaluated the results based on an index called the
classiﬁcation accuracy. Table 4.8 shows the classiﬁcation accuracy when using the
MixIRT model. The classiﬁcation accuracy was over 98% for the membership to the non-
guesser group and approximately 90% for the membership to the guesser group. In terms
of weighted classiﬁcation accuracy, the MixIRT model performed better in classifying
examinees into the group where they belonged. This was reﬂected by the accuracy of
96.92% or higher in all simulated conditions. The classiﬁcation accuracy was 100% for
the conditions when there were no guessers in the sample. This ﬁnding suggests that the
MixIRT models can be used even in absence of unobserved test-taking heterogeneity.
However, due to the complexity of the mixture models and the costs associated
estimating a large number of parameters, there is no advantage of using the MixIRT

model when no guessers are present.

5. 1.3 Results from Empirical Study
Finally, results from the real data example are presented in Tables 4.11 and 4.12.
The MixIRT-R model identiﬁed that nearly 5% of examinees were likely to be guessers

in this sample. The precision of these estimates are reﬂected in a 95% credible interval

83

around the estimate and the fact that MC error is less than 1/20‘“ of the standard
deviation.

The impact of excluding guessers in parameter estimation was also expressed in
terms of classiﬁcation into proﬁciency levels. As mentioned in Chapter 3, this study used
four proﬁciency levels: Advanced, Proﬁcient, Basic, and Below Basic, which are
commonly used in current test-based accountability system under NCLB. To evaluate the
degree to which the proﬁciency level classiﬁcations differ between the two samples, with
or without the guessers, a chi-square test was performed to compare whether the
proportions of student in proﬁciency levels are different between two samples. In
addition, the distributions of two ability (0) estimates using a two-sample Kolmogorov-
Smimov test show that the differences were not statistically signiﬁcant for both samples.
This suggests that the speciﬁed MixIRT model did not ﬁnd guessing as a potential cause
of observed difference in proﬁciency classiﬁcation for this assessment. Furthermore, the
inﬂuence of a small proportion of guessers in the sample, i.e., less than 5%, did not have
much inﬂuence on parameter estimation and decisions regarding proﬁciency level
classiﬁcation. It might be possible that, for an assessment where more examinees are
engaged in guessing, the impact on parameter estimation, as well as proﬁciency level
classiﬁcation, could be noticeable.

The impact of guessing was noticeable in analysis using the MixIRT-A model.
This model identiﬁed 7% of examinees as guessers in the training sample, and 10% in the
validation sample. Interestingly, among those 70 examinees for training sample and 100
examinees for validation sample that were classiﬁed as guessers by this model, 36 and 37

for training and validation samples were also classiﬁed as guessers by the previous

84

model, i.e. MixIRT-R. This shows that the two models are related. Naturally, the ability-
based guessing model is expected to identify more guessers than the random-guessing
model.

As earlier the impact of excluding guessers from the sample was evaluated by
ﬁnding the differences in proﬁciency level and ability distribution of examinees before
and after removing the guessers from the calibration. Interestingly, the proﬁciency level
for those who were proﬁcient (proﬁcient or advanced) was changed from the original to
the modiﬁed sample for both training and validation samples. For example, the percent of
proﬁcient was changed from 68.17 to 63.87 in training sample showing a noticeable
impact of removing guessers from the calibration. Similarly for validation sample, Table
4.21 showed that the proﬁciency level for those who were proﬁcient (proﬁcient or
advanced) had changed from original to modiﬁed sample. The changes in proﬁciency
level from 68.11% to 61.56% also indicate a noticeable impact on proﬁciency level
classiﬁcation.

In terms of inferential statistics, the statistical test of differences between the two
distributions in the original and modiﬁed samples had mixed ﬁndings at a statistical
signiﬁcance at alpha level of 0.05. In other words, the Kolmogorov-Smirnov Z shows
non-signiﬁcant result for the training sample but shows a signiﬁcant difference for the
validation sample. For the chi-square test, the differences in proﬁciency levels between
the original and modiﬁed samples were statistically signiﬁcant for both samples. This
may have a large ramiﬁcation from a policy perspective because even a few percent
changes in proﬁciency level receive attention by teachers, school administrators, and

policy makers. In this context, it may be prudent to decide to locate the appropriate cut-

85

score on the continuum where few examinees are situated so that a change in the cut-
score would result in unnoticeable changes in proﬁciency classiﬁcation. The change in
students’ proﬁciency estimates are also of interest to a wider audience that includes
parents, teachers, school administrators, educational researchers, and policy makers. This
puts unique responsibilities on educational researchers and psychometricians to answer
the question of whether the changes in student’s proﬁciency estimates are associated with

actual improvement in their ability or due to a measurement or scaling issues.

5.2 Study Limitations

Like any simulation study, there are also questions regarding the generalizability
of ﬁndings to real testing situations. Utmost care was taken to ensure that the simulated
conditions match with practical settings. However, due to limited ﬂexibility of modeling
in the program used for MCMC sampling in this study, i.e. WinBUGS, it was not
possible to take full advantage of Bayesian inference. For example, a user has limited
control over sampling procedures implemented in WinBUGS. The DIC value was not
possible to compute for the mixture model in order to compare the model ﬁt statistics.
Therefore, the model ﬁt evaluation of the mixture model was limited to a likelihood ratio
test.

The ﬁndings documented in this study were based on 15 replications. This may
also raise a question about the generalizability of the ﬁndings. However, this decision
was made due to the slow performance of WinBUGS. It should be noted that parameter
estimation using Gibbs sampling requires a substantial amount of time, especially when it

is estimated using WinBUGS. In this study, the required computing time for each dataset

86

varied anywhere from 1 hour to 6 hours of computer time (with Intel Centrino Duo
processor 1.66GHz, 2 GB RAM) depending upon the sample size and test length.
Therefore, this study used a limited number of levels within each of the simulated factors.

The MCMC estimation were performed using WinBUGS. Therefore, it should be
noted that the results obtained may not just be due to the theoretical differences between
models and study factors, but may also be related to how the software implements the
MCMC methods. In other words, use of alternative software or methods of estimation
could potentially lead to different results. Thus, comparing the results obtained from the
Bayesian estimation method implemented in WinBUGS with other programs such as
BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003) or Mplus (Muthén & Muthén,
1998-2007) may provide an additional perspective on this issue.

In a real-data application, this study classiﬁed the examinees into latent classes of
guessers and non-guessers based on the probability of each examinees being classiﬁed
into each class. The recommendation of this study to delete the guessers from the
calibration to improve the parameter estimation may not always be realistic in many large
scale assessments (e.g., state large scale assessment) because states are required to report

the scores for each examinee.

5.3 Implications

There are several possible implications of this dissertation. First, this dissertation
explored the effectiveness of using MixIRT models to estimate the differential
performance of latent classes in a sample i.e., sample heterogeneity. By studying the

impact of various study factors like sample size, test length, and proportion of guessing

87

on parameter recovery, it provides useful information for various testing applications like
item pool design or IRT-based test bank development and pre-equating. This study also
provides some direction on identifying aberrant item responses in any large scale
assessment and mapping the proﬁle of guessers.

This study focused on an important issue of psychometrics, i.e. parameter
estimation. If parameters are not well estimated, the proportion in Adequate Yearly
Progress (AYP) category will not be accurately reported. This study has the potential to
increase our understanding of challenges and conditions in the modeling of complex
behavioral phenomenon, such as test-taking behaviors. Furthermore, illustrative use of
WinBUGS may encourage researchers and practitioners to utilize Bayesian methods for
investigations of alternate modeling strategies when their data do not ﬁt a single IRT
model. Finally, this dissertation also provided a substantive and policy-relevant
illustration of ignoring test-taking heterogeneity, especially by showing how simplistic
applications of 2PL and 3PL models could adversely impact not only the examinees but

also schools, teachers, policymakers.

5.4 Future Directions

There are several possible future directions for this study. First, a future study
could model complex guessing strategies by taking interaction of ability, item difficulty,
and item location into account. Although the 3PL model performed less well than the 2PL
in this study potentially due to the guessing simulation favoring the 2PL, in practical
situations, we may not know which model would perform better. Therefore, a

comparative study of the 2PL and 3PL using real data may provide some useful ﬁndings.

88

Similarly, the estimations from this study could be compared with those from other
estimation programs, such as BILOG-MG (Zimowski et al., 2003), Mplus (Muthén &
Muthén, 1998-2007), and mdltm (von Davier, 2005). Although BILOG-MG is not
intended to model unobserved test-taking heterogeneity, the recovery of parameter could
still be compared especially for the sample with no-guesser (0% proportion of guesser in
this study) and the sample after removing guessers. Also, comparing the MixIRT model
parameter estimation ﬁom WinBUGS with that from M-Plus or mdltrn might provide an
additional perspective on the direction that could be undertaken if the MixIRT model
approach has to be realized in the practical situations.

In terms of handling the number of classes in the mixture model, the present
study was limited to two latent classes. Future studies could also explore such
investigation using more than two classes. The complexity of mixture modeling is further
increased by simulating the mixtures of one-parameter and 2-parameter IRT models, or
even mixtures of unidimensional and multidimensional IRT models. Such mixture IRT
modeling has potential to provide useful information for applications such as sub-score
reporting and cognitive diagnostic modeling. Future studies could investigate the
practicality of estimating such complex models in the mixture IRT framework.

As indicated in several earlier studies using WinBUGS, use of low level
programming languages such as FORTRAN or C++ to implement Gibbs sampling may
provide more ﬂexibility in addition to reduced computational time. Some of the limited
ﬂexibility of modeling in this study should not be attributed to the Bayesian estimation so

much as to the estimation tool used in this study, i.e. WinBUGS. Therefore, we may gain

89

some modeling and computational efﬁciency while moving in the direction of using a

low level programming language.

5.5 Summary of the Findings and Conclusions

In summary, this study shows that the MixIRT model can precisely recover the
model parameter. It also found that ignored unobserved test-taking heterogeneity, like the
presence of guessers in a sample in this study, had a noticeable impact on the precision of
recovery of both item and ability parameters. The item parameters were estimated more
precisely in MixIRT as compared to 2PL model. Finally, the mixture IRT model classiﬁed
examinees into guessers and non-guessers reasonably well. The impact of guessing on
ability estimation was not severe when the percentage of guessers was low, i.e. less than
5%. However, when the proportion of guesser was higher, say 7% to 10%, the impact was
noticeable as indicated by signiﬁcant changes in examinees classiﬁed into proﬁciency
levels.

This study investigated an important psychometric issue in large scale assessment,
such as modeling unobserved test-taking heterogeneity, using IRT mixture model. It
identiﬁed the guessers by estimating the probability based on response pattern of
examinees. This study also documented the impact of excluding the guessers from the
calibration to improve the parameter estimation, which has a large impact on improving
the quality of IRT-based item banking and the inferences drawn from the tests assembled
using the item pool. Since states are required to report the scores for each examinee, the
recommendation suggested by the results of this study to delete the guessers from the

calibration to improve the parameter estimation may not always be practical in many

90

large scale assessments. However, this study suggests that the proposed mixture
modeling approach can be applied in many large scale assessment such as IRT-based _
item banking to improve the quality of pre-equating and any inferences drawn from the
item parameter estimation.

This dissertation explored a psychometric perspective of modeling guessing as a
person characteristic rather than associating it with the item property as is commonly
done with the three-parameter logistic model. Finally, use of real data and illustration of
the MixIRT model’s usefulness in documenting the changes in proﬁciency level
classiﬁcations has potential for improving the understanding of issues pertaining to cut-

scores variation and its policy implications.

91

APPENDICES

List of Appendix

A. WinBUGS CODE FOR MixIRT model
B. FIGURES FOR EVALUATING CONVERGENCE OF THE ESTIMATES

C. ADDITIONAL TABLES
D. ADDITIONAL PLOTS

92

APPENDIX A

 

# Mixture IRT model

 

model

{
for (i in 1:N) {
for (j in 1:1) {
PliJ] <- (Z-Glil)/5 + (G[i]-1) * 1/ (1+eXp(-(a[i]*(thetalil-bUDD) ;

glid'l ~ dbemmlisjl) ;
G[i] ~dcat(PI[]);
pgli] <- equals(G[i].1);
#probability of being in a guesser class

# priors
PI[1:2] ~ ddirch(alpha[]);

for (j in 1:1) {
a[i] ~ dnorm(1,2) I(0,); #Truncated Normal
# a[i] ~ dlnonn(0,2); #Log Normal
b[j]~ dnorm(0, 1) ;
}

for (i in 1:N) {
theta[i] ~dnorm(0, taut) ; #prior for ability parameter

}
taut ~dgamma(0.01,0.01);

93

Appendix A.2 WinBUGS code for Model 2 (MixIRT model with ability based guessing)

model

{

for (i in 1:N) # N is the number of examinees
{
for (j in l:J) # J is the number of items
{

logit(P[iJ]) <- (aljl*(thetalil-bljl))-(alphali]-1)*stepfbli]-ﬂ16ta[il-delta[i])*(aﬁl*(thetalil-
bﬁ])+1.4);

check[i,j] <- (alpha[i]-1) * step(b[j]-theta[i]-delta[i]);

# 1.4 value is given instead of estimating c, because logit (-l .4) = 0.20

# step function (b[j]-theta[i]-delta[i]) =1 if an item j is difﬁcult for an examinee i with
threshold delta

# alpha estimates the group membership

rliaj] ~ dbemﬁiliajl) ;
:umcheckﬁ] <- sum(check[i,1:J]);
check2[i] <- sumcheck[i];
alpha[i] ~dcat(PI[]); # group membership is categorical

#priors
for (j in 1:J) {
b[j]~ dnorm(O, 1) ; # Normal prior for difﬁculty parameter
a[j] ~ dnorm(l ,2) I(0,); # Truncated normal for discrimination parameter

for (i in 1:N) {

theta[i] ~dnorm(0, taut) ; #prior for ability parameter

delta[i] ~ dnorm(0,10); #threshold for different degree of guessing
}

taut ~dgamma(0.01,0.01); # hyper-parameter for precision

PI[l] ~ dbeta(l,l);
PI[2] <- 1.0- PI[l];

94

APPENDD( B

FIGURES FOR EVALUATING CONVERGENCE OF THE MCMC METHODS

Figure B.l Convergence Diagnostic Plots for a difﬁculty parameter of a randomly
selected item [True b =2. 08, Estimated b =2.125]

Figure B.1a BGR plot for difﬁculty parameter estimate

 

 

 

 

 

 

b[20] chains 1:3
1.5 '
1.0 - i: r
0.5 r
0.0 ' I I l
1 10000 20000
iteration

Figure B. lb. History plot for difﬁculty parameter estimate

 

b[20] chains 1:3
6.0 -

4.0 *-

 

2.0

0.0 ’

 

 

10000 20000
iteration

Figure B. 1 c. Autocorrelation plot for difﬁculty parameter estimate

 

 

 

 

 

b[20] chains 1:3
1.0 '1',
0.5 '1‘ ~.___
00 -11 "r. ..... _ ___ _--_ _
-0.5 b
-1.0 L
0 20 40

95

Figure B.ld. Density plot for difﬁculty parameter estimate

 

b[20] chains 1:3 sample: 45000
1.5 -
1.0 -
0.5 '
0.0 -

 

 

 

1.0 2.0 3.0 4.0

Figure B.2 Convergence Diagnostic Plots for an ability parameter of a randomly selected
person [True theta=1.365, Estimated=0. 862]

Figure B.2a BGR plot for ability parameter estimate

 

 

 

 

 

 

theta[13] chains 1 :3
1.0 ' \
0.5 -
0.0 -
T l I
1 10000 20000

iteration

Figure B.2b. History plot for ability parameter estimate

 

theta[13] chains 1:3
30 -
2.0 -

 

0.0-
—1.0 F

 

 

 

l I l I
10001 15000 20000 25000
iteration

96

Figure B.2c. Autocorrelation plot for ability parameter estimate

 

 

 

 

 

theta[13] chains 1:3
1.0 ' '5
0.5 -§
0.0-L ~ -— n - -—*-—
-0.5L
-1.0 r
0 20 40

Figure B.2d. Density plot for ability parameter estimate

 

theta[13] chains 1:3 sample: 45000

T

1.0
0.75
0.5
0.25
0.0

I

T

1

l

 

 

 

97

Appendix C. ADDITIONAL TABLES

Table C.1 Simulated Item Parameters (Test Length=25)

 

Item Discrimination Difﬁculty

 

Number (a) (b)
l 0.98 0.17
2 1.22 0.15
3 0.97 0.23
4 1.17 -1.22
5 1.48 -1.90
6 0.99 -1.63
7 1.08 0.98
8 1.64 -0.29
9 0.85 -0.51
10 1.08 -0.18
1 1 1.49 0.14
12 0.78 0.72
13 0.75 -O.98
14 0.69 0.10
15 0.79 0.65
16 0.99 —0.85
17 0.59 -1.61
18 0.63 0.19
19 0.68 0.12

20 0.94 2.09
21 1.35 -0.04
22 0.99 -0.01
23 0.87 -0.41
24 0.98 0.97
25 1.76 0.09

 

98

Table C.2 Simulated Item Parameters (Test Length=50)

 

 

Item Discrimination Difficulty Item Discrimination Difﬁculty
Number (3) (b) Number (a) (b)
l 0.70 0.53 26 1.19 0.19
2 1.17 -1.11 27 1.15 -0.28
3 0.72 -1.04 28 1.28 0.71
4 0.85 -0.48 29 1.37 1.26
5 0.96 -1.19 A 30 1.01 -0.89
6 1.06 0.45 31 0.92 -0.02
7 0.98 -0.22 32 1.06 -0.89
8 1.54 -1.03 33 1.28 0.60
9 1.50 -0.71 34 1.36 0.40
10 1.17 -0.79 35 0.87 -0.41
1 l 1.02 —0.06 36 1.26 —0.45
12 0.82 -1.93 37 0.96 -0.60
13 0.89 -O.56 38 0.92 0.74
14 0.70 -1.11 39 1.14 -0.20
15 0.91 0.26 40 1.26 -0.12
16 0.89 0.48 41 0.87 -1.01
17 1.55 0.12 42 0.69 0.25
18 1.21 —0.08 43 0.99 2.15
19 1.02 -2.11 44 1.15 0.23
20 0.93 0.02 45 1.02 -1 .14
21 1.02 —0.70 46 1.48 0.20
22 1.21 -O.21 47 0.92 -1.15
23 1.75 -0.26 48 1.22 0.12
24 1.09 -0.93 49 0.82 —1.35
25 1.24 0.46 50 0.73 0.58

 

99

Table C.3 Simulated Item Parameters (Test Length=40)

 

 

Item Discrimination Difﬁculty Item Discrimination Difficulty
Number (a) (b) Number (a) (b)
1 0.853 0.575 21 1.298 0.694
2 1.176 -0.226 22 0.785 1.945
3 1.227 -1 . 140 23 0.798 0.088
4 1.175 0.369 24 0.800 0.021
5 0.858 0.873 25 0.911 0.776
6 0.673 -0.835 26 0.633 -0.004
7 0.833 -2.274 27 1.281 1.128
8 0.844 0.797 28 0.832 1.406
9 1.026 0.061 29 1.334 0.708
10 1.231 -1.493 30 1.807 0.913
11 1.897 -0.491 31 1.093 0.323
12 0.999 0.935 32 0.889 0.153
13 0.974 -0.460 33 1.189 -0.555
14 0.926 -0.212 34 0.710 0.009
15 0.769 0.004 35 1.019 -0.884
16 1.135 -0.032 36 1.004 1.526
17 0.961 -0.404 37 0.951 -0.132
18 1.176 -0.926 38 0.814 —0.586
19 1.300 0.568 39 0.743 -0.793
20 0.687 0.583 40 0.985 0.715

 

100

Appendix D. ADDITIONAL PLOTS

Figure D.l Convergence Diagnostic Plots for a discrimination parameter of a randomly
selected item [True a =0. 966, Estimated (1 =1.008]

 

1.5'
1.0"
0.5-
0.0'

 

a[3] chains 1:3

 

1.0
0.5
0.0
-O.5
—1.0

I
1 0000
iteration

I
20000

 

3.0
2.0
1.0

 

a[3] chains 1:3 sample: 45000

 

 

 

 

a[3] chains 1:3

 

 

- 1.0
_ L. _ ...... _..___,___ 0.5
- 0.0
I F
0 20 40
lag

 

 

a[3] chains 1:3

 

 

1 10000 20000
iteration

[From Top: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)]

101

 

Figure D.2 Convergence Diagnostic Plots for a difﬁculty parameter estimate of a
randomly selected item [True b =0. 22 78, Estimated b =0. 462 7]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

b[3] chains 1:3
1.5 -
1.0 -
0.5 '
0.0 -
-o.5 -
T 16666 20000
iteration
b[3] chains 1:3 sample: 45000
4.0 '
3.0 '-
2.0 '-
1.0 L
0.0 t
-0 5 0.0 0 5 1 0
b[3] chains 1:3 b[3] chains 1:3
1.0-S 1,0'\: _ 2.-
0.5 '3‘;
0.0 _ ;iIII-......_...._. .__-____-._.--..-._. . .-- - 0.5 -
-0.5 '
-1_o - 0.0 '
0 20 40 1 10000 20000

lag iteration

[From Top: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)]

102

Figure D.3 Convergence Diagnostic Plots for an ability parameter estimate of a randomly
selected examinee [True 6:2. 545, Estimated 6:2. 13 7]

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

theta[87] chains 1:3
6.0 -
4.0 —
2.0 -
0.0 '
10001 15000 20000
iteration
theta[87] chains 1:3 sample: 45000
0.8 -
0.6 -
0.4 -
0.2 -
0.0 -
0.0 2.0 4.0
theta[87] chains 1:3 theta[87] chains 1:3
1.0 ‘ 1.0 — ,—
0.5 -
0.0 ———'-——-~—'-“"— 0.5-
-O.5
-1.0 ‘ 0_0 -
0 20 40 1 10000 20000
'39 iteration

[From T op: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)]

103

REFERENCES

Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using
Gibbs sampling. Journal of Educational Statistics, I 7, 251-269.

Albert, J. H., & Ghosh, M. (2000). Item response modeling. In D. Dey, S. K. Ghosh & B.
K. Mallick (Eds.), Generalized Linear Models: A Bayesian Perspective (pp. 173-
193). New York: Addison-Wesley.

Ansari, A., J edidi, K., & Dube, L. (2002). Heterogeneous factor analysis models: A
Bayesian approach. Psychometrika.

Asparouhov, T., & Muthén, B. (2008). Multilevel mixture models. In G. R. Hancock &
K. M. Sarnuelsen (Eds.), Advances in latent variable mixture models. Charlote,
NC: Information Age Publishing.

Baker, F. B., & Kim, S.-H. (2004). Item Response Theory: Parameter Estimation
Techniques: CRC Press.

Bazan, J. L., Branco, M. D., & Bolfarine, H. (2006). A skew item response model.
Bayesian Analysis, 1(4), 861-892.

Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-ﬁt analysis
of multidimensional IRT models Psychometrika, 66(4), 541-561.

Bimbaum, A. (1968). Some latent trait models and their use in inferring an examinee's
ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test
Scores: Information Age Publishing.

Bock, R. D., & Zimowski, M. F. (1997). Multiple Group IRT. In W. J. van der Linden &
R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 43 3-
448). New York: Springer Verlag.

Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using Markov Chain Monte Carlo.

Applied Psychological Measurement, 27(6), 395-414.

Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for
Testlets. Psychometrika, 64, 153-168.

Budescu, D., & Bar-Hillel, M. (1993). To guess or not to guess: A decision-theoretic
view of formula scoring. Journal of Educational Measurement, 30(4), 277-291.

104

Cao, J ., & Stokes, S. (2008). Bayesian IRT guessing models for partial guessing
behaviors. Psychometrika, 73(2), 209-230.

Casella, G., & George, E. (1992). Explaining the Gibbs sampler. The American
Statistician, 46(3), 167-174.

Cizek, G. J. (2001). An overview of issues concerning cheating on large-scale tests. Paper
presented at the annual meeting of the National Council on Measurement in
Education, April 2001, Seatle, Washington.

Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item
functioning. Journal of Educational Measurement, 42(2), 133-148.

Congdon, P. (2005). Markov Chain Monte Carlo and Bayesian statistics. In B. Everitt &
D. Howell (Eds.), Encyclopedia of Statistics in Behavioral Science (V 01. 3, pp.
1134-1143): Wiley.

Cowles, M. (2004). Review of WinBUGS 1.4. American Statistician, 58(4).

Cowles, M., & Carlin, B. (1996). Markov Chain Monte Carlo convergence diagnostics: A
comparative review. Journal of the American Statistical Association, 91 (434),
883-904.

De Ayala, R. J ., Kim, S.-H., Stapleton, L., & Dayton, C. M. (2002). Differential item
functioning: A mixture distribution conceptualization. International Journal of
Testing, 2(3-4), 243-276.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation
ﬁom incomplete data via the EM algorithm (with discussion). Journal of Royal
Statistical Society, Series B, 39, 1-3 8.

Draney, K., Wilson, M., Gluck, J ., & Spiel, C. (2008). Mixture models in a
developmental context In G. R. Hancock & K. Samuelsen (Eds.), Advances in
Latent Variable Mixture Models. Charlote, NC: Information Age Publishing.

Fox, J ., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using
Gibbs sampling. Psychometrika, 66(2), 271-288.

Frary, R. B. (1988). Formula scoring of multiple-choice tests (correction for guessing).
Retrieved October 21, 2008, ﬁom
http://www.ncme.org/pubs/items/ITEMS_Mod_4.pdf

Frﬁhwirth-Schnatter, S. (2006). Finite mixture and markov switching models: Springer.

105

Gelfand, A., Hills, 8., Racine-Poon, A., & Smith, A. (1990). Illustration of Bayesian
inference in normal data models using Gibbs sampling. Journal of the American
Statistical Association, 85, 972-985.

Gelfand, A., & Smith, A. (1990). Sampling-Based Approaches to Calculating Marginal
Densities. Journal of the American Statistical Association, 85, 398-409.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple
sequences. Statistical Science, 7, 457-511.

Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the
Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721-741.

Gilks, W., Richardson, S., & Spiegelhalter, D. (Eds). (1996). Markov Chain Monte
Carlo in practice: Chapman & Hall/CRC.

Gill, J. (2002). Bayesian methods (A Social and Behavioral Sciences Approach):
Chapman & Hall/CRC.

Goldman, S. H., & Raju, N. S. (1986). Recovery of one- and two-pararneter logistic item
parameters: An empirical study. Educational And Psychological Measurement,
vol, 46(1), 11-21.

Gonzales, P., Calsyn, C., Jocelyn, L., Mak, K., Kastberg, D., Arfeh, S., et a1. (2000).
Pursuing excellence: Comparisons of international eighth-grade Mathematics
and Science achievement ﬁom a US. perspective, 1995 and 1999 (N o. NCES
2001-028): National Center for Educational Statistics.

Goodman, L. A. (1974). Exploratory latent structure analysis using both identiﬁable and
unidentiﬁable models. Biometrika, 61 , 215-231.

Hambleton, R. K. (1989). Principles and selected applications of item response theory. In
R. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). New York:
Macmillan.

Hambleton, R. K., & Swarrrinathan, H. (1985). Item Response Theory: Principles and
Applications. Boston, MA: Kluwer.

Hamilton, L. S., Stecher, B. M., & Klein, S. P. (Eds.). (2002). Making sense of test-based
accountability in education. Santa Monica, CA: RAND.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
applications. Biometrika, 5 7, 97-109.

106

Hulin, C., Lissak, R., & Drasgow, F. (1982). Recovery of two- and three-pararneter
logistic item characterstic curves: A monte carlo study. Applied Psychological
Measurement, 6(3), 249-260.

Johnson, E. G. (1992). The design of the national assessment of educational progress.
Journal of Educational Measurement, 22, 95-1 10.

Johnson, V. E. (1997). On Bayesian analysis of multirater ordinal data: An application to

automated essay grading. Journal of the American Statistical Association, 91, 42-
5 1 .

Johnson, V. E., & Albert, J. H. (1999). Ordinal Data Modeling. New York: Springer-
Verlag.

Kiefer, J ., & Wolfowitz, J. (1956). Consistency of the Maximum Likelihood Estimator in
the Presence of Inﬁnitely Many Incidental Parameters. The Annals of
Mathematical Statistics, 27(4), 887-906.

Kim, J. S., & Bolt, D. M. (2007). Estimating item response theory models using Markov
Chain Monte Carlo methods. Educational Measurement: Issues and Practice,
26(4), 38-51.

Kim, S. H., & Cohen, A. S. (1998). An evaluation of a Markov Chain Monte Carlo
method for the two—parameter logistic models. Paper presented at the annual
meeting of the American Educational Research Association, San Diego, CA.

Kolmogorov, A. N. (1933). On the empirical determination of a distribution function. 4,
83-91. ,

Lambert, P. C., Sutton, A. J ., Burton, P. R., Abrams, K. R., & Jones, D. R. (2005). How
vague is vague? A simulation study of the impact of the use of vague prior
distributions in MCMC using WinBUGS. Statistics in Medicine, 24, 2401-2428.

Lazersfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton
Mifflin.

Lemke, M., & Gonzales, P. (2006). US. student and adult performance on international
assessments of educational achievement: Findings from the condition of
education 2006 (No. NCES 2006-073). Washington, DC: US. Department of
Education.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.
Hillsdale, NJ: Lawrence Erlbaum Associates.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,
MA: Addison-Wesley

107

Lunn, D. J ., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS -- a Bayesian
modelling ﬁamework: concepts, structure, and extensibility. Statistics and
Computing, 10, 325-337.

Maier, K. S. (2002). Modeling incomplete scaled quessionnaire data with a partial credit
hierarchical measurement model. Journal of Educational and Behavioral
Statistics, 27, 271-289.

Martin, E., Pino, G., & De Boeck, P. (2006). IRT Models for Ability-Based Guessing.
Applied Psychological Measurement, 30(3), 183-203.

McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York: John Wiley &
Sons.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953).
Equations of state calculations by fast computing machines. The Journal of
Chemical Physics, 21 , 1087—1 091.

Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49(3), 359-381.

Mislevy, R. J ., & Verhelst, N. (1990). Modeling item responses when different subjects
employ different solution strategies. Psychometrika, 55(2), 195-215.

Muthén, B. (1989). Latent variable modeling in heterogeneous populations.
Psychometrika, 54(4), 557-585.

Muthén, B., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item
bias analysis. Journal of Educational Statistics, 10(2), 133-142.

Muthén, L., & Muthén, B. (1998-2007). Mplus user's guide. Los Angeles, CA: Muthén &
Muthén.

Patz, R. J ., & Junker, B. W. (1999a). Applications and extensions of MCMC in IRT:
Multiple item types, missing data, and rated responses. Journal of Educational
and Behavioral Statistics, 24(4), 342-366.

Patz, R. J ., & Junker, B. W. (1999b). A straightforward approach to Markov Chain Monte
Carlo methods for item response models. Journal of Educational and Behavioral
Statistics, 24(2), 146-178.

Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical
Transactions of the Royal Society of London, A(185), 71-110.

Pelton, T. W. (2002). The accuracy of unidimensional measurement models in the
persence of deviations for the underlying assumptions. Unpublished Unpublished
doctoral dissertation, Brigham Young University.

108

Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item
response data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of
Modern Item Response Theory (pp. 271-286). New York: Springer-Verlag.

Redner, R. A., & Walker, H. (1984). Mixture densities, maximum likelihood and the EM
algorithm. SIAM Review, 26, 195-239.

Rost, J. (1990). Rasch models in latent classes: an integration of two approaches to item
analysis. Applied Psychological Measurement, I 4, 271-282.

Sahu, S. K. (2002). Bayesian estimation and model choice in item response models.
Journal of Statistical Computation and Simulation, 72, 217-232.

Samuelsen, K. (2005). Examining differential item functioning ﬁom a latent class
perspective. University of Maryland, College Park, MD.

Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state

mixture model: A new method of measuring speededness. Journal of Educational
Measurement, 34(3), 213-232.

Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves of
distribution for two independent samples. Bulletin of Moscow, 2, 3-16.

Spiegelhalter, D. J ., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian
measures of model complexity and ﬁt. Journal of Royal Statistical Society, Series
B, 64(4), 583-616.

Spiegelhalter, D. J ., Thomas, A., Best, N. G., & Lunn, D. (2003). WINBUGS 1.4. User
Manual. Cambridge: MRC Biostatistics Unit.

Tsutakawa, R. K., & Soltys, M. J. (1988). Approximation for Bayesian ability estimation.
Journal of Educational Statistics, 13, 1 17-130.

von Davier, M. (2005). mdltrn: Software for the general diagnostic model and for

estimating mixtures of multidimensional discrete latent trait models [Computer
software]. Princeton, NJ: Educational Testing Service.

von Davier, M., & Carstensen, C. H. (2007). Multivariate and mixture distribution Rasch
models: Extensions and applications. New York, NY: Springer Science.

von Davier, M., & Rost, J. (2007). Mixture distribution item response models. In C. R.
Rao & S. Sinharay (Eds.), Handbook of Statistics (V 01. 26): Elsevier B. V.

Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response Theory: An analog for the
3-PL useful in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.),

109

Computerized adaptive testing: Theory and practice (pp. 245-270). Boston, MA:
Kluwer-Nijhoff.

Wilks, W. R., Richardson, 3., & Spiegelhalter, D. (Eds.). (1996). Markov Chain Monte
Carlo in practice: Chapman & Hall/CRC.

Wise, S. L. (2006). An investigation of the differential effort received by items on a low-
stakes computer-based test. Applied Measurement in Education, 19(2), 95-1 14.

Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment:
problems and potential solutions. Educational Assessment, 10(1), 1-17.

Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort-
moderated IRT model. Journal of Educational Measurement, 43(1), 19-38.

Yamarnoto, K. (1987). A model that combines IRT and latent class models. Unpublished
doctoral dissertation, University of Illinois Urbana-Charnpaign.

Yamarnoto, K. (1989). Hybrid model of IRT and latent class models (No. RR-89-41).
Princeton, NJ: Educational Testing Service.

Yarnamoto, K. (1995). Estimating the eﬂects of test length and test time on parameter
estimation using the HYBRID model (N o. RR-95-l 6). Princeton, NJ: Educational
Testing Service.

Yang, X. (2007). Methods of identifying individual guessers from item response data.
Educational And Psychological Measurement, 67(5), 745-764.

Zimowski, M., Muraki, E., Mislevy, R., & Bock, R. (2003). BILOG-MG 3: Item analysis

and test scoring with binary logistic models [Computer software]. Chicago, IL:
Scientiﬁc Software.

110