.1... 5+3 5.239%. 5.; _ “yuan” Efing. I!!! y .. i1. 1...... I h “1.23 IS . 2.3. .t !7§xv{;l I! .26.: .5. III}; .531. 0-1.3: .. in. ’ 2:... . .1325: 3.5.x. 1. .3?! . ? \ LIBRARY Michigan State ‘ University This is to certify that the dissertation entitled INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM RESPONSE THEORY MIXTURE MODELS presented by DIPENDRA RAJ SUBEDI has been accepted towards fulfillment of the requirements for the Ph. D. degree in Measurement and Quantitative Methods Major Professor’s Signature M [0; 2009 Date MSU is an Affinnative Action/Equal Opportunity Employer ---—.-.-.--.—4-.--.-.-.-.--.-----n-.--.-.-.-.on. - .— PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 5/08 K:IProjIAoc&Pres/ClRC/DateDue.indd INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM RESPONSE THEORY MIXTURE MODELS By Dipendra Raj Subedi A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2009 ABSTRACT INVESTIGATING UNOBSERVED HETEROGENEITY USING ITEM RESPONSE THEORY MIXTURE MODELS By Dipendra Raj Subedi Many item response theory (IRT) scaling and scoring models assume that examinee samples have comparable test-taking behaviors and comparable performance among different subgroups (e.g., gender and ethnicity) or in difierent test-taking contexts (e. g., geographic location or test—taking mode). However, in some situations the aforementioned assumption of test-taking homogeneity may not hold and test-taking heterogeneity is said to exist. When these sources of heterogeneity are unobservable (e. g., when examinees have unexpected guessing behaviors), then IRT mixture modeling (MixIRT) may be preferable to traditional IRT (i.e., two- and three-parameter logistic models) modeling for adjusting the parameter estimation inaccuracies that might otherwise occur in the presence of unobserved heterogeneity. Therefore, the goals of this study were to investigate: a) the estimation accuracy of MixIRT models when test-taking heterogeneity exists and b) the efficiency of MixIRT models in identifying subsets of examinees whose item responses do not fit the specified IRT model. Additionally, given the difficulty in estimating MixIRT parameters, Bayesian modeling with the Markov chain Monte Carlo method was used and the robustness of MixIRT modeling was investigated through a simulation study. This simulation study investigated several realistic testing factors that included test-taker sample size, test length, and the proportion of test-taking heterogeneity in the form of examinee guessing behavior. In other words, varying these testing factors allowed the evaluation of the impact of test-taking heterogeneity on the accuracy of parameter estimation. The results of the simulation study showed that the MixIRT model provided more accurate parameter estimates than traditional IRT models and was quite efficient in identifying subsets of examinees that had anomalous test-taking behaviors. A real data example also corroborated the simulation study results. Cepyright by DIPENDRA RAJ SUBEDI 2009 Dedication To my beloved parents: Bhoj Raj Subedi and Shanti Subedi ACKNOWLEDGEMENTS I would like to express my deepest gratitude to many people who assisted me throughout my doctoral studies in the Measurement and Quantitative Methods program at Michigan State University. First of all, I would like to express my sincere gratitude to my dissertation director and co-chair of the guidance committee, Dr. Mark Reckase. I am deeply indebted to him for his guidance and tremendous support throughout my dissertation. Dr. Kimberly Maier, co-chair of the guidance committee provided continuous support from the beginning of my doctoral studies. Other dissertation committee members Dr. Yeow Meng Thurn and Dr. Punya Mishra also provided excellent help and insightful comments during the various stages of my dissertation. My special appreciation goes to Dr. Raymond Mapuranga for his excellent help and camaraderie from the beginning of my doctoral studies to the completion of this dissertation. I especially appreciate his editorial comments in my dissertation. I would also like to thank Dr. Joseph Martineau at Michigan Department of Education for giving me an excellent opportunity to gain hands-on experience on various psychometric analyses while I was still a graduate student. I am very gratefirl to Professor Murari Suvedi and Dr. Bishwa Adhikari for their continuous support from the beginning of my graduate studies at MSU. vi I would like to express my special thanks to the Graduate School at MSU for providing me a dissertation completion fellowship. Thanks to my writing group members at MSU: Michael Sherry, Sungwom Ngudgratoke, and Young Yee Kim for their constructive feedbacks and comments in my writing. Also, thanks to my professors at MSU, who provided me various teaching and research assistantships. Thanks to Adam Wyse and Minh Duong for their wonderful fi'iendship. I would also like to take this opportunity to express my appreciation to my colleagues and mentors at the American Institutes for Research, Washington DC. for their understanding during the final editing stage of my dissertation. Finally, my special thanks go to my family. First, to my wife Shanta Subedi, for her love, patience, and continuous support throughout this journey. To my parents, who stood at every stage of my career with their unconditional support. To my brother Bishwa Subedi and other family members for their wonderful support. vii TABLE OF CONTENTS LIST OF TABLES ............................................................................................................... x LIST OF FIGURES .......................................................................................................... xii KEY TO ABBREVIATIONS .......................................................................................... xiv CHAPTER 1 INTRODUCTION ......................................................................................... 1 1.1 Background ................................................................................................................ l 1.2 Anomalous Examinee Test-taking Behavior ............................................................. 2 1.3 Test-taking Heterogeneity .......................................................................................... 3 1.4 Traditional Item Response Theory Modeling ............................................................ 4 1.5 Motivation .................................................................................................................. 6 1.6 Purpose ....................................................................................................................... 7 CHAPTER 2 LITERATURE REVIEW .............................................................................. 9 2.] Modeling Sources of Unobserved Heterogeneity ................................................. 9 2.2 Mixture Distributions and Mixture IRT Models ................................................. 10 2.3 Psychometric Applications of Mixture Modeling ............................................... 11 2.4 Mixture IRT Model Parameter Estimation .......................................................... 13 2.4.1 Frequentist Approach to Parameter Estimation ................................................ 13 2.4.2 Bayesian Approach ........................................................................................... 14 2.4.2 Markov chain Monte Carlo Algorithm ............................................................. 16 2.5 Research Questions ............................................................................................. 21 CHAPTER 3 METHODOLOGY ...................................................................................... 23 3.1 Models ...................................................................................................................... 23 3.1.1 Model 1: Mixture IRT model with completely random guessing behavior (MixIRT-R) ................................................................................................................ 24 3.1.2 Specification of the MixIRT-R Model in WinBUGS ....................................... 25 3.1.3 Model 2: Mixture IRT model with ability-based guessing (MixIRT-A) .......... 27 3.2 Simulation Study ...................................................................................................... 29 viii 3.2.1 Simulation Factors or Study Design ................................................................. 30 3.2.2 Generation of Simulated Parameters and Item Responses ............................... 31 3.2.3 Parameter Estimation ........................................................................................ 32 3.2.4 Evaluation Criteria and Analysis of Simulated Data ........................................ 34 3.2.5 Simulation Study using Mixture IRT Model with Ability-based Guessing ..... 36 3.3 Empirical Data Analysis .......................................................................................... 37 3.3.1 Analysis of Empirical Data ............................................................................... 38 CHAPTER 4 RESULTS .................................................................................................... 41 4.1 Descriptive Statistics of Simulated Item Parameters ............................................... 42 4.2 Evaluation of Parameter Estimate Convergence ...................................................... 43 4.3 Results of MixIRT-R Model Simulation Analyses .................................................. 47 4.3.1 Results from the Parameter Recovery Study .................................................... 47 4.3.2 Classification Accuracy of the MixIRT-R Model ............................................ 63 4.4 Results from Simulation Analyses using MixIRT-A Model ................................... 64 4.5 Results from Empirical Data Analysis ..................................................................... 70 4.5.1 Results Based on the Random Guessing Model ............................................... 70 4.5.2. Results Based on the Ability-based Guessing Model ...................................... 74 CHAPTER 5 DISCUSSION AND CONCLUSIONS ....................................................... 77 5.1 Interpretations of the Results ................................................................................... 80 5.1.1 Results fi'om Parameter Recovery Study .......................................................... 80 5.1.2 Results on Classification Accuracy .................................................................. 83 5.1.3 Results from Empirical Study ........................................................................... 83 5.2 Study Limitations ..................................................................................................... 86 5.3 Implications .............................................................................................................. 87 5.4 Future Directions ..................................................................................................... 88 5.5 Summary of the Findings and Conclusions ............................................................. 90 APPENDICES ................................................................................................................... 92 REFERENCES ................................................................................................................ 104 ix LIST OF TABLES 3.1 Summary of Parameter Recovery Study Factors ............................................ 30 4.] Descriptive Statistics for the Simulated Item Parameter .................................... 42 4.2 Descriptive Statistics of MixIRT Estimates for Selected Parameters ..................... 46 4.3 Bias and RMSE of Item Difficulty Parameter Estimates ................................... 49 4.4 Bias and RMSE of Item Discrimination Parameter Estimates ............................ 49 4.5 Correlations between True and Estimated Item Parameters ............................... 52 4.6 Bias and RMSE of Ability Parameter Estimates for all Simulation Conditions. . . . . ....60 4.7 Correlations between Simulated and Estimated Ability Parameters for all Simulated Condrtrons6l 4.8 Classification Accuracy in MixIRT-R Model ................................................. 64 4.9 Descriptive Statistics of Simulated Item Parameters in MixIRT-A Model. . . . . . 65 4.10 RMSE of Discrimination and Difficulty Parameter Estimates using MixIRT-A Model ................................................................................................... 66 4.11 Correlation of Discrimination and Difficulty Parameter Estimates using MixIRT-A Model .................................................................................................... 67 4.12. RMSE of Ability Parameter Estimates in MixIRT-A Model ............................ 68 4.13 41-1 4.15 4.16 4.2( 4.31 4.13 Correlation of Ability Parameter Estimates in MixIRT-A Model ....................... 69 4.14 Classification Accuracy of MixIRT-A Model .............................................. 69 4.15 MixIRT-R Estimates for Training Sample ................................................... 71 4.16 MixIRT-R Estimates for Validation Sample ................................................ 71 4.17 Distribution of Proficiency levels in Original and Modified Training Sample. . 72 4.18 Distribution of Proficiency levels in Original and Modified Validation Sample. . 72 4.19 Test Statistics from Two-sample Kolmogorov-Smirnov Test ............................ 73 4.20 Distribution of Proficiency levels in Original and Modified Training Sample. ....... 75 4.21 Distribution of Proficiency levels in Original and Modified Validation Sample. . 75 4.22 Test statistics from Two-sample Kolmogorov-Smirnov Test ............................. 75 xi LIST OF FIGURES 4.1 Sample plots for convergence assessment of discrimination parameter estimate... . .. 44 4.2 25-item test average bias results for difficulty parameter estimates ...................... 50 4.3 50-item test average bias results for difficulty parameter estimates ...................... 50 4.4 25-item test average RMSE results for discrimination parameter estimates ............ 51 4.5 50-item test average RMSE results for discrimination parameter estimates ........... 51 4.6 25-item test average correlations between true and estimated a-parameters ........... 53 4.7 50-item test average correlations between true and estimated a-parameters ........... 53 4.8 25-item test average correlations between true and estimated b-parameters ........... 54 4.9 50-item test average correlations between true and estimated b-parameters ........... 54 4.10 Recovery of a and b parameters in the 2PL model for sample size of 500 and test length of 25 and 10% proportion of guessers .................................................... 56 4.11 Recovery of a and b parameters in the MixIRT model for sample size of 500 and test length of 25 and 10% proportion of guessers ................................................. 57 4.12 Recovery of a and b parameters in the 2PL model for sample size of 2000 and test length of 50 and 10% proportion of guessers ..................................................... 58 4.13 Recovery of a and b parameters in the MixIRT model for sample size of 2000 and test length of 50 and 10% proportion of guessers ............................................... 59 xii 4.14 25-item test average RMSE results for ability parameter estimates . ....62 4.15 50-item test average RMSE results for ability parameter estimates .................... 62 4.16 RMSE of discrimination parameter estimates in MixIRT-A model ................... 66 4.17 RMSE of difficulty parameter estimates in MixIRT-A model .......................... 67 4.18 RMSE of ability parameter estimates in MixIRT-A model ............................... 68 4.19 Number of examinees identified as guessers in training and validation sample. . 74 xiii J (J) IRT 2PL 3PL MixIRT MIRT RMSE MixIRT-R MixIRT-A KEY TO ABBREVIATIONS Item response theory Two-parameter logistic model Three-pararneter logistic model Mixture item response theory model Multidimensional item response theory model Root mean squared error Mixture item response theory model with random guessing Mixture item response theory model with ability-based guessing xiv CL al. C\ C0. C0 1h: gu. 0c< CHAPTER 1 INTRODUCTION 1.1 Background The K-12 test-based accountability system has gained increasing attention from researchers, educators, and policy-makers since the implementation of the “No Child Left Behind” legislation (N CLB, 2001). This legislation was designed to improve existing educational practice and student academic achievement through improved teaching and curriculum (Hamilton, Stecher, & Klein, 2002). This renewed attention to K-12 education also led to increased interest in comparative and international assessments because of the ever-widening subject matter knowledge gap between US students and their counterparts in other industrialized countries. For example, results from the Third International Mathematics and Science Survey showed that US fourth and eighth grade students had comparatively lower mathematics and science achievement than many other developed countries (Gonzales et a1., 2000; Lemke & Gonzales, 2006). This increased focus on large scale assessment has also resulted in the added scrutiny of psychometric modeling approaches. Specifically, the accuracy of parameter estimates when using traditional psychometric models has come into question because they do not efficiently account for anomalous examinee behaviors such as cheating and guessing. These undesirable examinee behaviors can lead to aberrant responses that can occlude the accuracy of inferences drawn about student knowledge and academic skills. 1.2 Anomalous Examinee Test-taking Behavior As noted previously, the accuracy of psychometric parameter estimates used in large-scale assessments is fallible when examinees exhibit anomalous test-taking behavior. Common anomalous behaviors include guessing, cheating and examinee motivation. Cheating can be defined as any action that decreases the accuracy of the intended inferences based on the examinee’s performance, thus threatening the validity of the inference about the test taker (Cizek, 2001). Examinee motivation can be defined as the degrees of effort test-taker expend particularly when given a low-stakes assessment test. Several recent studies have investigated each of these testing phenomena. However, this study focuses only on heterogeneity introduced by test-takers’ guessing behavior. Guessing may occurs when test-takers run out of time on the test, when they are less motivated, or when they find test items difficult. Guessing behavior, however, varies depending upon the nature of the test (low-stakes or hi gh-stakes), item difficulty, examinee ability, available time to complete the test, and cross-cultural differences among examinees. The validity of the inference made using scores is partially dependent on the amount of effort put forth by the examinee while taking the test (Wise, 2006). Furthermore, when adequate effort is not given by examinees, they tend to guess randomly which makes it difficult to estimate the test taker’s true subject matter proficiency (Budescu & Bar-Hillel, 1993). Therefore, anomalous test-taking behavior is a concern and is particularly a concern in low-stakes tests when examinees are more likely to have low motivation, and to cheat or guess excessively. In low-stakes tests like the National Assessment of Educational Progress, attempts to mitigate the negative consequences of unusual examinee behavior include the use of shorter tests with specialized data collection designs like balanced incomplete blocking (Johnson, 1992). Even formula scoring does not adequately deter guessing on tests (Frary, 1988). It is also particularly important to identify and account for anomalous examined behavior in the current NCLB era because examinee test scores are an integral aspect of data-driven educational policy. For example, AYP decisions are based strongly on examinee test scores and yet not many model-based or sample-specific adjustments are made in the estimation of psychometric parameters. 1.3 Test-taking Heterogeneity It is important to study the undesirable test-taking behaviors described previously because they can have negative consequences on the interpretation and accuracy of psychometric models. Particularly, the psychometric models used in low-stakes tests have the underlying assumption that the same item parameters and ability distributions apply to all examinees taking the test. This is known as the assumption of test-taking homogeneity (e. g., Baker & Kim, 2004; Bock & Zimowski, 1997; Lord, 1980). But as noted previously, examinees often exhibit unconventional test-taking patterns such as cheating, excessive guessing and low motivation. When the aforementioned anomalous behaviors exist, test-taking heterogeneity is said to exist and is often evidenced by sample variability that occurs among different groups of test-takers. An example pertaining to this study would be guessers and non- guessers. Similarly, test-taking behavior may be different for different groups of examinees. For example, the Graduate Record Examinations’ (GRE) verbal assessment is administered to native and non-native English speakers who have different English proficiency which could impact their performance on the test and not allow the two groups to be analyzed together. In situations where the group-membership of test-takers is not observable, unobservable test-taking heterogeneity is said to exist. In contrast, if the source of heterogeneity can be observed in the data (e. g., gender, ethnicity), observable heterogeneity makes it convenient to stratify test-takers for any validation using multi- group analyses (Muthén & Lehman, 1985). Multi-group analyses are important in psychometrics and when test-taker . characteristics are not observable, a set of models called latent class models can be used for multi-group analyses. In the case of low-stakes tests, a form of latent class models called mixture models have been used recently by researchers for multi-group analyses when test-taking heterogeneity is unobservable (Bock & Zimowski, 1997). As a result, these mixture models were extended to models commonly used in low-stakes tests under a framework which analyses the interaction between examinee ability and test items called item response theory (Lord, 1980). IRT is the most common modeling approach used in many tests like NAEP, TIMSS and PISA. 1.4 Traditional Item Response Theory Modeling Traditional item response theory (IRT) modeling allows examinee performance on each test item to be succinctly quantified across all examinees. The three assumptions of traditional IRT are: dimensionality, local independence, and the existence of a monotonically increasing function (Hambleton, 1989). First, the dimensionality assumption indicates that a test should measure only one ability, personality trait or attitude -- called unidimensionality. When more than one ability is assumed to exist, these IRT models are called multidimensional (Hambleton & Swaminathan, 1985; Reckase, 1997). Local independence implies that no item should provide clues to the answers of other items in a test (Hambleton & Swaminathan, 1985). Finally, the assumption of a monotonically increasing function relates the probability of success on an item to the ability measured by the item. A common traditional IRT model is the three-parameter logistic (3PL) model which is represented mathematically as: ediW-bi) 13(0) : Ci +(1— ci)1+eai(6_bi) : i = 1, 2,....,n (1.1) where 13(6) is the probability that a given test-taker with ability 0 answer a random item correctly. (I,- is the item discrimination, b,- is the item difficulty and C,- is the pseudo guessing parameter (Hambleton & Swaminathan, 1985). Another common model called the two-parameter logistic (2PL) model is obtained when 0 = 0 in Equation 1.1 and the one-parameter logistic (lPL) model is obtained when 6 = 0 and a = 1. Although these traditional IRT models are useful for quantifying examinee ability, they are not able to account for unobservable test-taking heterogeneity, which may result in parameter estimates that are inaccurate. As noted above, mixture models are capable of accounting for unobservable heterogeneity and extensions of these models in IRT framework have produced so-called mixture IRT models or MixIRT for short. Therefore, MixIRT models provide greater flexibility in modeling complex item response distributions (McLachlan & Peel, 2000). Hence, in the low-stakes testing context, MixIRT models would be particularly useful and provide impetus for investigation of their robustness in modeling anomalous test-taking behavior as described in the section which follows. 1.5 Motivation Given the psychometric modeling limitations of traditional IRT models in accounting for test-taking heterogeneity, it is important to investigate the efficiency and accuracy of MixIRT models in estimating parameters. Moreover, the estimation of latent distributions (e.g., Mislevy, 1984) is an important area of psychometric research because even the most intuitively appealing and creative models are not useful unless the parameters in the model can be estimated accurately. Specifically, modeling unobserved test-taking heterogeneity such as aberrant item responses is crucial because ignoring it can lead to biased parameter estimates and may yield inflated measurement and test reliability (Lord & Novick, 1968; Muthén, 1989). Furthermore, the inaccurate estimation of examinee latent traits can have consequential impacts such as false interpretation of student ability, and erroneous measurement of school and teacher effectiveness (Ansari, Jedidi, & Dube, 2002). Given the limitations of IRT modeling articulated above, specifically in the modeling of guessing, the 3-PL model -- a commonly used model -- is unlikely to suffice for psychometric modeling in large-scale assessments. This is because it restricts the guessing parameter to be item dependent. Most importantly, the 3PL model is incapable of identifying whether individual test—takers actually guess, but rather it models guessing over the entire sample and hence results in inadequate modeling of guessing. Therefore, a subsidiary motivation of this study is to explicate the implications of inaccurately modeling guessing or random response behavior when this phenomenon is not modeled at the person level, but rather at the item level as is in current IRT modeling practice. Moreover, the MixIRT approach taken in this study is more appropriate than IRT for providing evidence of the impact of guessing on individual test items and has the secondary advantage of possibly identifying students with low motivation. 1.6 Purpose Mixture model parameters are estimated using either frequentist or Bayesian approaches. As described in Chapter Two, several practical problems arise in the fiequentist approach to mixture model parameter estimation (Friihwirth-Schnatter, 2006). On the other hand, Bayesian estimation methods can handle hi gh-dimensional problems and allow exploration of the distributions of parameters, regardless of the distributional forms of the likelihood functions or parameters. In addition, model complexity increases with the increase in number of parameters to be estimated, such as a rrrixture model, particularly with a large number of mixture components. Therefore, this study focused on using a Bayesian approach to parameter estimation in mixture IRT models, with specific emphasis on item parameter estimation, test-taker cluster identification, and proficiency level classification. These issues are of increased interest among researchers, policymakers, and educators in the current era of test-based accountability systems. In particular, this study compared the performance of Bayesian mixture IRT modeling to common IRT models in estimating person and item parameters, and identifying aberrant responses and low—motivation test-takers. The remainder of the dissertation is divided into four chapters. Chapter Two reviews the literature that lays out the important empirical and theoretical foundation for this dissertation. The third chapter presents the methodology and the research design, implementation of Bayesian estimation methods, and mixture model analysis. The results from both simulation and empirical data analysis are presented in Chapter Four. Finally, Chapter Five provides discussion, limitations, suggestions for further research, and summary of results and conclusions. CHAPTER 2 LITERATURE REVIEW As noted in the previous chapter, the purpose of this study is to investigate and illustrate the efficacy of using mixture models and a Bayesian approach in estimating item parameters and test-taker ability under the IRT framework. Therefore, the purpose of this chapter is to introduce important Bayesian and mixture modeling concepts that are pertinent to this study. In the sections which follow, descriptions of mixture distributions, mixture model parameter estimation, Bayesian statistical modeling, prior research on psychometric applications of mixture models, and the modeling of guessing behaviors in tests, are provided. 2.1 Modeling Sources of Unobserved Heterogeneity The latent structure model (Goodman, 1974; Lazersfeld & Henry, 1968) is used to explain underlying, unobservable or latent categorical relationships, and offers an efficient way of uncovering distinct sub-populations, incorporating correlated non- normally distributed outcomes, and classifying individuals into classes. That is, these models can serve as possible elucidations of the observed relationships among a set of manifest variables (Goodman, 1974). Depending upon the nature of variables used in these latent structure models, various types of models can be defined under this framework. Specifically, mixture modeling is categorized as a subset of latent structure models when latent variables that represent subpopulations are used for modeling 9 population membership. Mixture models in the context of IRT are presented next. 2.2 Mixture Distributions and Mixture IRT Models Mixture distributions are comprised of a finite or infinite number of components, possibly of different distributional types, that can describe different features of data. A mixture model is a flexible tool for modeling complex data through an appropriate choice of data components to accurately represent the data’s true characteristics (McLachlan & Peel, 2000). As a result, mixture models are a valuable tool for analyzing a wide variety of latent trait phenomena. Mathematically, a mixture model can be represented by the observation of n independent random variables X], x 2, . . .,x,, , from a k-component mixture density as denoted by Equation 2.1: k 1’09):an fj(xl.), 1:1, ....,n (2.1) j=1 where 71'1- >0,j=1, ...,k; 7T] '1' ------- + 77k =1 and f3(x), 1 Sj 5k, are the component densities of the mixture and 71' 1, . . . , 7t], are the mixing proportions. These proportions allowed us to estimate the size of subgroups in the sample. Mixture IRT (MixIRT) models are a combination of LCA and IRT models (Asparouhov & Muthén, 2008). Their development has been motivated primarily by diverse phenomena that are encountered when modeling data from populations that are potentially non-homogeneous (von Davier & Rost, 2007) such as heterogeneous 10 population of guessers. LCA is a statistical method used to identify homogeneous groups, or classes, from categorical multivariate data. In addition, MixIRT models are useful in testing for the population invariance of item parameters and ability distribution. Basically, these models are based on the assumption that the population under investigation is composed of two or more latent subpopulations dictated by different degrees of latent traits, each of which responds differentially to psychological tasks and stimuli (Draney, Wilson, Gluck, & Spiel, 2008). One of the most general MixIRT models is the mixed Rasch model (Rost, 1990) in which each examinee is parameterized both by a class membership parameter (g =1, ....G) and a within-class ability parameter (6g). The probability of a correct response (U) to the item is represented mathematically as: ((9 -b.) e g 1g P = exp[aj(6i —bj)—ni1(bj ‘(62' +6») (aj(6i —bj)_cj)] 1+explaj(6.~—b,-)—n.1(bj—(6.-+5.-» (aj(6’,- -b,-)-c,-)] (3.4) where 77 i =1 if examinee i is a guesser and 0 otherwise. 5,- is a parameter that measures the difficulty threshold for a guesser to guess. In other words, sOme examinees may use their full potential or even try illuminating one or two choices before making their guess. Others may not use their full potential, thus guessing on those items that are difficult for them. The indicator function, represented as I(. . .), in Equation 3.4 above becomes 1 only if the difficulty parameter of item j is larger than the ability of examinee i with some degree of adjustments controlled by the threshold parameter 5, . The current study allowed 5,- varying among examinees because different examinees have a different threshold in terms of their tendency to guess. 28 The priors for 6i a aj 9 bj 9 79 9 Tb are the same as those used in the Model 1 above. Appendix A.2 provides the WinBUGS code used to implement MixIRT-A model. The prior for 771’ is similar to those used for gi in MixIRT-R, which corresponds to categorical representation of group identification. The hyperparameter of this distribution is parameterized by a Dirichlet distribution which is a conjugate prior for estimating the probability that a particular examinee is likely to be a guesser. 3.2 Simulation Study Typically a simulation study is used to evaluate the performance of a particular model or method in precisely estimating the model parameters. Accurate estimation of item parameters is important in any psychometric applications such as test equating, item banking, etc. The overarching goal of this study was to evaluate the performance of the MixIRT model in precisely estimating sample heterogeneity, and to study how different testing characteristics influence the estimation of model parameters. Therefore, a simulation study was most appropriate to address these goals. In addition, a simulation study allowed exploration of the impact of guessing behavior on parameter estimation. Hence, in order to evaluate the extent to which the MixIRT model can precisely recover the item parameters using Bayesian estimation, a parameter recovery study was conducted. The precision of parameter estimation was evaluated in terms of bias, RMSE, and correlation between estimated and simulated parameters. As mentioned earlier, the proposed method provides better item parameter recovery when it produces small bias, small RMSE, and high correlation between estimated and simulated parameters. 29 3. 2.1 Simulation Factors or Study Design As mentioned earlier, a simulation study allows the evaluation of how different testing characteristics influence the estimation of mixture model parameters. In the context of this study, it is possible to explore the impact of unobserved test-taking heterogeneity (guessing proportion) on parameter estimation. Typical test characteristics, which are encountered in applied testing situations, include sample size, test length, and proportion of guessing. Taking this into account, this study used the factors listed in the Table 3.1, which are commonly used in parameter recovery studies (e. g., Goldman & Raju, 1986; Hulin et al., 1982; Kim & Cohen, 1998). Table 3.1 Summary of Parameter Recovery Study Factors Factors Levels Sample Size 500, 2000 Test Length 25, 50 proportion of “guessing” 0%, 5%, 10% Estimation model MixIRT, 2PL, 3PL This simulation study used a MixIRT model with simulated random guessing behavior as labeled as MixIRT-R model above. The estimation from 0% guessing serves as a baseline. This study investigated the impact of different guessing proportion (5% and 10%) on parameter estimation. The guessing preportion represents the percentage of examinees who are a guesser in a test. The two-parameter logistic (2PL) model was used for generating data. The performance of MixIRT-R and 2PL model was compared with 30 3PL model because 3PL is commonly used in practice for parameter estimation when guessing behavior is suspected in multiple choice items. Each condition in this study was replicated 15 times. Although this may appear to be too few replications from a fiequentist perspective, this is actually more than the number of replications used in Bayesian IRT-based simulation studies. This reduction in replications is partly a result of the computational intensity of WinBUGS software which can take upto 6 hours to run 25,000 iterations for the item responses with 2000 examinees and 50 items. Examples from the literature have used only five (e.g., Bolt & Lall, 2003) or ten replications (Cao & Stokes, 2008). The general procedures employed to simulate item and ability parameters, and simulation of item responses are presented next. 3. 2.2 Generation of Simulated Parameters and Item Responses The simulation of parameters and item responses was based on typical methods found in IRT literature (Hulin et al., 1982; Kim & Cohen, 1998). Ability parameters were assumed to follow a normal distribution; thus ability parameters were randomly sampled from a standard normal distribution (mean=0, standard deviation=l). Similarly, item discrimination parameters were assumed to follow a lognormal distribution. Thus, discrimination parameters were randomly sampled from a lognorrnal distribution [611' ~ lognorrnal (0,0.3)]. The item difficulty parameters were also assumed to follow a normal distribution. Therefore, difficulty parameters were randomly sampled from a normal distribution with mean of O and standard deviation of 0.7. The standard deviation was reduced to slightly less than 1 to avoid too easy or too difficult items. 31 The a and b parameters were randomly paired with each other. Thus, any nonzero correlations among the item parameters were attributable to chance. These item parameters may be thought of as simulating an idealistic scenario or one that a psychometrician using the 2PL model would hope to obtain. The probability of a correct response to item j by simulated examinee i was then computed using the two-parameter logistic IRT model (Bimbaum, 1968). A response vector of dichotomous item scores for each examinee was obtained by generating, for each item, a uniform random number (ranging between 0 and l) and comparing the value with the probability of an examinee of that ability level passing the item. If the computed probability exceeded the random number, then the item score was scored as correct (1); otherwise, it was scored as incorrect (0). In order to simulate the guessers, the item responses from a randomly selected 5% or 10% of total examinees were modified in such a way that their response patterns mimicked guessing behavior. The original data with no guessing (labeled as 0% proportion of guessing) served as baseline data for comparative purposes. The estimation of modified item responses allowed evaluation of the impact of guessing on parameter estimation. Thus, it also showed how 2PL and 3PL models could not account for test- taking heterogeneity. 3. 2.3 Parameter Estimation The item responses simulated or modified above were used as data for item and ability parameter estimation. The primary methodological objective of this study was to compare the estimation from various IRT models (MixIRT, 2PL, 3PL) when model parameters were estimated using computer software WinBUGS. In this program, the 32 estimations were carried out under Bayesian framework, using sampling procedures such as MCMC and the Gibbs Sampler. 3.2.3.1 Convergence Assessment and Sensitivity Analysis Evaluating chain convergence is a critical issue in monitoring the simulated states of the Markov chain (Cowles & Carlin, 1996; Kim & Cohen, 1998). In order to view the sampled observations as a sample from the posterior distribution of the model parameters, the sequence of states for the Markov chain should theoretically converge to a stationary distribution. The rate at which this convergence occurs can vary depending on several factors, such as correlations between adjacent states, the sampling algorithm used, and identification problems with the model. A critical issue for MCMC methods, including Gibbs sampling, is to determine when one can cease sampling and use the results to estimate characteristics of the distributions of parameters of interest (Kim & Cohen, 1998). In this context, the values for the unknown quantities generated by the Gibbs sampler can be graphically and statistically summarized to evaluate for mixing and convergence. Cowles and Carlin (1996) presented a comparative review of convergence diagnostics for MCMC algorithm. The most popular and useful method was that proposed by Gelman and Rubin (1992). This diagnostic measure is implemented in WinBUGS as the Brooks, Gelman, and Rubin (BGR) plot. In this study, five diagnostic measures were used to evaluate the sampler performance: (i) Brooks, Gelman, and Rubin (BGR) diagnostic plots; (ii) Monte Carlo errors; (iii) history plots; (iv) autocorrelation plots; and (v) density plots. 33 The Gelman-Rubin convergence statistic R compares the ratio of the pooled chain variance to the within chain variance (Gelman & Rubin, 1992). Once convergence is reached, R converges to 1. WinBUGS plots 3 items; where the Gelman-Rubin statistic is plotted in red, which is preferred to converge to 1. In blue, the average width of the 80% intervals within each individual chain and the width of the 80% interval of the pooled runs is plotted in green. The blue and green lines should stabilize to some number though it is not necessarily required to be 1. Monte Carlo error (MC error) is a measure like the standard error of the mean but adjusted for autocorrelation. Generally, autocorrelations for the MCMC sequence that decay slowly as a function of lag imply poor mixing of the MCMC series and could indicate a high-degree of correlations between the parameters or lack of identification of the model. Finally, history and density plots are also useful to monitor the convergence of estimates. Analysis to evaluate the sensitivity to the initial values and the mixing and convergence of the Gibbs sampler was carried out. The reasonable convergence was reached in each condition by running 3 chains of 25,000 iterations with the first 10,000 discarded as bum-in. For additional replications, however, a single chain of 25,000 iterations was run with the first 10,000 iterations discarded as bum-in period. The estimate of each parameter was based on final 15,000 iterations. 3. 2.4 Evaluation Criteria and Analysis of Sim ulated Data Three commonly used summary statistics were used as evaluation criteria: bias, Root Mean Squared Error (RMSE), and correlation. Before computing the bias and 34 RMSE, the estimated parameters were transformed to the same scale as the true parameters. RMSE is the square root of the average of the squared differences between true and estimated parameters across all the items for item parameters and across all the subjects for the ability parameter. For example, in case of item parameter recovery, the RMSE and Bias for each parameter 1] = a, b are expressed as: 2 J R fi.r—77. RMSE: ZZ( JJ*RJ) (3.5) j=lr=l Bias = i i (fijr — ”j ) (3.6) > where 771' is the true value and 771', is the corresponding estimate. J is the total number of items, and R is the number of replications. It should be noted that for ease of interpretation, the results for all J items were combined across the R replications for each simulation condition. Thus, the bias and RMSE presented in the results section are basically the averages of those values across each simulation condition. Bias index does not indicate in an absolute sense the degree of estimation accuracy. In bias, equal positive and negative errors are cancelled with each other producing a zero bias just as would perfect estimation. The bias then suggests whether there is a systematic tendency to overestimate or underestimate a parameter. A positive bias implies parameter overestimation and a negative bias implies parameter underestimation. 35 The correlation between simulated and estimated parameters was also used as an evaluation criterion because that reflects how well the estimated parameters are correlated with the simulated parameters. The Pearson correlation between estimated and simulated parameter values is given by: J __ "=1 r = J (3.7) J _ J 2(fij—fi)2 EVE—732 j=l j=l This study also used classification accuracy as additional criteria to evaluate how well the MixIRT model classified examinees into a model generated class. For example, to evaluate how well the MixIRT model identified the examinees likely to be in the guessers class, the classification accuracy can be expressed in percentage as: Classification Accuracy : Number of guesserszdentzfied correctly X 100 (3.8) Actual number of guessers Since the group membership was modeled as a categorical variable, the median was computed for the estimate. The classification accuracy was computed separately for each group (non-guessers and guessers). Because the sample size was different for different groups, weighted classification accuracy was also computed by averaging the classification accuracy values after weighting by sample size. 3. 2.5 Simulation Study using Mixture IRT Model with Ability-based Guessing Although a large part of the simulation study carried out in this dissertation was described in Section 3.2, the assumption in which guessing was defined might not be 36 realistic in all practical testing situations. Thus, the goal of this second simulation study was to use a MixIRT-A model that modeled a different guessing strategy. Specifically, this model accounted for ability-based guessing, as specified as MixIRT-A earlier. Once again, the objective was to show how the simplicity of the 2PL model failed to account for the heterogeneity in testing populations, and to show how Mixture IRT model can account for such heterogeneity. This simulation study, however, simplified the study design by considering only the simulation condition in which the estimation model is varied for a specific test length and sample size. Specifically, the estimation from the 2PL model was compared with the MixIRT-A model for a test of 40 items administered to 1000 examinees. The next chapter provides a summary of simulated item parameters and the results fiom this analysis. 3.3 Empirical Data Analysis This study used the data from a large scale assessment obtained from a statewide mathematics assessment administered to Fall 2006 Grade 8 students in a Midwestern state. The data was obtained from over 100,000 students. Although the original test also comprised of some constructed response items, this study used the item responses from 54 multiple-choice items only. Due to the longer computational time required for running MCMC analysis in WinBUGS, samples of 1000 randomly selected test-takers were used. These moderate sized samples were used to carry out empirical analysis. The primary objective of this analysis was to demonstrate an application of .the MixIRT model (both MixIRT-R and MixIRT-A) using real data. 37 3.3.1 Analysis of Empirical Data The empirical data analysis started with selecting random samples from the statewide assessment mentioned above. First, two samples of size 1000 were selected randomly. Then, WinBUGS was used to estimate model parameters (item and ability) and the group membership of the examinees. In order to demonstrate the application of the MixIRT model in identifying the guessers and showing the impact of guessing on parameter estimation, this study estimated the ability parameters with or without guessers in the sample. The calibration was performed twice. First, the model estimated the ability parameters and identified the examinees likely to be from a guesser class. Then, the model was rerun with those guessers removed. It is important to clarify how an examinee was classified as a guesser in this study. As noted earlier, the probability of an examinee likely to be a guesser was estimated fiom the item response pattern of the examinee. This probability was actually based on the average over a large number of MCMC iterations. If the probability was equal to or greater than 0.5, the examinee was classified as a guesser. The changes in ability parameter estimation were evaluated in terms of proficiency level classification and the difference between the distribution of ability parameters. The percentage of proficient students is a conceptually simple score- reporting metric that became widely used for school accountability decisions under the NCLB Act. In this accountability framework, students are generally classified into four or five different levels based on their performance in a statewide assessment. In most states, there are four proficiency levels: Advanced, Proficient, Basic, and Below Basic. This study also used the same convention to represent the proficiency levels. Based on the 38 ability estimates from a MixIRT model, the distribution of examinees into particular proficiency levels was made as realistic as possible by deriving three cut-scores on the 0- scale that provided the same percentage of examinees into each level reported by this assessment. Evaluation of results from this perspective have potential to provide some policy implications of the findings. This study used the two independent sample Kolmogorov-Smimov test (Kolrnogorov, 1933; Smirnov, 193 9) to evaluate whether the difference in distribution of 0 from the two samples was statistically significant. This nonparametric statistical test is often referred to as distribution free method as it does not rely on assumptions that the data are drawn from a given probability distribution. Specifically, the Kolmogorov- Smirnov test evaluates whether the shapes of the distributions of the two groups are comparable. In order to test the statistical significance of the differences between proficiency levels classified by two samples, a chi-square test was performed. Pearson’s chi-square is the most widely used chi-square test, in which the chi-square statistic is calculated by the difference between each observed and theoretical frequency of each possible outcome. Its formula is given in Equation 3.9 . 2 n Oi-Ei2 Z =Z( E ) (3.9) i=1 ' l 2 where Z is the test statistic that asymptotically approaches a chi-square distribution. 0,- is an observed fi'equency; E i is an expected frequency under the null hypothesis; It is the number of possible outcomes for each event. Pearson’s chi-square statistic is used 39 to test whether or not an observed frequency distribution differs from a theoretical distribution. The next chapter provides the results obtained from the simulation study under both guessing models (MixIRT-R and MixIRT-A models described in this chapter). It also outlines the results from the analysis of empirical data from a statewide large scale assessment. 40 CHAPTER 4 RESULTS This chapter presents findings from the simulation and real data analyses. Recall that the primary goal of this study was to explore the feasibility of using mixture IRT (MixIRT) models to estimate the differential performance of examinees in different latent classes in a sample i.e., sample heterogeneity. To accomplish this goal, a series of simulation factors were investigated in fully crossed designs, including two sample sizes (500 and 2000 simulees), two test lengths (25 and 50 items), and three proportions of guessing (0%, 5%, 10%). The estimation of model parameters (item and ability) was compared among three models: MixIRT, 2PL, and 3PL. This chapter is comprised of five sections. The first section summarizes the descriptive statistics of simulated item parameters. The second section presents the convergence of the estimates in WinBUGS because using MCMC sampling to do statistical inference requires convergence of the MCMC chain to its stationary distribution. In the third section, the results obtained fiom the simulation study under the random guessing model, described as MixIRT-R in Chapter Three, are presented. The fourth section summarizes the results from a simulation study under the ability-based guessing model, described as MixIRT-A in Chapter Three. The final section outlines results fi'om the analysis of empirical data. 41 4.1 Descriptive Statistics of Simulated Item Parameters Table 4.1 presents descriptive statistics of the simulated item parameters for both test lengths. Given that these item parameters were randomly selected from specific distributions, the two tests were slightly different in difficulty levels, with the longer test (n=50) being slightly easier than the shorter test (n=25). Since this occurred by a chance due to the difference in test lengths, it should not impact the interpretation of the results. The discrimination parameters ranged from 0.588 to 1.758 for the shorter test, and from 0.687 to 1.749 for the longer test. The difficulty parameters ranged fiom -l .896 to 2.086 for the shorter test, and from -2.108 to 2.152 for the longer test. These item parameters are similar to those found in many practical assessments and previous studies. To generalize the results from a simulation study to the practical setting, simulated parameters should be as realistic as possible. Therefore, extreme values of a- and b- pararneters were avoided in the simulation. A complete list of item parameters is listed in Appendix C] for test length of 25, and in Appendix C.2 for test length of 50. Table 4.1 Descriptive Statistics for the Simulated Item Parameters Test Item Standard Length Parameter Mean Deviation Maxrmtun Minimum 25 a 1.030 0.313 1.758 0.588 b -0.120 0.903 2.086 -1.896 50 a 1.076 0.243 1.749 0.687 b -0.266 0.797 2.152 -2.108 42 4.2 Evaluation of Parameter Estimate Convergence Using MCMC sampling to do statistical inference requires convergence of the MCMC chain to its stationary distribution. Five diagnostic measures, as described in Chapter Three, were used to evaluate convergence: (i) Brooks, Gelman, and Rubin (BGR) diagnostic plots; (ii) Monte Carlo errors; (iii) history plots; (iv) autocorrelation plots; and (v) density plots. It should be noted that no diagnostics can prove convergence, but these multiple criteria provide the indication that convergence might have occurred. These criteria may help in evaluating MCMC convergence to ensure that the samples are fairly representative of the underlying stationary distribution of the Markov chain. Figure 4.1 presents BGR diagnostic plots, history plots, autocorrelation plots, and density plots for discrimination parameter of a randomly-selected item estimated using the MixIRT-R model. This item has true discrimination parameter of 1.757 and the estimated parameter of 1.818. Similar plots for estimation of difficulty parameter of a randomly selected item and plots for estimation of ability parameters for a randomly selected examinee are given in Appendix B. These plots were chosen from the dataset in which the guessing percentage was 10% for a sample size of 500 and a test length of 25. This condition was chosen here because a small sample size and a short test generally yielded poor parameter recovery and sometimes produced chains that had difficulty in arriving at convergence. Evaluation of convergence from this condition may capture the representative findings from this study. 43 a[25] chains 1:3 ll 1_O- 7—--& 0.5 r 0.0 q 1 10000 20000 iteration a[25] chains 1:3 4.0 3.0 2.0 1.0 I I 1 10000 20000 iteration a[25] chains 1:3 1.0- 0.5 - _ 0.0 -'--——-—-~ ~ —— -o.5 -1.0 6 I - d . lag a[25] chains 1:3 sample: 45000 2.0 - 1.5 L 1.0t 0.5 - 0.01 1.0 1.5 2.0 2.5 Figure 4.1 Sample plots for convergence assessment of discrimination parameter estimate (From Top: BGR plot, History plot, Autocorrelation plot, Density plot) BGR Plots The BGR plot shown in Figure 4.1 indicates that the Gelman-Rubin statistic, which is plotted in red, has converged to 1. The average width of the 80% intervals within each individual chain is plotted in blue and the width of the 80% interval of the pooled runs is plotted in green. Both blue and green lines are stabilized to some number indicating adequate convergence of the chains. It is important to note that the colors shown in BGR plot might be difficult to distinguish in gray scale prints. Monte Carlo error Monte Carlo error facilitates the evaluation of convergence by suggesting that how long the simulation should be run to ensure adequate convergence. Table 4.2 presents the descriptive statistics of the MixIRT estimate for randomly selected item responses. For convenience of illustration, only results for the first five items and the first five examinees are shown. As a rule of thumb, the simulation should be run until the Monte Carlo error for each parameter of interest is less than about 5% of the sample standard deviation (Spiegelhalter et al., 2003). From the Table 4.2, it is clear that the Monte Carlo error is less than 1/20th of the standard deviation of the estimate indicating adequate convergence. History Plots The history plots in Figure 4.1 suggest convergence has been achieved since three chains essentially overlapped each other and could not be easily differentiated. Furthermore, the convergence seems have been reached well before the bum-in period of 10000 used in this study. 45 Table 4.2 Descriptive Statistics of MixIRT Estimates for Selected Parameters Node Mean [fetifafildg MC Error a , 0.9335 0.1462 0.0020 61; 1.0500 0.1536 0.0023 a 3 0.9025 0.1463 0.0018 04 1.1600 0.1711 0.0035 a 5 1.5760 0.2484 0.0052 b 1 0.2472 0.1262 0.0021 b; 0.1520 0.1164 0.0020 b3 0.2184 0.1295 0.0021 1)., -1.1760 0.1808 0.0047 b 5 —1.7730 0.2218 0.0063 6, 1.8550 0.5176 0.0041 6 2 0.5016 0.3885 0.0035 6’ 3 0.8399 0.4132 0.0036 64 -0.5894 0.3922 0.0041 19 5 0.2236 0.3801 0.0032 Autocorrelation Plots As shown in Figure 4.1, autocorrelations for the MCMC sequence decayed rapidly as a function of lag, which indicates that there was good mixing of the MCMC series. This shows a lack of correlations between the parameters and indicates satisfactory convergence. Density Plots The density plots of Figure 4.1 also suggested the convergence of estimates because the density resembled the appropriate distribution for discrimination parameter. 46 Thus, after evaluating all the diagnostic measures, adequate convergence was achieved. The additional plots given in the appendix also suggested the adequate convergence. 4.3 Results of MixIRT-R Model Simulation Analyses 4. 3. 1 Results from the Parameter Recovery Study Recall that the parameter estimate was evaluated by comparing the estimated model parameters (i.e., discrimination, difficulty, and ability parameters) to the true (simulated) parameters. As mentioned earlier, this study used bias, RMSE, and correlation between estimated and simulated parameters as evaluation criteria. The results are presented both numerically and graphically. Table 4.3 below summarizes the bias and RMSE of item difficulty parameter (b) estimates that were described in Equation 3.6 and Equation 3.5 respectively. Similarly, Table 4.4 summarizes the bias and RMSE of item discrimination parameter (a) estimates. The Bias and RMSE values for b and a parameters are also plotted separately for test lengths of 25 and 50. Only selected plots are presented here, and the remaining plots can be found in Appendix D. Figure 4.2 shows average bias for recovery of item difficulty parameters when test length is 25, whereas Figure 4.3 shows average bias for recovery of item difficulty parameters when test length is 50. The plots corresponding to RMSE values for recovery of item discrimination parameters are shown in Figure 4.4 and Figure 4.5 for test lengths of 25 and 50 respectively. It should be noted that the labels on the x-axis reflect guessing proportion and sample size. For example, 10P500 indicates a sample size of 500 simulees when the percentage of simulees that were guessing was 10%. 47 It can be noticed in Figure 4.4, that when a 2PL model was used with test length of 25 and sample size of 500, the RMSE increased from 0.129 to 0.152 when the percentage of guessers increased fi'om 0% to 5%. The RMSE increased further to a value of 0.192 when the simulated proportion of examinee guessing increased to 10%. Similarly, the RMSE increased from 0.130 to 0.146 for a 5% guessing percentage and to 0.174 for 10% guessing. Both bias and RMSE values were generally lower with the MixIRT-R model than with the 2PL model. However, both bias and RMSE tended to increase for both models when the percentage of guessers increased to either 5% or 10%. One of the primary objectives in varying study factors like test lengh and sample size was to evaluate their capacity to recover stipulated item and person parameters. These results show that smaller bias and RMSE were produced by larger sample sizes. The only exception to this sample size finding occurred with the use of the 2PL model for a 50-item test when there were 2000 simulees. For example, in a condition with a 25-item test and 5% guessing percentage, the RMSE value dropped from 0.152 to 0.110 with the 2PL model when sample size was increased fi'om 500 to 2000. Additionally, the RMSE value dropped from 0.141 to 0.080 in MixIRT-R model estimation when sample size was increased fi'om 500 to 2000. No clear pattern of results existed for bias when test length was increased from 25 to 50. 48 Table 4.3 Bias and RMSE of Item Difficulty Parameter Estimates 0% Guessing 5% Guessing 10% Guessing IRT Number Sample Proportion Proportion Proportion Model Of Items Size BIAS RMSE BIAS RMSE BIAS RMSE Mean Mean Mean Mean Mean Mean 2PL 25 500 0.004 0.129 0.058 0.152 0.096 0.192 25 2000 -0.002 0.069 0.061 0.110 0.103 0.159 50 500 0.005 0.127 0.056 0.146 0.088 0.174 50 2000 0.006 0.062 0.063 0.104 0.245 0.252 MixIRT 25 500 -0.012 0.130 -0.036 0.141 -0.058 0.156 25 2000 —0.008 0.069 -0.030 0.080 0.007 0.156 50 500 -0.001 0.128 -0.016 0.134 -0.024 0.137 50 2000 0.004 0.061 -0.014 0.065 0.019 0.070 Table 4.4 Bias and RMSE of Item Discrimination Parameter Estimates 0% Guessing 5% Guessing 10% Guessing IRT Number Sample Proportion Proportion Proportion Model of Items Size BIAS RMSE BIAS RMSE BIAS RMSE Mean Mean Mean Mean Mean Mean 2PL 25 500 0.021 0.144 0.045 0.160 0.080 0.202 25 2000 0.031 0.079 0.065 0.1 15 0.097 0.170 50 500 0.032 0.135 0.054 0.163 0.090 0.202 50 2000 0.038 0.077 0.074 0.116 0.105 0.168 MixIRT 25 500 0.016 0.144 0.015 0.143 0.020 0.151 25 2000 0.029 0.079 0.035 0.085 0.095 0.178 50 500 0.031 0.135 0.028 0.141 0.039 0.147 50 2000 0.037 0.077 0.042 0.081 0.042 0.086 49 Bias for 'b' parameters (Test length=25) 0.30 0.20 -.:+* *-----* o"’ 0'00 w QPSQO..... ..-._°P.3000_--..--35299----..méfwoo ._ 10" 90....___...}9f.3999_-.. -0.10 , Average Bias --o-- 2PL -I-Mix1RT Figure 4.2 25-item test average bias results for difficulty parameter estimates Bias for ’b' parameters (Test length=50) 0.30 ... ,. , _____, __ 0.95.09 .. .. ._ 399.2999. .. .. 59500 V_,_ 5.93999 _ 39859.0..- . . _ -.19??°09._ .. Average Bias -0.10 --¢-- 2PL —I— MixIRT Figure 4.3 50-item test average bias results for difficulty parameter estimates 50 (L30 (125 (L20 015 (L10 Average RMSE (L05 (L00 RMSE for 'a' parameters (Test Length=25) - “...—«.- w—« ...—Wu—m—nm ...... . OPSOO OPZOOO SPSOO SPZOOO 10P500 --O-- 2PL -l-— MixIRT 10P2000 1 Figure 4.4 25-item test average RMSE results for discrimination parameter estimates (L30 020 (L10 (LOO Average RMSE 4110 {L20 Figure 4.5 50-item test average RMSE results for discrimination parameter estimates RMSE for '3’ parameters (Test Length=50) OPSOO OPZOOO SPSOO SPZOOO 10PSOO --0-- 2PL —l— MixIRT 51 10P2000 Table 4.5 summarizes the average correlations between true (simulated) and estimated item parameters. These values are presented graphically in Figures 4.6 and 4.7 for discrimination parameters and in Figures 4.8 and 4.9 for difficulty parameters. Clearly, larger correlations were associated with larger sample sizes for both 2PL and MixIRT-R models. The impact of guessing was strong in the recovery of a parameters for the 2PL model. For example, correlations between true and estimated a parameters dropped from 0.877 to 0.807 when the proportion of guessers increased from 5% to 10% with the 2PL model as shown in Table 4.5. The correlations were similar (about 0.9) for both the 2PL and the MixIRT-R model when no guessers were included in the sample for the condition with the sample size of 500 and test length of 25. Table 4.5 Correlations between True and Estimated Item Parameters 0% Guessing 5% Guessing 10% Guessing IRT Number Sample Proportion Proportion Proportion Model of Items Size _ raa’ rbb’ raa’ rbb’ raa’ rbb’ 2PL 25 500 0.909 0.989 0.877 0.985 0.807 0.976 25 2000 0.972 0.997 0.932 0.993 0.839 0.978 50 500 0.866 0.985 0.779 0.982 0.665 0.974 50 2000 0.965 0.997 0.907 0.992 0.770 0.979 MixIRT 25 500 0.909 0.989 0.906 0.988 0.902 0.987 25 2000 0.971 0.997 0.967 0.996 0.956 0.994 50 500 0.867 0.985 0.852 0.984 0.842 0.984 50 2000 0.964 0.997 0.961 0.996 0.956 0.996 Note: rag, is the correlation between true (a) and estimated (a ’) parameters rbb’ is the correlation between true (b) and estimated (b ’) parameters 52 Correlations for '3' parameters (Test length=25) 1.00 .r 0.95 i 0.90 I 0.85 WW _. 4‘ 030 ;W~»wwmwmuw“... WMHHLWW-M.”www.mufigffmummuwww 0.75 0.70 0.60 I 0.55 , Average Correlations 0.50 ‘ OPSOO OPZOOO SPSOO SPZOOO 10P500 10P2000 -'O-‘ 2PL -l- MixIRT Figure 4.6 25-item test average correlations between true and estimated a-parameters Correlations for 'a’ parameters (Test length=so) 1.00 0.95 0.90 0.85 5 0.80 0.75 OJO i ~-r ., 7 w u. 4 .-~ . . sunuwuxunugaWUH..-_nxm 0.65 0.60 0.55 Average Correlations 0.50 : OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000 “'0'" 2PL —I— MixIRT Figure 4.7 50-item test average correlations between true and estimated a-parameters 53 Correlations for 'b' parameters (Test length=25) 1.00 r“ 0.99 0.98 0.97 ,_ . ,- .. . - _ ,. . _- _. 0.96 0.95 093 . ....... 0.92 _, ,, ........ 0.91 ‘. . V .._-_ 0.90 1 Average Correlations OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000 “'0'" 2PL -l- MixIRT Figure 4.8 25-item test average correlations between true and estimated b-parameters Correlations for 'b' parameters (Test length=50) 1.00 . 0.99 ' 0.98 097 _ .. . ._ _ ._ . . . .. ---.-.-. ._ .. __ - . _ 0.93 i 0.92 0.91 0.90 ' Average Correlations OPSOO OPZOOO SPSOO SPZOOO 10P500 IOPZOOO --¢-- 2PL -O— MixIRT Figure 4.9 SO-item test average correlations between true and estimated b-parameters 54 The results pertaining to the recovery of item parameters are also displayed using scatterplots in Figures 4.10 to 4.13. The results presented in these figures include recovery of both 2PL and MixIRT-R models when the percentage of guesser in the sample was 10%. Figures 4.10 and 4.11 are the scatterplots of true and estimated parameters for the conditions of sample size (N) of 500 and test length (n) of 25 for 2PL and MixIRT-R models respectively. Similarly, Figures 4.12 and 4.13 represent the scatterplots for sample size of 2000 and test length of 50 for 2PL and MixIRT-R models respectively. In a scatterplot, each dot represents the estimated value of a particular parameter for the given value of true parameter. Ideally, for a perfect recovery all dots should fall over the line passing through the origin. Clearly, consistent with the findings presented earlier, the recovery of difficulty parameters (b) was better than that of the discrimination parameters (a) in both models. The recovery of both parameters was better in the MixIRT-R model than in the 2PL model. The results regarding the recovery of ability (6) parameters are summarized numerically in Tables 4.6 and 4.7, and graphically in Figures 4.14 and 4.15. Only sample plots are included in this chapter. The results in these tables were similar for both the 2PL and the MixIRT-R models, but more clearly distinct for the 3PL model. The findings indicate that guessing did not have a meaningful impact on correlations between estimated and simulated parameter estimates. Specially, correlations between estimated and simulated 6 parameters were at least 0.9. 55 g_ . 0,. O N— u)._( o 0 ‘— . . F— m (U 9 o :- :3 Ew‘ 8°- :1 0 o J: m o (D LU ° LLI ‘- o .- 8 to o 0'? (N'l— Q- or)- o I T l l l l TTI I ll 0.0 0.5 1.0 1.5 2.0 —3 -2 -1 0 1 2 3 Truea Truea Figure 4.10 Recovery of a and b parameters in the 2PL model for sample size of 500 and test length of 25 and 10% proportion of guessers 56 O oi‘ "‘ O o N— . O *0... cu ° .0 'O U gr" E °‘ *8 a w Lu “5.- O 0']... Q—I (O— o I I I l I r I I r I I 00 05 10 15 2.0 -3 -2 -1 O 1 2 3 Tmea Trueb Figure 4.11 Recovery of a and b parameters in the MixIRT model for sample size of 500 and test length of 25 and 10% proportion of guessers 57 O 63— (“—1 . O O N—‘ to o T;_ :.. (U .. .0 V.— 8 ~-"' ° 8 ‘6 0.. ° "' _ E .— .‘ 3° 8 a LU o LU ...__ “3.- 0' O 0']— C’... 00.. O I I I I I I T I I 1 0.0 0.5 1.0 1.5 2.0 -3 -2 -1 0 1 2 Truea Trueb Figure 4.12 Recovery of a and b parameters in the 2PL model for sample size of 2000 and test length of 50 and 10% pr0portion of guessers 58 3.4 (’0 o N— O “3.- (I: .0 v4 ‘0 “O N —I (U _I E ‘- E o 6 3 LU LU ‘7— l0 0.- 0']... 2- 9— I I I I I II I I I I I 0.0 0.5 1.0 1.5 2.0 -3 -2 -1 0 1 2 3 Truea Trueb Figure 4.13 Recovery of a and b parameters in the MixIRT model for sample size of 2000 and test length of 50 and 10% proportion of guessers 59 Table 4.6 Bias and RMSE of Ability Parameter Estimates for all Simulation Conditions 0% Guessing 5% Guessing 10% Guessing IRT Number Sample Proportion Proportion Proportion Model of Items Size BIAS RMSE BIAS RMSE BIAS RMSE Mean Mean Mean Mean Mean Mean 2PL 25 500 —0.004 0.402 —0.058 0.404 -0.096 0.414 25 2000 0.002 0.404 -0.061 0.408 -0.103 0.417 50 500 -0.005 0.290 -0.056 0.296 -0.088 0.307 50 2000 -0.006 0.292 -0.063 0.297 0.01 1 0.304 3PL 25 500 -0.315 0.508 -0.315 0.506 -0.316 0.506 25 2000 -0.212 0.451 -0.229 0.458 -0.242 0.468 50 500 -0.294 0.41 1 -0.297 0.41 1 -0.295 0.409 50 2000 -0.214 0.358 -0.222 0.356 0.000 0.307 MixIRT 25 500 0.012 8 0.404 0.036 0.417 0.058 0.429 25 2000 0.008 0.404 0.030 0.415 -0.007 0.439 50 500 0.001 0.291 0.016 0.303 0.024 0.312 50 2000 -0.004 0.293 0.014 0.306 0.000 0.316 60 Table 4.7 Correlations between Simulated and Estimated Ability Parameters for all Simulated Conditions 0% Guessing 5% Guessing 10% Guessing IRT NUIDbCI' 0f Sample Proportion Proportion Proportion Model Items Size rao’ 700’ roo’ 2PL 25 500 0.910 0.910 0.909 25 2000 0.913 0.913 0.912 50 500 0.955 0.954 0.953 50 2000 0.955 0.955 0.952 3PL 25 500 0.909 0.909 0.908 25 2000 0.912 0.912 0.911 50 500 0.953 0.953 0.953 50 2000 0.954 0.955 0.953 MixIRT 25 500 0.909 0.898 0.889 25 2000 0.913 0.906 0.902 50 500 0.954 0.948 0.943 50 2000 0.955 0.949 0.945 61 RMSE for 'ability' parameters (Test Length=25) 0.60 ---..- -..--. ...- 9’ :5. O F’ N 0 Average RMSE O 8 F’ H O (100 OPSOO OPZOOO SPSOO SPZOOO IOPSOO 10P2000 --o--2PL -I-MileT thPL Figure 4.14 25-item test average RMSE results for ability parameter estimates RMSE for 'ability' parameters (Test Length=50) Average RMSE OPSOO OPZOOO SPSOO SPZOOO 10P500 10P2000 -+-2PL -I-MixIRT ~-----t3PL Figure 4.15 50-item test average RMSE results for ability parameter estimates 62 4. 3.2 Classification Accuracy of the MixIR T -R Model As noted previously, one of the purposes of this study was to investigate the accuracy with which Bayesian estimation of the MixIRT model can correctly identify guessers in a sample. Specifically, the goal was to evaluate the accuracy of classifying examinees into guesser and non-guesser groups. This research purpose can be addressed only through a simulation study because in real assessment it is impossible to make such a conclusion. Therefore, using estimates from the parameter recovery study described earlier, classification accuracy is ascertained by the extent to which simulees are correctly categorized as guessers or non-guessers. Table 4.8 provides results of weighted and unweighted classification accuracy for different guessing proportions when using the MixIRT-R model. Overall, the classification accuracy was over 98% for the non-guessing group and approximately 90% for the guessing group. The weighted classification accuracy, computed by weighting the results by the associated sample size, was 97.20% when sample size was 500 and the guessing proportion was 10%. This accuracy increased to 98.06% when sample size increased fi'om 500 to 2000 simulees. Similarly, when the proportion of guessing was 5%, the weighted classification accuracies were 96.92% and 98.00% for sample size of 500 and 2000 respectively. Interestingly, both classification accuracy and weighted classification accuracy were 100% when no guessers were present (labeled as 0%). 63 Table 4.8 Classification Accuracy in MixIRT-R Model ' 13* Proportion Sample True Class Averaged Classification “geighftied . of guessing size (Group*) Estimate Accuracy % C assr cation Guesser % Accuracy % 500 NG 1.22 98.78 97.20 G 83.00 83.00 10% 2000 NG 1.27 98.73 98 .06 G 85.20 85.20 500 NG 0.76 99.24 96.92 G 76.00 76.00 5% 2000 NG 0.96 99.04 98.00 G 78.20 78.20 500 NG 0 100 100 G NA NA 0% 2000 NG 0 100 100 G NA NA *NG = Non-guessers, G = Guessers ** Weighted by sample size 4.4 Results from Simulation Analyses using MixIRT-A Model As noted previously, the guessing factor may not be easy to model in practice, and hence the only way to illustrate it is through a simulation study. The goal of this second simulation study was to use a MixIRT model to incorporate a different guessing strategies (i.e., the assumption of ability-based guessing), that can be modeled using MixIRT-A of Chapter 3. Therefore, the second simulation study showed how the 2PL model is limited in its parameter estimation accuracy because it cannot account for sample heterogeneity. However, this simulation design was simplified by considering 64 only conditions in which the estimation model was varied for a specific test length and sample size. Specifically, estimation results using the 2PL and MixIRT-A models for 40- item tests administered to 1000 examinees were compared. Table 4.9 summarizes descriptive statistics of simulated item parameters used in second simulation study. The a parameters ranged from 0.633 to 1.897 with mean of 1.015 and standard deviation of 0.274. The b parameters ranged from -2.274 to 1.945 with mean of 0.093 and standard deviation of 0.855. A complete list of item parameters are listed in Appendix C-3. Table 4.9 Descriptive Statistics of Simulated Item Parameters in MixIRT-A Model Item Standard Parameter Mean Deviation Maxrmum M um (I 1.015 0.274 1.897 0.633 b 0.093 0.855 1.945 -2.274 The same five diagnostic measures used in the first simulation study were used to evaluate the convergence of the estimates. The recovery of item and ability parameters was evaluated using RMSE and correlations between estimated and simulated parameters. The results from this simulation study are presented in Tables 4.10 to 4.14 and Figures 4.16 to 4.18. 65 Table 4.10. RMSE of Discrimination and Difficulty Parameter Estimates using MixIRT- A Model No Guessers Guessers IRT MOdCl a b a b Mean Mean Mean Mean 2PL 0.100 0.096 0.187 0.199 MixIRT 0.102 0.097 0.133 0.084 RMSE for discrimination parameter estimates g (110 . "fig-TH--..W_M-MW No Guessers Guessers "-9-"- ZPL -l- MixIRT Figure 4.16 RMSE of discrimination parameter estimates in MixIRT-A model The recovery of discrimination and difficulty parameters indicates that both the 2PL and MixIRT-A models produced comparable results when no guessers were present, i.e. no heterogeneity existed. However, when some simulees were simulated as guessing on items that were likely to be difficult for their given ability level, the MixIRT-A model outperformed the 2PL model. This was reflected by smaller RMSE and larger correlations between estimated and simulated item parameters. Recovery of difficulty 66 parameters was better than that of discrimination parameters, and guessing had a large impact on the discrimination parameter estimates. For example, in the presence of guessing, the correlation between estimated and simulated discrimination parameter dropped from 0.949 to 0.705 in the 2PL model. However, guessing did not have much impact on the recovery of difficulty parameters. The correlations between true and estimated parameter remained fairly high with values greater than 0.98 in both models. RMSE for difficulty parameter estimates 0.30 Average RMSE No Guessers Guessers "O" 2PL —I— MixIRT ,2 - ...“ ,. - - - , ...N. .. - , - _ .. .-. a- ..i..- -- -.-.-- .. .-...- -.-..w -.-.-- --.--- .. .. .. ._.. .. .. .. ._ .. .. .. .. ._ _. _. _ -1 Figure 4.17 RMSE of difficulty parameter estimates in MixIRT-A model Table 4.11 Correlation of Discrimination and Difficulty Parameter Estimates using MixIRT-A Model No Guessers Guessers IRT Model raa’ rbb’ raa’ rbb’ 2PL 0.949 0.993 0.705 0.988 MixIRT 0.948 0.994 0.886 0.995 67 The recovery of ability parameters was also evaluated in terms of RMSE and correlations. Table 4.12 and Figure 4.14 show the recovery of ability parameter estimates. In the case of the 2PL model, RMSE increased from 0.325 to 0.411 in presence of guessing. However, the increase in RMSE for the MixIRT-A model was small and increased from 0.326 to 0.343. When guessing was allowed, the correlation decreased from 0.942 to 0.917 in the 2PL model and from 0.942 to 0.929 in the MixIRT-A model. Table 4.12. RMSE of Ability Parameter Estimates in MixIRT-A Model No Guessers Guessers IRT Model Mean RMSE Mean RMSE 2PL 0.325 0.411 MixIRT 0.326 0.343 RMSE for ability parameter estimates 0. 5 O —~- .-______ - WWW--- ._.___. ..-— ‘-‘ ------ - -- --- . _. fl Average RMSE No Guessers Guessers --o-- 2PL —I- MixIRT Figure 4.18 RMSE of ability parameter estimates in MixIRT-A model 68 Table 4.13 Correlation of Ability Parameter Estimates in MixIRT-A Model IRT Model No Guessers Guessers 2PL 0.942 0.917 MixIRT 0.942 0.929 As mentioned earlier, classification accuracy is an important criterion for evaluating the degree to which the proposed MixIRT-A model accurately classifies examinees into their true (simulated) class or group. Table 4.14 provides weighted and unweighted classification accuracy results for the MixIRT-A model. The results indicate that this model correctly identified 63.12% of guessers. This indicates a lack of power in identifying the guessers. Also, misclassifications occurred from using this model. For the non-guesser class, 7.52% were incorrectly classified as guessers. Similarly, even for a sample with no guessers, the model incorrectly classified 3% of the examinees as guessers. Table 4.14 Classification Accuracy of MixIRT-A Model ' ** True Class Average Classification Weighted. (Group*) N Estimated Accuracy % Class1fication Guesser % Accuracy % Guessing NG 748 7.52 92.48 Allowed G 252 63.12 63.12 85.08 NO: NG 1000 3.00 97.00 97.00 Guessmg G 0 NA NA *NG=Non-guessers, G=Guessers ** Weighted by sample size 69 4.5 Results from Empirical Data Analysis To address the fourth research question for which the goal was to investigate the impact of excluding aberrant item responses (from guessers) in proficiency level classification, real data from a statewide mathematics assessment was used. Since guessing behavior can only occur on multiple-choice items, the analyses were conducted on examinee responses to 54 multiple choice items. Because of extensive MCMC computational time, only two randomly selected samples of size 1000 were used in a cross-validation. The first sample is referred as a training sample and the second sample is referred as a validation sample. First, the results based on the random guessing model (MixIRT-R) are presented in section 4.5.1. However, the analysis was also carried out using the MixIRT model with ability-based guessing (MixIRT-A) so as to compare the classification of simulees into guessers and non-guessers. These results are presented in section 4.5.2. 4. 5.1 Results Based on the Random Guessing Model Tables 4.15 and 4.16 present sample WinBUGS output, particularly highlighting the estimates of class membership. The node in this table refers to the variable monitored in WinBUGS. In this output, PI[l] and PI[2] refer to classes or categories corresponding to guesser and non-guesser respectively. Interestingly, the estimates were similar for both samples, showing that about four to five percent of examinees were likely to belong to a guesser class in this particular assessment. The observation of 95% credible interval around the estimate and MC error being less than 1/20th of the standard deviation indicates that these estimates are fairly precise. 70 Table 4.15 MixIRT-R Estimates for Training Sample Node Mean Standard Deviation MC error* 2.50% Median 97.50% PI[l] 0.044 0.008 <0.001 0.029 0.044 0.062 PI[2] 0.956 0.008 <0.000 0.938 0.956 0.971 *MC error: Monte carlo error Table 4.16 MixIRT-R Estimates for Validation Sample Node Mean 5‘”?de MC error 2.50% Median 97.50% Devratron PI[l] 0.050 0.009 < 0.001 0.034 0.050 0.068 PI[2] 0.950 0.009 < 0.001 0.932 0.950 0.966 The estimates of guessing probability for each examinee also produced very similar results for both samples. Based on group membership estimate for each examinee, the numbers of guesser identified by the MixIRT-R model were 40 and 41 in training and validation samples respectively. Three 6 scale cut-scores were used for categorizing examinees into four proficiency levels based on values of -1.08, -0.53, and 0.39. As mentioned in the previous chapter, these cut-scores were chosen in such a way that the proportion of examinees into each proficiency levels in the current sample matched with that obtained from the actual statewide assessment. Therefore, in order to evaluate the impact of removing guessers from parameter estimation, the guessers identified by the MixIRT-R model were removed fiom the sample and the model parameters were estimated again. The results presented below are summarized for the same number of examinees, i.e. only non- guessers before and after removing guessers from the calibration. 71 Table 4.17 Distribution of Proficiency levels in Original and Modified Training Sample Original proficiency level Modified proficiency level Frequency Percent Frequency Percent Advanced 280 29.17 281 29.27 Basic 243 25.31 234 24.38 Below Basic 64 6.67 78 8.13 Total 960 100 960 100 A closer look to these results does not indicate any noticeable differences in proficiency levels between the proportion of examinees before and after removing the guessers identified by the MixIRT-R model. For example, the percentage of examinees that were classified as proficient (proficient or advanced) changed slightly fiom 68.02 to 67.50. Table 4.18 summarizes the distribution of proficiency levels for validation sample. The results from this sample were fairly similar to those obtained for training sample. There was a small difference between original and modified examinee classification as proficient (proficient or advanced) as indicated by changes in proficiency level from 68.20% to 68.30%. Table 4.18 Distribution of Proficiency levels in Original and Modified Validation Sample Original proficiency level Modified proficiency level Frequency Percent Frequency Percent Advanced 281 29.30 283 29.51 Proficient 373 38.89 372 38.79 Below Basic 65 6.78 93 9.70 Total 959 100 959 100 72 Testing statistical significance of these differences would provide usefiil information for evaluating the meaningfulness of sample differences. As noted in Chapter 3, one way of comparing ability parameter frequency distributions was to use the Kolmogorov-Smimov test. In addition to this test, a chi-square test was performed in order to test the statistical significance of the differences between proficiency levels classified by two samples. Table 4.19 provides Kolmogorov-Smimov test results. Table 4.19 Test Statistics from Two-sample Kolmogorov-Smimov Test Training sample Validation sample 0-Distributions 0-Distributions Most Kolmogorov-Smimov Z I 0.456 0.708 Extreme . . Differences Asymp. Sig. (2-tarled) 0.985 0.698 Kolmogorov-Smimov test results from Table 4.19 suggest that the difference between two distributions is not statistically significant. Similarly, the results from the chi-square test suggest that the difference for original and modified proficiency levels is not significant for training sample (x2=1.60, df=3, p=0.66). The chi-square test also suggested no significant difference for validation sample (12:6.83, df=3, p=0.08). In an attempt to map the characteristics of the examinees classified into the guesser class from this analysis, no specific conclusions could be made in terms of gender and ethnicity. The only variable that seemed related with guessing was economic disadvantage(ED), a measure of socio-economic status, operationalized by flee or reduced lunch. That is, ED=1 were more likely to be classified as guessers than ED=0. 73 4. 5. 2. Results Based on the Ability-based Guessing Model The results based on the MixIRT-A are presented for both training and validation samples. This model identified that 7% of examinees were guessers for training sample, and 10% of the examinees were guessers for validation sample. Interestingly, among those 70 examinees for training sample and 100 examinees for validation sample that were classified as guessers by this model, 36 for training sample and 37 for validation sample were also classified as guessers by the previous model (MixIRT-R). This result is presented in Figure 4.19. MixIRT-R M‘XIRT'R ) MixIRT-A Training sample MIXIRT-A Validation sample Figure 4.19 Number of examinees identified as guessers in training and validation sample Table 4.20 presents the distribution of proficiency levels in the original and the modified training sample. Interestingly, the proficiency level for those who were proficient (proficient or advanced) has decreased from 68.17 (original sample) to 63.87 (modified sample). This shows a large change in the proficiency level. 74 Table 4.20 Distribution of Proficiency levels in Original and Modified Training Sample Original proficiency level Modified proficiency level Frequency Percent Frequency Percent Advanced 262 28.17 240 25.81 Proficient 372 40.00 354 38.06 Basic 233 25.05 240 25.81 Below Basic 63 6.77 96 10.32 Total 930 100 930 100 Table 4.21 presents the distribution of proficiency levels in original and modified sample for validation sample. Interestingly, the proficiency level for those who were proficient (proficient or advanced) has decreased from 68.11 (original sample) to 61.56 (modified sample). This also shows a large change in the proficiency level. Table 4.21 Distribution of Proficiency levels in Original and Modified Validation Sample Original proficiency level Modified proficiency level Frequency Percent Frequency Percent Advanced 259 28.78 215 23.89 Basic 226 25.11 253 28.1] Below Basic 61 6.78 93 10.33 Total 900 100 900 100 Table 4.22 Test statistics from Two-sample Kolmogorov-Smirnov Test Training sample Validation sample 0-Distributions 0-Distributions 13er Kolmogorov-Smimov Z 1.322 1.721 Differences Asymp. Sig. (2-ta11ed) 0.061 0.005 75 The statistical test of differences between two distributions in original and modified sample suggested mixed findings for statistical significance at the a-level of 0.05. For the training sample, the Kolmogorov-Smimov Z shows non-significant (Z=1.322, p=0.061). However, for the validation sample, it shows a significant difference as indicated Z value of 1.72 (p < 0.05). In the chi-square test, the differences in proficiency levels (cell frequencies) between original and modified sample were statistically significant for both training and validation samples. For example, the chi-square test statistics of x2=8.363, df=3, p=0.039 indicate the statistically significant difference between cell frequencies for training sample and 38:12.58, df=3, p=0.006 indicate the statistically significant difference between cell frequencies for validation sample. The next chapter provides discussion and conclusions for this study. It summarizes the results, interprets those findings, and lists some implications of those results. 76 CHAPTER 5 DISCUSSION AND CONCLUSIONS The primary goal of this study was to explore the effectiveness of mixture IRT (MixIRT) models in estimating the differential performance of latent classes in a sample (i.e., sample heterogeneity). The variables (e. g, guessers or non-guessers) used for classifying examinees are referred to as sources of heterogeneity. When test-taking heterogeneity sources are unobservable (e. g., examinees’ tendency to guess), and if their group membership has to be inferred from the data, unobserved test-taking heterogeneity is said to exist. Therefore, in this study, the MixIRT model was used to investigate different examinee test-taking behaviors through a simulation study that varied (a) sample size, (b) test length, and (c) proportion of guessing. These factors were selected because these were thought to be useful in many testing applications like item pool design or IRT-based test bank development and pre-equating where precision of parameter estimation is paramount. Furthermore, varying these factors allowed the extent to which differing degrees of test-taking heterogeneity influence model parameter estimation to be studied, particularly for different test lengths and sample sizes. Given that MixIRT models are an extension of [RT models, their parameter estimation is complicated by the intractability of mathematical forms when trying to use fi'equentist techniques. Therefore, Bayesian estimation was used instead because it can handle high-dimensional problems and the distributions of parameters, can be explained 77 regardless of the forms of the distributions of the likelihood and the parameters. Through a simulation study, the precision of parameter estimation was evaluated in the MixIRT model for various realistic testing factors. As mentioned in Chapter Three, this study used two forms of MixIRT model to incorporate different guessing strategies, viz. MixIRT-R and MixIRT-A. Considering the extensive computational time required for the MCMC procedures of Bayesian methods that were used, only two levels of test length and sample size were considered. Since the impact of unobserved test-taking heterogeneity, represented in this study as a proportion of guessers, on model parameter estimation was the primary factor of interest, the proportion of guessers per sample was varied. Two percentages of guessing were used to represent 5% and 10% of the total examinees as guessers. The data with no guessers, represented as 0% guessing proportion, was used as a baseline to compare the results. Another purpose of this study was to compare the parameter estimation accuracy of the MixIRT model to two commonly used IRT models: the 2PL and 3PL models. A parameter recovery study was used for conducting the aforementioned comparison. This comparison was carried out by varying the three estimation models (i.e., 2PL, 3PL, and MixIRT) for all the study factors in a fully crossed design. The precision of parameter recovery was evaluated based on three commonly used evaluation criteria: bias, RMSE, and Pearson correlation. The interpretations of the results were based on both numeric and graphic representations. The study’s third objective was to evaluate the accuracy of MixIRT Bayesian estimation in identifying guessers when there were guessers in a sample. For this 78 purpose, the MixIRT model estimated the probability that each examinee belonged to latent classes of either guessers or non-guessers. And the model’s classification accuracy, which indicates the extent to which simulees are correctly categorized as guessers or non- guessers, was evaluated. As noted earlier, the probability of an examinee likely to be a guesser was estimated from the item response pattern of the examinee and the probability was actually based on the average over a large number of MCMC iteration. In this study, the examinee was classified as a guesser if that probability was equal to or greater than 0.5. The study’s final purpose was to investigate the impact of excluding aberrant guessing responses in examinee proficiency level classification. In other words, the ability continuum was divided into four different levels so that the impact could be studied in terms of proficiency classification. For proficiency classification, real data was used as a firrther illustration of the MixIRT model’s usefulness. This goal has potential for contributing to the better understanding of issues pertaining to cut-scores variation and its policy implications. It is important to clarify that this study does not suggest guessing is a bad thing from a student’s perspective, especially in circumstances such as when there is no penalty for guessing and when examinees run out of time. However, from the measurement or psychometric point of view, guessing introduces construct-irrelevant variance, which is a major concern in validity studies. Therefore, the objective of this study was to document the impact of guessing on parameter estimation thereby influencing proficiency level classification. In simple terms, the practical example illustrated in this dissertation was similar to using a correction for guessing to get the corrected distribution. Therefore, the 79 goal was to illustrate how the proposed mixture modeling approach has potential to address this very important issue encountered in many large scale assessments. 5.1 Interpretations of the Results 5. 1.1 Results from Parameter Recovery Study As noted previously, one of this study’s major goals was to evaluate the accuracy of parameter estimates by comparing them to true (simulated) parameters. The results presented both numerically in Tables 4.3 and 4.4 and graphically in Figures 4.2 to 4.5 show that both bias and RMSE values for discrimination and difficulty parameters are generally lower in MixIRT-R model estimation as compared to those obtained from the 2PL model. When no guessers were present in the sample, the bias and RMSE values were similar in both MixIRT-R and 2PL models. The lower values of these indices show that parameters are estimated reasonably well when no aberrant responses are present in the data. However, bias and RMSE values tended to be higher for both models when the proportion of guessers in the sample increased to 5% and 10%. This suggests that even in presence of 10% guessers in the sample, the aberrant responses have a huge impact on precision of item parameter estimation. Since commonly used IRT models (e.g., lPL, 2PL, 3PL) are not designed to handle the test-taking heterogeneity, alternate modeling approaches are necessary. A mixture model provides such avenues by allowing different latent classes to have their own set of model parameters. One of the primary objectives of varying study factors like test lengh and sample size was to evaluate their capacity to recover stipulated item and person parameters. No clear interpretation could be drawn from the available evidence about the impact of test 80 length on bias and RMSE. Moreover, as other studies have also shown, the larger sample size resulted in smaller bias and RMSE. This was, however, not the case for the 2PL model when test length was 50 and the sample size was 2000. These findings play an important role in judging the quality of IRT-based test banks and pre-equating used in large scale assessments. The average correlations between true (simulated) and estimated item parameters, presented in Table 4.5 and Figures 4.6 to 4.9 show the recovery of item discrimination and item difficulty parameters. Stronger correlations were associated with larger sample sizes for both 2PL and MixIRT-R models. This finding was also consistent with the literature on IRT parameter recovery. The impact of guessing was profound in recovery of item discrimination parameters for the 2PL model. For example, the correlation between true and estimated a parameters decreased fiom 0.877 to 0.807 with the use of 2PL model when the proportion of guessers increased from 5% to 10%. The correlations were similar for both 2PL and MixIRT-R models when no guessers were included in the sample. This suggests that when unobserved test-taking heterogeneity is absent (i.e., no guessers are present in the sample), it may not be necessary to use the complex models like the MixIRT. Nevertheless, this situation may not be practical in most situations as guessing is widely known to occur in many large scale assessments. . Overall, difficulty parameters had better recovery than discrimination parameters. This is consistent with the findings from earlier research, which showed that discrimination parameters are usually more poorly estimated than the difficulty parameters. 81 Person parameter recovery results are summarized in Tables 4.6 and 4.7. Sample plots of the parameter recovery results are also presented in Figures 4.14 and 4.15. The 2PL and MixIRT-R bias and RMSE values were similar, suggesting that ability parameter recovery was fairly similar for both models. However, among the three models compared in this study, the 3PL model performed the worst as indicated by large bias and large RMSE. One possible reason for this poor performance could be a result of the types of guessing behavior introduced in this simulation. That is, guessing is defined as examinee behavior and estimated as a person parameter using a probabilistic model. Generally, in the IRT framework, studies that use the 3PL model simulate data by associating guessing as a parameter associated with the items indicated by the c parameter. In addition, c-parameters are often recovered very poorly flVlartin et al., 2006; Pelton, 2002), because these are generally estimated as lower asymptotes based on fewer number of examinees. In this study, the model fit index Deviance Information Criteria (Spiegelhalter, Best, Carlin, & van der Linde, 2002) showed that the fit of 2PL model was better than that of 3PL model even when 10% guessing proportion was present. Finally, the true (simulated) parameters were generated based on the 2PL model and introduction of guessers might have noticeable impact that could not be captured by the 3PL model. Furthermore, based on the correlations between estimated and simulated ability parameters, the results were fairly similar in all three models and the magnitude of correlation was generally strong. This indicates that the influence of unobserved test- taking heterogeneity was more noticeable in item parameter estimation than the person parameter estimation. This finding suggests that the proposed mixture modeling approach 82 is more appropriate in applications where precise estimation of item parameter is paramount, such as pre-equating and IRT-based item banking or item pool. 5.1.2 Results on Classification Accuracy To investigate the accuracy of Bayesian MixIRT model estimation in correctly classifying guessers, this study evaluated the results based on an index called the classification accuracy. Table 4.8 shows the classification accuracy when using the MixIRT model. The classification accuracy was over 98% for the membership to the non- guesser group and approximately 90% for the membership to the guesser group. In terms of weighted classification accuracy, the MixIRT model performed better in classifying examinees into the group where they belonged. This was reflected by the accuracy of 96.92% or higher in all simulated conditions. The classification accuracy was 100% for the conditions when there were no guessers in the sample. This finding suggests that the MixIRT models can be used even in absence of unobserved test-taking heterogeneity. However, due to the complexity of the mixture models and the costs associated estimating a large number of parameters, there is no advantage of using the MixIRT model when no guessers are present. 5. 1.3 Results from Empirical Study Finally, results from the real data example are presented in Tables 4.11 and 4.12. The MixIRT-R model identified that nearly 5% of examinees were likely to be guessers in this sample. The precision of these estimates are reflected in a 95% credible interval 83 around the estimate and the fact that MC error is less than 1/20‘“ of the standard deviation. The impact of excluding guessers in parameter estimation was also expressed in terms of classification into proficiency levels. As mentioned in Chapter 3, this study used four proficiency levels: Advanced, Proficient, Basic, and Below Basic, which are commonly used in current test-based accountability system under NCLB. To evaluate the degree to which the proficiency level classifications differ between the two samples, with or without the guessers, a chi-square test was performed to compare whether the proportions of student in proficiency levels are different between two samples. In addition, the distributions of two ability (0) estimates using a two-sample Kolmogorov- Smimov test show that the differences were not statistically significant for both samples. This suggests that the specified MixIRT model did not find guessing as a potential cause of observed difference in proficiency classification for this assessment. Furthermore, the influence of a small proportion of guessers in the sample, i.e., less than 5%, did not have much influence on parameter estimation and decisions regarding proficiency level classification. It might be possible that, for an assessment where more examinees are engaged in guessing, the impact on parameter estimation, as well as proficiency level classification, could be noticeable. The impact of guessing was noticeable in analysis using the MixIRT-A model. This model identified 7% of examinees as guessers in the training sample, and 10% in the validation sample. Interestingly, among those 70 examinees for training sample and 100 examinees for validation sample that were classified as guessers by this model, 36 and 37 for training and validation samples were also classified as guessers by the previous 84 model, i.e. MixIRT-R. This shows that the two models are related. Naturally, the ability- based guessing model is expected to identify more guessers than the random-guessing model. As earlier the impact of excluding guessers from the sample was evaluated by finding the differences in proficiency level and ability distribution of examinees before and after removing the guessers from the calibration. Interestingly, the proficiency level for those who were proficient (proficient or advanced) was changed from the original to the modified sample for both training and validation samples. For example, the percent of proficient was changed from 68.17 to 63.87 in training sample showing a noticeable impact of removing guessers from the calibration. Similarly for validation sample, Table 4.21 showed that the proficiency level for those who were proficient (proficient or advanced) had changed from original to modified sample. The changes in proficiency level from 68.11% to 61.56% also indicate a noticeable impact on proficiency level classification. In terms of inferential statistics, the statistical test of differences between the two distributions in the original and modified samples had mixed findings at a statistical significance at alpha level of 0.05. In other words, the Kolmogorov-Smirnov Z shows non-significant result for the training sample but shows a significant difference for the validation sample. For the chi-square test, the differences in proficiency levels between the original and modified samples were statistically significant for both samples. This may have a large ramification from a policy perspective because even a few percent changes in proficiency level receive attention by teachers, school administrators, and policy makers. In this context, it may be prudent to decide to locate the appropriate cut- 85 score on the continuum where few examinees are situated so that a change in the cut- score would result in unnoticeable changes in proficiency classification. The change in students’ proficiency estimates are also of interest to a wider audience that includes parents, teachers, school administrators, educational researchers, and policy makers. This puts unique responsibilities on educational researchers and psychometricians to answer the question of whether the changes in student’s proficiency estimates are associated with actual improvement in their ability or due to a measurement or scaling issues. 5.2 Study Limitations Like any simulation study, there are also questions regarding the generalizability of findings to real testing situations. Utmost care was taken to ensure that the simulated conditions match with practical settings. However, due to limited flexibility of modeling in the program used for MCMC sampling in this study, i.e. WinBUGS, it was not possible to take full advantage of Bayesian inference. For example, a user has limited control over sampling procedures implemented in WinBUGS. The DIC value was not possible to compute for the mixture model in order to compare the model fit statistics. Therefore, the model fit evaluation of the mixture model was limited to a likelihood ratio test. The findings documented in this study were based on 15 replications. This may also raise a question about the generalizability of the findings. However, this decision was made due to the slow performance of WinBUGS. It should be noted that parameter estimation using Gibbs sampling requires a substantial amount of time, especially when it is estimated using WinBUGS. In this study, the required computing time for each dataset 86 varied anywhere from 1 hour to 6 hours of computer time (with Intel Centrino Duo processor 1.66GHz, 2 GB RAM) depending upon the sample size and test length. Therefore, this study used a limited number of levels within each of the simulated factors. The MCMC estimation were performed using WinBUGS. Therefore, it should be noted that the results obtained may not just be due to the theoretical differences between models and study factors, but may also be related to how the software implements the MCMC methods. In other words, use of alternative software or methods of estimation could potentially lead to different results. Thus, comparing the results obtained from the Bayesian estimation method implemented in WinBUGS with other programs such as BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 2003) or Mplus (Muthén & Muthén, 1998-2007) may provide an additional perspective on this issue. In a real-data application, this study classified the examinees into latent classes of guessers and non-guessers based on the probability of each examinees being classified into each class. The recommendation of this study to delete the guessers from the calibration to improve the parameter estimation may not always be realistic in many large scale assessments (e.g., state large scale assessment) because states are required to report the scores for each examinee. 5.3 Implications There are several possible implications of this dissertation. First, this dissertation explored the effectiveness of using MixIRT models to estimate the differential performance of latent classes in a sample i.e., sample heterogeneity. By studying the impact of various study factors like sample size, test length, and proportion of guessing 87 on parameter recovery, it provides useful information for various testing applications like item pool design or IRT-based test bank development and pre-equating. This study also provides some direction on identifying aberrant item responses in any large scale assessment and mapping the profile of guessers. This study focused on an important issue of psychometrics, i.e. parameter estimation. If parameters are not well estimated, the proportion in Adequate Yearly Progress (AYP) category will not be accurately reported. This study has the potential to increase our understanding of challenges and conditions in the modeling of complex behavioral phenomenon, such as test-taking behaviors. Furthermore, illustrative use of WinBUGS may encourage researchers and practitioners to utilize Bayesian methods for investigations of alternate modeling strategies when their data do not fit a single IRT model. Finally, this dissertation also provided a substantive and policy-relevant illustration of ignoring test-taking heterogeneity, especially by showing how simplistic applications of 2PL and 3PL models could adversely impact not only the examinees but also schools, teachers, policymakers. 5.4 Future Directions There are several possible future directions for this study. First, a future study could model complex guessing strategies by taking interaction of ability, item difficulty, and item location into account. Although the 3PL model performed less well than the 2PL in this study potentially due to the guessing simulation favoring the 2PL, in practical situations, we may not know which model would perform better. Therefore, a comparative study of the 2PL and 3PL using real data may provide some useful findings. 88 Similarly, the estimations from this study could be compared with those from other estimation programs, such as BILOG-MG (Zimowski et al., 2003), Mplus (Muthén & Muthén, 1998-2007), and mdltm (von Davier, 2005). Although BILOG-MG is not intended to model unobserved test-taking heterogeneity, the recovery of parameter could still be compared especially for the sample with no-guesser (0% proportion of guesser in this study) and the sample after removing guessers. Also, comparing the MixIRT model parameter estimation fiom WinBUGS with that from M-Plus or mdltrn might provide an additional perspective on the direction that could be undertaken if the MixIRT model approach has to be realized in the practical situations. In terms of handling the number of classes in the mixture model, the present study was limited to two latent classes. Future studies could also explore such investigation using more than two classes. The complexity of mixture modeling is further increased by simulating the mixtures of one-parameter and 2-parameter IRT models, or even mixtures of unidimensional and multidimensional IRT models. Such mixture IRT modeling has potential to provide useful information for applications such as sub-score reporting and cognitive diagnostic modeling. Future studies could investigate the practicality of estimating such complex models in the mixture IRT framework. As indicated in several earlier studies using WinBUGS, use of low level programming languages such as FORTRAN or C++ to implement Gibbs sampling may provide more flexibility in addition to reduced computational time. Some of the limited flexibility of modeling in this study should not be attributed to the Bayesian estimation so much as to the estimation tool used in this study, i.e. WinBUGS. Therefore, we may gain 89 some modeling and computational efficiency while moving in the direction of using a low level programming language. 5.5 Summary of the Findings and Conclusions In summary, this study shows that the MixIRT model can precisely recover the model parameter. It also found that ignored unobserved test-taking heterogeneity, like the presence of guessers in a sample in this study, had a noticeable impact on the precision of recovery of both item and ability parameters. The item parameters were estimated more precisely in MixIRT as compared to 2PL model. Finally, the mixture IRT model classified examinees into guessers and non-guessers reasonably well. The impact of guessing on ability estimation was not severe when the percentage of guessers was low, i.e. less than 5%. However, when the proportion of guesser was higher, say 7% to 10%, the impact was noticeable as indicated by significant changes in examinees classified into proficiency levels. This study investigated an important psychometric issue in large scale assessment, such as modeling unobserved test-taking heterogeneity, using IRT mixture model. It identified the guessers by estimating the probability based on response pattern of examinees. This study also documented the impact of excluding the guessers from the calibration to improve the parameter estimation, which has a large impact on improving the quality of IRT-based item banking and the inferences drawn from the tests assembled using the item pool. Since states are required to report the scores for each examinee, the recommendation suggested by the results of this study to delete the guessers from the calibration to improve the parameter estimation may not always be practical in many 90 large scale assessments. However, this study suggests that the proposed mixture modeling approach can be applied in many large scale assessment such as IRT-based _ item banking to improve the quality of pre-equating and any inferences drawn from the item parameter estimation. This dissertation explored a psychometric perspective of modeling guessing as a person characteristic rather than associating it with the item property as is commonly done with the three-parameter logistic model. Finally, use of real data and illustration of the MixIRT model’s usefulness in documenting the changes in proficiency level classifications has potential for improving the understanding of issues pertaining to cut- scores variation and its policy implications. 91 APPENDICES List of Appendix A. WinBUGS CODE FOR MixIRT model B. FIGURES FOR EVALUATING CONVERGENCE OF THE ESTIMATES C. ADDITIONAL TABLES D. ADDITIONAL PLOTS 92 APPENDIX A # Mixture IRT model model { for (i in 1:N) { for (j in 1:1) { PliJ] <- (Z-Glil)/5 + (G[i]-1) * 1/ (1+eXp(-(a[i]*(thetalil-bUDD) ; glid'l ~ dbemmlisjl) ; G[i] ~dcat(PI[]); pgli] <- equals(G[i].1); #probability of being in a guesser class # priors PI[1:2] ~ ddirch(alpha[]); for (j in 1:1) { a[i] ~ dnorm(1,2) I(0,); #Truncated Normal # a[i] ~ dlnonn(0,2); #Log Normal b[j]~ dnorm(0, 1) ; } for (i in 1:N) { theta[i] ~dnorm(0, taut) ; #prior for ability parameter } taut ~dgamma(0.01,0.01); 93 Appendix A.2 WinBUGS code for Model 2 (MixIRT model with ability based guessing) model { for (i in 1:N) # N is the number of examinees { for (j in l:J) # J is the number of items { logit(P[iJ]) <- (aljl*(thetalil-bljl))-(alphali]-1)*stepfbli]-fl16ta[il-delta[i])*(afil*(thetalil- bfi])+1.4); check[i,j] <- (alpha[i]-1) * step(b[j]-theta[i]-delta[i]); # 1.4 value is given instead of estimating c, because logit (-l .4) = 0.20 # step function (b[j]-theta[i]-delta[i]) =1 if an item j is difficult for an examinee i with threshold delta # alpha estimates the group membership rliaj] ~ dbemfiiliajl) ; :umcheckfi] <- sum(check[i,1:J]); check2[i] <- sumcheck[i]; alpha[i] ~dcat(PI[]); # group membership is categorical #priors for (j in 1:J) { b[j]~ dnorm(O, 1) ; # Normal prior for difficulty parameter a[j] ~ dnorm(l ,2) I(0,); # Truncated normal for discrimination parameter for (i in 1:N) { theta[i] ~dnorm(0, taut) ; #prior for ability parameter delta[i] ~ dnorm(0,10); #threshold for different degree of guessing } taut ~dgamma(0.01,0.01); # hyper-parameter for precision PI[l] ~ dbeta(l,l); PI[2] <- 1.0- PI[l]; 94 APPENDD( B FIGURES FOR EVALUATING CONVERGENCE OF THE MCMC METHODS Figure B.l Convergence Diagnostic Plots for a difficulty parameter of a randomly selected item [True b =2. 08, Estimated b =2.125] Figure B.1a BGR plot for difficulty parameter estimate b[20] chains 1:3 1.5 ' 1.0 - i: r 0.5 r 0.0 ' I I l 1 10000 20000 iteration Figure B. lb. History plot for difficulty parameter estimate b[20] chains 1:3 6.0 - 4.0 *- 2.0 0.0 ’ 10000 20000 iteration Figure B. 1 c. Autocorrelation plot for difficulty parameter estimate b[20] chains 1:3 1.0 '1', 0.5 '1‘ ~.___ 00 -11 "r. ..... _ ___ _--_ _ -0.5 b -1.0 L 0 20 40 95 Figure B.ld. Density plot for difficulty parameter estimate b[20] chains 1:3 sample: 45000 1.5 - 1.0 - 0.5 ' 0.0 - 1.0 2.0 3.0 4.0 Figure B.2 Convergence Diagnostic Plots for an ability parameter of a randomly selected person [True theta=1.365, Estimated=0. 862] Figure B.2a BGR plot for ability parameter estimate theta[13] chains 1 :3 1.0 ' \ 0.5 - 0.0 - T l I 1 10000 20000 iteration Figure B.2b. History plot for ability parameter estimate theta[13] chains 1:3 30 - 2.0 - 0.0- —1.0 F l I l I 10001 15000 20000 25000 iteration 96 Figure B.2c. Autocorrelation plot for ability parameter estimate theta[13] chains 1:3 1.0 ' '5 0.5 -§ 0.0-L ~ -— n - -—*-— -0.5L -1.0 r 0 20 40 Figure B.2d. Density plot for ability parameter estimate theta[13] chains 1:3 sample: 45000 T 1.0 0.75 0.5 0.25 0.0 I T 1 l 97 Appendix C. ADDITIONAL TABLES Table C.1 Simulated Item Parameters (Test Length=25) Item Discrimination Difficulty Number (a) (b) l 0.98 0.17 2 1.22 0.15 3 0.97 0.23 4 1.17 -1.22 5 1.48 -1.90 6 0.99 -1.63 7 1.08 0.98 8 1.64 -0.29 9 0.85 -0.51 10 1.08 -0.18 1 1 1.49 0.14 12 0.78 0.72 13 0.75 -O.98 14 0.69 0.10 15 0.79 0.65 16 0.99 —0.85 17 0.59 -1.61 18 0.63 0.19 19 0.68 0.12 20 0.94 2.09 21 1.35 -0.04 22 0.99 -0.01 23 0.87 -0.41 24 0.98 0.97 25 1.76 0.09 98 Table C.2 Simulated Item Parameters (Test Length=50) Item Discrimination Difficulty Item Discrimination Difficulty Number (3) (b) Number (a) (b) l 0.70 0.53 26 1.19 0.19 2 1.17 -1.11 27 1.15 -0.28 3 0.72 -1.04 28 1.28 0.71 4 0.85 -0.48 29 1.37 1.26 5 0.96 -1.19 A 30 1.01 -0.89 6 1.06 0.45 31 0.92 -0.02 7 0.98 -0.22 32 1.06 -0.89 8 1.54 -1.03 33 1.28 0.60 9 1.50 -0.71 34 1.36 0.40 10 1.17 -0.79 35 0.87 -0.41 1 l 1.02 —0.06 36 1.26 —0.45 12 0.82 -1.93 37 0.96 -0.60 13 0.89 -O.56 38 0.92 0.74 14 0.70 -1.11 39 1.14 -0.20 15 0.91 0.26 40 1.26 -0.12 16 0.89 0.48 41 0.87 -1.01 17 1.55 0.12 42 0.69 0.25 18 1.21 —0.08 43 0.99 2.15 19 1.02 -2.11 44 1.15 0.23 20 0.93 0.02 45 1.02 -1 .14 21 1.02 —0.70 46 1.48 0.20 22 1.21 -O.21 47 0.92 -1.15 23 1.75 -0.26 48 1.22 0.12 24 1.09 -0.93 49 0.82 —1.35 25 1.24 0.46 50 0.73 0.58 99 Table C.3 Simulated Item Parameters (Test Length=40) Item Discrimination Difficulty Item Discrimination Difficulty Number (a) (b) Number (a) (b) 1 0.853 0.575 21 1.298 0.694 2 1.176 -0.226 22 0.785 1.945 3 1.227 -1 . 140 23 0.798 0.088 4 1.175 0.369 24 0.800 0.021 5 0.858 0.873 25 0.911 0.776 6 0.673 -0.835 26 0.633 -0.004 7 0.833 -2.274 27 1.281 1.128 8 0.844 0.797 28 0.832 1.406 9 1.026 0.061 29 1.334 0.708 10 1.231 -1.493 30 1.807 0.913 11 1.897 -0.491 31 1.093 0.323 12 0.999 0.935 32 0.889 0.153 13 0.974 -0.460 33 1.189 -0.555 14 0.926 -0.212 34 0.710 0.009 15 0.769 0.004 35 1.019 -0.884 16 1.135 -0.032 36 1.004 1.526 17 0.961 -0.404 37 0.951 -0.132 18 1.176 -0.926 38 0.814 —0.586 19 1.300 0.568 39 0.743 -0.793 20 0.687 0.583 40 0.985 0.715 100 Appendix D. ADDITIONAL PLOTS Figure D.l Convergence Diagnostic Plots for a discrimination parameter of a randomly selected item [True a =0. 966, Estimated (1 =1.008] 1.5' 1.0" 0.5- 0.0' a[3] chains 1:3 1.0 0.5 0.0 -O.5 —1.0 I 1 0000 iteration I 20000 3.0 2.0 1.0 a[3] chains 1:3 sample: 45000 a[3] chains 1:3 - 1.0 _ L. _ ...... _..___,___ 0.5 - 0.0 I F 0 20 40 lag a[3] chains 1:3 1 10000 20000 iteration [From Top: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)] 101 Figure D.2 Convergence Diagnostic Plots for a difficulty parameter estimate of a randomly selected item [True b =0. 22 78, Estimated b =0. 462 7] b[3] chains 1:3 1.5 - 1.0 - 0.5 ' 0.0 - -o.5 - T 16666 20000 iteration b[3] chains 1:3 sample: 45000 4.0 ' 3.0 '- 2.0 '- 1.0 L 0.0 t -0 5 0.0 0 5 1 0 b[3] chains 1:3 b[3] chains 1:3 1.0-S 1,0'\: _ 2.- 0.5 '3‘; 0.0 _ ;iIII-......_...._. .__-____-._.--..-._. . .-- - 0.5 - -0.5 ' -1_o - 0.0 ' 0 20 40 1 10000 20000 lag iteration [From Top: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)] 102 Figure D.3 Convergence Diagnostic Plots for an ability parameter estimate of a randomly selected examinee [True 6:2. 545, Estimated 6:2. 13 7] theta[87] chains 1:3 6.0 - 4.0 — 2.0 - 0.0 ' 10001 15000 20000 iteration theta[87] chains 1:3 sample: 45000 0.8 - 0.6 - 0.4 - 0.2 - 0.0 - 0.0 2.0 4.0 theta[87] chains 1:3 theta[87] chains 1:3 1.0 ‘ 1.0 — ,— 0.5 - 0.0 ———'-——-~—'-“"— 0.5- -O.5 -1.0 ‘ 0_0 - 0 20 40 1 10000 20000 '39 iteration [From T op: History plot, Density plot, Autocorrelation plot (Left), BGR plot (Right)] 103 REFERENCES Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, I 7, 251-269. Albert, J. H., & Ghosh, M. (2000). Item response modeling. In D. Dey, S. K. Ghosh & B. K. Mallick (Eds.), Generalized Linear Models: A Bayesian Perspective (pp. 173- 193). New York: Addison-Wesley. Ansari, A., J edidi, K., & Dube, L. (2002). Heterogeneous factor analysis models: A Bayesian approach. Psychometrika. Asparouhov, T., & Muthén, B. (2008). Multilevel mixture models. In G. R. Hancock & K. M. Sarnuelsen (Eds.), Advances in latent variable mixture models. Charlote, NC: Information Age Publishing. Baker, F. B., & Kim, S.-H. (2004). Item Response Theory: Parameter Estimation Techniques: CRC Press. Bazan, J. L., Branco, M. D., & Bolfarine, H. (2006). A skew item response model. Bayesian Analysis, 1(4), 861-892. Béguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models Psychometrika, 66(4), 541-561. Bimbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores: Information Age Publishing. Bock, R. D., & Zimowski, M. F. (1997). Multiple Group IRT. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 43 3- 448). New York: Springer Verlag. Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov Chain Monte Carlo. Applied Psychological Measurement, 27(6), 395-414. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for Testlets. Psychometrika, 64, 153-168. Budescu, D., & Bar-Hillel, M. (1993). To guess or not to guess: A decision-theoretic view of formula scoring. Journal of Educational Measurement, 30(4), 277-291. 104 Cao, J ., & Stokes, S. (2008). Bayesian IRT guessing models for partial guessing behaviors. Psychometrika, 73(2), 209-230. Casella, G., & George, E. (1992). Explaining the Gibbs sampler. The American Statistician, 46(3), 167-174. Cizek, G. J. (2001). An overview of issues concerning cheating on large-scale tests. Paper presented at the annual meeting of the National Council on Measurement in Education, April 2001, Seatle, Washington. Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differential item functioning. Journal of Educational Measurement, 42(2), 133-148. Congdon, P. (2005). Markov Chain Monte Carlo and Bayesian statistics. In B. Everitt & D. Howell (Eds.), Encyclopedia of Statistics in Behavioral Science (V 01. 3, pp. 1134-1143): Wiley. Cowles, M. (2004). Review of WinBUGS 1.4. American Statistician, 58(4). Cowles, M., & Carlin, B. (1996). Markov Chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association, 91 (434), 883-904. De Ayala, R. J ., Kim, S.-H., Stapleton, L., & Dayton, C. M. (2002). Differential item functioning: A mixture distribution conceptualization. International Journal of Testing, 2(3-4), 243-276. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood estimation fiom incomplete data via the EM algorithm (with discussion). Journal of Royal Statistical Society, Series B, 39, 1-3 8. Draney, K., Wilson, M., Gluck, J ., & Spiel, C. (2008). Mixture models in a developmental context In G. R. Hancock & K. Samuelsen (Eds.), Advances in Latent Variable Mixture Models. Charlote, NC: Information Age Publishing. Fox, J ., & Glas, C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66(2), 271-288. Frary, R. B. (1988). Formula scoring of multiple-choice tests (correction for guessing). Retrieved October 21, 2008, fiom http://www.ncme.org/pubs/items/ITEMS_Mod_4.pdf Frfihwirth-Schnatter, S. (2006). Finite mixture and markov switching models: Springer. 105 Gelfand, A., Hills, 8., Racine-Poon, A., & Smith, A. (1990). Illustration of Bayesian inference in normal data models using Gibbs sampling. Journal of the American Statistical Association, 85, 972-985. Gelfand, A., & Smith, A. (1990). Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association, 85, 398-409. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457-511. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741. Gilks, W., Richardson, S., & Spiegelhalter, D. (Eds). (1996). Markov Chain Monte Carlo in practice: Chapman & Hall/CRC. Gill, J. (2002). Bayesian methods (A Social and Behavioral Sciences Approach): Chapman & Hall/CRC. Goldman, S. H., & Raju, N. S. (1986). Recovery of one- and two-pararneter logistic item parameters: An empirical study. Educational And Psychological Measurement, vol, 46(1), 11-21. Gonzales, P., Calsyn, C., Jocelyn, L., Mak, K., Kastberg, D., Arfeh, S., et a1. (2000). Pursuing excellence: Comparisons of international eighth-grade Mathematics and Science achievement fiom a US. perspective, 1995 and 1999 (N o. NCES 2001-028): National Center for Educational Statistics. Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika, 61 , 215-231. Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 147-200). New York: Macmillan. Hambleton, R. K., & Swarrrinathan, H. (1985). Item Response Theory: Principles and Applications. Boston, MA: Kluwer. Hamilton, L. S., Stecher, B. M., & Klein, S. P. (Eds.). (2002). Making sense of test-based accountability in education. Santa Monica, CA: RAND. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 5 7, 97-109. 106 Hulin, C., Lissak, R., & Drasgow, F. (1982). Recovery of two- and three-pararneter logistic item characterstic curves: A monte carlo study. Applied Psychological Measurement, 6(3), 249-260. Johnson, E. G. (1992). The design of the national assessment of educational progress. Journal of Educational Measurement, 22, 95-1 10. Johnson, V. E. (1997). On Bayesian analysis of multirater ordinal data: An application to automated essay grading. Journal of the American Statistical Association, 91, 42- 5 1 . Johnson, V. E., & Albert, J. H. (1999). Ordinal Data Modeling. New York: Springer- Verlag. Kiefer, J ., & Wolfowitz, J. (1956). Consistency of the Maximum Likelihood Estimator in the Presence of Infinitely Many Incidental Parameters. The Annals of Mathematical Statistics, 27(4), 887-906. Kim, J. S., & Bolt, D. M. (2007). Estimating item response theory models using Markov Chain Monte Carlo methods. Educational Measurement: Issues and Practice, 26(4), 38-51. Kim, S. H., & Cohen, A. S. (1998). An evaluation of a Markov Chain Monte Carlo method for the two—parameter logistic models. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA. Kolmogorov, A. N. (1933). On the empirical determination of a distribution function. 4, 83-91. , Lambert, P. C., Sutton, A. J ., Burton, P. R., Abrams, K. R., & Jones, D. R. (2005). How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Statistics in Medicine, 24, 2401-2428. Lazersfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin. Lemke, M., & Gonzales, P. (2006). US. student and adult performance on international assessments of educational achievement: Findings from the condition of education 2006 (No. NCES 2006-073). Washington, DC: US. Department of Education. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley 107 Lunn, D. J ., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS -- a Bayesian modelling fiamework: concepts, structure, and extensibility. Statistics and Computing, 10, 325-337. Maier, K. S. (2002). Modeling incomplete scaled quessionnaire data with a partial credit hierarchical measurement model. Journal of Educational and Behavioral Statistics, 27, 271-289. Martin, E., Pino, G., & De Boeck, P. (2006). IRT Models for Ability-Based Guessing. Applied Psychological Measurement, 30(3), 183-203. McLachlan, G., & Peel, D. (2000). Finite Mixture Models. New York: John Wiley & Sons. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equations of state calculations by fast computing machines. The Journal of Chemical Physics, 21 , 1087—1 091. Mislevy, R. J. (1984). Estimating latent distributions. Psychometrika, 49(3), 359-381. Mislevy, R. J ., & Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika, 55(2), 195-215. Muthén, B. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), 557-585. Muthén, B., & Lehman, J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational Statistics, 10(2), 133-142. Muthén, L., & Muthén, B. (1998-2007). Mplus user's guide. Los Angeles, CA: Muthén & Muthén. Patz, R. J ., & Junker, B. W. (1999a). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24(4), 342-366. Patz, R. J ., & Junker, B. W. (1999b). A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24(2), 146-178. Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London, A(185), 71-110. Pelton, T. W. (2002). The accuracy of unidimensional measurement models in the persence of deviations for the underlying assumptions. Unpublished Unpublished doctoral dissertation, Brigham Young University. 108 Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 271-286). New York: Springer-Verlag. Redner, R. A., & Walker, H. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 195-239. Rost, J. (1990). Rasch models in latent classes: an integration of two approaches to item analysis. Applied Psychological Measurement, I 4, 271-282. Sahu, S. K. (2002). Bayesian estimation and model choice in item response models. Journal of Statistical Computation and Simulation, 72, 217-232. Samuelsen, K. (2005). Examining differential item functioning fiom a latent class perspective. University of Maryland, College Park, MD. Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34(3), 213-232. Smirnov, N. V. (1939). On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bulletin of Moscow, 2, 3-16. Spiegelhalter, D. J ., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of Royal Statistical Society, Series B, 64(4), 583-616. Spiegelhalter, D. J ., Thomas, A., Best, N. G., & Lunn, D. (2003). WINBUGS 1.4. User Manual. Cambridge: MRC Biostatistics Unit. Tsutakawa, R. K., & Soltys, M. J. (1988). Approximation for Bayesian ability estimation. Journal of Educational Statistics, 13, 1 17-130. von Davier, M. (2005). mdltrn: Software for the general diagnostic model and for estimating mixtures of multidimensional discrete latent trait models [Computer software]. Princeton, NJ: Educational Testing Service. von Davier, M., & Carstensen, C. H. (2007). Multivariate and mixture distribution Rasch models: Extensions and applications. New York, NY: Springer Science. von Davier, M., & Rost, J. (2007). Mixture distribution item response models. In C. R. Rao & S. Sinharay (Eds.), Handbook of Statistics (V 01. 26): Elsevier B. V. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response Theory: An analog for the 3-PL useful in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), 109 Computerized adaptive testing: Theory and practice (pp. 245-270). Boston, MA: Kluwer-Nijhoff. Wilks, W. R., Richardson, 3., & Spiegelhalter, D. (Eds.). (1996). Markov Chain Monte Carlo in practice: Chapman & Hall/CRC. Wise, S. L. (2006). An investigation of the differential effort received by items on a low- stakes computer-based test. Applied Measurement in Education, 19(2), 95-1 14. Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: problems and potential solutions. Educational Assessment, 10(1), 1-17. Wise, S. L., & DeMars, C. E. (2006). An application of item response time: The effort- moderated IRT model. Journal of Educational Measurement, 43(1), 19-38. Yamarnoto, K. (1987). A model that combines IRT and latent class models. Unpublished doctoral dissertation, University of Illinois Urbana-Charnpaign. Yamarnoto, K. (1989). Hybrid model of IRT and latent class models (No. RR-89-41). Princeton, NJ: Educational Testing Service. Yarnamoto, K. (1995). Estimating the eflects of test length and test time on parameter estimation using the HYBRID model (N o. RR-95-l 6). Princeton, NJ: Educational Testing Service. Yang, X. (2007). Methods of identifying individual guessers from item response data. Educational And Psychological Measurement, 67(5), 745-764. Zimowski, M., Muraki, E., Mislevy, R., & Bock, R. (2003). BILOG-MG 3: Item analysis and test scoring with binary logistic models [Computer software]. Chicago, IL: Scientific Software. 110