, H35 ,. , ‘ , . . . .a a? . firm a; . . .xmnugummmgr . . .2 r 4% .r. ‘ a. x 1... 35“.! 6. .i x . mm .51.}, 7 a; V. .5? x: Stu-rap. . .l . .3; sunning. . :15... glpshd. 5-.J'v4-u1‘ .31. !a..u.?¥...3§1..; a , iii? x .2. six.) I Jinan.“ . 5;! iii: VI... 0“ . [I .;of. I . $23.1)! \‘ a Kilt-Akl. \‘1.....& £1.39?! . 4 $32205.“ o. . . . . ,. .rr . ‘ . ‘ Lylfuflvvw‘zqfvx‘ , . 1.54:1 fififimfim .. , V . . , fi,&.aq ~+§$§$ .:. .1 4, : Ffluug $51.. . ‘ 3.- - . .‘I III' LIBRARY Mlchirjcm State University This is to certify that the dissertation entitled ESTIMATING THE PARAMETERS FOR MULTIDIMENSIONAL ITEM RESPONSE THEORY MODELS BY MCMC METHODS presented by Yanlin Jiang has been accepted towards fulfillment of the requirements for the Ph. D degree in Education Major Professor’s Signature (8/! ">70 3'“ Date MSU is an Affirmative Action/Equal Opportunity Institution -.-.- .- -.... - - _ PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE AUG 0 9 2014 020.315 2/05 p:/ClRC/DateDue.indd-p.1 ESTIMATING PARAMETERS FOR MULTIDIMENSIONAL ITEM RESPONSE THEORY MODELS BY MCMC METHODS By Yanlin Jiang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counselling, Educational Psychology and Special Education 2005 ABSTRACT ESTIMATING PARAMETERS FOR MULTIDIMENSIONAL ITEM RESPONSE THEORY MODELS BY MCMC METHODS By Yanlin Jiang Efforts to apply Markov Chain Monte Carlo (MCMC) methods to three-parameter lin- ear logistic multidimensional IRT models are addressed using the Metropolis-Hastings algorithm within Gibbs approach. Bayesian modal estimators of both item and pro- ficiency parameters are obtained in a simultaneous process rather than a separate parameter estimation procedure. It is shown that it is effective by blocking individ- ual item discrimination and proficiency dimensional parameters and treating them without reference to other item and proficiency parameters. Both simple and com- plex structures of item dimensions are included. In addition, various proficiency di- mensional structures are considered for three and five dimensional cases, respectively. The effects of four potential factors on model parameter estimation are investigated. Simulation studies are conducted across different designs for one-, three-, and five- dimensional cases. Results show that the parameter estimators based on MCMC are accurate in terms of correlation and root mean square errors. Numeric examples for the estimates of the standard errors demonstrate that the estimation is statistically stable and accurate. ACKNOWLEDGEMENTS I am grateful for my dissertation committee: Dr. Mark Reckase (chair), Dr. Kimberly Maier, Dr. Richard Houang, and Dr. James Stapleton for their constructive comments and valuable suggestions. Without their inputs, this dissertation would not have been completed. I would like to express my sincere gratitude to my academic advisor, Dr. Mark Reckase, for his constant support, direction, and encoragement over the past five years. I would also like to thank the Center for the Study of Curriculum and my supervisor, Dr. Richard Houang, whose final assistance supported the completion of the dissertation research and enabled the completion of my doctoral study. Working with him has been a tremendously rewarding experience for me. Special thanks go to my husband, Deping Li, for his support, patience, and un- derstanding in my life. iii Contents LIST OF TABLES .............................. LIST OF FIGURES ............................. 1 Introduction 1.1 Item Response Theory Models ...................... 1.1.1 The Uni-dimensional Item Response Theory Models . . . i. . . 1.1.2 The Multi-dimensional IRT Models ............... 1.2 Estimation Methods for IRT Models .................. 1.2.1 Commonly Used Estimation Methods and Their Limitations . 1.2.2 Applications of MCMC methods to Estimation of IRT-based Models ............................... 1.3 The Importance of the Study ...................... 2.1 Overview of Markov Chain Monte Carlo Methods ........... 2.2 Likelihood Functions for the Linear Logistic MIRT Models ...... 2.3 M—H within Gibbs for Parameter Estimation for MIRT Models . . . 2.3.1 Complete Conditional Functions for Model Parameters . . . . 2.3.2 Modelling the Covariance Structure for Multidimensional Abil- ities ................................ 2.3.3 Random Walk Metropolis Algorithm within Gibbs ....... 2.4 Unbiased and Consistent Estimators of Parameters .......... 3 Simulation Studies and Results 3.1 Prior Distributions for Model Parameters ................ 3.2 Diagnosing the Convergence of Markov Chains ............. 3.3 Initial Values and Iterations ....................... iv vi viii 1 1 1 4 7 7 12 13 MCMC Methods for Parameter Estimation for Logistic MIRT Model 17 17 21 23 23 27 31 34 36 38 38 39 3.4 3.5 3.6 3.7 3.8 Estimating the Unidimensional 3PL Model ............... 3.4.1 Assessing Convergence ...................... Estimating the 3-Dimensional MIRT Model .............. 3.5.1 3.5.2 3.5.3 3.5.4 3.5.5 Generating Proficiency Parameters ............... The Number of Proficiency Dimension and Sample Size . . . . Proficiency Structure ....................... Generating Item Parameters ................... The Estimation Accuracy and Stability for the 3-Dimensional MIRT Model ................. Estimating the 5-dimensional Model .................. Proficiency Structure Estimation .................... Computing Time ............................. 4 Concluding Remarks and Eiture Research Directions BIBLIOGRAPHY 41 42 54 57 58 59 60 62 69 82 85 87 96 List of Tables 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 Time Item Parameters for 30—Item Test (Dim : 1) ........... 43 True Item Parameters for 45-Item Test (Dim = 1) ........... 44 Estimates from three chains for 30-Item Test (Dim = 1, N = 2000) . 46 Item Parameter Estimates for 30—Item Test (Dim = 1) ........ 48 Item Parameter Estimates for 30—Item Test In BILOG-MG3 (Dim = 1) 49 Item Parameter Estimates for 45—Item Test (Dim = 1) ........ 50 Item Parameter Estimates for 45-Item Test (Dim = 1), cont ...... 51 RMSE for Estimating Uni-dimensional Models (Dim = 1) ....... 53 Correlations Between True Proficiency and Estimates (Dim = 1) . . . 54 True Item Parameters for 30—Item Test (Dim = 3) ........... 63 Tme Item Parameters for 45-Item Test (Dim = 3) ........... 64 RMSE for Multi-dimensional Test (Dim = 3, p = .2) ......... 64 RMSE for Multi-dimensional Test (Dim =3, p = general) ....... 66 Correlations Between True Proficiency and Estimates (Dim = 3, p = .2) 66 Correlations Between True Proficiency and Estimates (Dim = 3, p = general) .................................. 67 True Item Parameters for 30—Item Test (Dim = 5) ........... 75 vi 3.17 True Item Parameters for 45-Item Test (Dim = 5) ........... 76 3.18 True Item Parameters for 45—Item Test (Dim = 5), cont. ....... 77 3.19 RMSE for Multi-dimensional Test (Dim = 5, p = .2) ......... 80 3.20 RMSE for Multi-dimensional Test (Dim = 5, p = general) ...... 80 3.21 Correlations Between 'Itue Proficiency and Estimates (Dim == 5, p = .2) 81 3.22 Correlations Between True Proficiency and Estimates (Dim = 5, p = general) .................................. 82 3.23 Estimates of Covariance Matrix, Dim = 3, p = .2 ........... 83 3.24 Estimates of Covariance Matrix, Dim = 3, p = general ........ 83 3.25 Estimates of Covariance Matrix, Dim = 5, p = general ........ 84 3.26 Estimates of Covariance Matrix, Dim = 5, p = .2 ........... 84 3.27 Computing time for 1-, 3-, and 5-Dimension data ........... 86 4.1 TESTFACT Item Parameters estimates for 30-Item Test (Dim = 3) . 97 4.2 TESTFACT Item Parameters Estimates for 30-Item Test (Dim = 5) . 98 vii List of Figures 3.1 Sample ACF for series of a6, Dim = 1 .................. 45 3.2 Sample draw at first 3000 iterations for series of a, b and c ...... 47 3.3 True Proficiency Versus Estimates (Dim = 1) ............. 55 3.4 True a Parameter Versus Estimates (Dim = 1) ............. 55 3.5 True b Parameter Versus Estimates (Dim = 1) ............. 56 3.6 True 0 Parameter Versus Estimates (Dim = 1) ............. 56 3.7 True Proficiency Versus Estimates (Dim = 3, p = general, n = 30, N = 5000) .................................... 69 3.8 TNe Proficiency Versus Estimates (Dim = 3, p = general, it = 45, N = 2000) .................................... 70 3.9 Time Proficiency Versus Estimates (Dim = 3, p = general, n = 45, N = 2000) .................................... 70 3.10 True a1 Parameter Versus Estimates (Dim = 3, p = .2) ........ 71 3.11 Time a2 Parameter Versus Estimates (Dim = 3, p = .2) ........ 71 3.12 True a3 Parameter Versus Estimates (Dim = 3, p = .2) ........ 72 3.13 True d Parameter Versus Estimates (Dim = 3, p = .2) ......... 72 viii Chapter 1 Introduction 1.1 Item Response Theory Models Item response theory (IRT) becomes more and more important for psychological and educational testing. This philosophic and theoretic framework not only provides useful analytical tools (e.g., item differential functioning and test equating), but also provides an effective test design tool. The importance of the IRT framework cannot be realized unless the model parameters are accurately estimated given that the model assumptions are satisfied and the model is adequately fitted to the observed data. In this chapter, both uni-dimensional and multi-dimensional logistic IRT models will be introduced, then some of the existing estimation methods will be reviewed, and finally the importance of a new method for estimating multidimensional IRT models will be addressed. 1.1.1 The Uni-dimensional Item Response Theory Models Classical test theory (CTT) has been the mainstream of educational and psycholog- ical testing research and practice for many decades. Gulliksen’s “Theory of Mental Tests ” (1950) is one of the earliest books and a milestone of measurement theory. However, CT T suffers from a number of limitations, as is often seen in the literature (e.g., Embreston & Reise, 2000; Hambleton & Swaminathan, 1985). For example, item statistics (e.g., item difficulty) are sample dependent; reliability and standard errors of measurement estimators, which are the fundamental concepts in true score theory, do not take the proficiency difierences among examinees into account. Hence, only a single reliability estimate is obtained for one test. Furthermore, CTT cannot probabilistically predict examinees’ response on items unless the items have previ- ously been administered to similar individuals. In many testing contexts such as adaptive test, it is important to predict the examinee’s response in probability in order to provide next item for the examinee. As Lord states, “we need to describe the items by item parameters and the examinees by examinee parameters in such a way that we can predict probabilistically the response of any examinees to any items, even if similar examinees have never taken similar items before (P.11, Lord, 1980)”. Unfortunately, CTT fails to satisfy this property. Item response theory is a model- based measurement framework. IRT provides a more complete rationale for model- based measurement than CTT and overcomes a number of limitations of CTT (for details, please refer to Embreston & Reise, 2000). The important development of IRT is due to the work of Lord (1952, 1953), Birnbaum (1957, 1958a, 1958b), Lord and Novick (1968), and Rasch (1960). Various IRT-based models have been developed in the literature, for examples, the normal ogive models (Lord, 1952) and the logis tic models (Rasch, 1960; Birnbaum, 1957, 1958a, 1958b, 1968; & Wright & Stone, 1979) for binary data, the graded response model (Samejima, 1969), the partial credit 2 model (Master, 1982), and the nominal response model (Bock, 1972) for polytomous data. There are other uni-dimensional IRT models (e.g., continuous response model, Samejima, 1972) but will not be discussed here since this study focuses on applying a new method to the logistic IRT models. One common feature of these models is that they explicitly predict the probability of correct response on an item given person and item parameters. More comparisons of other characteristics between CTT and IRT can be found in Embreston and Reise (2000). In the family of IRT models, the three-parameter logistic model (3PL model) is one of the most widely used models. It was proposed by Birnbaum in 1968. For a dichotomous item, the item response function (IRF or called ICC) is the probability of a correct response to the item. This probability can be represented by the function (Lord, 1980) exp[1.7ai(9j — bill 1 + exp[1.7a.-(9j - bail , Pi(9j) E p(Uij = 1 I aubuCiagj) = Ci + (1 — Ci) (1-1) where p,(9j) is the probability of correct answer to item i given the jth examinee’s proficiency level 03-; Uij is the item response either 0 (incorrect) or 1 (correct) for examinee j on item i; a, is the ith item discriminating power; it is usually a positive number. b,- is the ith item difficulty; c, is the ith item lower asymptote or called pseudo-guessing parameter; and 1.7 is a scale constant. If there is no lower asymptote parameter in the above model, i.e., c,- = 0, the 3PL model reduces to the 2PL model. Furthermore, if the discriminating power parameter a,- is treated as a constant in the model, then the model becomes 1PL model or Rasch model because of only one item parameter (i.e., item difficulty) in the model. Note that the 3PL, the 2PL, and the 1PL models only contain one proficiency parameter for each examinee, an important assumption for the models, which are labelled as uni—dimensional IRT models. In addition to unidimensionality, another important assumption for IRT models is local independence. For a single examinee, the responses to the test items are related to each other only through this examinee’s proficiency parameter(s). Hence, local independence can be understood as conditional independence. It assumes that examinee’s responses to items are independent of each other after controlling for the examinee’s proficiency parameter(s). The mathematical expression of local indepen- dence is given by n p(ulau2i ' ' ' tun l 0) = Hpi(ui i 0), (12) i=1 where u,- is the item response on the ith item for a single examinee and i = 1, 2, - - - ,n. Equation (1.2) implies that given a fixed proficiency parameter, the joint distribution p of responses to n items is the product of the marginal distributions p,- for all items. 1.1.2 The Multi—dimensional IRT Models In the multi-dimensional item response theory (MIRT), items require multiple abilities to get a correct response. Under this circumstance, the uni-dimensional IRT models are not adequate for such response data. A family of IRT models that contain multiple proficiency parameters is needed to reflect proficiency level on different dimensions for each examinee. MIRT is an extension of uni-dimensional IRT. Like uni-dimensional IRT, MIRT models examinee’s behavior (i.e., item response) given person and item characteristics. The essential difference of MIRT from uni-dimensional IRT is that in MIRT, multiple proficiency parameters are used to model person abilities and a vector form of item parameters to characterize items. To describe MIRT-based models, it is necessary to introduce the concept complete latent space. Lord defined it as a collection of all those latent variables Ok’s that discriminate among groups of examinees (Lord & Novick, 1968) for k = 1, 2, - - . , p, where p is the number of proficiency dimensions. Denote the complete latent space 0 by the vector 0 5 (01,027 ' ° ' 30?)" (1'3) These variables can be thought of as “psychological dimensions necessary for the psychological description of individuals” (p.359). For the population of examinees, every single examinee possesses a value for each of the latent variables in the space. For uni-dimensional IRT models, the complete latent space has only one variable. For multi-dimensional IRT models, it is assumed that two or more latent variables are needed to characterize an examinee’s proficiency. There are a few MIRT-based models. Early MIRT models for binary data were from the work of McDonald (1967) and Lord & Novick (1968). Other models have also been found in the literature. For example, the multidimensional Rasch model (Stegelmann, 1983), the multidimensional two-parameter normal ogive IRT model (Bock, Gibbons and Muraki, 1988), the multicomponent latent trait model (MLTM; Whitely, 1980), etc. Reckase provides the extension of the uni-dimensional three parameter logistic model to multi-dimensional form (Reckase, 1985, 1996). He pointed out that “After reviewing many possible models that include vector parameters for both examinee and item characteristics [see McKinley and Recakse (1982) for a summary], the model given below was selected for further develop- ment because it was reasonable given what is known about item response data, consistent with simpler,uni-dimensional item response theory mod- els, and estimable with commonly attainable numbers of examinees and test items (p.272)”. exp (aiaj + d.) 1 + “1301491 + di), pi(0j) E p(U1‘j =1 I ai,d,-,c,-,9,-) = C; +(1— Ci) (1.4) where p(U,-,- = 1 | a,, d,-, c,-, 03-) is the probability of a correct response (score of 1) for examinee j on test item i; U,,- is a dichotomous random variable representing the item response for examinee j on item i; Q, is the vector of abilities for examinee j, i.e., 91- E (OJ-1,6,2, - - - ,0,,,)’; a, is a vector of parameters related to the discriminating power of the test item i (the rate of change of the probability of correct response to changes in trait levels for the examinees); d, is a parameter related to the difficulty of item i; 6 c,- is the probability of correct response that is approached when the abilities assessed by item i are very low; c,- is usually called the lower asymptote, or less correctly, the guessing parameter. The unique contribution of the model above, as summarized by Recakse (1997), is that it focuses on the characteristics of the test items and the way they interact with the examinee population. This model has proved to be useful for a variety of applications and has helped in conceptualizing a number of psychometric problems including the assessment of differential functioning and test parallelism (Ackerman, 1990, 1992). 1.2 Estimation Methods for IRT Models 1.2.1 Commonly Used Estimation Methods and Their Limi- tations IRT models contain at least two types of parameters: person parameters (also called latent trait, proficiency, or ability parameters) and item parameters. Estimating person parameters for IRT models is frequently accomplished by using one of three methods: (1) maximum likelihood (ML); (2) maximum a posteriori (MAP); and (3) expected a posteriori (EAP). The ML method estimates person parameters by maximizing the likelihood of an examinee’s item responses. But one critical problem in the ML method is that the ML cannot estimate person parameters for examinees who have all correct or all incorrect response patterns (p.162, Embreston &o Reise, 2000). In addition, ML estimates have the consistency property only as sample size increases (here sample size refers to the number of test items, or test length), which in reality, is not an easy condition to meet because the test is often viewed as a fixed set of items. Both EAP and MAP are from the Bayesian perspective. MAP (also called Bayesian Modal Estimation) scoring method uses prior information about person proficiency in conjunction with the likelihood function to estimate proficiency level by maximizing a posterior distribution. The advantage of MAP is that proficiency can be estimated for all possible response patterns including perfect pattern. The per- fect pattern could be all-correct response pattern, all-incorrect response pattern, or some odd pattern that makes it difficult for the ML procedure to find solutions (e.g., no solution, or multiple solutions). Critics of Bayesian modal estimation methods is the proficiency estimates may depend on heavily the choice of the prior distribution of proficiency parameters especially when the sample size (i.e., test length) is small. EAP is a method of finding the mean of a posterior distribution. One advantage of the EAP estimator is that it “has minimum mean square error over the population of ability” (p.439, Bock & Mislevy, 1982). However, the estimates from EAP are biased (Wainer & Thissen, 1987). Item parameters in IRT models are usually estimated by the maximum likelihood (ML) approach. The commonly used methods under this approach are (a) joint maxi- mum likelihood (J ML), (b) marginal maximum likelihood (MML), and (c) conditional maximum likelihood (CML). It is known that the consistency property of the maximum likelihood estimator holds for person parameters only when item parameters are known and the number of items increases. Similarly, the consistent item parameter estimates can be obtained when person parameters are known and the number of examinees increases. The J ML procedure simultaneously estimates person and item parameters for all items and examinees by jointly maximizing the likelihood function of the response data. In principle, this procedure is straightforward. However, it has several drawbacks in practice as some researchers pointed out. First, nonlinear (i.e., S—shape) item char- acteristic curve (ICC) results in nonlinear likelihood equations. Solving nonlinear equation systems is often a formidable task (Hambleton & Swaminathan, 1985). See- ondly, when used with the 3PL model, large numbers of examinees (e.g., more than 1000) are required for accurate item parameter estimation (e.g., Lord & Novick, 1968; Swaminathan & Gifford, 1979). Thirdly, increasing the number of examinees cannot guarantee the estimation improvement (Hulin, Lissak, & Drasgow, 1982). That is, the consistency property of estimation does not always hold due to increase in both item (structure) and person (incidental) parameters simultaneously. When sufficient statistics are available for person parameters, one may avoid the problem of presenting person parameters in the likelihood function. For the Rasch model, since the number correct score (also called total score) is a sufficient statistic for the proficiency parameter, it is possible to express the likelihood function L(U I 0, b.) in terms of total score instead of proficiency parameters. The CML procedure can be used to estimate item parameters and the corresponding estimates are consistent (Hambleton & Swaminathan, 1985). However, since CML requires a sufficient statistic for estimating trait level, it is restricted to the Rasch model family. In more complex models such as the 2PL, the 3PL and the MIRT models, proficiency estimates are dependent on item characteristics. Therefore the total score is no longer a sufficient statistic for estimating proficiency. In addition, Embreston and Reise (2000) pointed out several other disadvantages on CML estimation procedure: no estimates for items or persons are available for perfect response pattern (R218); numerical problems often occur for long tests, complicated patterns of missing data, or polytomous data. Estimating item parameters can be carried out if the likelihood function can be expressed without any reference to the person parameters. Assuming the underly- ing distribution of proficiency is continuous and known, the essence of MML is to integrate over the proficiency distribution, then the item parameters are estimated in the marginal distribution (Bock & Lieberman, 1970). This procedure removes the dependency of item parameter estimates on the proficiency estimates. The advantage of MML is its estimates possess the consistency preperty since increasing number of examinees doesn’t require additional estimation of proficiency estimates (Kiefer & Wolfowitz, 1956). The MML approach is accomplished within the framework of the EM algorithm (p.190, Baker, 1992). Although MML/EM has lot of nice features and becomes a standard for item parameter estimation, Baker (p.190, Baker, 1992) pointed out that certain limitations of this approach exist in practice. For example, items that are answered correctly or incorrectly by all examinees have to be eliminated for item parameter estimation before calibration, an obvious loss of data information; certain data set can yield large absolute value of item difficulty and other deviant 10 values as item parameter estimates. Once these deviant values are used for profi- ciency estimation, it will cause estimation process to fail. In addition, although many has done research on an accelerated EM algorithm which is faster, the EM algorithm convergence rate is slow when estimating high-dimensional models. If prior information about item parameters is available, Bayesian estimation meth- ods are possible for IRT-based models. In 1982, 1985, and 1986, Swaminathan and Gifford (1982, 1985, 1986) derived Bayesian estimation procedures for the one, two-, and three-parameter logistic models, where item parameter estimation takes place without any marginalization. Mislevy (1986b), Tsutakawa and Lin (1986) took a different approach, which inherited properties of MML by integrating (i.e., marginal- izing) proficiency parameter out of likelihood function. Marginal Bayesian modal estimation is accomplished within the framework of the EM algorithm (Baker, 1992) too. However, marginalized Bayesian item parameter estimates may heavily depend on the item priors in particular for small sample size, and hence the resulting item parameter estimates will be shrunk to the mode of its corresponding prior distribution for informative priors. The frequently used estimation methods and their limitations are summarized in this section. For one-dimensional IRT models, although joint maximum likelihood estimates are available in some programs to estimate item and proficiency parameters simultaneously (e.g., LOGIST uses joint maximum likelihood estimation paradigm formulated by Alan Birnbaum in 1968), the estimates of proficiency parameters need not be consistent as the sample size increases (e.g., Neyman & Scott, 1948; Little & 11 Rubin, 1983). In addition, in some extreme situations of responses, the maximum likelihood procedure could give positive or negative infinity estimates for proficiency parameters. MML / EM procedure has become a central methodology for parameter estimates in the IRT framework. However, when test settings get more complex (e.g., with presence of missing data and polytomously score data) and IRT models are more complicated (e.g., the MIRT models), application of EM algorithm becomes less straightforward (Patz & Junker, 1999a). In Section 1.3, the importance of a new method for parameter estimation in linear logistic MIRT models will be addressed. 1.2.2 Applications of MCMC methods to Estimation of IRT- based Models A new estimation approach that could avoid some shortcomings of the estimation procedures discussed above is desired to improve the estimation accuracy in particular for the more complicated testing practices and the complex IRT models. Markov Chain Monte Carlo (MCMC) methods, which are from a Bayesian perspective, can be applied to estimating parameters for IRT models. Researchers have had interests in MCMC methods for several decades (e. g., Metropo- lis, et al., 1953). MCMC methods have been successful in many Bayesian applications because they allow one to draw samples from a wide range of interested posterior dis- tributions, including many for which simulation methods were previously much more difficult to implement ( e.g., Gilks, Richardson, & Spicgelhalter, 1996). 12 MCMC methods have also been recently implemented for parameter estimation and inference through stochastic simulation for IRT models. Patz and Junker (1999a) demonstrate that MCMC techniques are well-suited to complex models with IRT assumptions and the MCMC methodology can be routinely implemented to fit the IRT contexts, and further address the strategies and issues of extending the basic MCMC methods for Bayesian inference in complex IRT settings such as non-response, designed missingness, multiple raters, guessing behaviors, and partial credit (i.e., polytomous) test items (Patz & Junker, 1999b). Earlier work can trace back to Albert (1992), who estimated the two-parameter normal ogive model for augmented data using the Gibbs sampler. Various applications of MCMC methods have also been developed in the literature for item parameters recovery (e.g., Wollack, Bolt, Cohen, & Lee, 2002; Mathews & Hombo, 2001; Kim & Cohen, 1998; Dela-Torre, Patz, 2001; Maris & Maris, 2002; Fox, 2002; Williamson, Johnson, Sinharay & Bejar, 2002), for coefficient alpha estimates (Li & Woodruff, 2001), etc. Different from the Bayesian modal estimates discussed in Section 1.2, the MCMC estimates of parameters will no longer be dependent on the prior distribution and the parameter estimates are not shrunk to the mean of prior distribution. 1.3 The Importance of the Study Recently, Segall (1996, 2001) has advanced multidimensional adaptive testing (MAT) and the measure of general proficiency using a linear logistic MIRT model. He found 13 that MAT could provide equal or higher reliability with fewer items than are required in one-dimensional adaptive test. He concludes that in addition to increasing mea- surement efficiency, MAT can also be used as a tool ensuring adequate and efficient coverage of content for examinees at different levels of proficiency (Segall, 1996). How- ever, as he emphasizes, further study is needed before MAT can be routinely applied and item parameter estimation for MIRT models must be refined. In estimating parameters for MIRT models, simple structure (i.e., each item only measure one dimension of proficiency) is sometimes assumed (e.g., Dela-Torre, Patz, 2001). the Multi-unidimensional approach, as suggested by Segall (e.g., 1996), is an example of a simple structure. In this approach, several sub-tests measuring difl’er- ent contents are given at one test administration. There are two ways to estimate the model parameters for the multi-unidimensional approach. One is estimating the model parameter for the tests separately (i.e., independently), which is not realis- tic since usually the contents to be measured are correlated. The other way is to treat each content as one dimensional, then estimate the model parameters simulta- neously using a multidimensional model. Segall (1996) pointed out that although the multi-unidimensional approach is appealing in terms of its simple structure, it may suffer at least two undesirable features. One may be due to the poor specification of the elements of the covariance matrix of the proficiency vector, and the other is that the assumption of simple structure may lead to some poorly specified loadings (p.350). In addition, to develop a common metric and orientation of item parameter estimates for MIRT models is not convenient or even unlikely to be achieved. Segall 14 (1996) addresses that when developing large item pools with several dimensions, it is often necessary to divide the pools into subsets of items. This design however may raise several issues concerning the metric of the latent dimensions. Therefore, a new methodology is desirable for the concurrent estimation of item parameters for MIRT models for building item pool before MAT can be more reliably implemented. Both item and proficiency parameters in MIRT models can be estimated simul- taneously using MCMC methods. Parameter estimation using MCMC methods is different from a number of approaches for estimating MIRT models (Carlson, 1987; Fraser, 1988; McDonald, 1985; Mckinley & Reckase, 1983; Muthen, 1984). Efforts to apply MCMC methods to multidimensional models have been explored in the litera- ture. For example, Beguin and Glas (1998) generalized the Albert (1992) procedure to the unidimensional 3PL normal ogive model and Q—multidimensional normal ogive models. However, the study assumes the underlying covariance matrix for abilities is an identity matrix, which is not realistic since the proficiency dimensions in one test are more likely to be correlated. Moreover, the values of item parameters in the study are restricted to a small range (e.g., a is from O to 1, d is from -1 to 1), which is also not realistic for a general and more complex testing context. De—la-Torre and Patz (2001) examine simultaneous proficiency estimation for MIRT models using MCMC approach. But the study only assumes the simple struc- ture. In addition, to estimate the proficiency parameters, the study assumes the item parameters are known, which actually is not available in many applications. Belt and Lall (2003) investigate the item parameter estimation of compensatory 15 and noncompensatory MIRT models using the MCMC method. In their study, the guessing parameter was not included in the MIRT models and only two-dimensional model was considered. In addition, the item parameters cover only a small range of values. However, not much attention has been paid to three-parameter MIRT models that has been proven useful for a variety of applications in the literature. It is necessary to study parameter estimation using MCMC methods in a more general, complex, and realistic situations. For example, guessing parameter is included to the model, complex item dimension structures (i.e., each item measures one dimension or more than one dimension of abilities) are considered in the test design with an exploratory solution, and the inter-correlation among proficiency dimensions will be estimated and not limited to the identity matrix or special pattern of covariance matrix (e.g., all off- diagonal elements are the same). Moreover, the current study intends to examine the impact of four factors — the test length, the number of dimensions, the sample size, and the proficiency covariance structure on the accuracy and stability of parameter estimates for MIRT models. 16 Chapter 2 MCMC Methods for Parameter Estimation for Logistic MIRT Model 2.1 Overview of Markov Chain Monte Carlo Meth- ods Statistical inference is a procedure for drawing conclusions about pepulation pa- rameters from the observed sample data. Bayesian statistical conclusions about a parameter are typically made in terms of a probability statement conditioned on the observed data, or the posterior of the interested parameter. A sample generated by MCMC methods can be used for statistical inference, including point estimate, the construction of a marginal density, prediction, estimation of moments, and so on. Gill (2002) defined Markov chain as: “a stochastic process with the preperty that any specified state in the series, 0“), is only dependent on the previous value of the chain. Or in a probability expression (p.302): p(g(t) E A l 9(0),g(1), . .. ,g(t-2),g(t-1)) = p(g(t) E A l 90-1)), (2.1) Where A is an event or range of events in the complete state space; t is a positive 17 number referring to the tth time interval; 6 is a random quantity taking values in some known state space, 0. The Monte Carlo method uses random samples from the desired distribution in- stead of calculating quantities from the analytical form to summarize the interested theoretical distribution. Generally speaking, the Markov Chain Monte Carlo methods invlove two steps. First, producing a chain in which each value only depends on the previous value. Second, once this chain converges to the desired posterior distribution, the Monte Carlo method is used to summarize the interested distribution. There are two basic methods in MCMC: (1) Gibbs sampler; (2) Metropolis- Hastings algorithm. The Gibbs sampler named by Geman and Geman (1984) is one of the most widely used MCMC techniques. Let Q be the model parameters vector with k components, and q,- be the ith model parameter in Q. Denote Q E (q1,q2, - -- ,q,-,--- ,le and Q_,- E (q1,q2,~-- ,q,-_1,q,-+1,--- ,qk). Then Q can be expressed as Q E Q_,- U q,. Denote the complete conditional function of the ith parameter by P(q,- | Q_,-) E P(Qi l (11,(I2,"' 141—1,Qz+1,'°' #1:“)- The Gibbs sampler sequentially samples from the complete conditional distribu- tions P(q,- | Q_,-, y),i = 1, . .. ,k, where y indicates observed data. Then Gibbs sampling algorithm can be defined as the following: 1. Specify the starting values for the model parameter vector Q, i.e., 18 2 t=0 t=0 t=0 Q“°’=(q§ ),q§ ),-- .9}. ))- 2. At t + 1th iteration, simulate qltH ) from p(ql l q2t)aq(t)1°aql(ct)) l (t 1 t 93+ ) from We | 91+ ),q§),- .99) 1 t 1 t 1 t 1 t t 95” ) from P(Qt I 9‘ I ),q§+ ), ,qf SI ),q§.31,--- .919) q(t+1) fr (t+1) (t+1) (t+1) . om p(q;c I q1 , q2 , -__,qk 1 ) sequentially. 3. Set t = t + l and repeat step 2 until convergence. The second frequently used method is the Metropolis-Hastings algorithm (M-H algorithm, Metropolis et a1, 1953; Hastings, 1970). This method is applied when it is difficult to simulate from the complete conditional distributions by traditional methods (by the method of rejection sampling or by a known generator, for example). A Markov chain using the M-H algorithm can be obtained as follows: For any parameter 0, 1. Assign an initial value for parameter 6. 2. Specify a preposal density r(0‘, 0(‘+1)), which defines the proposal density from state 0‘ to state 0““). 19 3. Given the current state 0‘, the candidate 9" for the next state 6““) in the chain is sampled from r(6‘, 6(‘+”). 4. 6" is accepted as the next value 6““), i.e., 0““) = 0" with probability 0(0‘, 0‘), where 1 . _ . 9(9’)7‘(9‘,9‘) a(0,0)—m1n{g(0t)r(0t,0‘),1}, (2.2) and g(.) is the density of the target distribution. 5. If 0“ is rejected, then the next value will stay at current state, i.e., assign g(t-l-l) : at. The M-H algorithm first simulates a Markov chain whose distribution differs from the desired distribution for the parameter, and then subsequently uses the acceptance probability to reject or accept the value such that a new Markov chain is constructed that has the target posterior as its stationary distribution. It has been shown that the Gibbs sampler is a special case of the M-H algorithm where the probability of accepting the candidate value is always one (p.436, Gelman 1992; p.182, Tanner 1996). The distinction between the Gibbs sampler and the M-H algorithm is that the M-H algorithm requires the complete conditional distribution and so it is more restrictive (p.166, Gamerman 1997, Besag et a1. 1995, Tierney 1991). The combination of the Gibbs sampler and the M-H algorithm is a hybrid algo- rithm. One value is generated from the M-H procedure, followed by the next Gibbs step. Like the Gibbs sampler and the M-H algorithm, the M-H within the Gibbs 20 algorithm also produces a Markov chain with the correct stationary distribution. 2.2 Likelihood Einctions for the Linear Logistic MIRT Models If pre—calibrated item parameters are available, maximum likelihood estimates or Bayesian modal estimates of the proficiency parameters can be obtained. Suppose the assumption of local independence is held for the MIRT models. Then the probability of a set of observed responses u,- = (u1j,u2j, - - - ,ugj, - - - ,unj) for the jth examinee with proficiency vector 91- on 71 items is equal to the product of the probabilities associated with the response to each item. L(Uj l 91', 2,A,d,C) = p(ulj,u2j,- ' ' ,u,,~,- ° ' ,unj I 0]) (2.3) n = II p.(0,-)”ii(1 - 90%))l ' “‘1‘. (2.4) i=1 where Uij is a response (0 or 1) of the jth examinee on the ith item; 9,- is a p-dimensional proficiency vector, i.e., Oj = (OJ-1,0,2, - -- ,ij). p,(0,-) is the probability of the jth examinee correctly answering the ith item. Simi- larly, the probability of a set of N observed responses v,- = (011,012, - -- ,v,j, - -- ,v,N) for the ith item is given by N L(v, | 9,2,a,,d,,c,-) = H p(v,1,v,-2,-~ ,v,,-,--- ,2)»; | 9,2,a,,d,-,c,-) j = 1 N H p.(01)“‘j(1 - p.(9.))1 — u j =1 21 According to Bayes theorem, the posterior density function of 03- for j = 1, 2, - - - , N, can be expressed as f(91|uj)=LU(J-l9j)°(()) L(ujl91)7ro(9j) (2.5) where L(uj | 01-) is the likelihood function given by (2.3); 11'9 is the prior distribution of 9; m(uj) is the marginal probability density of u,-; and N is the number of examinees. Assume the prior distribution of 9 is a multivariate normal with mean vector u and the covariance matrix 2, then the density of 7r9(9,-) is 770(9j)=(27f) ZIZ 2exp[--(¢9 - u)§3'1(91-u)l- (2-6) Maximizing L(Oj | uj) can obtain the Bayesian modal estimates of an individual proficiency parameter vector Oj,Vj = 1,2, ~ - - ,N. That is to solve the equations as (9—.03-1 —logL(0- luj)=0,Vk=1,,2- -,p;j=1,2,-~-,N. (2.7) Nevertheless, in many applications, the item parameters are not available, or both item and proficiency parameters are required to estimate from the observed data. The following section is to address the simultaneous estimation of the item and proficiency parameters using the MCMC methods. 22 2.3 M-H within Gibbs for Parameter Estimation for MIRT Models 2.3.1 Complete Conditional Functions for Model Parameters Under the assumption of local independence, the overall likelihood function of re- sponses for N examinees on 71 items can be written as N L(Ule,2,A,d,c) = H L(u,|o,-,2,A,d,c) j = 1 n = II L(v.le.2.a.,d.,c.~) i= 1 N n = 1'1 11 p.(6,-)"='j(1—p.11 " “aroma. i=1 It can be shown that the complete conditional distributions for d,- and e,- have the following expressions: Pd(d, I d_i,8,E,A, c, U) at L(vt I 9,at,d,-,c,-)7rd(d,-) Tl = II Pt(9j)u‘j(l—pr(0j))1Tu‘jrd(d,~), i: l Pd(ci I C-la ea 2)A3d9U) a L(Vi I eiaiadiici)7rci(c‘i) n = II pt(9,-)“‘j(1 ‘ pi(0j))1 _ u‘firdq), i=1 where 71's,, 7rd, and 71}; are the prior distributions for a, d, and c respectively; 26 A4 is a (n — 1) x p matrix, i.e., A4 = (a1,a3,--~ ,ai_1,ai+1, - -- ,an); d_i is a vector with (n — 1) components, i.e., d_t = (d1,d2, - -- ,d,-_1,d,-+1, - ~- ,dn); c_t is a vector with (n — 1) components, i.e., C4 = (01,62, - - - ,c,-_1, c,+1,- -- ,cn); and p,(Bj) is as previously defined in equation (1.4). 2.3.2 Modelling the Covariance Structure for Multidimen- sional Abilities For a test measuring several different proficiency dimensions, it is assumed that each examinee’s proficiency follows a p-variate normal distribution with mean vector u. and the variance-covariance matrix 2. That is, 01' ~ Np(p.,E), Vj = 1,2,--- ,N. Since there is not much meaning in comparing abilities across dimensions, the mean of each dimension proficiency is set to zero. Thus, the mean vector for proficiency is set to a p-component zero vector. Modelling the covariance matrix is very important but difficult because (1) there are KHz—”ll parameters to estimate, where p is the number of dimensions; and (2) the matrix is required to be non—negative definite. To estimate the variance-covariance matrix 2, this study will use the inverse-Wishart (W‘l) distribution, a multivariate generalization of the sealed inverse-x2 distribution, as the prior distribution of the matrix 2, i.e., 2 ~ W‘1(m, ‘11), (2.8) which is suggested by Gelman, Carlin, Stern, and Rubin (2004). The above distri- bution is the conjugate prior distribution for the covariance matrix in a multivariate normal distribution. Where m and \II describe the degrees of freedom and the scale 27 matrix for the inverse-Wishart distribution on 2. The advantage of using inverse— Wishart as prior distribution for )3 is that the posterior distribution of 2 also follows the W“1 distribution (e.g., Gelman, Carlin, Stern, and Rubin, 2004) : 711' E I 9~ W‘l(m+n,(n—1)S+‘I'+ 56'), (2.9) n+1 where n is the number of examinees, S is the sum of squares and cross product matrix about the sample mean N (n — as = 2 9,9} (2.10) j=l T is the number of prior measurements, 0,- is a p - dimension vector, and 9- is a p—dimensional sample mean vector. Since the posterior distribution on E is a known distribution, 2 I 0 can be sampled directly. Let 2k be the kth sample covariance matrix drawn from W'l(m + n, (n — 1)S + \P + %§§'). Let sijk be the (i j)th component of 2k. Then the estimate of proficiency structure is the average of drawn covariance matrix samples: 1 N .2 _ 2 k=l where N is the total number of randomly drawn samples; 2', j = 1, 2, . - - . p. There are alternative approaches to modelling the underlying proficiency struc- ture. Another method for estimating proficiency structure is addressed through a two-dimensional example. For a two-dimensional IRT model, assume the proficiency parameters come from a bivariate normal distribution N2(0, E), where E is the stan- 28 dardized covariance matrix or correlation matrix, i.e., Assume p has a prior density which is the uniform distribution on (—1, 1). Then the posterior for p, fp(p I 9) can be expressed as fp(p I 9) 0< p(9 I p)1(_1,1), (2.12) where p(9 I p) is the probability function given by N p(elp) = H j = 1 N _.l! 1 ' 0C (1 - P2) 2 CXP[_W Z (91]- — 2091192) + gig-II, j = 1 and I (—1, 1) is an range indicator function. Therefore, the posterior for p is 2 -fl 1 N 2 2 fp(pI 9) 0< (1 — p I 2 “PI—W Z (91,- — 210911921 + 92,-)I1(_1,1)- (2-13) 2‘ = 1 _e£—1 _ 1+ Letp— (TE—+3. Thené—logtg. = 28 f5“ I 9) fp(P I e)d§ 2e£ 1+e€' = fp(P I 9) Suppose f is the maximum likelihood estimates of E, and 62 represents the estimated 29 variance of E. f can be obtained by letting p = arg mgxme | p) = arg mgxlogme I p), where N N logp(8 I p) = e +—— 2 log(1 — — W200? —2p01j02j + 033-), where e is a constant. Solve the likelihood equation 3103149 | p) = 0 6p ' The equation above implies, Np N —- 1— p2 -— pijWf-w —2p61j92j 'I” 02jH_—17;01j02j = 0’ i.e., ,5 subjects to Np +—— 1 + pjpz 2::(03 —2p01j92j + 931-) — 2011021 = 0. , . . - ~ 1 ‘ l\ote here, the pI'IOI' 7rp(p) = U(-1,1). So pmle = pmode' Thus 5 = log 113% The Fisher information Me) = wag—Iowa I p» N - —E5p—2'108P(9IPI- 1_p2- Then the asymptotic distribution of pmle is approximated by N (p, ) as N —t __1_ N1(/)) 00. 30 Therefore, by the delta method, 5 has the asymptotic distribution as . 1 1+p 2 . —+ N h ,—h’ , where h p =10 = , h’ = ——,1.e., t ((12) 1W) 0») (> g1_p 5 (p) H), g N (5 e—£-+—1) ”2 - e£ + 1 Hence aMetro olis-Hastin al orithm can be written to generate 5 from f5(£ I 9) using N (E , 6%) as the proposal density. Since the target function f5(€ I 9) —t N (E , 62). The sampling density is N (5 , 62). The transition function can be expressed as r£(.) = N (E , 62). The M-H algorithm is as follows: Given 5‘, simulate y from N (f , 62), then 5‘“ = y with 0(5‘, y) and 6‘ with 1 — a(§‘, y), where A My | emf—E) a(£,y)=min 6.. ,1 no: I eat—3) y— at Repeat this step. 2.3.3 Random Walk Metropolis Algorithm within Gibbs Since each complete conditional distribution is not convenient for sampling directly from the expressions given in Section 2.3.1, a MetrOpolis step, in which each pa- rameter or block has to specify a proposal distribution, is needed for the sampling process. Patz and Junker (1997) point out that there is much freedom in choosing the proposal distributions. For example, to sample a proposal value for 01- at step t + 1, a multivariate normal distribution can be chosen as the convenient proposal distribution. The random walk algorithm will choose the candidate state via a random walk mechanism. The candidate state is not chosen independently of the current state. And 31 the candidate state is not always accepted, unlike in the Gibbs sampler. Specifically, let 8?? be the p—dimensional Euclidian space, and let r be a density on if?” so that the transition function is defined as R(y, B) = [B r(z — y)dz. Define the acceptance probability a by (2)719 - 2) 1}, _ ~ 9 a(y, z) — min {g(y)r(z _ y) , (2.14) where g(.) is the density of the target distribution function (e.g., the above posteriors for each examinee proficiency parameter, P9(0:,- I 8.5, E, A, d, c, U, or the complete conditional distribution for each item parameters,Pa(ai I A_i, B, 2,d, C, U), Pd(di I d_i, e, 2, A, C, U», and Pc(c,- I C_i, e, 2, A, d,U)). If the denominator is zero, just set a = 1. Suppose Y, = y. Generate a “candidate ” observation 2 from the distribution R(y, .); accept this observation (set Yt+1 = z) with probability a(y,z). Otherwise, reject this observation (set Y,“ = Y; = y). Another way to describe the procedure is as follows. Start at y. Generate a candidate step w from the distribution R defined by R(B) = fl; r(a:)d:c with probability a(y, y + to) moving forward to w; Otherwise stay at y. In the MIRT context, for instance, denote rg(0,-‘, 0f“) as the transition function for the constructed Markov chain for sampling the jth examinee’s abilities. For random walk Metropolis algorithm, the transition kernel can have the form 1 , _ 73(0)}, git-H) = exp {—§(0jt — 9jt+l) 2 I(Gjt — 9jt+l)} . (2.15) Then the acceptance probability for the new candidate 9f, 3' = 1,2, . - - , N from the 32 transition kernel r9(0j‘, 9;“) is 0* 0* at g9(03t)ro(01t ’ 01*),1 . (2.16) 99(92' )To(9j ’91 ) Note here the target distribution g9(.) is the complete conditional distribution defined a(0,-t,6j*) = min { previously, i.e., 99(9j) P9(9j I 9.,-,A, (1,6, U) OC L(Uj I 9j,A,d,C)’/Tg(0j I 2) n H pt(0r)“‘j(1 —p.-(0.-)>1 - “aroma. i=1 Similarly, the acceptance probability for a new candidate of item parameters a;‘ for item i, i = 1, 2, ~ .. ,n from the transition kernel ra(ait,ai(t+1)) is, .* t t o agar = min {9“(a‘ )Ma‘ ’3‘ ),1} , 2.17 ( I ) ga(ait)ra(aitaai*) ( ) where g..(.) is the complete conditional distribution for at, i.e., ga(ai*) (X L(Vi I 9, ai", di, c,-)7ra(ai"‘). In the same way, we can find a(d§,d,?) and (I(CE, 6:). The following are the proposal densities corresponding to person and item param- eters, which are chosen for the purpose of convenience and efficiency. . t+1 . t Proposal den51ty for 9 IS Np(9 , Eat). Preposal density for each component of aitTI, aik is U (ail: — h, afik + h), Proposal density for d2“ is N(dz, 02). Proposal density for Ct+1 is U (C: — 5, CI + (5); i 33 where h, 6, and 02 are constants. In this study, h = 0.3, 6 = .03, and o2 = 1. Once the derivation of the complete conditional distribution for each parameter in the multidimensional model is finished, the corresponding acceptance probabilities can be calculated. And if the proposal densities are specified, it is ready to draw parameter samples. The steps for this drawing of parameter samples for the MIRT model are: 1. Draw 0;. ~ Np(0;,29t), Vj = 1,2,--- ,N. 0;.“ = 9; has acceptance proba- bility 040;, 0;) 2. Draw 2 I 9 ~ w-1(m + n, (n — ms + \II + ”$367) * o 3. Draw each ail: ~ U(aik — haaik + h), (LEE-1 = a”: With probability of a(afk,afk) VI: = 1,2,-~- ,p. and i = 1,2,--- ,n. p is the total number of dimensions. 4. Draw d: ~ N(dfi, 02) with acceptance probability ofa(d:, (1?) Vi = 1, 2, - -- , n. 5. Draw c: N U (cf - (i, C: + (5) with acceptance probability of a(c§, 6:) Vi = 1, 2, - - - ,n. Here h and k are known constants. 2.4 Unbiased and Consistent Estimators of Param- eters Let 9,1,, amine,- be the model estimators Vj = 1,2,-~ ,N, i = 1,2,--- ,n, k = 1, 2, - -- , p. For example, if the samples from the complete conditional distribution M r 1 of 9;, at, d,, c,- are drawn from the constructed Markov chain, then 0,), = M Z 37;“ m=l 34 M M M 1 ~ 1 1 ink = — E am, d,- = — 2 d}", and c,- = — E c1", where M is the sample size M m=1 M m=l M m=1 used for the estimates after certain length of the burn-in period. Obviously, E(é,-) = 93-, E(éii) = ai, Ea.) = di, E(5i) = Ci since E93: = 93': E03: = J aik, Ed;7| = (1,, and EC? = c,-. That is, the estimators are unbiased. VGT(0J']¢) The variance of the estimates Var(éjk) = T —> 0, We = 1, 2, - -- ,p. as M —> . , d.- 00. Var(c‘z,-k) = Lag-5191c)- —> 0, Var(d,-) = V013 ) ——» 0, Var(é,~) = YEIME). ——+ 0, as M —» 00. By the law of large number, 0:,- —+ 0,- , a, —* a,, d,- —+ d,, and 6,- -—+ c,- in probability. Therefore, the estimates are consistent. By the central limit theorem, M => N(0,1), (2.18) V var(6jk) as M —» 00, for j = 1, 2, - - - ,N. This can give a confidence interval for the estimate of proficiency parameters. Similarly, the results also hold for item parameter estimates. 35 Chapter 3 Simulation Studies and Results The derivations for the application of MCMC methods into the 3-PL linear logistic multidimensional IRT model are illustrated in Chapter 2. This approach is imple- mented in a C++ program, which provides an eficient computational tool for param- eter estimation of MIRT models of the application of the program are reported in the chapter. In this chapter, the parameter estimates for MIRT models. The accuracy and stability of the MCMC estimates will be examined by simulating various testing situations for the one-, three-, and five-dimensional MIRT models, respectively. Various simulation studies are presented in this chapter in an attempt to examine the efiects of four potential factors on the recovery of item and underlying proficiency parameters. These factors are: the number of proficiency dimensions, proficiency structure (i.e., covariance matrix for the proficiency distribution), test length (i.e., the number of test items), and the sample size (i.e., the number of examinees). Us- ing simulated data to investigate parameter estimation has at least two advantages: 36 (1) since the true person and item parameters are available, they can be used to assess the accuracy of parameter estimates, with smaller root mean square errors (RMSE) between the true parameters and the parameter estimates indicating more accurate estimation; (2) the information for the number of dimensions is available from the simulated data, as is similar to the confirmatory factor analysis given the factor structure is known before analyzing data. With knowing the number of di- mensions, researchers do not have to do additional analysis to determine how many dimensions each item measures and what these dimensions are about, a strategy that can help researcher separate dimensionality analysis with the issue of parameter estimation. It is necessary to point out that determining the statistical dimension based on the observed data itself is actually a complex and active research area. For example, Researchers suggest detecting the underlying dimension structure by para- metric approach (e.g., Reckase, Ackerman, & Carlson, 1988; Miller 81. Hirsh 1992) and nonparametric approach (e.g., Roussos, 1995). The topic of detecting dimension structure from the observed data is out of the scope of this research. Therefore, to control the dimensional structure in the simulated data instead of diagnosing it will facilitate an effective examination of the MCMC estimation approach. In addition, to examine the performance of parameter estimation by the MCMC approach in this research involves only simulation experiments because: (1) real data analysis will bring the model-data fit issue, which is often confounded with the issue of parameter estimation and obviously is not the focus of this study; (2) it is more difficult to evaluate the accuracy of estimation due to the lack of the true parameter 37 information. 3.1 Prior Distributions for Model Parameters The MCMC approach for parameter estimation is in fact from Bayesian perspective. The item and proficiency parameters are not treated as fixed values but random vari- ables with probability distributions. The role of prior distributions for both item and proficiency parameters is to provide additional information on the parameters before data collection and parameter estimation. In this study, the prior distribution for proficiency vector is Np(0, 29). That is, the group of examinees is assumed to come from the multivariate normal population Np(0, 29), where p is the number of dimensions. The prior distribution for each component of each a parameter is the uni- form distribution, the prior distribution for each d parameter is the standard normal distribution, and the prior for each c parameter is also the uniform distribution. 3.2 Diagnosing the Convergence of Markov Chains There are many approaches to the diagnosis of the convergence of a Markov chain. The purpose of this analysis is to ensure that the constructed Markov chains for the posterior distributions for both item and proficiency parameters through the Metropolis-Hastings within Gibbs algorithm have the target stationary distributions before taking sample for Monte Carlo estimation. The reliable estimation requires that each posterior distribution of a parameter converges to its stationary distribution. Gelfand and Smith (1990) suggested several approaches to check the convergence 38 based on graphical techniques. For m parallel chains, plot a histogram for n values of kth iteration, after skipping certain iterations (say 19 iterations), and plot a histogram for n values of (k + p)th iteration. Convergence is assumed if the histograms have very close pattern. Gelman, Carlin, Stern, and Rubin (p.294, 2004) recommended an approach to the inference and assessing convergence based on several independent parallel chains. First, simulate several independent sequences, with over-dispersed initial values. If multiple chains with different starting values are well mixed after certain number of iterations, then one can conclude that the chain reaches the convergence. 3.3 Initial Values and Iterations The choice of initial values should not affect the item and proficiency estimates, because the final estimates rely on the sample from the posterior distributions for the parameters when they reach stationary status. The initial values are often discarded before computing Monte Carlo estimates for the parameters. However, the initial values may affect the convergence speed for each chain of a posterior distribution. Thus, carefully selected starting values will accelerate the convergence speed and construct an effective Markov chain. For example, Beguin and Glass (1998) suggested using a = 1, d = 0, and the true c parameter or its estimates from BILOG as starting values and concluded that 1000 burn-in iterations was sufficient. In this study, random initial values will be used each time for the estimation. To 39 ensure the convergence of each chain, a large number of iterations, for example, 10, 000, will be taken. Moreover, multiple chains (e.g., 3 chains) will be constructed for each data set to assess the convergence of each chain and evaluate the accuracy and stability of the estimates by comparing the estimation from each chain with different random initial values. Hence, the starting values used for estimating proficiency parameters in this study are randomly drawn from Np(0, I), and the initial values for item parameters will be randomly sampled from uniform distributions. Since three independent replications of Markov chains are constructed with dif- ferent initial values for each data set, the final estimates for the parameters take the mean of the estimates from the three independent chains. For each independent chain, parameter estimates H is the average of the sample from posterior distributions, i.e., - 1 " 11:52:21,, (3.1) i=1 where n is the number of samples drawn from the stationary Markov chain for the posterior distribution. Thus the final estimates of parameters for each data set H is the average of the estimates from multiple independent chains, m H = 2 H,, (3.2) i=1 where m is the number of replications, i.e., m = 3 in this study. All of the data sets are randomly sampled from the linear logistic multidimensional IRT model for various conditions (e.g., test length, the sample size of examinees, the number of dimensions, and different proficiency covariance matrices). To minimize the sampling effects on parameter estimation, three replications are simulated for each 40 condition. Four factors considered in the simulation studies result in a total of 60 dichotomous response data sets. Therefore, the precision of parameter estimates can be compared across the sample size, the test length, and the proficiency structures. 3.4 Estimating the Unidimensional 3PL Model The form for the unidimensional 3PL model is given in equation (1.1) in the first sec- tion of Chapter 1. This section will discuss the parameter estimation by simulating dichotomous response data from the unidimensional 3PL model. One big difference for estimating unidimensional model parameters from the estimation of the multidi- mensional model parameters is that no underlying proficiency dimension structure needs to be estimated. To consider the model indeterminacy problem and establish a fixed metric for both item and proficiency parameter estimates, the sample of the posterior distributions for proficiency parameters will be standardized at each step of sample draw. Therefore, the final metric for the proficiency parameter estimates is placed on 0, 1 metric. For the simulation study in this section, the underlying proficiency parameters and difficulty item parameters are generated from the standard normal distribution N (0, 1); the discriminating power and asymptote item parameters are generated from a uniform distribution. Two tests with 30 and 45 items were simulated. Each test is administrated to 2000 and 5000 examinees, respectively. The combination of the test length, the sample size, and replications yields 12 (i.e., 2 x 2 x 3) data sets. Table 41 3.1 and Table 3.2 are the true items parameters for the two tests. It can be seen from Table 3.1 and Table 3.2 that both tests contains a wide variety of values of item parameters. For example, in the 30—item test, the discriminating power a parameter ranges from the smallest of .54 to the largest of 2.43, the diffi- culty parameters from -1.64 to 1.6, and the asymptote parameters from 0 to .25. In the 45—item test, the discriminating parameters cover a range between .5 and 2.45, the difficulty parameters fall into a range within -1.78 to 2.85, and the asymptote parameters ranges from 0 to .25. 3.4.1 Assessing Convergence Table 3.3 shows the three independent estimates from each chain replication with different initial values for the data set generated by the 30—item test to 2000 examinees. The final item parameter estimates are the mean of the three independent estimates for each chain. Clearly, the estimates from the three independent chains are very stable and consistent. For example, item 28 has the same estimates on a and c parameters over three chains, but has .01 difference on b parameter estimates across the three independent chains. The largest change for a parameter estimates over three independent chains is on item 1, showing 1.82 for the first chain, 1.67 for the second chain with a difference of .15, and 1.72 for the third chain. The slight change of estimates for each item parameter across the three independent chains indicates the stable estimates by the MCMC. More importantly, one can assess the convergence of the posterior distributions by the stability of the estimates over multiple chains 42 Table 3.1: True Item Parameters for 30—Item Test (Dim = 1) Item Discriminating ((1) Difficulty (b) Asymptote (c) 1 1.67 -l.17 0.14 2 0.89 0.28 0.09 3 0.55 -1.64 0.14 4 1.85 -0.72 0.16 5 2.07 0.50 0.08 6 1.40 0.46 0.18 7 2.43 1.37 0.21 8 0.85 -0.04 0.14 9 1.39 0.91 0.09 10 1.25 0.14 0.09 11 1.52 -0.19 0.21 12 1.34 -0.80 0.25 13 1.64 -0.44 0.05 14 0.99 0.57 0.12 15 1.48 -l.11 0.16 16 0.54 0.48 0.09 17 1.78 1.60 0.03 18 1.10 0.21 0.14 19 2.09 -0.31 0.01 20 2.26 1.10 0.04 21 1.53 0.65 0.24 22 0.79 -0.41 0.11 23 2.40 0.57 0.11 24 0.73 -1.21 0.25 25 0.56 0.62 0.02 26 0.56 -1.43 0.19 27 1.01 1.51 0.04 28 2.07 1.31 0.18 29 2.05 -0.25 0.17 30 1.48 -1.62 0.10 43 Table 3.2: True Item Parameters for 45-Item Test (Dim = 1) Item (1 b c Item 0 b c 1 1.73 0.03 .09 24 1.98 —0.88 .09 2 2.45 0.00 .16 25 0.83 0.20 .02 3 2.35 0.45 .02 26 0.84 0.98 .07 4 1.04 0.15 .22 27 1.95 0.90 .01 5 2.37 0.27 .23 28 1.99 -0.51 .19 6 0.95 -1.78 .05 29 0.92 -1.80 .20 7 2.06 1.08 .08 30 1.26 2.85 .00 8 1.43 -0.59 .11 31 1.77 -1.19 .02 9 1.63 -0.67 .11 32 0.50 -0.44 .17 10 2.00 0.54 .18 33 1.78 -0.62 .08 11 2.13 0.33 .25 34 0.61 0.64 .00 12 1.27 -0.56 .17 35 2.21 -0.57 .19 13 1.45 -0.64 .22 36 2.31 0.54 .09 14 2.04 -1.31 .05 37 2.30 0.27 .07 15 0.53 1.16 .19 38 1.51 1.48 .07 16 1.51 -1.53 .13 39 2.26 0.45 .10 17 2.29 0.70 .10 40 0.85 -1.05 .14 18 0.62 -0.18 .07 41 1.33 —0.33 .15 19 2.10 -1.08 .11 42 1.02 -1.24 .02 20 1.69 0.64 .23 43 0.73 1.74 .11 21 1.55 -1.32 .08 44 1.21 1.53 .07 22 1.34 0.03 .18 45 1.99 1.31 .19 23 1.50 1.21 .03 - - - - 44 Figure 3.1: Sample ACF for series of a6, Dim = 1 ACF g - .. lewmllllllllllllllllihi ..................................................................................................................................... Lag suggested by Gelman, Carlin, Stern, and Rubin (p.294, 2004). Table 3.3 provides numeric demonstrations that the chain has converged to its stationary distribution. Similar results are obtained for the sample size of 5000 and for the 45-item test but are omitted here. Figure 3.1 describes the estimated autocorrelation function (ACF) in the series of discriminating power for the 5th item after throwing away the burn-in draws. It is found that the autocorrelation become negligible at lags greater than 28. Figure 3.2 illustrates the behavior of the Markov chains constructed by the M-H within Gibbs algorithm for item 5 in the 30—item test. The upper panel shows the first 2000 draws for the posterior distribution of a parameter, the middle shows the 2000 draws for the 45 Table 3.3: Estimates from three chains for 30-Item Test (Dim = 1, N = 2000) Item 01 0.2 03 d1 d2 d3 Cl 02 C3 1 1.82 1.67 1.72 -1.00 -1.07 -1.05 .23 .18 .20 2 0.87 0.88 0.88 0.23 0.27 0.27 .07 .08 .08 3 0.46 0.45 0.43 -1.78 -1.80 -1.97 .20 .20 .17 4 1.66 1.58 1.60 -0.71 -0.75 -0.73 .16 .13 .14 5 2.09 2.12 2.12 0.56 0.58 0.57 .08 .09 .09 6 7 8 1.37 1.37 1.38 0.56 0.59 0.58 .20 .20 .20 2.09 2.11 2.10 1.43 1.44 1.44 .21 .21 .21 0.99 0.97 0.96 0.26 0.26 0.24 .26 .25 .25 9 1.50 1.49 1.51 0.90 0.91 0.90 .07 .07 .07 10 1.45 1.42 1.45 0.18 0.19 0.20 .08 .08 .09 11 1.70 1.71 1.71 -0.12 -0.10 -0.11 .22 .23 .23 12 1.17 1.13 1.17 -0.83 -0.87 -0.82 .23 .20 .24 13 1.68 1.68 1.63 -0.33 -0.31 -0.34 .06 .07 .05 14 1.01 1.00 0.99 0.64 0.65 0.64 .12 .12 .11 15 1.51 1.54 1.53 -1.16 -1.12 -1.14 .05 .07 16 0.64 0.68 0.66 0.83 0.90 0.89 .18 20 17 1.82 1.80 1.81 1.68 1.70 1.69 .04 04 18 1.19 1.20 1.21 0.17 0.20 0.20 .12 .13 19 2.06 2.06 2.05 -0.31 -0.29 -0.30 .00 .00 04 23 20 2.35 2.35 2.36 1.17 1.19 1.18 .04 21 1.60 1.60 1.59 0.76 0.77 0.77 .23 . . 22 0.79 0.8 0.75 -0.31 -0.26 -0.37 .14 .16 .12 23 2.24 2.24 2.22 0.60 0.63 0.62 .12 .12 .12 24 0.67 0.65 0.67 -1.52 -1.55 -1.50 .07 .06 .08 25 0.58 0.59 0.59 0.78 0.79 0.81 .04 .04 .04 26 0.59 0.57 0.60 -1.33 -1.37 -1.30 .20 .20 .22 27 1.14 1.18 1.14 1.45 1.45 1.46 .04 .04 .04 28 2.22 2.22 2.22 1.36 1.37 1.37 .19 .19 .19 29 2.34 2.31 2.34 -0.20 -0.20 -0.19 .18 .18 .18 30 1.72 1.72 1.69 -1.41 -1.39 -1.43 .23 .24 .22 06 19 .04 .13 00 04 22 46 Figure 3.2: Sample draw at first 3000 iterations for series of a, b and c a = 1.4 ‘0. . O .. . I g Q . . .‘ . Jo 1— ‘ . ' r... . ‘ , ' Q1 "fl. . i. -, ’ z - a . w. A ‘ '- '4.‘ . , ._ .. n . w, o. L... . oJ . ' O: o o . . i o - . . . O 500 1 000 1 500 2000 2500 3000 b = .46 3QJ=JS~:::J1fiHhé‘yjrié- {3"‘hfligfqaéniflhfiFQ’fiAB Sample Value 0.5 0.9 O 1 000 1 50018 2000 2500 3000 ID ‘V. ° WW I!) <3. c 500 1 000 1 500 2000 2500 3000 Iteration posterior distribution of b parameter, and the lower panel gives the first 2000 draws of the posterior distribution for the asymptote parameter. The path plot in Figure 3.2 shows that the posterior distributions for the fifth item parameters mixed well even in the first 2000 draws. The path plots for other items in the 30—item test or 45-item tests have similar path plots for 2000 draws and are not shown. The column 2 through 7 (denoted asa 6,,65 ,S(a), S(b), S(c)) in Table 3.4 are the item parameter estimates and their corresponding standard error of the estimates for the 30—item test with sample size 2000 from the first replication of response data. The last six columns are the values for the sample size 5000. Table 3.5 shows the item parameter estimates from BILOG-MG3 using MML procedure for the same 47 Table 3.4: Item Parameter Estimates for 30—Item Test (Dim = 1) N = 2000 N = 5000 Item (1 5 6 5(0) 5(1)) S(c) 6 5 5 5(0) S(b) S(c) 1 1.74 -1.04 .20 .16 .07 .06 1.70 -1.14 .13 .10 .07 .05 2 0.88 0.26 .08 .09 .09 .04 0.83 0.22 .04 .04 .04 .02 3 0.45 -1.85 .19 .05 .29 .09 0.58 -1.35 .23 .04 .17 .06 4 1.61 -0.73 .14 .18 .10 .07 1.88 -0.65 .17 .12 .05 .03 5 2.11 0.57 .09 .19 .04 .01 2.07 0.55 .07 .11 .02 .01 6 1.37 0.58 .20 .15 .05 .02 1.41 0.56 .18 .09 .03 .01 7 2.10 1.44 .21 .27 .05 .01 2.17 1.42 .21 .20 .03 .03 8 0.97 0.25 .25 .09 .08 .03 0.75 -0.10 .09 .05 .07 .03 9 1.50 0.90 .07 .14 .04 .01 1.38 0.98 .08 .10 .02 .01 10 1.44 0.19 .08 .10 .04 .02 1.20 0.20 .09 .06 .03 .02 11 1.71 -0.11 .23 .15 .05 .03 1.64 -0.08 .24 .10 .04 .02 12 1.16 -0.84 .22 .10 .10 .06 1.29 -0.78 .22 .08 .07 .04 13 1.66 -0.33 .06 .12 .04 .03 1.55 -0.42 .01 .07 .02 .01 14 1.00 0.64 .12 .12 .08 .03 0.93 0.63 .12 .07 .06 .02 15 1.53 -l.14 .06 .13 .08 .06 1.62 -0.95 .18 .11 .06 .04 16 0.66 0.87 .19 .10 .13 .04 0.55 0.57 .11 .04 .08 .03 17 1.81 1.69 .04 .25 .07 .01 1.83 1.66 .03 .14 .03 .00 18 1.20 0.19 .13 .12 .07 .03 1.16 0.23 .13 .07 .04 .02 19 2.06 -0.30 .00 .13 .03 .01 2.04 -0.27 .01 .09 .01 .01 20 2.35 1.18 .04 .15 .04 .01 2.24 1.15 .05 .15 .02 .00 21 1.60 0.77 .23 .18 .05 .02 1.60 0.69 .24 .12 .02 .01 22 0.78 -0.31 .14 .06 .12 .05 0.71 -0.51 .04 .04 .09 .04 23 2.23 0.62 .12 .18 .04 .01 2.21 0.62 .10 .14 .02 .01 24 0.66 -1.52 .07 .04 .14 .06 0.70 -1.41 .10 .03 .10 .06 25 0.59 0.79 .04 .06 .10 .03 0.58 0.74 .03 .04 .09 .03 26 0.59 -1.33 .21 .06 .26 .09 0.55 -1.69 .04 .03 .11 .04 27 1.15 1.45 .04 .13 .07 .01 1.09 1.54 .05 .08 .04 .01 28 2.22 1.37 .19 .21 .05 .01 2.35 1.35 .18 .14 .03 .01 29 2.33 —0.20 .18 .16 .04 .02 2.14 -0.14 .20 .13 .02 .02 30 1.71 -1.41 .23 .17 .08 .06 1.79 -1.35 .23 .15 .08 .06 48 Table 3.5: Item Parameter Estimates for 30—Item Test In BILOG-MG3 (Dim = 1) N=2000 N=5000 Item 6 13 6 a B 6 1.83 -1.00 .25 1.71 -1.17 .15 0.88 0.21 .08 0.81 0.18 .04 0.61 -0.65 .50 0.58 -144 .21 1.67 -073 .16 1.90 -O.68 .18 2.09 0.52 .08 2.04 0.51 .07 1.39 0.54 .20 1.39 0.53 .18 2.20 1.38 .21 2.14 1.39 .21 0.95 0.17 .24 0.75 -0.12 .11 1.49 0.85 .07 1.38 0.94 .09 10 1.45 0.15 .08 1.17 0.14 .08 11 1.66 -0.17 .22 1.62 -013 .24 12 1.18 -0.83 .25 1.27 -0.83 .21 13 1.65 -0.37 .05 1.53 -047 .01 14 1.00 0.58 .11 0.93 0.60 .12 15 1.46 -123 .00 1.64 -097 .19 16 0.67 0.84 .19 0.57 0.58 .12 17 1.82 1.62 .04 1.81 1.63 .03 18 1.23 0.17 .13 1.15 0.18 .12 19 2.02 -033 .00 2.02 -031 .01 20 2.52 1.12 .04 2.21 1.12 .04 21 1.60 0.72 .23 1.57 0.66 .24 22 0.78 -0.35 .14 0.66 ~0.61 .00 23 2.28 0.56 .12 2.19 0.59 .10 24 0.64 -1.67 .00 0.72 -131 .17 25 0.58 0.73 .03 0.55 0.64 .01 26 0.57 -152 .15 0.51 -1.87 .00 27 1.16 1.40 .04 1.08 1.51 .05 28 2.41 1.31 .19 2.47 1.31 .18 29 2.51 -021 .19 2.09 -0.19 .20 30 1.72 -1.39 .28 1.82 ~1.36 .26 p—n tomNOSCflAODM 49 Table 3.6: Item Parameter Estimates for 45—Item Test (Dim = 1) N=2000 N=5000 Item 6 B a S(a) S(b) 3(0) 6 B 6 3(0) S(b) 8(6) 1 1.54 -001 .08 .10 .03 .02 1.59 0.01 .07 .07 .03 .01 2 2.42 0.06 .17 .11 .03 .02 2.31 .08 .17 .13 .02 .01 3 2.31 0.46 .02 .15 .02 .01 2.41 0.50 .02 .10 .02 .00 4 1.00 0.20 .24 .10 .07 .03 0.95 0.20 .20 .05 .03 .02 5 2.37 0.31 .26 .14 .04 .02 2.44 0.35 .24 .09 .03 .01 6 0.96 -l.65 .09 .08 .14 .07 0.99 -155 .11 .05 .08 .05 7 2.10 1.13 .07 .20 .03 .01 2.06 1.16 .08 .16 .02 .01 8 1.35 -0.60 .14 .14 .11 .06 1.41 -049 .15 .07 .04 .03 9 1.55 -0.69 .11 .12 .06 .04 1.66 -0.62 .12 .08 .03 .02 10 1.93 0.55 .16 .18 .04 .02 2.08 0.58 .17 .12 .02 .01 11 1.87 0.33 .24 .17 .05 .02 2.03 0.37 .24 .12 .03 .01 12 1.39 -042 .23 .14 .10 .06 1.33 -0.47 .18 .08 .04 .03 13 1.28 -0.71 .18 .15 .12 .06 1.36 -0.63 .18 .06 .04 .03 14 2.29 -124 .08 .17 .06 .05 2.23 -1.19 .05 .12 .03 .03 15 0.50 1.01 .15 .08 .17 .05 0.43 1.02 .12 .05 .13 .04 16 1.53 -153 .21 .15 .11 .06 1.57 -143 .20 .18 .15 .11 17 1.89 0.76 .09 .17 .04 .01 2.16 0.75 .09 .12 .02 .01 18 0.61 -013 .10 .06 .16 .06 0.59 -025 .04 .03 .10 .04 19 2.28 -107 .14 .15 .05 .04 1.91 -107 .07 .13 .05 .03 20 1.62 0.70 .23 .17 .06 .02 1.77 0.71 .25 .13 .03 .01 21 1.60 -130 .09 .13 .08 .06 1.43 -135 .02 .07 .04 .02 22 1.38 0.04 .19 .12 .06 .03 1.45 0.11 .19 .07 .02 .01 50 Table 3.7: Item Parameter Estimates for 45-Item Test (Dim = 1), cont. N =2000 N =5000 Item 5 5 e S(a) 5(0) 8(6) 5 B a S(a) S(b) 5(6) 23 1.40 1.25 .02 .12 .05 .01 1.51 1.25 .03 .08 .03 .00 24 1.84 -0.96 .04 .14 .05 .03 2.08 "-0.78 .12 .09 .02 .02 25 .85 0.24 .02 .06 .06 .02 0.83 0.19 .00 .03 .03 .01 26 1.03 1.19 .12 .14 .07 .02 0.87 1.13 .09 .06 .04 .01 27 1.69 1.01 .01 .16 .04 .01 1.85 0.98 .01 .10 .02 .00 28 1.80 -052 .24 .15 .04 .03 1.95 -045 .21 .12 .05 .03 29 0.87 -195 .13 .10 .22 .10 0.85 -194 .07 .05 .09 .05 30 1.19 3.17 .00 .20 .29 .00 1.26 2.98 .00 .09 .10 .00 31 1.72 —1.22 .02 .11 .05 .03 1.69 -1.12 .04 .12 .07 .04 32 0.44 -072 .11 .04 .21 .06 0.50 -.67 .05 .02 .10 .04 33 1.66 -0.63 .06 .13 .05 .03 1.76 -0.56 .05 .08 .03 .02 34 0.64 0.69 .02 .06 .09 .02 0.66 0.72 .03 .05 .05 .02 35 2.22 -0.58 .20 .20 .06 .04 2.40 -0.48 .21 .11 .03 .02 36 2.39 0.55 .08 .12 .03 .01 2.40 0.58 .09 .11 .02 .01 37 2.37 0.32 .08 .14 .03 .01 2.46 0.33 .08 .07 .02 .01 38 1.37 1.56 .07 .19 .06 .01 1.47 1.54 .07 .11 .04 .01 39 2.33 0.50 .10 .14 .03 .01 2.37 0.49 .09 .11 .02 .01 40 0.76 -1.24 .05 .05 .09 .05 0.80 -1.13 .07 .04 .10 .05 41 1.18 -043 .06 .08 .05 .03 1.37 -024 .16 .06 .03 .02 42 1.32 -094 .21 .10 .08 .06 1.21 -101 .14 .05 .05 .04 43 0.77 1.76 .11 .13 .09 .02 0.77 1.78 .11 .08 .08 .01 44 1.19 1.59 .06 .13 .06 .01 1.27 1.60 .08 .09 .04 .01 45 1.84 1.36 .19 .24 .05 .01 1.94 1.34 .19 .16 .03 .01 51 data set, a standard procedure of item parameter estimation in most IRT calibration software. Comparing the results of item parameter estimates from these two different procedures, one can see that these results are very close to each other and close to their true item parameters, indicating the two estimation methods are comparable. Table 3.6 and 3.7 show the item parameter estimates and the corresponding standard error for the 45—item test. As is true in many estimation programs in IRT, item parameter estimates con- tain estimation errors even if the data and the mathematical models have perfect fit. To examine the estimation accuracy of item parameter estimates, root mean square errors (RMSE) of the item parameter estimates are calculated from each data replication and each chain. In this study, three data replications are observed for both tests (i.e., the 30-item test and the 45—item test). Here data replication means, for example, the 30—item test is administered to three groups of different examinees who come from the same population (N (0, 1)). Therefore, there will be three sets of item parameter estimates corresponding to the three groups of examinees. For each data set, the computation program will come up with three different chains along with three different initial values to make sure that the MCMC approach can provide stable parameter estimates. Each chain will independently give estimates for item parameters. Therefore, combining three data replications and three chains for each data set will yield nine sets of item parameter estimates. For each data set, the final item parameter estimates are the average of estimates from the three chains. RMSE is defined as the square root of the mean squared difference between the item parameter 52 Table 3.8: RMSE for Estimating Uni-dimensional Models (Dim = 1) 30 x 2000 30 X 5000 45 x 2000 45 x 5000 a .15 .07 .11 .07 b .11 .08 .08 .08 c .05 .04 .04 .03 estimates and the true item parameters over 7' data replications and across 71 items (1' in this example is 3, and n is 30 or 45). Let 17 denote as item parameter (e.g., discriminating power parameter a, or difficulty parameter b, or asymptote parameter c) and 1? as item parameter estimates. Then RMSE can be calculated by n r . ‘= '= (ni""7li‘)2 RMSE(77) = \/22 12] 1 J J . rxn RMSE gives a summary index of assessing the accuracy of item parameter estimates. Apparently, the larger RMSE of item parameter estimates for a data set, the worse of the item parameter estimates. For a simulation study, the perfect fit of model and data is assumed, and thus the difference between the true and item parameter estimates may depend on estimation procedures and some other factors (e.g., the sample size of examinees). Table 3.8 contains the RMSE for item parameters. It shows that for the same test the larger the sample size, the smaller RMSE, and the less estimation errors. The largest RMSE for a is .15 in the 30-item test with 2000 examinees. The smallest RMSE is .07 in both tests when sample size is 5000. The largest RMSE for b is .11 in the 30—item test with examinee 2000. It also shows that the RMSE for c is generally smaller than RMSE for a and b, with the largest one .05 in the 30—item test to 2000 53 Table 3.9: Correlations Between Me Proficiency and Estimates (Dim = 1) Tests N = 2000 N = 5000 30-items .9546 .9554 45—items .9712 .9718 examinees. Table 3.9 shows the correlation between true proficiency and estimates from the MCMC approach. For the 30—item test, the correlations are around .96. The correla- tions in the 45—item test are about .97, slightly higher than those in the 30—item test. That is, longer tests gives higher correlation between true and estimates, implying better proficiency parameter estimation. Figure 3.3 shows the plots of true proficiency versus estimates corresponding to the four correlations in Table 3.9. One can see that the proficiency estimates from the longer test (i.e., the 45-item test) more closely around the reference line y = 1:, representing a higher correlation between the true and estimates. Figure 3.4 through Figure 3.6 are the plot of the true item parameter versus the estimates for parameter a, b, and c, correspondingly. Most of the plots are close to the reference line y = 3:. For these figures that have larger sample size, the plots are more close to the reference line, implying better item parameter estimates. 3.5 Estimating the 3-Dimensional MIRT Model This section will discuss the simulation studies of the parameter estimation for the 3- dimensional model, which is slightly different compared to the parameter estimation for the unidimensional model because the number of parameters in the multidiem- 54 Figure 3.3: True Proficiency Versus Estimates (Dim = 1) n = 30, N = 2000 2 1 n = 45, N = 2000 Estimate -3 -2 -1 v 0 n = 30, N = 5000 \ m) N c. 123 -3-2—10123 m n=45,N=5ooo N1 : c1 1 a". TrueAbility Figure 3.4: True a Parameter Versus Estimates (Dim = 1) n = 30, N = 5900 n = 30, N = 2000 1 11 11 1 1 ‘0. . ° 1 to. . v- r- In 0 4 m J c - - - - <5 - - 0.5 1 .0 1 .5 2.0 2.5 0.5 1 .0 1 .5 2.0 2.5 g .0, n=45,N=50QQ N 0.5 1.0 1 .5 2.0 2.5 Parameter a 55 Estimates Estimates Figure 3.5: The b Parameter Versus Estimates (Dim = 1) n=30,N=2000 n=30,N=5000 o o 0, -1:5 ' -0'.5 f 0:5 f 1:5 ‘7 —1.5 ' -0'.5 f 055 ' 155 ”‘n=45,N=2000 n=45,N=5000 Parameter b Figure 3.6: True c Parameter Versus Estimates (Dim = l) ,0 n=30,N=2000 n=30,N=5QOO 2 b o 1 g ‘o o 1 P o 5‘ ‘ o‘ 2 . ‘ . 9. b d : *0 b ° : O ‘3 0.05 ' 0.15 ' 0.25 a .0 n=45,N=5000 0.0 0.10 0.20 Parameter c 56 sional model is much greater than that in the unidimensional model. In addition, new parameters (e.g., proficiency structure parameters that appear as the compo- nents in the covariance matrix of the underlying proficiency distribution) need to be considered to estimate at the same time along with the estimation of the item and proficiency parameters. One more concern for MIRT model parameter estimation is the issue of indeterminacy that is inherited from the form of the MIRT model. Basically, one needs to put some constraints to ensure the MIRT model parameters have fixed solutions. The following sections will discuss the design of the simulation studies, for example, on how to generate the item and proficiency parameters and the response data, the underlying proficiency covariance, how to put constraints on the items in a test to establish a fixed scale for the parameter estimates, and how to assess the accuracy and stability for the parameter estimation. 3.5.1 Generating Proficiency Parameters Assume that the underlying distribution of proficiency for each examinee follows the multivariate normal distribution with mean vector p and covariance matrix 29. That is, 0,- ~ Np(p., 29), where j = 1,2, - - - , N. Proficiency parameters for each examinee are randomly drawn from Np(0, 29), where p is the number of dimensions; 29 is the generating covariance matrix, which corresponds to its dimensional structure and will have more discussions in Section 3.5.3. The mean vector p here is set to 0, because each dimension actually represent one hypothetical construct and comparison among dimensions seems to be not necessary. 57 3.5.2 The Number of Proficiency Dimension and Sample Size One factor that might indirectly affect the parameter estimation in the MIRT model is the proficiency dimensions (i.e., the number of latent variables in the complete latent space). As is known, the unidimensional IRT model (dim = 1) has 3 parameters for each item and one parameter for an examinee’s proficiency. For a test with n item and N examinees, the total number of parameters to be estimated is 311 + N. But in the case of the 3-dimensional MIRT model, there are 5 parameters for each item (i.e., three a parameters plus d and c parameters), 3 parameters for an individual proficiency, and 3 more parameters for representing the components in the proficiency covariance matrix. Therefore, for a test with 71 items and N examinees, the total number of model parameters need to estimate is 5n+3N +3, much more than that in the unidimensional model. The increasing number of parameters in the MIRT model brings more difficulties for the estimation given the test length n and the sample size of examinees N, since more information is required to achieve the same level of estimation precision. The simulation studies here consider two different numbers of dimensions for esti- mating multi-dimensional MIRT models: three and five proficiency dimensions. That is, three, and five-dimensions of proficiency are required to determine the correct answers in the simulation studies. The stable Monte Carlo estimates may depend on the sample size (this would also be the case for the maximum likelihood and Bayesian modal estimation). To investi- gate the effect of the sample size on the accuracy and stability of the estimation, the 58 response data with the sample size 2000 and 5000 examinees are independently gen- erated from the multivariate normal population. The sample size 2000 is considered as moderate, and 5000 as a large sample. 3.5.3 Proficiency Structure For multivariate analysis, the estimating of the covariance matrix is an important step, because the covariance structure can reveal some helpful information on the interrelations among the interested set of variables. Since the comparisons among proficiency dimensions are not useful in testing practice, one can standardize the set of proficiency components and thus make the variance for each proficiency dimension equal to 1, which reduce the number of parameters in the proficiency covariance. For example, if a test requires 3 dimensional proficiency, three additional parameters are needed to describe the proficiency covariance. However, the off-diagonal components represent the interrelations among the required proficiency dimensions and the pair- wise correlations in the matrix may vary. For the multi-dimensional MIRT model, the generating covariance matrices used are in the form of 1 p p . 1 p p p p p 1 for simplicity, where p in the proficiency structure matrix equals to .2, which is denoted as, 59 1 .2 2 1 . . . .2 29.2 E . ... 2 .2 .2 1 For a more general case, ,0 takes different values for the off-diagonal components. For example, the generating covariance matrix for the 3-dimensional model has off- diagonal components from .2 to .7 denoted as 29.9 E 104"" 1.21-«1 “can 3.5.4 Generating Item Parameters It is natural to assume that some items in a test only measure one dimension profi- ciency (call such items uni-items), some items may measure two or more dimensions (call such items multi-items). A test can be composed by both uni—items and multi- items. Two tests that include both uni-items and multi-items are generated in this simulation study on estimation for the 3-dimensional MIRT model with 30 and 45 items, respectively. Table 3.10 contains the true item parameters for the 30—item test. The first 15 items only measure one dimension proficiency and the remaining 15 items measure three dimension abilities. The parameter vector a ranges from 0 to 2.45. Note for the items which measure 3-dimensional abilities, some components in the a parameter are dominant over other dimensions(e.g., item 20, 21, 24), and some items have very close values of a parameters on two or three dimensions (e.g., item 19, 25, 26, 27). The values of (1 parameters are simulated from the standard normal distribution N (0, 1). 60 The lowest d value is -1.63 and the highest value of d is 2.38, indicating a wide range of d values is included in the test. Asymptote parameters c are drawn from the uniform distribution U (0, .25). High guessing parameters are not expected for good test items, as in the case of this example. Combined with the number of items (e.g., 30 and 45 items) and the sample size (e.g., 2000 and 5000), and the underlying proficiency structure (e.g., 29,3 and 29,9), there are in all 24 dichotomous response data sets generated. To solve the indeterminacy problem and establish a fixed scale for the model parameter estimates, the first three items are chosen as an unidimensional item, which is strongly considered to measure only the first, the second, and the third dimension, respectively. More specifically, the a values for the first item takes zero on the second and third dimensions, the a values for the second item takes zero on the first and third dimensions, and similarly the a values for the third item takes zero on the first and second dimensions. These three items are viewed as anchor items, because they are placed at the first three positions in the test and all are uni-dimensional items, which is treated as a constraint in order to settle the metric issue or the indeterminacy problems that are inherited in the MIRT models. It is argued that the model can be identified by setting the mean vector of proficiency parameters equal to zero and standardizing the covariance matrix, plus the above constraints, which are also used in the exploratory option of NOHARM (Fraser, 1988). Table 3.11 contains the true parameters for the 45-item test. The first thirty items only measure one dimension proficiency. Item 1 and item 4 to item 12 only 61 load on the first dimension, item 2 and item 13 through item 21 measure the second dimension, and item 3 and item 22 through item 30 only load on the third dimension. The remaining 15 items of the test, item 31 through item 45, are able to measure all three dimensions. The parameters a in the 45—item test also see a wide range as well, from 0 (item 2) to 2.43 (item 41). The minimum value of parameter (1 is -2.06 (item 26) and the maximum is 2.07 (item 6). The parameters c are within the range of .01 to .24 in this test. The first three items in the 45—item test are also uni-dimensional items and placed in the first three positions in the test, which is to believe that these three items are able to measure well the first, the second, and the third dimension, respectively. The purpose of placing the three uni-dimensional item in the first three positions in the test is to settle the indeterminacy problems and establish a fixed scale for the item and proficiency parameter estimates. 3.5.5 The Estimation Accuracy and Stability for the 3-Dimensional MIRT Model Table 3.12 contains the RMSE for the item parameters in the 3—dimensional model for both tests with the sample size 2000 and 5000 and in a condition that all of the off-diagonal components for the proficiency covariance are equal to .2. Note the item parameter estimates are the means of the three individual estimates of the item parameters, which are based on the three chains with different random initial values. By taking the means of the individual estimates based on multiple chains for the same data set, one can expect the the final estimates to be more stable and accurate 62 Table 3.10: True Item Parameters for 30-Item Test (Dim = 3) Item a1 a2 a3 d c 1 1.30 0 0 -0.23 .21 2 0 0.50 0 0.02 .00 3 0 0 2.10 -1.00 .24 4 1.93 0 0 0.61 .06 5 0.81 0 0 0.31 24 6 1.62 0 0 1.76 .00 7 0.59 0 0 1.56 .06 8 0 2.45 0 -0.38 .08 9 0 1.88 0 -0.86 .14 10 0 0.57 0 -0.51 .02 11 0 1.15 0 1.25 .03 12 0 0 1.35 -0.29 .25 13 0 0 0.98 2.38 .09 14 0 0 1.46 -1.45 .12 15 0 0 1.49 -0.30 .21 16 1.34 2.23 1.98 -1.24 .05 17 1.84 2.34 0.90 0.08 .00 18 0.86 1.04 1.76 1.13 .03 19 1.93 1.65 1.96 0.61 .11 20 0.56 0.87 1.97 1.23 .09 21 2.20 0.96 1.16 -l.01 .05 22 1.58 1.48 2.29 -1.58 .19 23 1.26 1.68 1.45 -0.07 .16 24 2.37 0.75 0.52 -1.37 .02 25 1.94 1.99 1.16 -1.63 .04 26 0.89 1.32 0.92 0.35 .20 27 1.25 1.56 1.64 1.08 .06 28 2.07 1.71 2.43 0.79 .06 29 1.41 0.96 2.12 0.46 .20 30 0.98 2.30 1.64 -0.43 .08 63 Table 3.11: 'D‘ue Item Parameters for 45-Item Test (Dim = 3) Item a1 a2 a3 d 0 Item a1 a2 a3 (1 c 1 1.12 0 0 0.18 .09 24 0 0 0.24 0.56 .13 2 0 1.51 0 1.28 .08 25 0 0 1.96 -0.23 .14 3 0 0 1.24 —0.46 .19 26 0 0 0.44 -2.06 .04 4 2.03 0 0 -1.74 .05 27 0 0 0.91 0.24 .07 5 1.92 0 0 1.24 .14 28 0 0 2.46 0.05 .07 6 1.84 0 0 2.07 .19 29 0 0 0.77 1.06 .01 7 2.43 0 0 0.42 .11 30 0 0 1.57 0.51 .02 8 0.94 0 0 1.04 .03 31 2.26 0.52 1.85 -2.05 .06 9 0.89 0 0 0.27 .13 32 1.20 1.05 0.79 0.28 .04 10 0.52 0 0 ~0.69 .22 33 2.31 1.25 1.98 —0.31 .13 11 0.30 0 0 -0.75 .23 34 1.38 0.64 1.62 0.80 .02 12 0.94 0 0 0.65 .12 35 0.53 2.21 1.23 -0.55 .02 13 0 0.53 0 —0.92 .22 36 0.95 1.09 1.02 -0.99 .23 14 0 0.91 0 1.28 .03 37 0.22 0.92 0.80 0.04 .23 15 0 0.22 0 0.02 .17 38 1.77 2.50 0.78 1.33 .01 16 0 1.03 0 -1.64 .02 39 1.32 2.19 1.32 1.30 .13 17 0 1.87 0 -1.69 .02 40 1.85 1.08 1.22 -0.45 .09 18 0 0.79 0 -1.11 .03 41 1.27 2.21 2.43 1.98 .10 19 0 1.81 0 -0.47 .17 42 0.22 1.15 2.00 0.50 .24 20 0 1.75 0 1.31 .12 43 0.36 1.25 0.21 -0.43 .16 21 0 0.73 0 -1.07 .21 44 1.95 1.60 1.35 -0.90 .03 22 0 0 2.05 -1.22 .04 45 1.11 2.21 1.07 -0.40 .07 23 O 0 2.05 0.47 .05 - - - - - - Table 3.12: RMSE for Multi-dimensional Test (Dim = 3, p = .2) Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000 al .15 .08 .15 .06 (i2 .12 .08 .16 .04 *3 .18 .08 .11 .05 a? .19 .10 .22 .11 c .07 .03 .06 .03 64 because the fluctuation of the item parameter estimates induced by the initial values and sampling errors are taken into accounted. It shows for a given test (e.g., the 30- item or 45-item test) the larger the sample size, the smaller RMSE. For the 30—item test, the largest RMSE for a. is .18 when sample size is 2000, but is .8 when sample size is 5000. The RMSE for d parameter is .19 when sample size is 2000, and is .10 for the sample size 5000. The RMSE for c parameter is .07 for sample size 2000, but is .03 for 5000. Similar results can also be found in the 45-item test. The smallest RMSE is .04 for a parameter in the 45-item test with sample size 5000. Note that within the same test and with the same sample size, the RMSE for a,,Vi = 1, 2, 3 are close to each other, which implies that the estimation can achieve the same level of precision across dimensions. It also shows that the RMSE for c is generally smaller than the RMSE for a and b, with the largest one .07 in the 30—item test to 2000 examinees. Table 3.13 gives the RMSE for the situation in which the underlying proficiency covariance is a general one or it does not follow a special pattern (e.g., all off-diagonal components on the proficiency covariance matrix are the same). The results of the parameter estimation for this particular condition are found very similar to the case in which the off-diagonal components for the covariance matrix are equal to .2. This implies that the underlying proficiency covariance does not affect the item parame- ter estimates, which is expected because the estimation of the item and proficiency parameters are independent. Compared to the RMSE for the unidimensional model in Table 3.8, the RMSE 65 Table 3.13: RMSE for Multi-dimensional Test (Dim =3, p = general) Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000 ($1 .10 .12 .13 .06 6‘2 .14 .06 .11 .07 (£3 .13 .12 .13 .06 " .13 .11 .21 .15 6 .06 .03 .06 .04 Table 3.14: Correlations Between True Proficiency and Estimates (Dim = 3, p = .2) 30 x 2000 30 x 5000 45 x 2000 45 x 5000 corr(01, (51) .8765 .8737 .9144 .9136 corr(02, 62) .8677 .8703 .9125 .9121 corr(63, 63) .8531 .8649 .9109 .9146 for item parameter estimates in Table 3.12 and 3.13 are generally higher those item parameter estimates for the 3—dimensional MIRT model. It is clear that given the same size of data information, the more parameters to be estimated, the more estimation errors. It can be seen that for the same test, larger sample size gives smaller RMSE. The RMSE for a parameter cross dimensions are close to each other with a range from .10 to .14 for the sample size 2000 and a range of .06 to .12 for the sample size 5000. The largest RMSE for d is .21, which occurs in the 45-item test with 2000 examinees, the smallest is .11 in the 30—item test with sample size 5000. Generally speaking, The RMSE for parameter c are smaller than those for parameters a and d, varying from .03 to .06, because c is restricted to a very small range. The RMSE of c for the sample size 5000 are about the half of the ones for 2000 examinees. 66 Table 3.15: Correlations Between True Proficiency and Estimates (Dim = 3, p = general) 30 x 2000 30 x 5000 45 x 2000 45 x 5000 corr(01, (5,) .8876 .8943 .9198 .9259 c0rr(02, 62) .8878 .8966 .9211 .9255 corr(t93, 63) .8474 .8602 .9111 .9101 The correlations between true abilities and estimates are presented in Table 3.14 and 3.15 for p = .2 and p is varied, respectively. Table 3.14 shows that for the 30—item test the correlation between the true values and the estimates are around .87 with a very small range from .8531 to .8765. Also, the correlations for the 45—item test slightly differ from .9109 to .9146. The 45—item test in general has higher correlations (around .91) between the true and the estimated abilities than those in the 30-item tests. This implies the proficiency estimates get improved for the longer test, or the estimation precision for proficiency in the longer test is better than that in the short test (i.e., the 30—item test). Table 3.15 presents the correlations between the true proficiency (6) and the esti- mates (0) for the situation in which the components for the off-diagonal proficiency covariance matrix take different values. The 30-item test gives correlations from .8474 to .8966. Higher correlations are also found in the 45-item tests with a range from .9101 to .9259. N o noticeable difference of correlations have been found cross dimen- sions. For example, for the 30-item test with 2000 examinees, the correlation between the first proficiency dimension and its estimates, corr(01, 01) = .8876, the correlation between the second proficiency dimension and its estimates, corr(t92, 02) = .8878, and 67 the correlation for the third dimension is corr(03,é3) = .8474. Comparing Table 3.14 to 3.15, slightly higher correlations appear in the situation that p takes different values than the fixed p = .2 condition. But the difference is negligible. In general, the correlations for the unidimensional model in Table 3.9 are higher than those for the 3-dimensional model in Table 3.14 and 3.15. This implies that as the number of dimensions increases from 1 to 3, the number of parameters to be estimated increases from 2090 to 6153 for the 30-item test to 2000 examinees. Therefore, more estimation errors will appear in the item and proficiency estimates for the 3-dimensional model. Figure 3.7 through Figure 3.9 show the plots of the true proficiency versus the estimates for the 30—item and the 45-item tests cross different sample sizes. The plots in these 3 figures demonstrate that the true and estimates are more close to the reference line 3] = :r for the longer test (45—item), as is consistent with the findings on the correlations in Table 3.14 and 3.15. Figure 3.10 through 3.13 are the plots of the true item parameters versus their estimates and they are all tightly around the reference line, showing the stable and accurate estimates are obtained in various simulation conditions regarding the test length, the examinee sample size, and the underlying proficiency covariance. It is worth pointing out that from the Figure 3.10, 3.11, and 3.12, for (1 parameters with true value 0, the estimates are close to zero. The estimates in the tests with larger sample size (e.g., N = 5000) are even closer to zero, with the biggest difference between the true parameters and estimates less than .2. 68 Figure 3.7: True Proficiency Versus Estimates (Dim = 3, p = general, 71 = 30, N = 5000) 0 1 -1 -2 -2 0 2 —4 n = 45, N = 5000 Estimate 2 Ability 1 Note that the plots are for the situation in which the underlying proficiency covari- ance matrix is 293. Similar results are also obtained when the proficiency covariance is Bag, in which the pairwise correlations vary, but the plots are omitted here. 3.6 Estimating the 5-dimensional Model The two tests for the simulation studies in this section will have the same number of item (e.g., n = 30 or n = 45) and will also be administrated to the groups of examinees with size N = 2000 and N = 5000, respectively. The differences are both tests are assumed to require five dimensions of proficiency to correctly answer the items in the two tests. Since the tests are to measure five dimensions of abilities, the total number 69 Figure 3.8: 2000) Esmnake Figure 3.9: 2000) True Proficiency Versus Estimates (Dim = 3, p = general,n == 45, N = n = 30, N = 5000 1 o 1 ‘7‘ ‘ 7 - e x -2 0 2 n = 45, N = 5000 NJ _ -‘ is: o u “,1 1 o -3 -1 0 1 2 3 bennynz True Proficiency Versus Estimates (Dim = 3, p = generalm = 45, N = n = 30, N = 5000 n = 30, N = 2000 -5 .5 £1 6 i 5 5 .4 l; 6 2 n=45,N=2000 n=45,N=5000 ... 1 f . ‘ "’ “’1 e31 - - 4 4 Ability 3 70 Figure 3.10: True 01 Parameter Versus Estimates (Dim = 3, p = .2) Estimates n=30,N=2000 (n=30,N=500‘0 =2. °o 6. 1 P o 0.0 055 1:0 1:5 2.0 c 0.0 0:5 150 1:5 250 n=45,N=2000 n=45,N=5000 0.0 1 .0 2.0 Parameter a1 Figure 3.11: True 02 Parameter Versus Estimates (Dim = 3, p = .2) n=30,N=2000 n=30,N=SOQ Estimates o O. 5“ N o. c. 0. c. O - - O - - 0.0 1 .0 2.0 0.0 1 .0 2.0 n=45,N=2000 n=45,N=5000 9. 1 O. N N c: O. o. c. O - - - c . - - 0.0 1 .0 2.0 0.0 1 .0 2.0 Parameter a2 71 Figure 3.12: True 03 Parameter Versus Estimates (Dim = 3, p = .2) Estimates Estimates n = 30, N = 2000 01$D 1.0 2.0 n=30,N=5000 1.0 2.0 0.0 0.0 1 .0 2.0 ‘n=45,N=5000 0.0 1 .0 Parameter a3 q C v - 4 0.0 1.0 2.0 n = 45, N = 2000 c: 1‘ N q Q C - - 0.0 1.0 2.0 Figure 3.13: n = 30, N = 2000- N 1 o J 11 ‘71 1 3 6 1 5 N n=45, N=2000 O ‘2‘ -5 -1 3 7 2 True d Parameter Versus Estimates (Dim = 3, p = .2) n = 30, N = 5000 N O 1 ‘7 1 -1 0 1 2 n = 45, N = 5000 N O 1 N -2 -1 0 1 2 Parameter d 72 of parameters to be estimated are (5 + 2)n + 5N + 10, where n stands for test length and N for the sample size of examinee. For the 30-item test that is administrated to 2000 examinees, for example, the total number of model parameters need to be estimated from the observed data is 10220, which is much greater than the sample size 2000. If this test is to administrated to a group of 5000 examinees, the number of model parameters is 25220. Similarly, for a 45-item test that is administrated to a group of 2000 examinees, the total number of model parameters is 10325, and is 25325 if administrated to a sample of 5000 examinees. The design for the 30—item test that is assumed to measure five dimensions of abilities will follow the same pattern as that of the three dimensional tests. To put some constraints for the model identification and the establishment of the fixed scale for the parameter estimates, the first five items are unidiemsional items and are placed on the first five positions in the test with each item measuring only one dimension of proficiency. More specifically, these items are also called anchor items with the first item only measuring the first dimension of proficiency and the second items only measuring the second dimension, and so on. Table 3.16 and 3.17 contain the true item parameters for the 30-item test and the 45-item test, respectively. It can be seen that the anchor items have a wide range of values on the 0. parameters (e.g., from .65 to 2.04 for the 30—item test, and from 1.38 to 2.32 for the 45-item test). In the 30-item test, there are two additional unidimensional items (e.g., item 6 through item 15) for each dimension and the rest of the items are assumed to measure all five dimensions of abilities (e.g., item 16 through item 30). For the 45—item test, 73 only one additional unidimensional item for each dimension are present in the test, item 6 through item 10. The rest of the items in this test are suppose to measure all five dimensions of abilities. In the 30—item test, each dimension of proficiency is designed to be measured by only 17 items. And in the 45-item test, each dimension of proficiency can be measured by 42 items, much more than that in the 30—item test. According to this design of items for the two tests, one would reasonable expect that the proficiency estimates in the 45—item would be improved since more items are designed to measure each dimension of proficiency. Note that the true item parameters in both tests in Table 3.16, 3.17 and 3.18 include a wide range of values on each item parameter. For example, the largest value of a parameter is 2.44 and the lowest is 0 in the 30—item test, and the largest and lowest 0. values in the 45—item test are 2.32 and 0, respectively. The values on d parameters for both tests have a reasonable range, which are both from a standard normal distribution. All the asymptote parameters are controlled within the range between 0 and .3. The five dimensional proficiency parameters are randomly generated from a multi- variate normal distribution with the mean vector 0 and the covariance matrix 20 (i.e., N (0, 29)). As in the case for the three dimensional tests in Section 3.5, the mean vector for the underlying proficiency distribution is set to 0 to establish the same scale for each proficiency dimension. In the same way, the covariance matrix 29 is standardized and becomes actually the correlation matrix among these dimensions of abilities. The pairwise correlation among these five dimensions (or the off-diagonal 74 Table 3.16: True Item Parameters for 30-Item Test (Dim = 5) Item 01 a2 03 a4 05 d c l 0.65 0 0 0 0 1.76 .20 2 0 1.74 0 0 0 ~0.69 .23 3 0 0 2.04 0 0 0.13 .15 4 0 0 0 1.38 0 1.13 .24 5 0 0 0 0 0.98 -0.64 .14 6 1.14 0 0 O 0 0.30 .07 7 1.64 0 0 0 0 -0.11 .10 8 0 0.67 0 0 0 -0.62 .23 9 0 1.21 0 0 0 0.73 .25 10 0 0 1.49 0 0 -1.12 .12 11 0 0 0.99 0 0 -1.10 .12 12 0 0 0 1.18 0 1.34 .04 13 0 0 0 1.41 0 2.02 .09 14 0 0 0 0 1.91 0.49 .16 15 0 O 0 0 0.88 -1.28 .24 16 2.44 1.24 2.18 1.88 0.85 0.85 .03 17 1.81 1.85 2.28 1.21 2.44 -l.64 .13 18 1.02 2.14 1.77 1.80 2.02 0.91 .07 19 0.60 1.75 2.14 2.19 2.35 2.73 .03 20 0.94 1.23 2.07 1.91 1.42 1.43 .19 21 1.01 1.39 2.17 2.26 0.98 0.95 .12 22 1.13 1.47 2.50 1.08 1.84 2.30 .08 23 1.32 1.29 1.59 2.20 0.80 0.48 .22 24 0.73 2.28 2.00 0.86 0.87 0.51 .15 25 2.43 1.08 1.84 1.15 2.03 0.20 .15 26 1.73 1.30 2.42 1.29 1.15 0.21 .00 27 1.98 1.69 1.50 2.28 1.46 -0.71 .15 28 2.00 1.39 2.15 0.59 1.10 -0.86 .09 29 1.62 1.92 1.56 2.07 1.91 -0.09 .10 30 0.81 1.70 2.13 1.39 1.28 0.75 .06 75 Table 3.17: True Item Parameters for 45-Item Test (Dim = 5) Item 0.1 0.2 03 a4 0.5 d C 1 2.32 0 0 0 0 -0.17 .18 2 0 1.94 0 0 0 0.16 .22 3 0 0 1.53 0 0 0.36 .13 4 0 0 0 1.38 0 0.30 .19 5 0 0 0 0 1.71 0.47 .12 6 1.51 0 0 0 0 0.71 .23 7 0 1.74 0 0 0 -l.61 .21 8 0 O 1.90 0 0 -0.88 .03 9 0 0 0 2.14 0 ~1.15 .16 10 0 0 0 0 1.34 -0.13 .18 11 1.30 1.61 1.93 2.05 0.83 -0.73 .00 12 1.03 2.24 0.73 2.20 1.94 2.12 .23 13 2.05 1.56 1.09 0.92 1.83 -0.75 .04 14 1.36 0.93 0.90 1.89 1.45 1.12 .17 15 1.42 2.11 0.88 1.22 0.80 -0.07 .20 16 1.10 0.95 1.83 0.80 1.34 0.00 .23 17 1.65 1.52 2.15 1.09 1.38 1.01 .15 18 1.48 1.25 1.00 1.19 1.85 2.17 .14 19 0.82 1.49 0.62 2.01 1.84 -0.58 .21 20 0.87 1.79 1.61 1.10 1.31 -0.92 .02 21 0.93 2.07 1.49 1.11 1.85 0.80 .03 22 1.03 2.14 1.76 2.33 1.49 0.01 .01 76 Table 3.18: True Item Parameters for 45-Item Test (Dim = 5), cont. Item al 02 a3 a4 as d c 23 1.97 2.17 2.32 2.10 1.57 -0.44 .08 24 1.79 1.25 1.93 1.87 2.34 0.17 .24 25 1.29 0.76 2.20 1.70 1.60 -1.35 .10 26 1.50 1.90 2.03 1.31 1.07 -0.74 .17 27 1.65 0.90 1.42 1.81 0.69 -0.31 .12 28 2.31 0.82 1.91 1.50 1.75 -2.08 .19 29 0.93 2.35 2.34 1.70 1.12 0.36 .08 30 1.99 0.73 1.58 1.68 1.04 -1.36 .08 31 1.34 1.20 1.88 2.18 1.60 -0.81 .18 32 1.49 1.50 1.76 2.00 1.63 -0.25 .12 33 1.95 2.22 1.39 1.59 1.09 -0.29 .11 34 0.64 1.26 0.80 1.21 0.95 -1.55 .23 35 1.06 1.51 1.69 1.64 1.17 -O.60 .09 36 1.45 0.82 1.92 1.66 0.49 0.50 .13 37 1.52 2.22 0.87 1.70 0.71 0.82 .13 38 2.04 1.45 0.97 2.28 1.81 0.96 .24 39 0.90 2.06 1.27 1.55 1.25 1.83 .00 40 1.93 2.09 1.65 1.25 0.80 0.78 .04 41 1.44 1.01 0.81 2.13 1.22 0.19 .25 42 0.74 1.78 1.94 0.92 2.07 -1.01 .04 43 1.93 1.81 0.69 0.90 1.79 0.08 .09 44 1.11 1.91 1.83 0.86 1.06 1.60 .22 45 2.05 1.25 1.55 0.89 1.79 0.89 .02 77 components in 29) can be the same or can vary from each other. In this section, two covariance matrices of 29 are used and denoted as 29,; and 239.9, respectively. From the notations on the covariance matrices, one can see that the former covariance matrix indicates that all the off-diagonal components take the same values (e.g., .2) and the off-diagonal components for the latter covariance matrix vary from .2 to .6, which is shown as 20.9 m hbbe bbbwb cease cat-Hams: Haaaa Combined with the test length (30 and 45), the sample size (2000 and 5000), the proficiency covariance (29,3 and 29,9), and the replications, 24 response data sets are yielded for the simulation studies on the five dimensional case. For each data set, multiple chains (e.g., 3 chains for each data set) will be constructed. To give more stable and accurate estimates, the final estimates for item parameters will take the means of the three individual estimates from each chain with different initial values. Therefore, there are in all 72 runs for the parameter estimates in this section. Table 3.19 and Table 3.20 give the RMSE for the item parameter estimates for the eight simulation conditions for each item parameter. The differences between the two tables are that the underlying proficiency covariance is different. The results of Table 3.19 are based on 29.2 and Table 3.20 on 29.9. Most of the RMSE in the tables are less than .2. The highest RMSE value (.29) is for d parameter in the condition of 5000 examinee on the 45-item test with covariance 29,9. 78 Clearly from the two tables, the precision of item parameter estimates does not change due to the use of different proficiency covariance. Or the underlying proficiency covariance is not a factor that can affect the item parameter estimates, which is expected because sampling of item and proficiency parameters are independent. It is also clear that the RMSE are generally smaller when the sample size is 5000 than those when the sample size is 2000, which is also expected since more examinees provide more information on item parameter estimation. However, the estimation seems better on the 30-item test since the RMSE have slightly higher values in the 45—item test in general no matter what the sample size is, which is not expected. One possible reason is that the dimension structure in the 30-item test (only 17 items measuring all 5 dimensions) is much simpler than the 45—item test (32 items measuring all 5 dimensions). In addition, more items with extreme values that are difficult to estimate, might appear in the 45-item tests. Compared to the RMSE for item parameter estimates in the unidimensional model (Table 3.8) and the 3—dimensional model (Table 3.12 and 3.13), the RMSE for the item parameter estimates for the 5—dimensional model (Table 3.19 and 3.20) are gen- erally higher. Again, this implies for the same size of data information, the more parameters to be estimated as the number of dimensions increases, the more errors for the estimation. Table 3.21 shows the correlations between the true and estimates of proficiency parameters when the underlying covariance matrix is 29,2. That is, the off diagonal components for the covariance matrix of the proficiency distribution is equal to .2. 79 Table 3.19: RMSE for Multi-dimensional Test (Dim = 5, p = .2) Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000 01 .15 .13 .21 .20 62 .24 .16 .20 .14 63 .16 .11 .23 .15 04 .20 .14 .22 .22 65 .18 .15 .20 .14 d .21 .16 .22 .24 a .05 .05 .03 .03 Table 3.20: RMSE for Multi-dimensional Test (Dim = 5, p = general) Estimates 30 x 2000 30 x 5000 45 x 2000 45 x 5000 a, .17 .15 .26 .17 62 .18 .16 .27 .18 d3 .16 .15 .20 .19 a, .18 .16 .25 .21 65 .21 .18 .24 .25 a .25 .17 .28 .29 c .06 .04 .03 .03 80 Table 3.21: Correlations Between True Proficiency and Estimates (Dim = 5, p = .2) 30 x 2000 30 x 5000 45 x 2000 45 x 5000 corr(01, 61) .7899 .7829 .7935 .7976 corr(02, 62) .7508 .7499 .7984 .8006 corr(03, 0'3) .8038 .8067 .8088 .8195 corr(04, 9",) .7606 .7641 .7934 .7818 corr(05, 65) .7594 .7559 .8010 .7915 In general, the correlation for each dimension in this study is around .8, and the cor- relations are close between the two proficiency covariance conditions, indicating the proficiency covariance does not affect proficiency estimates. When compared to the correlations for the unidimensional model (Table 3.9) and the 3-dimensional model (Table 3.14 and 3.15), the correlations for the 5—dimensional model in Table 3.21 and 3.22 are generally smaller, which is expected because as the dimension increases, more parameters are to be estimated. The lowest correlations are for the short test ( the 30-item test), which is eXpected, because each dimension of proficiency is measured by only 17 items. The longer test (the 45—item test) has slightly higher correlation coefficients. Low correlations indicate large estimation errors for the proficiency esti— mates. Nevertheless, the estimation is not significantly improved in the 454tem test although each dimension is measured by 32 items. One possible interpretation is that the parameters to be estimated substantially increase as the number of dimensions increases to five. 81 Table 3.22: Correlations Between True Proficiency and Estimates (Dim = 5, p = general) 30 x 2000 30 x 5000 45 x 2000 45 x 5000 corr(61, 8,) .7835 .7935 .8076 .7999 corr(92, 62) .7548 .7617 .7882 .7983 corr(03, 63) .8098 .8034 .8221 .8139 corr(04, (9,) .8171 .8241 .8264 .8326 corr(05, 0",) .7745 .7971 .8244 .8333 3.7 Proficiency Structure Estimation The estimates of the underlying proficiency structure have potential affects on the convergence speed, since at each sampling step, the proficiency samples are taken from the multivariate normal distribution with mean vector 0 and sample covariance from the inverse Whishart distribution based on the sample covariance of abilities. Good recovery of the covariance structure can make an effective Markov chain. The components of the underlying proficiency covariance are also estimated along with item and proficiency parameters by the MCMC procedure. For each data set, one estimate of covariance can be obtained for each chain replication with different initial values. The final covariance matrix estimate is the mean of the three estimates from independent chains. Note for each chain, the proficiency covariance estimate is the mean of the 1000 sample of the covariance from inverse Wishart distribution, which is also based on the sample covariance. The good estimates of covariance matrix would better recover the interrelations across proficiency dimensions. Table 3.23 gives estimates for each chain of the 30—item test in threedimensional case with 82 Table 3.23: Estimates of Covariance Matrix, Dim = 3, p = .2 Data 30 x 2000 30 x 5000 1.02 0.21 0.15 1.01 0.15 0.13 Repl 1.01 0.14 1.03 0.13) 0.97 0.98 1.04 0.18 0.17 0.99 0.18 0.18 Rep2 0.99 0.12 1.00 0.18 0.96 1.01 0.99 0.13 0.17 1.00 0.12 0.14 Rep3 1.03 0.21 1.01 0.17) 1.01 1.00 Table 3.24: Estimates of Covariance Matrix, Dim = 3, p = general Data 45 x 2000 45 x 5000 .95 .58 .15 .99 .69 .16 Rep 1 .94 .29 1.00 .25 .98 1.01 1.04 .65 .13 1.01 .68 .14 Rep 2 ( 1.02 .24 ) ( 1.01 .25 ) .99 1.01 .99 .60 .18 .98 .70 .19 Rep 3 .97 .22 1.02 .27 1.05 .99 p taking the same value of .2 for all off—diagonal components. The table shows the diagonal elements are all close to 1, ranging from .96 to 1.04. The off diagonal elements ranges from .12 to .21. Similarly, Table 3.24 shows the covariance estimate for the 45—item test in the three-dimensional case with true covariance 29,9. Clearly, the estimate of each component is close to their true parameter. Results from the five dimensional case in Table 3.25 and 3.26 also indicate the reasonably good recovery of the proficiency structure. 83 Table 3.25: Estimates of Covariance Matrix, Dim = 5, p = general Data 30 x 2000 30 x 5000 1.03 .14 .24 .23 .47 1.00 .26 .26 .21 .47 1.05 .18 .39 .23 .99 .22 .46 .25 Repl 1.00 .29 .19 1.00 .32 .31 .99 .54 .99 .43 1.05 .97 1.01 .27 .30 .28 .44 1.01 .20 .29 .26 .39 1.03 .28 .55 .29 1.03 .07 .54 .20 Rep2 1.02 .39 .25 .98 .35 .24 1.07 .44 1.03 .52 1.02 1.00 1.03 .10 .17 .22 .55 1.01 .20 .30 .22 .53 1.01 .25 .45 .21 1.03 .21 .44 .20 Rep3 1.03 .37 .30 .98 .45 .29 1.00 .50 1.03 .41 1.03 1.00 Table 3.26: Estimates of Covariance Matrix, Dim = 5, p = .2 Data 45 x 2000 45 x 5000 .99 .18 .23 .38 .21 1.02 .21 .20 .33 .18 1.05 .22 .29 .06 .99 .21 .24 .22 Repl 1.00 .31 .21 1.01 .28 .15 .98 .25 .99 .14 1.05 .97 1.04 .17 .21 .17 .21 1.02 .16 .18 .33 .15 1.02 .15 .28 .21 .99 .24 .27 .18 Rep2 1.04 .30 .18 1.01 .26 .18 1.00 .27 .99 .17 1.00 .97 1.03 .14 .24 .28 .35 1.01 .19 .20 .33 .17 .98 .32 .22 .13 1.03 .24 .26 .17 Rep3 1.03 .35 .26 .98 .25 .18 1.04 .33 1.04 .17 1.03 1.00 84 3.8 Computing Time One open criticism to the MCMC approach is the extensive computation, which may depends on the program efficiency, the size of the data, the convergence speed, and the computer equipment as well. The program efficiency includes the design and algorithm in the source codes. Many researchers now use the application softwares (e.g., WINBUG, BUGS, SAS, SPLUS, MATLAB) to run MCMC procedures (e.g., Patz and Junker use S-PLUS, 1999a; Bolt uses WinBug, 2004). Some researchers use computer languages (e.g., S, R, FORTRAN, JAVA) to code their own programs. In this study, the code is written by C++ with efficient algorithm using MCMC for computing IRT model parameter estimation. The size of data involves the test length, the sample size of examinees, and number of dimensions and parameters to be estimated. In general, the longer the test, the more time is needed. Similarly, the larger number of examinees and dimensions of proficiency required, the longer the computing time is required. For a given data set, the more parameters are to be estimated, the longer the computing time is needed. As for the convergence speed, it is associated with the priors chosen for each item and proficiency parameters, and is also associated with the data structure. If each chain is diagnosed not mixed well, or not converged to the target posterior distributions, long iteration is required, and thus longer time is required. Finally, better equipped computer system give faster computation for the same program. The computing time for 11000 iterations using the C++ program is given in the Table 3.25, and it is calculated based on a computer 85 Table 3.27: Computing time for 1-, 3-, and 5-Dimension data Data 30 x 2000 30 x 5000 45 x 2000 45 x 5000 l-dimension 37 min 1 hr 17 min 42 min 1 hr 33 min 3-dimension 59 min 2 hr 30 min 1 hr 20 min 3 hr 35 min 5-dimension 1 hr 35 min 4 hr 5 min 2 hr 8 min 5 hr 17 min with 512 MB RAM and 3300 AMD Athlon 64 processor. The shortest time, 37 minutes, is in the computation of the parameter estimation for unidimensional model with 30 items and 2000 examinees. The longest time is in the case with 45 items to 5000 examinees and with 5 dimensions of proficiency, taking 5 hours and 17 minutes to finish the 10000 iterations. The time required to computing other conditions is within the range from the shortest to the longest. 86 Chapter 4 Concluding Remarks and Future Research Directions This research involves extensive simulation studies on parameter estimation for mul- tidimensional IRT models in various conditions in terms of the test length, the sample size of examinees, the number of dimensions, and the underlying proficiency structures using the MCMC approach. Results on parameter estimates from these conditions are compared to investigate the influence of the potential factors on the accuracy and stability of the estimation. This study is a extensive examination on the MCMC approach to parameter estimation in terms of the test length, the examinee sample size, the number of dimensions, the proficiency covariance, the range of item parameters, and the di- mensional structure in each simulated tests. For example, the study includes both unidimensional items and multidimensional items in a test, and it has a wide variety of parameter values (not limit to certain range of values for parameters). Moreover, the study does not only focus on simple structure, but also considered the complex structure. 87 The MCMC approach provides a convenient and flexible framework for parameter estimation of complex IRT models, as is shown in Chapter 3 for estimating multidi- mensional models. The C++ program is used to estimate not only the simple IRT model (e.g., unidimensional) but also some complex models (e.g., multidimensional IRT models). The framework involves estimation of any type of parameters in the IRT models (i.e., item parameters, proficiency parameters, proficiency covariance). One can use the framework to estimate both item and proficiency parameters simul- taneously. Or one can obtain the estimates of the proficiency covariance matrix to infer the interrelations among the proficiency dimensions. For some simple situations, for example, if only item parameter estimates, or only proficiency estimates, or only knowing the interrelations among proficiency dimensions is required, the program can give the required estimation procedures and ignore other parameter estimation with- out loss of any generality. In this case, the MCMC approach would be faster because less number of parameters are to be estimated, and thus less operation time is needed. In addition, under this framework, one can give the item parameter estimates first, then treat the item parameter estimates as true to yield the proficiency parameter es- timates and proficiency covariance estimates (even by other procedures, for example, ML procedure). Or one is able to estimate all the model parameters simultaneously, as is done in this study. In addition, the framework is not restricted to short tests or lower dimensional tests. It is particularly useful for estimating higher dimensional and long tests with large number of examinees, or is useful for the contexts in which the IRT model is so complicated that other estimation approaches become infeasible. 88 The MCMC approach is effective and the computation is efficient. For parameter estimation in unidimensional models, half an hour is enough for a test with 30 items to 2000 examinees for 11000 iterations. One hour and half to longer tests and larger sample size, for example a 45—item test to 5000 examinees. The path plots for the posterior distribution for item parameters shown in Figure 3.2 imply that the con— structed chains are well mixed even in the first 3000 iterations. If some parameters are not required for the estimation, less time is needed for the estimation and the resulted estimates are not affected by ignoring other parameter estimation. For example, the item parameters estimation will take less time if no proficiency parameter estimation is involved and the results of item parameter estimates are not affected, because the estimation of item and proficiency parameters are independent. Moreover, better equipped computer system can give faster computation for the parameter estimation. The important aspect of the MCMC approach for parameter estimation of IRT models is the reasonable estimation accuracy and stability for the estimates. Simu- lation study can have a straightforward comparison between the estimates and the true parameters, which are available before the estimation. The accuracy of item parameter estimates increases as sample size increases, but decreases as the number of dimensions increases. The estimation accuracy for item parameters can be seen from the comparison between the true and the estimates directly, which are presented in the RMSE tables (e.g., Table 3.8, 3.12, 3.13, 3.19, and 3.20) and plot figures (e.g., Figure 3.4 through 3.6 and Figure 3.10 through 3.13) in Chapter 3 for various simulation conditions. For 89 the unidimensional case, the item parameter estimates for both tests (e.g., the 30—item test and the 45-item test) are listed in Table 3.4, Table 3.6 and 3.7 for sample size 2000 and 5000 along with the standard errors. For multidimensional model parameter estimation, each item parameter estimate is not listed in a table but is plotted with the corresponding true parameters. The small difference between the true and the estimates of the item parameters indicates reasonable estimation. One can see in Table 3.4 and Table 3.6 and 3.7 on the item parameter estimates for unidimensional case, most of the absolute differences between the true and estimates are less than .1 and many of the standard errors of estimates are also less .1. More results are found in the summary statistics—RMSE. For unidimensional case, the RMSE for 0 parameters is less than .15 and arrives .07 when the sample size increases to 5000 (Table 3.8). The RMSE for b parameter is less than .11 and c parameter less than .05. For parameter estimation in multidimensional case, the RMSE is generally higher than the RMSE in the unidimensional models. For example in Table 3.12 and 3.13 for the RMSE for 3—dimensional model estimation, the RMSE for each a parameter estimates is generally higher than RMSE for a parameter in the unidiemnsional case; the RMSE in 5-dimensional item parameter estimation (Table 3.19 and 3.20) are in general higher than both the unidimensional and 3-dimensional case. One can conclude that as the dimension of proficiency increases in the model, the RMSE for item parameter estimates become larger, indicating poorer item parameter recovery. One simple interpretation to this observation is that the number of parameters to be estimated increases substantially as the proficiency dimension increases. Given the 90 same data structure and information, the more parameters need to estimate (as in the 3—dimensional and 5-dimensional model), the less information that the data contains for parameter estimation, and thus the less accurate the item parameter estimates. It is expected that the RMSE are larger in the 5—dimensional models than those in the 3—dimensional or unidimensional model. The good recovery of the item parameters can also be found from the plots of the true item parameters versus the estimates (e.g., Figure 3.4 through 3.6, Figure 3.10 through 3.13). In these figures the plots are closely around the reference line, indicating good estimates are obtained. The precision of the proficiency estimates are assessed in terms of the correlation and plots of the true proficiency parameters versus the estimates. Large correlations are obtained for longer test (the 45—item test), but lower correlations are associated with higher dimensional tests (e.g., 5—dimensional test). The proficiency covariance matrix has negligible effects on proficiency parameter estimation. The correlation tables show the correlations between the true proficiency param- eters and estimates in terms of the number of dimensions, the sample size, and the test length (e.g., 3.9, 3.14, 3.15, 3.21 and 3.22 ). One can find that the correlations for the unidimensional case are the highest, more than .95 for every conditions in the simulation studies (Table 3.9). The correlations for the multidimensional cases (e.g., 3 dimensions and 5 dimensions) are generally lower than those in the unidimensional models, around .8 ~ .93 for each proficiency dimensions. The plots of the true profi- ciency versus the estimates in Figure 3.4 through 3.6 show the estimates are closely around the reference line for unidimensional model. However, the plots on Figure 3.10 91 through 3.13 for the multidimensional proficiency cases show the estimates relatively spread out from the line. The possible reason to explain the relations of the correla- tions with the proficiency dimensions is concerning the information that is contained in the data. One can expect better proficiency estimates or higher correlations for the lower dimensional models, in particular for the uni-dimensional model, because less parameters are required to be estimated in the same size of data structure and more information contained in the data is provided for the proficiency estimation. Better proficiency estimates is expected for longer tests if the higher dimensional model is used. The estimation accuracy for both item and proficiency parameter estimates by the MCMC approach is clearly seen by the comparison of the results with the results from other procedures. For example, for unidimensional case, item parameter esti- mates in the 30-item test are calibrated from the standard procedure — MML / EM in BILOG-MG3, which is shown in Table 3.5. The results from the two approaches are comparable. However, the MCMC procedure, although from a. Bayesian perspective, is flexible and convenient for much more complex IRT models. Furthermore, as Patz and Junker point out, one advantage of the MCMC procedure over traditional method is that this procedure is capable to estimate the exact joint posterior distribution for the parameters (Patz and Junker, 1999a). The accuracy of the estimation by MCMC is clearly seen from the consensus esti- mation on the replication of data sets and the consensus estimation on the replication of multiple chains. This is also the aspects of the stability of the parameter estimation 92 of the MCMC approach. It is seen from Table 3.3 for the unidimensional ease, the three independent chains yield very stable estimates of item parameters for the 30- item tests. Similar results are obtained for the 45—item tests and higher dimensional model parameter estimation. For the same data set, parameter estimates are stable from three independent chains with different initial values indicating the posteriors of the model parameters reach the stationary status. That is why the parameter estimates do not depend on the initial values. The item parameter estimates are not only stable across the multiple chains, but also stable across data sets (e.g., Table 3.8, 3.9). It seems diffith to increase the estimation precision for both item and proficiency parameter estimates in IRT models at the same times. When the sample size increases for a fixed number of items in a test, the item parameter estimates are expected to be improved. For a fixed group of examinees, the proficiency parameter estimates are expected to improve as the number of items in a test increase. One can argue that for a fixed number of items in a test, the number of item parameters to be esti- mated is fixed and increasing the sample size of examinees provides more information for estimating item parameters. Therefore, the standard error of estimates decreases. When estimating proficiency parameters for a fixed number of examinees, the number of proficiency parameters to be estimated will be improved as the number of items increases in the test, because the test provides more information for estimating profi- ciency parameters. This also happens to the parameter estimation using the MCMC procedures. It is seen from Table 3.13 and 3.14 that for a fixed test (e.g., the 30—item 93 test or the 45—item test), item parameter estimates get better in terms of RMSE when the sample size changes from 2000 to 5000. The proficiency covariance is well recovered in the MCMC procedure and the estimation of the proficiency covariance matrix does not affect the item parameter estimates. The relations between the estimates and the design variables for a test (e.g., the test length, the sample size of examinees, and the number of dimensions) are helpful for suggesting a general guideline for parameter estimation. For example, to require accurate item parameter estimates for the unidimensional model assuming perfect model-data fit, if a test consists of 30 items, the number of 2000 examinees is good enough. But with the same number of 30 items for estimating item parameters from the 3—dimensional model, more than 2000 examinees (e.g., 5000) could achieve the estimation precision. Similarly, for the 5-dimensional model, more than 5000 examinees (e.g., 8000 or more) could help to reach the same estimation precision. For proficiency estimates using the unidimensional model, the number of 30 items can provide reasonable good estimation, as seen in the correlation Table 3.9 and plot Figure 3.3. But for the 3-dimensional test, the number of 45 items could provide reasonable good estimation for proficiency estimates, as seen in Table 3.14 and 3.15 and Figure 3.17. For the 5-dimensional model, more than 45 items (e.g., 60 items) could help for reasonable good proficiency estimation. One limitation for the MCMC approach estimating multidimensional IRT model parameters except the extensive computation, is the number of dimension is given. 94 But the number of dimensions is not generally available in real data analysis. How would the performance of the MCMC approach be if the number of dimension is less or more than that of the required dimensions in the test? This is an interesting practical issue and worthwhile for further research efforts. This issue is in fact also a model-data fit issue rather than parameter estimation issue (the focus of the whole research), or sensitivity issues on parameter estimation using the MCMC approach. The reality is the estimates are acceptable on the basis of the model-data fit. However, the MCMC approach does not give any mechanism to diagnose whether or not the data fit the estimating model. How much additional errors would be introduced because of the model-data having not adequately fit? This practical issue would give challenges to the MCMC estimation. In the simulation studies, the proficiency covariance matrix varied from a special pattern (e.g., all off-diagonal elements are the same) to a general one and the effects of the proficiency covariance matrix on the parameter estimation are carefully examined, the proficiency population is assumed from multivariate normal or standard normal. If the examinee groups are not from a normal distribution, does the approach still yield accurate and stable estimation? This issue also deserves further research efforts, because the examinees might not come exactly from a normal population in many applications. In addition, the metric for both item and proficiency parameters is established by a well-defined set of anchor items, which are often placed in the first positions in the tests. The anchor items help with solving the indeterminacy problems that 95 is inherited in many IRT models. However, the choice of anchor items are often subjective, and therefore may influence the establishment of the proficiency scales. Further research is needed to investigate the effects of the anchor items on parameter estimation using the MCMC approach. In real data applications, how can one choose a useful set of anchor items that help with the model identification and meanwhile ensure accurate parameter estimation? Finally, the item parameter estimates by MCMC methods are compared with the estimates by TESTFACT, and the results show that the estimates from MCMC meth- ods are better than those from the TESTFACT. Table 4.1 shows the item parameter estimates by TESTFACT for the 30—item test with 3 dimensions to 2000 examinees (i.e., the first replication of the data). Table 4.2 shows the item parameter estimates by TESTFACT for the 30-item test with 5 dimensions to 2000 examinees (i.e., also the first replication of the data). The input of the estimates for the pseudo-guessing parameters is the true values for the 0 parameters. Compared with the true item parameters (Table 3.10 and Table 3.16) and the estimates by MCMC (Table Table 3.13, 3.19, and 3.20), the item parameter estimates by TESTFACT in Table 4.1 and 4.2 in general seem a little bit worse. In addition, the results from TESTFACT have some deviant values (e.g., item 11, item 17, item 18, item 19, item 21 in Table 4.2 for 5—dimensional case). 96 Table 4.1: TESTFACT Item Parameters estimates for 30—Item Test (Dim = 3) Item 01 a2 03 d 1 1.25 -0.2 -0.17 -0.27 2 -0.09 0.65 -0.16 0.02 3 -0.45 -0.37 2.28 -1.2 4 1.97 -0.33 -0.35 0.53 5 0.86 -0.21 -0.13 0.22 6 1.37 -0.22 -0.21 1.47 7 0.65 -0.22 -0.08 1.57 8 -0.55 2.64 -0.41 -0.53 9 -0.27 1.86 -0.41 -1 10 -0.11 0.67 -0.15 -0.6 11 -0.16 1.08 -0.15 1.16 12 -0.29 -0.34 1.47 -0.33 13 -0.24 0.04 0.89 2.22 14 -0.03 -0.32 1.31 -1.46 15 -0.28 -0.36 1.83 -0.56 16 0.44 1.54 1.26 -1.48 17 1.11 1.78 0.19 -0.24 18 0.3 0.71 1.19 0.86 19 1.28 1 1.17 0.21 20 0.04 0.48 1.5 0.95 21 2.06 0.4 0.53 -1.39 22 1.29 0.75 1.96 -2.2 23 0.8 1.16 0.78 -0.42 24 2.27 0.19 -0.09 -1.55 25 1.35 1.45 0.45 -1.91 26 0.58 0.98 0.44 0.21 27 0.61 0.91 1.02 0.66 28 1.13 0.77 1.31 0.37 29 0.95 0.41 1.62 0.16 30 0.47 1.99 1.09 -0.87 97 Table 4.2: TESTFACT Item Parameters Estimates for 30—Item Test (Dim = 5) Item 0.1 02 a3 a4 a5 (1 1 0.88 -0.07 -0.29 -0.11 -0.02 1.59 2 -0.54 3.67 -0.59 -0.43 -0.66 -0.76 3 -0.05 -1.67 7.62 -0.88 -l.03 -l.84 4 -0.16 -0.4 -0.4 1.71 -0.25 1.48 5 -0.21 0 -0.25 -0.38 1.47 -0.58 6 1.96 -0.27 -0.39 -0.35 -O.24 -0.06 7 2.67 -0.31 -0.33 -0.54 -0.51 -0.75 8 -0.06 0.83 -0.14 0.03 -0.21 -0.5 9 -0.17 2.43 -O.38 -0.42 -0.42 1.17 10 0. 14 -0.2 1.3 -0.16 -0.25 -1.24 11 -22.14 -5.89 98.05 -15.65 -22.07 -76.86 12 -0.27 -0.22 -0.31 1.89 -0.41 1.66 13 -0.15 -0.16 -0.25 1.44 -0.31 2.13 14 -1.17 -2.15 -1.93 -1.48 10.28 2.83 15 0.01 -0.11 -0.14 -0.08 0.45 -0.42 16 3.19 -0.12 1.26 1.25 -0.51 0.3 17 15.39 1.78 26.26 7.73 26.4 -35.36 18 -0.62 6.8 0.43 6.35 10.72 7.27 19 -4.07 3.32 5.28 8.76 9.89 14.02 20 0.36 -0.06 1.26 1.74 1.24 1.57 21 -5 2.5 15.72 43.88 8.1 24.18 22 1.23 0.66 2.01 0.1 1.68 1.94 23 0.78 0.2 0.07 0.63 -0.1 0.72 24 0.81 2.59 1.17 —0.29 -0.03 0.21 25 2.02 -0.18 0.36 0.15 0.82 0.1 26 1.59 0.07 1.14 0.42 0.56 -0.26 27 0.7 0.21 0.05 0.45 0.03 0.12 28 1. 13 0.25 0.55 -0.09 -0.02 -0.48 29 0.87 0.35 -0.05 0.48 0.21 0.25 30 0.3 0.28 0.74 0.52 0.5 0.5 98 Bibliography [1] Ackerman, T. A. (1990). An evaluation of the multidimensional parallelism of the EAAP Mathematics Test. Paper presented at the Meeting of the American Educational Research Association, Boston, MA. [2] Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Mea- surement 29(1), 67-91. [3] Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251-269. [4] Baker, F. B. (1990). Some observations on the metric of BILOG results. Applied Psychological measurement, 14, 139-150. [5] Beguin, A. A., Glas, C. A. W. (1998). ED428100. MCMC Estimation of Multi- dimensional IRT models. Research Report 98-14. [6] Besag, J ., Green, P. J., Higdon, D. M., and Mengersen, K. L. (1995). Bayesian Computation and Stochastic Systems (with discussion). Statistical Science 10, 3-66. [7] Birnbaum, A. (1957). Eflicient design and use of tests of a mental ability for various decision-making problems. (Series Report No. 58—16. Project No. 7755- 23). USAF School of Aviation Medicine, Randolph Air Force Base, Texas. [8] Birnbaum, A. (1958a). Further considerations of efficiency in tests of a mental ability. Technical Report No. 17. Project No. 7755-23, USAF School of Aviation Medicine, Randolph Air Force Base, Texas. [9] Birnbaum, A. (1958b). On the estimation of mental ability. Series Report No. 15. Project No. 7755-23, USAF School of Aviation Medicine, Randolph Air Force Base, Texas. [10] Birnbaum, A. (1968). Some latent trait models and their use in inferring an examninee’s ability. In F.M. Lord and MR. Novick (Eds), Statistical Theories of Mental Test Scores (pp. 397-472). Reading, MA: Addison-Wesley. 99 [11] Bock, R. D. (1972). Estimating item parameters and latent ability when re spouses are scored in two or more nominal categories. Psychometrika. [12] Bock, R. D., Gibbons, R., & Muraki, E. J. (1988). Full information item factor analysis. Applied Psychological Measurement, 12, 261-280. [13] Carlson, J. E. (1987). Multidimensional item response theory estimation: A computer program (Research Report ONR 87-2). Iowa City, IA: The American College Testing Program. [14] Bock, R. D. and Lieberman, M. (1970). Fitting a response model for n dichoto— mously score items. Psychometrika, 35, 179-197. [15] Bolt, D. M. and Lall, V. F. (2003). Estimation of compensatory and noncom- pensatory multidimensional item response models using Markov Chain Monte Carlo. Applied Psychomological Measurement, 27( 6), 395-414. [16] De-la-Torre, J, Patz, R. J. (2001). ED 464143. Item Response Theory Equating Using Bayesian Informative Priors. Paper Presented at the Annual Meeting of the National Council on Measurement in Education (Seattle, WA, April 11-13, 2001). [17] Embreston, S. E. and Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum Associates. [18] Fox, J. P. (2002) Multilevel IRT Using Dichotomous and polytomous Response Data. Research Report. [19] Ftaser, C. (1988). NOHARM II. A Fortran program for fitting unidimensional and multidimensional normal ogive models of latent trait theory. Armidale, Aus- tralia: The University of New England, Center for Behavioral Studies. [20] Gamerman, D. (1997). Markov Chain Monte Carlo. New York: Chapman & Hall. [21] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004). Bayesian Data Analysis. Second Edition. Chapman & Hall. [22] Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Sciences, 7(4), 457-472. [23] Geman, S. and Geman, D. (1984). Stochastic Relaxation, Gibbs Distributions and the Bayesian Restroation of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 6, 721-741. [24] Gilks, W. R., Richardson, 8., and Spiegelhalter , D.J., eds. (1996), Markov Chain Monte Carlo in Practice, London: Chapman and Hall. 100 [25] Gill, J. (2002). Bayesian methods for the social and behavioral sciences. Chapman & Hall/CRC. [26] Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. [27] Hambleton, R. K. and Swaminathan, H. (1985). Item Response Theory: Principle and Applications. Kluwer Nijhoff Publishing. [28] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97-109. [29] Hulin, C. L., Lissak, R. L., and Drasgow, F. (1982). Recovery of two and three parameter logistic item charactersitic cureves: A monte Carlo study. Applied Psychological Measurement, 6, 249-260. [30] Kiefer, J ., and Wolfowitz, J. (1956). Consitency of maximum likelihood estimates in the presence of inifinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887-890. [31] Kim, S. H., & Cohen, A. S. ( 1998). ED420689. An Evaluation of a Markov Chain Monte Carlo Method for the Two-parameter Logistic Model. [32] Lemann, E. L., Casella, G. (1998). Theory of point estimation. Second edition. Springer-Verlag New York, Inc. [33] Li, J. C., Woodruff, D. J. (2001). ED 462419. Bayesian Statistical Inference for Coefficient Alpha. ACT Research Report Series. [34] Little, R. J. A., and Rubin, D. B. (1983). On jointly estimating parameters ad missing data by maximizing the complete-data likelihood. The American Statis- tician, 37, 218-220. [35] Lord, F. (1952). A theory of test scores. Psychometric Monograph, No. 7. [36] Lord, F. (1953). The relation of test score to the trait underlying the test. Edu- cational and Psychological Measurement, 13, 517-548. [37] Lord, F., & Novick, M. R. (1968). Statistical theories of mental test scores. Read- ing, MA: Addison-Wesley. [38] Lord, F.(1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates. [39] Maris, G. & Maris, E. (2002). A MCMC-Method for Models with Continuous Latent Responses. Psychometrika Vol. 67, No. 3, 335-350. [40] Matthews-Lopez, J. L., Hombo, C. M. (2001). ED 454268. Modeling the Hyper- distribution of Item Parameter to Improve the Accuracy of Recovery in Estima- tion Procedures. 101 [41] McDonald, R. P. (1967). Nonlinear factor analysis. Psychometric monographs, No. 15. [42] McDonald, R. P. (1985). Unidimensional and multidimensional models for item response theory. In D. J. Weiss (Ed.), Proceeding of the 1982 Computerized Adap- tive Testing Conference (pp. 127-148). Minneapolis: University of Minnesota, Department of Psychology, Psychometrics Methods Program. [43] McKinley, R. L., & Reckase, M. D. (1983). MAXLOG: A computer program for the estimation of the parameters of a multidimensional logistic model. Behavior Research Methods 69' Instrumentation, 15, 389-390. [44] Metropolis, N ., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087-1092. [45] Muthen, B. (1984). A general structural equation model with dichotomous, or- dered categorical , and continuous latent variable indicators. Psychometrika, 49, 115-132. [46] Neyman, J ., and Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrika, 16(1), 1-32. [47] Patz, R. J. Junker, B. W. (1999a). A Straightforward Approach to Markov Chain Monte Carlo Methods for Item Response Models. Journal of Educational and Behavioral Statistics, 24(2), 146-178. [48] Patz, R. J. Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24 (4), 342-366. [49] Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Jnstitute for Educational Research. [50] Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests. Chicago: The University of Chicago Press. [51] Reckase, M. (1996). A linear logistic multidimensional model for dichotomous item response data. In W. Van der Linden,& R. Hambleton (Eds), Handbook of modern item response theory (pp.271-286). New York: Springer - Verlag. [52] Reckase, M. D., Ackerman, T. A., 85 Carlson, J. E. (1988). Unidimensional data from multidimensional testes and multidimensional data from unidimensional test. [53] Reckase, M. D. & Hirsh, T. M. (1991). Interpretation of number-correct scores when the true number of dimensions assessed by a test is greater than two. 102 Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago. [54] Roussos, L. A. (1995). A new dimensionality estimation tool foe multiple-item tests and a new DIF analysis paradigm based on multidimensionality and con- struct validity. Unpublished doctoral dissertation, Universtiy of Illinois at Urbana- Champaign. [55] Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Monograph, No. 17. [56] Samejima, F. (1972). A general model for free-response data. Psychometric Monograph, No. 18. [57] Segall, D. O. (2001). General ability measurement: an application of multidi- mensional item response theory. Psychometrika. Vol. 66, No. 1, 79-97. [58] Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika. Vol.61, No. 2, 331-354. [59] Stegelmann, W. (1983). Expanding the Rasch model to a general model having more than one dimension. Psychometrika, 48, 259-267. [60] Tierney, L. (1991). Exploring Posterior Distributions Using Markov Chains. In Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface. E. Ml Keramidas (ed.). Fairfax Station, VA: Interface Foundation. pp. 563-570. [61] van der Linden, W. J ., & Hambleton, R. K. (1996). Handbook of modern item response theory. New York: Springer. [62] Whitely, S. E.(1980). Multicomponent latent trait models for ability tests. Psy- chometrika, 45, 479-494. [63] Williamson, D. M., Johnson, M. S., Sinharay, S., & Bejar, I. I. (2002). ED 464948 Hierarchical IRT Examination of Isomorphic Equivalence of Complex Constructed Response Tasks. [64] Wright, B. D., & Stone, M. H. Best test design. Chicago: MESA, 1979. [65] Wollack, J. A., Bolt, D. M., Cohen, A. S, & Lee, Y. S. Recovery of Item Pa- rameter in the Nominal Response Model: A Comparison of Marginal Maximum Likelihood Estimation and Markov Chain Monte Carlo Estimation. Applied Psy- chological Measurement, 26(3), 339-352. 103 "‘l’lllllliiiiii]