I ' I §-‘ n! I. 4:55 *v; v.3?- ’ 3—. ’ 5'5? 4114;.- . ,, .I' Q ‘ ' I’d“ 0 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE oEcznzoag 092109 6/07 p:lCIRC/DateDue.indd-p.1 THE NEW GOODNESS-OF -F IT INDEX FOR THE MULTIDIMENSIONAL ITEM RESPONSE MODEL By Shu-chuan Kao A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2007 ABSTRACT THE NEW GOODNESS—OF-FIT INDEX FOR THE MULTIDIMENSIONAL ITEM RESPONSE MODEL By Shu-chuan Kao The current research is concerned with the goodness-of-fit of the multidimensional item response theory (MIRT) model to binary test data. Based on the R2 analog proposed by Estrella (1998) for the dichotomous dependent variable model, the new goodness-of-fit index, the RLR index (Ratio of Log Residuals), was proposed to reflect the ratio of error reduction when adding dimensions to the MIRT model. The RLR index demonstrated nice statistical properties in term of the results from two simulation studies. Compared to the G2 test and G2 difference test from TESTF ACT, the RLR index could identify true dimensionality with Type I error rates less than .05 and demonstrate high statistical power to reject wrong models for most cases. The findings also indicated that the RLR index was sensitive to different levels of item discrimination, the variation of item difficulties, inter-factor correlation, and item-factor structure. It was also found that a large sample size and a long test could generate more accurate dimensionality decisions. Regarding the analysis of real data, one statistical dimension was suggested to describe the Grade 4 Mathematics Test of the Michigan Educational Assessment Progress (MEAP) testing program. The unidimensional finding was supplemented with the discussions in term of the test item content, the representativeness of the content-related dimensions, the definition of dimensionality, and the assumptions of the compensatory MIRT model. ANKNOWLEDGEMENTS There are a number of people who have helped me to this point, and whom I would like to acknowledge. First, I wish to express my sincere gratitude to my advisor and dissertation chair, Professor Mark Reckase. He has held my hand throughout my pursuit of Ph.D., serving as both a good friend and an advisor. His guidance and support not only made this dissertation possible, but also enabled me to become a psychometrician. I would also like to thank my dissertation committee members: Professor Richard Houang, Professor Yeow Meng Thum, and Professor Alexander von Eye. Their comments, suggestions, and encouragement not only made for a stronger dissertation but also helped me get a better view of research. Special thanks should go to my friends: Jane Lin, who has been my C language tutor for the past 3 years, gave me unwavering support and helped me correct my malfimctioning pointers; Jules and Andy, who have been my good friends and mathematics tutors, patiently taught me the essence of mathematics; Kang—Hung, who has always been willing to share his expertise in econometrics with friends, helped me find a different approach to investigate dimensionality; Yun-Jia, who has been the source of encouragement for the past 3 years, helped me collect laptops to run TESTFACT and find resource to finish this study. Besides, I have to thank all those friends who have been around me and brought light to my life. Finally, my deepest appreciation goes to my family, who is my lifelong support and has given me freedom to fulfill my dreams. My experience of doing this dissertation is quite rewarding. Without the pain, there won’t be the harvest. Without conducting this research, I won’t have a chance to know that these professors are so smart and warm, and my friends are so lovely and supportive. iii TABLE OF CONTENTS LIST OF TABLES ................................................................................. vi LIST OF FIGURES ................................................................................ viii CHAPTER 1 INTRODUCTION .................................................................. 1 1.1 Different Perspectives to Investigate Data Dimensionality .............................. 1 1.2 Dimensionality and Multidimensional Item Response Theory ........................... 4 1.3 Purpose of the Study ............................................................................ 6 CHAPTER 2 LITERATURE REVIEW ......................................................... 7 2.1 Multidimensional Item Response Theory ................................................... 7 2.2 Review of Goodness-of—F it Indices for Multidimensional Item Response Models... 11 2.2.1 Exploratory Linear Factor Analysis ................................................. 12 2.2.2 Confirmatory Linear Factor Analysis... ... ... .. ............................................... 15 2.2.3 Bivariate-Information Nonlinear Factor Analysis (N OHARM) ................. 17 2.2.4 Full-Information Item Factor Analysis (TESTFACT) ............................. 20 2.3 The Development of Goodness-of—F it Index for the MIRT Model ..................... 23 2.3.1 The R2 in the Ordinary Least Squares Model ...................................... 23 2.3.2 The R2 Analog in the Dichotomous Dependent Variable Model ............... 27 2.3.3 The RLR in the Multidimensional Item Response Model ........................ 32 CHAPTER 3 METHOD ........................................................................... 43 3.1 Simulation Study I (Unidimensional Data Sets) .......................................... 43 3.1.1 Research Design ........................................................................ 43 3.1.2 Generation of Item Parameters and Response Patterns ........................... 45 3.1.3 Analysis Procedures and Computer Programs .................................... 47 3.1.4 Evaluation Criterion .................................................................. 43 3.2 Simulation Study 11 (Multidimensional Data Sets) ......................................... 49 3.2.1 Research Design ....................................................................... 50 3.2.2 Generation of Item Parameters and Response Patterns ........................... 59 3.2.3 Analysis Procedures and Computer Programs ..................................... 59 3.2.4 Evaluation Criterion .................................................................. 59 3.3 Real Data Analysis ............................................................................ 60 CHAPTER 4 RESULTS ........................................................................... 63 4.1 Simulation Study I (Unidimensional Data Sets) ........................................... 63 4. 1.1 Results ofthe Summary Statistics” 64 4.1.2 Results of Multivariate Analysis of Variance for Study I ........................ 70 4.1.3 Comparisons of the Numbers of Rejections ......................................... 76 4.2 Simulation Study II (Multidimensional Data Sets) ......................................... 79 4.2.1 Results ofthe Summary Statistics” 30 4. 2. 2 Results of Multivariate Analysis of Variance for Study II ....................... 37 4.2.3 Comparisons of the Numbers of Rejections ........................................ 106 4.3 Real Data Analysis ............................................................................ 110 CHAPTER 5 SUMMARY, DISCUSSION, AND CONCLUSION ........................ 112 5.1 Summary of the Research ..................................................................... 112 5.2 Discussion ....................................................................................... 113 5.3 Conclusion ........................................................................................ 125 5.4 Limitations, Implications, and Suggestions for Future Research ........................ 126 APPENDIX A: Mathematical Derivation of Esrella’s (1998) R2 Analog .................. 131 APPENDIX B: The Conditional Distributions of the RLR Values in Simulation Study 132 APPENDIX C: The Conditional Distributions of the RLR Values in Simulation Study 11 .................................................................................. 156 REFERENCES .................................................................................... 174 LIST OF TABLES Table 3.1.1. Simulation tests for Study I ....................................................... 46 Table 3.2. 1. Levels of the item-factor structure .............................................. 56 Table 3.2.2. Simulated tests for Study 11 ........................................................ 58 Table 4.1.1. The number of unsuccessful TESTFACT runs for long tests in Study I. . .. 64 Table 4.1.2. Stunmary statistics of the RLR index for short tests ........................... 66 Table 4.1.3. Summary statistics of the RLR index for long tests ........................... 67 Table 4.1.4. The multivariate test for Study I ................................................. 71 Table 4.1.5. The univariate test for Study I .................................................... 72 Table 4.1.6. The number of rejections in 100 replications for unidimensional data. . . 78 Table 4.2.1. The number of unsuccessful TESTFACT runs in Study 11 ................... 80 Table 4.2.2. Summary statistics of the RLR index for two-dimensional data sets ........ 82 Table 4.2.3. Summary statistics of the RLR index for three-dimensional data sets. . 83 Table 4.2.4. The multivariate test for Study 11 ................................................ 87 Table 4.2.5. The univariate test for Study II ................................................... 88 Table 4.2.6. The multivariate test for the two-dimensional data ........................... 90 Table 4.2.7. The multivariate test for the three-dimensional data .......................... 90 Table 4.2.8. Univariate test for two-dimensional data ....................................... 91 vi Table 4.2.9. Univariate test for three-dimensional data ..................................... 92 Table 4.2.10. The number of rejections in 100 replications for two-dimensional data ............................................................................................................ 103 Table 4.2.11. The number of rejections in 100 replications for three—dimensional data ................................................................................. 110 Table 4.3.1. The RLR indices for the MEAP Grade 4 Mathematics Test data ............ 111 Table 4.3.2. Item parameter estimates and the test of unidimensionality .................. 1111 vii Figure 2.1.1. Figure 2.3.1. Figure 2.3.1. Figure 2.3.2. Figure 2.3.3. Figure 2.3.4. Figure 2.3.5. Figure 3.2.1. Figure 3.2.2. Figure 3.2.3. Figure 4.1.1. Figure 4.1.2. Figure 4.1.3. Figure 4.1.4. LIST OF FIGURES Item vector plot (a1= 1, az= 0.6, d= - 0.5) .................................... 10 The cumulative density function F(x) .......................................... 30 The observed distribution of LR statistic from the data generated by the constrained MIRT model ....................................................... 35 The distributions of R12 , R22 , and RLRI for the constrained-model data... 38 The distributions of R2 and RLR for the three-dimensional data ........... 39 The distn'butions of R12 , R22 , and RLR1 for the 25-dimensional model data ................................................................................. 40 The distributions ole2 , 11%, and RLR] for the random data.................. 41 The scree plot of matrices M1, Mk, and M3 .................................... 51 The relationship between the slope of eigenvalues and determinant ...... 53 Figure 3.2.3. Selecting correlation matrices in terms of the slope of eigenvalues and the determinant of the correlation matrix ................. 55 The change of RLR with dimensionality for a 25-item test and 2000 examinees ......................................................................... 68 The change of RLR with dimensionality for a 25-item test and 6000 examinees ......................................................................... 68 The change of RLR with dimensionality for a 50-item test and 2000 examinees ......................................................................... 69 The change of RLR with dimensionality for a 50-item test and 6000 examinees ......................................................................... 69 viii Figure 4.1.5. The interaction of A, D, and S in RLR] for 25—item.test ..................... 73 Figure 4.1.6. The interaction of A, D, and S in RLR. for 50-item test ..................... 73 Figure 4.1.7. The interaction of A, D, and S in RLRz for 25-item test ..................... 74 Figure 4.1.8. The interaction ofA, D, and S in RLR; for 50-item test. . 74 Figure 4.1.9. The interaction of A, D, and S in RLR3 for 25-item test ...................... 75 Figure 4.1.10. The interaction of A, D, and S in RLR3 for 50-item test ..................... 75 Figure 4.2.1. The change of RLR with dimensionality for the correlation matrix C1 . 34 Figure 4.2.2. The change of RLR with dimensionality for the correlation matrix C2. 34 Figure 4.2.3. The change of RLR with dimensionality for the correlation matrix C3. . .. 85 Figure 4.2.4. The change of RLR with dimensionality for the correlation matrix C4. 35 Figure 4.2.5. The change of RLR with dimensionality for the correlation matrix C5. 86 Figure 4.2.6. The change of RLR with dimensionality for the correlation matrix C6. . .. 86 Figure 4.2.7. The interaction of A and I in RLR; given correlation matrix C1 ............ 94 Figure 4.2.8. The interaction of A and I in RLRI given correlation matrix C2” . . . 94 Figure 4.2.9. The interaction of A and I in RLR1 given correlation matrix C3 ............ 95 Figure 4.2.10. The interaction of A and I in RLRI given correlation matrix C4 .......... 95 Figure 4.2.11. The interaction of A and I in RLR] given correlation matrix C5 ........... 96 Figure 4.2.12. The interaction of A and I in RLR1 given correlation matrix C6 .......... 96 Figure 4.2.13. The interaction of A and I in RLR2 given correlation matrix C. .......... 97 Figure 4.2.14. The interaction of A and I in RLR2 given correlation matrix C2 .......... 97 Figure 4.2.15. The interaction of A and I in RLR2 given correlation matrix C3 .......... 93 Figure 4.2.16. The interaction of A and I in RLR2 given correlation matrix C4 .......... 93 Figure 4.2.17. The interaction of A and I in RLR2 given correlation matrix C 5 .......... 99 Figure 4.2.18. The interaction of A and I in RLR2 given correlation matrix C6 .......... 99 Figure 4.2.19. The interaction of A and I in RLR3 given correlation matrix C1 .......... 100 Figure 4.2.20. The interaction of A and I in RLR3 given correlation matrix C2 .......... 100 Figure 4.2.21. The interaction of A and I in RLR3 given correlation matrix C3. . . . 101 Figure 4.2.22. The interaction of A and I in RLR3 given correlation matrix C4. . . . . 101 Figure 4.2.23. The interaction of A and I in RLR3 given correlation matrix C5 .......... 102 Figure 4.2.24. The interaction of A and I in RLR3 given correlation matrix C6 .......... 102 Figure 4.2.25. The interaction of A and I in RLR4 given correlation matrix C1 .......... 103 Figure 4.2.26. The interaction of A and 1 in RLR4 given correlation matrix C2 .......... 103 Figure 4.2.27. The interaction of A and I in RLR4 given correlation matrix C3 .......... 104 Figure 4.2.28. The interaction of A and I in RLR4 given correlation matrix C; .......... 104 Figure 4.2.29. The interaction of A and I in RLR4 given correlation matrix C5 .......... 105 Figure 4.2.30. The interaction of A and I in RLR4 given correlation matrix C6 .......... 105 CHAPTER I INTRODUCTION Dimensionality plays an important role in test score interpretation and the validity of inferences made from tests. and is one of the critical issues in educational measurement. For many testing practitioners, it seems unreasonable to use the common data analysis procedures assuming that the data are unidimensional while the assessment tools, especially achievement tests, are designed to measure multiple content knowledge and skills. When tests are planned to measure different cognitive abilities or content knowledge, and examinees are required to demonstrate more than one ability to answer items correctly, the properties of the resulting test response data are difficult to describe. For instance, a mathematics test may contain “story-type” questions. From the psychological point of view, examinees will have to use mathematical skills and reading abilities to correctly answer such questions. From the statistical point of view, psychometricians may need more than one statistical variable to represent each person in order to sufficiently model the interaction between test items and examinees. Describing the statistical characteristics of potentially multidimensional data by the traditional procedures assuming unidimensionality may not only cause measurement problems but also lead to inaccurate score interpretation. 1.l Different Perspectives to Investigate Data Dimensionality With the intention to investigate the likely multidimensional nature embedded in the item response data, psychometricians have developed different perspectives to interpret dimensionality. Based on Embretson’s (I985) definition, dimensionality indicates the number of hypothesized psychological constructs required for successful performance on a test. This definition of dimensionality can be referred to as “psychological dimensionality." In psychological measurement, the number of dimensions in the model is often based on cognitive theories and each dimension represents a specific latent trait being modeled. In educational testing, the psychological constructs are often attributed to content domains of interest, reflecting the purpose of the test. However, in the real testing situation, the sources of multidimensionality are still unclear. Besides the desired psychological traits or content knowledge, other undesirable factors that may be the cause of multidimensionality include: different item format (Tate, 2002); test speededness (Bock, Gibbons, & Muraki, 1988; Douglas, Kim, Habing, & Gao, 1998); item dependency from testlet items (Ferrara, Huynh, & Michaels, 1999; Thissen, Steinberg, & Mooney. 1989); and inappropriate design of test administration conditions (Tate, 2002). Determining the number of psychological dimensions to model test data, or deciding how well the model fits data, requires validity studies to supplement the statistical index. This implies that even if the test is known for requiring examinees to demonstrate two different cognitive abilities to answer the items correctly, validation studies are needed to verify that the two psychological dimensions in the model match the hypothesized constructs. Another definition of dimensionality is based on the statistical properties of the test. data. According Lord and Novick’s (I968) definition, dimensionality is the total number of abilities required to satisfy the assumption of local independence. This assumption indicates that an examinee’s responses to the items in a test are statistically IQ independent iftheir ability level is taken into account. The probability of any particular item response pattern for an examinee is the product of individual item probabilities. When the assumption of local independence is satisfied, the complete latent space is defined and, at the same time, the number of dimensions needed to summarize the data is specified. In terms of these explanations, this kind of definition of dimensionality can be referred to as “statistical dimensionality.” Unlike the psychological dimension, determination of the number of statistical dimensions depends on the mathematical properties in the data under the assumption of local independence and monotonicity'. Harrison (1986) and Tate (2002) concluded that every set of test responses is multidimensional to some degree. To decide the data dimensionality, many researchers (Berger & Knol, I990; Junker & Stout, 1994) suggested that the latent traits that underlie test data can be classified as major (i.e., dominant) and minor factors. Humphreys (1985) argued that the construction of tests that are valid for intended purposes requires tests that are sensitive to differences on a dominant trait and numerous minor factors. In order to measure the dominant factor of interest (e.g., computation ability), the inclusion of numerous minor factors is inevitable. Wainer and Thissen (1996) suggested that item responses will always reflect either random or fixed multidimensionality. The random multidimensionality is caused by the presence of minor dimensions or nuisance dimensions other than those planned to determine the responses. The fixed multidimensionality corresponds to the number of dimensions the test is designed to measure. Concerning the unidimensionality assumption of IRT, Ackerman (1994) pointed out that the unidimensionality should never l Suppes and Zanotti (l981) proved that all the data can be modeled unidimensionally when the restriction of monotonicity is relaxed. In this case, the dimensionality is no longer an issue in data modeling. However, the explanation of the relationship between ability and item response will be obscure. be assumed but should be verified. It would be considered problematic to analyze multidimensional data with the statistical procedures assuming that the data are unidimensional. To clarify the connections and distinctions between psychological and statistical dimensions, researchers (Reckase, 1990; Reckase, Ackerman, & Carlson, 1988) defined dimensionality as the minimum number of mathematical variables needed to summarize a matrix of item response data. In other words, to fully describe all the differences related to the test for the examinees in the population, the minimum number of statistical abilities required in the model would be considered as test dimensionality. Reckase (1990) indicated that for a test to be modeled unidimensionally, tests do not have to measure narrowly defined, pure psychological traits for statistical procedures that assume unidimensionality. Test items that measure the same combination of traits will likely generate unidimensional data when examinees interact with them. Therefore, it is possible to have statistically unidimensional item response data even though the psychological dimensions needed to correctly answer the questions are greater than one. 1.2 Dimensionality and Multidimensional Item Response Theory Determining the number ofdimensions needed to explain the item response data is often of substantive or methodological interest not only for educational measurement, but also for psychological studies. Speannan (1904) first argued that the performance on sets of tests could be explained by individuals” levels on general and specific traits. Since then, determining the number of dimensions needed to summarize a set of data has been an important research question. The study of test dimensionality is the essential issue for the investigation of test construction, test validity, reliability, fairness, and the interpretation and use of test scores (Choi, I997; Tate, 2002). For the past decades, a number of studies have been conducted to explain test data relaxing the restriction of unidimensionality assumption, and the methodology ofthe Multidimensional Item Response Theory (MIRT) has been more widely accepted. MIRT offers a new methodology to analyze test data in such an elaborate way that item characteristics are independent of the sample, and the examinees’ ability estimates are not test-dependent. However, the appropriate use of any MIRT model depends upon the good fit between model and data. All the MIRT-related testing techniques, such as multidimensional parallelism, multidimensional equating, multidimensional-based computerized adaptive testing, can be performed only when the data dimensionality is specified. Thus, it can be concluded that the applicability of MIRT rests on the availability of an appropriate model-data-fit index. Beyond generating different mathematical Ml RT models, researchers also proposed various model-data-fit indices to help determining the appropriate number of dimensions used in the MIRT models. However, no procedure for MIRT model selection has been universally accepted so far. Even though the MIRT calibration computer programs, such as TESFACT (Wilson, Wood, Gibbons, Schilling, Muraki, & Bock, 2003) and NOHARM (Fraser, 1988), are available, the problem of deciding the number of dimensions needed to model the data is still very much a topic of investigation. The current goodness-of—fit indices (e.g., the G 2 test provided by TESTFACT and the indices based on residual analysis) do not demonstrate good statistical properties in dimensionality detection (Berger & Knol. I990; De Champlain & Gessaroli, 199 l; Hambleton & Rovinelli, 1986; Mislevy, 1986). In order to correctly analyze test data with MIRT, the development of a valid model-data-fit statistic is not only desirable, but necessary. 1.3 Purpose of the Study The main purpose of this study is to propose and assess the use of the new goodness-of—fit index for MIRT model selection. Specifically, the degree to which the minor factors should be considered significant was evaluated in terms of the proposed index. Based on the results of simulation studies, the research demonstrated the accuracy and stability of the proposed goodness-of—fit index in detecting true dimensionality of test data under various testing conditions. The statistical characteristics of the proposed index were compared with those of the traditional )8 tests. Besides demonstrating the statistical properties for the simulated data, real test data were used to show the applicability of the proposed index in a real testing situation. The significance of the study is to offer a more reliable and testable goodness-of—fit index with which to determine the number of dimensions for the MIRT model to properly calibrate test data. The procedure proposed in this study offers the theoretical base and empirical evidence to decide the goodness-of—fit for MIRT models. The results of this work have potential use for both theoretical researchers and those who work in applied measurement. With this information, MIRT users would have better reference to decide the minimum number of dimensions needed to model test data and make more valid use of test theories. CHAPTER 2 LITERATURE REVIEW To begin this chapter, the MIRT model used in this study is elucidated in detail. The chapter then provides a review of model-fit studies concerning MIRT. Next, a new goodness-of-fit index is proposed along with the theoretical background. Finally, evidence is presented to demonstrate the feasibility of applying the index to describe the model-data-fit for MIRT model. 2.! Multidimensional Item Response Theory Psychometricians have developed a number of MIRT models (see Reckase & McKinley, 1982; van der Linden & Hambleton, I997) assuming a specific form of the item-examinee interaction on the basis of more than one ability dimension and attempt to decide the number of dimensions and which item measure which dimensions. Classified by their mathematical forms, these models can be distinguished as compensatory or partially compensatory, that is, whether or not high ability on one trait can compensate for low abilities on other traits. For the compensatory models (e.g., McDonald, I967; Reckase, I985; Reckase & McKinley, 199]), the performance on the item is determined by a linear combination of the multiple abilities so that high ability on one dimension can compensate for low abilities on other dimensions. By having high abilities on some dimensions, a probability of I for correct response can be obtained even with very low abilities on other dimensions (Reckase. l997b). Concerning the partially compensatory models (Sympson, 1978; Whitely, l980)2, the probability of correct response decreases with an increase in the number of dimensions (Reckase, 1997b). The multiplicative nature of the model allows an examinee to partially compensate for low abilities on one dimension by being high on other dimensions. Because most of the research on dimensionality has been done using compensatory models and the calibration computer programs are currently available only for that model, the logistic multidimensional compensatory two-parameter IRT model (Reckase, 1985; Reckase & McKinley, 1991) was employed in this study. In this model, the probability of a correct response to item 1' can be expressed as —. '—. exp(a,' 67 + (1')) (1) ])(lli/:I|ZIi,di.éj)-: _.,_.. 9 ‘ l+exp(a,'6j+d,-) where P(u,-j : 1151,4351) is the probability of a correct response of person j on item i in the k-dimensional ability space, ug- represents the item response for person j on item 1', -. ai is a vector of parameters representing the discriminating power of item 1', d1 is a parameter related to the difficulty of item i, B]- is the vector of abilities for examineej. and, e is the mathematical constant 2.7183. Under this framework, each examinee is represented as a data point in this k-dimensional latent space. This equation defines a surface indicating that the 2 For example, Sympson‘s (1987) model can be expressed as n P(X—l|é-Ei-l;-)-n[1+ex [(1-(6- —b- )1‘ " 1* 111— 'prk 7k ~rk ~ k=1 where k indicates the dimension; ark and by, are the discrimination and difficulty parameters, respectively. The root of the second derivative ofthis equation does not define a difficulty function but gives a single value for each dimension. That is, there is b parameter for each dimension. probability of a correct response for a test item is a function of an examinee‘s location in the ability space specified by the H-vector. The elements of the 6-vector are statistical constructs that may or may not correspond to any particular psychological traits or educational achievement domains (Reckase, l997a). Besides, there is nothing in the model that requires the fl-coordinates to be uncorrelated. The H-coordinates are for orthogonal axes, but the coordinates may be correlated. If the correlations among the H-coordinates are constrained to be 0.0, then the observed correlations among the item scores will be solely accounted for by the discrimination parameters (Reckase, l997a). The interpretations of the model parameters are somewhat different from those in the UIRT model. The item discrimination parameter for the MIRT model, assuming orthogonal axes. is represented by Reckase and McKinley (1991) as the length of the discrimination vector. The length, MDISC,-, as shown in equation (2), indicates the maximum overall item discrimination ofthe item 1' for the best combination ofabilities. The computation of MDISC; can be expressed as (2) where k is the number of dimensions in the 1‘) space, and (1,1. are elements in the vector a,- given in equation (I). The discrimination ofan item is a function ofthe slope at the steepest point and is best in a particular direction in the multidimensional space. The direction ofthe greatest discrimination in the multidimensional space is (1,1. - . 3 MDISC, ( ) COSO'ik : where am is the angle from the k-th dimension. The item difficulty parameter, MDIFF,-, is defined as MDIFF,= — d" _— . 4 MDISC, ( ) This value indicates the distance from the point of best discrimination to the origin. MDIFF,- can be interpreted much like the b-parameter in UIRT. A negative MDIFF,- value suggests an easier item, whereas a positive value indicates one more difficult. Graphically, test items can be summarized by a vector plot so that the geometrical characteristics of MDISC and MDIFF can be clearly represented. A two-dimensional example, as shown in Figure 2.1.1, shows that the distance from the vector’s base to the origin is MDIFF, and the length of the vector is MDISC. The extension of the vector goes through the origin, and the base of the vector is located on the line where examinees have a .50 probability to answer the item correctly. The vector plot allows plotting more than one item on one graph. Item vectors pointing in the same direction measure the same combination of 61 and 62. By examining the directions of the item vectors, the similarities among items and the dimensional structure can be identified. 3 25 62 2» -l 15L 4 —-I T ail 6] ‘05 ' p = 0.5 1— p— _1 1 1 1 1 -l {15 0 05 1 15 2 25 3 Figure 2.1.1. Item vector plot (a1: 1, a2: 0.6, d= - 0.5) 2.2 Review of Goodness-of—F it Indices for Multidimensional Item Response Models The dimensionality of test data is difficult to assess and is often based on personal judgment. Several studies (Berger & Knol, 1990; De Ayala & Hertzog, 1991; De Champlain & Gessaroli, 1996; Douglas, Kim, Roussos, Stout, & Zhang, 1995; Hambleton & Rovinelli, I986; Hattie, I984, I985; Nandakumar, I994; Roznowski, Tucker, & Humphreys, 1991; Stone & Yeh, 2006; Tate, 2003) were conducted to compare the relative effectiveness of the statistical procedures for detecting dimensionality of test data. These methods, available for assessing dimensionality, can be divided into two types: parametric and nonparametric procedures. The parametric procedure includes methods based on the mathematical equivalence between factor analysis models and the MIRT models (Knol & Berger, 1991; McDonald, 1967, 1985, 1989a). These studies suggested that the problem of assessing dimensionality in MIRT models for dichotomous data can be approached from a factor analytical point of view. An interpretation of multidimensional data structure is derived from the estimated factor loadings of the model. Conversely, the nonparametric procedure involves a collection of methods that avoid the problem of fitting an assumed parametric model3. The item covariance-based methods only assume that the item response function is monotonic and assessing dimensionality involves evaluating the conditional item associations. However, to perform goodness-of-fit studies, McDonald and Mok (I995) emphasized that the latent 3 The item covariance-based methods include: Stout’s essential unidimensionality procedure (Nandakumar & Stout, 1993; W. F. Stout, 1987) implemented in DIMTEST (W. Stout, Habing, Kim. Roussos, & Zhang, 1993); assessing multidimensional approximate simple structure DETECT (Kim, 1994; Zhang & Stout, 1995, I996); hierarchical cluster analysis HCA/CCPROX (Roussos, 1992, 1993; Roussos, Stout, & Marden, 1998) based on proximity measure; Holland-Rosenbaum’s test of unidimensionality, conditional independence, and monotonicity (Holland & Rosenbaum, 1986; Rosenbaum, I984); Bejar’s dimensionality assessment procedure (Bejar, 1980, 1988); and Tucker and Humphrey’s methods on the principle of local independence and second factor loadings (Roznowski et al., 1991). trait dimensionality should be assessed on the basis of the misfit ofa latent trait model, not by indices that are not based on the model to be fit. Since this study only focuses on the compensatory logistic MIRT model, only the fit indices based on the parametric procedures, which can be classified into four types, will be included in the following sections. Even though different methods were proposed in the past, the focus of the problem was the same: to decide whether the minor factors are large enough to represent significant dimensions, or whether they are merely nuisance in the data. 2.2.1 Exploratory Linear Factor Analysis Principal Component Analysis (PCA) and common Linear Factor Analysis (LFA) have been popular methods for exploring the dimensionality of dichotomous test data. In the studies of PCA or LFA, determining the number of components is often based on the amount of explained variance from phi or tetrachoric correlation matrices. Among the procedures are the well-known eigenvalue greater than 1.0 rule (Kaiser, 1960) and the scree plot test (Cattell, I966). The phi correlation coefficients generally produce a positive definite correlation matrix and tend to avoid the problem of Heywood cases (Berger & Knol, 1990). However, the LFA of phi correlation matrix was found to overestimate the number of underlying dimensions in any data (Hambleton & Rovinelli, I986). The identification of spurious difficulty factors is related to the characteristics of the items rather than to true underlying relationships (Guilford, 1941). That is, the choice of cut points affects the values of the expected phi correlation coefficients. Factor analysis of phi correlation matrix of binary variables produced by the same underlying correlation structure but dichotomized at different cut points can conform to factor models with different structure and different numbers of factors (Mislevy, I986). LFA of tetrachoric correlation matrix theoretically can avoid the problem of “difficulty” factors for dichotomous free-response items. Tetrachoric correlation coefficients can produce better estimates of the correlation than phi correlation coefficients, but the assumptions, such as the distribution of the latent variables being bivariate normal, and the latent variables being measured at the interval level should be obtained (De Ayala & Hertzog, 1991). However, when ability distributions are not normal and the item response function is not normal ogive, the use of tetrachoric correlations is inappropriate (Lord, 1980). Furthermore, tetrachoric correlation coefficients will become unstable when extreme values are reached. Tetrachoric correlation matrix will often not be positive definite and is more likely to produce Heywood cases (Berger & Knol, 1990). Although the criticism of the use of tetrachoric correlation in LFA was clear, some researchers still found it useful when used appropriately. Knol and Berger (1991) considered various common factor analysis methods and concluded that, for large-scale applications, an unweighted common factor analysis of tetrachoric correlations performed as well as other techniques (e.g., full-information factor analysis). Drasgow and Lissak (1983) suggested that interpretation of data dimensionality could be enhanced by comparing the scree plot created from real data to that created from a factor analysis of randomly generated test data containing the same number of items. Ackerman (1994) concluded that these methods may sometimes be inconclusive and lead to spurious counting of dimensions, but the size of the eigenvalues in conjunction with a substantive review of the items can lead to the conclude of how many essential traits are being measured. 2.2.2 Confirmatory Linear Factor Analysis McDonald (1981) suggested that the factor analytic models of item response data can be tested with CFA, a technique often considered to be a special case of Structural Equation Modeling (SEM). McDonald and Mok (1995) asserted that the indices developed for SEM under the assumption of continuous variables could be applied to the assessment of dimensionality for tests with dichotomous items. Akaike 19 Information Criterion (AIC) To determine data dimensionality, it would be convenient to formulate a criterion to compare the likelihood of a k-factor model against that of the saturated model (Berger & Knol, 1990). Given Bock and Aitkin’s (1981) ogive model, the probability ofa correct response for ability vector (71- and item 1' is _. m P(X,-j 2116]): <1) (3,- — ZAMQMW, , (5) 1:21 where y, is a threshold value for item 1', 6,1 is the ability of personj on ability dimensional k, 21,-], is the loading ofitem i for dimension k. Akaike (1974) developed an information theoretic criterion for identifying the optimal and parsimonious models in data analysis. Akaike‘s information criterion is defined as: A1C(m) = -2 ln[1.m(é_,-. 1,. a,, 3?,- )j+ 2K,,,, (6) where anléj, 27,0}, 71) is the maximized likelihood and K," is the number of independent parameters in the model. The term 2Km is the penalty term which corrects for over-fitting due to increasing bias in the first term when the number of parameters in the model increases. The term A1C(m) is a measure of badness-of-fit, and the minimum value of the AIC(m) indicates the true” dimensionality (Berger & Knol, I990). The critical value of the AIC statistic is embodied in the penalty for over-fitting, and the Type I error rate decreases exponentially with increased sample size (McKinley, I989). The AIC index has been recommended as a criterion for model selection, because when computed for a series of models of increasing dimensionality, it attains an optimum value for a model of intermediate dimensionality, thus allowing objective model selection (Berger & Knol, 1990; McDonald & Mok, I995). The practical performance of AIC in test data was not conclusive. Berger and Knol (1990) found that the AIC seemed to somewhat outperform the asymptotic )6 statistic, but these results were based on a small number of computer runs with sample sizes of 250 and 500. McKinley (1989) applied the AIC to artificial data fitting a confirmatory multidimensional item response model with the sample size of 1000, and found that AIC outperformed the likelihood ratio )8 test. McDonald (l989b) pointed out, however, that in applications, for a sufficiently small sample size, the optimum value must be attained by the unidimensional model, and for a sufficiently large sample size, it must be obtained by the saturated model. He concluded that AIC behavesjust like the )8 significance test itself and cannot possibly be recommended for the use with real data. Muthen .‘s' Robust Weighted Least Squares (Mplus) Muthen proposed a probit function and a robust Weighted Least Squares (WLS) estimation procedure to assess dimensionality. This method was implemented in the computer program LISCOMP (B. Muthen, 1987) but later replaced by Mplus (L. K. Muthen & Muthen, 1998). According to Muthen (1978), the parameters of the factor analytic model for dichotomous variables can be estimated by minimizing the weighted least-square fit function F=%(.s—0')'W_l(s—0'), (7) where 0 contains the population threshold and tetrachoric correlation values; 5 includes the sample estimates of the threshold and the sample tetrachoric correlation values; and W is a consistent estimator of the asymptotic covariance matrix of s, multiplied by the total sample size. The F function minimized in the WLS solution asymptotically follows a )8 distribution with df=k(k-1)/2-t, where k is the number of items and t is the number of parameters estimated in the model. If the null hypothesis in not true, the discrepancy function is distributed asymptotically as non-central chi-square. With WLS method, determining dimensionality is based on the fail-to-reject hypothesized model. That is, the hypothesis testing starts with the unidimensional model, and stopped when the hypothesized dimensionality is not rejected. In application, Stone and Yeh (2006) found that Mplus worked as well as NOHARM and TESTFACT when guessing was not modeled in the data. Tate (2003) also found that WLS procedure worked excellent for data with no guessing using an admittedly crude fit index equal to the ratio of x2 to degrees of freedom (112/d1). However, for data generated with guessing, this procedure generated distortions in the recovery of the true structure (Stone & Yeh. 2006). 2.2.3 Bivariate-lnformation Nonlinear Factor Analysis (NOHARM) Starting from Spearman’s common factor model, McDonald (1982) showed that IRT models are a special case of Nonlinear Factor Analysis (NLFA). He provided a general framework with a variety of models including unidimensional/multidimensional, linear/nonlinear, and dichotomous/polytomous models. The NOHARM program (Fraser, 1988) employs McDonald’s (McDonald, 1981, I982, I985) NLFA, which uses a reparameterization of latent trait theory and “ nonlinear harmonic” approximations to the normal ogive error distribution (Fraser & McDonald, 1988). In this process, the model is fit by unweighted least square which minimizes the squared difference between the observed frequencies of correctly answering item 1' and j, and the predicted frequencies of the joint occurrence ofthe pair of correct responses. Using McDonald’s NLFA, researchers have developed various goodness-of-fit indices to decide the dimensionality of test data. Approximate X: Yes! ofa Fitted NOHA RM Model Gessaroli and De Champlain (1996) proposed an approximate 712 test to assess dimensionality based on the estimation from NLFA. This approximate )8 statistic, originally proposed by Bartlett (1950) and outlined in Steiger (1980a; 198%), tests whether all of the off-diagonal elements of the residual correlation matrix are equal to zero after fitting a k-factor NLFA model. The approximate )8 statistic is defined as i l k : ') . 2’ =< R . (20) i—R2 In addition, Magee (1990) also showed that LR statistic for the same null hypothesis LR: —2|og[ If L SSE SST =—nxl0"—-——=nXIO’——, 21 U] °(ssr) ASS/5) ( ) where log LC = constant —g log SST (log-likelihood ofthe fully constrained model) log LU = constant —3 log SSE (log-likelihood of an unconstrained model) The model containing predictors is referred to as an unconstrained model because adding a predictor means relaxing a restriction in the maximization of the log-likelihood. Nagelkerke (I991) explained that the value of —2 log L(. indicates the “error variation” of the model with only the intercept term. It is equivalent to the SST in the OLS model. With regard to the value of -2 log LU , it is similar to the “error variation” for a model with predictors, analogous to the SSE in the OLS model (Menard, 2000). Under the null hypothesis that all the slopes in the population are 0, LR test follows a )8 distribution with k degree of freedom, where k is the number of predictors in the model. In the standard linear model with normally distributed errors, there is a simple relationship between R2 and LR statistic because LR is related to W (Vandaele, 1981) such that LR = n x log(l + 51:] . (22) )7 Form equation (20) and (22), the relationship between R2 and LR can be formulated as R2 = 1 —exp(#). (23) Just as R2 in OLS model in equation (16) can be interpreted as the proportion of reduction in the error sum of squares, the likelihood-based R2 in equation (23) can also be interpreted as the proportion of reduction in the -2|og-likelihood statistic (Menard, 2000). Moreover, Estrella (1998) demonstrated that the relationship between R2 and LR statistic can also be expressed in terms of LR statistic per observation ALR :fl:__z_l0g{£g_]9 (24) n I? LU which takes on values between 0 (misfit) and infinity (perfect fit). Accordingly to Estrella, equation (23) can be rewritten as L . 2 R2 =1—(f—i" = i -€XP(-ALR)- (25) (I, The R2 in equation (25) may be considered as a nonlinear rescaling of LR statistic per observation (Estrella, 1998). The endpoints of the scale are still compatible to a straightforward way indicating a “misfit” and a “perfect fit”, respectively. Estrella (1998) also indicated that the difference in the likelihood statistic per observation is related to the difference in R2 in an intuitive way such that dR2 l—R = (IA/1R . i (26) 2 The left side ofthis equation can be considered as a marginal R2. This function specifies that the change of A M can be represented by the change of R3. The marginal increment of fit, as shown to be consistent with the formal properties of R2 in OLS, provides consistently accurate information to indicate goodness-of—fit (Estrella, 1998). 2.3.2 The R2 Analog in the Dichotomous Dependent Variable Model In the OLS model, the common assumption is that the error term of the model, 8, consists iid variates with a mean of zero and a fixed value of variance. This assumption is violated when the dependent variable in the regression model is dichotomously scored. In this case, a different regression model should be used for describing the relationship between the predictors and the dichotomized dependent variable. A Dichotomous Dependent Model (DDV) model can be defined in the form of a linear regression y“ = ,B'x+ g. (27) where y* is an unobservable variable, ,8 is a vector ofk coefficients (the first term is the intercept), and x is a vector of the values of k independent variables. In equation (27), y* is linear in its parameters and may range from -00 to +00, depending on the range of x. There is also an observable variable y, which takes only two possible values and is related to y* in the following way: y = l ify* > threshold y = 0, otherwise. With dichotomous data, the outcome must be bounded between 0 and I. The form of the estimation equation is P(y = 1 Ix) = F(,B'x), where F is the cumulative distribution function of e. In practice, F is usually specified as normal or logistic, but any other continuous distribution function whose first two derivatives exist and are well-behaved may be used (Estrella, 1998, p. 198). For a DDV model, the model parameters are estimated by maximum likelihood estimation, which can be defined as L = Uri/3%,) nil -— Fin},- )1- (28) )7 :1 fl :0 The likelihood function yields maximum likelihood estimators for the unknown parameters by maximizing the probability of obtaining the observed data. The resulting estimators are those that agree most closely with the observed data. In the OLS model, there is only one reasonable residual variation criterion for the continuous dependent variable, but there are several possible variation criteria for DDV models (Efron, 1978). Based on the conceptual and mathematical similarity to the familiar R2, many RBanalogies have been developed for the use with models having DDV (see Estrella, I998; Kvalseth, I985; Menard, 2000). In this study, the index proposed by Estrella (1998) was used to assess model-data-fit for test data because of its nice statistical properties. Estrella’s measure of model-fit possesses the basic requirement of R2 and has been used mainly in the areas of economics (Estrella, Rodrigues, & Schich, 2003; Herath & Takeya, 2003; Moneta, 2005; Shin & Moore, 2003; Stratmann, 2002) and medical research (Zheng & Agresti, 2000). Based on Esterlla’s (1998) assertions, this goodness-of-fit index has some important statistical properties that other measures lack. This measure is constructed by imposing certain restrictions on its relationship with the underlying likelihood ratio statistics. These restrictions, including one expressed in terms of marginal increments in fit, are shown to be consistent with the formal properties of R2 in the linear case and to provide consistently accurate signals as to statistical significance. This measure may be interpreted intuitively in a similar way to R2 in the linear regression context, even away from the endpoints of its range values (Estrella. 1998, p. 198). In the standard linear model with normally distributed errors, the relationship between R2 and LR is clear. If there are n observations, of which it, indicates the case of L = firmly) HI! - Ftfl'xjflo (28) it =1 n =0 The likelihood function yields maximum likelihood estimators for the unknown parameters by maximizing the probability of obtaining the observed data. The resulting estimators are those that agree most closely with the observed data. In the OLS model, there is only one reasonable residual variation criterion for the continuous dependent variable, but there are several possible variation criteria for DDV models (Efron, 1978). Based on the conceptual and mathematical similarity to the familiar R2, many Rzanalogies have been developed for the use with models having DDV (see Estrella, I998; Kvalseth, 1985; Menard, 2000). In this study, the index proposed by Estrella (1998) was used to assess model-data-fit for test data because of its nice statistical properties. Estrella’s measure of model-fit possesses the basic requirement of R2 and has been used mainly in the areas of economics (Estrella, Rodrigues, & Schich, 2003; Herath & Takeya, 2003; Moneta, 2005; Shin & Moore, 2003; Stratmann, 2002) and medical research (Zheng & Agresti, 2000). Based on Esterlla’s (I998) assertions, this goodness-of-fit index has some important statistical properties that other measures lack. This measure is constructed by imposing certain restrictions on its relationship with the underlying likelihood ratio statistics. These restrictions, including one expressed in terms of marginal increments in fit, are shown to be consistent with the formal properties of R3 in the linear case and to provide consistently accurate signals as to statistical significance. This measure may be interpreted intuitively in a similar way to R3 in the linear regression context, even away from the endpoints of its range values (Estrella, 1998, p. 198). In the standard linear model with normally distributed errors, the relationship between R2 and LR is clear. Ifthere are n observations, of which m indicates the case of y = 1. According to Estrella (1998), under the condition that H0 is true (all the k-I slopes are zero), equation (28) is maximized where F(flo) = y = fl and can be n simplified as Lt" = y”) (I —y)"‘"' to represent the likelihood ofthe constrained model. Furthermore, he pointed out that the function of the log likelihood per observation has a particularly simple form that depends only on )7 l L. All?) 5 “fl:— = ,9 ln(y) + (I — y) ln(l —y). (29) The hypothesis HD may be tested using LR statistic. When H0 is true, the value of LR statistic is asymptotically distributed as a x2 with the degree of freedom of k-I. With a dichotomous dependent variable, the approach using equation (25) fails ‘ because the LR statistic per observation is bounded (Estrella, 1998). Let A be the LR statistic per observation for DDV, then A can be expressed as A=3IHL£LJIEUH LU —II'IL(“). (30) n LC )7 When the model fits the data perfectly, the cumulative density function F can be represented as in Figure 2.3.1. In this case, when LU = l, A reaches its upper bound. 4) F(x) F l ._ , .— 0 Figure 2.3.1. The cumulative density function F(x) 30 Estrella (1998) indicated that the upper bound ofA can be expressed as B=— 2In L(. = —2A(. (9), where AC is defined in equation (29). Based on this formula, n the upper bound B is only a function ofthe log likelihood per observation. When y approaches either 0 or I, B approaches 0. The derivation ofthe R2 analog is a differential equation, which bases primarily on an analog with the relationship between marginal R2 and the Lagrange Multiplier (LM) statistic in the linear case (Estrella, I998). The marginal R2 in the linear case may be expressed in terms of the average LM statistic as (Estrella, I998) dRz _ (IA/“1, i—R2 l-Aiii (31) The marginal R2 increases with a rate inversely proportional to the distance between the current value ofthe statistic and its upper bound. In the DDV case, as Estrella (I998) explained, a measure based on the statistic A may be constructed using the fact thatO S A/ B S l . The index can be designed to reflect the marginal increase of fit being conversely proportional to I-A/B, which is the fraction of the “information content” of y that is still unexplained. The goodness-of-fit index,¢ , can be defined by solving the differential equation (Estrella, I998) i’ (4’5 : (IA - (32) "¢ (l-é) B With the initial condition 415(0) = 0. the solution ofequation (32) is 2l l A B In LL" " "1C 21— l-— =l———’— " . 33 ¢ ( B) (In LC) ( ) 31 To demonstrate the derivation of the fit index, the mathematical proof of equations (33) is shown in Appendix I. When A=B, (150(3) =1, and this solution also satisfies the condition ¢0(B)=l and¢3(0)=l(Estrella, 1998). Moreover, Estrella(l998) pointed out that ifB is replaced by "infinity” in the formula (33), then lim i—(i-—A/B)B =l—exp(—A), (34) B—nc which is the exact expression for R2 in the linear case in equation (25). According to Estrella (1998), the goodness-of—fit index,¢ , contains some desired features for a measure of model—data-fit. First, the measure takes on values on the unit interval and has the straightforward interpretation at the endpoints; that is, 0 corresponds to no fit and 1 corresponds to a perfect fit. The goodness-of—fit index is based on maximum likelihood method, which is also a common method used to calibrate test data in the field of educational measurement. This likelihood-based measure can be transformed into an F statistic as described in equation (18). Moreover, this index can work well for both the dichotomous and continuous dependent variables. 2.3.3 The RLR in the Multidimensional Item Response Model Based on the similarity between the logistic regression model (one of the DDV models) and the logistic MIRT model (Reckase, I985; Reckase & McKinley, 1991), it is possible to apply Estrella’s R2 analog to the MIRT model to reflect the error reduction by adding dimensions to the model. Furthermore, in order to reflect the degree of error reduction, the new index, which is the ratio of SSEs of two successive MIRT models, was proposed to show the improvement of model-data-fit. If a DDV model takes the logistic function, it can be expressed as 32 * = eXPLBO + flix) l + CXPIBO + 131-V), V (35) where Bi) is the intercept parameter and B, indicates the vector of slope parameters. The observed variable y takes the value of I if y* is greater than a threshold value and takes the value of 0 otherwise. The total number of model parameters needed to be estimated is expressed as k+ l , where k is the number of predictors. As indicated in Chapter 2.1, the logistic MIRT model is exp(d,- +5157) P(UU- = i 15,-,d,.é,-)= , (36) I+exp(d,- +515!) where the 21,-. (11, Bj are the same as those defined in Chapter 2.1. Compared to equation (35), d. in equation (36) can be considered as the intercept parameter and the a; vector can be viewed as the vector of slope parameters on the 6coordinate axes. The only difference between the two models is that the 6 vector in equation (36) contains model parameters instead of predictors. In other words, along with the a and d parameters, the elements in the 6 vector in the MIRT model also need to be estimated by the model. The total number of parameters in equation (36) is n+f(n+m), where n is the number of items,fis the number of factors, and m is the number of examinees. Employing the likelihood based R2 analog to the MIRT model, the constrained MIRT model can be simplified as CXPIdi) PU--=l d- = . (U I ') l+exp(d,-) (37) This equation indicates that the probability of a correct response on item 1' depends only on d,. Under this constrained model. a'l is estimated by nl/n, where n, is the number of 33 examinees answering the item correctly, and n is the sample size. In this case, d, in equation (37) can be considered as a nonlinear transformation of the item difficulty, also known as the p-value. Then, the probability of correctly answering an item only depends on the item difficulty and has nothing to do with the examinees’ abilities. For the constrained model, the likelihood function can be expressed as M n LC: L(U|d,)- 1‘11‘113”’J(i—P,)'"’f, (38) j=li=1 where an takes on the value of l or 0, which indicates a correct or incorrect response respectively. The likelihood function for the unconstrained model (MIRT model) is M n .. L, =L(U|a,,d,,e )= an’f'iu- )1 "a, (39) j2 Ii=1 where un- takes on the value of I or 0. The probability in equation (39) takes two subscripts representing a correct response of person j on item 1'. With Estrella’s R2 analog method, one can use the likelihood of the constrained model (LC) and the likelihood of the unconstrained MIRT model (Lu) to express the proportion of the total variance explained by the MIRT model. The feasibility of applying the R2 analog to the MIRT model was first evaluated by examining the distribution of LR statistic. One of the well-known characteristics of the DDV model is that, when the null hypothesis (all the slopes in the model are 0 in the population) is true, LR statistic is x2 distributed. With the constrained model in equation (37), 1000 sets of item response data were generated for 25 items and 2000 examinees, and then were calibrated by the unidimensional MIRT model. The resulting distribution ofLR statistic, as shown in Figure 2.3.1, has a mean of38.47 and a variance of 70.605. 34 When taking sampling variation into account, this distribution approximates a x2 distribution since 0'2 : 2,11 = 2v , where vare the degrees of freedom. This LR distribution demonstrates that the MIRT model contains the same characteristic as the DDV model, and thus can be considered as a special form of a DDV model. -i4;\x 100— i O O I 1 \ I Ll” / Frequency A o I 20— 0 I * I 20 30 4o 50 60 70 LR Figure 2.3.1. The observed distribution of LR statistic from the data generated by the constrained MIRT model The R2 analog can be used to represent how well the MIRT model fits the test data, but the most critical issue is to indicate whether or not the increase of fit by adding one more dimension to the model is important. In other words, it is useful to have an index reflecting the marginal effect of the “added” dimension to the overall model fit. Given a test data set, two successive MIRT models, the k-dimensional model and the (k+1)-dimensional model, are considered to describe the data. In order to indicate the 35 marginal effect ofthe (k+1)-th dimension to the overall model fit, the new index is defined as follows. Let In LI be the log-likelihood ofthe k-dimensional MIRT model In LII,“ be the log-likelihood ofthe (k+ l)-dimensional MIRT model In LC be the log-likelihood ofthe constrained MIRT model Then the R2 analog for the two models can be expressed as k 21 2 InL1"’"’~ R;=l—(———(’—) ” ( and lnL(. ., k+l _‘.' . lnL 1 In!” RE+I:I_( (V )n (- lnL(. Based on the equation (16). the percentage ofthe unexplained variance is l — R2 2 £12. Taking the logarithm of both sides. the equation becomes SST ln(l — R2) : In(%)' Then, the ratio of the log residuals (RLR) is defined as k SSE , ,n(__|_n__L_t ) | i—R2 mm”) 1 L 7 . - ln(l — R1211) 1n(§:9§k+l_) ln(_l_n_éiii) SST 1,, L(., This index shows if the percentage of the unexplained variance in the (k+I)-th dimensional MIRT model is smaller than that in k-th dimensional MIRT model. The k-th dimension in equation (40) can be considered as the target dimension. The successive dimension. the (k+I)-th dimension, can be viewed as the reference dimension. Equation (40) focuses on the relative gain ofoverall model fit in view of comparing the 36 residuals in two models. If the k-dimensional model fits the data well, the reduction in SSE due to adding the (k+1)-th dimension should be minor. In this case, the value of the numerator and denominator in equation (40) are close to each other so that the RLR approaches 1. Since the RLR index always compares the SSEs for two successive models, for the convenience of discussion only the target dimension will be appended to the index to show the level of dimensionality. For instance, RLR. stands for the RLR index comparing the SSE of a one-factor model and that of a two-factor model. The feasibility of using the R2 analog and the RLR index to determine dimensionality is demonstrated by showing their empirical distributions in some basic cases. In all the following examples, 100 sets of item responses were generated for a 25-item test with 2000 examinees. For different situations, different models were used to generate the desired data. When the data were generated by the constrained model, which only has the intercept term, no dimensionality underlies the data. When the data are explained by the MIRT model, the corresponding model-data-fit was reported in Figure 2.3.2. As Panel (A) shows, the distribution of R12 has a mean of 0.021 I and a SD equal to 0.0031; the distribution of R; has a mean of 0.0387 and a SD of 0.0044. The small values of R12 indicate that the unidimensional MIRT model explains little variance in the data. After . . . '7 . . adding the second dimenSion to the model, the value of R5 has little increment, indicating limited increase in explained variance. The resulting distribution of RLR. has the distribution with the mean of 0.5391 and SD equal to 0.0412. 25—~ 35“: 20—« 20— >. >1 2 15- 2 15-« § 2 5.. E e e “2 10—3 in 10.. 5—‘ 5_. 0 I ’ ' ” " ’ 0 1 1 1 1 1 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 . . . ’7 . . . (A) The distributions ofRf and R22 (B) The distributions of RLR. Figure 2.3.2. The distributions 0le2 , R22 , and RLR. for the constrained-model data Another case offered here is the three-dimensional data. Item responses were generated assuming that three dimensions were independent of each other and all item discriminations equal to I. As shown in panel (A) in Figure 2.3.3, the mean 0le2 is 0.6972 and the SD is 0.0183; the mean of R22 is 0.9084 and the SD is 0.0183; the mean of R32 is 0.9687 and the SD is 0.0033; the mean of R512 is 0.97 and the SD is 0.003. Just like in the OLS model, the R2 analog raises as the number ofdimensions in the model increases. Regarding the distribution of R37“, when the model fits the data well, the index approaches 1. Besides, the distributions of RLR3 and RLR.) have substantial overlapping area, indicating the similarity of the two distributions. Thus, given that the model already fits the data well, the increase of fit by adding another dimension to the model is limited. Concerning the improvement of fit as shown in Panel (B); RLR. has a 38 mean of 0.4995 and a SD of 0.03 l 5; RLR2 has a mean of 0.6925 and a SD of 0.04 I 0; RLR3 has of mean .996 of and 3 SD of 0.004. When the model under-fits the data, the RLR is low and the distribution is located on the left side of the scale. Conversely, the index shifts to the right end of the scale with little variation when the model captures true dimensionality. The information from these distributions suggests that the RLR index offers clear and useful information about dimensionality. 25 _ 25- RLR2 2 R3 70 2 2 2 4— - R1 R2 R4 20— RLR] RLR3 l 15 ~ 15~ >5 8 3 U' 93 10 d 10— u- . 5 1 5 '7 0 i 0 I I I I 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 (A) The distributions of R12 , R22 , R32 , and R212 (B) The distributions of RLRl _, RLR2 , and RLR3 Figure 2.3.3. The distributions of R2 and RLR for the three-dimensional data An example of high-dimensional data was also offered to show the statistical characteristics of the proposed indices in the extreme situation. The item response data were generated with a 25-dimensional MIRT model assuming that all the dimensions 39 were independent of each other. Besides, the item discriminations were all fixed as 1.0. In this case, one item represented one distinct dimension in the data, and all the 25 dimensions had equal dominance of dimensionality. The results, as shown in Figure 2.3.4, indicated that the mean R12 is 0.0208 and SD is 0.0034; the mean R7? is 0.0374 and SD is 0.0047; RLR, has a distribution with mean 0.5487 and SD of 0.0505. The distributions ole2 , R22 , and RLR. are similar to those in the constrained model. The values of R1“ and R22 indicate that the unidimenSional and two-dimenSional models only explain little variance in the data. These findings suggest that high dimensional data have similar properties as the constrained-model data. Because of the lack of a dominant factor, the increment of model-data-fit by adding dimensions to the model is limited. To explain the data well, complicated high-dimensional models need be employed. 25— 25~ 211— 2 31‘)— R1 >1 __ >3 2 l.)— 2 IS" 29' E a ” a (I ‘L ii)—~ F“ 10— s- 5— “" i ' ’1' ' ' ' ‘I T I ' I I 1 0.00 0.01 0.02 0.03 0.04 0 0 0.2 0 4 0.6 0 8 l 0 (A) The distributions of R12 and R22 (B) The distribution of RLR] Figure 2.3.4. The distributions of R12, R22 , and RLR] for the 25-dimensional model data 40 The last example offered here is to show how the R2 analog and RLR index react to random data. For the distributions shown in Figure 2.3.5, R12 has a mean of 0.0146 and a SD of 0.0056; 1122 has a mean of 0.0259 and a so of 0.0074; RLR] has the mean of 0.5762 and a SD of0.2098. Again, the means of R12 and R22 are as small as those in the constrained model and 25-dimensional model, but the variation is large. With random data, RLR; may have any value along the scale. 25—I 25-1 R12 "" 20— 204 ~— /‘\ >5 5" 215~ 215—4 § 2.7 / a 5 a: a “”10““ “‘10— 7 5~ 5_. 0 0 l l * l 1 I”I 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 (A) The distributions 61R? and R22 (B) The distribution ofRLR] Figure 2.3.5. The distributions of R12 , R22 , and RLR] for the random data To summarize this chapter, there are several advantages of the RLR index as compared to other statistics. (1) The calculation ofRLR is based on maximum likelihood estimation, which is strong in its theoretical foundation, especially with a large sample size. 41 (2) This index has sound mathematical background. The derivation ofthe RLR index is based on the R2 analog in the DDV model, which is in accordance with the R2 in the linear regression model. (3) LR statistics in the MIRT model is x2 distributed, which is consistent with the DDV model when the null hypothesis (all the slopes are zero) is true. (4) With the RLR index, the dimensionality is assessed based on the improvement of the model-data-fit. (5) The explanation of the RLR index is straightforward. The RLR index is viewed as the ratio of the log transformation of the unexplained percentage of the variance from two regression models. As shown in the preliminary simulations, the RLR index has a lower bound around .50. When the fit is good, the index approaches 1, indicating that the target dimension should be of use for describing the data. (6) Furthermore, this statistic has the desirable property of showing the improvement of fit from adding dimensions to the model. Based on this procedure, researchers have a rule of thumb to decide when the increase of fit is important. (7) Unlike the )8 test, the index is sensitive to sample size in a way that large sample size can increase the accuracy of identifying correct dimensionality. Within the limits of simulation, the index is not inflated by sample size and demonstrates desired statistical properties. CHAPTER 3 METHOD This chapter describes the research designs for exploring the statistical characteristics of the RLR index. Many researchers (Davey, Nering, & Thompson, 1997; Harwell, Stone, Hsu, & Kirisci, I996) recommended the use of simulation studies because it offers an opportunity to permit theoretical results to be confirmed in practice. While manipulating all kinds of testing conditions, it is possible to know the statistical characteristics and the limits of the index of interest. With known dimensionality, two simulation studies representing some basic testing situations were conducted in order to explore the statistical properties of the RLR index. Furthermore, based on the procedures developed in simulation studies, the analysis of real test data is presented to demonstrate the feasibility of applying the fit index to a real testing situation. 3.1 Simulation Study 1 (Unidimensional Data Sets) The focus of Study I is to explore the relationship between the RLR index and item characteristics for different unidimensional data. Correspondingly, the effects of test length and sample size on the RLR index are explored as well. 3.1.1 Research Design Four variables were selected in Study I to simulate different testing conditions. (I) Item discrimination (A) When the MIRT model in equation (1) reduces to a unidimensional model. the value 43 ofthe MDISC is the same as the value of the a-parameter. In this study, the unidimensional data were generated in the unidimensional Rasch model fashion by setting all a—parameters equal in one test. The values of the a-parameters were fixed at four levels (0.2, 0.4, 0.6, and 0.8) with no variation in each data set, respectively. Low a-parameters imply that test items were poorly designed so that those items could not well differentiate examinees” abilities. Consequently, the signal in the test data may be weak and it would be difficult to identify the true dimensionality of the test data. High a-parameters indicate good items that can well differentiate examinees with different levels of ability. In this case, it is expected that the goodness-of-fit index can function well in recovering the true dimensionality. Originally, the level of 1.0 of the a-parameter was included in the pilot study. When calibrated by multidimensional models, the simulation data with the a-parameters equal to 1.0 consistently generated a singular correlation matrix in TESTFACT. Because the calibrations for multidimensional models never succeeded, the level of 1.0 was excluded from Study 1. This phenomenon implies that it is unlikely to have multidimensional solutions using full-information factor analysis when the item discriminations for unidimensional data are high. The procedure itself can detect the impossibility of getting multidimensional solution when the data are strongly unidimensional. (2) Item Difficulty (D) The variation in the distribution of item difficulty affects the sampling variability of tetrachoric correlations (Roznowski et al., 1991). When the spread of item difficulties increases, the tetrachoric correlation matrix tends to be non-Gramian and causes 44 computational difficulty in maximum likelihood factor analysis (McDonald, 1985). In order to explore how the variation of item difficulty affects full-information factor analysis and the RLR index, the d—parameters were sampled from normal distribution with a mean of0 and three levels (0. 0.5. and l) of standard deviation. (3) Test Length (7) To explore the possible effect of test length on the value of RLR, short test forms with 25 items and long test forms with 50 items were created. A short test was generated by selecting 25 a- and d-parameters from the predefined item distributions. With regard to a 50-item test, it was generated by adding parallel items to the original 25—item test. It is expected that as the number of items increases the data unidimensionality should be more accurately identified by the RLR index. (4) Sample size (S) According to the literature (Ackerman, 1994; R. L. Turner, Miller, Reckase, Davey, & Ackerman, I996), usually 2000 or more examinees are suggested for MIRT calibration. In this study, the random samples of 2000 and 6000 examinees were drawn from a normal distribution with a mean of 0 and a standard deviation of I. It is expected that the dimensionality index should vary in accuracy as a function of sample size. 3.1.2 Generation of Item Parameters and Response Patterns Given the design of a-parameters (4), d-parameters (3), and test lengths (2), twenty-four combinations of simulated tests were generated. Table 3.1.1 tabulates the label and characteristics of each test. The numbers in the test label represent the levels of the a-parameters. d—parameters. and test length in order. Test 321. for example. 45 represents the test having the third level ofthe a-parameters (0.6). the second level of the SD of the (ii-parameters (0.5), and the first level of test length (25). Table 3.1.]. Simulation tests for Study 1 short st form 10111 rm a-parameters SD of d-parameters te £465th (25 items) (50 items) 0.2 0 Test 1 I 1 Test 112 0.5 Test 121 Test 122 I Test 131 Test 132 0.4 0 TestZIl Test 212 0.5 Test 221 Test 222 1 Test 231 Test 232 0.6 0 Test 31 1 Test .212 0.5 Test 321 Test 322 1 Test 331 Test 332 0.8 0 Test 41 1 Test 412 0.5 Test 421 Test 422 1 Test 431 Test 432 When combining simulated tests (24) and sample sizes (2), forty-eight combinations of testing conditions were generated. In order to explore the consistency of the results in this study, replications are needed. For IRT-based studies, at least 25 replications have been recommended (Harwell et al., 1996). In this study, 100 sets of item response patterns were produced for each combination. Thus, the overall number of observations in Study 1 is 4800. The way to generate dichotomous item response is to implement the known item parameters and ability parameters in the model in equation (I). Then, the computed probability is compared to a random number drawn from a uniform distribution ranged from 0 to 1. If the computed probability is greater than the random number, a response of l is generated, if not, a response of0 is produced. The data simulation was 46 completed by using GENDATS developed by Thompson (Undated). This F ortran-based computer program uses input of the MIRT item parameters and an inter-factor correlation matrix, which is used to generate ability vectors based on the standardized normal distribution. This program can simulate multidimensional test data for up to 60 dimensions and can generate ability vector even for the case when factors are completely correlated in the correlation matrix. 3.1.3 Analysis Procedures and Computer Programs The calculation of the RLR index depends upon being able to compute the maximum likelihood of the constrained model and that of the MIRT model. The likelihood of the constrained model was computed by the MATLAB program written by the author based on equation (38), and the likelihood of the MIRT model was calculated by TESTFACT (Wilson et al., 2003). Then, the values of the likelihood of the constrained model and the MIRT model were implemented in equation (40) to get the corresponding RLR value. To decide data dimensionality, MIRT models with different levels of dimensionality were employed to analyze each data set. The test calibration started from the unidimensional MIRT model and continued to four-dimensional model. For each level of dimensionality the value of RLR was computed to reflect the increase of model-data—fit. After collecting the RLR values for all 4800 observations, the statistical package SPSS version 12.0 was employed to perform further statistical analyses. A Multidimensional Analysis of Variance (MANOVA) was conducted to explore the influence of the manipulated factors on the RLR index at different levels of dimensionality. Furthermore, the regression model was built to decide if the observed RLR index reflected a good fit 47 between the model and data. 3.1.4 Evaluation Criterion The main purpose of Study 1 is to determine the level of accuracy of the RLR index in correctly determining unidimensionality. As shown in Figure 2.3.3, the distributions of the RLR index indicate that the RLR index is low and locates on the left side on the scale when the model under-fits the data; when the fit is good, the RLR index shifts to the right side of the scale and approaches 1. The theoretical conditional distribution of RLR... can be expressed as Figure 4.1.1. When the null hypothesis is true (Ho: d= k), the distribution of MR. approaches 1 with small variation. Whenever the model under-fits the data, the ' Hozd=k H1Id>k 3 __L 1 5% rejection area Figure 4.1.1 The theoretical distribution of MR, In order to decide if a RLR value shows a good fit between the data and model, the 5% rejection criterion was set on the lower tail of the RLR distribution when the model 48 captures the true dimensionality. If the observed RLR]. is smaller than the lower bound of a good fit, the null hypothesis, H0: d= k is true, is rejected. The significance test starts from testing the unidimensional model. If the observed RLR. index is less than the 5% lower bound, then the null hypothesis (Ho: d= l) is rejected. Then the next significance test is to test if the observed RLR2 index shows a good fit. Once a given value RLR is greater than the lower bound of a good fit, the null hypothesis is not rejected and the dimensionality can be decided. To decide the lower bound ofa good fit between the model and data, a regression analysis was conducted. Given the information of sample size, test length, the estimated a-parameters, and the estimated d-parameters, the predicted value of the RLR index can be estimated by the regression model. For each testing condition, the number of rejections obtained from the RLR index, and those from the G 2 test in equation (13) and the Giff test in equation (14) were compared. The accuracy of these indices was deemed acceptable if the number of rejections in 100 replications was less than 5 for the true model. In Study 1, it is expected that the RLR index should demonstrate lower Type lerror rate than the G2 test, and the 03,77 test for the unidimensional data. 3.2 Simulation Study II (Multidimensional Data Sets) The goal of the second simulation is to investigate how the RLR index detects dimensionality for different kinds of multidimensional test data. In this study, the two- and three-dimensional test data were generated under different conditions. 49 3.2.2 Research Design The levels of multidimensionality were manipulated using three essential variables as follows: (I) Inter-Factor Correlation (C) In order to simulate examinees‘ multidimensional ability distributions, the correlation between factors (abilities) needs to be defined. The indices of dimensionality have long depended on relations among the successive eigenvalues obtained from factor analysis (see Hutten, 1980; Kaiser, 1970; Lord, 1980; Lumsden, I957). The assumption of the scree test, for example, is that when the eigenvalues are displayed in their decreasing order, there will be a clear separation in fraction of total variance where the unimportant factor has been extracted. With information about the distribution of eigenvalues, Roznowski et al (1991) proposed the ratio difference index representing the ratio of the difference between the first two eigenvalues to their subsequent differences, in order to identify data unidimensionality. In this study, a different procedure was proposed. Dimensionality was manipulated by sampling correlation matrices in terms ofthe slope of eigenvalues and the determinant ofthe correlation matrix. For a correlation matrix, the slope ofeigenvalues reflects the magnitude and pattern of the inter-factor correlations. While working with the inter-factor correlations. the dimensional structure of the latent trait can be manipulated, and the level of dimensionality can be mapped on an arbitrary scale. An 11 ><2) combinations. Again, the levels for inter-factor correlation, item-factor structure, and item discrimination were labeled in order as the numbers in the form name. Form 321, for example, represents the test having the third level of the inter-factor correlation (C 3 ), the second level of the item-factor structure (16:16: 16). and the first level of item discrimination (M). 57 Table 3.2.2. Simulated tests for Study 11 Inter-factor correlation Item discrimination Item-factor structure 12:12:24 16:16:16 36:6:6 Two-dimension design ' I l 0-71 M Form 111 Form 121 Form 131 CI = l l 0-7 Form 112 Form 122 Form 132 -0-7 0'7 '~ H (50:50) (67:33) (88:12) I I l 0-4“ M Form 211 Form 221 Form 231 C: = I ' 0-4 Form 212 Form 222 Form 232 -0-4 0'4 '4 H (50: 50) (67:33) (88:12) II I 0 M Form 311 Form 321 Form 331 Cs: I l 0 Form 312 Form 322 Form 332 10 0 ' H (50: 50) (67:33) (88: 12) Three-dimension design I 0-5 0-6fi M Form 411 Form 421 Form 431 C1 = 0-5 l 0-4 Form 412 Form 422 Form 432 0'6 0'4 '— H (25:25:50) (33:33:33) (76:12:12) I l 0-5 02‘ M Form 51 1 Form 521 Form 531 Cs 2 0'5 l 0'3 Form 512 Form 522 Form 532 -0-2 0-3 ' - H (25:25: 50) (33:33:33) (76: 12:12) II 0 0 M Form 611 Form 621 Form 631 C11 = 0 1 0 Form 612 Form 622 Form 632 -0 0 ' H (25:25: 50) (33:33:33) (76: 12:12) Under each test label, the numbers in the parentheses specified the percentage of items per dimension in the data. With the correlation matrices, C 1, C2, and C3, two-dimensional data were generated because the first two factors converged into one factor. Thus, those items originally sensitive to the first and second factors would converge into a bigger item cluster. With structure I, 50% of the items loaded on the converged first dimension, and the remaining 50% of items loaded on the other dimension. With regard to structure 2, 67% ofthe items were grouped as the first dimension and the rest of the 33% items grouped as the second dimension. With respect to structure 3, 88% of the items were clustered as one dimension and the remaining 12% formed a second dimension. Regarding the correlation matrices C4, C5, and C6, three-dimensional data were generated and the percentage of item per dimension was consistent with the original item-factor structure. 3.2.2 Generation of Item Response Patterns The d—parameters were randomly generated from a normal distribution MO, I) for all 48 test items. The multidimensional ability distributions were generated from the standardized multidimensional normal distribution with the pre-selected inter-factor correlation matrices. Again, the sample size used in Study II was 2000. The procedures for generating item response patterns were the same as those described in Section 3.1.2. For each cell ofthe thirty-six combinations, 100 replications were performed, and the total number of 3600 multidimensional data sets was produced. 3.2.3 Procedures and Computer programs The procedures for computing of the RLR index were the same as those described in section 3.1.3. In study II, the test calibration started from the unidimensional model and continued to the 5-dimensional model. For each level ofdimensionality. the RLR index was computed to show the improvement of model-data-fit. 3.2.4 Evaluation Criterion 50 Again, the statistical properties of the RLR index were explored and compared with those of the G 2 test and the 03,-” test. To test whether the data can be well fit by the unidimensional model, the unidimensional regression model generated in Study I was used in conjunction with sample size, test length, and the estimated unidimensional item parameters. lfthe observed RLR; is smaller than the predicted lower bound, then the null hypothesis (Ho: d=l) was rejected, indicating a higher-dimensional model is needed. To test whether or not the null hypothesis (H0: d=2) was true for a given data sets, the two-dimensional regression model was constructed based on the two-dimensional data. Again, given that the model captures the true two-dimensional data, the regression model sets up the 5% rejection area at the lower end of the predicted RLR2 distribution. If the observed RLR2 value is smaller than the predicted lower bound, the null hypothesis is rejected and the data should be modeled with higher dimension. Using the same procedure, the three-dimensional regression model was constructed to test the null hypothesis (Ho: d=3) based on the three-dimensional model. If the observed RLR3 value is smaller than the predicted lower bound, then the data should be modeled with higher dimension. It is expected that, the number of false rejections should be lower than 5 among 100 replications when the regression model captures the true dimensionality. Conversely, when detecting the wrong models, the RLR index should generate large number of rejections. indicating high statistical power. 3.3 Real Data Analysis Along with the simulation studies, the statewide test data ofthe Mathematics Test from the Michigan Educational Assessment Progress (MEAP) testing program were 60 analyzed. Under the No Child Left Behind (NCLB) act of2001, the federal approval depends on strict alignment of state assessment to state content standards. Michigan’s Mathematics Test, which developed to match the mathematics content standards, were developed to measure what Michigan educators believe all students should learn and be able to achieve in each grade level (Michigan Department of Education, 2004). In this study, the test data from the Grade 4 Mathematics Test were used. The Mathematics Test contained 57 items covering content knowledge in data and probability, geometry, measurement, and numbers and operations. To be more precise, students were requested to demonstrate their academic proficiency in (I) fluency with operations and estimations; (2) geometric shape, properties, and mathematical arguments; (3) meaning, notation, place value, and comparisons; (4) number relationships and meaning of operations; (5) problem solving involving measurement; (6) data representation; (7) spatial reasoning and geometric modeling; and (8) units and systems of measurement (Michigan Department of Education, 2006). Students who score high on the test have documented substantial achievement in mathematics at the grade-4 level. In terms of the hierarchical ability structure in the blueprint of the Mathematics Test, it is suspected that the resulting test data may be explained by a multidimensional model. The test data from 10000 examinees were requested from the testing program. The sample was then divided into five smaller data sets with 2000 examinees by random selection. The MIRT model parameters for different levels of dimensionality were estimated using TESTFACT. For each level of dimensionality, the corresponding RLR index was computed to determine the increment of model-data-fit. To decide the 61 dimensionality of MEAP data, the regression models developed from the simulation studies were used to determine whether the observed RLR index showed a good fit between the model and data. If the observed RLR index fell in the 5% rejection area of the lower end, the null hypothesis was rejected, and the higher-dimensional model was tested in turn. The significance test started from the unidimensional model and stopped when the null hypothesis was not rejected. Instead of makingjudgments form a single test, the results from different sample data would give the basis of cross-validation and offer a more dependable decision. CHAPTER 4 RESULTS Based on the research designs described in the previous chapter, the main results of the three studies are provided along with the initial interpretations. 4.1 Simulation Study 1 (Unidimensional Data Sets) The focus of Study 1 was to explore the effects of item discrimination (A), item difficulty (D), sample size (S), and test length (7) on the RLR index. However, when the unidimensional data were analyzed by multidimensional models, some of the TESTFACT analyses failed. When Twas short (25 items), all TESTFACT runs were successful regardless of the levels of A, D, and S. When T was long (50 items), some tests generated a singular tetrachoric correlation matrix, causing a serious estimation problem in full-information factor analysis. Table 4.1 .I shows the number of unsuccessful cases out of 100 replications for long-test data. Given that T was long, when D was high, the probability of getting a singular tetrachoric correlation matrix was high, especially for the case when S was small (2000). For these data sets, the rates of getting a singular tetrachoric correlation matrix increased with the increment of the number of factors in the estimation model. The highest rate of getting unsuccessful TESTFACT runs occurred when the unidimensional data were analyzed by the four-dimensional MIRT model. 63 Table 4.1.1 . The number ofunsuccessful TESTFACT runs for long tests in Study I Sample MIRT Model . Test 5'26 1 Factor 2 Factor 3 Factor 4 Factor 2000 Test 112 0 O 0 0 Test 122 0 0 0 2 Test 132 0 0 4 15 Test 212 0 0 0 0 Test 222 0 O 0 0 Test 232 0 l 3 35 Test 312 0 0 0 0 Test 322 0 0 0 0 Test 33- 0 0 3 18 Test 412 0 0 0 0 Test 422 0 0 0 0 Test 432 0 0 2 7 6000 Test 112 0 0 0 0 Test l22 0 0 0 0 Test 132 0 0 0 4 Test 212 0 0 0 0 Test 222 0 0 0 0 Test 232 0 0 3 7 Test 312 0 0 0 0 Test 322 0 0 0 0 Test 332 0 0 0 1 Test 412 0 0 0 0 Test 422 0 0 0 0 Test 432 0 0 0 0 Note: The results for short tests were not listed because all TESTFACT runs were successful. 4.1.1 Results ofthe Summary Statistics With regard to those successful TESTFACT runs, no outliers were found in the preliminary analysis. RLR values in each condition. Table 4.1.2 and Table 4. I .3 display the summary statistics ofthe The changes of RLR values associated with dimensionality were plotted in Figure 4.1.1 to Figure 4.1.4. The conditional 64 distributions ofRLR values were presented in Appendix B as a supplement to the summary statistics. By and large, the SD of the RLR values in each condition was small. Given the same levels ofS and T, the SD ofthe RLR values was small when A was high. Conditioned on A and D, the SD of the RLR values decreased when T was long or S was large. For most data sets, the SD of the RLR values for a higher-factor model was smaller than that for a lower-factor model. The decrease of the variation of the RLR values was more noticeable when A was low. The RLR index for the unidimensional model was particularly sensitive to item parameters. The increase of A was proportional to RLR), but the increase of D was inversely proportional to RLR). The effects of A and D on RLR) was similar across different combinations of S and T. When the RLR values were plotted against dimensionality, the lines indicated the change of the RLR values as a result of dimensionality. As shown from Figure 4.1.1 to Figure 4.1.4, the color of the lines denotes different levels of A, and the shape of the lines represents different levels of D. For the tests with A higher than 0.2, the RLR values were all centered to 1 and formed horizontal lines. The change of the RLR values was limited when adding more factors to the model. Since the increase of the RLR values due to adding factors to the model was trivial, this pattern of the RLR values might imply that the unidimensional model was good enough to explain the test data. Conversely, for the tests with A equal to 0.2, the RLR values showed noticeable increase associated with dimensionality, especially when D was large, S was small, and Twas short. This pattern implied that higher-factor models fit the data better than the unidimensional model. 65 Table 4.1.2. Summary statistics of the RLR index for short tests 2000 examinees 6000 examinees 25-item RLR 135‘ Mean SD N SE Mean SD N SE Test 1 II RLR] 0.8713 0.0224 100 0.0022 0.9506 0.0085 100 0.0008 RLR2 0.9046 0.0156 100 0.0016 0.9614 0.0059 100 0.0006 RLR3 0.9225 0.0110 100 0.0011 0.9679 0.0042 100 0.0004 Test 121 RLR. 0.8533 0.0254 100 0.0025 0.9429 0.0093 100 0.0009 RLR2 0.8942 0.017] 100 0.0017 0.9542 0.0080 100 0.0008 RLR3 0.9152 0.0143 100 0.0014 0.9639 0.006] 100 0.0006 Test 13] RLR] 0.8086 0.0356 100 0.0036 0.9245 0.0143 100 0.0014 RLR2 0.8695 0.0231 I00 0.0023 0.9398 0.0115 100 0.001] RLR3 0.8933 0.0182 100 0.0018 0.9508 0.0099 100 0.0010 Test 21 l RLR; 0.9809 0.0034 100 0.0003 0.9935 0.0012 100 0.0001 RLR2 0.9843 0.0024 100 0.0002 0.9947 0.0008 100 0.0001 RLR3 0.9862 0.0020 100 0.0002 0.9954 0.0008 100 0.0001 Test 22] RLRI 0.9783 0.0039 100 0.0004 0.9925 0.0012 100 0.0001 RLR2 0.9823 0.0029 100 0.0003 0.9940 0.0010 100 0.000] RLR3 0.9844 0.0023 100 0.0002 0.9949 0.0009 100 0.000l Test 23] RLR] 0.9717 0.0050 100 0.0005 0.9901 0.0017 100 0.0002 RLR2 0.9771 0.0042 100 0.0004 0.992] 0.0018 100 0.0002 RLR3 0.979] 0.0039 100 0.0004 0.9930 0.0016 100 0.0002 Test 31 l RLR] 0.9924 0.001 I 100 0.000] 0.9975 0.0005 100 0.0000 RLR2 0.9937 0.0009 100 0.000] 0.9979 0.0003 100 0.0000 RLR3 0.9944 0.0009 100 0.0001 0.9984 0.0003 100 0.0000 Test 321 RLR; 0.9917 0.00 l 2 100 0.000 I 0.9972 0.0005 100 0.0000 RLR2 0.9932 0.0009 100 0.000] 0.9977 0.0003 100 0.0000 RLR3 0.9939 0.00] l 100 0.0001 0.9982 0.0003 100 0.0000 Test 33] RLR. 0.9898 0.0018 100 0.0002 0.9966 0.0005 l00 0.000 I RLR2 0.9915 0.0014 100 0.0001 0.9971 0.0006 100 0.0001 RLR3 0.9920 0.0017 100 0.0002 0.9975 0.0006 100 0.0001 Test 4] l RLR. 0.9955 0.0007 100 0.000] 0.9984 0.0003 100 0.0000 RLR2 0.9963 0.0006 100 0.0001 0.9990 0.0002 100 0.0000 RLR3 0.9969 0.0006 100 0.000 I 0.9994 0.0003 100 0.0000 Test 42] RLR. 0.9952 0.0008 100 0.000] 0.9984 0.0003 100 0.0000 RLR2 0.9960 0.0006 100 0.000 I 0.9988 0.0002 l00 0.0000 RLR3 0.9967 0.0008 100 0.0001 0.9993 0.0003 100 0.0000 Test 431 RLR] 0.9942 0.0010 100 0.0001 0.9981 0.0003 100 0.0000 RLR2 0.9951 0.0008 100 0.0001 0.9984 0.0003 100 0.0000 RLR3 0.9956 0.0009 100 0.0001 0.9989 0.0003 100 0.0000 66 Table 4.1.3. Summary statistics ofthe RLR index for long tests 50-item 2000 examinees 6000 examinees RLR test Mean SD N SE Mean SD N SE Test I I2 RLR. 0.9096 0.0] 17 100 0.0012 0.9673 0.0039 100 0.0004 RLR2 0.9257 0.0087 100 0.0009 0.9718 0.0027 100 0.0003 RLR3 0.9353 0.0063 100 0.0006 0.9750 0.0024 100 0.0002 Test 122 RLR1 0.8982 0.0127 100 0.0013 0.9623 0.0044 100 0.0004 RLR2 0.9177 0.0087 100 0.0009 0.9668 0.0037 100 0.0004 RLR3 0.9270 0.0070 98 0.0007 0.9707 0.0027 100 0.0003 Test 132 RLR. 0.8766 0.0159 100 0.0016 0.9536 0.0051 100 0.0005 RLR2 0.9004 0.0] 14 96 0.0012 0.9595 0.0044 100 0.0004 RLR3 0.9133 0.0097 83 0.00] 1 0.9639 0.0046 96 0.0005 Test 212 RLR 1 0.9844 0.0019 100 0.0002 0.9948 0.0006 100 0.0001 RLR2 0.9867 0.0012 100 0.0001 0.9954 0.0004 100 0.0000 RLR3 0.987] 0.00] l 100 0.000 I 0.9957 0.0004 100 0.0000 Test 222 RLR] 0.9827 0.0020 100 0.0002 0.994] 0.0007 100 0.0001 RLR2 0.9848 0.0017 100 0.0002 0.9948 0.0006 100 0.0001 RLR3 0.9857 0.0015 100 0.000] 0.9952 0.0005 100 0.0001 Test 232 RLR 1 0.9793 0.0025 99 0.0003 0.9929 0.0009 100 0.0001 RLR2 0.9817 0.0019 97 0.0002 0.9939 0.0007 97 0.0001 RLR3 0.9828 0.0029 63 0.0004 0.9942 0.0007 90 0.0001 Test 312 RLR] 0.993] 0.0007 100 0.000] 0.9976 0.0003 100 0.0000 RLR2 0.9942 0.0007 100 0.000] 0.9983 0.0002 100 0.0000 RLR3 0.9943 0.0007 100 0.000] 0.9983 0.0003 100 0.0000 Test 322 RLR. 0.9925 0.0008 100 0.000] 0.9974 0.0003 100 0.0000 RLR2 0.9936 0.0007 100 0.000] 0.9980 0.0003 100 0.0000 RLR3 0.9936 0.0008 100 0.000l 0.9981 0.0004 100 0.0000 Test 332 RLR. 0.9914 0.0010 100 0.000] 0.997] 0.0003 100 0.0000 RLR2 0.9925 0.0008 97 0.000] 0.9976 0.0003 100 0.0000 RLR3 0.9934 0.0008 77 0.000] 0.9978 0.0004 99 0.0000 Test 412 RLR 1 0.9946 0.0006 100 0.000] 0.9973 0.0003 100 0.0000 RLR2 0.9963 0.0005 100 0.0001 0.9987 0.0002 100 0.0000 RLR3 0.9968 0.0006 100 0.000] 0.9994 0.0003 100 0.0000 Test 422 RLR. 0.9945 0.0006 100 0.000] 0.9975 0.0002 100 0.0000 RLR2 0.9959 0.0005 100 0.000] 0.9987 0.0002 100 0.0000 RLR3 0.9965 0.0007 100 0.000] 0.9993 0.0002 100 0.0000 Test 432 RLR] 0.9942 0.0007 100 0.0001 0.9975 0.0003 100 0.0000 RLR2 0.9954 0.0006 98 0.000] 0.9985 0.0002 100 0.0000 RLR3 0.9959 0.0007 92 0.000] 0.999] 0.0003 100 0.0000 67 0.95 0.9 RLR 0.85 1 2 Target dimension —-o--Test111 ---0---Test121 +Test13l --o--Test211 -------Test221 +Test23l --o--Test3ll ---0---Test321 --t—-Test331 —~t--Test4ll -* m---Test421 ——— .1. - Test431 Figure 4.1.1. The change of RLR with dimensionality for a 25-item test and 2000 examinees 0.9 RLR 0.85 _ 0.8 1 l 2 Target dimension —-o~-Test111 ---O---Test121 +Testl31 —-o--Test211 -------Test221 +Test231 --o--Test311 ---0---Test321 ——A—-Test33l - v --Test411 1i. --Test421 .1 -—Test431 Figure 4.1.2. The change of RLR with dimensionality for a 25-item test and 6000 examinees 68 0.85 0.8 l 2 Target dimension 3 —-¢--Test 112 ---O---Test122 +Test 132 —+--Test212 '--0~-Test 222 —t——Test 232 —+--Test312 -------Test 322 — 1— - Test 332 -- ~--v=~---Test412) --- 0 -- Test 422 --—4---— Test 432 Figure 4.1.3. The change of RLR with dimensionality for a 50-item test and 2000 examinees 0.85 0.8 .2.—3.7-27.2? :7. : 7.27.: .... 732:7.':.:.-::.-.:: : : '. 7.: ..... + 0.95 L 7- 3 Target dimension —-o--Test112 ---0---Test122 +Test132 —-o--Test212 ---0---Test222 +Test232 —-o--Test312 ---0---Test322 —-t—-Test332 — -:r---Test412 -- '*:---Test422 ~ s~--—Test432 Figure 4.1.4. The change of RLR with dimensionality for a 50-item test and 6000 examinees 69 -_.=..-~1- I ~1.- 3" - 4. l .2 Results of Multivariate Analysis of Variance for Study I To explore the influence of manipulated factors on the RLR index, a Multivariate Analysis of Variance (MANOVA) was conducted. The dependent variables in the MANOVA model were the MR indices representing three levels of dimensionality (RLR., RLR2, and RLR3), and the independent variables were A, D, S, and T. To test whether the overall multivariate difference was significant , Pillai's Trace was employed because it is more robust than other statistics (Wilks' 3., Hotelling's T2, and Roy's greatest characteristic root) when assumptions are not met (Olson, 1976). As Table 4.1.4 shows, the main effects of A, D, S, T, and the interactions were all significant so the hypothesis that there was no between-group difference was rejected. Several of these significant factors had substantive effect sizes, such as A (F(9, I3950)= 757.18,p < .01, if: .328), D (F(6, 9298): 274.68, p< .01, ”2: .151 ), S (F(3, 4648): 6230.6l,p < .01. 1,3: .801 ). T(F(g3, 4648): 580.6].p < .01. ’72: .273). A XD (F(18, 13950): 124.11,): < .0. if: .138), A xs (F(9. 13950): 61 3.79. p < .01. if: .284), and A x T(F(9, 13950): 284.63, p < .01, if: .155). They should be considered as having important effects on the RLR indices. The interactions DXS, D XT, SX T. A XD XS, A XD X T. A XSX T, D XSX T, and A XD XS X T were significant, but their effect sizes were small. Because the total number of simulated data sets was 4800. the significance of the interaction terms with small effect sizes may be due to the large sample size in MANOVA. Even though these interactions were significant, they might not have important influence on the dependent valuables. 70 Table 4.1.4. The multivariate test for Study I Effect Value F Hypothesis df Error df 712 A .985 757.18* 9 13950 .328 D .301 274.68* 6 9298 .151 S .80] 6230.6]* 3 4648 .80] T .273 $80.61* 3 4648 .273 AXD .414 124.11* 18 13950 .138 A XS .85] 613.79* 9 13950 .284 A XT .465 284.63* 9 13950 .155 DXS .056 4499* 6 9298 .028 DXT .027 2 I .04* 6 9298 .013 SXT .054 89.03* 3 4648 .054 A XD XS .08] 2 l .40* 18 13950 .027 A XDXT .046 I 1.95* 18 13950 .015 A XSXT .106 5682* 9 13950 .035 DXSXT .004 3.44* 6 9298 .002 A XD XSXT .010 253* l 8 13950 .003 *p< .0] Given that the overall difference was significant, the univariate tests for each dependent variable were conducted. First, Levene's test of equality of error variances were all significant (RLRI: F(47, 475 I )= 128.803, p< .0] ; RLR2: F(47, 4737): 133.710, p< .0]; RLR3: F(47, 4650): 129.233, p< .01), indicating that the variances in different design groups were not homogeneous for each separate ANOVA test. However, Lindman (1974, p. 33) and Box (1954) reported that F statistic is quite robust against the violation of the homogeneity assumption. Since the assumption of equal variance was violated at the .01 level, special caution should be taken when interpreting the results of these separate ANOVA analyses. Table 4.1.5 summarizes ANOVA tests for RLRI, RLR2. and RLR3. The effect _ sizes ofA, D, S, T, and the interactions were similar for RLR], RLR2, and RLR3. Again, ‘ A, D, S, T, and the interactions A XD, A XS, and A XTcan be considered as having important effects on RLR]. RLR2, and RLR3. A has the largest effect on all the RLR 7l indices. Moreover. D. S, and T had a smaller effect size than its two-way interaction with A. In RLRI, for example, D (113: .199) < A XD (if: .317): $612: .695) < A xs (772: .781); T072: .250) < A XT(i]2= .442). These patterns indicated that A was the main variable influencing the RLR indices. To further explore the nature of the interactions, the simple effects were shown in Figure 4. l .5 to Figure 4.1.10. Table 4. l .5. The univariate test for Study 1 Source a'f RLRl » RLR3 ‘1 RLR3 3 ‘ MS F if MS F if AIS F if A 2.023 27764.26* .947 1.194 3386].] 1* .956 .833 36971.23* .960 D 2 .042 579.] 1* .199 .022 629.5 I * .213 .017 739.77* .24] S l .773 10610.07* .695 .420 1 1912.50* .719 .313 l3893.09* .749 T l .l 13 1547.67* .250 .036 1019.57* .180 .013 575.13* .110 A xD 6 .026 359.70* .317 .012 339.4] * .305 .008 368.70* .322 A XS 3 .402 5512.18* .78] .193 5464.55* .779 .130 5792.86* .789 A x T 3 .089 1226.21 * .442 .026 739.64* .323 .010 433.68* .219 D XS 2 .008 109.8 I * .045 .002 69.44* .029 .002 81.34* .034 DxT 2 .004 5332* .022 .001 25.14* .01] .00] 3185* .014 Sx T l .019 265.19* .054 .003 90.14* .019 .00] 4342* .009 A xDxS 6 .004 6] .55* .074 .00] 2673* .033 .00] 2788* .035 A XDXT 6 .002 3282* .04] .000 1355* .017 .000 1374* .017 A xSxT 3 .0] 3 l82.l3* .105 .002 5385* .034 .001 2557* .016 D XSX T 2 .00] 7.98* .003 .000 .09 .000 .000 1.35 .00] A xDxSx T 6 .000 5.00* .006 .000 .05 .000 .000 .26 .000 Error 4652 .000 .000 .000 Total 4699 * p< .0l 1.00“ :3:tl‘""' D '3‘): __ O 0.95-l 0'5 -— 1 ‘3 S E 090— solid line =2000 dotted line=6000 0.85— 0.80— l I I I 0.2 0.4 0.6 0.8 A Figure 4.1.5. The interaction of A, D, and S in RLRi for 25-item test 1.00— D ----- 0 0.95—1 0'15 % 090—1 5 14 solid line =2000 Q: dotted line=6000 0.85— 0.80— l I r I 0.2 0.4 0.6 0.8 A Figure 4.1.6. The interaction of A, D, and S in RLR. for 50-item test 73 1.00- D -—— 0 —— 0.5 0.95— __ 1 9: 090- S § ' solid line =2000 dotted line=6000 0.85— 0.80—1 I I I I 0.2 0.4 0.6 0.8 A Figure 4.1.7. The interaction of A, D, and S in RLR2 for 25-item test -—— 0 — 0.5 0.95 1 S 090— S ‘3: solid line =2000 dotted line=6000 0.85-1 0.80-l l I I 0.2 0.4 0.6 0.8 A Figure 4.1.8. The interaction of A, D, and S in RLR2 for 50-item test 74 100— 1) -——-—- 0 —— 0.5 095— "" 1 S E 090— solid line =2000 dotted line=6000 0.85-7 0.80- I I l I 0.2 0.4 0.6 0.8 A Figure 4.1.9. The interaction of A, D, and S in RLR3 for 25-item test 100- ”a L) ;:-. *"r— 0 a _ 0.95— / 0°15 RLR solid line =2000 dotted line=6000 0. 85 - 0.80- . l l _ I 0.2 0.4 0.6 0.8 A Figure 4.1.10. The interaction of A, D, and S in RLR3 for 50-item test 75 Conditioned on T, the patterns ofthe interactions ofA, D, and S were similar across RLRi, RLR2, and RLR3. When the model fit the data, as shown in Figure 4.1.5 and Figure 4.1.6, D had a noticeably negative effect on RLR] when A= 0.2. However, the effect of D varied depending on S: when S was 6000, the decrease of RLR. due to the increase of D was small; when S was 2000, the descent of RLR; due to the increase of D was great. When A= 0.4. the negative effect of D was not obvious, especially when S: 6000. When A= 0.6 or 0.8, the effect ofD was minor and thus only the effect ofS could be identified. When the model over-fit the data, as shown in Figure 4.1.7 to Figure 4.1.10. it is clear that D had an effect on RLR2 and RLR3, but the effect varied dependeing on A. As long as A was greater than 0.2, the effect of D was minor. Moreover, S still had an effect on RLR2 and RLR3, but varied depending on the level of A: the effect of S was great when A was small. but small when A was great. 4.1.3 Comparisons ofthe Numbers of Rejections This part of the analysis involved comparing the empirical Type I error rate of the RLR index with those from the (1‘2 test and the 03,-”: test in different testing conditions. The theoretical a used for the 62 test and the 05,-” test was .05. The results in Table 4. l .6 show that the (1‘2 test at all times rejected the true model regardless ofthe levels ofA. D, S, and T. The results ofthe (7,2)”) test were not satisfactory either. With a short Tand a small S. the minimum number of rejections was 68 out of 100. Given the same levels ofA, D, and T, a large S didn't help decreasing the 76 number of false rejections. When T was long, the minimum number of rejections was 98 out of 100, regardless of S. A large T tended to inflate the number of rejections more severely than a large S. These results were indicative of a severely inflated Type I error rate problem of using the G2 test and the G5,.”- test to determine whether or not the test data were unidimensional. The number of rejections of the RLR index was computed based on the linear regression technique. With the information of the estimated a-parameters (EA), estimated d—parameters (ED), sample size (S), and test length (T), the lower bound of a good fit for the unidimensional model can be predicted. With the item parameter estimates obtained in Study I, the unidimensional regression model (adjusted R2 equal to .709) can be expressed as RLR1= 0.817509 + 0.000021(S) + 0.00125 1 (T) — 0.020432(ED) + 0.050065(EA) + 0.000023(EA XS) — 0.00 l 083(EA X T) + 0.067449(EA XED) + 0.000000 166(EA XSX 7). (4 I) If the observed RLR1 was smaller than the lower bound, the null hypothesis H0: d= l was rejected. As shown in Table 4.1.6, when S= 2000, the numbers of the false rejections were high for Test 1 I 1, Test 12 I , Test 13 I, and Test 132, indicating that the low level ofA inflated the Type I error rate. Given that A= 0.2 and S: 2000, the number of false rejections inflated with the increase of D. When A= 0.2 and S: 6000, all the false rejections were less than 5 regardless of the levels of D and T. Conversely, for the cases when A was equal to or greater than 0.4, the numbers of rejections were low regardless of the levels of D, S, and T. 77 Comparing the numbers of false rejections for the three indices under different testing conditions. the RLR index outperformed the G2 test and the 03,7; test. A large sample size and a long test both inflated the Type I error rates for the G2 test and the 0,2,1” test, but helped reducing the Type I error rates for the RLR index. Table 4.1.6. The number of rejections in 100 replications for unidimensional data 2000 examinees 6000 examinees Data sets 7 7 .7 7 RLR G " G ,7,- [f RL R (1 " G (7!. [f 25-item test Test 1 l l 29 100 74 0 100 83 Test 13] 33 100 80 0 100 80 Test I3] 76 100 82 3 100 67 Test2ll 0 100 71 0 100 73 Test 23] 0 l00 72 0 100 80 Test 23] 0 100 75 0 100 74 Test 31 l 0 100 80 0 100 69 Test 32] 0 100 75 0 100 75 Test 33] 0 100 68 0 l00 67 Test 4] l 0 100 79 0 [00 87 Test 42] 0 100 74 0 100 75 Test 431 0 100 75 0 100 68 .50-item test Test 1 l2 4 100 100 0 100 100 Test 122 2 100 100 0 100 98 Test 132 I8 100 100 0 100 100 Test 212 0 100 100 0 l00 100 Test 222 0 100 98 0 I00 98 Test 232 0 100 100 0 l00 100 Test 312 O 100 100 0 100 100 Test 322 0 100 100 0 100 100 Test 332 0 100 100 0 100 97 Test 412 0 100 100 0 100 100 Test 437— 0 100 100 0 100 100 T651432 0 100 100 0 100 100 78 4.2 Simulation Study 1] (Multidimensional Data Sets) The purpose of Study II was to investigate how well the RLR index determined the dimensionality for multidimensional data. Again, when the simulated data were analyzed by different levels of multidimensional MIRT models, some of the TESTFACT runs failed because the data generated a singular tetrachoric correlation matrix. Table 4.2.] shows the number of unsuccessful runs out of 100 replications for each condition. The two-dimensional data had higher rates of unsuccessful TESTFACT runs for the four-dimensional model than for the five-dimensional model, whereas the three-dimensional data had higher rates of unsuccessful TESTFACT runs for the five-dimensional model than for the four-dimensional model. Given the same levels of C and 1, the rate of getting a singular tetrachoric correlation matrix was high when A was moderate. Conditioned on A and C, the third level of I (36: 6: 6) generated a singular tetrachoric correlation matrix at lower rates than the first level (12: 12: 24) and second level (16: I6: l6)of1. 79 Table 4.2. l. The number of unsuccessful TESTFACT runs in Study 1] correlation Form MIRT model matr 1x 1 Factor 2 Factor 3 Factor 4 Factor 5 Factor Ci Form 11] 0 0 0 32 4 Form 112 0 0 0 0 2 Form 12] 0 0 0 2] 8 Form 122 0 0 0 l 0 Form 131 0 0 0 3 1 Form 132 0 0 0 0 0 C7 Form 21 l 0 0 0 29 Form 212 0 0 0 0 1 Form 22] 0 0 0 24 12 Form 222 0 0 0 2 0 Form 23] 0 0 0 6 2 Form 232 0 0 0 1 0 C3 Form 3] I 0 0 0 29 9 Form 312 0 0 0 2 4 Form 321 0 0 0 30 16 Form 322 0 0 0 2 1 Form 33] 0 0 0 l I 4 Form 332 0 0 0 2 3 C4 Form 41 I 0 O 0 0 17 Form 412 0 0 0 0 0 Form 42] 0 0 0 0 17 Form 422 0 0 0 0 3 Form 43] 0 0 0 0 3 Form 432 0 0 0 0 0 C 5 Form 51 l 0 0 0 0 24 Form 512 0 0 0 0 0 Form 521 0 0 0 0 19 Form 522 0 0 0 0 2 Form 531 0 O 0 0 3 Form 532 0 0 0 0 2 C1, Form 6] l 0 0 0 0 20 Form 612 0 0 0 0 0 Form 621 0 0 0 0 17 Form 622 0 0 0 0 0 Form 631 0 0 0 0 1 Form 632 0 0 0 0 0 4.2.] Results ofthe Summary Statistics Table 4.2.2 and Table 4.2.3 tabulate the summary statistics ofthe RLR values for 80 each combination. In order to show the change of RLR values associated with dimensionality, Figure 4.2.1 to Figure 4.2.6 were provided with colors denoting different levels of I and the line shapes representing different levels of A. Besides, the conditional distributions of the RLR values were offered in Appendix C as a supplement to the summary statistics. In Table 4.2.2 and Table 4.2.3, some ofthe RLR values slightly exceeded 1 when the model recovered the true dimensionality or over-fit the data. The unexpected RLR values showed that the lower-factor model fit the data better than the higher-factor model. However. for every case when the RLR values exceeded 1, a negative Giff statistic occurred. A negative value of the GET/f statistic indicated that the discrepancy between the predicted frequency and the observed frequency for the lower-factor model is smaller than that of the higher-factor model. The discussion of the occurrence of the unexpected values for the RLR index and the G3,,” test was provided in Chapter 5. Again, the SD of the RLR values in each condition was small. Conditioned on A, C, and I, the SD of the RLR values was great when the model under-fit the data. Given the same levels ofC and I, RLR, was low when A was high. With the same levels ofC and A, RLR) was high when the dominant factor was strong. For the data generated with the correlation matrices C 1, C 2, and C3, the RLR values approached ] for the two-dimensional model, and did not obviously increase for the higher-factor models. For the data generated with the correlation matrices C 4. C 5, and C 6, the RLR values approached l for the three-dimensional model, and did not increase for the four-dimensional model. In general. the patterns of the RLR values reflected the simulated dimensionality. 8] Table 4.2.2. Summary statistics of the RLR index for two-dimensional data sets Form RLR Form 1 l l RLR] RLR2 RLR3 RLR: Form 121 RLR. RLR2 RLR3 RLR4 Form 13] RLRI RLR2 RLR3 RLR: Form 21 l RLR. RLR2 RLR3 RLR4 Form 22] RLR] RLR2 RLR3 RLR4 Form 231 RLR; RLR2 RLR3 RLR4 Form 31 l RLR] RLR2 RLR3 RLR: Form 32] RLR; RLR2 RLR3 RLR4 Form 331 RLR1 RLR2 RLR3 RLR4 Descriptive statistics Descriptive statistics Test Mean SD N SE Mean SD N SE Form 112 0.8904 0.0073 100 0.0007 0.8380 0.0079 100 0.0008 0.9951 0.001] 100 0.0001 1.0003 0.0009 100 0.0001 0.9954 0.0015 68 0.0002 0.9990 0.0013 100 0.000] 0.9930 0.0012 66 0.0001 0.9924 0.0012 99 0.0001 Form 122 0.8954 0.0077 100 0.0008 0.8727 0.006] 100 0.0006 0.9940 0.0018 100 0.0002 0.9990 0.0008 100 0.000] 0.9938 0.0019 76 0.0002 0.9984 0.0012 99 0.000] 0.9926 0.0015 74 0.0002 0.9932 0.0012 99 0.0001 Form 132 0.9725 0.0032 100 0.0003 0.9547 0.0027 100 0.0003 0.9933 0.0012 100 0.000] 0.9980 0.0007 100 0.0001 0.9935 0.0013 97 0.0001 0.9990 0.0010 100 0.000] 0.9939 0.0010 96 0.000] 0.9950 0.0010 100 0.0001 Form 212 0.7276 0.0122 100 0.0012 0.6453 0.0119 100 0.0012 0.9959 0.0015 100 0.0001 1.0015 0.0016 100 0.0002 0.9955 0.0018 71 0.0002 0.9993 0.0021 100 0.0002 0.992] 0.0017 67 0.0002 0.9904 0.0013 99 0.0001 Form 222 0.7305 0.0136 100 0.0014 0.7357 0.0092 100 0.0009 0.9944 0.0016 100 0.0002 1.0000 0.0014 100 0.0001 0.9941 0.0018 76 0.0002 0.9989 0.0016 98 0.0002 0.9914 0.0017 69 0.0002 0.9915 0.0013 98 0.0001 Form 232 0.9413 0.0046 100 0.0005 0.9170 0.0039 100 0.0004 0.9930 0.0016 100 0.0002 0.9982 0.0008 l00 0.0001 0.9940 0.0018 94 0.0002 0.9997 0.0010 99 0.0001 0.9931 0.0011 92 0.0001 0.9938 0.0011 99 0.0001 Form 312 0.6049 0.0118 100 0.0012 0.5011 0.013] 100 0.0013 0.9952 0.0013 100 0.0001 l.00|2 0.0012 100 0.0001 0.9963 0.0015 71 0.0002 1.0001 0.0017 98 0.0002 0.9919 0.0016 65 0.0002 0.9893 0.0015 95 0.0002 Form 322 0.6025 0.0121 100 0.0012 0.6586 0.0088 100 0.0009 0.9936 0.0015 100 0.0002 0.9988 0.0012 100 0.0001 0.9952 0.0017 70 0.0002 0.9999 0.0013 98 0.000] 0.9910 0.0017 58 0.0002 0.9905 0.0013 97 0.0001 Form 332 0.9240 0.0055 100 0.0006 0.8988 0.0040 100 0.0004 0.9930 0.0013 100 0.0001 0.9978 0.0013 100 0.0001 0.9943 0.0015 89 0.0002 0.9996 0.0012 98 0.0001 0.9924 0.0012 85 0.000] 0.9932 0.0012 95 0.000] Table 4.2.3. Summary statistics of the RLR index for three-dimensional data sets Form RLR Form 4] l RLR1 RLR2 RLR3 RLRa Form 42] RLRI RLR2 RLR3 RLRa Form 43] RLR. RLR2 RLR3 RLR4 Form 5] l RLR] RLR2 RLR3 RLRa Form 521 RLR. RLR2 RLR3 RLR4 Form 53] RLR. RLR2 RLR3 RLR4 Form 61 l RLR. RLR2 RLR3 RLR4 Form 62] RLR. RLR2 RLR3 RLR4 Form 63 I RLR] RLR2 RLR3 RLR4 Descriptive statistics Descriptive statistics Fonn Mean SD N SE Mean SD N SE Form 412 0.8382 0.0l0l l00 0.00l0 0.8050 0.0094 100 0.0009 0.9574 0.0050 100 0.0005 0.9144 0.0060 100 0.0006 0.9961 0.0016 100 0.0002 0.9990 0.0018 l00 0.0002 0.9904 0.0017 83 0.0002 0.9853 0.0020 100 0.0002 Form 422 0.8177 0.0124 100 0.0012 0.7542 0.0ll6 100 0.0012 0.9092 0.008] 100 0.0008 0.8864 0.0082 100 0.0008 0.9959 0.0025 100 0.0003 0.9994 0.0021 100 0.0002 0.9888 0.0018 83 0.0002 0.9848 0.0021 97 0.0002 Form 432 0.9558 0.0050 100 0.0005 0.9252 0.0045 100 0.0005 0.9676 0.0042 100 0.0004 0.9512 0.0037 100 0.0004 0.9932 0.0015 100 0.0002 0.9987 0.0014 100 0.0001 0.9919 0.00l4 97 0.0001 0.9900 0.0014 l00 0.0001 Form 512 0.7540 0.0124 100 0.0012 0.6690 0.0123 100 0.0012 0.9435 0.0062 l00 0.0006 0.9106 0.0065 100 0.0006 0.9962 0.0022 100 0.0002 0.9978 0.0025 100 0.0002 0.9883 0.0019 76 0.0002 0.9818 0.0019 100 0.0002 Form 522 0.6366 0.0l86 100 0.0019 0.6318 0.0151 100 0.0015 0.8902 0.008] 100 0.0008 0.8456 0.0086 100 0.0009 0.9956 0.0032 l00 0.0003 0.9981 0.0023 100 0.0002 0.9870 0.0022 8] 0.0002 0.9823 0.0019 98 0.0002 Form 532 0.9138 0.0065 100 0.0007 0.8838 0.0058 100 0.0006 0.9698 0.0041 100 0.0004 0.9505 0.0045 l00 0.0005 0.9932 0.0018 100 0.0002 0.9985 0.0023 100 0.0002 0.9913 0.0015 97 0.0002 0.9876 0.0019 98 0.0002 Form612 0.768] 0.0102 100 0.0010 0.7041 0.0098 l00 0.0010 0.8838 0.0077 100 0.0008 0.8144 0.0071 100 0.0007 0.9957 0.0034 100 0.0003 0.9946 0.0043 100 0.0004 0.9847 0.0023 80 0.0003 0.9778 0.00l9 l00 0.0002 Form 622 0.5915 0.0154 100 0.0015 0.4847 0.0218 100 0.0022 0.7398 0.0l20 100 0.0012 0.6688 0.0222 l00 0.0022 0.9934 0.005] 100 0.0005 0.9929 0.0036 100 0.0004 0.9845 0.0025 83 0.0003 0.9783 0.003l 100 0.0003 Form 632 0.9126 0.0088 I00 0.0009 0.8865 0.0065 100 0.0007 0.9450 0.0082 l00 0.0008 0.912] 0.0060 100 0.0006 0.9948 0.0023 99 0.0002 1.0002 0.0025 l00 0.0002 0.9887 0.0023 99 0.0002 0.9838 0.00l8 100 0.0002 83 —°—-Formlll —*-Form121 l l . l ““41: “'"‘ Forml3l l ‘2 0.7 g - -I' -Form112 0'6 ” - -- -Form122 0,5 - — ”2' -Fonnl32 0.4 1 1 1 1 2 3 4 l Target dimension Figure 4.2.1. The change of RLR with dimensionality for the correlation matrix C1 ._ khfimflh‘w— —0—Fonn211 —'— Form221 -—-:1 _ Form23l - -l- -Form212 - 1‘ -Form222 0.5 ' 7" “1301111232 0.4 1 I 2 3 4 Target dimension Figure 4.2.2. The change of RLR with dimensionality for the correlation matrix C2 84 “ma -—0—Form311 —'-—Form321 - *2“ Form331 -I- -Form312 -l- - Form322 - ~°- -Fonn332 1 2 3 4 Target dimension —0—Fonn411 —4'-— Form421 — -.s.~— Form43l ‘ - -I- - Form412 - ‘l- - Form422 0.5 ‘ “"- - Fonn432 0.4 ‘ L l 2 3 4 Target dimension Figure 4.2.4. The change of RLR with dimensionality for the correlation matrix C4 85 —0—Form511 —'-—Form521 " -*-"- Form531 - -I- - Form512 l - -I- -Form522 ‘ 0.5 . - 9 - Form532 1 0.4 1 1 ‘ l 2 3 4 Target dimension Figure 4.2.5. The change of RLR with dimensionality for the correlation matrix C5 —0— Form6ll —-'— Form621 "-“ -.__ Forrn63l ' 'I' 'Form612 - -l- - Form622 0.5 e' ' 'Form632 0.4 + i 1 2 3 4 Target dimension Figure 4.2.6. The change of RLR with dimensionality for the correlation matrix C6 86 4.2.2 Results of Multivariate Analysis of Variance for Study 11 Again, a MANOVA analysis was conducted to explore the influence of the manipulated factors on the RLR index. The dependent variables in the MANOVA model were the RLR indices representing four levels of dimensionality (RLRI, RLR2, RLR3, and RLR4), and the independent variables were A, C, and 1. Again, the Pillai’s Trace was employed to test the overall multivariate dilTerence because of its robustness to the violation ofthe assumption of homogeneity of variance. Table 4.2.4. The multivariate test for Study 11 Effect Value F Hypothesis (y Error df 112 A 0.870 5310.849* 4 3182 0.870 C 2.079 689.733* 20 12740 0.520 1 1.799 71 16.803* 8 6366 0.899 A ><(.'><1 10 0.000 47.435* 0.124 0.000 46.238* 0.127 A XCXI 10 0.000 11.962* 0.034 0.000 2926* 0.009 Error 3564 0.000 0.000 Total 3600 * p< .0] 88 Based on the results shown in Table 4.2.5, all the main effects and interactions were significant, but their effects on RLR., RLR2, RLR3 and RLR4 were different. The effect size of A decreased from RLR) to RLR4. However, C and I had large effect sizes for RLR), RLR2, but a small effect size for RLR4 and the smallest effect size for RLR3. Concerning the interaction A > l) were found in the multidimensional simulation. These unexpected values occurred when the estimation model recovered the true dimensionality or over-fit the data. Theoretically, the RLR index should not be greater than I because the SSE should not increase when adding more factors to the model. However, whenever the RLR index exceeded 1. the corresponding G3,” test generated a negative value. which was not reasonable for a x2 distribution. The exact cause of these unexpected values was not clear, but a possible explanation is provided. The R2 for the OLS model has the property ofnot decreasing when more predictors are added to the model. However, this is not always the case for the MIRT model where both the a-parameters and ability parameters need to be estimated simultaneously. Adding one more factor to the MIRT model increases the degrees of freedom, but simultaneously requires m + n — 2 (n is the number ofitems, and m is the number of examinees) more parameters to be estimated. It is possible that. when the model-data-fit 115 is already perfect, adding more factors to the MIRT model would increase fit, but simultaneously would generate larger estimation errors. When the model over-fits the data, the increase of fit due to adding more factors may not compensate for the increase of estimation error. According to the definition, the RLR index is the ratio of the log transformation of the unexplained percentage of the variance from the k-dimensional and the (k+l)-th dimensional models. When the unexplained percentage of the variance of the k-dimensional model is smaller than that of the (k+] )-th dimensional model, the value of the RLR index becomes greater than one. The same rationale can be applied to explain the negative values of the 05,-” test. The G2 test in equation (13) is a discrepancy function based on the ratio of the likelihood under the fitted model to the likelihood ofthe empirical frequencies. The Giff test, as shown in equation (14), compares the discrepancy of the likelihoods for the model and the data between a lower-factor model and a successive higher-factor model The formula explicitly indicates that the discrepancy between the model and the data for the lower-factor model should always be greater than the discrepancy between the model and the data for the higher—factor model. In this study, however, the results showed that the assumption of the formula is not always true. When the model already fits the data well, over-fitting the data by adding one more factor to the model may increase the discrepancy between the model and the data and thus generates a negative value for the Giff statistic. Because both the RLR index and the Giff test compare the fit of the two successive MIRT models, the over-fitting problem occurs when the lower-factor model already has a good fit and the higher-factor model has a relatively ll6 poor fit. Thus. when the over-fitting problem arises, the RLR index may exceed 1, and at the same time the Giff/i statistic may be negative. The Patterns ofthe RLR Values and Dimensionality From the results in unidimensional simulation, RLR, reached .99 when the a-parameters were higher than 0.2. In such cases, because all RLR, values were close to the upper bound, adding more dimensions to the model only increased the values of RLR3 and RLR3 at the third decimal place. Conversely, for the tests with the a-parameters equal to 0.2, adding factors to the model did obviously increase the RLR values. The simulation of two-dimensional data based on the three-dimensional inter-factor correlations and the three-dimensional item parameters was successful. The patterns of the RLR values for the multidimensional data sets, as shown from Figure 4.2.1 to Figure 4.2.6, were as expected. For the two-dimensional data, the values of RLR, were small. but the values of RLR2 approached I. When adding more factors to the model, both the values of RLR 3 and RLR4 were still close to 1. For the three-dimensional data, the values of RLR, were small. When adding a second factor to the model, the values of RLR2 increased but not to the level of a good fit. For the three-dimensional solution, all the values of RLR3 approached 1, suggesting a good fit. When the model over-fit the data, the values of MR; were still close to l, but sometimes less than the values of RLR 3. Based on the results of the unidimensional and multidimensional simulation studies, it was clear that the change of the RLR values with dimensionality reflected the simulated dimensionality underlying the data. Once the RLR index stops increasing. the minimum number of statistical dimensions can be specified. 117 The Variables Associated with the RLR Index and Dimensionality The results from the MANOVA analysis in Study I showed that item discrimination, item difficulty, sample size, and test length collectively had an effect on the RLR index. Sample size affected the RLR index, but the effect varied depending on the level of item discrimination. A large sample size helped reducing the sampling variation and offered better estimates of the model parameters, especially when the item discrimination was low. With a larger sample size, the RLR index became more stable. That is, when item discrimination was low, the problem of falsely rejecting the true unidimensionality was circumvented. The effect of item difficulty also depended on the level of item discrimination. As long as item discrimination was greater than 0.2, the effect of item difficulty was minor. The results based on the MANOVA analysis in Study 11 indicated that inter-factor correlation, item-factor structure, and item discrimination all together influenced the RLR index. Because the interactions were significant and some of them had substantive magnitude of effect sizes, the simple effect instead of the main effect should be discussed. Given the same level of inter-factor correlation and item-factor structure, high item discrimination increased the change of RLR associated with dimensionality when the model under-fit the data. Thus, thejudgment of the dimensionality based on the RLR index would be easy when the item discrimination was high. Given the same level of inter-factor correlation and item discrimination, the change of RLR with dimensionality was the greatest when items were evenly sensative to factors. In other words, when there was no clear dominant factor in the data, the change of RLR with dimensionality 118 was obvious. On the contrary. when the data had a strong dominant factor and some weak minor factors that were only sensitive to a small number of items, the change of RLR with dimensionality became small and thus increased the difficulty of identifying minor factors. However, when the model fit the data, the effects of item discrimination, inter-factor correlation. and item-factor structure became minor. The RLR Index and the Magnitude of the Dominant Factor In terms of the factor analysis technique, the dominant factor will always be identified first by the factor-analytical model. Then, minor factors will be extracted in order by their quantities of explained variance. The first extracted factor always explains the most variance in the data than the subsequent factors. The R2 technique is primarily designed to represent the percentage of explained variance in the data. In the MIRT model, RI2 shows the percentage of variance explained by the unidimensional ’7 . . . . . model, and R5: reflects the percentage of variance explained by the two-dimenSIonal model. Based on the equivalence between the MIRT model and factor analytic model, RLR] can be used to show the relative size of the dominant factor in contrast to the second factor. Based on the results from the unidimensional simulation, it is clear that the magnitude of RLR, was related to the size of item discrimination. RLRI reached .99 when item discrimination was 0.4 or higher. Even though item discrimination was as low as 0.2 with a short test and a small sample size, the minimum value ofRLR. was .80. For the unidimensional data with higher item discrimination, the dominant factor explained more variance in the data and thus could be more easily identified by the 119 statistical model. The determination of the size of the dominant factor is more complex in the multidimensional data. When inter-factor correlation and item discrimination were held constant, RLR. increased with the increment of the number of items sensitive to the dominant factor. The two-dimensional data, the data related to the correlation matrices CI, C3, and C 3, were generated with a three-dimensional inter-factor correlation matrix and item-factor structure by combining the first two groups of items into a bigger item cluster. Thus, the first level of the item-factor structure (12:12:24) generated the lowest dominant factor, which were sensitive to 50% of items in a test. The second level of the item-factor structure (16:16:16) produced a dominant factor sensitive to 67% of items in a test. With 88% of items sensitive to one factor, the third level of the item-factor structure (36:6:6) generated the greatest value of the dominant factor and at the same time had the greatest value of RLR]. With regard to the three-dimensional data, which were the data sets related to C4, C 5, and C6, the percentage of items related to one factor was consistent with the level of item-factor structure. For the second level of the item-factor structure (16:16:16), each of the three dimensions had 33% of items. Without a doubt, the second level of item-factor structure (16:16:16) generated lower RLRI than the first level (12:12:24) and the third level (36:6:6) ofitem-factor structure. With 76% of items related to the main factor, the third level of the item-factor structure had the largest dominant factor and generated the greatest value of RLR]. Given the same level of item-factor structure and item discrimination, RLR. increased proportionally to the decrease of the inter-factor correlations. ln factor analysis, when the factors are completely independent, the dominant factor tends to explain less variance than the case when the factors are correlated. Thus, it is not surprising that when the level of item-factor structure and item discrimination were held constant, C3 generated the lowest value of RLR. in the two-dimensional data and C6 generated the lowest value of RLR. in the three-dimensional data. In short, RLRI in the multidimensional data reflected the size of the dominant factor. Low RLRI suggested that the items were more evenly distributed to factors, and the factors tended to be independent of each other. C orrespondingly, the lower value of RLRI also implied that the data were less likely to be unidimensional. The Statistical Characteristics of the RLR Index, G 2 Test, and 0517/ Test The results of the G 2 test and 63”] test indicated that these statistics could not accurately identify dimensionality. Even though these statistics demonstrated high statistical power in rejecting wrong models, they tended to reject right models with high Type 1 error rates. These findings are consistent with earlier studies (Berger & Knol, 1990; De Champlain & Gessaroli, I998; DeMars, 2003; McDonald, 1989b) that these (1‘2 tests should not be used to assess the dimensionality for test data. On the contrary, the RLR index demonstrated low Type I error rates and high statistical power for most data sets. In the unidimensional simulation, the RLR index generated low Type I error rates except for the extreme cases when item discrimination was 0.2 and sample size was 2000. When item discrimination is low and sample size is limited, the test data are close to random data so that the signal in the data is unnoticeable. Accordingly, it is reasonable that the RLR index can not function well for these test data. From the practical consideration in test development, a test with these items can be considered useless because items are not discriminating examinees’ abilities. It can be expected that such bad tests may not be developed in real testing conditions, so the failure of the RLR index in detecting the true unidimensionality for these test data will not be an issue. It can be concluded that the RLR index demonstrated low Type I error rates for common tests. When the data are close to random, the index tended to falsely reject the true unidimensional model. With regard to the multidimensional data, the RLR index performed well in rejecting the wrong unidimensional model except for the two-dimensional data having two highly correlated factors, a strong dominant factor, and moderate item discrimination. For this kind of test data, the RLR index cannot detect the weak second factor and tends to underestimate the data dimensionality. Other than this special case, the RLR index had high statistical power and low Type 1 error rates. The results of the simulation studies indicated that the RLR index outperformsthe G2 test and the Giff test in detecting the true dimensionality. Real Data Analysis The RLR indices for the five random samples consistently indicated that the Grade 4 Mathematic Test data from the MEAP testing program can be modeled unidimensionally. As described earlier, this test was designed to measure different ability domains and skills in mathematics at the grade-4 level. The results based on the RLR index suggested that these content domains may be described under the umbrella of a general factor called “basic mathematics skills." The unidimensional finding is supplemented with the discussions in term of the test item content, the representativeness of the content-related dimension, the definition of dimensionality. and the assumption of the compensatory MIRT model. The mathematics knowledge taught in grade-4 contains the basic mathematics concepts and skills. The differences among different content knowledge and skills may not be as great as expected by the test developers. For example, if students can do multiplication, they need to have the prerequisite knowledge in addition. When responding to fraction questions. students have to think about how fractions are related to a unit whole, compare fractional parts of a whole, and find equivalent fractions to give a correct response. The processes for answering these mathematics questions are actually related to counting and addition. As a whole, the test items in the Grade 4 Mathematics Test may cover several distinct content domains, but these content-related abilities may be indeed highly correlated to each other. As shown in the second simulation study, when two of the three factors are highly correlated, the dimensions will converge so that a two-dimensional model can well explain the truly three—dimensional data. When the content-related abilities are highly correlated, similar to the multicollinearity problem in multiple regression, it is difficult to identify the net contribution of the minor factors when the dominant factor already explains most of the contribution of the minor factors. Besides, how well the minor factors were measured in the Mathematics Test is another important issue. The Mathematics Test contained 57 items: 6 items for data and probability; 6 items for geometry; 18 items for measurement; and 27 items for numbers and operations. For the 6 items in data and probability. the mean of the item 123 discrimination is only 0.5347. With regard to the 6 items in geometry, the mean ofthe item discrimination is 0.5413. Given that the content-related abilities are highly correlated, those weak dimensions having only 6 moderate-discriminating items are not easily identified by a mathematical model. Another explanation for the findings from the real data analysis goes back to the definition of dimensionality. There appears to be a common misconception that a set of items on a test measure a distinct number of dimensions regardless of the characteristics of the examinees taking the test (R. L. Turner et al., 1996). However, the statistical dimensionality is a characteristic of the data matrix, not the test or examinee population (Reckase, 1990). Researchers (Ackerman, 1994; Reckase, l997a; R. L. Turner et al., 1996) pointed out that dimensionality is a function of both the skills being measured by the items and the multivariate ability distributions of the examinees. The dimensional structure of the data from a test could differ for various subgroups of an examinee population. Ackerman (1994) indicated that if items collectively are capable of distinguishing between levels of several skills, and examinees differ in their levels of proficiencies on more than one of these skills, the interaction needs to be described by a multidimensional model. Based on this rationale, the findings of the Grade 4 MEAP Mathematics Test data may indicate that these test items indeed covered several distinct content domains and the items should be described by more than one content-related ability, but the target examinees, i.e. the grade-4 students in Michigan state, were heterogeneous with respect to the main content-related ability but homogeneous with respect to the minor content-related abilities. When the variations of examinees‘ proficiencies on the minor content-related abilities were limited, it is difficult for a 124 mathematical model to capture those dimensions. Another possible explanation can be offered based on the assumption of the compensatory logistic MIRT model. This MIRT model assumes that abilities can be linearly combined and compensated. It is possible that the content-related dimensions for the Mathematics Test data may be multidimensional, but the items were sensitive to the same combination of the content-related dimensions. Consequently, the statistical dimension needed for the model to describe the item-person interaction was one. Given the unclear nature of the ability structure in the mathematics test data, it is uncertain whether or not the unidimensional model can still fit the data well if a different model, such as a partially compensatory model, is used to analyze the same data. To conclude these possible explanations for the real data analysis, one statistical dimension was enough to sufficiently explain the MEAP Grade 4 Mathematics Test data when the compensatory logistic MIRT model was used. 5.3 Conclusion Based on the findings in the simulation studies and the real data analysis, the RLR index is a promising goodness-of-fit index for the MIRT model. The dimensionality index varied in accuracy as a function of sample size and could more accurately identify unidimensionality as the number of items increased. The RLR index demonstrated low Type I error rates except for the tests composed of poor items having item discrimination values of 0.2 with a short test and a small sample size. The RLR index also revealed high statistical power in rejecting wrong models except for the two-dimensional data with highly correlated factors, moderate item discrimination. and one weak minor factor. 125 The change of the RLR index with dimensionality implied the decrease of error in the data when adding factors to the model. Moreover, the RLR index for the initial unidimensional model reflected the size of the dominant factor. When the RLR index for the initial unidimensional model was low, it implied that the data had a weak dominant factor and were less likely to be unidimensional. Based on the RLR index, the Grade 4 Mathematics Test data from the MEAP testing program can be well explained by the unidimensional model. Even though the test was developed by selecting items representing different knowledge domain and skills, one statistical dimension would be enough to explain the interaction between items and examinees. 5.4 Limitations, Implications, and Suggestions for Future Research The purpose of this study is to offer an index which can be used as a rule of thumb in selecting the most appropriate dimensionality for the MIRT model to explain test data. Instead of relying on subjectivejudgments, the proposed index provides objective and useful information to decide dimensionality based on the compensatory logistic MIRT model. Once the dimensionality is identified, the dimensional structure can further be explored to identify the relationships between dimensions. Validity studies (to identify what domains or dimensions are measured) can proceed to provide evidence supporting hypothesized multidimensionality and to identify construct-irrelevant variance. It is important to emphasize that these findings were just preliminary and caution should be taken when interpreting and generalizing the results to other conditions. It is therefore important to highlight the limitations associated with this investigation and to offer suggestions for future research with reference to assessing MIRT goodness-of-fit. 126 First, as introduced in Chapter 2, the parametric MIRT models provide full dimensionality estimation specifying the number of dimensions and which item measures which dimension, but these benefits all rest on their specific assumptions of the item responses. Tate (2002) pointed out that any mathematical model with limited numbers of parameters provided a relatively efficient summary of data, but it also brought in the strong assumption that the phenomenon of interest could be accurately explained by the assumed model. Based on the rationale, data dimensionality can be determined by the model-data-fit procedure only when the proposed model is appropriate. Since the RLR index was derived from the logistic compensatory MIRT model (Reckase, 1985; Reckase & McKinley, ‘l 991), this index can work well only when the logistic compensatory MIRT model is the appropriate model to explain the data. The logistic compensatory MIRT model used in this study is only one of the MIRT models proposed in the literature. This model explicitly assumes that abilities can be linearly combined so that the high level of one ability can compensate for the low level of a second ability. However, for real test data it is unclear if abilities can be linearly combined or compensated. Sympson’s model, for example, assumes that the ability structure underlying the test data is partially compensatory (cited from Reckase & McKinley, 1982). A correct item response requires examinees to demonstrate high abilities on all dimensions. 1f the underlying dimensional structure in the data is different from the model assumption, using the model to explain the data may not generate a good fit unless the extremely high-dimensional model is used. As explained by Tate (2002), the attempt to fit the partially compensatory function with a compensatory model is similar to the unwise attempt to use an additive regression model to represent an interactive relationship. However, so far the robustness of the compensatory MIRT models to the violation of the assumption of ability compensation is still unclear. It would be worth noting that the MIRT model used in this study is only one option to describe test data. If the inherent ability dimensions in the data cannot match the model assumption, using the compensatory MIRT model to describe the data may result in essential misfit. and consequently the statistical power of the RLR index would be limited. Second, since the RLR index compares the ratio of the residuals ofthe two successive MIRT models, the degrees of freedom for the RLR index need further investigation. In the OLS model, R2 is not an unbiased estimate of the corresponding parameter in the population, and the degree of bias depends on the relative size of the number of observations (N) and the number of parameters (P)(Howell, 2001, p. 546). In the OLS regression model, the number of parameters is usually independent of the number of observations. The R2 tends to be perfect (R2: 1) when N= P + l regardless of the true relationship between the dependent variable and the predictors in the population. For the MIRT models, however, the total number of parameters needed to be estimated is always large. As the number of examinees increases in the MIRT model, the number of parameters increases proportionally. For example, in a unidimensional MIRT model. if 2000 examinees take a test that has 40 items, the total number of parameters to be estimated is 2078 (2000 + 2 x40 — 2). While adding the second dimension to the MIRT model, there are 2038 (2000 + 40 — 2) more parameters to be estimated for the same data set. It is uncertain how the R2 analog ofthe MIRT model reacts to the huge number of the degrees of freedom. It is also unclear how the RLR index reflects the potential inflation problem for the R2 analogs for two successive models. Even though the current findings are positive, the succeeding research should focus on the degrees of freedom of the RLR index to examine the possible inflation problem. Third, simulation studies offer a means to verify the theoretical statistical properties in practice, but the simulation scenarios always have less than real complexity. It is critical to point out that all the simulated data sets in this research were based on the simple structure and they only represented the simplest cases. Future studies should also employ mixed structure to explore the statistical characteristics of the RLR index in correctly identifying the true dimensionality. Furthermore, the two simulation studies employed the important variables related to dimensionality. Some other potential variables, such as the effect of the guessing parameter on model-data-fit and the interaction between item-factor structure and item discrimination (the item discrimination are different for each factor) may be appealing topics for future research. Besides, the comparisons between the RLR index and the non-parametric indices on detecting dimensionality would be worth investigation. To detect the limitation of the RLR index, it would also be of interest to decide the minimum number of items and the minimum level of item discrimination representing one identifiable dimension. Last, it is not surprising that the choice of the appropriate dimensionality assessing method is constrained by the limitations of estimation theory and the computer program (Tate, 2002). When using full-information factor analysis (TESTFACT), the number of factors should not exceed five in order to ensure the accuracy of the results (Bock et al., 1988). In order to demonstrate how the RLR index functions for under-fit, good fit, and over-fit, the maximum number ofdata dimensionality simulated in this research is three. 129 It is expected that the investigation of higher-dimensional data may be possible when a more powerful mathematical algorithm or a computer program is developed. Hopefully, the results presented in this research will offer useful information to practitioners interested in using the MIRT model. It is hoped that these findings will promote future research in this area and lead to helpful guidelines with respect to the assessment ofthe data dimensionality. I30 APPENDIX A Mathematical Derivation of Esrella’s (1998) R2 Analog solve d¢ : dA I—¢ IJI 8 Solution: d¢ _. dA I—¢ 'I__A B :>I_II —d¢= I—dA (1 --AB) :> — ln(1 — It) : —BIn(1-%)+ C , where C is a constant :5 ln(l —¢) = ln(l “3)” A :> (I —¢> = (I 7% xeXP(-C) Given that 050(0) 2 0 , which means when A=0, (0 =0 :> I — 0 = (I —0)3 exp(—C) :> exp(——C) = l :> C = 0 An Thus 1— :l—— t5 ( B) _ _ LIB :>¢_l (I B) APPENDIX B The Conditional Distributions of the RLR Values in Simulation Study I 0.80 0.85 0.90 0.95 RLR Figure B. I. The conditional distributions of the RLR values for Test lllwith 2000 examinees RLR] RLR2 RLR3 ---- -. Q I l I I I I 0.90 0.92 . RLR Figure B2. The conditional distributions of the RLR values for Test lllwith 6000 examinees 132 0.75 0.80 0.85 0.90 0.95 1.00 RLR Figure B3. The conditional distributions of the RLR values for Test 121with 2000 examinees RLR. RLR2 RLR3 ....... RLR Figure B4. The conditional distributions of the RLR values for Test 121with 6000 examinees 133 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 RLR Figure B5. The conditional distributions of the RLR values for Test 131with 2000 examinees RLR] RLR2 RLR3 _______ I | 0.850 0.875 0.900 0.925 0.950 0.975 1.000 RLR Figure B6. The conditional distributions of the RLR values for Test 131with 6000 examinees I34 RLRI RLR2 RLR3 _______ 0.97 0.98 0.99 1.00 RLR I I I 0.980 0.985 0.990 0.995 1.000 RLR Figure B8. The conditional distributions of the RLR values for Test 211with 6000 examinees 135 0.96 0.97 0.98 0.99 1.00 RLR Figure B9. The conditional distributions of the RLR values for Test 221with 2000 examinees I I 0.980 0.985 0.990 0.995 1.000 RLR Figure B. 10. The conditional distributions of the RLR values for Test 22 1 with 6000 examinees I36 RLR Figure B.11. The conditional distributions of the RLR values for Test 231with 2000 examinees RLRI RLR2 RLR3 _______ 0.980 0.985 0.990 0.995 1.000 RLR Figure B. 12. The conditional distributions of the RLR values for Test 231with 6000 examinees I37 RLR] RLR2 RLR3 I I I 0.980 0.985 0.990 0.995 1.000 RLR Figure B. 13. The conditional distributions of the RLR values for Test 33 lwith 2000 examinees II RLR] RLR2 RLR3 5 I I I I I I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B. l 4. The conditional distributions of the RLR values for Test 331with 6000 examinees 138 I l I I I 0.980 0.985 0.990 0.995 1.000 RLR Figure B. 15. The conditional distributions of the RLR values for Test 321with 2000 examinees RLR] RLR2 I I I I I I ‘ I 0990 0.992 0.994 0.996 0,998 1.000 RLR Figure B. 16. The conditional distributions of the RLR values for Test 321with 6000 examinees 139 0.980 0.985 0.990 0.995 1.000 RLR Figure B. 17. The conditional distributions of the RLR values for Test 331with 2000 examinees RLRI I I I I I I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B. 18. The conditional distributions of the RLR values for Test 331with 6000 examinees 140 I I I I I I 0.990 0.992 0.994 0.996 0.998 1.000 Figure B. 19. The conditional distributions of the RLR values for Test 411with 2000 examinees RLRI l I l I l I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure 8.20. The conditional distributions of the RLR values for Test 411with 6000 examinees I41 I I I ‘ I I ‘ I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.21. The conditional distributions of the RLR values for Test 421with 2000 examinees RLR 1 RLR2 RLR3 | l l I l | 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.22. The conditional distributions of the RLR values for Test 421with 6000 examinees I42 \ | I I I l I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.23. The conditional distributions of the RLR values for Test 43 lwith 2000 examinees RLR. I I I I 1| *1 0.990 0,992 0.994 0.996 0.998 1.000 RLR Figure 8.24. The conditional distributions of the RLR values for Test 431with 6000 examinees I43 I I I l l I I 0.925 0.950 l I I 0.850 0.875 0.900 I I 0.975 1.000 RLR Figure B.25. The conditional distributions of the RLR values for Test 112 with 2000 examinees Figure B.26. The conditional distributions of the RLR values for Test 112 with 6000 examinees RLRI RLR2 I F 0.80 0.85 0.90 0.95 1.00 RLR Figure B.27. The conditional distributions of the RLR values for Test 122 with 2000 examinees RLR. RLR2 RLR3 _______ RLR Figure 8.28. The conditional distributions of the RLR values for Test 122 with 6000 examinees I45 RLR Figure B.30. The conditional distributions of the RLR values for Test 132 with 6000 examinees 146 l' a...)- ~i‘-‘rd\“-d.._ H‘- "h" ’rr" Figure 3.31. The conditional distributions of the RLR values for Test 212 with 2000 examinees RLR. RLR2 RLR3 I I I‘ I I I 0.990 0.993 0.994 0.996 0.998 1.000 RLR Figure 8.32. The conditional distributions of the RLR values for Test 212 with 6000 examinees I47 RLR] RLR2 I I I 0.970 0.975 0.980 0.985 0.990 0.995 1.000 RLR Figure B.33. The conditional distributions of the RLR values for Test 222 with 2000 examinees I I I 0.990 0 992 0.994 0.996 0.998 1 000 RLR Figure B.34. The conditional distributions of the RLR values for Test 222 with 6000 examinees I48 | I 0.970 0.975 0.980 0.985 0.990 0.995 1.000 RLR Figure B.35. The conditional distributions of the RLR values for Test 232 with 2000 examinees RLR 1 RLR2 RLR3 _______ 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.36. The conditional distributions of the RLR values for Test 232 with 6000 examinees 149 I I 0.990 0.992 I I 0 994 0.996 I I 0.998 1.000 RLR Figure 8.37: The conditional distributions of the RLR values for Test 312 with 2000 examinees RLR. RLR2 RLR3 "If _ 1......“- \‘L l I I I 0.990 0.992 0.994 0 996 I 0.998 1.000 RLR Figure B.38: The conditional distributions of the RLR values for Test 312 with 6000 examinees 150 I I l I I I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure 8.39. The conditional distributions of the RLR values for Test 322 with 2000 examinees RLR] RLR2 RLR3 _______ I I I I I fl 0990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.40. The conditional distributions of the RLR values for Test 322 with 6000 examinees 151 RLR I RLR2 RLR3 _______ l I | I I 0.980 0.985 0.990 0.995 1.000 Figure B4]. The conditional distributions of the RLR values for Test 332 with 2000 examinees RLR I RLR2 RLR3 _______ \ l I | I | | 0.990 0,992 0.994 0.996 0.998 1.000 RLR Figure B.42. The conditional distributions of the RLR values for Test 332 with 6000 examinees 152 II RLR. RLR2 RLR3 _______ F I I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.43. The conditional distributions of the RLR values for Test 412 with 2000 examinees I I I l I I 0.990 0.992 0.994 0.996 0.998 RLR Figure B.44. The conditional distributions of the RLR values for Test 412 with 6000 examinees 153 RLR I RLR2 RLR3 I r ‘ 0.990 0.992 0.994 0.996 0.998 I 1.000 RLR Figure B.45. The conditional distributions of the RLR values for Test 422 with 2000 examinees : : RLR I ' . RLR2 RLR3 III. I I I I I i 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure 8.46. The conditional distributions of the RLR values for Test 422 with 6000 examinees 154 I I r I I: I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.47. The conditional distributions of the RLR values for Test 432 with 2000 examinees RLR; I RLR2 n RLR3 _______ 1| 3 l g I 1 I I I I I I g I l I g I g I u | l I I I 3 | | .I..J.I~ I 0.990 0.992 0.994 0.996 0.998 1.000 RLR Figure B.48. The conditional distributions of the RLR values for Test 432 with 6000 examinees 155 APPENDIX C The Conditional Distributions of the RLR Values in Simulation Study 11 n RLRI RLR2 . RLR3 ...... g RLR4 ______ 3 i :ii 32% its 1:! E!!! II J l I I I L}? I 0.85 0.90 0.95 1.00 1.05 RLR Figure C. 1. The conditional distributions of the RLR values for Form 111 RLR] RLR2 =2:'.I.II.II.OIOII .mm:IIIIIIIIIII|I-' -‘ ifisiii K~“-—-.__ (7 RLR Figure C2. The conditional distributions of the RLR values for Form 112 156 A’EL -C ° ‘ ' ' ' ""' ' ' ' "V ' "'rrml'bv- ----------.uugm.g Figure C3. The conditional distributions of the RLR values for Form 121 RLR. ; RLR2 RLR3 ...... n RLR4 ______ £33M¥L§muuuuouu ...-...--.‘-.-.--. L4!" murauxu'i-‘fi-19 vii ”mm-m'm'm'“ RLR Figure C4. The conditional distributions of the RLR values for Form 122 157 I51; )5 i : 5 4' '3}. I I I I I I 095 096 097 0.98 0.99 1.00 RLR - _ . .‘rp' ' L»... .‘m ._ - _..m".I.‘IJI¢" ‘Uiub..“¢L. -.. ....- . ....-.-‘ . . N-O— -—-_-:="-"='-". - .""""‘—" " f;¢.¢-.'... ' 011‘!"- fi‘fif. ‘M In .0. s4 I I I I l I I 0.94 0.95 0.96 0.97 0.98 0.99 1.00 RLR Figure C6. The conditional distributions of the RLR values for Form 132 158 ‘ flit-:31 h7- —.- ‘fi'fia *‘ —.O .31.— _i1.. I I I 0.7 0.8 0.9 RLR Figure C7. The conditional distributions of the RLR values for Form 211 Y" O RLR l RLR2 —_ RLR 3 ...... n RLR4 ______ a“““--- Ill-Inw- .. '- .— .—-. E ’5 fi" I-‘L-Z-h. '- ’*M“-— 7‘" O 0.6 07 0.8 0.9 RLR Figure CS. The conditional distributions of the RLR values for Form 212 159 " " arm 2: I" :0 A Ili"i" _.._._.. a: ..._‘3 ‘I. H. -5.-.3..“ J .l. 11 I F I 0.7 0 8 0.9 ._a O RLR Figure C9. The conditional distributions of the RLR values for Form 221 >6 1‘‘ W J) D ““‘h‘..-- RLR Figure C. 10. The conditional distributions of the RLR values for Form 222 I60 {I I“ I II I I I 7 l 0.92 0494 0.96 0.98 1.00 RLR Figure C. l 1. The conditional distributions of the RLR values for Form 231 RLR. RLR2 I 1 I 1 I 0.90 0,93 0.94 0.96 RLR Figure C.12. The conditional distributions of the RLR values for Form 232 161 ¥ K” I I I I 0.6 0.7 0.8 0.9 RLR Figure C.13. The conditional distributions of the RLR values for Form 3 ll b-1 ‘- - .— -p_'|..: 3‘; A; I1 I 1 I | 1 I 0.4 0.5 0.6 0.7 0.8 0.9 10 RLR Figure C. 14. The conditional distributions of the RLR values for Form 312 162 ¥ K’ _ mm on- F I I 1 0.6 0.7 0.8 0.9 1.0 RLR Figure C.15. The conditional distributions of the RLR values for Form 321 RLR1 RLR2 :- ” p ’1 _ —'\—'\-‘--\—-—_— mmmfivw- .— h--—-——- . .- “Evil-‘- gnu-II- JI 1 | I | 0.6 0.7 0.8 0.9 RLR Figure C. 16. The conditional distributions of the RLR values for Form 322 b—- 163 J‘J-c- -. RLRI :: RLR2 I: RLR3 ...... :i RLR4 ______ :15. If I“: II- II. In: II' 1,1: 'I." i '. I Lu I I I r I I 0.90 0.92 0.94 0.96 0.98 1.00 RLR Figure C.17. The conditional distributions of the RLR values for Form 331 RLRI RLR2 RLR3 ...... ; RLR4 ______ I 2.1: I I" 13'- II: | II II . ll ' .I. II I II: 7| I I 1'3 r I I T 0.85 0.90 0 95 l ()0 RLR Figure C. 18. The conditional distributions of the RLR values for Form 332 164 E RLRI : RLR2 I: RLR3 ______ I“ § 'I RLR4 ______ II“ II" .m- —--- .‘M—n-fl I 1 0.80 0.85 0.90 RLR Figure C. 19. The conditional distributions of the RLR values for Form 411 fi x f 8‘” _° ‘0 LII I-b RLR. I RLR2 -— : RLR3 ...... I: RLR4 ______ 0" I" :I: II" :: 1: II" II" II" ;:II II" I I 3:: RLR Figure C20. The conditional distributions of the RLR values for Form 412 165 RLR Figure C21. The conditional distributions of the RLR values for Form 421 0.7 0.8 0.9 RLR Figure C22. The conditional distributions of the RLR values for Form 422 166 .I 'I ' I I I . I I I ' I Q I" I" .;-' I. |' 'I 9| . I J ‘X 1 1 T I 1 1 I 1'— 0.93 0.94 0.95 0,96 0.97 0.98 0.99 1.00 RLR Figure C23. The conditional distributions of the RLR values for Form 431 A I. I I. II I. u I. n I. II I. II I. II I. II I. II I. I l I. II I . I I I .l I I .I I I .l I I .I I I .l I I .I I L I L I l | I T I l 0.90 0.93 0.94 0.96 0.98 1.00 RLR RLR] RLR2 Figure C.24. The conditional distributions of the RLR values for Form 432 167 E 5’ it? I I l I l I T 0.70 0.75 0.80 0.85 0.90 0.95 1.00 RLR Figure C25. The conditional distributions of the RLR values for Form 511 RLR; ” RLR2 I .EULIQ3 ------ I : ICLJQ4 ______ i I I II I II I II I II II I' IIII IIII III' III' IIII III' III' IIII IIIl J k J k Itl' l l I l I 06 0.7 08 0.9 1.0 RLR Figure C26. The conditional distributions of the RLR values for Form 512 168 I I I I ‘ T 0.6 0.7 0.3 0.9 1.0 RLR Figure C27. The conditional distributions of the RLR values for Form 521 RLR] RLR2 RLR3 ...... n E RLR4 ______ I I I 'I I H I 'I {'h {'3 3" a 'I I"| I"| I". I". I"| l"| ] j K :2! I I I I I 0.6 0.7 0.8 0.9 1.0 RLR Figure C28. The conditional distributions of the RLR values for Form 522 16%) RLR Figure C29. The conditional distributions of the RLR values for Form 531 RLR 1 RLR2 L. I I I 0.88 0.93 0.96 RLR Figure C30. The conditional distributions of the RLR values for Form 532 EH. 170 «2.. gum.-.- k K k /r RLR -:::::=:==3I. ¥ K W 5-:.:."-'. O—lc-------- 0.65 0.70 0.75 0.80 0.85 0.90 0.95 l. RLR Figure C32. The conditional distributions of the RLR values for Form 612 171 D E in RLR2 I RLR3 ______ II I RLR4 ______ I I --': :II 'I ‘I. :9 HM I I I I l I 0.5 0.6 07 0.8 09 1.0 RLR Figure C33. The conditional distributions of the RLR values for Form 621 RLR 1 RLR2 I II. :5- II" .g: :I' ,L' I I I 1 I 0.5 0.6 0.7 08 0.9 - - - - -“““-- ‘l. I 0.4 RLR Figure C34. The conditional distributions of the RLR values for Form 622 172 “.I-’*‘ I I I I I I l I I I T 0.88 0.90 0.93 0.94 0.96 0.98 1.00 RLR Figure C35. The conditional distributions of the RLR values for Form 631 M I I m — :1 :I RLR3 ...... I' l' .I I' RLR4 ______ II I. I' II I' I. I' II I' l' I' l' I‘ I. I' I ' I' l ' II I | :1: 1 I'l ' I'l ' I" ‘4 I I A 1* 085 0.90 095 1.00 RLR Figure C36. The conditional distributions of the RLR values for Form 632 173 REFERENCES Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7(4), 255-278. Akaike. H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723. Bartlett, M. S. (1950). Tests of significance in factor analysis. British Journal of Psychology, 3, 77-85. Bejar, l. l. (1980). A procedure for investigating the unidimensionality of achievement tests based on item parameter estimates. Journal of Educational Measurement, 17, 283-296. Bejar. l. 1. (1988). An approach to assessing unidimensionality revisited. Applied Psychological Measurement, 12, 377-3 79. Berger. M. P. F., & Knol, D. L. (1990). On the assessment of dimensionality in multidimensional item response theory models. Research Report 90-8 (142 Reports--Evaluative). Netherlands: Twente University, Enschede (Netherlands). Department of Education. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 46, 443-459. Bock, R. D., Gibbons, R. D., & Muraki, E. (1988). Full-information item factor analysis. Applied Measurement in Education, 12(3), 443-459. Box, G. E. P. (1954). Some theorems on quadratic forms in the study of analysis of variance problem 11. Effect of inequality of variance and correlation between errors in the two-way classification. Annals of Mathematical Statistics, 25, 484-498. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276. Choi. (1997). A response dichotomization technique for item parameter estimation ofthe multidimensional graded response model. Unpublished doctoral dissertation, University of Texas at Austin, Austin. Davey, T., Nering, M. L., & Thompson, T. (1997). Realistic simulation of item response data (No. 97-4). Iowa City, IA: College Admission Testing Program. De Ayala. R. 1.. & Hertzog. M. A. (1991). The assessment ofdimensionality for use in 174 item response theory. Multivariate Behavioral Research, 26(4), 765-792. De Champlain, A., & Gessaroli, M. E. (1991). Assessing test dimensionality using an index based on non-linear factor analysis. Paper presented at the annual meeting of American Educational Research Association, Chicago, IL. De Champlain, A., & Gessaroli, M. E. (1996). Assessing the dimensionality of item response matrices with small sample sizes and short test lengths. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY. De Champlain, A., & Gessaroli, M. E. (1998). Assessing the dimensionality of item response matrices with small sample sizes and short test lengths. Applied Measurement in Education, 11(3), 231—253. DeMars, C. E. (2003). Detecting multidimensionality due to curriculum differences. Journal of Educational Measurement, 40(1), 29-51. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Douglas, J., Kim, H. R., Habing, B., & Gao, F. (1998). Investigating local dependence with conditional covariance functions. Journal of Educational and Behavioral Statistics 23(2), 129-1 5 l . Douglas, 1., Kim, H. R., Roussos, L., Stout, W., & Zhang, J. (1995). LSATdimensionality analysis for the December 1991, June 1992, and October 1992 administration: LSAC research report series. LSAC-R-95-05. Drasgow, F., & Lissak, R. l. (1983). Modified parallel analysis: A procedure for examining the latent dimensionality of dichotomously scored item responses. Journal of Applied Psychology, 68(3), 363-373. Efron, B. (1978). Regression and ANOVA with zero-one data: measures of residual variation. Journal of American Statistical Association, 73, 113-121. Embretson, S. E. (1985). Multicomponent latent trait models for test design. In S. E. Embretson (Ed.), Test Design: Developments in Psychology and Psychometrics. New York, NY: Academic Press. Estrella, A. (1998). A new measure of fit for equation with dichotomous dependent variables. Journal of Business & Economic Statistics, 16(2), 198-205. Estrella, A., Rodrigues, A. P., & Schich, S. (2003). How stable is the predictive power of the yield curve? Evidence from Germany and the United States. The Review of 175 Economics and Statistics, 85(3), 629-644. Ferrara, S., Huynh, H., & Michaels, H. (1999). Contextual explanations of local dependence in item clusters in a large scale hands-on science performance assessment. Journal of Educational Measurement 36(2), 119-140. Fraser, C. (1988). NOHARM: An IBM PC computer program for fitting both unidimensional and multidimensional normal ogive models of latent trait theory. Center for Behavioral Studies. The University of New England. Armidale, New South Wales, Australia. Fraser, C ., & McDonald, R. P. (1988). NOHARM: Least squares item factor analysis. Multivariate Behavioral Research, 23, 267-269. Gessaroli, M. E., & De Champlain, A. (1996). Using an approximate x2 statistics to test the number of dimensions underlying the responses to a set of items. Journal of Educational Measurement, 33, 157-179. Guilford, J. P. (1941). The difficulty of a test and its factor composition. Psychometrika, 6. 67-77. Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cells counts. Annals of Statistics, 5, 1148-1169. Hambleton, R. K., & Rovinelli, R. J. (1986). Assessing the dimensionality of a set oftest items. Applied Psychological Measurement, 10, 187-302. Harrison, D. A. (1986). Robustness of IRT parameter estimation to violations of the unidimensionality assumption. Journal of Educational Statistics, 11(2), 91-115. Harwell, M., Stone, C. A., Hsu, T. C., & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20, 101-125. Hattie, J. (1984). An empirical study of various indices for determining unidimensionality. Multivariate Behavioral Research, 19(1), 49-78. Hattie, J. (1985). Methodology review: Assessing unidimensionality oftests and items. Applied Psychological Measurement, 9, l39-164. Herath, P. H. M. U., & Takeya, H. (2003). Factors determining intercropping by rubber smallholders in Sri Lanka: a logit analysis. Agricultural Economics, 29(2), 159-168. Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics. 14. 1523-1543. 176 Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. New York, NY: John Wiely & Sons. Howell, D. C. (2001 ). Statistical Methods for Psychology (fifth ed.). CA: Duxbury. Humphreys, L. G (1985). General intelligence: An integration of factor, test, and simplex theory. In B. B. Wolman (Ed.), Handbook of intelligence (pp. 201-224). New York, NY: Wiley. Hutten, L. (1980). Some empirical evidence for latent trait model selection. Paper presented at the annual meeting of American Educational Research Association. Boston, MA. Junker, B., & Stout, W. F. (1994). Robustness of ability estimation when multiple traits are presented with one trait dominant. In D. Laveault, B. D. Zumbo, M. E. Gessaroli & M. W. Boss (Eds), Modern Theories of Measurement: Problems and Issues (pp. 31-61). Ottawa, Canada: Edumetrics Research Group, University of Ottawa. Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. Kaiser, H. F. (1970). A second generation littlejiny. Psychometrika, 35, 401-415. Kendall, M. G (1977). Multivariate contingency tables and for further problems in multivariate analysis. In P. R. Krishnaiah (Ed.), Multivariate analysis IV. Amsterdam: North Holland. Kim, H. R. (1994). New techniques for the dimensionality assessment of standardized test data Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign. Knol, D. L., & Berger, P. F. (1991). Empirical comparison between factor analysis and multidimensional item response models Multivariate Behavioral Research 26. 457-477. Kvalseth, T. O. (1985). Cautionary note about R2. The American Statistician, 39(4), 279-285. Lindman, H. R. (1974). Analysis of variance in complex experimental designs. New York, NY: W. H. Freeman. Lord, F. M. (1980). Application of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum. 177 Lord. F. M., & Novick, M. R. (1968). Statistical theories ofmental test scores. Boston, MA: Addison-Wesley. Lumsden, J. (1957). A factorial approach to unidimensionality. Australian Journal of Psychology. 9, 105-111. Magee, L. (1990). R2 Measure based on Wald and likelihood ratio joint significance test. The American Statistician, 44(3), 250-253. McDonald, R. P. (1967). Non-linear factor analysis. Psychometric Monographs, 15, (15, Pt. 12). McDonald, R. P. (1981). The dimensionality of tests and items. British journal of Mathematical and Statistical Psychology, 34, IOO-117. McDonald, R. P. (1982). Linear versus nonlinear models in item response theory. Applied Psychological Measurement, 6, 379-396. McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Lawrence Erlbaum. McDonald, R. P. (1989a). Future directions of item response theory. International Journal of Educational Research, 13, 205-220. McDonald, R. P. (1989b). A index of goodness—of—fit based on noncentrality. Journal of classification, 6, 97-103 McDonald, R. P., & Mok, M. M. C. (1995). Goodness of fit in item response models. Multivariate Behavioral Research 30(1), 23-40. McKinley, R., L. (1989). Confirmatory analysis of test structure using multidimensional item response theory (No. ETS-RR-89-3 l ). Princeton, NJ: Educational Testing Service. McKinley, R., L., & Way, W. D. (1992). T he feasibility of modeling secondary TOEFL ability dimensions using multidimensional IRT models. (No. ETS-RR-92-l6). Princeton, NJ: Educational Testing Service. Menard, S. (2000). Coefficients of determination for multiple logistic regression analysis. The American Statistician, 54(1), 17-24. Michigan Department of Education. (2004, November 30, 2006). Mathematics grade level content expectation. Retrieved November 30, 2006. from http://wwwmichigan.gov/documents/ELA K-8_87340 7.pdf Michigan Department of Education. (2006). Mathematics Field Review. Retrieved 178 November 30, 2006, from http://wwwmichigan.gov/documents/mde/MATHEMATICS Sgrcadsheet 09220 6_with_DUTCHER_BACKGROUND AND INSTRUCTIONS 174138 7.pdf Mislevy, R. J. (1986). Recent developments in the factor analysis of categorical variables. Journal of Educational Statistics, 11, 3-31. Moneta, F. (2005). Does the yield spread predict recessions in the Euro Area? International Finance, 8(2), 263-301. Muthen. B. (1987). LISCOMP: Analysis of linear structural equations using a comprehensive measurement model. Moorseville: 1nd: Scientific Software. Muthen, L. K., & Muthen, B. (1998). Mplus: The comprehensive modeling program for applied researchers. User's guide. Los Angeles: Muthen & Muthen. Nandakumar, R. (1994). Assessing dimensionality of a set of responses-comparison of different approaches. Journal of Educational Measurement, 31(1), 17-35. Nandakumar, R., & Stout, W. F. (1993). Refinements of Stout's procedure for assessing latent trait unidimensionality. Journal of Educational Statistics, 18(1), 41-68. Olson. C. L. (1976). On choosing a test statistic in multivariate analyses of variance. Psychological Bulletin, 83, 579-586. Reckase, M. D. (1985). The difficulty of test Items that measure more than one ability. Applied Psychological Measurement, 9(4), 401-412. Reckase, M. D. (1990). Unidirnensional data from multidimensional tests and multidimensional data from unidimensional tests. Paper presented at the annual meeting of the American Educational Research Association, Boston, MA. Reckase, M. D. (1997a). A linear logistic multidimensional model for dichotomous item response data. In W. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 271-286). New York, NY: Springer. Reckase, M. D. (1997b). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25-36. Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25(3), 193-203. Reckase, M. D., & McKinley, R., L. (1982). Some latent trait theory in a multidimensional latent space. Paper presented at the Item Response Theory and Computerized Adaptive Testing Conference, Wayzata, MN. 179 Reckase, M. D., & McKinley, R., L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15(4), 361-373. Rosenbaum, P. R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika, 49(3), 425-435. Roussos, L. A. (1992). Hierarchical agglomerative clustering computer program user's manual: University of Illinois at Urbana-Champaign. Roussos, L. A. (1993). PROX help sheet. Urbana-Champaign: Statistical Laboratory for Educational and Psychological Measurement, Department of Statistics, University of Illinois. Roussos, L. A., Stout, W. F., & Marden, J. l. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement, 35(1), 1-30. Roznowski, M., Tucker, L. R., & Humphreys, L. G (1991). Three approaches to determining the dimensionality of binary items. Applied Psychological Measurement, 15(2), 109-127. Shin, Y. S., & Moore, W. T. (2003). Explaining credit rating differences between Japanese and US. agencies Review of Financial Economics, 12(4), 327-344. Spearman, C. (1904). "General intelligence" objectively determined and measured. American Journal of Psychology, 15, 201-293. Steiger, J. H. (1980a). Testing pattern hypotheses on correlation matrices. Multivariate Behavioral Research, I 5, 3 3 5-3 5 2. Steiger, J. H. (1980b). Tests for comparing elements of a correlation matrix. Psychological Bulletin, 87, 245-251. Stone, C. A., & Yeh, C. C. (2006). Assessing the dimensionality and factor structure of multiple-choice exams: An empirical comparison of methods using the multistate bar examination. Educational and Psychological measurement, 66(2), 193-214. Stout, W., Habing, 3., Kim, J ., Roussos, L., & Zhang, J. (1993). Conditional covariance based nonparametric multidimensionality assessment. Applied Psychological Measurement, 20, 33 1 -3 54. Stout, W. F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589-617. 180 Stratmann, T. (2002). Can special interests buy congressional votes? Evidence from financial services legislation. The Journal of Law and Economics, 45, 345-373. Suppes, P., & Zanotti, M. (1981). When are probabilistic explanations possible? . Synthese, 48, 191-199. Sympson, J. B. (1978). A model for testing with multidimensional items. In J. D. Weiss (Ed.), Proceedings of the 1977 computerized adaptive testing conference (pp. 82-98). MN, Minneapolis: University of Minnesota. Tanaka, J. S. (1993). Multifaceted conceptions of fit in structural equation models. In K. A. Bollen & T. S. Long (Eds), Testing Structural Equation Models. Newbury Park, CA: Sage. Tate, R. (2002). Test dimensionality. In G Tindal & T. M. Haladyna (Eds), Large-scale assessment programs for all students. Mahwah, NJ: Lawrence Erlbaum. Tate, R. (2003). A comparison of selected empirical methods for assessing the structure of responses to test items. Applied Psychological Measurement, 23(3), 159-203. Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlets: A use of multiple category response models. Journal of Educational and Behavioral Research, 26, 247-260. Thompson, T. (Undated). GENDATS: A computer program for generating multidimensional item response data. Turner, R. C. (2000). Evaluating a procedure for investigating the multidimensional parallelism of standardized tests. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign. Turner, R. L., Miller, T., Reckase. M. D., Davey, T., & Ackerman, T. A. (1996). Assessing the dimensionality of the interaction between items on a Mathematics test of the American College Testing (AC T) exam and subgroups of an AC T examinee population. Paper presented at the Annual Meeting of the National Council of Measurement in Education, New York, NY. van der Linden, W. J ., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York, NY: Springer. Vandaele, W. (1981). Wald, likelihood ratio. and Lagrange multiplier tests as an F test. Economics Letters, 8, 361-365. Wainer, H., & Thissen, D. (1996). Howis reliability related to the quality of test scores? What is the effect of local item dependence on reliability? Educational Measurement: Issues and Practice, 15, 22-29. 181 Whitely, S. E. (1980). Multicomponent latent trait models for ability tests. Psychometrika, 45, 479-494. Wilson, D., Wood, R., Gibbons, R., Schilling, S., Muraki, E., & Bock, R. D. (2003). TESTFACT: Test scoring and full information item factor analysis (Version 4.0). Chicago, IL: Scientific Software lntemational. Zhang, J ., & Stout, W. F. (1995). Theoretical Results Concerning DE TEC T Paper presented at the Annual Meeting of the National Council of Measurement in Education, San F rancisco, CA. Zhang, J ., & Stout, W. F. (I996). A new theoretical DE T EC T Index of dimensionality and its estimation. Paper presented at the Annual Meeting of the National Council of Measurement in Education, New York, NY. Zheng, B., & Agresti, A. (2000). Summarizing the predictive power of a generalized linear model. Statistics in Medicine, 19(13), l77l-l 781. l82 l"Elllllllllillill"