fit?" ‘7":‘5 , :Jx: ‘ 2 J -3: ”‘3 2M2;- - 3;- Jh-Jh. 1g ‘9'“ ' 2. ..u. lfi'ém-kyj Wyn. v 2'.’ ‘ :3 '1' u m . 4“ A". UV ustwfi w ’51 4 - V 5 1 ~ mfg-kin ‘5‘ s ‘2: .7 F, 1 ‘ :11 5:541:35“! 3.x! ~: - L fl 6"? 11‘ gm irrit- ‘ .~I:‘.-;-I.'.“. 3. LIBRARY Michigan State University This is to certify that the dissertation entitled THE IMPACT OF 'SCALE DILATION ON THE QUALITY OF THE LINKING OF MULTIDIMENSIONAL ITEM RESPONSE THEORY CALIBRATIONS presented by Kyung-Seok Min has been accepted towards fulfillment of the requirements for Ph. D. degree in Counseling, Educational Psychology & Special Education Major professor Date May 23, 2003 M5 U is an Affirmau'w Action/Equal Opportunity Institution 0-12771 PLACE IN RETURN Box to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE I DATE DUE u 3 ,. 0. [£03 3 1 2005 th "014 220% 5&2 91 Q 390? ,, A 'flBH-irEmr- 000 20 2005 MAY 1 6 2007 _j40208 092109 A000 .2; 2m 6/01 cJCIRC/DateDuepGS—sz THE IMPACT OF SCALE DILATION ON THE QUALITY OF THE LINKING OF MULTIDIMENSIONAL ITEM RESPONSE THEORY CALIBRATIONS BY KYUNG-SEOK MIN A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2003 ABSTRACT THE IMPACT OF SCALE DILATION ON THE QUALITY OF THE LINKING OF MULTIDIMENSIONAL ITEM RESPONSE THEORY CALIBRATIONS BY KYUNG-SEOK MIN This study compares and evaluates multidimensional item response theory (MIRT) linking methods in terms of the accuracy and stability of metric transformations across several testing conditions. Most psychological and educational tests are sensitive to multiple traits. This suggests that the application of MIRT is needed. One factor that limits the application of MIRT in practice is difficulty in establishing equivalent scores on multiple ability dimensions. While several MIRT linking methods have been developed to solve the problem, each of them has unique properties in terms of statistical characteristics and optimization criteria, and it is not yet known whether different MIRT linking methods lead to the same/similar metric transformations. Both simulation and real data are analyzed to compare different MIRT linking methods. In addition, a new way of MIRT linking is suggested based on orthogonal Procrustes solutions using a diagonal dilation matrix. ACKNOWLEDGEMENTS My special appreciation goes to Dr. Mark Reckase, my dissertation director. His insight, encouragement, and friendship have been invaluable for me to finish this manuscript. I wish to send my true thanks to Dr. Ken Frank, my academic advisor, who has been encouraging and monitoring my academic works from the beginning to the end of my doctoral study. I also appreciate Drs, Richard Houang, James Stapleton, and Edward Wolfe, who spent their precious times to provide suggestions and comments on my work. I wish to send my recognition to Drs. Jong—Sung Lee and Sang-Jin Kang at Yonsei University, who supported me to study abroad and kept guiding me in various issues for a long time. I also wish to thank my parents and Miran, who are lifelong supporters and friends whatever happens around me. Having you as a family is more valuable than what any word can express in the world. TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES vii CHAPTER 1 INTRODUCTION 1 1.1 Invariance and Indeterminacy in Item Response Theory ...... 2 1.2 Equating and Dimensionality 3 1.3 Purpose of the Study 5 CHAPTER 2 MULTIDIMENSIONAL IRT MODEL AND LINKING 8 2.1 Multidimensional IRT Models 8 2.2 Multidimensional IRT Linking Methods 16 2.2.1 Oshima, Davey and Lee's Method 17 2.2.2 Li and Lissitz' Method 20 2.3 Extension of the LL Method with a Diagonal Dilation Matrix 23 2.3.1 Example: LL Method and M method 27 2.4 Other MIRT Linking Methods 31 2.4.1 Hirsch’s Method 31 2.4.2 Thompson, Nering and Davey's Method 32 2.5 Evaluation Criteria 33 CHAPTER 3 METHODS 36 3.1 Simulation Study 36 3.1.1 Equating Design and Specification of MIRT Model 36 3.1.2 Generation of Item Parameters and Response Patterns 38 3.1.3 Simulation Factors 44 3.1.4 Evaluation Criteria and Data Analysis 48 3.2 Real Data Analysis 52 3.2.1 Evaluation Criteria and Data Analysis 56 CHAPTER 4 RESULTS 58 4.1 Simulation Study 58 4.1.1 Results of Repeated Measures Analysis of Variance 58 4.1.2 Comparison of the Three Linking Methods .......... 61 4.2 Real Data Analysis 70 W 4.2.1 Item Estimates Comparison 4.2.2 True Score Comparison CHAPTER 5 SUMMARY, DISCUSSION, AND CONCLUSION 5.1 Simulation Study 5.2 Real Data Analysis 5.3 Discussion 5.3.1 Rotation and Optimization Criteria 5.3.2 Evaluation Criteria 5.3.3 Relative Efficiency of Linking Methods ........................ 5.3.4 Test Response Surface and Ability Levels .................. 5.4 Conclusion APPENDIX A APPENDIX B REFERENCES 71 73 76 76 77 79 79 81 82 83 85 94 Table Table Table Table Table Table Table Table Table Table LI ST OF TABLES Two Sets of Item Estimates and Rotated Estimates 27 Comparison of Transformed Results with a Dilation Constant and with a Diagonal Dilation Matrix 29 Five MIRT Discrimination and Difficulty Levels 40 Item Parameters for Twenty Common Items ..................... 41 Ability Distributions for Five Examinee Groups 44 Composition of Two Test Forms 54 Item Difficulties of Two Test Forms 54 Item Parameter Estimates of Common Items .................. 56 Test Statistics (F), Degrees of Freedom (DF) and Effect Sizes (02) of Biases from Repeated Measures ANOVA 59 Test Statistics (F), Degrees of Freedom (DF) and Effect Sizes (”2) of RMSEs from Repeated Measures ANOVA 6O vi Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure .10: .11: .12: .13: .14: .15: LIST OF FIGURES Item Response Surface with agfl” 55:1, c=.2 and d=1 12 UIRT and MIRT Linking Components 15 Two dimensional Structures in Simulation Data 39 Item Vectors, Approximate Simple Structure ......... 42 Item Vectors, Mixed Structure 43 Bias (a1, n=1000) 62 Bias (a2,:n=lOOO) 63 Bias (d, n=1000) 63 RMSE (au n;1000) 64 RMSE (a?,:n71000) 64 RMSE (d, n=1000) 65 Bias (aw n;2000) 65 Bias (a2, n=2000) 66 Bias (d, n=2000) 66 RMSE (aw n:2000) 67 RMSE (a?,:n=2000) 67 RMSE (d, n=2000) 68 Mean Differences of Five Sets of Samples ............ 72 Difference Variations of Five Sets of Samples 73 Differences of Transformed Test Scores and Estimated True Scores on the Base Test Form 74 vii CHAPTER 1 INTRODUCTION The main role of standardized tests is to provide fair, reliable, and objective information regarding examinees’ ability or skill that the tests are developed to measure. Since test scores are often used to make individual, institutional, and national decisions, in conjunction with other available information, equitability and comparability of test scores have been important issues in the testing arena (Cook & Eignor, 1991). Various situations exist that make test users concerned about different measures of allegedly the same thing (Kolen, 2001). For example, because of the security of test items, most testing programs develop multiple test forms of the same test (e.g., ACT and SAT). By the very nature of the test design, recently popular computer adapted tests (CAT) use different sets of items for different examinees (Dorans, 2000). Moreover, a comparable test scale provides meaningful interpretations over test questions matched to different grades (e.g., grade equivalents), types of tests (e.g., paper-pensile and CAT versions of Armed Services Vocational Aptitude Battery [ASVAB], Maier, 1993) or somewhat different constructs (e.g., state and national achievement tests, Feuer, Holland, Green, Bertenthal, & Hemphill, 1999). In this chapter, as an introduction, some background information on IRT models and the need for metric transformations under certain test administration conditions are provided. In addition, the purpose of the study and research questions are described. 1.1 Invariance and Indeterminacy in Item Response Theory One of the important and useful features of item response theory (IRT) compared with classical test theory is the invariance of parameters. That is, item characteristics denoted as item parameters do not depend on the distribution of examinees’ abilities (Lord, 1980). Examinee ability is also invariant across different item sets. However, it is well known that there can be an infinite number of legitimate translations of IRT parameters. This is true because of scale indeterminacy similar to that of traditional factor analysis, the so called identification problem (Baker, 1992; Lord, 1980). Therefore, the invariance property of IRT parameters holds only after a certain common metric is set across examinee samples or sets of items. In practice, IRT calibration programs (e.g., BILOG3 [Mislevy & Bock, 1990] and LOGIST [Wingersky, Barton, & Lord, 1982]) solve scale indeterminacy by setting features of ability or difficulty distributions to specific values. Specifying these values makes the model identified and allows unique estimation of the model parameters. 1.2 Equating and Dimensionality Equating is defined as those processes establishing equivalent scores on different instruments or subject groups (Crocker & Algina, 1986; Embretson & Reise, 2000; Hambleton & Swaminathan, 1985; Lord, 1980). Equating, as a way to build a common metric, has been used in various testing situations, such as administrations of multiple test forms, detecting differential item functioning (DIF), development of an item bank, computerized adapted testing (CAT), and more. Traditionally, IRT models have been developed with the assumption of unidimensionality: The item-person interaction is modeled with a single latent trait. However, the mechanisms and cognitive processes that an examinee uses to respond to test items do not seem so simple (Reckase, 1985, 1995), and many psychological and educational researchers agree that multidimensional abilities/traits come into play in test performances (Ackerman, 1991; Traub, 1983). For example, a mathematical item with a verbal description may require reading ability to transform the text into a mathematical formula as well as mathematical knowledge to find a solution from the formula. Moreover, achievement tests are likely to be sensitive to multiple dimensions because several content areas are included in the particular subject matter. Research has shown that when test responses known to be multidimensional are modeled under the unidimensional assumption, they violate the local independence assumption and result in increased measurement errors and incorrect inferences (Ackerman, 1994; Baker, 1992; Reckase, 1985). Due to their popularity and simplicity, most IRT equating methods have been based on unidimensional item response theory (UIRT) models. These methods make adjustments for different scales (i.e., origin and unit of scale) (Cook & Eignor, 1991; Kolen & Brennan, 1995; Lord, 1980). But when the goal is to establish comparable scores on tests that seem to be affected by more than one dimension, the directions of dimensions also need to be adjusted to obtain equitable meaning of the reference/coordinate system. That is, multidimensional item response theory (MIRT) models are directionally indeterminant as well as scale indeterminant. Therefore, MIRT equating requires a composite transformation of rotation and scaling to derive comparable scores (refer to Green, 1976; Lissitz, Schonemann, & Lingoes, 1976; Schonemann, 1966; Schonemann & Carroll, 1970). 1.3 Purpose of the Study Even though most psychological and educational tests are sensitive to multiple traits/skills implying the need for MIRT, the application of MIRT is limited in practice by difficulties in establishing equivalent scores on multiple ability dimensions. While several MIRT linking methods have already been developed to solve the problem of comparability (Davey, Oshima, & Lee, 1996; Hirsch, 1988, 1989; Li, 1997; Li & Lissitz, 2000; Oshima & Davey, 1994; Oshima, Davey, & Lee, 2000; Thompson, Nering, & Davey, 1997), each of them has unique properties in terms of statistical characteristics and optimization criteria (i.e., what is minimized or maximized). Moreover, it is not yet known whether different MIRT linking methods lead to the same/similar metric transformations. If it is found that there are significant differences across MIRT linking methods, it might be possible that one method is more appropriate than others in certain testing conditions. Then, careful consideration should be taken to apply any specific linking technique according to the properties of each method and the specific goal of test equating. The purposes of this study are to compare and evaluate two leading linking procedures1 for MIRT equating (i.e., Li & Lissitz, 2000; Oshima et al., 2000) in terms of the accuracy and stability of metric transformations across various testing conditions (e.g., sample sizes, structures of dimensionality, and distributional shapes of true ability) and to develop and verify a new linking method that provides a more desirable multidimensional metric transformation, especially in dilation/contraction of a scale. One of leading MIRT linking methods developed by Li and Lissitz (2000) includes only a single dilation constant for multiple dimensions based on traditional factor analysis techniques (i.e., orthogonal Procrustes solutions). However, more desirable transformations might be expected when linking allows a unique dilation/contraction for each dimension. A new MIRT linking method that incorporates a diagonal dilation matrix into orthogonal Procrustes solutions was developed and compared with the previous two linking methods. Both simulation and real data were used to investigate statistical characteristics and practical implications of 1 Linking is essential part of equating processes that generate comparable test scores. Throughout the remainder of the study, linking is defined as specific statistical procedures to determine the relationship among sets of item/ability parameters. the comparisons. More specifically, in the simulation study, the question is which linking procedure is better for placing item parameter estimates on the same scale. In real data analysis, the concern is about which linking procedure would be better at transforming item estimates on one test form into item estimates of the other test form. Beyond the item level comparison, it was also explored whether or not there were distinguishable linking error patterns on the ability space by investigating differences in test response surfaces (i.e., estimated true test scores on the ability space) across different linking procedures. CHAPTER 2 MULTIDIMENSIONAL IRT MODEL AND LINKING In this chapter, MIRT models and linking methods are described based on a review of the literature. Two leading MIRT linking methods are examined in detail. A new linking method with a diagonal dilation matrix is proposed and demonstrated with an example. 2.1 Multidimensional IRT Models When real phenomena turn out to be complicated, statistic models have to allow for complexity in order to explain and reflect realities as they are. In test situations, MIRT has been developed to explain effects of multiple ability dimensions on test performances. When conducting linking in the IRT framework, it is assumed that the latent space is sufficiently represented in the model such that item responses are independent after controlling for ability levels (i.e., local independence). Even though most IRT linking research has focused on unidimensional models, modeling more than one ability dimension may better satisfy the requirements for model fit, specifically the local independence assumption. Previous studies (Ackerman, 1994; Baker, 1992; Reckase, 1985) indicated that UIRT models might violate the invariance property when a test is sensitive to more than one dimension and examinees' abilities vary on those multiple dimensions. Moreover, Reckase and Hirsch (1991) claimed that the number of dimensions is often underestimated and that overestimating the number of dimensions does little harm. In general, unidimensional interactions between persons and items are sufficient to model test data under the following conditions: (a) both examinee’s ability and test item Characteristics are varying on one dimension as assumed in the model; (b) examinee ability varies only on one ability dimension even though test items are sensitive to more than one ability and vice versa; or (c) examinee abilities are different on multiple ability dimensions but all items are sensitive to the same composite of abilities (Li, 1997; Reckase, 1990). Therefore, a UIRT model is easily expressed in a form of an MIRT model by making constant one set of item parameters or ability parameters. Put in another way, a UIRT model can be treated as a special case of an MIRT model. Two types of MIRT models have been developed, i.e., compensatory and noncompensatory models. These are different with respect to relationships among the ability dimensions that determine the probabilities of person’s item responses. In compensatory models (Lord & Novick, 1968; McDonald, 1967; Reckase, 1985, 1995), the proficiencies are additive in the logit such that low ability on one trait can be compensated by high ability on other trait(s). In noncompensatory (partial compensatory) models (Embretson, 1984; Sympson, 1978), the multiplication of the probability of getting a component of an item right results in a low ability on one trait being only partially compensated by being high on other trait(s). In fact, the lowest probability for an item component sets the upper limit on the probability of correct response for a noncompensatory model. Since most research on MIRT equating has been done using compensatory models (partly because of estimation difficulties for noncompensatory models), and the fit of the two types of MIRT models appear indistinguishable from a practical point of view (Spray, Davey, Reckase, Ackerman, & Carlson, 1990), the compensatory model is considered in this study. The compensatory multidimensional extension of the three— parameter logistic model with m dimensions2 is (Mckinley & Reckase, 1983; Reckase, 1985, 1995) 2 Corresponding noncompensatory MIRT model is CXP(aik 9,1: “ bik) m P(u--=1|a-,b~,c-,0-)=c-+(1-c-)II , U l ‘ l J ' lk=ll+exP(aik6jk 'bik) where m is the number of dimensions, and aik and bik are the discrimination and difficulty parameter, respectively, for item i and dimension k. 10 exp(a}0j + d,) , . (1) 1+ exp(a,-0j + d,) P(u,-j =l|ai,Ci,d1,9j)=Ci+(1—Ci) where IKuU=dlabchdh0j)is the probability of a correct response for examinee j on test item 1 in an m-dimensional space, up is the item response for person j on item 1 (1 correct; 0 wrong), a, is a vector of discrimination parameters of item i, c; is the lower asymptote (or guessing parameter), the probability of correct answer when an examinee’s ability is very low, d, is a parameter related to item difficulty of item i, and, 0 . J is a vector of jth examinee’s abilities. This model implies that the probability of a correct item response is a monotonic increasing function bounded between 0 and 1. A two—dimensional three—parameter item response surface (analogous with the item characteristic curve in UIRT) with affl” eg=1, c=.2 and d=1 is provided in Figure 2.1. The height represents the probability of correct response corresponding to a pair of abilities (6% and 02). The probability spans .2 to 1 because of the lower asymptote ll (.2). This item measures two ability dimensions equally because it has the same item discrimination on both dimensions. Probability Figure 2.1. Item Response Surface with affl” 55:1, c=.2 and d=1 Unlike UIRT models, multidimensional item—discrimination and person—ability parameters are vectors rather than scalars, and the difficulty-related parameter is a composite of item difficulty and discrimination on each dimension. The interpretation of MIRT discrimination parameters is analogous with UIRT parameters but each element of the 12 vector is related to a direction in the dimensional space. The meaning of the MIRT difficulty parameter is not directly equivalent to that of unidimensional difficulty parameter because of a different parameterization. Two statistics (i.e., MDISC and MDIFF) were developed to capture multidimensional item characteristics corresponding to UIRT item discrimination and difficulty. The discrimination power of a multidimensional item in the dimensional space can be defined as a function of item discrimination parameters (Reckase, 1985, 1995; Reckase & McKinley, 1991; for graphic representations see also Ackerman, 1994, 1996) as m 2 1/2 aik] , (2) NHNSC1=[ k=l where MDISC,- denotes ith item's multidimensional discrimination as a function of the slope at the steepest point, and am is ith item’s discrimination on the kth dimension. Multidimensional item difficulty equivalent to unidimensional difficulty is MDIFF, = ”i" (3) MDISC,- where MDIFF, is the distance between the origin and the point of the steepest slope of the item response surface. In addition, the direction of the greatest discrimination in the dimensional space is given by aik “ii: a”, =arccos——— (or cos a”, =——) , (4) NHMSC; NHMSC; where ‘Hk is an angle from the kth dimension. As is shown in Equation (1), the probability of the correct answer is a linear function of item (a auxici) and ability (0) parameters in the exponent. Therefore, any linear transformation of an ability scale results in the same value of the exponent for a given response patterns if the item and ability parameters are transformed in a consistent way. The probability that an examinee gets an item right is identical when IRT scale and item parameters are transformed properly. This property is referred to as invariance (Kolen & Brennan, 1995; Lord, 1980). While scale indeterminacy (unspecified location of the origin and the unit of scale) is a concern when finding a proper l4 transformation in URT equating, the rotation to determine the comparable reference system as well as the scale alteration has to be considered in MIRT due to multidimensionality (see Figure 2.2). OE Us Scale E ——_ \, ‘5 Translation \ Dilation —— \ \— ‘N T": Scale B _—> V (a) UIRT Linking Scale E Rotation n ‘. Di_w_io_n ’4 Tmmditm/‘t /°" (b) MIRT Linking (two dimensions) ‘ 7 U3 Scale B Figure 2.2. UIRT and MIRT Linking Components* * O is the location of origin, U is the length of unit, the subscript E is the metric to transform and B is the target metric. Note the picture of MIRT linking is a modification of Figure 11-5 in Li’s study (1997, p 37) The rotation is required to line up one coordinate system 15 with the other. Scaling procedures are about matching the origin (0, translation) and unit length (U, dilation) between two scales. 2.2 Multidimensional IRT Linking Methods Even though modeling more than one dimension often improves model fit, the use of MIRT models is limited in testing practice (Gosz & Walker, 2002; Reckase, 1997). One of the reasons might be the difficulty in finding comparable multidimensional scales across different test forms or examinee groups (Oshima et al., 2000). So far, several multidimensional linking methods have been proposed (Hirsch, 1989; Li and Lissitz, 2000; Oshima et al., 2000; Thompson et al., 1997). These methods use a two—dimensional compensatory model and consist of some or all of three linking components: a rotation matrix deals with directional indeterminacy to establish a common reference/coordinate system, and a translation vector and a dilation constant remove scale indeterminacy by finding comparable origin and unit. Even though Hirsch’s study (1988, 1989) can be valued as the first attempt to deal with multidimensionality in IRT linking, his linking method is similar to later methods, 16 especially Li and Lissitz’ method. The method of Thompson et al. (1997) has a strong potential, but it is still experimental. The focus of this study is on two more recent linking methods (Li & Lissitz, 2000; Oshima et al., 2000). 2.2.1 Oshima, Davey, and Lee's method Oshima and her colleagues’ linking method (2000), called ODL method afterwards, is based on the anchor item design: a set of common items are included in multiple test forms to define a common scale. Transformations of parameters of the compensatory multidimensional model with the exponent of aEOi+1h, are conducted through the following set of linking equations. a: =(A“)’a,-, (5) d; =d,- -a;. A'IB, (6) 03=A0j+p, (7) where A (me, m is the number of dimensions) is a rotation matrix, 8 (mxl) is a translation vector, and the asterisk (*) indicates transformed parameters. Here, the rotation matrix A has two functions: (a) to rotate to a proper dimensional orientation, and (b) to adjust the variances of ability dimensions. The translation vector 8 is used to shift to a compatible origin by altering the origin of a scale. The equality of the transformed exponent and the original exponent can be illustrated by aft)";- +d,-*=(a; A"l)(A0j +B)+(d,- —a;. A‘15)=a; oj +d,~. (8) As a result, transformed components of the exponent are mathematically alternate ways to express the initial relationship without changes of probabilities of correct responses. Oshima and her colleagues compared four MIRT linking procedures according to different evaluative criteria, and suggested the test characteristic function (TCF) and the item characteristic function (ICF) methods are more stable than other procedures (i.e., direct method and equated function method). The two favorable methods are distinguishable in terms of what is minimized. The TCF method is a multidimensional extension of Stocking and Lord's method (1983) that minimizes squared differences between two test response surfaces (i.e., sum of item response surfaces) of common items, while the ICF method 18 minimizes the sum of squared differences of item response. surfaces. Finally, they concluded that the TCF method was best at estimating the rotation matrix over other sub— methods and was also relatively good at estimating the translation vector. The minimization function for the ODL TCF method is m» = iii-<0), and i=1 Ewan. (m-TE (9)12 . <9) * I I where Tb and TE indicate expected number-correct scores for the common items for the examinee on the base test and the transformed equated test, respectively, n is the number of items, and up is a weighting value which allows some regions on the ability space of 0 to be more important than others. Also, if all weights are equal for all regions, the result is unweighted estimation. The ODL method is unique in that it estimates the rotation matrix and the translation vector simultaneously but orthogonality of rotation is not constrained. It means that relative distances among items (e.g., directional item vectors on the dimensional space, see Figures 3.2 and 3.3 in 19 the next chapter) may not be the same after and before the rotation is conducted. 2.2.2 Li and Lissitz' method Li and Lissitz (2000; see also Li, 1997), called LL method afterwards, developed four different linking procedures based on the anchor item design and claimed that the best procedure was a composite transformation with three linking components: a rotation matrix from the orthogonal Procrustes solutions, a translation vector obtained by a least-square method of minimizing differences between initial difficulty parameters and transformed parameters, and a central dilation constant obtained by the trace method of minimizing the sum of squared errors. The LL method uses the following set of linking equations to transform model parameters in the exponent, aEOi+wh af=kag T, (10) df=d,--a; Tm, <11) 0"} =(1/k)(T'10j+m), (12) where T (ann is an orthogonal rotation matrix, change in 20 orientation, m.(mX1) is a translation vector for location, k (1x1) is a central dilation constant for unit change. Then the equality of exponent terms after and before transformation is established by are} +d}"=(ka;. T)(l/k)(T'10j +m)+(d,- —a; Tm) =a; 0]. +d,— . (13) Note that Equations (10) to (12) are mathematically the same as Equations (5) to (7) except for pre—multiplication or post-multiplication of the rotation matrix, and a dilation constant. Li and Lissitz provided a fair amount of information on multidimensional linking procedures by explaining three linking components, i.e., rotation, translation, and central dilation (refer to Schonemann, 1966; Schonemann & Carrol, 1970). It is straightforward in that the three components provide useful information on specific stages of multidimensional metric transformations. While the ODL method deals with dimensional direction and unit change at once in the nonorthogonal rotation matrix, Li and Lissitz split these two components into the orthogonal rotation matrix and the central dilation constant. Here, the term ‘central’ means that unit changes are assumed to be 21 similar/constant across dimensions such that one scalar (k) can account for all unit changes. They justified the central dilation constant with its mathematical tractability and relatively reasonable accuracy. The best linking procedure of the LL method is obtained by minimizing the following functions for the rotation matrix T, the dilation constant k, and the translation vector m” El=kAET—AB, (l4) ”'(E’lEl)=tr(kAET—AB)’(kAET-AB)l (15) n * 2 Q: .2101”; -d,°3) I (16) 1: In Equations (14) to (16), tr is the matrix operator of the sum of diagonal elements (trace), A and d are item discrimination and difficult, subscripts B and E denote the base test and equated test, and the asterisk (*) indicates transformed values onto the dimension of the base test. Note that there are two equations to minimize for the LL method (Equation [15] for rotation and dilation, and Equation [16] for translation) while there is one minimization equation for all linking components in the ODL method (Equation [9]). 22 2.3 Extension of the LL Method with a Diagonal Dilation Matrix From the review of ODL and LL methods, there is clear difference between two methods in term of the structure of linking components: the rotation matrix of the ODL method is supposed to adjust for the orientation of the reference system and the length of unit simultaneously while there are two linking components in the LL method, a rotation matrix and a central dilation scalar. Further, the LL method uses an orthogonal rotation matrix that maintains relative distances among item vectors after and before rotation, but the ODL method uses an oblique rotation that optimizes the minimization criterion. The two MIRT linking methods take different positions about whether changes of the unit lengths are constant across dimensions. Li and Lissitz (2000) clearly indicated that they assumed constant change of the unit length across multiple dimensions such that one dilation constant was enough to cover overall unit length adjustment. They provided two reasons for a dilation constant: mathematical tractability and reasonable accuracy. In the ODL method, this issue was not clearly stated because there is not a specific component for dilation. Even though simulation examples in the paper (Oshima et al., 2000, Table 2 on p. 23 364) showed that their main concern was about a constant unit change across dimensions, there are no statistical constraints on a constant unit change in the ODL method so that it can allow unique dilation for each dimension. Back to the previous example, a mathematics test measuring mathematical knowledge and reading skills, suppose the examinee group A shows less variation in mathematical knowledge compared with the examinee group B, should the group A have less variation on reading skills as well? The answer is maybe yes or maybe no. A more extensive answer could be found from theories on the relationship between mathematical knowledge and reading ability in general, and from specific characteristics of a given test, such as how extensive are the reading skills needed to determine the mathematical solution. One reasonable argument for constant overall dilation of multiple dimensions may be that the dimensions measured by a test are strongly related. The change in one dimension goes along with other dimension(s) with the same dilation/contraction rate. However, this may not be typical for various constructs measured by educational and psychological tests. In addition, from a methodological perspective, the dilation constant can be treated as a special case of multiple dilation constants. 24 In order to model different unit changes along with an orthogonal rotation in MIRT linking, the dilation constant adopted in the LL method is replaced with a diagonal dilation matrix, called the M method afterwards. Transformation equations of the M method are a: =a; TK, (17) d}"=d,-—a; Tm, and (18) 9’; =K'1(T'10j+m), (19) where K is a diagonal dilation matrix and other terms are defined as before. For the two—dimensional case, K is k1 defined as 0 k2 ]. Here, kl in the matrix K indicates the dilation component for the first dimension, and k2 is for the second dimension. Off—diagonal elements in K are set to zero because the relationship/direction between two dimensions is not defined by K but only by the orthogonal rotation matrix, T. Then, the equality of exponent terms after and before transformation is established by 1t 313 IF , _l _1 I ’ a,0,~+d,~=(aiTK)(K )(T 0j+m)+(di—ai Tm)=ai01+d,-. (20) 25 Two points should be mentioned. First, the Equations (17) to (19) are the same as the Equations (10) to (12) except for including a diagonal dilation matrix rather than a dilation constant. Second, when k1 is equal to k2 in the dilation matrix Equation (20) becomes the same as the LL method (Equation [13]). The proposed linking method in Equations (17) to (19) differs from the ODL method by splitting the rotation matrix and the dilation matrix and using an orthogonal rotation. It also differs from the LL method by allowing a unique unit change for each dimension rather than a constant change for all dimensions. The same minimization criteria as Equations (15) and (16) are used for each linking component of the M method. However, because the dilation component, K is a diagonal matrix rather than a scalar, the solution is somewhat different from the LL method and further details are provided in Appendix A. In addition, a simple example is given in the next section to emphasize the similarity and difference between the LL method and the new method with a diagonal dilation matrix. 26 2.3.1 Example: LL method and M method Suppose we have two sets of two dimensional item parameter estimates for twenty anchor items as in Table 2.1. Table 2.1. Two Sets of Item Estimates and Rotated Estimates3 l Base Test Equated Test Rotated Equated Test mm 01 a2 01 02 01 a2 1 1.81 0.86 0.33 1.55 1.44 0.66 2 1 .22 0.07 0.59 0.88 1.06 0.05 3 1.57 0.36 0.56 1.29 1.37 0.32 4 0.71 0.53 -0.08 0.66 0.49 0.46 5 0.86 0.19 0.36 0.59 0.69 0.06 6 1.72 0.18 0.84 1.28 1.53 0.08 7 1.86 0.29 0.82 1.52 1.71 0.24 8 133 034 0A3 106 1J1 028 9 1.19 1.57 -0.45 1.31 0.79 1.14 10 2.00 0.00 1.04 1.30 1.66 -0.07 1 1 0.87 0.00 0.45 0.61 0.75 0.00 12 2.00 0.98 0.29 1.93 1.72 0.92 13 1 .00 0.89 -0.20 1.21 0.85 0.89 14 1.22 0.14 0.65 0.92 1.12 0.03 15 1.27 0.47 0.33 1.10 1.08 0.39 16 1.35 1.15 -0.24 1.36 0.95 1.00 17 1.06 0.45 0.23 0.99 0.94 0.41 18 1.92 0.00 1.15 1.42 1.83 -0.07 19 0.96 0.22 0.35 0.79 0.84 0.18 20 1.20 0.12 0.57 0.93 1.08 0.09 SmtSS 632 273 255 529 541 228 The estimates from the equated test are to be linked to the estimates from the base test. Based on the orthogonal Procrustes rotation solutions, the rotation matrix is _ .59 -.80 .80 .59 ]. By using this rotation matrix, the item 3 Two sets of item estimates were taken from Li’s study (1997, Table II- 1 on p. 57). 27 estimates of the equated test are rotated to the orientation of the estimates of the base test (see columns 6 and 7 in Table 2.1). The mathematical measure of the length of a vector is the square root of sum of square of elements of the vector (Sqrt SS). The bottom row of Table 2.1 gives the length of the vectors and it shows that rotated item estimates are shrunken compared with the estimates of the base test estimates. Also the lengths show that the first dimension of the rotated matrix is less shrunken (6.32/5.4l=1.17) than the second dimension is (2.73/2.28=1.20) although the difference is not much in this example. There are two ways to deal with the discrepancy in vector lengths. First, by assuming that the change of unit lengths was constant across dimensions, and the differences of dilation/contraction rates were sampling/estimation errors which can be ignored, a single dilation constant was calibrated to account for all changes of unit lengths (dilation constant of the LL method). Second, the difference of the dilation rates across dimensions is real so unique dilation components were included to adjust the unit length for each dimension. In the second case, a diagonal dilation matrix was used for the adjustment for each dimensional unit change. 28 After modeling both a dilation constant and a diagonal dilation matrix, final transformed matrices for discrimination estimates can be obtained (see Table 2.2). Table 2.2. Comparison of Transformed Results with a Dilation Constant and with a Diagonal Dilation Matrix . . . Difference between dilated matrix Dilated matrix after rotation and base matrix Item With a constant With a matrix With a constant With a matrix “1 02 01 02 a1 02 a] “2 1 1.55 0.71 1.67 0.78 -0.26 015 -0.14 -0.08 2 1.13 0.05 1.23 0.06 -0.09 -0.02 0.01 -0.01 3 1.47 0.34 1.60 0.38 -0.10 -0.02 0.03 0.02 4 0.52 0.49 0.57 0.54 -0.19 —0.04 -0.14 0.01 5 0.74 0.07 0.80 0.07 -0.12 -0.12 -0.06 -0.12 6 1.64 0.09 1.78 0.10 -0.08 -0.09 0.06 -0.08 7 1 .83 0.26 1 .99 0.29 -0.03 -0.03 0.13 0.00 8 1.19 0.31 1.29 0.34 -0.14 -0.03 -0.04 0.00 9 0.85 1.23 0.92 1.36 -0.34 -0.34 -0.27 -0.21 10 1 .78 -0.07 1 .93 -0.08 -0.22 -0.07 -0.07 -0.08 11 0.81 0.00 0.88 0.00 -0.06 0.00 0.01 0.00 12 1.85 0.98 2.00 1.09 -0.15 0.00 0.00 0.11 13 0.92 0.95 0.99 1.05 -0.08 0.06 -0.01 0.16 14 1.20 0.03 1.30 0.03 -0.02 -0.11 0.08 -0.11 15 1.16 0.42 1.26 0.46 -0.11 -0.05 -0.01 -0.01 16 1.02 1.08 1.10 1.19 -0.33 -0.08 -0.25 0.04 17 1.00 0.44 1.09 0.48 -0.06 -0.01 0.03 0.03 18 1.96 -0.08 2.12 -0.09 0.04 -0.08 0.20 -0.09 19 0.91 0.20 0.98 0.22 -0.05 -0.02 0.02 0.00 20 1 .16 0.10 1.26 0.1 1 -0.04 -0.02 0.06 001 Sqrt SS Sum 5.81 2.45 6.30 2.71 —2.43 -1.22 -0.36 -0.42 For the example data, the value of estimated dilation constant is k=1.074 and the diagonal dilation matrix is 1.163 0 . Note that the value of the 0 1.187 estimated as K=[ dilation constant is smaller than either of the two diagonal 29 components of the dilation matrix. That is, for this example data, the transformed matrix with the dilation matrix was more dilated in both dimensions than that with the dilation constant. Finally, Table 2.2 shows that the dilated discrimination matrix with the diagonal dilation matrix is more similar to the base matrix than that with the dilation constant in terms of the length of vectors (the row headed as Sqrt SS) and overall differences for 20 items (the row headed as Sum). Because the dilation matrix always has more parameters than the dilation constant it is expected that the M method should be better than the LL method in transforming one scale to the other. The way to compare overall goodness of two linking methods is to calculate sum of squared differences (SS) between the base matrix (say matrix B) and the transformed matrix (say, matrix A). The SS is calculated by tr(A!A-B'B) and the ratio of 88s can indicate the proportion of linking errors. For the present example data, the error SS with the constant is 4.14 and the SS of the diagonal matrix is .40. Then the ratio of two 88s is .10 (=.40/4.14). It means that modeling the diagonal dilation matrix reduces linking errors 30 by 90%, denoted by SS, from modeling the constant.4 2.4 Other MIRT Linking Methods Two MIRT linking methods have been reviewed, and a new method with a diagonal dilation matrix and orthogonal Procrustes solutions has been suggested. In addition to these, there are two other MIRT linking methods proposed by Hirsch (1988, 1989) and Thompson et al.(1997). 2.4.1 Hirsch’s Method Hirsch (1988, 1989) developed an MIRT linking method based on a common-examinee design, and it was composed of three transformation steps. First, find common orthogonal basis vectors. Second, the orthogonal Procrustes rotation matrix is sought to align the reference systems between two examinee groups. Third, after conducting two rotational transformations, scaling indeterminacy is handled by linear methods of UIRT linking (e.g., mean and sigma method, or Stocking and Lord’s method [1983]). Hirsch’s method can be valued as the first attempt to ‘ The statistically elegant way to compare two different statistical models is to take the ratio of likelihoods. However, it does not apply to the present situation. The reason is (a) the distribution of MIRT discrimination parameter is not known yet and (b) estimation procedure of the LL method is different from the new method such that the former is not nested within the latter. 31 deal with multidimensionality in IRT linking. However, it takes multiple and complicated stages to find a final rotation matrix mainly due to different basis vectors between the base group discrimination matrix and the equated group discrimination matrix. However, recently popular MIRT calibration programs (e.g., NOHARM [Fraser, undated] and TESTFACT [Wilson, Wood, & Gibbons, 1991]) provide orthogonal ability dimensions as a default option so there is no need to find common basis vectors for multiple test forms or examinee groups. The remaining two steps of this method are very similar to later developed MIRT linking methods, especially, to Li & Lissitz’ method. 2.4.2 Thompson, Nering and Davey’s Method Thompson and his colleagues (1997) developed linking procedures for multiple test forms when there were neither common items nor common examinees. They argued that linking information for different test forms could be obtained from the assumption of randomly equivalent examinee groups and specification of common item content cluster(s). Even though this method could have important implications in practices because it relaxes equating conditions by not requiring common items/examinees, their assumptions are still in question. The assumption of equivalent examinee groups is 32 hard to verify without large sample sizes and random sampling. Moreover, identifying the similar item clusters is also critical and may require large item sets. As Li and Lissitz (2000) mentioned, this method is still an experimental stage and needs more technical justification for its procedures. 2.5 Evaluation Criteria Different linking methods may be expected to produce different results because statistical or optimization criteria differ from one to another (Bolt, 1999; Harris & Crouse, 1993), but each method should be good at what it is supposed to minimize/maximize. Researchers who have pursued multidimensional equating methods usually provide various evaluation criteria that support their own linking methods. However, when different linking methods are adopted, different sources of linking errors play roles that make the final equating/linking results different. Therefore, the evaluation of linking methods depends on what criteria are to be employed for comparison and evaluation. In practice, the selection of criteria could be determined according to the purpose and practical conditions of linking. For example, if test scores are reported on true score scales, it would be better to use a certain linking method which 33 could minimize the differences between test response surfaces. The popular evaluation criterion for MIRT linking of simulation data is to compare transformation parameters (i.e., rotation matrix and scaling vector/constants) with their estimates. Even though the comparison of estimates of linking components with their parameters is reasonable for evaluating different linking techniques, it is limited to the situation that all evaluated techniques have to have the same structures to transform one set of estimates into another. For instance, the ODL method consists of two linking components (the rotation matrix and the translation vector) while the LL method is composed of three stages (the rotation matrix, the translation vector, and the central dilation constant). Further, the rotation matrix of the ODL method functions equivalently to both the rotation matrix and the dilation constant of the LL method. Therefore, these two linking methods can not be evaluated using the comparison of parameters and estimates of linking components. In order to compare different linking frameworks, criteria which work for all methods need to be developed. A way to develop evaluation criteria is to go back to the goal of equating. For example, in the anchor item design, 34 information obtained from common items is used to estimate a transformation which makes independent calibrations of common items as similar as possible. In this study, the anchor item design is considered and three linking methods described in this chapter are evaluated in terms of how similar transformed item estimates are to estimates on the base test or item parameters. More details on evaluation criteria will be provided in the method chapter. 35 CHAPTER 3 METHODS In this chapter, statistical procedures for simulation and real data analyses are described. Evaluation criteria for comparisons of three linking methods are also provided. 3.1 Simulation Study Simulation data is sometimes recommended rather than real data to evaluate equating methods in order to separate effects of model misfit and equating errors (Bolt, 1999; Davey, Nering, & Thompson, 1997). Because true parameters are known in the simulation study, it is easier to compare true parameters with transformed estimates. The purpose of the present simulation study is to evaluate the three linking methods described in the previous chapter by quantifying linking errors, discrepancies between item parameters and transformed estimates. 3.1.1 Equating Design and Specification of MIRT Model Two test forms which shared a set of common items were considered, the so called anchor item design. Suppose there are a base test form and another form, the equated test, and each form includes common items and unique items. The 36 equated test scores need to be converted onto the metric of the base test scores. The common item set consisted of twenty items for both tests and they were used to find a comparable test scale. Because parameters are known in simulated data, a set of known item parameters was treated as estimates for the base test and item parameter estimates from various simulation conditions were used as equated test estimates. The number of common items was set to twenty for all simulation conditions. Although there is no absolute agreement about the length of the common/anchor test, most frequently cited rule of thumb is no fewer than 20 items or 20% of total test items (Angoff, 1968, 1971; also see Budescu, 1985). In order to obtain item parameter estimates, a compensatory two-dimensional two—parameter logistic model was used as in Equation (1) with c=0. The two-dimensional case is the simplest situation of multidimensionality but the same linking procedures for each method can be easily expanded to higher dimensions. Note that lower asymptote parameters were not considered for the present simulation mainly because the lower asymptote parameters are on the probability metric so they do not directly relate to linking processes. 37 3.1.2 Generation of Item Parameters and Response Patterns Item parameters were specified by the selection of item dimensional structures. Two types of dimensional structures were examined: approximate simple structure (APSS) and mixed structure (MS) (Roussos, Stout, & Marden, 1998; see also Kim, 1994; Kim, 2001). APSS means that each item has relatively higher loadings on one dimension than on other dimensions. In other words, a set of items (e.g., item cluster) has high discrimination on the same dimension and low discriminations on the other dimension. However, in reality, test items may measure some composites of dimensions as well as relatively pure dimensions. MS refers to a test that measures both relatively pure trait dimensions and their composites. For the present simulation with twenty common items and two dimensions, APSS was represented by two sets of items, ten items for each dimension. One set of items mainly loaded on the first dimension and the other set on the second dimension. In MS, there were four sets of items, five items for each. Two sets among them highly loaded on either of the two dimensions, and the remaining two sets were sensitive to composites of the two dimensions. These two dimensional structures for twenty common items are illustrated in Figure 3.1. 38 Dimension 2 (82) J Items 11 to 20 (75°~90°) Items 1 to 10 (O°~15°) Z"””,74"”" I Dimension 1 (0;) (a) Approximate Simple Structure (APSS) Dimension 2 (01) Items 16 to 20 (75°~90°) —'> Items 11 to 15 (50°~65°) Items 6 to 10 (25°~40°) Items 1 to 5 (0°~15°) Dimension 1 (01) (b) Mixed Structure (MS) Figure 3.1. Two Dimensional Structures in Simulation Data* * All angles are defined from the first dimension. For APSS, two clusters of 10 items highly loaded on either of the two dimensions. For MS, two sets of five items clearly measured each of the two dimensions and remaining two sets were measuring composites of dimensions. Among two sets of composite measuring items, one was slightly more sensitive to the first dimension and the other set to the 39 second dimension. To construct the dimensional structures, angles (a, see Equation [4]) between item vectors and the first dimension were randomly drawn from a uniform distribution with given ranges of item clusters defined in Figure 3.1. In order to define item parameters, fixed values of MDISCs and MDIFFs in Table 3.1 generated by Roussos et al. (1998) were used. These two sets of MIRT item characteristics were selected because they are realistic, cover item features usually found on a test, and they do not relate dimensionality and item difficulty levels. The average value of MDISC is 1.2, and the average value of MDIFF is zero. Table 3.1. Five MIRT Discrimination and Difficulty Levels Level MDISC MDIFF 1 0.4 -1.5 2 0.8 1.0 3 1.2 0.0 4 1.6 —1.0 5 2.0 1.5 Mean 1.2 0.0 This pattern of MDISCs and MDIFFs was repeated four times for twenty common items. Then, discrimination and difficulty-related parameters were determined by Equations (2), (3), and (4) with given angles, MDISCs, and MDIFFs. 4O A set of item parameters that were used for the present simulation conditions is given in Table 3.2. Table 3.2. Item Parameters for Twenty Common Items Item APSS MS a1 a2 a1 a2 d 1 0.40 0.03 0.40 0.03 0.60 2 0.80 0.07 0.78 0.17 —0.80 3 1.19 0.16 1.20 0.07 0.00 4 1.56 0.34 1.60 0.10 1.60 5 2.00 0.04 1.98 0.29 -3.00 6 0.40 0.05 0.34 0.21 0.60 7 0.78 0.17 0.71 0.36 -O.80 8 1.20 0.06 1.01 0.64 0.00 9 1.60 0.11 1.25 1.00 1.60 10 2.00 0.09 1.68 1.08 -3.00 11 0.04 0.40 0.25 0.31 0.60 12 0.15 0.79 0.47 0.65 -0.80 13 0.09 1.20 0.64 1.01 0.00 14 0.16 1.59 0.75 1.41 1.60 15 0.47 1.94 1.03 1.71 -3.00 16 0.08 0.39 0.03 0.40 0.60 17 0.04 0.80 0.10 0.79 —0.80 18 0.30 1.16 0.14 1.19 0.00 19 0.37 1.56 0.34 1.56 1.60 20 0.23 1.99 0.21 1.99 -3.00 Mean 0.69 0.65 0.75 0.75 -0.32 SD 0.66 0.68 0.57 0.60 1.59 The average angle from the first dimension for the first ten items with APSS was 6.000 and that of the remaining ten items was 81.040. The average angles of four sets of five items with MS were 6.190, 32.390, 56.870, and 82.830, respectively. Directional vectors of twenty common items are 41 illustrated in Figures 3.2 (APSS) and 3.3 (MS). The length of an item vector indicates the degree of discrimination (MDISC) and the distance between the origin and the starting point of the vector (arrow point of the vector on the third quadrant) is item difficulty (MDIFF). All vectors are extended through the origin, and they are located in the first and third quadrants because of positive discrimination parameters (a’s) (Ackerman, 1996; Reckase & McKinley, 1991). I >— ,_ t. >- >— 1.. L Figure 3.2. Item Vectors, Approximate Simple Structure 42 i ,— .. ,_. .. >— u— I— Figure 3.3. Item Vectors, Mixed Structure The probability of getting an item right was computed by the compensatory two~dimensional two—parameter IRT model, Equation (1) with c=0. Given the MIRT item parameters, the response probability PU was computed for each examinee. Then PU was compared to a uniform random value P* where 0 S P* S 1. A binary item score of XU = 1 (correct response) was assigned when PU > P*. Otherwise, a score of x” = 0 (incorrect response) was assigned. 43 3.1.3 Simulation Factors (1) Ability Distributions Five bivariate normal distributions with different means and variances/covariances were considered for examinee true abilities. Different ability distributions mean that the test forms are administered to somewhat different populations. For example, considering vertical equating, examinee groups may have different ability levels represented by different means of distributions. It might also be possible that there are different relationships between dimensions when examinee groups have different ability levels. Table 3.3 shows mean vectors (p), variance/covariance matrices (2) and correlation coefficients (2)) of two traits for five examinee groups. Table 3.3. Ability Distributions for Five Examinee Groups Group 1 Group 2 Group 3 Group 4 Group 5 0 1 0 0 l .5 5 1 .5 .5 8 $1 .5 L2 5 fl: 9:: 9 ’ ’ ’ 0 0 1 () .5 l .5 5 l .5 A-.8 .5 .5 .8 1):.00 .50 .50 .50 .51 The distribution of Group 1 is a default ability distribution (standardized independent bivariate normal distribution) that is assumed in MIRT calibration programs 44 (e.g., NOHARM [Fraser, undated] and TESTFACT [Wilson, Wood & Gibbons, 1991]). Therefore, least estimation errors as well as less linking errors are expected for Group 1. From Group 2, true abilities are assumed to have moderate correlation (17:0.5). Because MIRT calibration programs generate independent ability dimensions by default, the dimensional orientations of Groups 2 to 5 become arbitrary. However, item discrimination estimates reflect dimensional dependency when correlated ability dimensions are transformed to be independent. In Group 3, a different mean vector from the default zero means for both dimensions was used. For the last two groups, differences of variances of abilities were implemented but still the correlations of two dimensions were maintained about 0.5. Note that Group 5 shows expansion for the first dimension (1.2) and shrinkage for the second dimension (.8), while the rates of shrinkage for Group 4 are the same for the two dimensions (.8). One may consider more variation in ability distributions than those discussed so far. However, five distributional conditions in Table 3.3 cover essential types of distributional changes (e.g., means, variances/covariances) and still retain simplicity to make the comparison of the three linking methods clear. 45 (2) Number of Examinees Usually 2000 or more examinees are suggested for MIRT calibration (Ackerman, 1994; Reckase, 1995). In order to evaluate the stability of linking results under less desirable conditions, relatively small sample size (1000) along with recommended size of 2000 was considered. (3) Dimensional Structures As was described in the section 3.1.2, two sets of twenty common items were specified according to two dimensional structures (i.e., APSS and MS). (4) Linking Methods Three linking methods, described in the chapter 2, were compared based on how closely item estimates for twenty common items were transformed into item parameters, i.e., degrees of parameter recovery through metric transformations. For each method, there are several sub—procedures which result in slightly different transformations. One relatively better, or best, sub-procedure for the ODL and LL methods was selected for the comparison: test characteristic function procedure (TCF) for the ODL method, and the composite procedure of orthogonal Procrustes solutions for 46 the LL method. The new method with a dilation matrix followed the same criteria of the LL method (see also Appendix A). (5) Procedures and Computer Programs Given ability distributions (5), sample sizes (2), and dimensional structures (2), there were twenty combinations of simulation conditions. Fifty test response patterns were generated for each combination as was described in the section 3.1.2. Even though there is not a clear guideline for the number of replications needed for reliable results, at least 25 replications have been recommended in IRT—based research (Harwell, Stone, Hsu & Kirisci, 1996) There were three steps to conducting the simulation study; generation of test response patterns, MIRT calibration, and linking. First, dichotomous response data were generated based on MIRT model (Equation [1] with c=0), item parameters (Table 3.2), and characteristics of ability distributions (Table 3.3). This step was completed by using GENDATS developed by Thompson (undated). Second, item parameters were estimated from item response patterns generated in the first step. For MIRT calibration, a modified version of NOHARM (Normal Ogive Harmonic Analysis Robust Method, Fraser, undated; Thompson, 1996) was used 47 under several options; exploratory analysis, starting values given by the program, and latent trait parameterization. Third, estimates of item characteristics of twenty common items were transformed into initial item parameters according to three linking methods. While item parameter estimates of one test are equated to other estimates in practice, item parameters were used as base test parameters for the present simulation. IPLINK (Lee & Oshima, 1996) was used for the ODL method with options: scaling constant of 1.702, #2 parameterization, no weighting, and TCF method. MDEQUATE (Li, 1996) was run to implement the LL method. For the expansion of the LL method with a diagonal dilation matrix, a new linking program using MATLAB (The MathWorks, 1995) was written (see Appendices A and B). 3.1.4 Evaluation Criteria and Data Analysis Although the three linking methods easily apply to other equating designs, originally they were developed with the anchor item design. In the IRT framework, one of the evaluation criteria for linking methods with the anchor item equating is based on the size of the differences between base estimates and transformed estimates. Adopting the statistical concepts of accuracy and stability for metric transformation, two summary statistics were used as 48 evaluation criteria: how far transformed estimates depart from initial item parameters on average (linking bias), and how much differences fluctuate (root mean square error, RMSE) across common items. Bias and RMSE were computed by I 1! 2(11011), (21) 1 1/2 I a: A 2 “Z (01,- " 01:) '=1 , respectively, (22) l-l where, an is the discrimination parameter on the first a a n n A* 0 o c a u dimenSion of item 1 , a“ Is the transformed discrimination estimate on the first dimension of item i, and I is the number of common items, twenty items for the present simulation. These two summary statistics represent the quality of linking for the first dimensional discrimination. Then overall patterns were examined with mean bias and mean RMSE for 50 replications. The same formula were applied to other item characteristics, item discrimination on the second dimension (az) and difficulty related parameters (d). As each item had three parameters and transformed values, there 49 were three sets of bias and RMSE for each replication. Because three linking methods applied to the same test response patterns, a repeated measures analysis of variance (ANOVA) model was used to detect effects of simulation conditions (between—factors) and linking methods (within- factor) on bias and RMSE.5‘The repeated measures ANOVA model is Bias(al ),,-,,g, = p + 19,, + 78 + ,1, + Wig. + ”Kngs) (23) + (I, + aflln + Wig + ails + W’ilg s + eli(ngs) where, Bkwfln)mms; bias of the first dimensional discrimination for 1th linking method, ith iteration, nth sample size, gth group and sth structure, p ; overall mean in the population, fit; effect of nth sample size (1000 and 2000), 78; effect of gth distributional group (Groups 1 to 5), is; effect of sth dimensional structure (APSS and MS), )d 85; interaction effect of group and structure, 5 In order to obtain more desirable distribution (i.e., normality). a natural logarithm was taken for RMSEs. 50 ”me)’ effect of ith iteration within nth sample size, gth group and sth structure (iterations 1 to 50), a); effect of 1th linking method (three linking methods), infin; interaction effect of linking method and sample size, 07%; interaction effect of linking method and group, ad“; interaction effect of linking method and dimensional structure, amhgs; interaction effect of linking method, group, and dimensional structure, and empw”; interaction effect of linking method and iteration within nth size, gth group and sth structure.6 In Equation (23), there are three between-factors: sample size, distributional shape of the group, and dimensional structure. The interaction term of between—factors (group by structure) was selected based on initial examinations of full model results. Also there is one within-factor, linking method, and there are several interaction terms for between- by within-factors. Equation (23) is the model for the bias of the first 6 Statistical tests of the repeated measures analysis of variance model are based on the symmetry conditionszl) the variance-covariance matrix of the transformed variables used to test effects has covariances of zero and equal variances (sphericity) 2) the variance—covariance matrix must be equal for all levels of the between subject factors (homogeneity). 51 dimensional discrimination. The same model applies to bias and log transformed RMSE for all three item parameters. Inference statistics of this model tested whether simulation conditions and linking methods had statistically significant effects on linking bias and RMSE, and then descriptive statistics of two summary statistics were examined to provide more detailed patterns of linking errors and to compare the three linking methods. 3.2 Real Data Analysis Simulation data have advantages in that they can clarify which linking method and testing condition(s) lead to favorable metric transformations, because one knows the true model and parameters. However, simulation is not reality itself, and the overall meaning of simulation depends on how realistic assumed conditions and the following resultant data are. One way to scrutinize a simulation study is to compare its results with real data and see if they lead to consistent conclusions. For this purpose, scoring outcomes of a statewide 7U‘grade students were analyzed. The mathematics test for test consisted of three sections of 115 multiple choice items (four alternatives) and each item was classified into one of six content areas. More than 130,000 students took 52 the test and the valid sample size was 124,481. In order to evaluate three linking methods in real testing conditions, two test forms that shared common items (base test form and equated test form), were artificially assembled as following. First, twenty common items among 115 items were randomly selected based on original test structures (three sections) and content areas (six areas) in order to represent the whole test. The number of common item was decided by following Angoff’s rule of thumb (1971). Second, the remaining 95 items were randomly assigned to unique items of either of the test forms, again based on test structures and content areas in order to make two sets of unique items as similar as possible. As a result, the base test form was assembled with twenty common items and 47 unique items, and the equated test form was composed of the same twenty common items and 48 unique items. Table 3.4 shows the composition of two test forms. Third, in order to replicate real data comparisons, five non—overlapped groups of 2000 examinees for the base test and the equated test were randomly sampled from the overall examinees. The number of examinees was decided based on the recommendation of related literature on MIRT models (Ackerman, 1994; Reckase, 1995). Descriptive statistics of 53 classical item difficulties (proportion of examinees who got an item right) Table 3.5. Table 3.4. Composition of Two Test Forms* for common and unique items are provided in Content Sectionl Section2 Section3 Total Area 1 1, 2, 1 1, 1, 3 1, 0, 2 3, 3, 6 2 0, 1, 0 0, 1, 0 2, 4, 5 2, 6, 5 3 0, 0, 0 0, 0, O 4, 10, 8 4, 10,8 4 0, 1, 0 O, 0, 2 1, 1, 3 1, 2, 5 5 2, 5, 4 2, 4, 3 3, 7, 9 7, 16, 16 6 0, 0, 0 0, 0, 0 3, 10, 8 3, 10, 8 Total 3, 9, 5 3, 6, 8 14,32,35 20, 47, 48 * The first number in the cells indicates the number Of common items. The second and third numbers are the number of unique items in the base form and in the equated form, respectively. Table 3.5. Item Difficulties of Two Test Forms Common items Unique items Unique items on base test on equated (47) test (48) Mean SD Mean SD Mean SD Base Groupl .57 .15 .57 .16 .58 .15 Group2 .57 .15 .57 .16 .59 .15 Group3 .56 .15 .57 .17 .59 .15 Group4 .57 .16 .57 .17 .59 .15 Group5 .57 .16 .57 .17 .58 .15 Equated Groupl .58 .16 .58 .17 .59 .15 Group2 .58 .15 .58 .16 .59 .15 Group3 .57 .16 .57 .17 .59 .16 Group4 .57 .16 .58 .17 .59 .16 GroupS .57 .15 .57 .16 .59 .15 54 From Tables 3.4 and 3.5 one can find that twenty common items were similar to unique items in terms of test contents and item difficulties. Unique items on the base form are also similar to those on the equated form. Fourth, the two—dimensional three—parameter logistic model (Equation [1]) was applied to each data set and three linking methods were applied to five pairs of random examinee groups (e.g., base groupl and equated groupl, and so forth). Because NOHARM does not provide estimates of lower asymptotes, these were estimated from BILOG3 and were used as input data for MIRT calibration. Item parameter estimates of twenty common items for the first pair of samples after a varimax rotation are provided in Table 3.6. Even though test dimensions were not a main focus of this study, the first dimension could be defined as an elementary algebra dimension (e.g., whole number and basic computation) and the second dimension as a logical reasoning dimension (e.g., problem solving and statistics). 55 Table 3.6. Item Parameter Estimates of Common Items It Base test form (groupl) Equated test form (groupl) em a1 a2 d c a1 a2 (1 c 1 0.88 0.42 -0.58 0.21 1.01 0.47 —0.87 0.24 2 0.69 0.25 -0.31 0.14 0.66 0.18 —0.27 0.15 3 1.14 0.66 0.04 0.10 1.33 0.45 0.14 0.10 4 1.05 0.48 0.23 0.12 1.23 0.51 0.20 0.16 5 0.34 0.39 0.56 0.15 0.43 0.35 0.70 0.15 6 0.54 0.42 0.43 0.25 0.52 0.56 0.70 0.15 7 1.32 0.71 —0.76 0.08 1.37 0.65 -0.77 0.12 8 0.72 0.37 —0.05 0.18 0.74 0.43 —0.10 0.24 9 1.01 0.40 —1.11 0.15 0.98 0.41 -1.26 0.20 10 0.17 0.31 -0.10 0.18 0.27 0.23 -0.17 0.21 11 0.83 0.62 —0.91 0.16 0.58 1.56 -1.16 0.14 12 0.63 0.76 0.24 0.16 0.69 0.60 0.36 0.16 13 0.50 0.55 -0.39 0.23 0.43 0.38 -0.24 0.16 14 0.31 1.19 1.58 0.16 0.43 0.87 1.41. 0.13 15 0.41 0.38 —0.25 0.26 0.32 0.34 -0.02 0.14 16 0.61 0.58 -1.73 0.26 0.45 0.88 -1.67 0.22 17 0.54 0.80 0.23 0.13 0.68 0.74 0.35 0.11 18 0.60 0.69 0.87 0.12 0.61 0.54 0.89 0.12 19 0.39 0.63 —0.60 0.25 0.38 0.43 -0.35 0.16 20 0.57 1.21. 0.77 0.11 0.85 0.85 0.77 0.15 Mean 0.66 0.59 -0.09 0.17 0.70 0.57 —0.07 0.16 SD 0.30 0.26 0.76 0.06 0.33 0.31 0.79 0.04 3.2.1 Evaluation Criteria and Data Analysis For the real data analyses, were evaluated in two ways, true test score estimates for the twenty common items. the three linking methods item parameter estimates and In order to compare transformed item estimates for’the equated test form with estimates of the base test form, Equations (21) and (22) were used. Note that in the real data analysis true item parameters were unknown but there 56 were two sets of item estimates for the common items. So, item estimates (fin) on the base form were used instead of item parameters (au) in both Equations (21) and (22). They are mean difference and difference variation rather than bias and RMSE. In addition to item level comparisons, differences of estimated true scores on the common items using test response surfaces were evaluated for the three linking methods. By summing Equation (1) for the base group and equated group, two test response surfaces for the 20 common items can be estimated (see also the first part of Equation [9]). Practical information is expected to be obtained by exploring where most/least discrepancy between base test response surface and transformed surface occurs for each linking method. 57 CHAPTER 4 RESULTS Based on study designs described in the previous chapter, the main results of the simulation study and the real data analysis are provided along with initial interpretations. 4.1 Simulation Study 4.1.1 Results of Repeated Measures Analysis of Variance Linking errors for each replication with twenty common items were calculated using two statistics: the mean and standard deviation of differences between transformed estimates and item parameters (bias and RMSE, see Equations [21] and [22]). These summary statistics were considered as indicators of quality of linking for each replication. After finding significant multivariate results for the model given by Equation (23), univariate test results for six dependent variables are provided in Tables 4.1 and 4.2. In each cell there are three numbers; F value, degrees of freedom, and eta square (proportions of explained variance to overall variance). It should be noted that the first degree of freedom regarding linking method for the difficulty parameter is 1 rather than 2. The reason is that only the ODL and LL (or M) methods were compared for 58 difficulty parameters because the LL and M methods resulted in the exactly same transformations of difficulty parameters. Table 4.1. Test Statistics (F), Degrees of Freedom (DF) and Effect Sizes (02) of Biases from Repeated Measures ANOVA7 Source Bias, al Bias, a2 Bias, d Between Factor ,6 F = 72-86“ 58.67** 8.08** n . DF = (1' 989) (1, 989) (1, 989) sample Size ”2 = .07 .06 .01 Y 175.04** 167.27** .12 5 (4, 989) (4, 989) (4, 989) Distributional Group .41 .40 .00 2 353.82** 239.62** 28.75** f _ (l, 989) (l, 989) (1, 989) DimenSional Structure .26 .20 .03 A 29.32** 22.27** .40 7;” (4, 989) (4. 989) (4, 989) GroupXStructure _11 .08 .00 Within Factor a, 179.95** 101.36** 393.32** ! . (2, 1978) (2, 1978) (1, 1978) Linking Method .15 .09 .29 (HZ 58.05** 53.44** 2.74 . n . (2, 1978) (2, 1978) (l, 1978) LlnkXSIZG .06 .05 .00 07 67.16** 75.76** 1.07 '3 (8, 1978) (8, 1978) (4, 1978) LinkXGroup .21 .24 _00 a1 130.85** 127.24** 8.35** .“ (2, 1978) (2, 1978) (1, 1978) LinkXStructure ' .12 .11 .01 l 16.20** 16.16** 3.20* aylg, (8, 1978) (8, 1978) (4, 1978) LinkXGroupXStructure .06 .05 _01 ** p<.01, * p<.05 7 When the sphericity assumption is violated as the present simulation adjustments of test statistics can be made by Grenhouse-Gesser Epsilon. Because initial F values and adjusted statistics are not very different, initial statistics under the sphericity assumption are reported. 59 Table 4.2. Test Statistics (F), Degrees of Freedom (DF) and Effect Sizes (02) of RMSEs from Repeated Measures ANOVA Source LN RMSE,a.L LN RMSE,a2 LN RMSE, d Between Factor fl F = 184.23** 201.42.”. 132.38** " . DF = ‘1' 989’ (1, 989) (1, 989) Sample Size ”2 = .16 .17 .12 7 165.15** 188.77** 6.77** S (4, 989) (4. 989) (4, 989) Distributional Group .40 .43 .03 A 8.47** 0.21 0.45 f . (1. 989) (l. 989) (l. 989) DimenSional Structure .01 .00 .00 Z 2.93* 1.53 6.08** 733 (4. 989) (4. 989) (4, 989) GroupXStructure .01 .01 .02 Within Factor a, 536.22** 614.73** 1486.73** 1 _ (2, 1978) (2, 1978) (1. 1978) Linking Method .35 .38 _50 64% 49.89** 87.26** 7.65** . “ . (2, 1978) (2, 1978) (1, 1978) LinkXSize . 05 . 08 , 01 ay 87.99** 99.68** 0.79 '3 (8. 1978) (8. 1978) (4, 1978) LinkXGroup .26 .29 .00 ad 27.53** 45.36** 5.76* .b (2, 1978) (2, 1978) (1, 1978) LinkXStructure .03 .04 .01 1 4.62** 7.49** 6.31** “7185 (8, 1978) (8, 1978) (4. 1978) LinkXGroupXStructure .02 .03 .03 ** p<.Ol, * p<.05 The statistical test results indicated that effects of linking methods depended on simulation conditions (i.e., statistically significant interactions of between— and within—factors). In addition to interactions of between— by three main factors of simulation conditions within-factors, had significant effects on linking bias and log transformed 60 RMSE. An interesting finding is that discriminations (especially the first dimensional discriminations) were more sensitive than difficulty estimates to simulation conditions. This pattern was clearer in the effect sizes denoted by eta squares. That is, distributional groups, dimensional structures and linking methods accounted for large portions of bias variation of discriminations but only linking methods were an important factor for bias variation of difficulties. In the log transformed RMSEs, sample sizes, distributional groups and linking methods were important for discrimination, and sample sizes and linking methods were for difficulties. In sum, the results of the repeated measures ANOVA showed that the type of linking methods had significant effects on all 6 dependent variables, and the soundness of linking results (i.e., how close transformed estimates were to true parameters) depended on various test conditions, linking methods and their interactions. 4.1.2 Comparison of the Three Linking Methods In order to directly compare behaviors of the three MIRT linking methods across simulation conditions, linking errors (i.e., bias and RMSE) are illustrated in Figures 4.1 to 61 4.12. Bounded vertical lines represent upper and lower limits of one standard deviation for bias and RMSE of fifty replications of the twenty common items, and marked middle points of the lines are mean values of bias and RMSE. Note that the horizontal axis represents the combinations of five distributional shapes and two dimensional structures. For example, APl indicates distributional Group 1 with APSS items. EANS a1 b -.1 1 -.2 ' -.3 Figure 4.1. AP1 ARZ ABS AR4 APS ~é1 ~62 Néa ~é4 Nés Bias (a1, n=1000) 62 Ibfi *i—c ow 8 r r 3 - - - - - AP1 AP2 AP3 A94 AP5 ~61 Bias (aw r1 1000) Figure 4.2. - - - [0&2 t - AP1 AFQ APB A94 AP5 ~61 Bias (d, Figure 4.3. 1000) n: 63 db ' ' ' 'Mszmsa' 0164-0165 j A91 A92 AFB A94 A95 ~61 RMSE (a1, n=1000) 51 .4' Figure 4.4. N64M-S5 ~1s2 Msa a n .3... Tllllillllli A _:|.._.|:.-.. .m Tlldll. A _...I.__.?|.:. .m Tll-Illll Wfiui .m TIITIII - 7-..-.. ........ _ __ .. .... m u. ... o. O Na mmgfia Figure 4.5. 1000) RMSE (aw r1 T APiAP2AP3Ae4AP5E11/52MS3NB4MS5 .6! 0.0 Figure 4.6. 1000) RMSE (d, n m u M I. .14. 1* .II; .1... a. -8 a ._ _1_.T.u_ _. ooooo . nnnnn J .m _ 41.-.-.. .m .5--. ..... . _ ___ _ .m _I|.||+ A T¥l .55.. .m TI w ..... .u iiiii . _ _ rm TI .- ..... O ..... ... _ _ .m .. . 1.1..-- _. m .u Bias (au :1 Figure 4.7. 2000) 65 .2I g 7: - 6 i Q _ 1 9 i (‘8 .:. . -. E T -.- 1' 9 ' 9 s i 5 ‘=' m I f 1 if F . ; -:. 5 5 : . I on 11111111119- o.o _.-_ _ 3' A j - 1 ’ 1 LL 9 o J. I M -.1 * A51AF>2A138A134A55~I§1~I§2~E3~§4~§5 Figure 4.8. Bias (a2, n=2000) .1 #--?* . ' . 0.0 Tj_ : J T;_ _ i 'o 7 1 l 1 I out 1. I Jb . 1 LL 0 I M -.2 * Figure 4.9. Bias (d, n=1000) 66 AhAéAfiAhAasué1~é2~é3~é4~és FMSEa1 0.0 A91 A732 A38 A94 A95 Figure 4.10. RMSE (av IFQOOO) Né1 NéZ .3' 0.0 AP1AP2AF’3A94A95Né1Né2NéSMS40/S5 Figure 4.11. RMSE (a n:2000) 2! 67 i Il—l *i—i ow Il—l 8 *I—i ow 8 r r 2 r r 3 -------J -------J 8 l_ I L------ r r *l—l ow E 0.0 A131 AE’ZAgAi-MAPSNé1Né2Né3MSTNB: Figure 4.12. RMSE (d, n=2000) In general, one can find the M method and the ODL method were less biased and more stable than the LL method for the three item parameters. More specific points follow. JJ As the sample size became larger and the ability distribution was close to the default conditions (i.e., orthogonal standard bivariate normal distribution), metric transformations were more accurate (less variations of bias) and stable (smaller RMSE). 2) For discrimination estimates, transformations of MS items were less biased than those of APSS items, especially with the LL method; Figures 4.1, 4.2, 4.7 and 4.8. 68 3) 4) 5) 6) 7) 8) Transformations of difficulty estimates were less stable than those of discriminations, especially with the ODL method; Figures 4.6 and 4.12. The LL method showed relatively larger biases of discriminations compared with two other methods. Compared with the M method, the ODL method is less biased for discriminations of APSS items, but more biased for discriminations of MS items; Figures 4.1, 4.2, 4.7 and 4.8. While transformations of the ODL method was relatively stable among different ability distributions, two orthogonal Procrustes based methods showed drastic changes in bias and RMSE of discrimination parameters between Group 1 and Group 2 (i.e., whether or not two dimensions were correlated). The M method and the LL method showed very similar RMSEs for discrimination parameters. The two orthogonal Procrustes based methods over- transformed difficulty estimates while the ODL method under-transformed; Figures 4.3 and 4.9. The ODL method had relatively small RMSEs for two discriminations but it showed more RMSE variations for 50 replications compared with the other methods; Figures 4.4, 4.5, 4.10 and 4.11 69 EH Compared with the M method and LL method, the ODL method showed less accurate and less stable transformations of difficulty related parameters; Figures 4.3, 4.6, 4.9 and 4.12. 10)The LL method and the M method showed exactly same transformation results for the difficulty parameter because they estimated the same rotation matrix and translation vector; Figures 4.3, 4.6, 4.9, and 4.12. In sum, inference statistics from the repeated measures ANOVA model indicated that the linking methods and the three testing conditions significantly affected linking bias and RMSE. From mean bias and RMSE of 50 replications in each simulation condition, one can find that the ODL method and the M method provided less biased metric transformations of discrimination estimates compared with the LL method. And the M method with the diagonal matrix made more stable transformations for difficulty related parameters than the ODL method. 4.2 Real Data Analysis As a real data example, the original 115 test items were artificially assigned to two parallel test forms, which shared twenty common items, as similar as possible in terms 70 of test structures, content areas and item difficulties. Item composition of two test forms and common items was shown in Table 3.4. Five pairs of sub-samples for base and equated examinee groups randomly sampled from originally 124,881 valid examinees. For the purpose of the simplest demonstration of multidimensional analysis, a two— dimensional three—parameter logistic model was used. Twenty common item parameter estimates with a varimax rotation for the first pair of samples were provided in Table 3.6. 4.2.1 Item Estimates Comparison After item estimates of the equated form was transformed into those of the base test scale by the three linking methods, transformed estimates were compared with item estimates of the base form. This procedure was replicated for five pairs of sub—samples in order to find consistent patterns. Differences between transformed estimates and base form estimates are illustrated in Figures 4.13 and 4.14. Bounded vertical lines indicate ranges of maximum and minimum values, and markers on the lines are mean values of five replications. The LL method showed largest mean differences for two discrimination parameters and the two orthogonal Procrustes solution based methods resulted in less biased difficulty transformations than the ODL method. 71 For difference variation, the LL method showed the most stable results, and the ODL method was worst but not by much. These findings for real data generally confirmed the results of the simulation study. --,- --,- '2' I i I I A Q I on I «3; 5 : ' I I - i I 1 LL 0 l . -.4" : : u --L- ---- I M -.5 * a1 a2 d Figure 4.13 Mean Differences of Five Sets of Samples 5. --r-—— i. ___, [a] 0.0 k D'fl'erenoe Variation 0) ---.-------------- Ii—i F 8 *I—I ow g a1 a2 d Figure 4.14 Difference Variations of Five Sets of Samples 4.2.2 True Score Comparison In addition to the item level comparison, overall true score comparisons for the first pair of sub-samples were conducted to evaluate linking methods using a different aspect. When the three linking methods were compared on estimated true score scale (test response surface for twenty common items), differences between two sets of estimated true scores (transformed true scores minus base true scores) were calculated for limited points (49 points, 7 by 7) from the ability space. Difference scores along with true score estimates of the base test form are presented in Figure 4.15. 73 3 0.44 0.32 0.01 0.01 -0.01 -0.02 -0.01 2 0.04 0.27 0.01 -0.09 -0.03 0.00 0.00 1 -0.30 -0.08 -0.05 -0.25 -0.17 -0.04 -0.01 02 0 -0.14 -0.09 -0.12 -0.38 -0.45 -0.23 -0.08 '1 0.04 0.13 0.17 -0.02 -0.35 -0.39 -0.25 -2 0.05 0.13 0.23 0.24 0.07 0.03 0.15 -3 0.02 0.06 0.11 0.13 -0.01 -0.03 0.47 -3 -2 -1 0 1 2 3 0| (a) ODL method (b) LL method 3 0.25 0.17 -0.09 -0.15 -O.15 -O.12 -0.08 2 -0.05 0.19 0.09 -0.07 -0.15 -O.14 -0.11 1 ~0.20 -0.02 0.09 0.01 -0.15 -0.21 -O.2O 92 O 0.02 0.15 0.26 0.23 -0.09 -0.33 -0.38 '1 0.18 0.35 0.58 0.81 0.65 0.09 -0.34 '2 0.16 0.32 0.58 0.98 I 0.61 '3 0.10 0.20 0.37 0.70 -3 -2 -1 0 1 2 3 0| (c) M method 3 10.68 12.69 15.47 17.84 19.10 19.62 19.82 2 9.01 10.80 13.41 16.36 18.32 19.25 19.64 1 7.31 8.84 11.12 14.26 17.02 18.56 19.28 02 O 5.89 6.93 8.72 11.62 15.01 17.37 18.60 -1 4.94 5.49 6.58 8.71 12.12 15.39 17.39 -2 4.50 4.75 5.28 6.49 8.99 12.45 15.25 -3 4.35 4.46 4.72 5.35 6.87 9.67 12.67 -3 -2 -1 0 1 2 3 6| (d) True score estimates on the base form Figure 4.15. Differences of Transformed Test Scores and Estimated True Scores on the Base Test Form* * Dark cells of the three MIRT linking methods indicate absolute differences larger than 1.0. 74 The results show that the three methods had different patterns of linking errors in terms of estimated true test scores for 20 common items. The LL method showed the largest mismatch between two test response surfaces, and differences were located above and below the diagonal. It means that the LL method had relatively large discrepancies when an examinee had high or low scores. On the other hand, the M method improved true score transformations by modeling unique unit changes compared with the LL method, but there were relatively large gaps when an examinee’s ability was high on the first dimension and low on the second dimension. The ODL method showed the most favorable result among three linking methods and discrepancies for any ability value were less than 0.5. The better performance of the ODL method was expected because its minimization criterion is based on the test response surface (Equation [9]). 75 CHAPTER 5 SUMMARY, DISCUSSION, AND CONCLUSION In this chapter, overall results are summarized, related issues are discussed, and conclusion and suggestions for further studies are provided. 5.1 Simulation Study By using simulated data, three MIRT linking methods based on the compensatory two—dimensional two-parameter logistic model have been evaluated for the anchor item equating design. In order to emulate real test conditions, several simulation factors were incorporated, e.g., sample sizes, dimensional structures, ability distributions. The amounts of linking errors were quantified by bias and RMSE based on basic statistical concepts of accuracy and stability. For statistical tests of effects of simulation factors and linking methods, a repeated measures design was applied because each simulated test response pattern was transformed into three sets of metrics according to the three linking methods. Then comparisons of the three linking methods were conducted by using mean bias and mean RMSE of 50 replications. The result of the repeated measures ANOVA showed that the 76 choice of linking methods had statistically significant impacts on linking errors, both bias and log transformed RMSE, for all three item parameters. Further, linking methods had significant interaction effects with simulation conditions. That is, the soundness of metric transformations depends on the type of linking methods as well as test administration conditions. When the degree of the recovery of parameters was quantified in terms of bias and RMSE, the new linking method, which is based on orthogonal Procrustes solutions with a diagonal dilation matrix, reduced linking biases, especially for discrimination parameters, compared with the LL method. The ODL method and the new method were relatively good at obtaining less biased metric transformations of discriminations, and the new method outperformed the ODL method in terms of stable transformations of difficulty related parameters. 5.2 Real Data Analysis For real data analysis, a statewide mathematics test was used. Because there was one test form originally, two test forms with twenty common items were artificially assembled based on initial test structures, content areas and item difficulties. In order to examine patterns of linking 77 results, five pairs of sub—samples with 2,000 examinees (base and equated groups) were randomly sampled from the originally more than 120,000 examinees, i.e., five replications of metric transformations. The evaluation of behaviors of the three linking methods was done in the same way as the simulation study comparing linking bias (mean differences) and RMES (difference variation). Transformed test response surfaces were also evaluated based on similarity to the base test response surface. The new method and the ODL method outperformed the LL method in terms of how close transformed discrimination estimates were to estimates on the base test. However, the two orthogonal Procrustes based methods resulted in more favorable linking for difficulty related parameters than the ODL method. The linking results with the real test data were consistent with the simulation study results in general. The comparison of test response surfaces (i.e., estimated true scores at 49 ability points) revealed different error regions for the three linking methods. While the ODL method resulted in the closest agreement, a large region of linking error was found for the LL method when examinees had low or high scores. The new method with a dilation matrix generated more acceptable agreement than the LL method, but was a little less favorable than the ODL method. 78 5.3 Discussion 5.3.1 Rotation and Optimization Criteria Statistically, the main differences between the two types of linking methods evaluated in the study could be found in linking components and optimization criteria. The ODL method consists of two linking components, the rotation matrix and the translation vector, while the two orthogonal Procrustes solution based methods include the dilation factor (a constant or a diagonal matrix) in addition to the previous two components. In some sense, the rotation matrix of the ODL method can be considered as a composite of the rotation matrix and the dilation factor because it alters both variances and covariances of the initial discrimination matrix. More noticeable difference between the ODL method and the two Procrustes based methods lies in the types of rotation. The rotation matrix of the ODL method adopts general rotation procedures, i.e., oblique rotation, because it does not put any constraint on the rotation matrix, while the two Procrustes methods constrain an orthogonal structure in the rotation matrix. One concern when using an oblique rotation in factor analysis techniques (Harman, 1976) is that the meaning of the reference axes could change after rotation, 79 because the angles among axes (correlations/covariances) are changed when finding the optimal rotation, while the orthogonal rotation maintains the initial structure of a reference system. In the MIRT model context, the orthogonal rotation in the two Procrustes solution based methods keeps the relative distances among item vectors before and after conducting metric transformations, while the structure of item vectors would be somewhat changed with the oblique rotation of the ODL method. However, it is not clear whether we need to maintain item vector structure of the equated test through a MIRT metric transformation, or to what degree the oblique rotation of the ODL method changes the vector structures. Further study is needed on this issue. Another distinguishable difference between two types of methods is optimization criteria for estimating linking components. That is, the TRS ODL method is an expansion of the UIRT equating framework minimizing the differences between two test response surfaces (Stocking & Lord, 1983). So the ODL method uses one equation (Equation [9]) to obtain both rotation and translation components simultaneously. However, the orthogonal Procrustes based methods adopt traditional factor analysis techniques and estimate the rotation and translation components separately (Equations [14] and [16], and Appendix A). 80 These different estimation procedures may explain different behaviors of the two type of linking methods. Because the TRS ODL method is to optimally minimize differences between transformed test response surface and the base test response surface, it outperformed the orthogonal Procrustes based methods in obtaining desirable concurrences of true scores. On the other hand, the orthogonal Procrustes based methods establish an additional problem equation for the translation vector such that these methods were better than the ODL method in transformations of difficulty related parameters. 5.3.2 Evaluation Criteria In the simulation study and in the real data analysis, three linking methods were evaluated by means (bias) and standard deviations (RMSE) of differences between transformed values and estimates of the base test for common items. Especially for bias, opposite signs of differences could be canceled out across items in taking an average of both over— and under—transformed differences, such that using mean differences as an indicator of the quality of linking may provide inaccurate information on the amount of linking errors across items. While using absolute values of differences is a possible, easy alternative solution, 81 further study is also needed to develop more informative summary statistics that would be impartial to various linking techniques. Moreover, if distributional characteristics of summary statistics are investigated, more statistically elegant comparisons could be employed rather than simply using descriptive statistics. Estimates of true scores were examined as another way to evaluate the three linking methods with real data. Reporting one test score is the most popular way to state an examinee’s performance, but important benefits of using the MIRT model compared with UIRT are that we can obtain more detailed information on multiple dimensions (e.g., ability estimates on each dimensions). Therefore, comparisons of transformed ability estimates on each dimension would provide more meaningful information on MIRT linking methods than comparing overall true scores. For this purpose, TESTFACT should be used for MIRT calibrations because NOHARM does not provide ability estimates (Miller, 1991). 5.3.3 Relative Efficiency of Linking Methods From a statistical point of view, as a model includes more parameters, estimation results get better in terms of model fit, but the cost of a more complex model is the loss of degrees of freedom and model parsimony. 82 As was mentioned before, three linking methods include different number of linking components. The ODL method includes two linking components while the orthogonal Procrustes solution based methods adopt three. Further, the new linking method estimates more dilation components than the LL method by allowing different dilation rates for different ability dimensions. Then the question is whether including more linking parameters reduces linking errors significantly. A traditional way to evaluate efficiency of different statistical models is to compare likelihoods or amounts of error. The LL method and the new method were compared by calculating amounts of linking errors (in chapter 3) but a statistical test was not conducted. In order to test different models, several conditions should be satisfied such as distributional assumptions (e.g., normality) or consistent error terms (e.g., nested relationship among competing models), which were not examined in the study. Research on statistical comparison of linking methods needs to be pursued in order to provide statistically persuasive evidences about different behavior of various linking methods. 83 5.3.4 Test Response Surface and Ability Levels Because reporting one overall test score rather than multiple scores is the most popular way of giving test results, comparisons of transformed test response surfaces were conducted as part of real data analysis. As a result, the three linking methods contained different amounts of linking errors on various regions of the ability space, and the ODL method and the new method produced transformed test response surfaces closer to the base one than the LL method. This result confirms previous evaluation using linking bias and RMSE for both simulated and real data. However for practical applications, we need to notice different disagreement patterns, although the ODL method showed the best results in the present demonstration. For example, if any critical decision is made at low or high test scores (above or under the diagonal in Figure 4.15) for an examinee who takes the equated form, it would be better to adopt the ODL method or the new method than the LL method. On the other hand, at moderate test scores (around the diagonal) all linking methods do not much differ. So, it can be said that the selection/decision of which linking method is to be adopted depends on the purpose of the equating, such as low equating error for all ranges of test scores or for a certain decision point score (e.g., cutoff 84 score). That is, the selection of the linking method is a situational specific decision in real application. In general, an equating procedure requires individual judgments that are made by the individuals who are doing equating. The judgment should be informed on practical testing issues and statistical characteristics of equating techniques. 5.4 Conclusion The results from this study indicate that modeling unique dilation rate for each ability dimension improves the orthogonal Procrustes metric transformation which was initially modeled in the LL method, and the ODL method and the new method provide more favorable linking of discriminations than the LL method but the orthogonal Procrustes solutions based methods produce better translation vectors than the ODL method. These differences of the three linking methods can be explained by the types of rotation and the number of linking components. The oblique rotation of the ODL method may provide closer agreement of dimensional orientation. And unique dilation components of each dimension improve metric transformations compared with a dilation constant. Frequently, two academic camps, IRT framework and traditional factor analysis, have been referred to in 85 dealing with dichotomous variables such that MIRT linking methods are not free from these theoretical origins. Therefore, optimization criteria and statistical behaviors of linking components need to be explored further by revisiting the theoretical/statistical origin of each method. It should also be noted that even if the purpose of the study is to evaluate statistical methods to obtain comparable test scores from different test forms, the focus is set on linking of MIRT scales rather than on the whole equating procedures that finally provide conversion tables on different test forms. It means that overall equating results with common items and unique items might be somewhat different from the comparison of common items, which have been discussed in the study. Moreover, only the anchor item design was discussed so it is not known whether the presented simulation and real data results would hold under other equating design (e.g., common/equivalent group design). In addition, further research is needed on various issues of evaluation criteria, theoretical estimation errors, and other testing—related factors which were not dealt with in the study, such as the number of common items and non—normal ability distributions. 86 APPENDIX A The purpose of this appendix is to derive linking components that allow a unique dilation for each dimension based on orthogonal Procrustes solutions. Essential concepts of the orthogonal Procrustes problem were well explained by SChOnemann (1966) and the extension with a translation vector and a dilation constant was presented by Schdnemann and Carroll (1970). The procedures of Schonemann and Carroll (1970) are followed to derive the solution of the case with a diagonal dilation matrix. The main difference from previous procedures is that the solution for the rotation and the dilation components are obtained from two different problem equations while original methods derived solutions from one problem equation for both components. More details will be discussed at corresponding steps of the derivation. The orthogonal Procrustes problem is defined as, B=AT+EU (k4) TTI=TIT=II (A-Z) minimizing tr(E;E1) , (A-3) where, A and B are known nXm.matrices, 87 T is the orthogonal mXleotation matrix, E1 is the me residual matrix (i.e., El=B-AT), and I is the identity m m matrix. Equation (A—l) indicates the model of the problem to solve, (A-2) is the side condition of orthogonality and (A— 3) is the minimization criterion. Considering anchor item equating conditions, A and B would be treated as item discrimination matrices for the equated test and the base test, respectively, n is the number of common items, m is the number of dimensions, and T indicates the rotation matrix. To obtain the solution for T in Equation (A—l) with (A-2) and (A-3), set f=f|+f2I (A-4) where f1 = tr(E’1E1) = tr(B'B — B'AT — T'A'B + T'A’AT) , and f2 = tr(L[T’T - I]) . The anlnatrix L of f; is unknown Lagrange multipliers with 88 respect to T. By taking a partial derivative of Equation (A—4) regarding T and setting it equal to zero, we obtain af/8T=A’AT—A'B+T(L+L’)=0. (A—S) By applying eigenvalue and eigenvector techniques (singular value decomposition) to Equation (A—S) the solution for T can be obtained (refer to Schonemann, 1966). After obtaining the rotation matrix, the concern is about unit lengths between the base matrix B and the rotated matrix AT. By considering a diagonal dilation matrix the second problem equation is B=ATK+E2, (A-6) where K is the diagonal anldilation matrix. Under the same minimization criteria as in Equation (A- 3), minimization of UKEQEZ) requires the partial derivative with regard to the diagonal dilation matrix K. atr(E'2E2)/ 3K = diagonal elements of (KT'A'AT - B'AT) = 0 . (A— 7) 89 Then, the solution for K is diag [KT'A'AT - B'AT] = 0 => K{diag [T’A'AT] } = diag[B'AT] => K = diag[B'AT]x (diag [T'A'ATD'l , (A- 8) where the matrix operator diag means that off-diagonal elements equal zero. Note that the traditional orthogonal Procrustes problem with a dilation constant, on which the LL method is based, needs only one equation such as Equation (A-6) (see Equation [15]) by using a constant, k instead of a matrix, K. Because of mathematical intractability, two equations are used, (A- l) for the rotation matrix and (A—6) for the dilation matrix. As a result, the LL method and the new method with the dilation matrix provide the exactly same solutions for T and m.(a translation vector, see Equation [16]), such that difference of linking results of the two methods come only from the dilation component, a scalar or a matrix. The MATLAB (The MathWorks, 1995) program used in the new linking method is provided in APPENDIX B. 90 APPENDIX B The program for the M method for the two dimensional case is mainly based on MDEQUATE developed by Li (1996). The only difference from MDEQUATE is the procedures of estimating the diagonal dilation matrix rather than the dilation constant. MATLAB Program for the M method base =x; % base form file, id a1 a2 d; global Al A2 d A10 A20 do D; d= base(:,4); Al: base(:,2); A2= base(:,3); BASE: base (:,2:3); equated =y; % equated form file, id a1 a2 d; do: equated(:,4); TA1= equated(:,2); TA2= equated(:,3); EQUATED_1= equated(:,2:3); dM= [d do]; cordM=corrcoef(dM); BASEEQUATED_1 = [BASE EQUATED_1]; COR_obs = corrcoef(BASEEQUATED_1); disp('Orthogonal Procrustes Rotation, Schonemann, 1966'); S=EQUATED_1'* BASE ; STS=S'*S ; SST=S*S'; [U,S,V] =Svd (STS); Vl = U; D1 = S; Vl = V; [U,S,V] =svd (SST); W1 = U; D1 = S; W1= V; ESTRM_1 = W1 * Vl'; 91 ESTRM =ESTRM_1; EQUATEDrot = EQUATED_l * ESTRM_1; disp('Rotation Matrix, T'); ESTRM EQUATEDrot; AlO =EQUATEDr0t(:,l)i A20 =EQUATEDr0t(:,2); disp('Diagonal Dilation Matrix, K'); EQUATED_1_C = EQUATED_1; BASE_C = BASE; LEFT=ESTRM'*EQUATED_1_C'*EQUATED_1_C*ESTRM; RIGHT=BASE_C'*EQUATED_1_C*ESTRM; DEN=inv(diag(diag(LEFT))); NUM=diag(diag(RIGHT)); K=NUM*DEN disp('Start Value for m using the Least Squares Procedure'); %Need two function files (by Li, 1996); SUMdB = sum (d): SUMdE = sum (do); SUMal = sum (A10); SUMa2 = sum (A20); Est_m = (SUMdB -SUMdE) / (SUMal +SUMa2); ml = Est_m; m2 = Est_m; D 21.702; dml=0.0001; dm2=0.0001; for Iteration=1z99; s=[ml,m2]; mlp= m1 + dml; m2p= m2 + dm2; J(l,1)= (func_m1(m1p,m2) —func_ml(ml,m2))/dm1; J(l,2)= (func_m1(m1,m2p) -func_ml(ml,m2))/dm2; J(2,1)= (func_m2(mlp,m2) -func_m2(ml,m2))/dml; J(2,2)= (func_m2(ml,m2p) -func_m2(m1,m2))/dm2; P(1,l)=J(1,l); P(1,2)=J(1,2); P(2,l)=J(2,1): P(2,2)=J(2,2); f(1)=func_ml(ml,m2); f(2)=func_m2(m1,m2); ds= -P\f'; 92 ml: ml + ds(l); m2: m2 + ds(2); fprintf('Iteration=%2.0f, ml=%7.4f, m2=%7.4f‘,Iteration,m1,m2) fprintf(' ') fprintf('f(l)=%8.4f,f(2)=%8.4f\n',f(1),f(2)) if (abs(f(1))<0.00001 & abs(f(2))<0.00001), break; end end disp('translation vector, ml and m2'); ml m2 disp ('Transformed discriminations and difficulties'); Est_d = do + m1*Alo + m2*A20; TEMP=[Alo A20]; M_FINAL_A=TEMP *K; final =[M_FINAL_A Est_d] 93 REFERENCES Ackerman, T. A. (1991). The use of unidimensional parameter estimates of multidimensional items in adaptive testing. Applied Psychological Measurement, 15, 12—24. Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 18, 255- 278. Ackerman, T. A. (1996). Graphical representation of multidimensional item response theory analyses. Applied Psychological.Measurement, 20, 311—329. Angoff, W. H. (1968). How we calibrate College Board scores. College Board Review, 68, 11—14. Angoff, W. H. (1971). Scales, norms and equivalent scores. In R. L. Thorndike (Ed.), Educational Measurement (ZmU (pp. 508-600). Washington, DC: American Council of Education. Baker, F. B. (1992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel Dekker. Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied .Measurement in Education, 12, 383-407. Budescu, D. (1985). Efficiency of linear equating as a function of the length of the anchor test. JOurnal of Educational Measurement, 22, 13-20. Cook, L. L., and Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. Educational .Measurement: Issues and Practice, 10, 37-45. Crocker, L., and Algina, J. (1986). Introduction to Classical and.Mbdern Test Theory. New York: Holt, Rinehart and Winston. Davey, T., Nering, M. L., and Thompson, T. (1997). Realistic simulation of item response data. ACT Research Report series ONR 97—4. Iowa City, IA: ACT, Inc. 94 Davey, T., Oshima, T. C., and Lee, K. (1996). Linking multidimensional item calibrations. Applied Psychological.Measurement, 20, 405-416. Dorans, N. J. (2000). Scaling and equating. In H. Wainer (Ed.), Computerized Adaptive Testing: A Primer (ZmU (pp. 135-158). New Jersey: Lawrence. Embretson, S. E. (1984). A general latent trait model for response processes. Psychometrika, 40, 175-186. Embretson, S. E., and Reise, S. P. (2000). Item Response Theory for Psychologists. New Jersey: Lawrence Erlbaum Associates. Fraser, C. (undated). NOHARM: A computer program for fitting both unidimensional and multidimensional normal ogive models of latent trait theory. Feuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M. W., and Hemphill, F. C. (Ed.) (1999). Uncommon .Measures: Equivalence and Linkage among Educational Tests. Washington, DC: National Academic Press. Green, P. E. (1976). Mathematical Tools for Applied Multivariate Analysis. New York: Academic Press. Gosz, J. K., and Walker, C. M. (2002). An empirical comparison of multidimensional item response data using TESTFACT and NOHARM; Paper presented at the annual meeting of the National Council for Measurement in Education. New Orleans, LA. Hambleton, R. K., and Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Boston: Kluwer. Harman, H. (1976). Mbdern Factor Analysis,.Tfl ed., Chicago: University of Chicago Press. Harris, D. J., and Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240. Harewell, M. R., Stone, C. A., Hsu, T., and Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological.Measurement, 20, 101—125. 95 Hirsch, T. M. (1988). Multidimensional equating. Unpublished doctoral dissertation. Florida State University. Hirsch, T. M. (1989). Multidimensional equating. Journal of Educational Measurement, 26, 337-349. Kim, H. (1994). New techniques for the dimensionality assessment of standardized test data. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign. Kim, J.—P. (2001). Proximity measures and cluster analyses in multidimensional item response theory. Unpublished doctoral dissertation. Michigan State University. Kolen, M. J. (2001). Linking assessments effectively: Purpose and design. Educational Measurement: Issues and Practice, 20, 5—9. Kolen, M. J., and Brennan, R. L. (1995). Test Equating: Methods and Practices. New York: Springer. Lee, K., and Oshima, T. C. (1996). IPLINK: Multidimensional and unidimensional item parameter linking in item response theory. Applied Psychological.Measurement, 20, 230. Li, Y. H. (1996). MDEQUATE [Computer software]. Upper Marlboro, MD: Author. Li, Y. H. (1997). An evaluation of multidimensional IRT equating methods by assessing the accuracy of transforming parameters onto a target test metric. Unpublished doctoral dissertation. The University of Maryland. Li, Y. H., and Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24, 115—138. Lissitz, R. W., Schénemann, P. H., and Lingoes, J. C. (1976). A solution to the weighted Procrustes problem in which the transformation is in agreement with the loss function. Psychometrika, 41, 547-550. 96 Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. New Jersey: Lawrence. Lord, F. M., and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Maier, M. H. (1993). Military aptitude testing: The past fifty years (DMDC Technical Report 93-007). Monterey, CA: Defense Manpower Data Center. McDonald, R. P. (1967). anlinear factor analysis (Psychometric Monographs, No. 15). Iowa City: Psychometric Society. Mckinley, R. L. and Reckase, M. (1983). An EXtension of the two-parameter logistic model to the multidimensional latent space. ACT Research Report Series ONR 83—2. Iowa City, IA: ACT, Inc. Mislevy, R. J., and Bock, R. D. (1990). BILOG-3: Item Analysis and Test Scoring with Binary Logistic Medals [Computer Software]. Mooresville, ID: Science Software. Miller, T. R. (1991). Empirical estimation of standard errors of compensatory.MIRT.model parameters obtained from the.NOHARM estimation program. ACT research report series ONR 91-2. Iowa City, IA: ACT, Inc. Oshima, T. C., and Davey, T. C. (1994). Evaluation of procedures for linking multidimensional item calibrations. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Oshima, T. C., Davey, T. C., and Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37, 357-373. Reckase, M. D. (1985). The difficulty of items that measure more than one ability. Applied Psychological .Measurement, 9, 401 - 412. Reckase, M. D. (1990). Unidimensional data from multidimensional tests and multidimensional data from unidimensional tests. Paper presented at the annual meeting of the American Educational Research Association, Boston, NJ. 97 Reckase, M. D. (1995). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden and Hambleton (Ed.), Handbook of.MOdern Item Response Theory (pp. 271-286). New York: Springer. Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-36. Reckase, M. D., and Hirsch, T. M. (1991). Interpretation of number correct scores when the true number of dimensions assessed by a test is greater than two. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. Reckase, M. D., and Mckinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological.Measurement, 14, 361-373. Roussos, L. A., Stout, W. F., and Marden, J. I. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. JOurnal of Educational Measurement, 35, 1-30. Schonemann, P. H. (1966). A Generalized solution of the orthogonal Procrustes problem. Psychometrica, 31, 1-10. Schonemann, P. H., and Carroll, R. M. (1970). Fitting one matrix to another under choice of a central dilation and a rigid motion. Psychometrica, 35, 245-255. Spray, J. A., Davey, D. C., Reckase, M. D., Ackerman, T. A., and Carlson, J. E. (1990). Comparison of two logistic multidimensional item response theory models. ACT Research Report Series ONR 90-8. Iowa City, IA: ACT, Inc. Stocking, M. L., and Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. Sympson, J. B. (1978). A model for testing with multidimensional items. In D. J. Weiss (Ed.), Proceedings of the 1977 Computerized Adaptive Testing Conference (pp. 82-98). Minneapolis: University of 98 Minnesota. The MathWorks. (1995). MATLAB: The Ultimate Computing Environment for Technical Education. Englewood Cliffs, NJ: Prentice-Hall. Thompson, T. (Undated). GENDAT5: A computer program for generating multidimensional item response data. Thompson, T. (1996). NOHARM21: NOHARM (C. Fraser, undated) converted to Windows. Thompson, T., Nering, M., and Davey, T. (1997). .MUltidimensional IRT scale linking without common items or common examinees. Paper presented at the annual meeting of the Psychometric Society, Gatlinburg, TN. Traub, R. E. (1983). A priori considerations in choosing an item response model. In R. K. Hambleton (Ed.), Applications of Item Response Theory (pp. 57-70). Vancouver: Educational Research Institute of British Columbia. Wilson, D., Wood, R. and Gibbons, R. D. (1991). TESTFACT: Tests scoring, item statistics, and item factor analysis. Mooresville, ID: Scientific Software. Wingersky, M. 3., Barton, M. A., and Lord, F. M. (1982). LOGIST V USer’s Guide. Princeton, NJ: Educational Testing Service. 99 IES ‘W [WWHT‘ETW‘T ] 111111.][HIM M [111‘] ’l 3 ‘1293 02461 8542 mm 1 [