THE COMPARISON OF COMMON ITEM SELECTION METHODS IN VERTICAL SCALING UNDER MULTIDIMENSIONAL ITEM RESPONSE THEORY By Yang Lu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2010 ABSTRACT THE COMPARISON OF COMMON ITEM SELECTION METHODS IN VERTICAL SCALING UNDER MULTIDIMENSIONAL ITEM RESPONSE THEORY By Yang Lu The characteristics of common items are always considered as an important factor affecting the quality of scale linking between tests. Although many studies have focused on the common item selection via the Unidimensional Item Response Theory (UIRT) models, seldom did researchers investigate the selection in the vertical scaling under the framework of Multidimensional Item Response Theory (MIRT). This study examines different common item selection methods when the correlation among proficiencies varies at different levels and when the content structures in tests are either identical or different. With respect to the recoveries of the probability matrix, item parameters and effect sizes, the results show that (1) full content coverage in the common item set is important, no matter whether the content structures are identical or not, (2) high correlation among proficiencies could partly compensate for the adverse effect caused by the common items not covering all content domains, and (3) the common item set covering all content domains with medium difficulty items yields better linking results and common items with high item-total-test correlation also perform well when the content structures are identical for both tests. Dedicated to my beloved husband: Yu Fang iii ACKNOWLEDGMENTS I would like to express my sincere gratitude to Dr. Mark Reckase, who is my academic advisor and chairperson of my dissertation committee. I am fortunate to have the privilege to be his student. I want to thank him for his excellent guidance and patience in my dissertation and research work. I also want to thank Dr. Sharon Senk, one of my committee members, for strengthening my background in math education field and giving me a warm hug when I felt frustrated in working on the project. Thanks also go to other members in my committee, Dr. Richard Houang and Dr. Sharif Shakrani, for their valuable insights and assistance on my dissertation research. I would like express my appreciation to Dr. Maria Teresa Tatto for providing me with assistantship opportunities to work in the TEDS-M project for most of my graduate study. Through this experience, I have developed a deep understanding in measurement theory and improved my teamwork skills. In addition, I want to thank the Measurement and Quantitative Methods program and the College of Education at the Michigan State University for providing me with an excellent atmosphere for my graduate study. Finally, I owe my deepest appreciation to my husband, Yu Fang, for his continuous love, support and patience, and to my parents for their constant love and support in all aspects. iv TABLE OF CONTENTS LIST OF TABLES ........................................................................................................................ vii LIST OF FIGURES ....................................................................................................................... ix CHAPTER 1 ....................................................................................................................................1 INTRODUCTION ...........................................................................................................................1 1.1 Introduction to Vertical Scaling .............................................................................................2 1.2 Construct Shift and Dimensionality in Vertical Scaling ........................................................3 1.3 UIRT and MIRT ....................................................................................................................5 1.3.1 UIRT model ....................................................................................................................5 1.3.2 MIRT model ....................................................................................................................6 1.3.3 Indeterminacies in MIRT ................................................................................................9 1.3.4 Full information factor analysis in MIRT .....................................................................10 CHAPTER 2 ..................................................................................................................................14 LINKING AND COMMON ITEM SELECTION ........................................................................14 2.1 Linking Designs ...................................................................................................................14 2.2 Linking Methods ..................................................................................................................16 2.2.1 Linking methods in UIRT .............................................................................................16 2.2.2 Linking methods in MIRT ............................................................................................18 2.2.3 Linking through separate and concurrent calibrations ..................................................20 2.3 Research on Common Items in Equating ............................................................................22 2.4 Research on Common Items in Vertical Scaling .................................................................24 2.5 Research Objectives and Questions .....................................................................................24 CHAPTER 3 ..................................................................................................................................27 DESIGNS AND METHODS .........................................................................................................27 3.1 Parameter Simulation ...........................................................................................................27 3.2 Parameter Estimation and Vertical Scaling .........................................................................31 3.3 Evaluation Criteria ...............................................................................................................32 CHAPTER 4 ..................................................................................................................................36 PART I: SAME CONSTRUCTS ...................................................................................................36 4.1 Parameters and Designs .......................................................................................................36 4.1.1 Unique items .................................................................................................................36 4.1.2 Common items ..............................................................................................................36 4.1.3 Person parameters .........................................................................................................41 4.2 Estimation ............................................................................................................................41 4.3 Results ..................................................................................................................................43 4.3.1 Recovery of probability matrix .....................................................................................43 4.3.2 Recovery of a-parameters .............................................................................................49 4.3.3 Recovery of d-parameters .............................................................................................55 4.3.4 Recovery of effect sizes ................................................................................................60 v CHAPTER 5 ..................................................................................................................................63 PART II: DIFFERENT CONSTRUCTS .......................................................................................63 5.1 Parameters and Designs .......................................................................................................63 5.1.1 Unique items .................................................................................................................63 5.1.2 Common items ..............................................................................................................66 5.1.3 Person parameters .........................................................................................................67 5.2 Estimation ............................................................................................................................69 5.3 Results ..................................................................................................................................70 5.3.1 Recovery of probability matrix .....................................................................................70 5.3.2 Recovery of a-parameters .............................................................................................72 5.3.3 Recovery of d-parameters .............................................................................................79 5.3.4 Recovery of effect sizes ................................................................................................83 CHAPTER 6 ..................................................................................................................................88 SUMMARY, LIMITATION AND FUTURE RESEARCH .........................................................88 6.1 Conclusions and Discussions ...............................................................................................88 6.2 Limitations and Future Research .........................................................................................96 APPENDIX ..................................................................................................................................100 REFERENCES ............................................................................................................................105 vi LIST OF TABLES Table 3.1. Layout of Person by Item Response Matrix ................................................................ 30 Table 4.1. Unique Item Parameters and Statistics for Lower Grade in Part I............................... 37 Table 4.2. Unique Item Parameters and Statistics for Upper Grade in Part I ............................... 38 Table 4.3. Number of Items for Different Content and Difficulty Categories in Item Pool of Part I ....................................................................................................................................................... 39 Table 4.4. Statistics of Common Items for Different Selection Methods in Part I ....................... 41 Table 4.5. Correlation for the Recovery of Probability Matrix in Part I ...................................... 44 Table 4.6. Bias for the Recovery of Probability Matrix in Part I.................................................. 46 Table 4.7. RMSE for the Recovery of Probability Matrix in Part I .............................................. 48 Table 4.8. Correlation for the Recovery of a-parameters in Part I ............................................... 50 Table 4.9. Bias for the Recovery of a-parameters in Part I .......................................................... 52 Table 4.10. RMSE for the Recovery of a-parameters in Part I..................................................... 53 Table 4.11. Correlation for the Recovery of d-parameters in Part I ............................................. 56 Table 4.12. Bias for the Recovery of d-parameters in Part I ........................................................ 57 Table 4.13. RMSE for the Recovery of d-parameters in Part I..................................................... 59 Table 4.14. Recovery of Effect Sizes for Proficiencies in Part I .................................................. 61 Table 5.1. Allocation of Unique Items in Different Content Domains and Grades in Part II ...... 63 Table 5.2. Unique Item Parameters and Statistics for Lower Grade in Part II ............................. 64 Table 5.3. Unique Item Parameters and Statistics for Upper Grade in Part II .............................. 65 Table 5.4. Number of Items for Different Content and Difficulty Categories in Item Pool of Part II .................................................................................................................................................... 66 Table 5.5. Statistics of Common Items for Different Selection Methods in Part II ..................... 67 vii Table 5.6. Mean Vectors for Proficiency Distributions of Lower and Upper Grade Examinees in Part II ............................................................................................................................................ 68 Table 5.7. Variance-Covariance Matrix for Proficiency Distribution of Lower Grade Examinees in Part II ........................................................................................................................................ 69 Table 5.8. Variance-Covariance Matrix for Proficiency Distribution of Upper Grade Examinees in Part II ........................................................................................................................................ 69 Table 5.9. Correlation for the Recovery of Probability Matrix in Part II ..................................... 70 Table 5.10. Bias for the Recovery of Probability Matrix in Part II .............................................. 71 Table 5.11. RMSE for the Recovery of Probability Matrix in Part II .......................................... 71 Table 5.12. Correlation for the Recovery of a-parameters in Part II ............................................ 73 Table 5.13. Bias for the Recovery of a-parameters in Part II ....................................................... 74 Table 5.14. RMSE for the Recovery of a-parameters in Part II ................................................... 77 Table 5.15. Correlation for the Recovery of d-parameters in Part II ............................................ 80 Table 5.16. Bias for the Recovery of d-parameters in Part II ....................................................... 80 Table 5.17. RMSE for the Recovery of d-parameters in Part II ................................................... 80 Table 5.18. Recovery of Effect Sizes for Proficiencies in Part II ................................................. 84 viii LIST OF FIGURES Figure 1.1. Representation of Item Vectors in a Two-Dimensional Space .................................... 9 Figure 4.1. Correlation for the Recovery of Probability Matrix in Part I ..................................... 45 Figure 4.2. Bias for the Recovery of Probability Matrix in Part I ................................................ 47 Figure 4.3. Bias for the Recovery of Probability Matrix for Method 1 at Zero Proficiency Correlation Level in Part I ............................................................................................................ 47 Figure 4.4. RMSE for the Recovery of Probability Matrix in Part I ............................................ 49 Figure 4.5. Correlation for the Recovery of a1 -parameters in Part I ............................................ 51 Figure 4.6. Correlation for the Recovery of a 2 -parameters in Part I ........................................... 51 Figure 4.7. Bias for the Recovery of a1 -parameters for Method 1 at Zero Proficiency Correlation Level in Part I................................................................................................................................ 52 Figure 4.8. RMSE for the Recovery of a1 -parameters in Part I ................................................... 54 Figure 4.9. RMSE for the Recovery of a 2 -parameters in Part I .................................................. 54 Figure 4.10. Correlation for the Recovery of d-parameters in Part I ............................................ 56 Figure 4.11. Bias for the Recovery of d-parameters in Part I ....................................................... 57 Figure 4.12. Bias for the Recovery of d-parameters for Method 1 at Zero Proficiency Correlation Level in Part I................................................................................................................................ 58 Figure 4.13. RMSE for the Recovery of d-parameters in Part I ................................................... 59 Figure 4.14. Recovery of Effect Size for the Proficiency on Dimension 1 in Part I .................... 62 Figure 4.15. Recovery of Effect Size for the Proficiency on Dimension 2 in Part I .................... 62 Figure 5.1. RMSE for the Recovery of Probability Matrix in Part II ........................................... 72 Figure 5.2. Bias for the Recovery of a1 -parameters in Part II ..................................................... 74 Figure 5.3. Bias for the Recovery of a 2 -parameters in Part II..................................................... 75 ix Figure 5.4. Bias for the Recovery of a3 -parameters in Part II..................................................... 75 Figure 5.5. Bias for the Recovery of a3 -parameters for Method 2 at Zero Proficiency Correlation Level in Part II .............................................................................................................................. 76 Figure 5.6. RMSE for the Recovery of a1 -parameters in Part II .................................................. 78 Figure 5.7. RMSE for the Recovery of a 2 -parameters in Part II ................................................. 78 Figure 5.8. RMSE for the Recovery of a3 -parameters in Part II ................................................. 79 Figure 5.9. Bias of the Recovery of d-parameters for Method 1 at Zero Proficiency Correlation Level in Part II .............................................................................................................................. 81 Figure 5.10. Correlation for the Recovery of d-parameters in Part II........................................... 81 Figure 5.11. Bias for the Recovery of d-parameters in Part II ...................................................... 82 Figure 5.12. RMSE for the Recovery of d-parameters in Part II .................................................. 82 Figure 5.13. Comparison of Effect Size for the Proficiency on Dimension 1 in Part II ............... 85 Figure 5.14. Comparison of Effect Size for the Proficiency on Dimension 2 in Part II ............... 85 Figure 5.15. Comparison of Effect Size for the Proficiency on Dimension 3 in Part II ............... 86 Figure 5.16. Recovery of Effect Size for the Proficiency on Dimension 1 in Part II ................... 86 Figure 5.17. Recovery of Effect Size for the Proficiency on Dimension 2 in Part II ................... 87 Figure 5.18. Recovery of Effect Size for the Proficiency on Dimension 3 in Part II ................... 87 x CHAPTER 1 INTRODUCTION Standardized tests have been more and more important and widely applied in performance assessment. They are used to provide fair, reliable and objective information on examinees‟ abilities or skills that the tests are developed to measure. When educational achievement is assessed, it is important to estimate and track the extent to which examinees grow over time or over the course of their schooling. One method to evaluate examinees‟ progress is to use a single set of test questions or equivalent forms for all the assessments across time. But this method could be problematic, since the shift in content and the challenge of materials may be too huge to be appropriate for all grade levels. For example, a test containing many items from higher grade levels could be too difficult or too advanced with brand new topics for the students at the early grade levels, so as to make them feel overwhelmed by the test. On the other hand, a test may also be too easy and lead to carelessness, inattention or creative but wrong thinking when items from lower grade levels are administered to the students in upper grades (Kolen & Brennan, 2004, p. 372). The above problems could be avoided by administering the tests of different levels to students from different grades and using vertical scaling method to link these tests. Practically, vertical scaling is widely used in standardized tests. For example, for No Child Left Behind (NCLB), vertical scales are established to meet the requirement of evaluating the progress of children in attaining English proficiency as they grow from one year to the next. Vertical scales are also commonly used in several elementary test batteries, such as the Iowa Tests of Basic Skills (ITBS) (Hoover, Dunbar, & Frisbie, 2003) and the Cognitive Abilities Test (CogAT) (Lohnman & 1 Hagen, 2002). According to Harris (2007), the scores from tests in some testing systems, such as those from the EXPLORE, PLAN and ACT tests in the Educational Planning and Assessment System (EPAS), are also put on one vertical scale, and it is clearly stated that the target populations for these tests are students at different grade levels. 1.1 Introduction to Vertical Scaling The primary reason to construct vertical scales is to “develop a conceptual definition of growth, especially for test areas that are closely related to the school curriculum” (Kolen & Brennan, 2004, p. 376). Among all tests, the mathematics achievement test is one of those covering several content areas that are closely related to the school curriculum. Some content domains are taught with different difficulty levels in different grades while others are not. As the grade level increases, curriculum contents become more advanced and new contents are added as well; consequently, test items become more difficult and items measuring new constructs are also included in the test. With the increase of the depth or amount of the contents that students have been taught, it is natural to ask how much students gain in knowledge according to test scores from different tests administered in different years. How could these changes be measured by the scores from time to time? Could the scores from different grades be compared directly? Did the students grow as much in one content area as in another one? Vertical scaling is one of the commonly used methods to answer these questions. There are many definitions for vertical scaling. According to Kolen and Brennan (2004) and Holland (2007), different from equating, vertical scaling tries to place tests with different difficulty but measuring similar constructs on the same scale. Kolen (2006) indicated that “vertical scaling procedures are used to relate scores on these multiple test levels to a developmental score scale 2 that can be used to assess student growth over a range of educational levels”. More specifically, vertical scaling focuses on the linking between tests with similar reliability and measuring similar construct, but with different difficulty and administered to different populations of examinees (Holland, 2007). 1.2 Construct Shift and Dimensionality in Vertical Scaling There are many issues when the vertical scaling procedure is applied, one of which is the way to define the overlap of content structure across grades. Harris (2007) indicated that the relationship between the test content and the nature of growth plays a major role on the resulting score scales. It is generally easy to compare gains when the constructs measured by tests are fairly similar from year to year. In this situation, the linking of score scales can give sufficient and meaningful results. However, since the curriculum and instructions often change from grade to grade in practice, test designers need to modify the test content specifications to match the targets of instructions in order to accurately assess achievements at different grades. Normally, the skills and knowledge included in instructions are not very simple. For example, the mathematics test could cover different content areas, such as number, algebra, arithmetic and geometry. Furthermore, a variety of skills, such as problem solving, logical thinking and reading and understanding are often required to solve some mathematics problems as well. The higher the grade level is, the more complex the covered contents and required skills in the test are. This multidimensionality issue was addressed by Yen (1986), who pointed out that it is one of the major factors that are likely to affect the vertical scales. Several studies questioned the appropriateness of using only a single vertical scale to track students‟ growth from year to year since the instruction and curriculum change across years. Yen and Burket (1997) found that scales always vary by subjects and subtests within the subject. Martineau (2004, 2006) also 3 showed the significant effect of construct shift when a single vertical scale was used in the value added model. Li (2006) made a further study to capture the cross-grade content shift based on the MIRT, and found that the two constructs (vocabulary and problem solving) were overlapping and measured by both Grade 6 and Grade 7 Michigan Educational Assessment Program (MEAP) tests, while an additional construct (abstracting concept) was only measured by the Grade 7 test. Note that the tests in MEAP actually measure the students‟ learning of previous academic year. Patz and Yao (2007) also indicated that using the unidimensional IRT model is implausible when vertical scales are developed across grades. Specifically, they noticed that the construct measured by the seventh-grade mathematics achievement test was different from that by the fourth-grade test. They also pointed out that failure to account for the complexity of the large differences in test content and examinee skills could be an important reason that concurrent calibration using the unidimensional IRT model did not perform well in practical settings. What is more, some other studies (Braun, 2005; Doran & Cohen, 2005; Reckase & Li, 2007; Reckase & Martineau, 2004; Schmidt, Houang, & McKnight, 2005) also addressed the issues of content shift and the inappropriateness of using one single scale to track students‟ growth. On the other hand, many studies discussed the effect of violation of the unidimensionality assumption in vertical scaling (e.g., Camilli, Wang, & Fesq, 1995; Dorans & Kingston, 1985; Turhan, Tong, & Um, 2007; Yao & Mao, 2004). More specifically, using simulated test scores across grade levels, Yao and Mao (2004) found that the score distribution estimated under one-, two- or three-dimensional models did not differ significantly; however, they did not show how stable the actual dimensional structures of cross-grade tests were. Turhan et al. (2007) simulated data with the MIRT model and tested different ways of selecting common items in vertical 4 scaling after the simulated data were calibrated with the commonly used unidimensional IRT model. They concluded that vertical scaling was robust to the types of slight violation of the unidimensionality assumption investigated in their paper, given the goodness of the content coverage by common items. Because of these two factors, namely, the modification of content area and curriculum across grades and the multidimensionality in the tests, a single scale score for the tests often becomes less comparable across grades. These factors can influence the interpretation and alignment of vertical scales; hence, it is important to check the content shift and dimensionality for the constructs measured in the tests and select appropriate vertical scaling methods to make the scores more comparable. Since the MIRT model identifies the multidimensional content structure in the test and estimates all dimensions simultaneously, it is a promising method for vertical scaling among the tests measuring shifted and/or multiple constructs. 1.3 UIRT and MIRT 1.3.1 UIRT model Birnbaum (1968) developed a three-parameter logistic unidimensional IRT model, which assumes that only one latent trait is necessary to account for variations in person-item responses and is widely used in test construction and equating. The formula for this model is P(uij  1 | ai , bi , ci , j )  ci  1  ci , 1  exp( 1.7ai ( j  bi )) (1.1) where P(uij  1 | ai , bi , ci , j ) is the probability of person j correctly answering item i, given that the person‟s ability level is  j , and ai , bi , ci represent item discrimination, difficulty and guessing parameters, respectively. 5 1.3.2 MIRT model Multidimensional abilities/traits are often required to get correct responses for items within one test, or even a correct response for one item (Reckase, 1985). Hence, theoretically, it is more appropriate to use the multidimensional IRT model instead of the unidimensional IRT model for the calibration and estimation of the aforementioned multidimensional data. Since “a set of test items can be sensitive to several traits, or a group of examinees might vary in several latent traits” (Li & Lissitz, 2000), the person-item interaction and parameter estimation can be quite complex in MIRT. According to Reckase (1997a), MIRT is useful to (1) understand the proficiency structure needed to respond to test items, (2) describe the differential item functioning (DIF), and (3) choose items to fit the unidimensional IRT model. There are two types of MIRT models, namely, the compensatory MIRT model and the noncompensatory MIRT model. The main difference between these two models lies in the relationship among the multiple proficiencies that determine the probability of person-item responses. The compensatory model follows the logic of factor analysis in that the probability of a correct response is related to a linear combination of several proficiencies; therefore, proficiencies are additive so that high proficiency on one dimension can „make up‟ the low proficiencies on other dimensions. The three-parameter compensatory MIRT model (Reckase, 1985, 1997b) is P(uij  1 | a i , d i , ci , θ j )  ci  1  ci , 1  exp( 1.7(a i ' θ j  d i )) (1.2) where P(uij  1 | a i , d i , ci , θ j ) is the probability of a correct response for person j on item i, uij is the response for person j on item i (1 if correct and 0 otherwise), a i is an m-element vector that specifies the discrimination power of item i on the m dimensions, d i is a scalar parameter 6 that is related to the difficulty of item i, ci is a guessing parameter for item i, and θ j is the person j‟s proficiency vector in an m-dimensional space. On the other hand, Sympson (1978) proposed a noncompensatory MIRT model, 1  ci , (1.3) 1  exp( 1.7aik ( jk  bik )) k 1 m P(uij  1 | a i , b i , ci , θ j )  ci   where aik , bik and  jk are the item discrimination, item difficulty and person proficiency on the kth dimension, and ci is the item guessing parameter. He argued that an increase in one proficiency could improve the overall probability of getting an item correct, but only to some extent. The probability of a correct response cannot exceed that defined by the dimension where the proficiency does not have a positive infinity value, even when all other proficiencies increase to positive infinity. The compensatory and noncompensatory MIRT models are quite different, from either the mathematical formula or the assumption on how people use their skills and knowledge to answer items. However, Spray, Davey, Reckase, Ackerman and Carlson (1990) identified item parameters for the two models that can give similar classical item statistics, and found that when the correlation between proficiencies increases, the detectable difference between models decreases. Therefore, they concluded that the difference between these two models could be considered practically unimportant. In practice, due to the comparative simplicity and easy estimation procedure, the compensatory model is the one commonly used in the MIRT calibration, scaling and equating. Thus, the compensatory MIRT model is used in this study, and for simplicity, no guessing parameter is assumed for items. 7 According to Reckase (1985), MDISC was developed to capture the discrimination power of an item in MIRT and its formula is given by m 2 MDISCi  (  aik )1 / 2 , (1.4) k 1 where MDISCi denotes the item i‟s multidimensional discrimination and aik is the discrimination parameter of item i on the kth dimension. Also, the multidimensional difficulty of an item, MDIFF, is defined as MDIFFi   di ai 'ai  di . MDISCi (1.5) These two characteristics of an item can be represented graphically by an item vector in the multidimensional θ –space. In order to describe the most discriminating direction of an item in that space, Reckase (1985) proposed the direction cosine for the item vector as cos  ik  aik , MDISCi (1.6) where  ik is the angle between the vector of item i and the kth coordinate axis in an mdimensional space. Figure 1.1 shows these characteristics of item vectors using arrowed lines in a twodimensional space. The length of the arrowed line represents MDISC, the distance from the origin to the base of the arrowed line is MDIFF, and the direction of the arrowed line is defined using angles from the direction cosines. 8 Item Plot 3 2  2 1 0 -1 -2 -1 -0.5 0  0.5 1 1.5 1 Figure 1.1. Representation of Item Vectors in a Two-Dimensional Space 1.3.3 Indeterminacies in MIRT According to Reckase (1997a), the compensatory MIRT model could be considered as a special case of nonlinear factor analysis. More specifically, this model can be regarded as a combination of the factor analysis model and the unidimensional IRT model; therefore, it suffers from the indeterminacies inherent in either model, such as the orientation of coordinate axes relative to persons‟ locations, the units of measurement and the location of origin of the coordinate system. These three indeterminacies are named as rotational indeterminacy, unit indeterminacy and origin indeterminacy, which are often discussed in MIRT research (Hirsch, 1989; Li & Lissitz, 2000; Min, 2003; Oshima, Davey, & Lee, 2000; Reckase, 2007; Reckase & Martineau, 2004). 9 Suppose A  (a1 ,, a N )' , d  (d1,, d N )' , Θ  (θ1 ,, θ J ) is one solution set for the MIRT calibration, where N is the total number of items and J is the total number of persons. There are always infinite sets of A*, d*, Θ * as defined in Equation 1.7, which satisfy the MIRT invariance property as shown in Equation 1.8. Θ*  T 1Θ  M1' , A*  AT , d*  d  ATM , (1.7) A * Θ * d * 1'  AT(T 1Θ  M1' )  (d  ATM )1'  AΘ  d1' , (1.8) where T is a rotation matrix, M is a transformation vector and 1 is an N-element vector of 1s. Note that the rotational indeterminacy and unit indeterminacy are combined together as the T matrix in the above formulas. Generally, for easy computation, the MIRT software packages provide one solution for the MIRT calibration by setting the constraints on the person proficiencies, or more strictly speaking, the coordinates of person locations in an m-dimensional space, as E (θ)  0m1 and cov(θ)  I mm , although this zero correlation among proficiencies is implausible in practice. Besides these constraints, the software may also use the Varimax method to change the relative positions between item vectors and coordinate axes for a better interpretation of item characteristics. 1.3.4 Full information factor analysis in MIRT TESTFACT is one of the software packages for MIRT calibration (Bock, Gibbons, Schilling, Muraki, Wilson, & Wood, 2003). In this package, full information factor analysis is used for the item and person parameter estimation, and this method uses all the information available in the matrix of dichotomously scored responses. Another software package, NOHARM (Fraser, 1988), 10 uses the aggregated information, which is the product-moment matrix, for the MIRT parameter estimation. Based on the local independence assumption that the examinee‟s responses to test items are statistically independent conditional on the examinee‟s ability, the probability for person j with the proficiency value θ j to get a certain response pattern is N u 1 u ij P(u j | A, d, θ j )   Pij ij (1  Pij ) , (1.9) i 1 where N is the total number of items in the test, u j is the response vector of size N for person j, u ij is the response for person j on item i , and Pij is defined by the aforementioned compensatory MIRT model. Hence, by incorporating the prior assumption on the distribution of person proficiencies, the marginal probability for person j‟s response pattern is P(u j | A, d)   P(u j | A, d, θ) g (θ)dθ , (1.10) where g (θ) is the pre-assumed distribution of person proficiencies. With the similar local independence assumption among persons‟ response strings in the MIRT, the joint probability for the person-by-item data matrix can be obtained by multiplying the probability of each person‟s response string across persons. Thus the marginal likelihood function for all persons on all items is J L(U | A, d)   P(u j | A, d) . (1.11) j 1 Then the TESTFACT software package sets the constraint as E (θ)  0m1 and cov(θ)  I mm , and applies the Expectation-Maximization (EM) algorithm to maximize this marginal likelihood function (Bock, Gibbons, & Muraki, 1988). 11 The initial values of slopes for the EM algorithm are calculated from the factor loadings, which result from the principal factor analysis on the tetrachoric correlation matrix among item responses. Then they are rotated orthogonally with the Varimax criterion to serve as starting values if the Varimax or Promax rotation option exists in the TESTFACT command syntax. There is concern that the starting values for slopes may be negative for any dimension due to the rotational indeterminacy in factor analysis. With these starting values, the item parameter estimates can be obtained after the EM cycle converges, which generally means that the change in parameter estimates between adjacent cycles is less than some predefined value. Then these estimates are regarded as fixed parameters and person proficiency estimates are calculated under the Bayesian framework by incorporating prior information on its distribution. There might be many or all negative a-parameter estimates on certain dimensions in the calibration result, which is most likely due to the defaults used in TESTFACT for the MIRT calibration. It is reasonable and legitimate to change the sign of all estimates on these dimensions for a better interpretation. For the scoring option in the TESTFACT software package, both MAP (Maximum A Posteriori) and EAP (Expected A Posteriori) scores can be requested, and only the latter is used in this study. The reason to use the EAP score is that compared with MAP and MLE (Maximum Likelihood Estimate) scores, the EAP score is not only more stable and easy to compute without any iterative procedure, but also has “smaller mean square error over the population for which the distribution of ability is specified by the prior” (Bock & Mislevy, 1982). According to Muraki and Engelhard (1985), the EAP score is calculated with the posterior distribution of person proficiency by ~ θ j  E (θ j | u j ) 12   θP(θ | u j )dθ θ  θ and θP(u j | θ) g (θ) h(u j ) dθ, (1.12) h(u j )   P(u j | θ) g (θ)dθ, θ where P(u j | θ) is the probability function defined in Equation 1.9, g (.) is the prior distribution of person proficiencies, h(.) is the marginal probability for the response string u j , and P(θ | u j ) is the posterior distribution, which is the conditional density for θ given the response vector u j . 13 CHAPTER 2 LINKING AND COMMON ITEM SELECTION 2.1 Linking Designs Vertical scaling is well known to be very important in tracking students‟ growth over time. The purpose for vertical scale construction is to develop a common score scale across grades. When the vertical scale is to be constructed, numerous decisions need to be made, one of which is the design for data collection (Yen & Burket, 1997). Since the scaling design greatly determines the quality of the collected data, no matter how properly the measurement model, the calibration or linking method is used, if the data collection design is not appropriate, the resulting vertical scales will not be correctly constructed among multiple grades (Kolen & Brennan, 2004). As is well known in the educational measurement field, horizontal equating is used to adjust the difference in difficulty among forms that are built to be similar in difficulty and content (Kolen & Brennan, 2004), while vertical scaling aims to link scales among forms that are built to be different in difficulty and to be administered to students of different levels. Different from those in horizontal equating, the test forms used in vertical scaling, which are most likely administered to adjacent grade levels, are not parallel or equivalent. Therefore, an equating of forms is not requested in the vertical scaling procedure; instead, the test linkage is more crucial and the strength of the link would affect the validity of the inferences based on such linkages (Patz & Yao, 2007). According to Kolen and Brennan (2004), there are several ways to design the linkage for the tests from different grade levels. For example, in the equivalent groups design, through spiraling 14 of test forms, examinees are randomly chosen to be administered the test designed for their own grade or for the adjacent lower grade. In the test scaling design, the scaling test consists of items covering the contents across all grade levels. Since off-grade items may be too easy or too difficult for students in different grades, special instructions are needed to advise students to do their best. This design is not often used in practical settings, because it requires the construction of a complicated scaling test with appropriate length and the content areas covered by the scaling test may be too broad to be appropriate for all students in different grades. The common item design is different from the above two designs and has its own characteristics. This design identifies the overlapping structure of the tests between adjacent grades. In addition to the appropriately designed items for the test at each grade, a set of common items, also called anchor items, are included in the tests of both grades in order to link the scale from one grade level to the next. These common items could provide basic statistical inferences for linking tests with similar construct and reliability so that scaled scores from both tests are comparable without the assumption of group equivalence. Although this common item design is not difficult to implement, there are many practical issues that need to be considered when this design is used in vertical scaling, especially under the MIRT framework. For example, there is no general rule on the number of common items needed for an adequate vertical scaling. Also, the characteristics of items have not been clarified when these items are selected as common items for vertical scaling. Although there are some rules for the above questions in horizontal equating, the relationship between the common items and the tests to be linked has not been fully examined yet in vertical scaling, let alone under the MIRT framework. 15 2.2 Linking Methods For the linking under the common item design, item parameters estimated from different test forms can be put onto the same scale with either separate calibration or concurrent calibration. When the concurrent calibration is used, item responses for all grade levels are formatted for a single computer run, with missing item responses coded as “not presented”. For the separate calibration, numerous research studies have been conducted in both UIRT and MIRT fields. 2.2.1 Linking methods in UIRT In the unidimensional IRT model, the probability of a correct answer mainly depends on the linear combination of item discrimination parameter, item difficulty parameter and person proficiency parameter in the exponent of the model. When the UIRT model holds, given the probability of a correct response, a proper linear transformation of the proficiency scale can result in a consistent transformation of item parameter scale. That is to say, if the proficiency is linearly transformed from Y scale to X scale, which is  X  AY  B , item parameters can then be transformed as following so that the model can produce exactly the same fitted probabilities. a a X  Y , bX  AbY  B , and c X  cY . A (2.1) Note that guessing parameters are independent from the scale transformation. The above is called the unit and origin indeterminacies in the UIRT model. Due to these indeterminacies, the software packages for the UIRT calibration often provide item and person parameter estimates based on the constraints that E ( )  0 and var( )  1 . Therefore, if two forms under the common item design are separately calibrated, a linear transformation is needed to put the two sets of estimates onto the same scale using the assumption that the common items in both forms have the same item parameters, so as to capture 16 the proficiency difference between groups. In practice, this is often done by putting item parameters estimated from the new form on the scale of the old form or base form. Generally speaking, there are four methods for the scale transformation in UIRT: the mean/mean method and mean/sigma method that belong to the moments methods, and the Haebara method and Stocking-Lord method that belong to the characteristic curve methods. In the mean/mean method (Loyd & Hoover, 1980), the A and B parameters for the scale transformation are computed as     (aY ) A  and B   (b X )  A (bY ) .  (a X ) (2.2) On the other hand, in the mean/sigma method (Marco, 1977), the A parameter is estimated via the standard deviations of difficulty in both forms by     (b X ) A and B   (b X )  A (bY ) .   (bY ) (2.3) The mean/mean and mean/sigma methods described above do not consider all item parameters simultaneously (Kolen & Brennan, 2004). Haebara (1980) and Stocking and Lord (1983) avoided it by using the item or test characteristic curves to estimate the transformation. The Haebara method considered the difference between the item characteristic curves of common items, and for examinees of a particular proficiency level  j , the sum of the squared difference between the curves of each item is expressed as  aYi      Hdiff ( j )  [ pij ( j ; a Xi , b Xi , c Xi )  pij ( j ; , AbYi  B, cYi )]2 . (2.4) A i Then this method finds A and B by minimizing the summation across all proficiency levels as Hcrit   Hdiff ( j ) . j 17 (2.5) Comparatively, the Stocking-Lord method minimizes the sum of the squared differences between the two test characteristic curves across all proficiency levels as     a   SLcrit   [ pij ( j ; a Xi , b Xi , c Xi )   pij ( j ; Yi , AbYi  B, cYi )]2 . (2.6) A j i i 2.2.2 Linking methods in MIRT Several researchers (Li & Lissitz, 2000; Min, 2003; Oshima et al., 2000; Reckase, 2007; Reckase & Martineau, 2004) proposed the scaling methods based on separate multidimensional IRT calibrations. Oshima et al. (2000) developed several methods, which are extensions from the unidimensional IRT linking methods (e.g. the Haebara method and the Stocking-Lord method), to obtain (1) the rotation matrix to simultaneously adjust the orientation of coordinate axes and the variances of proficiencies, and (2) the translation vector to adjust the means of proficiencies. The scaling procedure in Li and Lissitz (2000) was carried out through an orthogonal Procrustes rotation matrix and a central dilation constant obtained by minimizing the sum of squared errors between the estimated item discrimination matrix from the base form and the transformed one from the alternate form, and a translation vector obtained by the least squares method of minimizing differences between the difficulty estimates from the base form and the transformed ones from the alternate form. Min (2003) improved this procedure by using a dilation matrix in lieu of the central dilation constant to take different unit scales for different dimensions into account. Later, Reckase and Martineau (2004) proposed an oblique Procrustes rotation method to match the discrimination estimates of common items, which are obtained from separate calibrations on linking tests. 18 With the MIRT indeterminacy formula in Equation 1.7, the oblique Procrustes rotation method is shown below in details. First, the transformation between discrimination estimates of common items from the alternate form to the base form is obtained by T  (A' A a ) 1 A' A b , a a (2.7) where T is the m by m rotation matrix, A a is the n by m matrix of discrimination estimates for the alternative form, and A b is the matrix of the same size for the base form that is the target for  the transformation. Then A b , the a-parameter estimates from alternate form on the metric of base form, is  Ab  AaT . (2.8) Transformation of the d-parameter estimates from the alternate form to the metric of the base form is obtained by    M'  (d b  d a )' A b (A' A b ) 1 , b (2.9) where d b is the n-element vector of d-parameter estimates from the base form, and d a is the vector of the same size for the alternate form. Then the d-parameter estimates of the alternate form on the base form metric is  d b  d a  A a TM . (2.10) Accordingly, the θ estimates from the alternate form on the base form metric is  θ b  T 1θ a  M . (2.11) Most studies of MIRT linking methods focused on separate calibrations. One of the few examples for the concurrent calibration for the MIRT linking could be found in Reckase and Li 19 (2007), where they discussed the issue of using concurrent calibration to link the tests from adjacent grades. 2.2.3 Linking through separate and concurrent calibrations Numerous studies discussed and compared the linking results by using separate and concurrent calibrations. The findings for which method performs better are mixed. The study of Kim and Cohen (1998) compared separate and concurrent linking methods in three ways (separate calibration with the characteristic curve linking, concurrent calibration with the marginal maximum a posteriori estimation, and concurrent calibration with the marginal maximum likelihood estimation) using the simulated unidimensional data. They found that the three methods could yield similar results when the number of common items was large; however, when the number of common items was small, they noticed that the separate calibration could provide more accurate results. Contrary to that, some other studies found that concurrent calibration could produce more stable linking results even when the number of common items was not large. For example, in the study by Hanson and Bé guin (2002), concurrent calibration was found to result in lower errors than separate calibration by using the BILOG-MG software package when the groups were non-equivalent. One possible reason given by them was that the parameter estimates for the common items are based on larger samples. But they also reported the effect of different software packages when comparing concurrent and separate calibrations. When the MULTILOG software package was used, the concurrent estimation also performed well, except when the two groups had a mean difference of one standard deviation. Furthermore, Kolen and Brennan (2004, p. 391) pointed out that concurrent estimation might be more preferable, since it is less time consuming and is supposed to generate more stable results given the IRT model holds. However, they also indicated that in practice the separate 20 estimation might be more popular. The reason was that the two sets of item parameters estimated from different tests could be compared to check their behaviors and identify potential problems under the common item design. More importantly, according to them, the violation of unidimensionality assumption could cause problems in the concurrent calibration for vertical scaling, since this approach assumes a single proficiency estimated across all grades. Bé guin, Hanson and Glas (2000) also examined the accuracy of equating with separate and concurrent calibration using the unidimensional IRT model when the data were actually generated with a two-dimensional model. It was found that under the equivalent groups design, the separate calibration performs consistently better than the concurrent calibration. In the non-equivalent groups design, both methods gave unsatisfactory results when their results were compared to those from a correctly specified two-dimensional model; however, when the proficiency correlation was high, the separate calibration still performed better than the concurrent calibration. Therefore, separate calibration seemed to show comparative robustness to the violation of unidimensionality and this may be due to the fact that the parameter estimation was carried out for only one grade level at each computer run. This finding was confirmed by many studies (e.g., Hoskens, Lewis, & Patz, 2003; Karkee, Lewis, Hoskens, Yao, & Haug, 2003; Kim & Cohen, 1998). However, according to the studies by Hanson and Bé guin (2002), Patz and Yao (2007) and Yao and Mao (2004), there is some evidence that given the model is correctly specified, the concurrent calibration method might produce more stable results. There are also several discussions which argued that concurrent calibration should be conducted with the software packages that allow the multiple group estimation, such as the BILOG-MG. However, when the MIRT model is assumed, currently, there is no efficient software package for the MIRT calibration for multiple groups. On the other hand, according to 21 Simon (2008), in the MIRT linking, concurrent calibration, which was conducted without additional parameters for multiple groups, generally performs better than the linking methods with separate calibration even when the mean difference between proficiencies of two groups was 0.5 standard deviation and the correlation among proficiency dimensions was high. According to the results of above research, the concurrent calibration method should be used for the multidimensional IRT estimation in this study so that the MIRT estimates from different tests are automatically aligned to the same coordinate system. 2.3 Research on Common Items in Equating With the common item non-equivalent groups design, there are numerous research studies on the effect of different common items on the equating results. Some studies (Haertel, 2004; Michaelides & Haertel, 2004) investigated the behavior of linking items on the test equating results and found that the error caused by the selection of common items has been overlooked in the error calculation process. Some other studies (Raju, Edwards, & Osberg, 1983; Wingersky & Lord, 1984) investigated the minimum adequate number of common items and suggested that as few as five or six carefully chosen items could serve as satisfactory anchors in IRT equating when the item parameters of both tests are estimated in one single analysis. However, the rule of thumb for the minimum number of common items was given by Angoff (1984, p. 107), who suggested that 20 items or 20% of the total number of items in the test are more appropriate for linking. More importantly, in order to reflect group differences accurately, the set of common items should be proportionally representative of the total test in content and statistical characteristics (Kolen & Brennan, 2004, p. 19; Petersen, Kolen, & Hoover, 1989, p. 246). This is a commonly accepted rule for the equating in either research or operational work. Sinharay and Holland 22 (2006a) reexamined discussions about the correlation between common items and the total test in equating. Based on the previous studies, they asserted that the miditest, which consists of medium difficulty items, has higher reliability and also higher correlation with the total test than the commonly used minitest. Since it is long believed that higher anchor-test-to-total-test correlation could lead to better equating (Angoff, 1971, p. 577; Petersen et al., 1989, p. 246), Sinharay and Holland (2006b) doubted the necessity of selecting common items to form a miniversion of the total test and pointed out that the anchor test with a spread of item difficulties less than that of a total test seems to perform as well as or even better than a minitest. In that study, they also discussed the issue of composing an anchor test with different spread of difficulties using the data simulated from multidimensional IRT model. They found that the content representativeness of anchor items is very crucial in the equating but there seems to be no practically significant difference in the equating performances using either minitest or miditest as the anchor test. Note that all the equatings in their study were conducted using the classical methods, although the data were simulated with the IRT models. Nevertheless, all the above research focused on the principles and suggestions on common item selection in horizontal equating. Different from that in equating, tests in vertical scaling cannot be designed as parallel forms, since items with different difficulty levels should be selected to be consistent with the curriculum and instructions for different grades; therefore, vertical scaling can only be called linking instead of equating, which requires more restricted conditions on the characteristics of equated tests. Additionally, the validity of inferences is highly influenced by the strength of the linkage, which is determined by the characteristics of common items under the common item design. 23 2.4 Research on Common Items in Vertical Scaling Several studies have also been conducted on the common item selection in vertical scaling. Two simulation studies (Jiao & Wang, 2006; Wang, Jiao, Young, & Jin, 2006) explored the effects of the linking items when they come from different sources: using only below-grade items, using only above-grade items, or using both below-grade and above-grade items. They showed inconsistent recovered growth patterns and variability of scale scores when different offgrade items were used for vertical linking in a common person design. Jiao and Wang (2007) tested the effect of anchor items with respect to the source of linking items, target test difficulty and the percentage of linking items relative to the total test. They found that separate calibration method using both below-grade and above-grade items for linking would lead to the best recovery. In addition, they concluded that more linking items could yield less mean bias in proficiency estimation and higher classification accuracy. The simulation study by Turhan et al. (2007) tested the effect of anchor items on vertical scaling according to the item difficulty level, the proficiency distribution and the dimensionality of the constructs. They concluded that with appropriate content coverage, any item from upper or lower grade tests could be selected as an anchor item, and slight violations of the unidimensionality assumption did not distort the vertical scale, given the good content coverage by the anchor items. However, from their design, anchor items were only selected according to the difficulty and grade level instead of different content domains; therefore, it was insufficient to draw the conclusion on the effect of content coverage by anchor items, which was not examined in their study. 2.5 Research Objectives and Questions Most of the above studies were based on the results from the unidimensional IRT calibration, even for the data simulated from a MIRT model. As is well known, the unidimensional IRT 24 model assumes that only one latent trait is necessary to account for variations in examinees‟ response strings (Lord, 1980); however, in practical settings, multiple abilities/traits are often required to get correct responses in standardized tests. Since the MIRT model is specially designed for the multidimensional data, it is very important to choose appropriate common items for vertical scaling under the framework of multidimensional IRT instead of the misspecified unidimensional IRT. Based on the previous studies on the common item selection methods, although some criteria were set up to select common items via unidimensional IRT, hardly could I find any guidelines in the MIRT vertical scaling literature. Nor did any research evaluate different common item selection methods when constructs measured by the tests from different grades are not identical. Therefore, it is very important to further examine these issues and this study aims to answer the following three research questions. First, in Part I, a design is used to evaluate different ways to select common items for vertical scaling in MIRT when both lower and upper grade tests measure the same constructs. In this part, items in the tests of both grades are manipulated to mainly differ in difficulty. Common items used for linking can be selected according to different content coverage and item difficulty level from the MIRT framework. In addition to these MIRT methods, one classical correlation method is also applied to select common items to examine their influence on the scale linking. Second, in Part II, a design is used to evaluate different ways to select common items for vertical scaling in MIRT when the upper grade test measures more constructs than the lower grade test. This part is designed to evaluate whether it is useful to include in the anchor test items that measure the constructs only in the upper grade test, or it is sufficient to include items which measure the constructs in both tests. More specifically, this design is to test the effectiveness of 25 using items measuring common constructs to replace those measuring the unique constructs in the anchor test, especially when the proficiency correlation between these constructs is high. Third, in both Part I and Part II, designs are also used to evaluate the effect of different proficiency correlation levels on the linking results. One special interest is to examine whether the correlation between the construct only measured by the upper grade test and the construct measured by both tests has any impact on the linking strength in the Part II design. Different ways of selecting common items are evaluated and compared by checking the accuracy of item and person parameter recovery. Since parameters are not available in real data, a simulation study is used to answer the aforementioned research questions. Results of this study can provide practitioners with guidance on which common item selection method should be used in vertical scaling under the MIRT framework. 26 CHAPTER 3 DESIGNS AND METHODS This chapter first describes the design and data generation method for the two parts that deal with different content structures. Then the MIRT calibration and the evaluation criteria for parameter recovery are discussed in details. 3.1 Parameter Simulation In order to make parameters more realistic, efforts were made to match the generating parameters to those estimated from the real data with respect to the distribution, structure and complexity. The parameters in this study were either from those in the study by Reckase and Li (2007) or generated from the estimated distributions from that study, and some adjustments were also made to these parameters to match the research interest of this study. The parameters used in their study were revised from the research result of Li (2006), who analyzed the data of the 2005 mathematics tests from the Michigan Educational Assessment Program (MEAP) (2005) in details. For the content structure defined in that study, the Grade 6 test is considered to measure two constructs named as „Problem solving‟ and „Arithmetic‟, while besides these two constructs, the Grade 7 test measures one more construct „Algebra‟. Note that these constructs were determined from item clustering rather than the content specifications for the test. All items were generated to be approximate simple structure items (Roussos, Stout , & Marden, 1998) so that they have high discrimination values on one dimension and low values on all other dimensions. These approximate simple structure items were used for simplicity and clarification in that the correlation between item responses is assumed to be only determined by person proficiencies, instead of the correlated composite effects caused by items (Fang, 2008). 27 The idea of item cluster, which was first proposed by Miller and Hirsch (1992), was also used for the generation of item discrimination parameters. The item cluster is defined as a set of items whose vectors roughly point to the same direction in the multidimensional space. All items within the same cluster are supposed to measure the same proficiency that can be put onto one continuum scale. Roussos et al. (1998) gave an example of using item clusters to define the dimensionality based on the inter-cluster proximity matrices. According to Reckase (2009, p. 221), for all items within one cluster, the angle between each pair of item vectors should be small. As is well known, the angle between any two item vectors in a multidimensional space can be computed through the following formula:  m   ai1k ai 2k  k 1  i1, i 2  arccos(cos α i1 ' cos α i2 )  arccos   m m   a2  ai22k i1k  k 1  k 1     , (3.1)     where  i1, i 2 is the angle between vectors for items i1 and i2 , α i1 and α i2 are the vectors of direction angles for item i1 and i2 , and ai1k and ai 2k are the kth dimensional discrimination parameters for item i1 and i2 , respectively. The angle between item vectors can range from 0 to 90 . An angle of 0 means that the two item vectors point in exactly the same direction and the underlying proficiencies measured by these two items are perfectly correlated, while an angle of 90 indicates that there is no correlation between the two underlying proficiencies. The multidimensional discrimination parameter was simulated from a lognormal distribution and log( MDISC) had a mean of 0 and a standard deviation of 0.2. The simulation of withincluster angles followed the idea of approximate simple structure, where item vectors measuring a certain dimension randomly fall within 15 around that dimensional axis. Therefore, the angle 28 on the dominating dimension,  , was simulated to follow a uniform distribution with the range between 0 and 15 . In a two-dimensional situation, the angle on the other dimension was then calculated by 90   ; however, in a three-dimensional situation, the angle for a second dimension was simulated from a uniform distribution with the range between 90   and 90 , and the third angle was obtained by the mathematical fact that the sum of the squared cosines of all angle degrees should be equal to one. The MDIFF was simulated according to the normal distributions with a mean of -0.2 for the lower grade, a mean of 0 for the upper grade, and standard deviations of 0.75 for both grades. Since the mean of these simulated MDIFFs may not be close to the proposed mean due to the small sample, the MDIFFs of all unique items were adjusted according to the difference between the proposed and sample means, and this was done for each dimension and for each grade. Finally, d i was computed by − MDIFFi * MDISCi as in Equation 1.5. This study was divided into two parts that address different research questions. In Part I, both lower and upper grade tests measure the same two constructs, while in Part II, besides the same two constructs measured by both tests, one more construct is measured by the upper grade test. In each part, 3000 person and 40 unique item parameters were simulated for each grade. The same unique item parameters were employed for the simulation for each part, while the person proficiency parameters were manipulated to vary according to different correlation levels. Additionally, one item pool with 100 items was created according to the generation distribution of these unique items, and 10 common items were selected from the pool according to different criteria, such as content coverage, item difficulty and classical point-biserial correlation. All the parameters were generated using Matlab (The MathWorks., 2008). 29 With the item and person parameters, the probability matrix was computed using the MIRT model in Equation 1.2 by assuming the guessing parameters to be 0. Then the dichotomous response matrix was created by comparing the true probability matrix with a matrix where elements were randomly simulated from a standard uniform distribution. In this study, in order to make the results more comparable, the full response matrix was generated for each replication. This full response matrix is a matrix of 6000 examinees by 180 items. The first 100 columns in the matrix include responses on all items in the item pool and for all examinees, while the remaining 80 columns contain responses on unique items for examinees from either lower or upper grade and each of the two sets of 40 unique items was answered only by 3000 examinees in the corresponding grade. Since unique items designed for lower grade were not administered to upper grade examinees and vice versa for lower grade examinees, those responses were missing by design and coded as not presented in the response matrix. Based on this full response matrix, the response matrix for analysis on different common item selection method was created by combining responses to the 10 common items selected from the item pool and responses to all the unique items. The layout for this response matrix is shown in Table 3.1. Therefore, for each replication, the difference among the data matrices for different item selection methods only lies in the responses to different sets of common items. Table 3.1. Layout of Person by Item Response Matrix Item Person Common Items (10) Lower grade items (40) Upper grade items (40) Lower grade examinees (3000) lower grade item responses lower grade item responses not presented Upper grade examinees (3000) upper grade item responses not presented upper grade item responses 30 3.2 Parameter Estimation and Vertical Scaling In this study, concurrent calibration with common items as links was used for scaling so that the item estimates for tests in both grades were automatically put on the same coordinate system. The MIRT calibration on the response matrix was conducted via the TESTFACT software package. As mentioned in Section 1.3.4, the TESTFACT software estimates item parameters by applying the EM algorithm to maximize the constructed marginal maximum likelihood, where person proficiency parameters are integrated out via some pre-assumed proficiency distribution. These item estimates are then regarded as fixed parameters for the proficiency estimation. The convergence criterion of the EM algorithm was set to a maximum of 200 cycles and the precision of 0.005, and the options of nine quadrature points and EAP scoring method were specified in the TESTFACT syntax for the proficiency estimation. For the concurrent calibration in TESTFACT, item and person parameter estimates were supposed to be ready for further analysis; however, one small error was found for the person parameter estimation in this software package when the concurrent calibration was used under the common item design. It seemed that person proficiencies for the second group were incorrectly estimated by using the item estimates from the first group; therefore, instead of one computer run to obtain both item and person estimates simultaneously, person proficiencies were separately estimated for each grade using additional computer runs by fixing the related item (10 common items + 40 unique items) estimates as parameters. Due to the aforementioned three indeterminacies in the MIRT model (Li & Lissitz, 2000), TESTFACT provides one solution of item and person parameter estimates using some convenient constraints. These estimates are subject to rotational, unit and origin transformations; however, the core part a' θ  d should be invariant to the transformation to ensure the invariance 31 property of the MIRT model, which means that the probabilities of item responses remain unchanged through the transformation (Reckase, 2009, p. 235). 3.3 Evaluation Criteria In order to reduce the impact of different common items on the evaluation, these common items were not included in the calculation of evaluation indices, although they were used to link the scales across the two grades and estimate person proficiencies for each grade. In this study, three indices, including Pearson‟s correlation, bias and Root Mean Squared Error (RMSE), were employed to evaluate the parameter recovery. Among these indices, correlation explains the linear trend between the estimated and true parameters, bias represents the average of the differences between estimated and true parameters across replications and RMSE refers to the square root of the sum of squares of those differences. In this study, high correlation, zero bias and low RMSE indicate a good recovery of parameters. The formulas of bias and RMSE for parameter  are shown in Equations 3.2 and 3.3: Bias  RMSE  ˆ  r ( r   ) R , ˆ  r  r   2 R (3.2) , (3.3) ˆ where  r is the estimate for  in the rth replication, and R is the total number of replications. The linking performance of different item selection methods was evaluated with four parameter recovery criteria, including the probability matrix recovery, the item a-parameter recovery, the item d-parameter recovery and the effect size recovery. Although the estimated and true probability matrices can be compared directly for different item selection methods, the recoveries related to item and person parameters can only be evaluated after the estimates from 32 the TESTFACT software are put onto the same coordinate system as the parameters. In the study by Reckase and Li (2007), before the evaluation on the parameter recovery, the item discrimination estimates were rotated to a simple structure through the oblique Procrustes rotation method described in Section 2.2.2 to match the content dimensions measured by these items. The target matrix in their study was defined as the 0/1 matrix with 1s for the measured dimension and 0s elsewhere; rather, in this study, the true discrimination parameter matrix was used as the target to solve the rotational and unit indeterminacies as in Simon (2008), and the d estimates were then adjusted accordingly by matching with the d-parameters. Finally, the evaluation on the effect size recovery was conducted after the person parameter estimates were transformed based on the transformations of item estimates. The computation details for each evaluation criteria are described as follows: 1. The recovery of probability matrix. The estimated probability matrix for correct responses was computed using the item and person parameter estimates. Meanwhile, the true probability matrix could also be obtained with the parameters for the simulation. The correlation was calculated between the two vectors based on the vectorization of the true and estimated probability matrices for each replication and then averaged across all replications; however, the bias and RMSE were computed for each entry in the probability matrix, and then averaged across items and examinees. 2. The recovery of item a-parameters. With the oblique Procrustes rotation method, the item estimates from the TESTFACT software were first transformed to match the generating parameters and then compared with the parameters. The correlation for the recovery on each dimension was calculated between the estimated and true a-parameters for each replication and 33 then averaged across replications; however, the bias and RMSE were computed for each aparameter and then averaged across items for each dimension. 3. The recovery of item d-parameters. The d estimate is affected not only by the rotation and unit indeterminacies but also by the origin indeterminacy. Reckase (2009, p. 242) mentioned that “the change in d-parameter is the addition of a term that is the shift in origin weighted by the aparameter corresponding to the coordinate axis.” Hence, it is the cumulative effect from the change of the coordinate system that results in the change of the d-parameters. After the d estimates from the TESTFACT software were transformed, the correlation was calculated between the adjusted estimates and the true d-parameters for each replication and then averaged across replications; however, the bias and RMSE were computed for each item and then averaged across items. 4. The recovery of effect sizes. The use of effect size, which is of great interest to policy makers, shows how sensitive each common item selection method is to detect the differences in achievements across the two grades. In this study, the effect size was computed by dividing the difference in means of proficiencies for examinees in different grades by the pooled standard deviation of proficiencies for these examinees, based on the true parameters as well as those transformed estimates. The formula is shown as following: ES k  and  k ,upper   k ,lower sk  sk , (3.4) s2  s2 k ,upper k ,lower , 2 where  k is the mean of proficiencies on the kth dimension and s k is the standard deviation of proficiencies on the same dimension. Since the effect size is an aggregated statistics, the three 34 indices for parameter recovery are not appropriate for the evaluation; instead, the magnitudes of the estimated and true effect sizes were used for the comparison. Therefore, for each dimension, the estimated effect size was calculated for each replication and then averaged across replications for the comparison with the true effect size. 35 CHAPTER 4 PART I: SAME CONSTRUCTS 4.1 Parameters and Designs In this part, both lower and upper grade tests are assumed to measure the same two constructs. There are 40 unique items in the test of each grade; besides, in both tests, there are 10 common items that are selected according to nine different methods. For different grades, person proficiencies are simulated from the multivariate normal distributions with different mean vectors but the same variance-covariance matrix. 4.1.1 Unique items For each grade, 20 unique items measure the first construct and the other 20 unique items measure the second one. The cluster number, item parameters, multidimensional difficulty, multidimensional discrimination and direction angles of unique items for tests in both grades are listed in Tables 4.1 and 4.2. Items in Cluster 1 with large loadings on a1 mainly measure the proficiency on Dimension 1, while those in Cluster 2 measure the proficiency on Dimension 2. The means of MDISCs for different grade levels are quite similar; however, the mean and standard deviation of MDIFFs are -0.2 and 0.83 for the lower grade, while they are 0 and 0.63 for the upper grade. Therefore, the mean of MDIFFs for unique items is smaller for the lower grade than for the upper grade. 4.1.2 Common items After unique items for each grade level were simulated, an item pool with 100 items was generated with half of the items simulated using the generation distribution of unique items for each grade level. Items in this pool were then divided into several categories according to the 36 Table 4.1. Unique Item Parameters and Statistics for Lower Grade in Part I Cluster 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Mean Std a1 1.02 1.07 0.83 0.97 0.89 0.90 1.37 1.18 1.12 1.08 1.04 1.27 1.13 1.04 0.96 1.27 1.13 1.08 0.75 1.00 0.12 0.17 0.09 0.24 0.13 0.14 0.00 0.19 0.16 0.22 0.19 0.11 0.13 0.12 0.26 0.02 0.12 0.14 0.02 0.23 0.60 0.48 a2 0.07 0.25 0.03 0.08 0.20 0.06 0.26 0.09 0.14 0.23 0.23 0.23 0.14 0.21 0.08 0.31 0.06 0.09 0.13 0.08 1.26 0.78 0.93 0.90 0.97 0.88 0.80 0.98 1.09 1.01 0.81 0.83 1.00 0.86 1.21 0.88 1.07 0.90 0.67 0.98 0.55 0.42 d 1.00 0.65 0.52 -0.27 -1.22 -0.47 0.39 -0.38 0.35 2.00 0.23 0.60 0.38 0.38 -1.94 1.68 1.34 1.04 -0.71 0.04 -0.76 -0.39 1.39 0.25 0.60 0.41 0.38 0.36 0.82 0.85 0.19 -1.10 0.23 0.47 0.17 1.49 -1.29 -0.08 0.15 -0.60 0.23 0.84 MDIFF -0.97 -0.59 -0.63 0.28 1.34 0.52 -0.28 0.32 -0.30 -1.81 -0.22 -0.46 -0.33 -0.35 2.03 -1.28 -1.19 -0.96 0.94 -0.04 0.60 0.49 -1.48 -0.27 -0.62 -0.46 -0.47 -0.36 -0.75 -0.82 -0.23 1.31 -0.22 -0.54 -0.14 -1.69 1.21 0.09 -0.22 0.60 -0.20 0.83 37 MDISC 1.02 1.10 0.83 0.98 0.91 0.90 1.39 1.18 1.13 1.10 1.06 1.29 1.14 1.06 0.96 1.31 1.13 1.09 0.76 1.00 1.27 0.80 0.94 0.93 0.98 0.89 0.80 1.00 1.10 1.03 0.83 0.84 1.01 0.87 1.23 0.88 1.07 0.91 0.67 1.01 1.01 0.16 1 2 4 13 2 5 13 4 11 5 7 12 13 10 7 12 5 14 3 5 10 4 85 78 84 75 82 81 90 79 82 78 77 82 82 82 78 89 84 81 88 77 86 77 88 85 77 86 79 85 83 78 77 80 83 78 85 76 87 85 80 86 5 12 6 15 8 9 0 11 8 12 13 8 8 8 12 1 6 9 2 13 Table 4.2. Unique Item Parameters and Statistics for Upper Grade in Part I Cluster 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Mean Std a1 1.46 1.01 0.68 0.80 0.55 1.07 1.08 1.25 1.08 0.60 1.35 0.94 1.19 1.12 0.80 1.18 1.03 1.05 0.91 1.19 0.19 0.07 0.16 0.25 0.21 0.07 0.24 0.11 0.15 0.22 0.09 0.10 0.27 0.12 0.16 0.10 0.21 0.07 0.20 0.16 0.59 0.47 a2 0.31 0.22 0.13 0.05 0.12 0.28 0.15 0.02 0.20 0.11 0.19 0.24 0.16 0.16 0.02 0.29 0.25 0.15 0.19 0.28 0.80 0.96 0.98 1.03 0.86 1.03 0.90 0.92 0.73 1.22 0.79 0.93 1.22 0.90 0.89 1.04 1.18 0.84 1.31 0.92 0.57 0.42 d -0.91 0.11 -0.03 -0.58 -0.10 0.54 -0.04 -2.22 0.72 -0.08 -0.90 -0.01 -0.77 0.60 0.47 0.36 0.00 0.58 0.58 1.18 -0.33 1.52 1.50 -0.62 0.38 -0.63 0.03 -0.11 -0.39 0.74 0.01 -0.26 -0.20 -0.72 -0.12 0.23 -0.05 -0.46 -0.35 0.06 -0.01 0.69 MDIFF 0.61 -0.11 0.04 0.72 0.17 -0.49 0.04 1.78 -0.65 0.13 0.66 0.01 0.64 -0.53 -0.58 -0.29 0.00 -0.55 -0.63 -0.96 0.41 -1.59 -1.52 0.59 -0.43 0.61 -0.03 0.12 0.53 -0.60 -0.01 0.27 0.16 0.79 0.13 -0.22 0.05 0.55 0.26 -0.06 0.00 0.63 38 MDISC 1.49 1.04 0.69 0.80 0.56 1.11 1.09 1.25 1.10 0.61 1.36 0.97 1.21 1.13 0.80 1.22 1.06 1.06 0.93 1.23 0.82 0.96 0.99 1.06 0.89 1.03 0.93 0.92 0.74 1.23 0.79 0.93 1.25 0.91 0.90 1.04 1.20 0.84 1.33 0.93 1.01 0.21 1 2 12 12 11 4 12 15 8 1 10 10 8 14 8 8 2 14 14 8 12 13 76 86 81 76 76 86 75 83 79 80 83 84 77 83 80 85 80 85 81 80 78 78 79 86 78 75 82 89 80 80 82 76 82 82 88 76 76 82 78 77 14 4 9 14 14 4 15 7 11 10 7 6 13 7 10 5 10 5 9 10 two content dimensions and three difficulty levels. The numbers of items for different combinations of these two factors are shown in Table 4.3. Table 4.3. Number of Items for Different Content and Difficulty Categories in Item Pool of Part I Dimension 1 Dimension 2 Low 15 15 Medium 20 20 High 15 15 All 50 50 Ten common items for tests in both grades were selected from the item pool according to the MIRT methods and the classical correlation method. The details for each method are shown below. (1) The MIRT methods consist of eight methods according to different combinations of content and difficulty coverage.  Content coverage. Common items are selected (a) only from items in Cluster 1, or (b) evenly from items in Cluster 1 and Cluster 2 to achieve full content coverage.  Difficulty coverage. Items in the pool are grouped into the low, medium and high difficulty levels. Common items are selected from (a) only the low level, (b) only the medium level, (c) only the high level, or (d) all three levels. Note that the item difficulty is confounded with the grade level, because most likely, difficult items come from the upper grade and easy items from the lower grade. (2) The classical method selects 10 items with high item-total-test correlation in both lower and upper grade tests. In Methods 1-4 where partial content coverage is achieved, 10 common items were selected from items in Cluster 1 according to different difficulty coverage. For Methods 1-3, a simple random sample of items was selected from the low, medium and high difficulty levels, respectively. For Method 4 with full difficulty coverage, three low, four medium and three high 39 difficulty items were randomly chosen from the corresponding categories, according to their proportions in the item pool. In Methods 5-8, five items were chosen from each of the two item clusters to achieve the full content coverage. Common items in Methods 5-7 were selected from each of the three difficulty categories, respectively. Note that the common item set in Method 6 can be regarded as a miditest with full content coverage. For Method 8, the set of common items covers all three difficulty levels and two content domains, and can be considered as a mini-version of the whole test. Method 9 is a post-hoc method with common items selected based on the analysis on some generated response matrices. Correlations between the scores of items in the item pool and the total score for unique items in each grade test were calculated and ranked from high to low. These correlations were computed for two samples of response matrix at each proficiency correlation level. Ten items with high item-total-test correlation in tests of both lower and upper grades were selected as common items for this method. The statistics for different common item sets are listed in Table 4.4. The means of MDIFF values range from -1.05 to 0.92. The average MDIFF values for common items in Methods 2, 4, 6 and 8 are all close to 0. This is reasonable since items in these methods are either from medium difficulty level with a small spread of difficulty values or from all three difficulty levels with a large spread. From the perspective of MIRT, common items in the classical correlation method seemed to be selected from the medium difficulty level but with an unbalanced coverage for the two content domains. This also indicated that items from the medium difficulty level are more highly correlated with the total test than other items. 40 Table 4.4. Statistics of Common Items for Different Selection Methods in Part I Selection Method (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical MDIFF Mean -0.95 0.00 0.61 0.10 -1.05 -0.09 0.92 -0.16 -0.26 MDIFF Std 0.30 0.30 0.23 0.89 0.44 0.18 0.47 0.87 0.30 # of items in Dim 1 10 10 10 10 5 5 5 5 2 # of items in Dim 2 0 0 0 0 5 5 5 5 8 MDIFF Mean in Dim 1 -0.95 0.00 0.61 0.10 -1.05 -0.09 0.90 -0.37 -0.42 MDIFF Mean in Dim 2 NA NA NA NA -1.05 -0.09 0.95 0.05 -0.22 4.1.3 Person parameters For different grades, person proficiency parameters were simulated according to the multivariate normal distributions with different mean vectors but the same variance-covariance matrix. According to the analyses on MEAP mathematics test by Li (2006), the increases on mathematical skills from the lower grade to the adjacent upper grade were about 0.2 standard deviation units. Therefore, the mean vector of normal distribution for the person proficiency generation was set to be (-0.2, -0.2) for the lower grade and (0, 0) for the upper grade. The variances of person proficiencies on both dimensions were set to one; however, the correlation between proficiencies was manipulated to vary from 0, 0.4, 0.6 to 0.8 for both grades. Responses were simulated for 3000 examinees and 40 unique items in each grade, while those for the common items were simulated for all 6000 examinees in both grades. 4.2 Estimation In order to reduce sampling errors, 30 response matrices were simulated based on the same probability matrix for each proficiency correlation level and the MIRT calibration was conducted 41 on the data matrix for the selected common items and all unique items. This resulted in a total of 1080 computer runs (4 proficiency correlation levels x 30 replications x 9 item selection methods). It took about 20 minutes for each calibration with the TESTFACT software package. Dimensionality of the simulated data was checked with the MIRT calibration using one more dimension. The results showed that the item discrimination estimates on the extra dimension were very small, while the estimates were large on at least one of the other dimensions. This justified the use of the two-dimensional solution. Due to the rotational indeterminacy, the TESTFACT software may not orient the axes of the coordinate system for the parameter estimates in a proper direction as those for the generating parameters. One common problem is that the a-parameter estimates on different dimensions may be switched or the estimates from some dimension may be negated, with respect to the generating parameters. However, some additional TESTFACT runs confirmed that the proficiency estimates do not change, no matter whether the a-parameter estimates on any dimension are negated or not. This phenomenon was also observed in the study by Fang (2008) and explained in the TESTFACT help file, which says that “it may therefore happen that negative scores are associated with above average percent responses and vice versa for below average responses. TESTFACT software attempts to reverse the signs in such a way that scores above zero are usually assigned with above average achievement.” These item and person parameter estimates were rotated to match the generating parameters before the evaluation. Thus, the order of these estimates does not lead to any problem; however, it is problematic if proficiency estimates are not correctly paired with item estimates. Therefore, the signs of item and person parameter estimates may need to be corrected to make these estimates a valid pair as one solution for the MIRT calibration. 42 In order to correct the sign of item discrimination estimates, the mean of these estimates was computed for each dimension. Based on the assumption that item discrimination parameters should be positive in the MIRT model, if the mean of the estimates from the TESTFACT software was negative on any dimension, the negated estimates were regarded as the correct item discrimination estimates on that dimension; otherwise, these estimates were kept the same. Although the proficiency estimates seemed to be automatically corrected in the TESTFACT software, a double check was still conducted to examine the signs of these proficiency estimates separately for each grade. First, the percentage of correct responses was obtained for each examinee. Then, examinees with the highest and lowest percentages were picked and their proficiency estimates were examined. It was expected that all proficiency values should be positive for the examinee with the highest percentage value and negative for the examinee with the lowest percentage. If both criteria failed for the proficiency estimates on any dimension, it was very likely that the estimates provided by the TESTFACT software were incorrect and all proficiency estimates on that dimension should be negated to be paired with the item discrimination estimates on that dimension. If one of the two criteria failed, that replication would be picked for further checking. From the checking results, no proficiency estimates were found to have the sign problem; therefore, the TESTFACT software package seems to align the proficiency estimates in a correct direction as the percentage of correct responses predicts when both proficiencies contribute significantly to the percentage. 4.3 Results 4.3.1 Recovery of probability matrix The probability matrix of correct responses is obtained by applying item and person parameters in the MIRT model as shown in Equation 1.2. Since the value of a' θ  d is not 43 affected by the MIRT indeterminacies, so is the case with the probabilities of correct responses. The recovery of probability matrix, which is indicated by the similarity between the estimated and true probability matrices, is evaluated via the correlation, bias and RMSE indices. For each replication, the correlation was computed using all corresponding elements in the estimated and true probability matrices. The correlation values averaged across replications for different conditions are listed in Table 4.5. All correlation values are above 0.96, which indicates that the ordering of estimated probabilities is very similar to that of true ones. As the proficiency correlation increases, the correlation between the estimated and true probabilities also increases. Among all nine selection methods, Method 9 gives the highest correlation values for all four proficiency correlation levels and the correlation values for Method 6 are the second highest. Figure 4.1 gives the plot between the proficiency correlation level and the correlation for the probability matrix recovery for each item selection method. It is easy to observe that the points representing Methods 9 (classical correlation) and 6 (full content coverage with medium difficulty items) are above all the other points. On the other hand, the points representing Methods 1 (partial content coverage with low difficulty items) and 3 (partial content coverage with high difficulty items) are at the bottom. Table 4.5. Correlation for the Recovery of Probability Matrix in Part I (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0 0.9651 0.9664 0.9650 0.9660 0.9665 0.9675 0.9656 0.9664 0.9687 p0.4 0.9681 0.9691 0.9680 0.9689 0.9694 0.9701 0.9685 0.9693 0.9712 44 p0.6 0.9700 0.9710 0.9700 0.9707 0.9712 0.9719 0.9705 0.9711 0.9729 p0.8 0.9732 0.9741 0.9733 0.9738 0.9741 0.9749 0.9735 0.9741 0.9757 0.9760 correlation, probability 0.9740 (1) 1D, Low (2) 1D, Medium 0.9720 (3) 1D, High (4) 1D, All 0.9700 (5) 2D, Low (6) 2D, Medium 0.9680 (7) 2D, High (8) 2D, All 0.9660 (9) Classical 0.9640 p0 p0.4 p0.6 p0.8 Figure 4.1. Correlation for the Recovery of Probability Matrix in Part I Table 4.6 provides the result of bias for the recovery of probability matrix and Figure 4.2 shows the plot between the proficiency correlation level and the bias for the probability matrix recovery for each item selection method. Note that the bias values in the table were averaged across all items and examinees. One observation is that the selection methods with low difficulty items always yield positive values for bias and those with high difficulty items give negative values, while other MIRT methods give comparatively small absolute values. It is interesting to observe that Methods 2 and 4 also give good results, which seems to indicate that the bias is not influenced by the content coverage. All the absolute values of bias are less than 0.001, which tends to suggest that no bias exists in the probability estimation; however, the values in the table are not sufficient for this judgment, since parameters at different value levels may have different 45 degrees of bias in the estimates and the bias values with different signs can be cancelled out when averaged across items and examinees. Figure 4.3 shows the plot between the probability parameters and their estimation bias for Method 1 under the condition of zero proficiency correlation. As can be observed from the figure, the probabilities of large values tend to be underestimated and those of small values tend to be overestimated. The underestimation and overestimation of parameters seem to be less severe for the probabilities of medium values. A further analysis showed that the underestimation is mostly on the probabilities for difficult items answered by examinees with extremely high proficiency on the dimension, which is dominantly measured by each of these items. This is because the estimated proficiencies yielded from the EAP scoring method tend to be smaller than the true proficiencies with large values and this makes the estimated probabilities much smaller than the true probabilities for these difficult items. The overestimation can be explained in a similar way. The plots for all other methods and conditions are similar to this one. Table 4.6. Bias for the Recovery of Probability Matrix in Part I (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0 0.0005 0.0001 -0.0001 0.0001 0.0008 0.0002 -0.0003 0.0002 0.0004 46 p0.4 0.0001 -0.0004 -0.0006 -0.0004 0.0004 -0.0002 -0.0007 -0.0002 0.0000 p0.6 0.0006 0.0001 -0.0001 0.0002 0.0008 0.0003 -0.0002 0.0002 0.0004 p0.8 0.0002 -0.0001 -0.0002 0.0000 0.0003 0.0000 -0.0002 0.0000 0.0001 0.0010 0.0008 (1) 1D, Low Bias, probability 0.0006 (2) 1D, Medium 0.0004 (3) 1D, High 0.0002 (4) 1D, All 0.0000 (5) 2D, Low (6) 2D, Medium -0.0002 (7) 2D, High -0.0004 (8) 2D, All -0.0006 (9) Classical -0.0008 p0 p0.4 p0.6 p0.8 Figure 4.2. Bias for the Recovery of Probability Matrix in Part I Figure 4.3. Bias for the Recovery of Probability Matrix for Method 1 at Zero Proficiency Correlation Level in Part I 47 The RMSE values listed in Table 4.7 also give information on the recovery of the probability matrix for correct responses. There is a clear pattern that as the correlation between proficiencies increases, the value of RMSE consistently decreases for all selection methods. For the MIRT methods selecting items from the same difficulty level, the full content coverage is more important than the partial content coverage. Also, for the same content coverage, the method of including medium difficulty items is the best among the four methods based on different difficulty levels. In particular, Methods 1 and 3 give comparatively larger RMSE values while Methods 6 and 9 provide smaller RMSE values for all four proficiency correlation levels. This can also be observed from Figure 4.4, which gives the plot between the proficiency correlation level and the RMSE value for the probability matrix recovery for each common item selection method. In conclusion, the higher the correlation between proficiencies is, the better the estimated probabilities could match the true values. The classical correlation method gives the highest correlation and lowest RMSE for the recovery of probability matrix and the method of full content coverage with medium difficulty items is the second best. Table 4.7. RMSE for the Recovery of Probability Matrix in Part I (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0 0.0752 0.0737 0.0753 0.0742 0.0738 0.0726 0.0748 0.0739 0.0711 p0.4 0.0727 0.0714 0.0727 0.0718 0.0713 0.0704 0.0723 0.0715 0.0689 48 p0.6 0.0707 0.0694 0.0707 0.0699 0.0693 0.0685 0.0702 0.0695 0.0671 p0.8 0.0668 0.0656 0.0667 0.0661 0.0657 0.0648 0.0665 0.0657 0.0635 0.0760 0.0740 RMSE, probability (1) 1D, Low 0.0720 (2) 1D, Medium (3) 1D, High 0.0700 (4) 1D, All (5) 2D, Low 0.0680 (6) 2D, Medium (7) 2D, High 0.0660 (8) 2D, All 0.0640 (9) Classical 0.0620 p0 p0.4 p0.6 p0.8 Figure 4.4. RMSE for the Recovery of Probability Matrix in Part I 4.3.2 Recovery of a-parameters As is well known, the default in the TESTFACT software package for the MIRT calibration uses the zero mean vector and identity variance-covariance matrix for the distribution of proficiency coordinates, which may not match the real situation for proficiencies, and employs no rotation or the Varimax rotation for the orientation of coordinate axes to solve the MIRT indeterminacies. Therefore, the a-parameter estimates obtained from the TESTFACT software cannot be directly compared with generating parameters since they are not in the same coordinate system. These discrimination estimates were rotated to match the parameters using Equations 2.7 and 2.8 for the oblique Procrustes rotation before they were compared with generating parameters. 49 Table 4.8 shows the average correlations between the rotated estimates and true parameters. The result shows that as the correlation between proficiencies increases, the correlation between the estimated and true a-parameters decreases slightly but consistently. Compared with all other selection methods, Methods 6, 8 and 9 could give a little higher correlation values for the recovery on both dimensions and for all four proficiency correlation levels. Figures 4.5 and 4.6 also plot the correlation for the recovery of a-parameters on each dimension. It is clear that when the correlation between proficiencies is small, the differences among methods are also small. However, with the increase of the correlation between proficiencies, the difference between the a-parameters and their rotated estimates tends to become larger for the recovery on any dimension. Table 4.9 shows the bias for the recovery of a-parameters. The value of bias is negative for each dimension and for each item selection method, which indicates that the estimates for aparameters are most likely negatively biased. Also, it seems that the magnitude of bias is larger for the lowest and highest proficiency correlation levels than for the two middle correlation levels. Figure 4.7 shows the plot between the bias values and a1 -parameters for Method 1 under the zero proficiency correlation condition. Points in each of the two clusters represent items that Table 4.8. Correlation for the Recovery of a-parameters in Part I (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0 0.9944 0.9947 0.9946 0.9947 0.9944 0.9947 0.9946 0.9947 0.9946 Dimension 1 p0.4 p0.6 0.9942 0.9912 0.9944 0.9925 0.9940 0.9928 0.9942 0.9928 0.9943 0.9920 0.9944 0.9927 0.9940 0.9928 0.9944 0.9926 0.9942 0.9927 p0.8 0.9841 0.9867 0.9868 0.9867 0.9861 0.9871 0.9867 0.9867 0.9870 50 p0 0.9949 0.9950 0.9948 0.9949 0.9948 0.9950 0.9948 0.9951 0.9950 Dimension 2 p0.4 p0.6 0.9939 0.9917 0.9940 0.9919 0.9940 0.9917 0.9940 0.9919 0.9939 0.9920 0.9940 0.9920 0.9936 0.9918 0.9940 0.9920 0.9940 0.9920 p0.8 0.9847 0.9852 0.9844 0.9850 0.9857 0.9858 0.9846 0.9856 0.9861 0.9980 0.9960 (1) 1D, Low correlation, a1 0.9940 (2) 1D, Medium (3) 1D, High 0.9920 (4) 1D, All (5) 2D, Low 0.9900 (6) 2D, Medium 0.9880 (7) 2D, High (8) 2D, All 0.9860 (9) Classical 0.9840 p0 p0.4 p0.6 p0.8 Figure 4.5. Correlation for the Recovery of a1 -parameters in Part I 0.9980 0.9960 (1) 1D, Low correlation, a2 0.9940 (2) 1D, Medium (3) 1D, High 0.9920 (4) 1D, All (5) 2D, Low 0.9900 (6) 2D, Medium 0.9880 (7) 2D, High (8) 2D, All 0.9860 (9) Classical 0.9840 p0 p0.4 p0.6 p0.8 Figure 4.6. Correlation for the Recovery of a 2 -parameters in Part I 51 Table 4.9. Bias for the Recovery of a-parameters in Part I p0 (1) 1D, Low -0.0016 (2) 1D, Medium -0.0013 (3) 1D, High -0.0014 (4) 1D, All -0.0013 (5) 2D, Low -0.0016 (6) 2D, Medium -0.0013 (7) 2D, High -0.0012 (8) 2D, All -0.0013 (9) Classical -0.0014 Dimension 1 p0.4 p0.6 -0.0004 -0.0012 -0.0003 -0.0007 -0.0005 -0.0006 -0.0003 -0.0007 -0.0006 -0.0010 -0.0004 -0.0004 -0.0007 -0.0005 -0.0005 -0.0006 -0.0006 -0.0005 p0.8 -0.0014 -0.0006 -0.0005 -0.0006 -0.0010 -0.0006 -0.0006 -0.0007 -0.0007 p0 -0.0011 -0.0010 -0.0011 -0.0011 -0.0013 -0.0011 -0.0011 -0.0011 -0.0012 Dimension 2 p0.4 p0.6 -0.0007 -0.0003 -0.0007 -0.0005 -0.0008 -0.0006 -0.0007 -0.0005 -0.0008 -0.0006 -0.0006 -0.0005 -0.0008 -0.0005 -0.0007 -0.0004 -0.0005 -0.0006 p0.8 -0.0012 -0.0015 -0.0015 -0.0015 -0.0017 -0.0013 -0.0015 -0.0013 -0.0015 0.15 0.1 Bias 0.05 0 -0.05 -0.1 0 0.5 1 1.5 a1 Figure 4.7. Bias for the Recovery of a1 -parameters for Method 1 at Zero Proficiency Correlation Level in Part I dominantly measure each of the two dimensions. It seems that there is no clear pattern for the bias in the estimation for the a1 -parameters of small values, while the a1 -parameters of large values tend to be underestimated. The plots for all other dimensions, methods and conditions are similar to this one. 52 The RMSE values for the a-parameter recovery are listed in Table 4.10. According to the table, the value of RMSE increases as the proficiency correlation increases. Since “the observed correlations among the item scores will be accounted for solely by the a-parameters (when the proficiency correlation is forced to zero, for example, in the TESTFACT software)” (Reckase, 1997b, p. 275), it seems that as the proficiency correlation increases, it becomes more difficult to separate the effect of proficiency correlation from the estimation of a-parameters even with the Procrustes rotation method to match generating parameters. It can be observed that Method 1 (partial content coverage with low difficulty items) gives the largest RMSE values, while Methods 6, 8 and 9 generally provide comparatively lower RMSE values for the recovery of aparameters on any dimension than all other methods. Figures 4.8 and 4.9 show the plots of RMSE for the a-parameter recovery for each item selection method and for each dimension. In a word, as the proficiency correlation increases, the correlation between the rotated estimates and parameters decreases and the deviation increases for the recovery of a-parameters on both dimensions. Comparatively, Methods 6, 8 and 9 give slightly higher correlation and lower RMSE than other methods. A little negative bias was found in the estimation of aparameters of large values. Table 4.10. RMSE for the Recovery of a-parameters in Part I (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0 0.0445 0.0436 0.0437 0.0432 0.0440 0.0427 0.0426 0.0428 0.0428 Dimension 1 p0.4 p0.6 0.0468 0.0585 0.0461 0.0535 0.0479 0.0524 0.0470 0.0526 0.0461 0.0552 0.0455 0.0525 0.0472 0.0524 0.0458 0.0530 0.0463 0.0525 p0.8 0.0793 0.0727 0.0725 0.0730 0.0741 0.0719 0.0731 0.0729 0.0717 53 p0 0.0400 0.0398 0.0405 0.0401 0.0402 0.0397 0.0404 0.0394 0.0398 Dimension 2 p0.4 p0.6 0.0438 0.0515 0.0434 0.0510 0.0435 0.0515 0.0435 0.0510 0.0439 0.0503 0.0437 0.0504 0.0449 0.0510 0.0438 0.0505 0.0438 0.0505 p0.8 0.0708 0.0696 0.0714 0.0701 0.0682 0.0681 0.0710 0.0686 0.0672 0.0800 0.0700 (1) 1D, Low RMSE, a1 (2) 1D, Medium (3) 1D, High 0.0600 (4) 1D, All (5) 2D, Low 0.0500 (6) 2D, Medium (7) 2D, High 0.0400 (8) 2D, All (9) Classical 0.0300 p0 p0.4 p0.6 p0.8 Figure 4.8. RMSE for the Recovery of a1 -parameters in Part I 0.0800 0.0700 (1) 1D, Low RMSE, a2 (2) 1D, Medium (3) 1D, High 0.0600 (4) 1D, All (5) 2D, Low 0.0500 (6) 2D, Medium (7) 2D, High (8) 2D, All 0.0400 (9) Classical 0.0300 p0 p0.4 p0.6 p0.8 Figure 4.9. RMSE for the Recovery of a 2 -parameters in Part I 54 4.3.3 Recovery of d-parameters The person proficiency parameters were separately simulated from different multivariate normal distributions for the two grades. More specifically, the mean vector of the normal distribution is (-0.2, -0.2) for the lower grade and (0, 0) for the upper grade. However, the item estimates from the TESTFACT software were obtained by assuming that person proficiency coordinates follow a standard multivariate normal distribution with a zero mean vector, in order to facilitate the calibration and solve the indeterminacies in the MIRT model. Thus, it was expected that the item MDIFF estimates would be inflated by around 0.1, which would also directly lead to a negative bias for the d estimates from the TESTFACT software. The effect incurred by the inconsistency between proficiency distributions assumed for the generation and the estimation was minimized by matching the raw estimates with generating parameters via the aforementioned oblique Procrustes rotation method. The correlation values between the adjusted estimates and the true values of d-parameters are listed in Table 4.11. From the table, all correlation values are close to one. Also, as the correlation between proficiencies increases, the correlation between the estimated and true dparameters also increases, although sometimes an opposite pattern may occur between the correlation levels of 0.4 and 0.6. The correlation values for Method 6 are the largest among all the methods; besides, the values for Methods 8 and 9 also seem to be slightly larger than those for other methods. Figure 4.10 shows the plot between the proficiency correlation level and the correlation value for the recovery of d-parameters for each item selection method. The points representing Methods 6, 8 and 9 are above all other points, while the points for the methods with partial content coverage are at the bottom. However, the difference between these values becomes smaller as the proficiency correlation increases. 55 Table 4.11. Correlation for the Recovery of d-parameters in Part I p0 0.9938 0.9944 0.9943 0.9938 0.9965 0.9976 0.9963 0.9971 0.9970 (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical P0.4 0.9965 0.9966 0.9965 0.9964 0.9979 0.9983 0.9975 0.9980 0.9983 p0.6 0.9962 0.9971 0.9967 0.9966 0.9976 0.9981 0.9973 0.9978 0.9981 p0.8 0.9981 0.9983 0.9982 0.9983 0.9983 0.9986 0.9985 0.9985 0.9986 0.9990 0.9980 correlation, d (1) 1D, Low (2) 1D, Medium 0.9970 (3) 1D, High (4) 1D, All 0.9960 (5) 2D, Low (6) 2D, Medium 0.9950 (7) 2D, High (8) 2D, All 0.9940 (9) Classical 0.9930 p0 p0.4 p0.6 p0.8 Figure 4.10. Correlation for the Recovery of d-parameters in Part I The bias values for d-parameter estimates are listed in Table 4.12, and Figure 4.11 gives the plot between the proficiency correlation level and the bias for the d-parameter recovery for each item selection method. All the bias values are close to zero, which seems to indicate that there is no systematic error in the estimate. In order to examine the bias for parameters of different 56 values, Figure 4.12 shows the plot between the bias and d-parameters for Method 1 under the zero proficiency correlation condition. From the figure, the d-parameters of large values tend to be underestimated and those of small values tend to be overestimated. The plots for all other methods and conditions are similar to this one. Table 4.12. Bias for the Recovery of d-parameters in Part I p0 -0.0014 -0.0009 -0.0007 -0.0009 -0.0010 0.0000 0.0004 -0.0002 -0.0003 (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0.4 -0.0003 -0.0001 0.0002 -0.0001 -0.0002 0.0003 0.0007 0.0002 0.0004 p0.6 -0.0012 -0.0005 -0.0003 -0.0006 -0.0007 0.0000 0.0003 -0.0002 -0.0001 p0.8 -0.0004 -0.0001 0.0001 0.0000 -0.0004 0.0001 0.0002 0.0000 0.0002 0.0010 0.0005 (1) 1D, Low Bias, d (2) 1D, Medium (3) 1D, High 0.0000 (4) 1D, All (5) 2D, Low -0.0005 (6) 2D, Medium (7) 2D, High (8) 2D, All -0.0010 (9) Classical -0.0015 p0 p0.4 p0.6 Figure 4.11. Bias for the Recovery of d-parameters in Part I 57 p0.8 0.3 0.2 Bias 0.1 0 -0.1 -0.2 -3 -2 -1 0 1 2 3 d Figure 4.12. Bias for the Recovery of d-parameters for Method 1 at Zero Proficiency Correlation Level in Part I Table 4.13 gives the RMSE values between the adjusted estimates and true values of dparameters, and Figure 4.13 shows the plot of RMSE for the d-parameter recovery. As the proficiency correlation increases, the value of RMSE decreases, although sometimes an opposite pattern may occur between the correlation levels of 0.4 and 0.6. One reason to explain this nonmonotonicity pattern may be that the mean differences between the simulated examinee proficiencies of the two grades at the 0.4 correlation level are the smallest among the four proficiency correlation levels, which may result in more accurate estimates with the single group MIRT calibration in TESTFACT. Also, this effect is more obvious when the common items cover all content domains. 58 The RMSE values for Methods 6, 8 and 9 are smaller, which indicates that these methods can give a better match between the estimated and true d-parameters. For all proficiency correlation levels, the methods with full content coverage yield lower RMSE than those with partial content coverage. However, the difference becomes smaller when the correlation between proficiencies increases. Table 4.13. RMSE for the Recovery of d-parameters in Part I p0 0.0826 0.0770 0.0786 0.0806 0.0635 0.0540 0.0661 0.0590 0.0568 (1) 1D, Low (2) 1D, Medium (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium (7) 2D, High (8) 2D, All (9) Classical p0.4 0.0647 0.0627 0.0643 0.0648 0.0524 0.0468 0.0564 0.0505 0.0461 p0.6 0.0680 0.0596 0.0630 0.0637 0.0562 0.0496 0.0587 0.0528 0.0484 p0.8 0.0490 0.0453 0.0468 0.0462 0.0463 0.0415 0.0440 0.0428 0.0406 0.0900 0.0800 (1) 1D, Low RMSE, d (2) 1D, Medium (3) 1D, High 0.0700 (4) 1D, All (5) 2D, Low 0.0600 (6) 2D, Medium (7) 2D, High (8) 2D, All 0.0500 (9) Classical 0.0400 p0 p0.4 p0.6 p0.8 Figure 4.13. RMSE for the Recovery of d-parameters in Part I 59 4.3.4 Recovery of effect sizes The person proficiency was estimated with the EAP scoring method that incorporates the prior distribution into the estimation. The observed covariance matrix for estimated proficiencies on different dimensions was compared with the true matrix. It was found that the variances were underestimated and the correlations were overestimated. This may be due to the EAP scoring method, which yields estimates biased towards the prior mean. Table 4.14 shows the true effect sizes as well as the means and standard deviations of estimated effect sizes on both dimensions for all methods and for all proficiency correlation levels. From the table, the effect size is underestimated for any dimension and for any method. For Methods 1-4 with partial content coverage, when the proficiency correlation is low, the effect size estimates on Dimension 1 are slightly better than all other methods, while those on Dimension 2 are highly negatively biased. However, both the advantage on Dimension 1 and disadvantage on Dimension 2 tend to diminish as the proficiency correlation increases. Compared with these methods, Methods 5-8 provide slightly worse estimates on Dimension 1 but much better ones on Dimension 2. Also, Method 9 gives the best results on Dimension 2, which may be due to the fact that eight out of ten common items in this method are from Cluster 2. Therefore, one conclusion is that the effect size recovery on each dimension largely depends on the number of common items measuring that dimension. For the same content coverage, the method with medium difficulty items yields the best results, followed by the method with items from all difficulty levels. However, as the correlation between proficiencies increases, the difference between the estimated effect sizes from all selection methods decreases. Generally speaking, in consideration of a good recovery on both dimensions, Methods 6 and 9 perform the best among all methods. 60 The standard deviations are quite small, especially for Methods 1-4 and when the proficiency correlation is very low. This indicates that the estimates of effect sizes are fairly stable across replications and the difference across methods is substantial in consideration of random errors. Figures 4.14 and 4.15 show the recovery of effect sizes for diferent methods. From both figures, when the proficiency correlation is 0.8, the points representing the estimates from different methods are all clustered together. One explanation is that the proficiency estimation on one dimension can „borrow‟ information from other dimensions in the MIRT calibration and more information can be „borrowed‟ when the correlation between them is high. However, from Figure 4.15 for the effect size recovery on Dimension 2, when the proficiency correlation is low, the points representing methods with partial content coverage are much lower than those for other methods. Table 4.14. Recovery of Effect Sizes for Proficiencies in Part I TRUE (1) 1D, Low Std (2) 1D, Medium Std (3) 1D, High Std (4) 1D, All Std (5) 2D, Low Std (6) 2D, Medium Std (7) 2D, High Std (8) 2D, All Std (9) Classical Std p0 0.2315 0.1940 0.0131 0.2165 0.0109 0.2017 0.0110 0.2087 0.0136 0.1501 0.0114 0.1757 0.0113 0.1471 0.0129 0.1632 0.0115 0.1400 0.0137 Dimension 1 p0.4 p0.6 0.1474 0.1788 0.1226 0.1384 0.0092 0.0104 0.1264 0.1586 0.0097 0.0098 0.1155 0.1439 0.0085 0.0090 0.1236 0.1487 0.0069 0.0088 0.1097 0.1305 0.0129 0.0108 0.1176 0.1452 0.0106 0.0106 0.0870 0.1181 0.0107 0.0147 0.1068 0.1380 0.0126 0.0114 0.1048 0.1374 0.0084 0.0091 p0.8 0.1936 0.1679 0.0120 0.1781 0.0112 0.1684 0.0124 0.1753 0.0106 0.1619 0.0102 0.1758 0.0104 0.1645 0.0095 0.1734 0.0124 0.1760 0.0107 61 p0 0.1891 -0.0022 0.0044 -0.0030 0.0041 0.0027 0.0044 -0.0112 0.0029 0.1348 0.0124 0.1490 0.0109 0.1159 0.0139 0.1395 0.0161 0.1854 0.0100 Dimension 2 p0.4 p0.6 0.1616 0.1841 0.0294 0.0649 0.0028 0.0083 0.0306 0.0801 0.0047 0.0074 0.0305 0.0736 0.0039 0.0070 0.0243 0.0692 0.0031 0.0056 0.1072 0.1322 0.0121 0.0118 0.1272 0.1490 0.0104 0.0122 0.0971 0.1200 0.0122 0.0136 0.1175 0.1367 0.0114 0.0108 0.1478 0.1691 0.0078 0.0112 p0.8 0.1954 0.1428 0.0101 0.1516 0.0098 0.1441 0.0103 0.1477 0.0091 0.1632 0.0093 0.1772 0.0121 0.1634 0.0125 0.1696 0.0101 0.1877 0.0094 0.2300 TRUE effect size, Dim1 (1) 1D, Low (2) 1D, Medium 0.1800 (3) 1D, High (4) 1D, All (5) 2D, Low (6) 2D, Medium 0.1300 (7) 2D, High (8) 2D, All (9) Classical 0.0800 p0 p0.4 p0.6 p0.8 Figure 4.14. Recovery of Effect Size for the Proficiency on Dimension 1 in Part I 0.2200 TRUE 0.1700 effect size, Dim2 (1) 1D, Low (2) 1D, Medium 0.1200 (3) 1D, High (4) 1D, All (5) 2D, Low 0.0700 (6) 2D, Medium (7) 2D, High 0.0200 (8) 2D, All (9) Classical -0.0300 p0 p0.4 p0.6 p0.8 Figure 4.15. Recovery of Effect Size for the Proficiency on Dimension 2 in Part I 62 CHAPTER 5 PART II: DIFFERENT CONSTRUCTS 5.1 Parameters and Designs In this part, two constructs are measured in the lower grade test; besides these two, one more construct is measured in the upper grade test. There are 40 unique items in the test of each grade. Additionally, 10 common items in both tests are selected according to four different methods. For different grades, person proficiencies are simulated from the multivariate normal distributions with different mean vectors and variance-covariance matrices. 5.1.1 Unique items The numbers of unique items in different content domains were chosen to be the same as that in the study by Reckase and Li (2007). With the context from that study, the allocation of unique items in the test of each grade is shown in Table 5.1. Note that there are no algebra items in the lower grade test and the numbers of unique items in different content domains are not balanced. Table 5.1. Allocation of Unique Items in Different Content Domains and Grades in Part II Grade Lower Grade Upper Grade Arithmetic 17 11 Problem Solving 23 18 Algebra 0 11 The cluster number, item parameters, multidimensional difficulty, multidimensional discrimination and direction angles of unique items for tests in both grades are listed in Tables 5.2 and 5.3. The means of MDISCs are quite similar for different grade levels; however, the mean and standard deviation of MDIFFs are -0.2 and 0.78 for the lower grade, while they are 0 and .67 for the upper grade. As can be observed, the a3 -parameters are always of small values 63 Table 5.2. Unique Item Parameters and Statistics for Lower Grade in Part II Cluster 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Mean Std a1 0.88 0.94 0.91 0.89 0.91 0.77 1.08 1.08 1.09 0.97 1.20 0.90 1.06 1.35 0.92 1.03 1.38 0.10 0.17 0.02 0.01 0.07 0.02 0.01 0.01 0.11 0.06 0.18 0.14 0.02 0.09 0.11 0.15 0.00 0.06 0.01 0.10 0.02 0.00 0.00 0.47 0.49 a2 0.01 0.01 0.03 0.03 0.01 0.03 0.06 0.04 0.15 0.00 0.01 0.10 0.14 0.03 0.17 0.01 0.00 0.94 0.99 1.20 0.79 1.43 0.68 0.87 0.98 0.98 1.25 1.16 1.13 0.79 0.69 0.89 0.86 0.89 0.93 1.47 1.01 1.20 0.83 1.38 0.61 0.51 a3 0.02 0.04 0.20 0.22 0.04 0.13 0.09 0.05 0.22 0.01 0.18 0.18 0.11 0.02 0.12 0.25 0.00 0.14 0.18 0.03 0.20 0.12 0.10 0.13 0.24 0.07 0.18 0.10 0.05 0.06 0.16 0.11 0.03 0.13 0.12 0.01 0.03 0.04 0.02 0.02 0.10 0.07 d 0.51 -0.20 0.58 -0.04 0.95 0.21 -0.55 -1.12 -0.59 0.84 0.95 -0.99 -0.02 1.00 0.73 0.17 1.33 0.88 -1.13 0.56 1.38 -1.62 0.58 -0.53 -0.39 -0.97 0.60 0.78 1.17 -0.59 0.80 0.73 0.17 0.53 0.91 0.37 -0.81 1.52 -0.23 -0.89 0.19 0.81 MDIFF -0.58 0.21 -0.63 0.05 -1.05 -0.26 0.51 1.03 0.53 -0.87 -0.78 1.07 0.02 -0.74 -0.77 -0.16 -0.97 -0.91 1.11 -0.47 -1.68 1.13 -0.85 0.60 0.39 0.98 -0.47 -0.66 -1.02 0.75 -1.13 -0.81 -0.19 -0.58 -0.97 -0.25 0.80 -1.26 0.28 0.64 -0.20 0.78 64 MDISC 0.88 0.94 0.93 0.92 0.91 0.78 1.09 1.08 1.12 0.97 1.21 0.92 1.07 1.35 0.94 1.06 1.38 0.96 1.02 1.20 0.82 1.43 0.69 0.88 1.01 0.99 1.26 1.18 1.14 0.79 0.71 0.90 0.87 0.90 0.94 1.47 1.01 1.20 0.83 1.38 1.03 0.20 1 2 2 3 12 14 2 10 6 3 14 1 8 13 10 1 13 14 0 84 81 89 90 87 88 89 90 83 87 81 83 89 83 83 80 90 86 90 84 89 90 90 89 90 88 88 90 88 87 88 82 90 89 84 82 89 79 89 90 10 14 2 14 6 8 9 14 8 9 10 8 4 15 10 10 8 8 0 6 2 1 1 3 89 87 78 76 88 80 85 87 79 89 82 79 84 89 83 76 90 82 80 88 76 85 82 81 76 86 82 85 87 86 77 83 88 82 83 90 89 88 89 89 Table 5.3. Unique Item Parameters and Statistics for Upper Grade in Part II Cluster 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 Mean Std a1 1.08 1.00 0.67 0.81 0.94 0.97 0.67 0.97 0.87 1.04 0.93 0.02 0.07 0.05 0.01 0.00 0.08 0.07 0.00 0.03 0.10 0.05 0.00 0.01 0.11 0.00 0.13 0.03 0.05 0.05 0.06 0.27 0.08 0.12 0.03 0.15 0.02 0.06 0.09 0.08 0.29 0.39 a2 0.04 0.19 0.01 0.10 0.06 0.07 0.05 0.03 0.04 0.08 0.11 0.77 0.97 1.16 0.89 1.18 0.83 0.78 1.18 1.11 1.19 1.09 1.14 1.16 0.87 0.95 1.19 0.76 0.94 0.04 0.22 0.09 0.06 0.02 0.14 0.04 0.01 0.03 0.16 0.10 0.50 0.48 a3 0.16 0.19 0.04 0.18 0.21 0.04 0.04 0.09 0.02 0.08 0.03 0.15 0.14 0.18 0.15 0.01 0.04 0.05 0.17 0.15 0.12 0.28 0.14 0.03 0.16 0.17 0.07 0.05 0.13 1.19 0.91 1.09 1.52 0.95 1.02 1.02 0.71 0.79 0.92 0.76 0.35 0.42 d -0.23 -0.14 -0.69 0.10 0.86 0.88 -0.15 -0.50 0.31 -0.09 -0.07 -0.48 0.09 -0.48 -1.09 -0.27 -0.26 0.45 0.73 -0.59 0.68 0.78 -0.99 -0.75 -0.32 0.62 1.18 0.64 0.16 -0.62 -0.65 0.03 0.70 0.02 -0.99 -0.08 0.72 1.31 0.21 -0.89 0.00 0.63 MDIFF 0.21 0.14 1.02 -0.12 -0.89 -0.90 0.22 0.52 -0.36 0.09 0.07 0.61 -0.09 0.41 1.21 0.23 0.31 -0.57 -0.61 0.53 -0.57 -0.70 0.86 0.65 0.36 -0.64 -0.99 -0.84 -0.17 0.52 0.69 -0.03 -0.46 -0.02 0.96 0.08 -1.01 -1.65 -0.23 1.15 0.00 0.67 65 MDISC 1.09 1.03 0.67 0.84 0.96 0.98 0.67 0.97 0.87 1.04 0.94 0.79 0.98 1.18 0.90 1.18 0.83 0.78 1.19 1.13 1.20 1.12 1.15 1.16 0.89 0.97 1.20 0.76 0.95 1.19 0.94 1.12 1.53 0.96 1.03 1.03 0.71 0.79 0.94 0.77 0.99 0.18 1 2 9 15 3 14 13 5 5 5 3 6 7 88 86 87 89 90 85 85 90 88 85 87 90 89 83 90 84 88 87 87 86 76 87 83 88 82 88 86 85 84 88 79 89 83 86 86 86 88 87 85 83 11 9 9 10 0 6 6 8 8 7 15 7 1 13 10 7 5 9 88 76 86 88 89 82 88 89 88 80 82 3 81 80 87 78 78 88 87 85 89 86 88 79 81 81 80 90 87 86 82 82 84 75 83 89 79 80 87 86 82 3 14 15 3 7 8 9 2 5 11 9 for the lower grade, which indicates that this content domain is not designed to be measured by unique items of the test in this grade. 5.1.2 Common items An item pool with 100 items was generated in a similar way as the unique items for both grade levels. Items in this pool were divided into several categories according to three content domains and three difficulty levels as shown in Table 5.4. Table 5.4. Number of Items for Different Content and Difficulty Categories in Item Pool of Part II Dimension 1 Dimension 2 Dimension 3 Low 14 14 5 Medium 14 14 6 High 14 14 5 All 42 42 16 Ten common items were selected according to the MIRT methods and the classical correlation method. The details for each method are shown below. (1) The MIRT methods consist of three methods with different numbers of items selected from the three content domains in order to achieve different degrees of content coverage. For simplicity, common items in these MIRT methods are only selected from medium difficulty items, in view of the results in Part I.  Method 1: Common items are selected from all content domains. The numbers of items from the three item clusters are four, four and three, respectively. Note that the third content domain is only taught in the upper grade.  Method 2: Common items are only selected from the first two content domains that are measured by unique items in both tests. Six items are selected from Cluster 1 and four from Cluster 2. Given that the proficiency on Dimension 1 is manipulated to have 66 a higher correlation with that on Dimension 3, more items are selected from Cluster 1 to replace the items in Cluster 3 that are missing from the common item set.  Method 3: Common items are only selected from the first two content domains, with four items from Cluster 1 and six from Cluster 2 according to their proportions in the unique items. In order to make results more comparable, Methods 2 and 3 share the same eight common items with four from each cluster. (2) Method 4, the classical correlation method, chooses common items based on the high item-total-test correlation. The detailed procedure is the same as in Part I. The statistics for different common item sets are listed in Table 5.5. For each method, the mean of MDIFF values is close to zero and the standard deviation is small. From the perspective of MIRT, common items in the classical correlation method appeared to be also selected from the medium difficulty level but with an extremely unbalanced coverage for the three content domains. Almost all the common items in this method were selected from Cluster 2, which is reasonable in that the number of unique items from this cluster is the largest in both tests. Table 5.5. Statistics of Common Items for Different Selection Methods in Part II Selection Method MDIFF Mean MDIFF Std # of items in Dim 1 # of items in Dim 2 # of items in Dim 3 (1) 3D (2) 2D, Correlation (3) 2D, Proportion -0.11 -0.07 -0.01 0.25 0.24 0.18 4 6 4 3 4 6 3 0 0 (4) Classical -0.14 0.29 1 9 0 5.1.3 Person parameters For different grades, the person proficiency parameters were simulated according to the multivariate normal distributions with different mean vectors and variance-covariance matrices. 67 The mean vectors for person proficiency distribution in each grade are shown in Table 5.6. The variance was set to one for all the proficiencies except for that on Dimension 3 for the lower grade examinees. That proficiency was set to have a much lower mean and smaller variation since examinees in the lower grade were not supposed to have knowledge in this content domain. Note that although the mean differences between examinee proficiencies of the two grades were set to 0.2 in Part I, they were set to 0.7 for the first two dimensions in this part, following the study by Reckase and Li (2007). Also, the correlation matrices for both grades were manipulated to be roughly the same as those in that study. The variance-covariance matrices for person proficiency distributions in both grades are shown in Tables 5.7 and 5.8. In the lower grade, the correlation between proficiencies on the first two dimensions was 0.7; however, no correlation was assumed between proficiencies on the third and any of the first two dimensions. For the upper grade, the correlations between proficiencies on the second and the other two dimensions were fixed at 0.24 and 0.32 respectively. However, the correlation between proficiencies on the first and third dimensions was manipulated to vary from 0, 0.4, 0.6 to 0.8, in order to check the effect of different correlation levels on the linking results. There was concern that when the correlation was zero, according to the logic of Method 2, more items were supposed to be selected from Cluster 2 instead of Cluster 1. Nevertheless, for simplicity and consistency, this method always selected more items from Cluster 1 for all correlation levels. Table 5.6. Mean Vectors for Proficiency Distributions of Lower and Upper Grade Examinees in Part II Grade Lower Grade Upper Grade Arithmetic -0.5 0.2 Problem Solving -0.7 0 68 Algebra -1.5 0 Table 5.7. Variance-Covariance Matrix for Proficiency Distribution of Lower Grade Examinees in Part II Arithmetic Problem solving Algebra Arithmetic 1 0.7 0 Problem Solving Algebra 1 0 0.25 Table 5.8. Variance-Covariance Matrix for Proficiency Distribution of Upper Grade Examinees in Part II Arithmetic Problem solving Algebra Arithmetic 1 0.24 0.6 Problem Solving Algebra 1 0.32 1 5.2 Estimation Fifty replications of response matrix were simulated for each proficiency correlation level and the MIRT calibration for each selection method was conducted on the data matrix for the selected common items and all unique items. This resulted in a total of 800 computer runs (4 proficiency correlation levels x 50 replications x 4 item selection methods). The MIRT calibration took more than one hour for each TESTFACT run. The data layout and TESTFACT syntax in this part are similar to those in Part I, except that a three-dimensional instead of two-dimensional solution was requested for the MIRT calibration. As in Part I, item a-parameter estimates from the TESTFACT software were corrected by forcing the mean on each dimension to be positive. However, one problem was found in the sign correction on Dimension 3. It may happen that both positive and negative item discrimination estimates on that dimension have large absolute values, which may be partly due to the low proficiency on that dimension for examinees in the lower grade. Therefore, it was difficult to identify which set of item discrimination estimates, negated or non-negated, was „correct‟. 69 Furthermore, because of the weak relationship between the percentage of correct responses and the proficiency estimate on Dimension 3, it was also hard to rely on the TESTFACT software or apply the previous checking method for the sign correction to proficiency estimates on that dimension. The solution to the sign indeterminacies in both item and person estimates was to try all four sign combinations for the proficiencies on Dimension 3, which is to keep it unchanged or changed for either grade, in an attempt to make the item and person estimates a valid pair for the MIRT calibration. Combined with the item estimates, the „correct‟ signs for person proficiency estimates were selected as the combination that gave the largest correlation value for the recovery of probability matrix and these adjusted estimates were used for further analysis. 5.3 Results 5.3.1 Recovery of probability matrix The average correlation values between corresponding elements in the estimated and true probability matrices are listed in Table 5.9. As the proficiency correlation increases, so does the correlation between the true and estimated probabilities for any item selection method. Although Method 3 could give comparatively higher values and Method 4 provides lower values, the differences are very small. Table 5.9. Correlation for the Recovery of Probability Matrix in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 0.9641 0.9639 0.9641 0.9635 p0.4 0.9651 0.9649 0.9651 0.9646 p0.6 0.9654 0.9652 0.9655 0.9651 p0.8 0.9664 0.9664 0.9669 0.9664 Tables 5.10 and 5.11 give the bias and RMSE values for the recovery of probability matrix. All methods yield slightly negative biased estimates. With a further examination on the plot for 70 bias at different parameter values, it was found that, as observed in Part I, the probabilities of large values tend to be underestimated and those of small values tend to be overestimated. However, there are much more points representing the negative bias for the true probabilities with large values than points representing the positive bias for probabilities with small values, and the points representing negative bias are mostly for lower grade examinees. This is reasonable because the means of proficiencies of lower grade examinees are much lower than the mean of multidimensional difficulties for lower grade items and it was already explained in Part I that difficult items are more likely to lead to negative bias for the proficiencies with large values. Also, as the proficiency correlation increases, the RMSE value decreases for all methods. Therefore, it seems to be a general conclusion that as the proficiency correlation increases, the fit between the estimated and true probability matrices becomes better. Furthermore, Methods 3 and 4 give the lower RMSE values for the recovery of probability matrix than the other two methods. These could also be observed in Figure 5.1, which shows the plot between the proficiency correlation level and the RMSE value for each selection method. Table 5.10. Bias for the Recovery of Probability Matrix in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 -0.0015 -0.0011 -0.0011 -0.0009 p0.4 -0.0009 -0.0009 -0.0010 -0.0007 p0.6 -0.0011 -0.0013 -0.0011 -0.0011 p0.8 -0.0008 -0.0012 -0.0010 -0.0009 p0.6 0.0751 0.0751 0.0747 0.0747 p0.8 0.0740 0.0738 0.0732 0.0733 Table 5.11. RMSE for the Recovery of Probability Matrix in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 0.0763 0.0762 0.0759 0.0760 71 p0.4 0.0755 0.0754 0.0751 0.0751 0.0770 RMSE, probability 0.0760 0.0750 (1) 3D (2) 2D, Correlation (3) 2D, Proportion 0.0740 (4) Classical 0.0730 0.0720 p0 p0.4 p0.6 p0.8 Figure 5.1. RMSE for the Recovery of Probability Matrix in Part II 5.3.2 Recovery of a-parameters The average correlation values between the rotated estimates and true parameters are shown in Table 5.12. Generally speaking, as the correlation between proficiencies increases, the correlation between the rotated estimates and the parameters decreases. This pattern is opposite to that observed in the recovery of the probability matrix. As already explained in Part I, the reason is that as the proficiency correlation increases, it is more difficult to separate its effect from the item estimates obtained with the constraint of identity variance-covariance matrix. Method 1 gives the highest correlation value for the recovery of a3 -parameters, which leads to the conclusion that items from all content domains should be included in the common item set for a good recovery of item discrimination parameters on all dimensions. Method 4 does not 72 perform as well as the other three methods under almost all conditions, which may be due to the extremely unbalanced proportion of common items from different content domains. Table 5.12. Correlation for the Recovery of a-parameters in Part II (1) 3D Dimension 1 Dimension 2 Dimension 3 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 0.9759 0.9719 0.9727 0.9547 0.9856 0.9839 0.9847 0.9813 0.9695 0.9669 0.9619 0.9442 (2) 2D, Correlation 0.9758 0.9783 0.9851 0.9673 0.9856 0.9803 0.9702 0.9735 0.9211 0.9420 0.8873 0.8556 (3) 2D, Proportion 0.9789 0.9815 0.9606 0.9462 0.9865 0.9781 0.9846 0.9811 0.9239 0.9386 0.8992 0.8766 (4) Classical 0.9732 0.9357 0.9323 0.9185 0.9734 0.9699 0.9747 0.9676 0.9222 0.8990 0.8909 0.8712 Table 5.13 shows the bias values for the recovery of a-parameters. It seems that, in the classical correlation method, the estimates on Dimension 1 are positively biased and those on other dimensions are negatively biased. Also, the estimates on Dimension 3 are negatively biased for all methods. Figures 5.2-5.4 plot the bias values for the recovery of a-parameters on each dimension. For all plots, the points representing Method 4 deviate far away from the horizontal zero-line. In addition, the points for Methods 1-3 are very close to each other and to the zero-line for the first two dimensions; however, for the third dimension, it is clear that only the points representing Method 1 are close to the zero-line. 73 Table 5.13. Bias for the Recovery of a-parameters in Part II (1) 3D Dimension 1 Dimension 2 Dimension 3 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 0.0005 -0.0016 -0.0019 -0.0042 -0.0021 -0.0011 -0.0005 -0.0005 -0.0035 -0.0017 -0.0019 -0.0006 (2) 2D, Correlation 0.0013 -0.0010 0.0003 -0.0013 -0.0019 -0.0004 0.0001 -0.0009 -0.0119 -0.0046 -0.0054 -0.0035 (3) 2D, Proportion 0.0030 0.0002 0.0024 0.0002 -0.0028 -0.0014 -0.0011 -0.0017 -0.0111 -0.0049 -0.0067 -0.0039 (4) Classical 0.0075 0.0086 0.0056 0.0019 -0.0091 -0.0071 -0.0047 -0.0049 -0.0129 -0.0110 -0.0078 -0.0035 0.0100 0.0080 0.0060 Bias, a1 0.0040 (1) 3D 0.0020 (2) 2D, Correlation (3) 2D, Proportion 0.0000 (4) Classical -0.0020 -0.0040 -0.0060 p0 p0.4 p0.6 p0.8 Figure 5.2. Bias for the Recovery of a1 -parameters in Part II 74 0.0020 0.0000 Bias, a2 -0.0020 (1) 3D -0.0040 (2) 2D, Correlation (3) 2D, Proportion -0.0060 (4) Classical -0.0080 -0.0100 p0 p0.4 p0.6 p0.8 Figure 5.3. Bias for the Recovery of a 2 -parameters in Part II 0.0000 -0.0020 Bias, a3 -0.0040 (1) 3D -0.0060 (2) 2D, Correlation -0.0080 (3) 2D, Proportion -0.0100 (4) Classical -0.0120 -0.0140 p0 p0.4 p0.6 p0.8 Figure 5.4. Bias for the Recovery of a3 -parameters in Part II 75 In order to evaluate the bias for the parameters of different values, the plot between the bias and a-parameters was examined on each dimension and for each method. In the classical method, a1 -parameters of small values tend to be slightly overestimated but there is no clear pattern for those of large values. For the remaining a-parameters, no clear pattern is found for the bias for those of small values, but a-parameters of large values tend to be underestimated. In addition, for a3 -parameters, the underestimation tends to become worse but the magnitude of bias is smaller in Method 1 than in other methods. Figure 5.5 shows one example of those plots, which gives the bias of a3 -parameters in Method 2 under the zero proficiency correlation condition. Note that the points in the right cluster represent the items dominantly measuring the third dimension. 0.4 0.3 0.2 Bias 0.1 0 -0.1 -0.2 -0.3 -0.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 a3 Figure 5.5. Bias for the Recovery of a3 -parameters for Method 2 at Zero Proficiency Correlation Level in Part II 76 Table 5.14 gives the RMSE values for the recovery of a-parameters. As the proficiency correlation increases, the estimates tend to deviate further from the true values for all dimensions and for all methods. Method 1, which provides full content coverage, performs a little better on Dimension 2 and gives much lower RMSE values on Dimension 3. Therefore, it confirms again that the a-parameter estimates on Dimension 3 are closer to the true values only by including items from that dimension in the common item set. The classical correlation method yields high RMSE values and does not perform as well as the other three methods, which is not very surprising in view of the extremely unbalanced number of common items for each content domain. These results, which are also shown in Figures 5.6-5.8, indicate that the recovery for the a-parameters on a certain dimension mostly depends on the number of common items measuring that dimension. Table 5.14. RMSE for the Recovery of a-parameters in Part II (1) 3D Dimension 1 Dimension 2 Dimension 3 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 p0 p0.4 p0.6 p0.8 0.0925 0.0980 0.0967 0.1255 0.0770 0.0798 0.0803 0.0883 0.0731 0.0785 0.0842 0.1016 (2) 2D, Correlation 0.0935 0.0875 0.0727 0.1073 0.0769 0.0904 0.1141 0.1064 0.1134 0.0980 0.1390 0.1573 77 (3) 2D, Proportion 0.0882 0.0817 0.1140 0.1344 0.0752 0.0932 0.0808 0.0888 0.1101 0.1020 0.1323 0.1435 (4) Classical 0.0924 0.1386 0.1451 0.1613 0.0998 0.1052 0.0980 0.1071 0.1108 0.1279 0.1337 0.1445 0.1800 0.1600 0.1400 RMSE, a1 0.1200 (1) 3D 0.1000 (2) 2D, Correlation 0.0800 (3) 2D, Proportion 0.0600 (4) Classical 0.0400 0.0200 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.6. RMSE for the Recovery of a1 -parameters in Part II 0.1200 0.1000 RMSE, a2 0.0800 (1) 3D 0.0600 (2) 2D, Correlation (3) 2D, Proportion 0.0400 (4) Classical 0.0200 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.7. RMSE for the Recovery of a 2 -parameters in Part II 78 0.1800 0.1600 0.1400 RMSE, a3 0.1200 (1) 3D 0.1000 (2) 2D, Correlation 0.0800 (3) 2D, Proportion (4) Classical 0.0600 0.0400 0.0200 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.8. RMSE for the Recovery of a3 -parameters in Part II 5.3.3 Recovery of d-parameters Tables 5.15-5.17 show the correlation, bias and RMSE values between the adjusted estimates and true values of d-parameters for all item selection methods. There is no consistent pattern for the value change of these indices across different proficiency correlation levels. Generally speaking, Method 3 seems to give the highest correlation values for the recovery of d-parameters, while Method 2 provides the lowest values. Also, negative bias is found in the d-parameter estimates, although the values are quite small. From the RMSE table, Methods 3 and 4 seem to have a little advantage over the other two methods, especially when the proficiency correlation is low. Figure 5.9 gives the plot between the bias values and d-parameters for Method 1 at the zero proficiency correlation level, and the plots for all other methods and conditions are similar to this 79 one. Different from that in Part I, there is no clear pattern for the bias at different values of dparameters, but the magnitude of negative bias tends to be slightly larger than that of positive bias. Figures 5.10-5.12 provide the plots of correlation, bias and RMSE for the recovery of dparameters. Table 5.15. Correlation for the Recovery of d-parameters in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 0.9873 0.9846 0.9892 0.9891 p0.4 0.9874 0.9871 0.9901 0.9893 p0.6 0.9895 0.9875 0.9902 0.9893 p0.8 0.9904 0.9872 0.9902 0.9885 p0.6 -0.0026 -0.0033 -0.0042 -0.0021 p0.8 -0.0023 -0.0037 -0.0043 -0.0023 p0.6 0.0949 0.0984 0.0847 0.0899 p0.8 0.0904 0.1001 0.0860 0.0945 Table 5.16. Bias for the Recovery of d-parameters in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 -0.0034 -0.0015 -0.0044 -0.0007 p0.4 -0.0015 -0.0022 -0.0029 -0.0006 Table 5.17. RMSE for the Recovery of d-parameters in Part II (1) 3D (2) 2D, Correlation (3) 2D, Proportion (4) Classical p0 0.1032 0.1148 0.0900 0.0904 p0.4 0.1046 0.0999 0.0868 0.0894 80 0.4 0.3 0.2 Bias 0.1 0 -0.1 -0.2 -0.3 -0.4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 d Figure 5.9. Bias of the Recovery of d-parameters for Method 1 at Zero Proficiency Correlation Level in Part II 0.9910 0.9900 0.9890 correlation, d 0.9880 0.9870 (1) 3D 0.9860 (2) 2D, Correlation 0.9850 (3) 2D, Proportion 0.9840 (4) Classical 0.9830 0.9820 0.9810 p0 p0.4 p0.6 p0.8 Figure 5.10. Correlation for the Recovery of d-parameters in Part II 81 0.0000 Bias, d -0.0010 -0.0020 (1) 3D (2) 2D, Correlation (3) 2D, Proportion -0.0030 (4) Classical -0.0040 -0.0050 p0 p0.4 p0.6 p0.8 Figure 5.11. Bias for the Recovery of d-parameters in Part II 0.1400 0.1200 RMSE, d 0.1000 0.0800 (1) 3D (2) 2D, Correlation 0.0600 (3) 2D, Proportion (4) Classical 0.0400 0.0200 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.12. RMSE for the Recovery of d-parameters in Part II 82 5.3.4 Recovery of effect sizes The variances of estimated proficiencies on different dimensions were underestimated and the correlations were overestimated, which is similar as in Part I. Table 5.18 shows the true effect sizes as well as the means and standard deviations of estimated effect sizes on the three dimensions for all methods and for all proficiency correlation levels. The first observation is that the estimated effect sizes from Method 4 deviate far away from the true values across all proficiency correlation levels and for all dimensions. The reason may be that almost all common items in this method are from the second content domain. For Dimension 1, it seems that no method gives a consistent good recovery and the estimates given by Method 1 are closer to the true values only when the proficiency correlation is low. For Dimension 2, Method 3 gives the best recovery among all methods; however, for Dimension 3, Method 1 performs substantially better than the other methods although the estimates still deviate from the true values. Therefore, it can be concluded that without common items dominantly measuring a certain dimension, the effect size on that dimension is highly underestimated. However, as the proficiency correlation increases, the underestimation tends to be less severe, especially for Methods 2 and 3. The values of standard deviations are quite small, which indicates that estimated effect sizes are fairly stable across replications. In order to show how substantial the difference is between different methods, Figures 5.13-5.15 provide the 95% confidence intervals of effect size for the comparison of Methods 1 and 4 across four proficiency correlation levels and for all three dimensions. From the plots, all the confidence intervals are quite narrow and there is no overlapping of the confidence intervals for the two methods. Figures 5.16-5.18 provide the plots for the recovery of effect sizes for the three dimensions, respectively. 83 Table 5.18. Recovery of Effect Sizes for Proficiencies in Part II Dimension 1 Dimension 2 Dimension 3 p0 Std p0.4 Std p0.6 Std p0.8 Std p0 Std p0.4 Std p0.6 Std p0.8 Std p0 Std p0.4 Std p0.6 Std p0.8 Std TRUE 0.7122 0.7283 0.6556 0.7147 0.7198 0.7377 0.6550 0.7058 1.8427 1.9016 1.8838 1.9483 (1) 3D 0.7057 0.0187 0.7551 0.0167 0.7309 0.0339 0.8263 0.0327 0.6494 0.0190 0.6193 0.0179 0.5996 0.0308 0.6547 0.0204 1.1923 0.0240 1.2743 0.0175 1.2346 0.0184 1.2384 0.0182 84 (2) 2 D, Correlation 0.8005 0.0147 0.8028 0.0123 0.7351 0.0117 0.7893 0.0136 0.6531 0.0177 0.6493 0.0149 0.5895 0.0142 0.6232 0.0162 0.3271 0.0217 0.6102 0.0140 0.6161 0.0156 0.7135 0.0161 (3) 2 D, Proportion 0.7974 0.0133 0.7995 0.0135 0.7310 0.0131 0.7878 0.0140 0.7184 0.0156 0.7140 0.0135 0.6490 0.0118 0.6853 0.0131 0.3956 0.0207 0.6357 0.0171 0.6325 0.0156 0.7262 0.0165 (4) Classical 0.6116 0.0130 0.6218 0.0143 0.5619 0.0146 0.6085 0.0226 0.7812 0.0130 0.7853 0.0115 0.7177 0.0093 0.7599 0.0089 0.4095 0.0157 0.5087 0.0165 0.4939 0.0379 0.5355 0.0922 0.85 (1) 3D (4) Classical 95% CI effect size, Dim1 0.80 0.75 0.70 0.65 0.60 0.55 p0 p0.4 p0.6 p0.8 Figure 5.13. Comparison of Effect Size for the Proficiency on Dimension 1 in Part II 0.80 (1) 3D (4) Classical 95% CI effect size, Dim2 0.75 0.70 0.65 0.60 0.55 p0 p0.4 p0.6 p0.8 Figure 5.14. Comparison of Effect Size for the Proficiency on Dimension 2 in Part II 85 1.4 (1) 3D (4) Classical 95% CI effect size, Dim3 1.2 1.0 0.8 0.6 0.4 p0 p0.4 p0.6 p0.8 Figure 5.15. Comparison of Effect Size for the Proficiency on Dimension 3 in Part II 0.9000 0.8000 effect size, Dim1 0.7000 0.6000 TRUE 0.5000 (1) 3D 0.4000 (2) 2D, Correlation (3) 2D, Proportion 0.3000 (4) Classical 0.2000 0.1000 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.16. Recovery of Effect Size for the Proficiency on Dimension 1 in Part II 86 0.9000 0.8000 effect size, Dim2 0.7000 0.6000 TRUE 0.5000 (1) 3D 0.4000 (2) 2D, Correlation (3) 2D, Proportion 0.3000 (4) Classical 0.2000 0.1000 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.17. Recovery of Effect Size for the Proficiency on Dimension 2 in Part II 2.5000 effect size, Dim3 2.0000 TRUE 1.5000 (1) 3D (2) 2D, Correlation 1.0000 (3) 2D, Proportion (4) Classical 0.5000 0.0000 p0 p0.4 p0.6 p0.8 Figure 5.18. Recovery of Effect Size for the Proficiency on Dimension 3 in Part II 87 CHAPTER 6 SUMMARY, LIMITATION AND FUTURE RESEARCH In this chapter, results and conclusions from the simulation studies are summarized and practical implications are discussed. In addition, limitations and suggestions for future research are provided. 6.1 Conclusions and Discussions Part I focused on the two-dimensional constructs with balanced item design and data were simulated based on proficiencies on the same constructs in both upper and lower grades. In this part, anchor items were selected according to the combination of different content coverage (partial or full content coverage) and difficulty coverage (low difficulty level, medium difficulty level, high difficulty level, or all three difficulty levels). In addition, items with high item-totaltest correlations were selected as common items under the classical method. Meanwhile, proficiency correlation level was manipulated to vary from low to high to evaluate its effect on the linking results for each item selection method. The comparisons among different methods were made with respect to the recovery of the probability matrix, item parameters and effect sizes. The results show that with the increase of the correlation between proficiencies, the probability matrix recovery becomes better. In particular, the correlation between the estimated and true probabilities increases and the RMSE decreases. This was also observed in the study by Fang and Lu (2010). They pointed out that the RMSE for the probability matrix recovery decreases as the proficiency correlation increases when one unidimensional IRT or MIRT calibration is conducted on the data matrix simulated from a two-dimensional MIRT model. 88 Therefore, for the MIRT calibration on the complete data matrix, when the proficiency correlation is high, it can be assumed that the proficiency estimation on different dimensions can borrow information from each other. Thus, it is reasonable that a better estimation on proficiencies can yield a better recovery of the probability matrix. However, this is not the case with the a-parameter recovery. As the proficiency correlation increases, the correlation between a-parameters and estimates decreases and their deviation increases. The reason may be that it becomes more difficult to separate the effect of the proficiency correlation from the a-parameter estimates in an attempt to solve the rotational indeterminacy in MIRT. The recovery of d-parameters becomes better as the proficiency correlation increases. This may be due to the fact that the estimates become less affected by differences among dimensions when the data structure approaches unidimensionality. Generally speaking, the recovery of effect sizes is not much influenced by the magnitude of the proficiency correlation, although it seems to be slightly better as the proficiency correlation increases. One exception is when all common items or most common items dominantly measure one dimension. In this case, the effect size recovery on other dimensions becomes substantially better as the proficiency correlation increases, which is reasonable since the proficiency parameters on different dimensions are more interrelated. It is obvious that different common item selection methods do give different linking results as expected. Among all methods, three of them are of special interest: Method 6 (full content coverage with items from medium difficulty level), Method 8 (full content coverage with items from all difficulty levels) and Method 9 (items with high item-total-test correlation). The first one is originated from the idea of miditest proposed by Sinharay and Holland (2006b), the 89 second one continues to be the golden rule and favorite of practitioners, and the third one traces back to the in-depth reason for the better equating although it is applied under the framework of multidimensional constructs in this study. The results show that the classical correlation method gives the best probability recovery. Among all MIRT methods, with the same difficulty coverage, the method achieving full content coverage could give better results than the method achieving partial content coverage. Also, the method selecting medium difficulty items is the best among those selecting items from different difficulty levels under the same content coverage condition. Thus, it is not surprising that Method 6 performs much better than all other MIRT methods. The a-parameter recovery varies in different selection methods. Methods 6 and 8 perform better in the linking as expected. Surprisingly, Method 9 also works better over most methods, although the numbers of common items from different content domains are unbalanced. For the d-parameters, Methods 6 and 9 could give pretty good results for the recovery, followed by Method 8. The methods selecting common items from one content domain do not work as well as those selecting items from both content domains, especially when the proficiency correlation is low. But as the proficiency correlation increases, the difference in the d-parameter recovery for these two types of methods becomes smaller. All effect sizes are underestimated, which may be due to the EAP scoring method for the proficiency estimation. The number of common items from each content domain plays an important role in the effect size recovery. For Methods 1 to 4 where common items come from the first content domain, the effect size on Dimension 2 is highly underestimated. Similar results could also be observed in the classical correlation method, which selects more items from the second content domain than the first one. This method could give a pretty good effect size 90 recovery on Dimension 2 but not on Dimension 1. Therefore, besides the full content coverage, attention should also be paid to the proportion of common items in each content domain. All in all, this part confirmed the advantage of miditest in the context of vertical scaling, which extended the conclusion by Sinharay and Holland (2006b) that the common item set with medium difficulty items could work better than the minitest in the equating. This is worthy of further attention by practitioners, although the minitest continues to be widely used in practical settings. Furthermore, the linking results also proved the importance of content coverage when multiple proficiencies are measured within one test. Therefore, the conclusion for Part I is that in vertical scaling under the MIRT framework, when the same constructs are measured in both tests, the common item set achieving full content coverage with medium difficulty items perform slightly better than the minitest that covers all content domains with a similar spread of item difficulties as the total test. And these two methods are substantially better than the other item selection methods using MIRT. All common items selected via the classical correlation method are actually medium difficulty items, which is consistent with the idea that medium difficulty items tend to have higher itemtotal-test correlations than other items. This method gives good recovery for all parameters except for the effect size on Dimension 1. Since the effect size is much influenced by the proportion of common items in each content domain, given a fixed number of common items that can be used for linking, it is expected that the results would be better if this method selects appropriate number of items from each content domain to achieve proportional representativeness. Therefore, the classical correlation method seems surprisingly promising in the linking of multiple constructs. But there is concern that this method can only be used with 91 careful design, since the correlation should be known in advance and the multidimensionality really limits the rationale and use of the item-total-test correlation. Part II focused on the three-dimensional constructs with unbalanced item design. The purpose was to evaluate the common item selection methods when the measured constructs are not identical in both grades; in particular, the upper grade test measures more constructs than the lower grade test. In this part, “algebra” was not supposed to be taught in the lower grade; therefore, the proficiency distribution on that construct for lower grade examinees was assumed to have a low mean and small standard deviation. With the conclusion from Part I that medium difficulty items perform comparatively better than other items in the linking, this part focused more on the content coverage and the proportion of common items from each content domain. As in Part I, comparisons among different item selection methods were made using the criteria on the recovery of probability matrix, item parameters and effect sizes. The conclusion on the effect of proficiency correlation for the recovery of different parameters is consistent with that in Part I. As the proficiency correlation increases, the recovery of probability matrix and d-parameters becomes better, the recovery of a-parameters becomes worse, and the effect size recovery does not change too much except on Dimension 3 when no common items are selected to link that dimension. The performance of different methods varied if different criteria were used for comparison. The content coverage is not as important as expected for the probability matrix recovery; however, it is very crucial for the recovery of a-parameters and effect sizes, which are also influenced by the proportions of common items selected from different content domains. 92 Nevertheless, the disadvantage of not covering all content domains could be partly compensated for by the high correlations between proficiencies. Method 3, which is to select common items according to proportions of unique items in the two common content domains, seems to yield better results in most recoveries, except for those on Dimension 3. On the other hand, Method 1, which is to select items for full content coverage, is a better choice if item a-parameters and effect sizes are expected to be reasonably estimated for all dimensions. The classical correlation method does not work well in this part. This may be due to the unbalanced item design and the complicated content structure, which lead to the extremely unbalanced numbers of common items from different content domains. However, this classical correlation method is worth further analysis if it can be adjusted to achieve the aforementioned proportional representativeness. The results of this part reconfirm the importance of content coverage, even when the content domain only exists in one grade. Items measuring other highly-correlated proficiencies cannot replace the items from that domain in the common item set. However, it should be noted that including items from all content domains does not ensure a better recovery of probability matrix. As is well known, vertical scaling with the common item design is currently implemented in many state testing programs, such as California English Language Development Test (CELDT), Colorado Student Assessment Program (CSAP), Connecticut Mastery Test (CMT), Delaware Student Testing Program (DSTP), Mississippi Curriculum Test (MCT), North Carolina End-ofGrade Tests (NCEOG) and Texas English Language Proficiency Assessment System (TELPAS) (Reckase, 2010). Since the scale scores are not comparable across different state testing programs, the common core standards and common assessments are of special interest to practitioners, policy makers and researchers. With the Race to the Top program to motivate 93 reforms in state and local district K-12 education, two consortia, including Partnership for Assessment of Readiness for College and Careers (PARCC) and SMARTER Balanced Assessment Consortium (SBAC), also show great interest in vertical scales for assessing students‟ achievement and growth. However, it is the unidimensional IRT models, either the Rasch model or the 3 PL model, that are commonly adopted in state testing programs. Theoretically, these two unidimensional IRT models can be arguably used and serve as a good approximation to the multidimensional IRT models only when all items in the test measure roughly the same composite of multiple proficiencies (Reckase, Ackerman & Carlson, 1988), or all the proficiencies are highly correlated in consideration of a correlation of 0.7 or more as commonly seen in practice. The high correlation among multiple proficiencies not only explains the reason that mathematical skills can be finely divided into geometry, algebra and etc. or they can be attributed to one general mathematical ability, but also raises heated debates on when to report subscores or one general score from the perspective of psychometrics. Due to the complex constructs measured in tests and the changes in the curriculum and policy requirements with the change of grade levels, if the constructs measured by different tests in different grade levels are not identical, the interpretation of one single scale score obtained from those tests may not be the same. Therefore, the MIRT model seems to be more appropriate under the situations of multidimensionality and content shift, but there are four main concerns when the MIRT model is practically used in vertical scaling. First, even with the expert judgment and the dimensionality analysis, it is still difficult to define the constructs measured by different tests across grades. This is even worse in view of the high correlation among proficiencies. 94 Second, from the results of this study, in order to link the scales for all content areas, offgrade items need to be administered to students even if the content area is not covered by that grade. The importance of the off-grade content in tests is confirmed by Lazer, Mazzeo, Twing, Way, Camara, & Sweeney (2010), who assumed that “this out-of-grade content will mirror the instruction the student has received regardless of his or her grade level or age”. Although students may be reluctant to answer items that are never taught in class, it is difficult to evaluate students‟ gains after learning activities if we have no ideas about their pre-knowledge on that content. It is a trade off; unfortunately, the NCLB has prohibited this off-grade testing. Third, the rotational indeterminacy is a big problem in MIRT vertical scaling. The Varimax and Promax methods are commonly used to constrain items to follow the simple structure by assuming that each item only dominantly measures one proficiency. However, this indeterminacy problem becomes complicated with the existence of mixed structure items and the different correlation structures of proficiencies for students in different grades; therefore, more research studies are needed to better construct the coordinate system for the interpretation of parameter estimates in the MIRT vertical scaling. Finally, the MIRT calibration is extremely computationally extensive no matter whether the EM algorithm or Markov Chain Monte Carlo (MCMC) algorithm is used, and the computation time increases rapidly with the increase of the number of dimensions for the MIRT model. This may not be acceptable for most testing programs since test results need to be delivered within a short period of time. But as the computer becomes more and more powerful, this may not be a problem. All in all, the MIRT model is more appropriate to account for the multidimensionality in vertical scales; however, there is still a long way to go to implement this model in practice. 95 6.2 Limitations and Future Research The above conclusions should be interpreted in light of the limitations inherent in this simulation study. Also, future research needed to make the conclusions more solid and generalizable is discussed below. First, items in this study were simulated to be approximate simple structure items, which results in a rough alignment of proficiency dimension and content domain. Therefore, sometimes these two terms are used interchangeably in this study. But in practice, multiple proficiencies are often needed to get an item correct (Reckase, 1985). Future studies could be conducted to examine whether mixed structure items can be used in lieu of simple structure items that measure different proficiencies to achieve the full content coverage. Second, in order to compare the recovery results from different selection methods, item parameter estimates need to be rotated and adjusted to be put on the same coordinate system as the generating parameters. In this study, the target matrix of rotation was defined as the generating parameters, while the study by Reckase and Li (2007) adopted a target matrix with 1s as indicators of measured dimensions and 0s elsewhere. Some trials of matching 0/1 matrices were conducted as well and the results were compared with those from matching with generating parameters. It was found that the difference was quite subtle. Third, since probability values range from 0 to 1, the nonparametric Spearman‟s rank order correlation coefficient may be a better choice than the Pearson's correlation coefficient that requires the assumption of normal distribution for variables. The probability recovery using this rank order correlation coefficient was compared with that using Pearson‟s correlation coefficient and the results were quite similar. 96 Fourth, although multi-group analysis may be deemed as more appropriate for the concurrent calibration in this study, this option is not available in the TESTFACT software. For the multigroup calibration under the unidimensional IRT, most software packages set the default constraints only on the distribution of reference group, and treat the means and variances of distributions for other groups as unknown parameters. However, it can be imagined that in MIRT, if the elements in variance-covariance matrices are regarded as unknown parameters for other groups, the MIRT calibration would become much more complicated and time-consuming. Future research can verify whether the above conclusions still hold when the efficient software package is available for the MIRT multi-group calibration. Fifth, concurrent calibration is used to align scales across different grade levels in this study. However, vertical scales can also be created by applying an orthogonal or oblique Procrustes rotation method to match the common item parameters estimated from separate calibrations on the two tests. This can be another topic for further studies and some thought should be given to the dimensionality issue when the constructs measured by the two tests are not identical. Sixth, the performance of the classical correlation method is surprisingly good in Part I. However, the feasibility of this method is a little questionable since the item-total-test correlation values, which depend on the population of examinees, could not be known in advance. Although the results from field tests might be used for reference, they should be used with caution since the sample for the field test may not be representative of the population. Also, the performance of this method becomes worse when more distinctive proficiencies are measured in the test, which makes the unidimensionality assumption more vulnerable. Finally, because the MIRT calibration in the TESTFACT software is time-consuming, the number of replications in this study is somewhat small compared with other studies. In the future, 97 more replications could be conducted when the computation time shortens. In addition, in order to make the conclusions more generalizable, further studies can focus on the data with more dimensions, since the number of constructs measured by the test is often more than two in practical settings. 98 APPENDIX 99 APPENDIX % code for the evaluation of Part II results in MATLAB function evaluation_final load item2.dat; A=item2(:,1:3); d=item2(:,4); ns=3000; np=4; ndim=3; ni=40; nj=4; nr=50; result.d=zeros([2*ni,np,nj,nr]); result.A=zeros([2*ni,ndim,np,nj,nr]); result.theta=zeros([2*ns,ndim,np,nj,nr]); result.PBias=zeros([np,nj]); result.PRmse=zeros([np,nj]); result.PCorr=zeros([np,nj]); result.ABias=zeros([ndim,np,nj]); result.ARmse=zeros([ndim,np,nj]); result.ACorr=zeros([ndim,np,nj]); result.dBias=zeros([np,nj]); result.dRmse=zeros([np,nj]); result.dCorr=zeros([np,nj]); result.ES=zeros([ndim,np,nj]); ES=zeros([ndim,np]); ESTemp=zeros(ndim,nr); thetaT=zeros([ns,ndim]); thetaTemp=zeros([2*ns,ndim]); numT=zeros([1,ns]); PBiasTemp=zeros([2*ns,ni,nr]); PRmseTemp=zeros([2*ns,ni,nr]); PCorrTemp=zeros([1,nr]); ABiasTemp=zeros([2*ni,ndim,nr]); ARmseTemp=zeros([2*ni,ndim,nr]); ACorrTemp=zeros([ndim,nr]); dBiasTemp=zeros([2*ni,nr]); dRmseTemp=zeros([2*ni,nr]); dCorrTemp=zeros([1,nr]); 100 % person correlation i, method j, replication r for i=1:np theta=load (['person2', num2str(i),'.dat']); P=pfunction(theta,A,d); ES(:,i)=ESfunc(theta(3001:end,:), theta(1:3000,:)); for j=1:nj for r=1:nr fname=['d2_p',num2str(i),'_r',num2str(r),'_m',num2str(j)]; fid=fopen([fname,'.PAR'], 'r'); C=textscan(fid,'%*d %*s %*d %9.6f %9.6f %9.6f %9.6f','headerlines',11); fclose(fid); result.d(:,i,j,r)=C{1}; result.A(:,:,i,j,r)=[C{2:4}]; fid=fopen([fname,'_s1.FSC'],'r'); for t=1:ns temp=str2num(fgetl(fid)); numT(t)=temp(3); thetaT(t,:)=str2num(fgetl(fid)); fgetl(fid); end result.theta(1:3000,:,i,j,r)=thetaT; fclose(fid); fid=fopen([fname,'_s2.FSC'],'r'); for t=1:3*ns fgetl(fid); end for t=1:ns temp=str2num(fgetl(fid)); numT(t)=temp(3); thetaT(t,:)=str2num(fgetl(fid)); fgetl(fid); end result.theta(3001:end,:,i,j,r)=thetaT; fclose(fid); % correction for item discrimination by forcing the mean of item discrimination estimates to be positive if mean(result.A(:,1,i,j,r))<0 result.A(:,1,i,j,r)=(-1)*result.A(:,1,i,j,r); end if mean(result.A(:,2,i,j,r))<0 result.A(:,2,i,j,r)=(-1)*result.A(:,2,i,j,r); end if mean(result.A(:,3,i,j,r))<0 result.A(:,3,i,j,r)=(-1)*result.A(:,3,i,j,r); end % correction for person proficiency estimates by choosing the pair which gives the highest correlation for the recovery of probability matrix 101 PTemp1=pfunction(result.theta(:,:,i,j,r), result.A(:,:,i,j,r),result.d(:,i,j,r)); PTemp2=pfunction([result.theta(:,1:2,i,j,r),[(-1)*result.theta(1:3000,3,i,j,r); result.theta(3001:6000,3,i,j,r)]], result.A(:,:,i,j,r),result.d(:,i,j,r)); PTemp3=pfunction([result.theta(:,1:2,i,j,r),[result.theta(1:3000,3,i,j,r);(-1)* result.theta(3001:6000,3,i,j,r)]], result.A(:,:,i,j,r),result.d(:,i,j,r)); PTemp4=pfunction([result.theta(:,1:2,i,j,r),(-1)*result.theta(:,3,i,j,r)], result.A(:,:,i,j,r),result.d(:,i,j,r)); PCorrTemp1=corr(reshape(PTemp1,[],1),reshape(P,[],1)); PCorrTemp2=corr(reshape(PTemp2,[],1),reshape(P,[],1)); PCorrTemp3=corr(reshape(PTemp3,[],1),reshape(P,[],1)); PCorrTemp4=corr(reshape(PTemp4,[],1),reshape(P,[],1)); PTemp=PTemp1; [PCorrTemp_value,ind]=max([PCorrTemp1,PCorrTemp2,PCorrTemp3, PCorrTemp4]); if ind==2 result.theta(1:3000,3,i,j,r)=(-1)*result.theta(1:3000,3,i,j,r); PTemp=PTemp2; elseif ind==3 result.theta(3001:6000,3,i,j,r)=(-1)*result.theta(3001:6000,3,i,j,r); PTemp=PTemp3; elseif ind==4 result.theta(:,3,i,j,r)=(-1)*result.theta(:,3,i,j,r);PTemp=PTemp4; end % oblique Procrustes rotation to match with generating parameters T=inv(result.A(:,:,i,j,r)'*result.A(:,:,i,j,r))*result.A(:,:,i,j,r)'*A; ATemp=result.A(:,:,i,j,r)*T; m=((d-result.d(:,i,j,r))'*ATemp*inv(ATemp'*ATemp))'; dTemp=result.d(:,i,j,r)+ATemp*m; thetaTemp=(inv(T)*result.theta(:,:,i,j,r)'-m*ones(1,size(result.theta(:,:,i,j,r),1)))'; [PBiasTemp(:,:,r),PRmseTemp(:,:,r)]=criteria(PTemp, P); [ABiasTemp(:,:,r),ARmseTemp(:,:,r)]=criteria(ATemp,A); [dBiasTemp(:,r),dRmseTemp(:,r)]=criteria(dTemp, d); PCorrTemp(r)=PCorrTemp_value; ACorrTemp(:,r)=[corr(ATemp(:,1),A(:,1)),corr(ATemp(:,2),A(:,2)), corr(ATemp(:,3),A(:,3))]; dCorrTemp(r)=corr(dTemp, d); ESTemp(:,r)=ESfunc(thetaTemp(3001:end,:),thetaTemp(1:3000,:)); end result.PBias(i,j)=mean(mean(mean(PBiasTemp,3),1),2); result.PRmse(i,j)=mean(mean(sqrt(mean(PRmseTemp,3)),1),2); result.PCorr(i,j)=mean(PCorrTemp); result.ABias(:,i,j)=mean(mean(ABiasTemp,3),1); 102 result.ARmse(:,i,j)=mean(sqrt(mean(ARmseTemp,3)),1); result.ACorr(:,i,j)=mean(ACorrTemp,2); result.dBias(i,j)=mean(mean(dBiasTemp,2),1); result.dRmse(i,j)=mean(sqrt(mean(dRmseTemp,2)),1); result.dCorr(i,j)=mean(dCorrTemp); result.ES(:,i,j)=mean(ESTemp,2); end end result.PCorr result.PBias result.PRmse result.ACorr result.ABias result.ARmse result.dCorr result.dBias result.dRmse ES result.ES % function for probability calculation under 2PL compensatory MIRT function P=pfunction(theta, A, d) ns=0.5*size(theta,1); ni=0.5*size(A,1); P1=1./(1+exp(-1.7*(theta(1:ns,:)*A(1:ni,:)'+ones(ns,1)*d(1:ni)'))); P2=1./(1+exp(-1.7*(theta(ns+1:end,:)*A(ni+1:end,:)'+ones(ns,1)*d(ni+1:end)'))); P=[P1;P2]; % function related to bias and RMSE calculation function [diff,diff2]=criteria(estimate, true) diff=estimate-true; diff2=(estimate-true).^2; % function for effect size calculation function ESvalue=ESfunc(theta1, theta2) ESvalue=(mean(theta1)-mean(theta2))./sqrt((var(theta1)+var(theta2))*.5); 103 REFERENCES 104 REFERENCES Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, New Jersey: Educational Testing Service. Bé guin, A. A., Hanson, B. A., & Glas, C. A. W. (2000). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Bock, D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis. Applied Psychological Measurement, 12(3), 261-280. Bock, D., Gibbons, R., Schilling, S., Muraki, E., Wilson, D., & Wood, R. (2003). TESTFACT 4.0 [Computer software and manual]: Test scoring, item statistics, and item factor analysis. Lincolnwood, IL: Scientific Software International. Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6(4), 431-444. Braun, H. I. (2005). Using student progress to evaluate teaching: A primer on value-added models (Tech. Rep.). Princeton, New Jersey: Educational Testing Service. Camilli, G., Wang, M.-M., & Fesq, J. (1995). The effects of dimensionality on equating the law school admission test. Journal of Educational Measurement, 32(1), 79-96. Doran, H. C., & Cohen, J. (2005). The confounding effect of linking bias on gains estimated from value-added models. In R. Lissitz (Ed.), Value-added models in education: Theory and applications. Maple Grove, MN: JAM Press. Dorans, N. J., & Kingston, N. M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22(4), 249-262. Fang, Y. (2008). Using a projection method to estimate subscores from tests with multidimensional structures. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. 105 Fang, Y., & Lu, Y. (2010). The effect of proficiency correlation on the application of multidimensional IRT model. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Denver, CO. Fraser, C. (1988). NOHARM II: A Fortran program for fitting unidimensional and multidimensional normal ogive models in latent trait theory. The University of New England, Center for Behavioral Studies, Armidale, Australia. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144‐149. Haertel, E. H. (2004). The behavior of linking items in test equating (CSE Technical Report 630). Los Angeles, CA: CRESST/CSE, University of California, Los Angeles, Graduate School of Education and Information Studies. Hanson, B. A., & Bé guin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common item equating design. Applied Psychological Measurement, 26(1), 3-24. Harris, D. J. (2007). Practical issues in vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 233-251). New York: Springer. Hirsch, T. M. (1989). Mutidimensional equating. Journal of Educational Measurement, 26(4), 337–349. Holland P. W. (2007). A framework and history for score linking. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 5-29). New York: Springer. Hoover, H. D., Dunbar, S. B., & Frisbie, D. A. (2003). ITBS forms A & B guide to research and development. Itasca, IL: Riverside. Hoskens, M., Lewis, D. M., & Patz, R. J. (2003). Maintaining vertical scales using a common item design. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. Jiao, H., & Wang, S. (2006). Comparison of vertical linking designs. Paper presented at the National Conference of the Large-Scale Assessment, San Francisco, CA. Jiao, H., & Wang, S. (2007). The effects of the selection of vertical linking items on modeling student growth. Paper presented at the National Conference of the Large-Scale Assessment, Chicago, IL. Karkee, T., Lewis, D. M., Hoskens, M., Yao, L., & Haug, C. (2003). Separate versus concurrent calibration methods in vertical scaling. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. 106 Kim, S., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22(2), 131-143. Kolen, M. J. (2006). Scaling and norming. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 155-186). Westport, CT: American Council on Education and Praeger Publishers, jointly. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer. Lazer, S., Mazzeo, J., Twing, J. S.,Way, W. D., Camara, W., & Sweeney, K. (2010). Thoughts on an assessment of common core standards. Retrieved from http://www.pearsonassessments.com/NR/rdonlyres/6063DE04-2372-4EC4-96427B8A584F942F/0/ThoughtonaCommonCoreAssessmentSystem.pdf. Li, T. (2006). The effect of dimensionality on vertical scaling. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Li, Y. H., & Lissitz, R. W. (2000). An evaluation of the accuracy of multidimensional IRT linking. Applied Psychological Measurement, 24(2), 115–138. Lohnman, D. F., & Hagen, E. (2002). Cognitive abilities test (Form 6): Research handbook. Itasca, IL: Riverside. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17(3), 179-193. Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14(2), 139-160. Martineau, J. A. (2004). The effects of construct shift on growth and accountability models. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Martineau, J. A. (2006). Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. Journal of Educational and Behavioral Statistics, 31(1), 35-62. Michaelides, M. P., & Haertel, E. H. (2004). Sampling of common items: An unrecognized source of error in test equating. Los Angeles, University of California, Center for the Study of Evaluation (CSE). Michigan Department of Education (2005). Mathematics Grade Level Expectations. Lansing, MI: Author. 107 Miller, T. R., & Hirsch, T. M. (1992). Cluster analysis of angular data in applications of multidimensional item-response theory. Applied Measurement in Education, 5(3), 193-211. Min, K.-S. (2003). The impact of scale dilation on the quality of the linking of multidimensional item response theory calibrations. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Muraki, E., & Engelhard, G. (1985). Full-information item factor analysis: Applications of EAP scores. Applied Psychological Measurement, 9(4), 417-430. Oshima, T. C., Davey, T. C., & Lee, K. (2000). Multidimensional linking: Four practical approaches. Journal of Educational Measurement, 37(4), 357-373. Patz, R. J., & Yao, L. (2007). Methods and models for vertical scaling. In N. J. Dorans, M. Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales (pp. 253-272). New York: Springer. Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed. pp. 221–262). Washington, DC: American Council on Education. Raju, N. S., Edwards, J. E., & Osberg, D. W. (1983, April). The effect of anchor test size in vertical equating with the Rasch and three-parameter models. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada. Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9(4), 401–412. Reckase, M. D. (1997a). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25–36. Reckase, M. D. (1997b). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 271-286). New York: Springer. Reckase, M. D. (2007). Multidimensional item response theory. In C. R. Rao, & S. Sinharay (Eds.), Handbook of Statistics 26: Psychometrics (pp. 607-641). Amsterdam: NorthHolland. Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. Reckase, M. D. (2010). Study of best practices for vertical scaling and standard setting with recommendations for FCAT 2.0. Retrieved from http://www.fldoe.org/asp/k12memo/pdf/StudyBestPracticesVerticalScalingStandardSetting.pdf . Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25(3), 193–203. 108 Reckase, M. D., & Li, T. (2007). Estimating gain in achievement when content specifications change: A multidimensional item response theory approach. In R.W. Lissitz (Ed.) Assessing and modeling cognitive development in school (pp. 189-204). Maple Grove, MN: JAM Press. Reckase, M. D., & Martineau, J. (2004). The vertical scaling of science achievement tests. Paper presented to the Committee on Test Design for K-12 Science Achievement, Washington, DC. Roussos, L. A., Stout, W. F., & Marden, J. I. (1998). Using new proximity measures with hierarchical cluster analysis to detect multidimensionality. Journal of Educational Measurement, 35(1), 1-30. Schmidt, W. H., Houang, R. T., & McKnight, C. C. (2005). Value-added research: Right idea but wrong solution? In R. Lissitz (Ed.), Value-added models in education: Theory and applications (Chapter 6). Maple Grove, MN: JAM Press. Simon, M. K. (2008). Comparison of concurrent and separate multidimensional IRT linking of item parameters. Unpublished doctoral dissertation, University of Minnesota, Minneapolis, MN. Sinharay, S., & Holland, P. (2006a). The correlation between the scores of a test and an anchor test (ETS Research Rep. No. RR-06-04). Princeton, NJ: ETS. Sinharay, S., & Holland, P. (2006b). Choice of anchor test in equating (ETS Research Rep. No. RR-06-35). Princeton, NJ: ETS. Spray, J. A., Davey, T. C., Reckase, M. D., Ackerman, T. A., & Carlson, J. E. (1990). Comparison of two logistic multidimensional item response theory models (Tech. Rep. No. ONR 90-8). Iowa City, IA: ACT. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. Sympson, J. (1978). A model for testing with multidimensional items. In D. J. Weiss (Ed.), Proceedings of the 1977 Computerized Adaptive Testing Conference (pp. 82-98). Minneapolis: University of Minnesota, Department of Psychology, Psychometric Methods Program. The MathWorks. (2008). Matlab 2008: The language of technical computing [computer program]. Natick, MA. Turhan, A., Tong, Y., & Um, K. R. (2007). Effects of anchor item properties and dimensionality of test on vertical scaling. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago, IL. 109 Wang, S., Jiao, H., Young, M. J., & Jin, Y. (2006). The effects of linking designs in vertical scaling on the growth patterns of student achievement. Paper presented at the 13th International Objective Measurement Workshop, San Francisco, CA. Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. Applied Psychological Measurement, 8(3), 347-364. Yao, L., & Mao, X. (2004). Unidimensional and multidimensional estimation of vertical Scaled Tests with Complex Structure. Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA. Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23(4), 299-325. Yen, W. M., & Burket, G. R. (1997). Comparison of item response theory and thurstone methods of vertical scaling. Journal of Educational Measurement, 34(4), 293-313. 110