REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Measurement and Quantitative Methods - Doctor of Philosophy 2013 ABSTRACT REMOVE OR KEEP: LINKING ITEMS SHOWING ITEM PARAMETER DRIFT By Qi Chen IRT-based procedures using common items are widely used in test score linking. A critical assumption of linking methods is the invariance property of IRT. This assumption is that item parameters remain the same on different testing occasions when they are reported on the same θ-scale. In practice, however, there are occasions when an item parameter drifts from its original value. This study investigated the impact of keeping or removing linking items that were showing item parameter drift. Simulated data were generated under a modified three-parameter logistic model with common item non-equivalent group linking design. Factors manipulated were percentage of drifting items, type of drift, magnitude of drift, group achievement differences and choice of linking methods. The effect of item drift was studied by examining mean difference between true θs and θ-estimates, and the accuracy of the classification of examinees into proper performance categories. Results indicated that the characteristics of the drift had little impact upon the performance of the Fixed Common Item Parameter (FCIP) method or Stocking & Lord’s Test Characteristic Curve (TCC-ST) method and its influence on the performance of the Concurrent method varied depending on whether the drifting items were removed from the linking. In addition, better estimation was achieved if the drifting items were removed from the linking when the Concurrent method was used. ACKNOWLEDGEMENTS There are many people who have been supportive and helpful along the road to the completion of my dissertation. I feel grateful to their assistance, inspiration and encouragement. First of all, my deepest gratitude goes to Dr. Mark Reckase, my academic advisor and chairperson of my dissertation committee, for his scholarly guidance and constant support. He has shown substantial support, discussing the project with me, reviewing the research drafts and providing critical feedbacks. I have benefited greatly from his valuable insights and assistance during the writing process and throughout the whole journey of my Ph.D. study. I would like to express my sincere appreciation to other members in my dissertation committee: Dr. Tenko Raykov, Dr. Edward Roeber, and Dr. Ann Marie Ryan, for their excellent insights, suggestions, and assistance. They have made their time from their busy schedules to attend the meetings and have provided me with their deep insights, constructive feedbacks and thoughtful suggestions. My appreciation also extends to Dr. Sharif Shakrani and Dr. Cassandra Guarino for their valuable comments and insightful suggestions on my dissertation proposal. Their valuable comments on the design and analyses of the dissertation study are highly appreciated. I would like to express my thanks to Dr. Michael Kozlow and Dr. Xiao Pang for sharing their insights and allowing me the access to the data. I feel blessed with the opportunities to work with them. Their assistance and enthusiasm have motivated me to pursue this research topic. I would also like to express my gratitude to Dr. Yong Zhao for providing me with iii assistantship opportunities during my graduate study. I appreciate the opportunities he has given me to participate in many research projects which have helped my development as a professional researcher. I appreciate the great support of my friends Brad and Tinker for editing and proofreading the proposal and the drafts. I am deeply grateful and indebted to my family for their love and support. I could not have completed the journey without the encouragement and patience from my dear husband Haonan, the inspiration and love from my wonderful daughter Karen and the support from my loving parents. iv TABLE OF CONTENTS LIST OF TABLES………………………………………………………………………………vii LIST OF FIGURES………………………………………………………………………………ix Chapter I: Introduction………………………………………………………………….................1 1.1 Research Background………………………………………………………………….…1 Chapter II: Background and Objective of Study……………………….…………………………4 2.1 Common-Item Linking Design ……………………..……………………………………4 2.2 Linking Methods………………………………………………………………………….4 2.3 Item Parameter Drift………..…………………………………………………………….6 Chapter III: Methods and Research Design……………….…………………………………….12 3.1 Data Generation…………………..……………………………………………………..12 3.11 Generating Parameters…………………………….,…………..……………..13 3.12 Simulation of Item Parameter Drift…………………………………………..13 3.13 Group Achievement Differences………...…………………….……………...15 3.2 Calibration and Linking Procedures…………………..………………………………...16 3.3 Handling of Drifting Items………….……….…………………………………………..17 3.4 Evaluation Criteria………………..……………………………………………………..17 Chapter IV: Results and Discussion……………………………………………………………...21 4.1 Drift on Discriminating Parameter a…………………………………………………….21 4.11 Correlation between θ Estimate and True θ………………………………......22 4.12 Accuracy of θ Estimates………………………………………………………24 4.121 Bias and RMSE in Four a-drift Situations…………………...….....24 4.122 Effect of Percentage of Items Showing a-parameter Drift…………27 4.123 Effect of the Direction of a-parameter Drift…………………...…..29 4.124 Effect of the Linking Method………………………………………31 4.125 Effect of Group Difference………………………………………...37 4.126 Effect of Drifted Items Handling…………………………………..41 4.127 Effect of Drifted Items Handling at Different θ Levels……………47 4.13 Accuracy of Performance Level Classification……………………………… 68 4.2 Drift on Difficulty Parameter b……………………………………………………….... 88 4.21 Correlation between θ Estimate and True θ………………………………......88 4.22 Accuracy of θ Estimates………………………………………………………92 4.221 Bias and RMSE in Eight b-drift Situations…………………...........92 v 4.222 Effect of Percentage of Items Showing b-parameter Drift…………97 4.223 Effect of the Direction of b-parameter Drift………………….......100 4.224 Effect of the Magnitude of b-parameter Drift…………………….102 4.225 Effect of the Linking Method……………………………………..104 4.226 Effect of Group Difference……………………………………......112 4.227 Effect of Drifted Items Handling………………………………....121 4.228 Effect of Drifted Items Handling at Different θ Levels………..…131 4.23 Accuracy of Performance Level Classification……………………………..171 Chapter V: Conclusions, Implications and Future Research……………………….…………200 5.1 Conclusions…………………………………………………………………………….200 5.2 Implications…………………………………………………………………………….203 5.3 Limitations and Future Directions……………………………………………………..205 APPENDIX……………………………………………………………………………………..207 BIBLIOGRAPHY…………………………………………………………………………..…..209 vi LIST OF TABLES Table 3.1 Descriptive Statistics of the Item Parameters………………………………………...13 Table 3.2 Summary of Simulated Conditions…………..………………………………………..16 Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses)……………………………………...……………………….23 Table 4a.2 Bias for θ Estimates when a-parameter Drifting……………..……….……………..26 Table 4a.3 RMSE for θ Estimates when a-parameter Drifting………..…………………………27 Table 4a.4 Change in bias and RMSE as More Items Drifting in a-parameter……………...….30 Table 4a.5 Change in bias and RMSE with a-parameter Drifting in Different Directions...........32 Table 4a.6 Percentage in Each Performance Level Classification (N=3, Drift=a+0.4)………..71 Table 4a.7 Percentage in Each Performance Level Classification (N=3, Drift=a-0.4)……..….76 Table 4a.8 Percentage in Each Performance Level Classification (N=8, Drift=a+0.4)………..81 Table 4a.9 Percentage in Each Performance Level Classification (N=8, Drift=a-0.4)…….…..85 Table 4b.1 Average Correlation Coefficients between θ Estimates and True θs when b-parameter Drifting (with SDs in Parentheses)………………………………………………………………90 Table 4b.2 Bias for θ Estimates when b-parameter Drifting…………..………………………...95 Table 4b.3 RMSE for θ Estimates when b-parameter Drifting…….……….……………………96 Table 4b.4 Changes in bias and RMSE with More Items Drifting in b-parameter (3 items 8 items)…………………………………………………………………………………………….99 Table 4b.5 Changes in bias and RMSE with b-parameter Drifting in Different Directions (Positive Drift Negative Drift)………………………………………………………………101 vii Table 4b.6 Changes in bias and RMSE as the size of b-parameter Drift Increases (0.2 0.4)……………………………………………………………………………………………..103 Table 4b.7 Percentage in Each Performance Level Classification (N=3, Drift=b+0.2).….......176 Table 4b.8 Percentage in Each Performance Level Classification (N=3, Drift=b-0.2)….…….179 Table 4b.9 Percentage in Each Performance Level Classification (N=3, Drift=b+0.4)…..…..182 Table 4b.10 Percentage in Each Performance Level Classification (N=3, Drift=b-0.4)………185 Table 4b.11 Percentage in Each Performance Level Classification (N=8, Drift=b+0.2)..……188 Table 4b.12 Percentage in Each Performance Level Classification (N=8, Drift=b-0.2)………191 Table 4b.13 Percentage in Each Performance Level Classification (N=8, Drift=b+0.4)….….194 Table 4b.14 Percentage in Each Performance Level Classification (N=8, Drift=b-0.4)..……..197 Table 6 Population Item Parameters Used for Simulations…………...……………...……….208 viii LIST OF FIGURES Figure 3.1 Design of Dataset Simulation………………………………………………………..13 Figure 4a.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included)…………………………………………………………………………………………34 Figure 4a.0.2 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included)…………………………………………………………………………………………34 Figure 4a.0.3 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included)…………………………………………………………………………………………35 Figure 4a.0.4 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped)…………………………………………………………………………………………35 Figure 4a.0.5 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped)…………………………………………………………………………………………36 Figure 4a.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped)…………………………………………………………………………………………36 Figure 4a.1.1 Effect of Group Difference (Concurrent; All Items Included)………………….....38 Figure 4a.1.2 Effect of Group Difference (FCIP; All Items Included)……………………..........38 Figure 4a.1.3 Effect of Group Difference (TCC-ST; All Items Included)……………………......39 Figure 4a.1.4 Effect of Group Difference (Concurrent; Drifted Items Dropped)………………..39 Figure 4a.1.5 Effect of Group Difference (FCIP; Drifted Items Dropped)………………...........40 Figure 4a.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring)………..………………………………………………………………………………....40 Figure 4a.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped)……………….......41 Figure 4a.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2)……………………………………………………………………………………………...….43 ix Figure 4a.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2)..…43 Figure 4a.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2)...44 Figure 4a.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2)……………………………………………………………………………………………...….44 Figure 4a.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2)…...45 Figure 4a.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2)...45 Figure 4a.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2)……………………………………………………………………………………………...….46 Figure 4a.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2)…..46 Figure 4a.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2)...47 Figure 4a.3.1 Mean bias at θ Intervals (3 Items a-drift +0.4)………...……...............................48 Figure 4a.3.2 Mean bias at θ Intervals (3 Items a-drift -0.4)……………...................................54 Figure 4a.3.3 Mean bias at θ Intervals (8 Items a-drift +0.4)………………..............................59 Figure 4a.3.4 Mean bias at θ Intervals (8 Items a-drift -0.4)……………...................................64 Figure 4b.0.1 Comparison of Linking Methods (Mean Year 1= Mean Year 2; All Items Included)………………………………………………………………………………………..106 Figure 4b.0.2 Comparison of Linking Methods (Mean Year 1= Mean Year 2; Drifted Items Dropped)……………………………………………………………………………………….107 Figure 4b.0.3 Comparison of Linking Methods (Mean Year 1< Mean Year 2; All Items Included)……………………………………………………………………………………..…108 Figure 4b.0.4 Comparison of Linking Methods (Mean Year 1< Mean Year 2; Drifted Items Dropped)…………………………………………………………………………………..……109 Figure 4b.0.5 Comparison of Linking Methods (Mean Year 1> Mean Year 2; All Items Included)……………………………………………………………………………………..…110 x Figure 4b.0.6 Comparison of Linking Methods (Mean Year 1> Mean Year 2; Drifted Items Dropped)…………………………………………………………………………………….….111 Figure 4b.1.1 Effect of Group Difference (Concurrent; All Items Included)……………….......114 Figure 4b.1.2 Effect of Group Difference (Concurrent; Drifted Items Dropped)……..………..115 Figure 4b.1.3 Effect of Group Difference (FCIP; All Items Included)…..…………………......116 Figure 4b.1.4 Effect of Group Difference (FCIP; Drifted Items Dropped)..…………...............117 . Figure 4b.1.5 Effect of Group Difference (TCC-ST; All Items Included)…………..………..…118 Figure 4b.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring)…….…………………………………………………………………………………...119 Figure 4b.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped)…...……………..120 Figure 4b.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2)………………………………………………………………………………...……………...122 Figure 4b.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2)…123 Figure 4b.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2).124 Figure 4b.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2)……………………………………………………………………………………………..…125 Figure 4b.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2)….126 Figure 4b.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2).127 Figure 4b.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2)…………………………………………………………………………………………......…128 Figure 4b.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2)….129 Figure 4b.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2).130 Figure 4b.3.1 Mean bias at θ Intervals (3 Items b-drift +0.2)…...………….............................132 xi Figure 4b.3.2 Mean bias at θ Intervals (3 Items b-drift -0.2)……..………...............................137 Figure 4b.3.3 Mean bias at θ Intervals (3 Items b-drift +0.4)……...……….............................142 Figure 4b.3.4 Mean bias at θ Intervals (3 Items b-drift -0.4)…………….................................147 Figure 4b.3.5 Mean bias at θ Intervals (8 Items b-drift +0.2)………...…….............................152 Figure 4b.3.6 Mean bias at θ Intervals (8 Items b-drift -0.2)……...………..............................157 Figure 4b.3.7 Mean bias at θ Intervals (8 Items b-drift +0.4)……...……….............................162 Figure 4b.3.8 Mean bias at θ Intervals (8 Items b-drift -0.4)………...……...............................167 xii Chapter I: Introduction 1.1 Research Background Scores from large scale assessment are commonly used as good indicators of student performance. Educational policy makers, administrators and educators hope to compare how students are doing from year to year. However, because the test is administrated each year, different test forms need to be used to ensure test security. Practically, no test developer can guarantee the equivalency of different test forms despite vigorous efforts to ensure that equivalency. Hence, to achieve comparability, practitioners need to link the test scores. One way to achieve comparability is through a common item linking design. There are various approaches to common item linking -- some based on Item Response Theory (IRT) models. In Item Response Theory, item parameters are estimated and are assumed to be invariant under linear transformation (Lord, 1980). Several methods have been developed to place the item parameters on a common metric, including linear transformation of separate calibrations, fixed common item parameter (FCIP) calibrations and concurrent calibration. The invariance property of item response theory assumes that the item characteristic curves for a test item should be the same when estimated from data from two different populations. The linear relationship between θ-estimates and item parameter estimates indicates that the difference in the scaled scores is the result of difference in θs across groups or over time, when the parameters would remain unchanged if they were on the same scale. However, in practice, this assumption of invariance does not always hold. When an item in the same test functions differently for different subgroups with the same degree of proficiency, it is called differential 1 item functioning (DIF; Holland & Wainer, 1993). When the statistical properties of the same items change on different testing occasions, it is called item parameter drift (IPD) (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988). Item parameter drift has the potential to negatively affect the validity of the score scale conversion. Although there has been some research on item parameter drift, it is not as extensive as research on DIF. In practice, items flagged for IPD are often removed from the linking items in the estimation of linking coefficients (Cook & Eignor, 1991). However, since research comparing the IPD detection methods has found that the effectiveness of these methods depends on the testing situation (Donoghue & Isham, 1998; DeMars, 2004), it is likely that some item parameter drift goes undetected while some items are improperly flagged as drifting. Research has found some possible sources of drift, such as change in curriculum (Goldstein, 1983; Bock, Muraki, & Pfeiffenberger, 1988), context effects (Eignor, 1985), sample statistics (Cook, Eignor, & Taft, 1988), content of items (Chan, Drasgow, & Sawin, 1999), and item over-exposure (Veerkamp & Glas, 2000). These reasons for item parameter drift may or may not be related to the construct being measured. It will be a source of linking error to keep items that are drifting due to construct-irrelevant factors. However, it creates another source of linking error to remove items whose drift is closely related to the construct being measured (Miller & Fitzpatrick, 2009). The research on the impact of IPD on θ-estimates has produced mixed results. Some research studies found that IPD had little effect on θ-estimates (Wells, Subkoviak, & Serlin, 2002; Witt, Stahl, & Bergstrom, 2003; Rupp & Zumbo, 2003). On the other hand, other research has 2 found that IPD could compound over multiple testing occasions and that the choice of linking model could have a large effect on the θ-estimates (Wollack, Sung & Kang, 2005; 2006). In most of the research on the effect of IPD, items exhibiting IPD were removed from the linking set of items and the test characteristic curve method (TCC) was often chosen as the linking method (Stocking & Lord, 1983). However, drifted items may remain as the linking items due to the ineffective detection of IPD. When the drift is related to the construct being measured, the items should not be removed. This study intends to compare the effects of IPD on θ-estimates when the drifted items are either kept or removed from the linking set. As well, the interaction between the handling of the drifted items and the linking methods will be examined. The linking methods used in this study are Stocking and Lord’s test characteristic curve method (CCM) (Stocking & Lord, 1983), fixed common item parameter (FCIP) calibrations, and concurrent calibration (Hambleton, Swaminathan & Rogers, 1991; Kolen & Brennan, 1995). 3 Chapter II: Background and Objective of Study 2.1 Common-Item Linking Design Essentially, the process of linking is to change θ-estimates from one test to θ-estimates on the equivalent trait scale on another test (Holland & Dorans, 2006). In many large-scale testing programs, the common item non-equivalent group linking design is widely used. In this design, two or more test forms are created with a set of items in common, and these test forms are used in different test administrations (Kolen & Brennan, 1995). The item parameters obtained from different test forms are placed on the same scale by using the common items as linking items. 2.2 Linking Methods Item parameters obtained from different test forms need to be aligned on the same scale. There are a variety of alignment approaches to achieve this purpose: linear procedures based on θ-estimates, fixed common item parameters, the mean-mean method, the mean-sigma method, the test characteristic curve method, and concurrent calibration (Kolen & Brennan, 1995; Yen & Fitzpatrick, 2006; Holland & Dorans, 2006). Because of the importance of choosing an appropriate linking method to ensure the accuracy of the linking results, there has been research comparing the merits of different linking methods. Some of the studies have investigated the strengths and weaknesses of those IRT-model-based linking methods (Baker & Al-Karni, 1991; Kim & Cohen, 1998; Henson & Beguin, 1999, 2002; Li, Griffith & Tam, 1997; Jodoin, Keller & Swaminathan, 2003; Li, Tam & Tompkins, 2004). One kind of linking method is to transform parameter estimates obtained from two separate calibrations onto a common scale through a linear scale transformation. The Stocking & Lord 4 (1983) CCM method was found to yield more stable results than moment methods when data sets are typically troublesome to calibrate (Baker, 1991). The better performance of the Stocking-Lord method over the moment methods was documented in the literature (Hanson & Beguin, 2002). Another commonly used linking method is the fixed common item parameter method. The pre-calibrated item parameter estimates for the common items are fixed while calibrating the non-common items, and then item parameter estimates from the non-common items are placed on the same scale as the fixed parameters. This method does not require the computation of scale transformation coefficients. Li, Griffith, and Tam (1997) found that both the FCIP linking method and the characteristic curve linking method provided stable θ estimates, except for students with low θ values and that the item parameter estimates calibrated with these two methods were consistent except for the estimation of guessing parameter under CCM. Concurrent calibration is also a widely used linking method. Parameters for items from multiple test forms are estimated in a single calibration run. The simulation study by Kim and Cohen (1998) found that separate and concurrent calibration provided similar results when the number of common items was large, but separate calibration provided more accurate results when the number of common items was small. Hanson and Beguin (2002) found that concurrent calibration generally yielded more accurate results, although the results were not sufficient to support total preference for concurrent estimation. Although there has been research comparing these IRT linking methods, there has been not sufficient evidence as to which one is the best. Each method has its own merit. Keller, Jodoin, 5 Rogers and Swaminathan (2003) compared linear, FCIP and concurrent linking procedures in detecting academic growth and found that the type of linking method used resulted in differences in mean growth and classification. Lee and Ban (2007) compared concurrent calibration, the Stocking-Lord method and the fixed item parameter linking procedures in the random group linking design. They found that the relative performance of different linking procedures varies with the measurement conditions. So no conclusion can be drawn about one preferred procedure for all occasions. 2.3 Item Parameter Drift In Item Response Theory, the IRT estimates are assumed to be invariant up to a linear transformation (Lord, 1980). For example, under a 3PL model, the probability of a correct response to the i th item is given by P(θ ) = ci + 1 − ci −1.7 a (θ −b ) i i 1+ e (2.1) where ai is the item discrimination, bi is the item difficulty and ci is the pseudo-guessing parameter for item i . A linear transformation of the parameters will produce the same probability of a correct response. For example, let θ Jk = Aθ Ik + B , (2.2) a Ji = aIi / A , (2.3) bJi = AbIi + B , (2.4) and c Ji = cIi (2.5) 6 where A and B are constants, θ Jk and θ Ik are values of θ for individual k on Scale J and Scale I, aJi, bJi and c Ji are the item parameters for item i on Scale J and a Ii , bIi , and cIi are the item parameters for item j on Scale I. The c-parameters do not change with the linear transformation of scale. The probability of correctly answering item i for an examinee with θ Jk (equation 2.1) is c Ji + 1 − c Ji −1.7 a (θ −b ) , Ji Jk Ji 1+ e Which equals (with the expressions from equations (2.2)-(2.5)) cIi + 1 − cIi 1+ e ( −1.7 a / A ( Aθ + B ) −( Ab + B ) Ii Ik Ii = cIi + ) 1 − cIi 1+ e −1.7 a (θ −b ) Ii Ik Ii which is exactly the probability of correctly answering item i for the same examinee with θ Ik on the alternative scale ( Hambleton & Swaminathan, 1985; Kolen & Brennan, 1995). However, in practice, item parameters do not always remain unchanged. When an item performs differently for examinees of comparable proficiency, it is defined as differential item functioning (DIF, Holland & Wainer, 1993). Goldstein (1983) developed a general framework of the change of item characteristics or parameter values over time. Research has found a number of possible sources of item parameter drift. Goldstein (1983) suggested some possible reasons such as changing curriculum content, different social demands for knowledge and skills, and so on. Bock, Muraki and Pfeiffenberger (1988) analyzed item response data from 10-year administrations of the College Board Physics Achievement Test. 7 They found differential drift occurred with the change in curricular emphasis. When teachers began to focus more on basic topics of mechanics rather than advanced topics, the difficulty of the mechanics questions increased. As the English units of measurement were phased out of the physics curriculum, the slopes for items using English units and metric units were changing directions. Chan, Drasgow and Sawin (1999) observed item parameter drift on the Armed Services Vocational Aptitude Battery over a 16-year period and found that the drift was related to changing demands for knowledge. They found that tests with more semantic/knowledge content seemed to have higher rates of item drift. Eignor (1985) found that for reading tests, item drift could be explained by the location of reading passages and position of items in the test. Sykes and Fitzpatrick (1992) tried to explain the drift in item difficulty after consecutive administrations of a professional licensure examination and found that change in difficulty parameter was related neither to changes in the booklet or test position of the items, nor to the item type. Cook, Eignor and Taft (1988) investigated curriculum-related achievement tests given in spring or fall and concluded that tests taken at different times during the school year might have measured different attributes. Veerkamp and Glas (2002) investigated item drift in adaptive testing due to previous exposure of the item. Giordano, Subhiyah, and Hess (2005) analyzed the item exposure on take-home examinations in the medicine and its influences on the difficulty of the exam. There has been research on how to detect item parameter drift. Researchers have used DIF 2 procedures such as the Mantel-Haenszel procedure, Lord’s X measure , Kim & Cohen’s (1991) closed interval measures and Raju’s (1988) exact signed- and unsigned-integral measures and 8 2 Kim, Cohen, & Park’s (1995) X test for multiple-group differential DIF; analysis of covariance models (Sykes & Fitzpatrick, 1992); restricted item response models (Stone & Lane, 1991); the cumulative sum (CUSUM) chart – a statistical quality control used in production processes (Veerkamp & Glas, 2002); and the procedure in BILOG-MG for estimating linear trends in item difficulty. In their study comparing the procedures for detecting IPD, Donoghue and Isham (1998) found that Lord’s measure was the most effective in detecting drift on the condition that the item’s guessing parameter was controlled to be equal across calibrations. Their findings suggested that the effective functioning of the detection method depended on the specific testing situation. In a study by DeMars (2004), the linear drift procedure in BILOG-MG and the modified-KPC were found effective in identifying drift, similar to those represented in this study, but these procedures falsely identified non-drift items. There has been research on the effect of item parameter drift on the estimation of an examinee’s θ, but the research is not extensive and the conclusions are not conclusive. Wells, Subkoviak, and Serlin (2002) simulated item response data using the two-parameter logistic model for two testing occasions. The factors they manipulated in the study included the percentage of items that exhibited IPD, types of drift, sample size and test length. Drift was simulated by increasing the difficulty parameters by 0.4 and the discriminating parameters by 0.5. Their results suggested that item parameter drift as simulated in their study had a small effect on θ-estimates. The study also illustrated the robustness of the 2PL model despite the violation of the invariance property. Witt, Stahl, and Bergstrom (2003) investigated the effects of IPD on the 9 stability of test taker θ-estimates and pass/fail status under the Rasch model. The researchers used real, non-normal distribution of examinee θ values. Six levels of shifts in difficulty parameter were simulated. The results of the study illustrated the robustness of the Rasch model in spite of item drift, even when the true θs were not normally distributed. The θ-estimation was stable under moderate drift in item difficulties. Similarly, Rupp and Zumbo (2003) concluded from their study that IRT θ-estimates were relatively robust, with moderate amounts of item parameter drift. In Wollack, Sung & Kang’s study(2005) of longitudinal item parameter drift, over 7-year’s worth of test forms from a German placement test were linked. The results showed that the choice of linking/IPD model could have a large impact on the resulting θ-estimates and passing rates. The simulation and real data studies by Wollack, Sung & Kang (2006) further supported this conclusion. They found that direct linking of each new form to the base form was slightly better than indirect linking. Models with TCC linking were compared with models that used the fixed parameter linking method in the study and the TCC linking process was found to perform better. The inconsistency in the findings about the effect of item parameter drift indicates that further studies need to be conducted to explore the effect of IPD from the perspectives of its interaction with factors such as linking procedures and treatment of the drifting items. First, for the common-item linking design, no conclusion has been reached as to what is the most effective linking procedure. Therefore, it might provide some interesting insights if the comparison of linking procedures could be combined with the study of the effect of IPD. 10 Second, there has been very limited research on how to handle the drifting items. In practice, items flagged for IPD are often removed from the set of items linking two or more test forms (Cook & Eignor, 1991). However, that is not always a proper way of treating item parameter drift. Before removing the drifting items from the linking set, an examination should be conducted about the property of the drift to see whether the drift is related to the construct being measured. Some possible sources for item parameter drift could be irrelevant to the construct being measured, e.g. the drift due to over-exposure of the item, or change in item parameter because of the change of item position. If item drift occurs as a result of the construct-irrelevant factors, then keeping these drifting items in the linking set will be an incorrect way of handling IPD, resulting in linking errors. However, if the item parameter drift cannot be explained by construct-irrelevant factors, it is likely that the drift is related to the construct being measured. In this case, if the drifting items are dropped out of the linking set, they become another source of linking error (Miller and Fitzpatrick, 2009). Moreover, in a real testing situation, items that drift are only items “flagged” by one or more methods of detecting IPD. There can be linking errors from the false detection of IPD. The objective of the proposed study is to investigate the effects of item parameter drift on θ-estimates when the items exhibiting drift are treated in different ways: either kept or removed from the set of linking items. The study also explores, in the presence of item parameter drift, the performance of three commonly used linking procedures – the Stocking and Lord TTC method, the fixed common item parameter method and concurrent calibration. 11 Chapter III: Methods and Research Design 3.1 Data Generation In the study, data were generated to simulate a large scale assessment of mathematics. To focus on the effect of item parameter drift under the 3PL model, only multiple-choice items were considered in this study. Item response data were generated to simulate two test administrations one year apart. The test form included 30 operational items and 30 field test items. The items that appeared as field test items in one testing year became operational items in the following testing year and they serve as common items that link the two testing occasions that are one year apart. Item responses were simulated for 3000 examinees taking the test each year. The study tried to model a real testing situation where the test form consisted of operational items and field test items. The operational items in one testing year were field test items in the previous year. So the number of operational items was almost the number of common items. However, in the real situation, a matrix-sample design was used in field testing items. The field test items were divided into subsets and placed into several booklets. Each examinee worked on all the operational items plus a subset of field test items. In this study, to minimize sampling errors, the matrix sampling design was not simulated. Instead, item responses were generated for all examinees answering all the operational and field test items. Figure 3.1 shows the design of the simulated datasets. 12 Figure 3.1 Design of Dataset Simulation Year One Group Year Two Group 3.11 30 Unique Items Operational Year One (Considered missing responses in Linking) 30 Common Items Field test Year One Operational Year Two Field test Year Two (not simulated) Generating Parameters A set of 60 item parameter estimates from the 2006/2007 Canadian provincial mathematics assessment was used as true parameters for generating baseline data. A modified three-parameter logistic model was used in estimating the item parameters where the guessing parameter was fixed to 0.2. The mean difficulty is -0.1199 and the mean slope is 0.5624. Modifications were made to randomly selected a or b parameters to reflect item parameter drift, while the c-parameter was set at 0.2. Table 3.1 describes the distribution of the true item parameters. Table 3.1 Descriptive Statistics of the Item Parameters Parameter a b c 3.12 N Minimum Maximum 60 0.2621 1.1495 60 -2.1495 1.7984 60 0.2000 0.2000 Mean 0.5624 -0.1199 0.2000 Std. Deviation 0.1885 1.1225 0.0000 Simulation of Item Parameter Drift To ascertain whether the effect of item parameter drift on θ-estimates differed when more items were showing drift or when items drifted further away from their original values, the number of items drifting and level of drift were manipulated. 13 In one condition about 10% and in another condition 25% of the items were randomly selected to exhibit item parameter drift. When 10% of the items were drifting, it suggested a scenario that the number of items drifting was not large and the drift might go undetected. When 25% of the items were drifting, the drift was not likely to be ignored and whether to keep or remove those drifting items could have a larger effect on θ-estimates. Data were also generated with a no-drift condition to serve as baseline. Two types of drift were simulated: drift on discriminating parameter a and drift on difficulty parameter b. The a-parameter drift was simulated by manually increasing or decreasing the a parameter by 0.4. In previous research, a similar magnitude of a-drift was adopted, e.g. a-drift of 0.3 (Donoghue & Isham, 1998) or 0.5 (Wells et al., 2002). Two levels of b-parameter drift were simulated by increasing or decreasing the parameter by 0.2 or 0.4. A drift of 0.2 simulated a moderate amount of item parameter drift while a drift of 0.4 simulated a large amount of drift. The same magnitude of drift (0.4) was used in other studies of IPD, such as Wells et al. (2002), Donoghue and Isham (1998) and Wollack (2006). The changes in p-values of the items with each amount of drift were also examined. To study the effects of a-drift and b-drift separately, the two types of drifts were not mixed in any one condition. Meanwhile, simulated drift were restricted to one direction. In practice, drift is likely to go in either direction. However, mixing positive and negative drift within a test can result in cancellation of drift and give less information on the effects of IPD. Thus, under either of the conditions studied, there was one parameter drifting with the drift always increasing or always decreasing. Though this unidirectional drift design was an oversimplification of how drift 14 actually occurred in real testing, it represented a worst-case-scenario where the effect of drift was unlikely to be ignored. 3.13 Group Achievement Differences When item parameters drift, it sometimes indicates change in achievement of the group taking the test, such as changes in curriculum, or in policy. To investigate how drifting items would help identify the changes in group achievement, data were simulated for both equivalent and non-equivalent groups. For examinees taking the test in YEAR1, the set of item responses was generated by sampling the latent trait (θ ) from a normal independent distribution (NID) with mean 0 and standard deviation 1 (NID(0,1)). For examinees taking the test in YEAR2, three sets of items responses were generated by sampling θ from an NID (0,1) distribution, an NID (0.2,1) distribution and an NID (-0.2,1) distribution. The 0.2 shift represented a moderate increase in θ from one year to another. A similar magnitude of increase in achievement was used in other research. Donoghue & Isham (1998) used 0.1 and Wollack et al. (2006) used 0.15 as a yearly increase in student achievement. The group differences were designed such that item parameter drift could be examined in different situations: 1) when there was no remarkable policy change between the two testing years and populations were assumed to be equivalent in achievement; 2) when there was noticeable policy implementation that might affect student learning and students were expected to fluctuate in achievement. Under a modified 3PL model, with the guessing parameter fixed at 0.2, item response data 15 were generated for the above 36 conditions: percentage of drift (2) X type and level of drift (6) X group achievement difference (3). 3.2 Calibration and Linking Procedures When different methods are used for linking tests, the effect of keeping or removing the drifting items on θ-estimates might differ. Three commonly used IRT linking methods were examined in the study: 1) Concurrent calibration, 2) Fixed Common Item Parameter calibration, and 3) Stocking & Lord’s test characteristic curve method. Table 3.2 Summary of Simulated Conditions Drift Group Difference Linking Method 10% (3 items) drifting 25% (8 items) drifting a-parameter +0.4 -0.4 +0.4 -0.4 +0.2 -0.2 +0.2 -0.2 b-parameter +0.4 -0.4 +0.4 -0.4 Year One [NID(0,1)] and Year Two [NID(0,1)] Year One [NID(0,1)] and Year Two [NID(0.2,1)] Year One [NID(0,1)] and Year Two [NID(-0.2,1)] Concurrent calibration (Concurrent) Fixed Common Item Parameter calibration (FCIP) Stocking & Lord’s test characteristic curve method (TCC-ST) The computer program PARSCALE 4 (Muraki & Bock, 2003) was used to calibrate the parameters. In concurrent calibration, responses from the YEAR1 test and YEAR2 test were combined for one concurrent run to estimate the parameters. When the fixed common item parameter method was used, item parameters were first estimated from a separate calibration for the YEAR2 test. Then the item parameters for the common items were fixed while a calibration for YEAR1 test was done. Thus, the item parameters were placed on the YEAR2 scale. 16 With Stocking & Lord’s test characteristic curve method, item parameters were first estimated through two separate calibrations for YEAR1 test items and YEAR2 test items. The linking coefficients were then obtained from the common items using the test characteristic curve method of Stocking & Lord. Using the linking coefficients, item parameter estimates from the YEAR1 test were placed on the scale of the YEAR2 test. For Stocking-Lord transformation, the computer program ST (Hanson, Zeng & Cui, 2004) was used. 3.3 Handling of Drifting Items Two ways of handling the drifting items were compared: either to treat them or to ignore them. Those items whose parameters had been manually altered during data generation were considered drifting items. The drift was considered as treated when the drifting items were dropped from the linking items. The drift was considered as ignored when the drifting items were kept in the linking items, which was to simulate a scenario where item parameter drift was either undetected or drift was construct-relevant. To compare the two ways of handling drifting items, when item parameter drift was present, each of the linking methods was applied twice, once with drifting items included in the common items and later with the drifting items removed from the linking items. For the Stocking & Lord’s method, it was applied a third time when the drifting items were removed from the linking items but they were included in the scoring. 3.4 Evaluation Criteria Several indices were used to evaluate the effect of the choice of treating item parameter drift and linking method on θ-estimates. Correlation between true θs and the θ-estimates was one index. Bias and the root mean square error (RMSE) were used to assess the accuracy of 17 θ-estimates. One benefit of a simulation study is that the bias and RMSDs between the estimates and the true θs can be obtained. These indices will indicate the accuracy of θ-estimates. If negatively biased, the θ is under-estimated, otherwise it is over-estimated. The smaller the RMSDs are, the better the estimation method is. The bias and RMSE were calculated as follows (as in Li, Tam & Tompkins, 2004): p ˆ ∑ (H i − H i ) bias ( H ) = i =1 (3.1) p and p ˆ ∑ (H i − H i )2 RMSE ( H ) = i =1 p (3.2) ˆ where H i is the true θ, H i is the corresponding estimate, and p is the total number of examinees. In calculating RMSE and bias the θs that were used to generate the item response data were transformed to match the scales from the estimates. With the Stocking-Lord TCC method and the fixed common item parameter method used in this study, θ-estimates for the YEAR One students were placed on the scale of the YEAR Two estimates after linking. With the concurrent calibration linking method, θ-estimates for the YEAR One students were placed on the scale of the combined YEAR One and YEAR Two estimates after linking. When computing RMSE and bias, the θ-values generating data for YEAR One students were transformed onto the YEAR Two 18 scale (YEAR One and YEAR Two combined scale for concurrent calibration linking method) first and then these transformed θ-values will be used as the true θ values to be compared with the θ-estimates of the YEAR One students. The θ values used in generating data are transformed to scaled θ-values through linear transformations (Kolen, 2006): S ( x) = ⎛ ⎞ σs σ x + ⎜ μs − s μ x ⎟ ⎜ ⎟ σx σx ⎝ ⎠ where X is the raw score (the generating θ-values), S is the scale score (scaled θ-values), and (3.3) μx σ x are the mean and standard deviations of raw values, and μ s and σ s are the mean and standard deviations of the scaled values. Another index is the percentage of examinees classified into appropriate performance levels, especially the examinees in levels meeting or above the proficient level. Many large-scale assessments report the performance level of an examinee in addition to or instead of an individual score. Therefore, the proportion of correct classification is one indication of the quality of the estimate. To check this index, the θ cut score for each performance level was set, following the guidelines used in the real assessment. In the 2006/2007 Canadian provincial mathematics assessment, there are four performance levels, Level 3 being the provincial target level. Examinees in Level 3 or 4 are considered to meet or surpass the provincial target level. After θ-estimates from the previous year are placed on the same scale of the current testing year, the percentage of students in each level in the previous year will be used to find the θ thresholds for performance levels in the current year. For this simulation study, the cumulative percentages 17.5%, 57.3%, 94.5% were used to find θ cuts for each level. Performance level, classified 19 according to the examinee’s true θ, was considered as his/her true performance level, while the level classified through estimated θ was the estimated performance level. To examine different ways of treating drifting items, the proportion of correct classification for each level were compared, and special attention were paid to the pass/fail status at the provincial level, as the percentage of students at or above this level was the important index of how schools were progressing. 20 Chapter IV: Results and Discussion The results are presented and discussed for the different types of parameter drift simulated in this study. The conditions that were manipulated included the number of drifted items, the level of drift and the direction of drift. The effects of the item parameter drift on θ-estimates are compared when the drifted items are handled differently. The first part will focus on the effect of a-parameter drift on θ-estimates and the second part will examine the effect of b-parameter drift. In all the tables and figures presented in the results part, “Group Difference” refers to three types of group ability change: 1) examinees taking exams in both years were equivalent groups in ability (“Year One (0,1), Year Two (0,1)”), 2) examinees were non-equivalent in ability with higher ability in the following year (“Year One (0,1), Year Two (0.2, 1)”), and 3) examinees were non-equivalent in ability with lower ability in the following year (“Year One (0,1), Year Two (-0.2,1)”). “Linking Method” refers to the three methods to link the Year One and Year Two scores: 1) Concurrent calibration (“Concurrent”), 2) Fixed Common Item Parameter method (“FCIP”) and 3) Stocking & Lord’s test characteristic curve method (“TCC-ST”). “Drifted Item Handling” refers to the way to treat item parameter drift: 1) keeping the drifted items in both calibration and scoring (“keep/keep”), and 2) dropping the items in both calibration and scoring (“drop/drop”), and 3) dropping the items in calibration while keeping them in scoring (“drop/keep”), which was only applied with the TCC-ST method. 4.1 Drift on Discriminating Parameter a The effect of different ways to handle the drifted items were studied in four kinds of a-parameter drift: 1) three items drifting with a-parameter increasing by 0.4; 2) three items 21 drifting with a-parameter decreasing by 0.4 ; 3) eight items drifting with a-parameter increasing by 0.4 and 4) eight items drifting with a-parameter decreasing by 0.4. 4.11 Correlation between θ Estimate and True θ Table 4a.1 lists the correlation coefficients between the θ estimates and the true θs when a-parameter drifts. The correlations are high, ranging from 0.912 to 0.927. This indicates that the θ estimates have a very strong and positive association with the true θs. Compared with the no-drift baseline condition where the average correlation is 0.927, the strong and positive relationship between the θ estimates and true θ are consistent across the four conditions of a-parameter drifts. The correlations tend to drop slightly when more items are showing drift and when the drifted items are dropped from the linking and scoring. But the drop is quite small, less than 0.014. These consistently high correlations are good indications that different ways to handle the drifting items has a negligible effect on the relationship between θ estimate and true θ, regardless of the group abilities and linking methods. 22 Table 4a.1 Average Correlation Coefficients between θ Estimates and True θs when a-parameter Drifting (with SDs in Parentheses) Group Linking Difference Methods Year One (0,1) Year Two (0,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (0.2,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (-0.2,1) Concurrent FCIP TCC-ST Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift 3 items a-drift +0.4 3 items a-drift -0.4 8 items a-drift +0.4 8 items a-drift -0.4 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.926(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.922(0.005) 0.927(0.002) 0.923(0.002) 0.926(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.924(0.002) 0.926(0.002) 0.922(0.005) 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.922(0.002) 0.927(0.002) 0.927(0.002) 0.922(0.002) 0.926(0.002) 0.925(0.002) 0.925(0.002) 0.922(0.002) 0.927(0.002) 0.927(0.002) 0.922(0.002) 0.926(0.002) 0.926(0.002) 0.925(0.002) 0.922(0.003) 0.927(0.002) 0.927(0.002) 0.922(0.002) 0.926(0.002) 0.924(0.002) 0.925(0.002) 0.919(0.002) 0.927(0.002) 0.927(0.002) 0.919(0.002) 0.927(0.002) 0.923(0.002) 0.925(0.002) 0.919(0.002) 0.927(0.002) 0.927(0.002) 0.919(0.002) 0.927(0.002) 0.924(0.002) 0.925(0.002) 0.919(0.002) 0.927(0.002) 0.927(0.002) 0.919(0.002) 0.926(0.002) 0.924(0.002) 0.923(0.003) 0.912(0.002) 0.927(0.002) 0.927(0.002) 0.913(0.002) 0.926(0.002) 0.923(0.002) 0.923(0.003) 0.912(0.002) 0.927(0.002) 0.927(0.002) 0.913(0.002) 0.926(0.002) 0.924(0.002) 0.923(0.002) 0.912(0.002) 0.927(0.002) 0.927(0.002) 0.913(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 23 4.12 Accuracy of θ Estimates The correlation table (Table 4a.1) showed a strong relationship between θ estimates and true θs and this relationship was not affected by the types of a-drift, group difference, linking methods and way of handling drifted items. However, these differences may have effect on the accuracy of θ estimation. To further examine the accuracy of θ estimation, bias and RMSE between θ estimates and true θs were calculated. Table 4a.2 and 4a.3 give the bias and RMSE values for θ estimates when a-parameter drift is present. 4.121 Bias and RMSE in Four a-drift Situations Four situations of a-drift were examined in the study: 3 items (10%) showing an increase of 0.4 in a-parameter, 3 items (10%) showing a decrease of 0.4 in a-parameter, 8 items (25%) showing the same +0.4 increase, and 8 items (25%) showing the same -0.4 decrease. When 10% of the items were showing a 0.4 increase in a-parameter, bias for θ estimates ranged from -1.210 to 0.217. Most of the bias values were negative, indicating that θ was under-estimated except when datasets of non-equal groups were linked by concurrent calibration with all items included. The under-estimation was most obvious when concurrent calibration was applied with the drifting items dropped, the bias being -1.045, -1.210 and -0.890. The RMSE ranged from 0.155 to 1.229. RMSE was relatively smaller when concurrent calibration was used with all the linking items included. However, the largest RMSE occurred when concurrent calibration was used with the drifting items dropped. When the same drift occurred to 25% of the items, bias for θ estimates ranged from -1.084 to 0.176 and RMSE ranged from 0.154 to 1.111. bias values were negative for most of the datasets, except when TCC-ST method with all linking items included was applied with equivalent groups. 24 As with the 10%-item-drift, bias was more obvious when concurrent calibration was used with drifting items dropped and a relatively larger RMSE could be observed as well. The results were different when the drift was a 0.4 decrease in a-parameter. When 10% of the items were showing the -0.4 a-drift, bias for θ estimates ranged from -0.691 to -0.008. θ-values were underestimated for all the cases with this parameter drift. Bias values closer to zero were observed with TCC-ST method and FCIP method when non-equivalent groups with lower ability in Year Two were linked. However, when the same methods were applied in linking groups with higher ability in Year Two, the bias values had a significant range below zero. RMSE values ranged from 0.177 to 0.723 and showed a similar pattern. Linking groups with lower ability in Year Two showed smaller RMSE values; while linking groups with higher ability in Year Two showed larger RMSE values. One exception was the concurrent calibration with no dropped item, which showed a small RMSE value even when linking groups with higher ability in Year Two. When the -0.4 drift occurred to 25% of the items, a similar tendency could be observed as with the 10%-item-drift. Bias values for θ estimates ranged from -0.734 to -0.023, indicating that there was no overestimation of θs. Estimation was more accurate in linking groups when the groups in Year Two had a lower ability, with FCIP and TCC-ST methods doing a relatively better job than the Concurrent method. The estimation was less accurate in situations when groups in Year Two had a higher ability, in which case the Concurrent method with all linking items included was doing a better job than the FCIP and TCC-ST methods. RMSE values ranged from 0.160 to 0.762, showing that all the linking methods were doing better when groups of Year Two were of lower ability than when groups of Year Two were of higher ability. 25 Table 4a.2 Bias for θ Estimates when a-parameter Drifting Group Difference Year One (0, 1) Year Two (0, 1) Linking Method Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Concurrent Year Two (-0.2, 1) FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift -0.227 -0.337 -0.337 -0.208 -0.668 -0.668 -0.130 -0.013 -0.036 26 3 items a-drift +0.4 -0.011 -1.045 -0.354 -0.330 -0.334 -0.334 -0.333 0.217 -1.210 -0.684 -0.658 -0.663 -0.662 -0.659 0.103 -0.890 -0.033 -0.015 -0.013 -0.012 -0.010 3 items a-drift -0.4 -0.242 -0.497 -0.353 -0.369 -0.339 -0.360 -0.357 -0.198 -0.588 -0.678 -0.691 -0.667 -0.690 -0.685 -0.143 -0.309 -0.025 -0.043 -0.008 -0.029 -0.027 8 items a-drift +0.4 -0.018 -0.954 -0.348 -0.359 -0.305 -0.367 -0.361 -0.220 -1.084 -0.689 -0.700 -0.644 -0.709 -0.701 -0.122 -0.777 -0.027 -0.040 0.018 -0.046 -0.042 8 items a-drift -0.4 -0.276 -0.361 -0.360 -0.385 -0.340 -0.390 -0.380 -0.392 -0.549 -0.699 -0.723 -0.679 -0.734 -0.719 -0.167 -0.186 -0.044 -0.068 -0.023 -0.070 -0.062 Table 4a.3 RMSE for θ Estimates when a-parameter Drifting Group Difference Linking Method Drifted Items Handling (link/score) Year One (0, 1) Concurrent keep/keep Year Two (0, 1) drop/drop FCIP keep/keep drop/drop TCC-ST keep/keep drop/keep drop/drop Year One (0, 1) Concurrent keep/keep Year Two (0.2, 1) drop/drop FCIP keep/keep drop/drop TCC-ST keep/keep drop/keep drop/drop Year One (0, 1) Concurrent keep/keep Year Two (-0.2, 1) drop/drop FCIP keep/keep drop/drop TCC-ST keep/keep drop/keep drop/drop No drift 0.275 0.397 0.398 0.257 0.698 0.698 0.205 0.203 0.202 27 3 items a-drift +0.4 0.155 1.066 0.419 0.404 0.405 0.404 0.404 0.267 1.229 0.718 0.696 0.699 0.698 0.696 0.194 0.913 0.235 0.237 0.236 0.236 0.240 3 items a-drift -0.4 0.286 0.518 0.404 0.425 0.383 0.413 0.414 0.248 0.607 0.706 0.723 0.691 0.719 0.717 0.212 0.342 0.194 0.212 0.177 0.203 0.210 8 items a-drift +0.4 0.154 0.983 0.418 0.426 0.391 0.428 0.429 0.263 1.111 0.723 0.732 0.686 0.738 0.733 0.205 0.809 0.243 0.244 0.252 0.233 0.243 8 items a-drift -0.4 0.315 0.393 0.406 0.448 0.377 0.441 0.444 0.420 0.571 0.724 0.759 0.698 0.762 0.756 0.228 0.239 0.189 0.237 0.160 0.212 0.235 4.122 Effect of Percentage of Items Showing a-parameter Drift The changes in bias and RMSE resulting from the change in the number of drifting items were small in most linking conditions. Table 4a.4 showed the change in bias and RMSE with the increase in the number of items showing a-parameter drift. When a-parameter was drifting in positive direction, the mean change was 0.027 for bias and 0.005 for RMSE. When a-parameter was drifting in negative direction, the mean change was 0.015 for bias and -0.010 for RMSE. The small change in bias and RMSE indicated that the percentage of items with a-parameter drift have little effect on the accuracy of θ estimates, if FCIP and TCC-ST methods were applied to link the groups. However, when the Concurrent method was applied, the effect of the number of drifting items varied depending on how the drifted items were handled. The effect of the number of drifting items was smaller when the drifted items were kept in the linking and the effect became larger when the drifted items were removed from the linking. This indicated that when more items were showing drift, if drifted items were removed, there were fewer items linking the two groups and the θ estimates became less accurate. The exception was when the group in Year Two was of higher ability than the group in Year One, while the drift was showing a decrease in a-parameter. When more θ values in the Year Two group were at the higher end of the θ scale, items with smaller a-parameters did not function as well as those with bigger a-parameters. In that case, θ estimates were more accurate when these drifted items with a relatively smaller a-parameter were left out. It was noticeable that there were cases when the bias change was big while the RMSE change was small, such as in the condition of Concurrent method with non-equivalent groups. This discrepancy occurred because θ was overestimated with three drifted items while it was underestimated with eight drifted items. However, the sizes of the bias were similar whether three or eight items were drifting. So the tendency still holds that the 28 increase in the number of items showing drift has little effect on the accuracy of θ estimates when the drifted items were kept in linking. 29 Table 4a.4 Change in bias and RMSE as More Items Drifting in a-parameter Group Difference Year One (0, 1) Year Two (0, 1) Linking Method Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop Mean Std Deviation a-drift +0.4 (3 items -> 8 items) a-drift -0.4 (3 items -> 8 items) bias change bias change 0.007 -0.091 -0.006 0.028 -0.030 0.034 0.029 0.436 -0.126 0.005 0.042 -0.019 0.047 0.043 0.225 -0.113 -0.006 0.025 -0.030 0.034 0.032 0.027 0.118 RMSE change 0.002 0.083 0.001 -0.022 0.014 -0.024 -0.025 0.004 0.118 -0.006 -0.036 0.013 -0.040 -0.038 -0.011 0.105 -0.009 -0.007 -0.016 0.003 -0.003 0.005 0.044 0.034 -0.135 0.006 0.016 0.001 0.031 0.023 0.194 -0.039 0.021 0.032 0.011 0.044 0.034 0.025 -0.123 0.019 0.025 0.015 0.041 0.035 0.015 0.064 RMSE change -0.029 0.126 -0.002 -0.023 0.007 -0.027 -0.030 -0.171 0.036 -0.018 -0.036 -0.007 -0.043 -0.039 -0.016 0.102 0.005 -0.024 0.016 -0.009 -0.025 -0.010 0.057 4.123 Effect of the Direction of a-parameter Drift If other conditions hold the same, the effect of the direction of the a-parameter drift varies depending on the linking methods used. Table 4a.5 lists the difference in the bias and RMSE values when a-parameter drift is positive versus when the drift is negative. The results showed that with the FCIP or TCC-ST method, the size of the differences in bias due to drift direction were between 0 and 0.041 and the size of changes in RMSE ranged from 0.001 to 0.092. This 30 indicated that when FCIP or TCC-ST method was used to link the groups, if other conditions were the same, the accuracy of θ estimates when the drift was positive was similar to the accuracy when the drift was negative. However, when Concurrent method was used, the mean differences due to drift direction were bigger, size of changes ranging from 0.044 to 0.622 for bias and ranging from 0.019 to 0.622 for RMSE. Moreover, the change was bigger when the drifted items were dropped from the linking. These results indicated that the Concurrent method was more sensitive to the direction of the drift than the FCIP and TCC-ST method, and the effect of the direction of the drift was stronger when drifted items were removed from the linking. 31 Table 4a.5 Change in bias and RMSE with a-parameter Drifting in Different Directions Group Difference Year One (0, 1) Year Two (0, 1) Linking Method Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop Mean Std Deviation a-drift 3 Items (+0.4 -> -0.4) bias change 0.258 -0.593 0.012 0.026 0.035 0.023 0.018 0.173 -0.535 0.010 0.024 0.035 0.024 0.018 0.045 -0.591 0.018 0.028 0.041 0.023 0.020 0.027 0.118 RMSE change -0.161 0.591 0.012 -0.022 0.015 -0.013 -0.016 -0.156 0.541 -0.001 -0.027 -0.012 -0.024 -0.022 -0.023 0.569 0.054 0.007 0.092 0.021 0.009 0.005 0.044 a-drift 8 Items (+0.4 -> -0.4) bias change 0.258 -0.593 0.012 0.026 0.035 0.023 0.018 0.173 -0.535 0.010 0.024 0.035 0.024 0.018 0.045 -0.591 0.018 0.028 0.041 0.023 0.020 -0.042 0.230 RMSE change -0.161 0.591 0.012 -0.022 0.015 -0.013 -0.016 -0.156 0.541 -0.001 -0.027 -0.012 -0.024 -0.022 -0.023 0.569 0.054 0.007 0.092 0.021 0.009 0.068 0.216 4.124 Effect of the Linking Method Use of the Concurrent method to link the data played an important role on the performance of θ estimation. Figure 4a.0.1 to Figure 4a.0.6 compared the accuracy of θ estimates in terms of RMSE when the Concurrent, FCIP and TCC-ST linking methods were used in different situations. From the graphs it was obvious that there was little difference between the FCIP linking and the TCC-ST linking in their performance. However, when the Concurrent method was applied in linking, it was a different story. The accuracy of θ estimation differed depending 32 on the group ability difference and the handling of drifted items. When there was no item dropped from the linking, the Concurrent method did a better job in θ estimation than the FCIP and TCC-ST methods when the mean ability of the Year Two group was equal or higher than that of the Year One group, and it did almost equally well as the other two methods when the mean ability of the Year Two group was lower than that of the Year One group. However, when the drifted items were dropped from the linking, the Concurrent method could not keep up its advantage. In half of the cases, it managed to do an equally satisfactory job as the other two methods, but it did much worse when the items dropped were those with an increased a-parameter. This suggested that the Concurrent method was sensitive to the discriminating power of its linking items. With the Concurrent method, item and ability parameters were estimated simultaneously from one single computer run, and items with higher a-parameters had more influence in the accuracy of θ estimation. 33 Figure 4a.0.1 RMSE Comparison of Linking Methods (Mean Year 1 = Mean Year 2; All Items Included) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Concurrent FCIP TCC-ST *For interpretation of the references to color in this and all other figures, the reader is referred to the electronic version of this dissertation. Figure 4a.0.2 Comparison of Linking Methods (Mean Year 1 < Mean Year 2; All Items Included) RMSE 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 Drift Concurrent FCIP 34 TCC-ST I8_a-0.4 Figure 4a.0.3 Comparison of Linking Methods (Mean Year 1 > Mean Year 2; All Items Included) 0.30 0.25 RMSE 0.20 0.15 0.10 0.05 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Concurrent FCIP TCC-ST Figure 4a.0.4 Comparison of Linking Methods (Mean Year 1 = Mean Year 2; Drifted Items Dropped) 1.20 1.00 RMSE 0.80 0.60 0.40 0.20 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 Drift Concurrent FCIP 35 TCC-ST I8_a-0.4 Figure 4a.0.5 Comparison of Linking Methods (Mean Year 1 < Mean Year 2; Drifted Items Dropped) 1.40 1.20 RMSE 1.00 0.80 0.60 0.40 0.20 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Concurrent FCIP TCC-ST Figure 4a.0.6 RMSE Comparison of Linking Methods (Mean Year 1 > Mean Year 2; Drifted Items Dropped) 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 Drift Concurrent FCIP 36 TCC-ST I8_a-0.4 4.125 Effect of Group Difference The difference in the ability of the groups taking the two test forms influenced the accuracy of θ estimates. The effect of group difference was displayed in Figures 4a.1.1 to 4a.1.7. Compared with the performance when the two groups of equal abilities were linked, the θ estimation tended to be more accurate when the Year One group was of a higher ability than the Year Two group and the estimation tended to be less accurate when the Year One group was of a lower ability than the Year Two group. This tendency was particularly true with the FCIP and TCC-ST methods examined in this study, where the scores from Year One test were put on the Year Two test scale through linking. This might be explained by the guessing parameter. In the IRT model used in this study, the c-parameter was set at 0.2. The group of lower ability might have a lower guessing parameter than the group of higher ability and if put on the scale of the higher-ability group, the estimation might be less accurate. This tendency was observed in some cases with the Concurrent method when the drifted items were dropped from the linking. However, the tendency did not hold in the cases when the Concurrent method was used with all items included. In the Concurrent method, the two groups were combined as one group and placed on a combined scale after the calibration, and the influence of the guessing parameter was mixed with the effect of a-parameter. 37 Figure 4a.1.1 RMSE Effect of Group Difference (Concurrent; All Items Included) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.2 Effect of Group Difference (FCIP; All Items Included) 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.3 Effect of Group Difference (TCC-ST; All Items Included) 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.4 Effect of Group Difference (Concurrent; Drifted Items Dropped) 1.40 1.20 RMSE 1.00 0.80 0.60 0.40 0.20 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.5 Effect of Group Difference (FCIP; Drifted Items Dropped) 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.6 RMSE Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 Figure 4a.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped) RMSE 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Year1=Year2 Year1Year2 4.126 Effect of Drifted Item Handling The way to handle the drifting items could affect the accuracy of θ estimation when the Concurrent linking method was used. However, the choice of removing the drifted items had little effect on the accuracy of θ estimation when FCIP and TCC-ST linking methods were applied. The effect of the way drifted items were handled was shown in Figure 4a.2.1 to Figure 4a.2.9. When the FCIP linking method was used, there was very little difference in terms of RMSE whether the drifted items were kept in or dropped from the linking. In the condition where 8 items were a negative drift of 0.4 on the a-parameter, the RMSE was a little bit higher when the drifted items were dropped, but the difference was negligible (smaller than 0.05). When the TCC-ST linking method was used, there were three ways to handle the drifted items: keeping all the items, dropping the drifted items in the calibration of the linking coefficients but keeping 41 these items in the scoring, or dropping the drifted items from both the calibration and the scoring process. The results indicated that keeping the drifted items in both linking and scoring did equally well as, in some cases even better than, the other two ways of handling the drifted items in θ estimation. Dropping the drifted items from both linking and scoring sometimes resulted in a higher RMSE in θ estimation than the other two ways, but the difference was not very large. For example, when the mean ability of the Year One group was greater than that of the Year Two group and a-parameter had a negative drift with eight items, the RMSE while dropping these eight items from both linking and scoring was about 0.07 bigger than the RMSE while all the items were kept in both linking and scoring. However, when the Concurrent linking method was used, the way of handling the drifted items did make a difference, and in some cases, the difference was too large to be ignored. In all the conditions with the Concurrent method, dropping the drifted items resulted in less accurate θ estimates. The extreme cases were when the a-parameter had a positive drift. In these conditions whether the drifted items were dropped from linking could mean a big difference in RMSE, a size as big as over 1 θ unit. Dropping the drifted items greatly reduced the accuracy of θ estimation. This might be explained by the power of the a-parameter. A positive drift in a-parameter means the increased discriminating power. More items with higher discriminating power contributed to a better estimation of ability. However, if this a-parameter drift was a false increase, the accuracy of the estimation of ability would be jeopardized. Items with the increasing a-parameter were to be handled with caution. It was necessary to investigate the items to decide whether the drift was the result of true increase in the ability. 42 Figure 4a.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) 1.20 1.00 RMSE 0.80 0.60 0.40 0.20 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop Figure 4a.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2) 0.50 RMSE 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop 43 Figure 4a.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2) 0.50 RMSE 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Keep Drop/Drop Figure 4a.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) 1.40 1.20 RMSE 1.00 0.80 0.60 0.40 0.20 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop 44 Figure 4a.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2) 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop Figure 4a.2.6 RMSE Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2) 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Keep 45 Drop/Drop Figure 4a.2.7 RMSE Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop Figure 4a.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2) 0.30 0.25 RMSE 0.20 0.15 0.10 0.05 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Drop 46 Figure 4a.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2) 0.30 0.25 RMSE 0.20 0.15 0.10 0.05 0.00 No_Drift I3_a+0.4 I3_a-0.4 I8_a+0.4 I8_a-0.4 Drift Keep/Keep Drop/Keep Drop/Drop 4.127 Effect of Drifted Items Handling at Different θ Levels To further examine the effect of the handling of drifting items, mean differences between θ estimates and true θs were calculated and plotted for different θ intervals. The θ scale was divided into equally spaced intervals of 0.25, ranging from -4.00 to 4.00. Mean differences between θ estimates and true θs for examinees whose true θs fall in the corresponding interval were calculated. The scatter plots between the θ mean differences and the interval of true θs are displayed in Figures 4a.3.1 to 4a.3.4, for a-drift conditions 1 to 4, respectively. As shown in Figure 4a.3.1, when there were 3 items showing a positive a-parameter drift, there was almost no difference whether these drifted items were removed from the linking or not, when the FCIP or TCC-ST method was used. However, when the Concurrent method was used, 47 it mattered whether the drifted items were dropped from the linking. While θs were extremely small (less than -3.0), bias was smaller when drifted items were dropped. As the θs got larger, the bias became smaller while all items were included in the linking, but the bias became larger while all items were removed from the linking. When the θs were in the interval between -3.0 and -2.0, the method with drifted items and the method without these items produced similar bias. When the θ values were bigger than -2.0, the method without the drifted items yielded much larger bias than the method with all the items. The difference between the two methods reached its peak when the θ values were around -1.0. Within most θ intervals, θ estimates were more accurate when all the items were included in the linking. Figure 4a.3.1 Mean bias at θ Intervals (3 Items a-drift +0.4) Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 48 concurrent_drop Figure 4a.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 49 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4a.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 < Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 50 FCIP_drop 4.0 Figure 4a.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 51 concurrent_drop Figure 4a.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items a -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 52 TCC-ST_keep TCC-ST_drop/drop 4.0 From Figure 4a.3.2, it can be observed that when three items had a negative a-parameter drift, the effect of dropping drifted items still existed as in positive a-parameter drift, though the effect was relatively smaller. When the a-parameter was drifting either in an increasing or decreasing manner, it did not affect the accuracy of θ estimation for the FCIP or TCC-ST method whether these drifted items were removed from the linking. The accuracy of θ estimation was affected only when the Concurrent method was used. When a-parameter was drifting in a decreasing manner, for θs within the -2.0 to 2.0 range, the Concurrent method yielded more accurate θ estimates if the drifted items were kept in the linking than if these items were dropped. However for θs at the lower end of the θ scale (smaller than -2.0), estimation was more accurate if the drifted items were dropped from the linking. Moreover, for θs at the higher end of the θ scale (above 2.0), it made no difference on the accuracy of θ estimation whether the drifted items were removed from the linking or not. 53 Figure 4a.3.2 Mean bias at θ Intervals (3 Items a-drift -0.4) Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 54 FCIP_drop 4.0 Figure 4a.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 55 concurrent_drop Figure 4a.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 56 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4a.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 57 FCIP_drop 4.0 Figure 4a.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items a -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop As shown in Figure 4a.3.3, when eight items were having a positive a-parameter drift, it showed a similar pattern of bias as observed when only three items were having such a drift. The estimation of θ was less accurate at the extreme ends of the θ scale, no matter what kind of linking method was used. For the FCIP and TCC-ST methods, the way to handle the drifted items had no effect on the accuracy of θ estimation. For the Concurrent method, θ estimation was less accurate when the drifted items were dropped from the linking, except at the lower end of the θ scale. 58 Figure 4a.3.3 Mean bias at θ Intervals (8 Items a-drift +0.4) Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 59 FCIP_drop 4.0 Figure 4a.3.3 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 60 concurrent_drop Figure 4a.3.3 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 61 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4a.3.3 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 > Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 62 FCIP_drop 4.0 Figure 4a.3.3 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop When more items were showing negative a-parameter drift, as shown in Figure 4a.3.4, the effect of dropping drifted items with the Concurrent method was reduced. The difference in bias as a result of handling the drifted items differently was small. The difference became even smaller as the θs became bigger. When the θs were larger than zero on the -4 to 4 scale, there was virtually little difference in terms of bias if the drifted items were dropped from the linking or not. This indicated that items with small discriminating power were not effective in estimating relatively larger θs. Therefore, if the a-parameter became small as a result of the parameter drift, it would make little difference whether to keep these items with smaller a-parameters in the linking or not. 63 Figure 4a.3.4 Mean bias at θ Intervals (8 Items a-drift -0.4) Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 64 FCIP_drop 4.0 Figure 4a.3.4 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 65 concurrent_drop Figure 4a.3.4 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 66 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4a.3.4 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 67 FCIP_drop 4.0 Figure 4a.3.4 (cont’d) Mean BIAS at θ Intervals (8 Items a -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop 4.13 Accuracy of Performance Level Classification In the assessment simulated in this study, examinees were classified into four performance levels after they had been linked onto the same scale. Following the guidelines used in the actual assessment, a θ cut score was set for each performance level. The level classified using the estimated θ was considered as the examinee’s estimated level while the one classified according to the examinee’s true θ was considered his/her true level. The percentage of examinees classified into the appropriate performance levels was a good indication of the quality of estimation. Moreover, when using this as the index of accuracy, special attention was paid to the classification of Performance Level 3, which was the provincial standard of pass/fail. To examine the effect of how the drifting items were treated, the proportion of correct classification in each 68 level was calculated for different situations. The percentages were listed in Table 4a.6 to Table 4a.9 for the four a-parameter drifting conditions, respectively. The accuracy of the performance level classification was summarized in Table 4a.6 for the condition when three items were showing a positive drift of size 0.4 in a-parameter. When the two groups to be linked were of the same ability, the Concurrent method worked best when the drifting items were included in the linking. Overall, 92.3 % of the examinees were classified into the correct category. This is much higher than the 71.2% and the 72.6% achieved by the FCIP and TCC-ST methods. The performance of the Concurrent method was more outstanding in classifying examinees at Level 3. 92.3 % of the Level 3 examinees were correctly classified, which was a much larger number than the 54.2% and 55.5% acquired with the FCIP and TCC-ST methods. However, if the drifting items were dropped from the linking, the Concurrent method performed much worse than the FCIP and TCC-ST methods, with a large percent (67.6% overall and 71.3% at Level 3) of the examinees classified one level lower. When the Year Two group had a higher ability than the Year One group, if the drifting items were kept in the linking, the Concurrent method did much better than the FCIP and TCC-ST methods. The percentage of correct classification was much higher (82.2% overall and 94.3% at Level 3) than the FCIP method (46.1% overall and 25.9% at Level 3) and the TCC-ST method (47.5% overall and 26.5% at Level 3). However, if the drifted items were removed from the linking process, the Concurrent method was doing a poor job. The percentage of correct classification was lower (25.9% overall and 19.5% at Level 3) than the FCIP method (48.0% overall and 27.1% at Level 3) and the TCC-ST method (47.6% overall and 26.8% at Level 3). When the Year Two group had a lower ability than the Year One group, if the drifting items were kept in the linking, the three linking methods did equally well in classifying the examinees 69 into the right level. The percentages of correct classification with these three linking methods were high (more than 87% overall and more than 95% at Level 3). However, when the drifting items were removed from the linking, the Concurrent method did a poor job, much worse than the FCIP and TCC-ST methods. The percentage of correct classification was only 39% overall and 32.7% at Level 3, much lower than the corresponding percentages with the FCIP method (88.3% and 96.4%) and the TCC-ST method (88% and 96.1%). When the a-parameter drift was a positive one, the Concurrent method did a much better job when the drifting items were kept in the linking process. Items with a greater discriminating power helped improve the accuracy of θ estimates and the Concurrent method was sensitive to the effect of the discriminating parameter. 70 Table 4a.6 Percentage in Each Performance Level Classification (N=3, Drift=a+0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 91.0 2 2.9 93.1 Concurrent 3 7.1 92.3 4 24.2 75.8 Overall 5.1 91.5 1 0.0 99.4 Year One 2 17.6 82.4 (0,1) FCIP 3 45.8 54.2 Year Two 4 84.8 15.2 (0,1) Overall 28.7 71.2 1 0.0 99.0 2 15.4 84.6 TCC-ST 3 44.5 55.5 4 83.0 17.0 Overall 27.3 72.6 Group Linking Method 71 1 9.0 3.9 0.6 0.0 3.4 0.6 0.0 0.0 0.0 0.1 1.0 0.0 0.0 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 91.9 8.1 2.7 71.3 26.0 0.0 81.8 18.2 1.0 67.6 31.4 0.0 99.0 15.1 84.9 44.1 55.9 83.6 16.4 27.0 72.8 0.0 99.0 15.6 84.4 44.4 55.6 82.4 17.6 27.3 72.6 1 1.0 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 Table 4a.6 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 62.9 2 0.0 77.3 Concurrent 3 0.2 94.3 4 2.4 97.6 Overall 0.2 82.2 1 0.0 100.0 Year One 2 52.6 47.4 (0,1) FCIP 3 74.1 25.9 Year Two 4 97.6 2.4 (0.2,1) Overall 53.9 46.1 1 0.0 100.0 2 49.9 50.1 TCC-ST 3 73.5 26.5 4 96.4 3.6 Overall 52.5 47.5 Group Linking Method 72 1 37.1 22.7 5.6 0.0 17.6 Items Dropped Est.Level - True Level -2 -1 0.0 0.0 0.0 97.9 10.4 70.4 0.0 91.5 3.9 70.2 0.0 49.2 72.9 97.0 52.0 0.0 49.9 73.2 96.4 52.4 0 1 100.0 2.1 19.2 8.5 25.9 100.0 50.8 27.1 3.0 48.0 100.0 50.1 26.8 3.6 47.6 Table 4a.6 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 74.7 2 0.3 89.4 Concurrent 3 1.6 96.7 4 12.1 87.9 Overall 1.4 89.4 1 0.0 77.1 Year One 2 0.3 93.8 (0,1) FCIP 3 4.6 95.4 Year Two 4 59.4 40.6 (-0.2,1) Overall 5.1 88.6 1 0.0 73.5 2 0.1 93.0 TCC-ST 3 3.9 96.1 4 57.0 43.0 Overall 4.6 88.0 Group Linking Method 73 1 25.3 10.4 1.7 0.0 9.2 22.9 5.9 0.0 0.0 6.3 26.5 6.9 0.0 0.0 7.4 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 80.0 20.0 0.3 67.0 32.7 0.0 74.5 25.5 0.1 60.9 39.0 0.0 74.7 0.3 93.5 3.6 96.4 60.0 40.0 4.8 88.3 0.0 73.5 0.1 93.0 3.9 96.1 57.0 43.0 4.6 88.0 1 25.3 6.2 0.0 0.0 6.9 26.5 7.0 0.0 0.0 7.4 When three items were showing a negative drift in a-parameter, the Concurrent method performed as well as or even better than the FCIP and the TCC-ST methods did. Its performance is demonstrated in Table 4a.7. When the Year One group and the Year Two group were of equal ability, the Concurrent method was doing slightly better than the other two methods in classifying the examinees into the right categories, if the drifting items were included in the linking. The percentages of correct classifications were 79.9% (overall) and 71.5% (at Level 3) for the Concurrent method, while the corresponding percentages were 71.5% and 55.9% for the FCIP method, and 73.2% and 59.2% for the TCC-ST method. If the drifting items were removed from the linking, the Concurrent method was doing as well as the FCIP and TCC-ST methods. The percentages of correct classifications were 62% (overall) and 52.2% (at Level 3) for the Concurrent method, while the corresponding percentages were 70% and 52.2% for the FCIP method, and 71.4% and 55% for the TCC-ST method. When the Year Two group had a higher ability than the Year One group, the Concurrent method was doing much better than the FCIP and TCC-ST methods, if the drifting items were kept in the linking. The Concurrent method had much higher percentages of correct classifications (83.1 % overall and 76.1% at Level 3) than the other two methods (46.8% overall and 28% at Level 3 for the FCIP method; 48% overall and 30.7% at Level 3 for the TCC-ST method). When the drifting items were dropped from the linking, the performance of the Concurrent method became worse, but still slightly better than the other two methods. The percentages of correct classification were 56% overall and 46.2% at Level 3 for the Concurrent method, while the corresponding numbers were 45.8% and 26.3% for the FCIP method and 46% and 26.8% for the TCC-ST method. When the Year Two group had a lower ability than the Year One group, the three linking 74 methods were doing equally well in the level classification, if the drifting items were included in the linking. The overall percentages of correct classification were around 90% for any of the linking method. The FCIP and TCC-ST methods were doing even better at the classification of Level 3 (95.8% and 96.7% versus 81%). When the drifting items were removed from the linking, the FCIP and TCC-ST methods were doing better than the Concurrent method, especially in the classification at Level 3. The percentages of correct classifications were 76%, 89.6% and 90% overall and 68%, 93.9% and 95.3% at Level 3, for the Concurrent, the FCIP and the TCC-ST method respectively. It seemed that the more items were included in the linking, the better the Concurrent method could perform. 75 Table 4a.7 Percentage in Each Performance Level Classification (N=3, Drift=a-0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.4 2 16.5 83.5 Concurrent 3 28.5 71.5 4 50.9 49.1 Overall 20.0 79.9 1 0.0 99.8 Year One 2 19.7 80.3 (0,1) FCIP 3 44.1 55.9 Year Two 4 77.0 23.0 (0,1) Overall 28.5 71.5 1 0.0 100.0 2 19.2 80.8 TCC-ST 3 40.8 59.2 4 73.3 26.7 Overall 26.8 73.2 Group Linking Method 76 1 0.6 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 42.5 57.5 47.8 52.2 60.0 40.0 38.0 62.0 0.0 99.8 19.4 80.6 47.8 52.2 80.6 19.4 29.9 70.0 0.0 100.0 18.9 81.1 45.0 55.0 79.4 20.6 28.6 71.4 1 0.2 0.0 0.0 0.0 0.0 Table 4a.7 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 98.7 2 13.3 86.6 Concurrent 3 23.8 76.1 4 43.6 56.4 Overall 16.6 83.1 1 0.0 100.0 Year One 2 53.4 46.6 (0,1) FCIP 3 72.0 28.0 Year Two 4 94.5 5.5 (0.2,1) Overall 53.2 46.8 1 0.0 100.0 2 53.1 46.9 TCC-ST 3 69.3 30.7 4 92.7 7.3 Overall 52.0 48.0 Group Linking Method 77 1 1.3 0.1 0.1 0.0 0.3 Items Dropped Est.Level - True Level -2 -1 0.0 51.1 53.8 66.7 44.0 0.0 54.1 73.7 95.8 54.2 0.0 54.1 73.2 95.8 54.0 0 100.0 48.9 46.2 33.3 56.0 100.0 45.9 26.3 4.2 45.8 100.0 45.9 26.8 4.2 46.0 1 Table 4a.7 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 97.3 2 9.2 90.5 Concurrent 3 18.9 81.0 4 40.6 59.4 Overall 12.9 86.5 1 0.0 83.8 Year One 2 1.3 92.3 (0,1) FCIP 3 4.1 95.8 Year Two 4 42.4 57.6 (-0.2,1) Overall 4.4 90.2 1 0.0 85.3 2 1.3 91.1 TCC-ST 3 3.1 96.7 4 33.3 66.7 Overall 3.5 90.8 Group Linking Method 78 1 2.7 0.3 0.1 0.0 0.6 16.2 6.4 0.1 0.0 5.4 14.7 7.5 0.2 0.0 5.6 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 25.0 75.0 31.9 68.0 39.4 60.6 24.0 76.0 0.0 83.2 1.3 93.6 6.1 93.9 49.1 50.9 5.5 89.6 0.0 83.0 1.0 92.9 4.7 95.3 44.2 55.8 4.6 90.0 1 0.0 0.0 0.1 0.0 0.0 16.8 5.0 0.0 0.0 4.9 17.0 6.1 0.0 0.0 5.4 Examination of Table 4a.8 suggested that when more items were showing a positive 0.4 drift in the a-parameter, the Concurrent method did a much better job when all the drifting items were included in the linking. When the two groups of equivalent ability were linked, the Concurrent method did a better job than the other two linking methods. The better performance of the Concurrent method could be observed when the percentages of correct classifications were compared – 91.5% overall and 92% at Level 3 for the Concurrent method, versus 71.7% and 54.7% for the FCIP method and 74.7% and 57.1% for the TCC-ST method. However, when the drifting items were removed from the linking, the performance of the Concurrent method took a sharp downturn. It did a much worse job than the other two methods. The percentages of correct classification were only 36.5% overall and 31.8% at Level 3, while the corresponding numbers were 70.7% and 52.9% for the FCIP method and 70.2% and 52% for the TCC-ST method. The similar tendency was observed when the Year Two group had a higher ability than the Year One group. When the drifting items were included in the linking, the Concurrent method did a much better job than the other two methods. The percentages of correct classification were 81.7% overall and 75.2% at Level 3 for the Concurrent method, while the corresponding percentages were 46.5% and 25.6 % for the FCIP method, and 48.8% and 26.5% for the TCC-ST method. However, when the drifting items were dropped from the linking, the Concurrent method did worse than the other two methods. The percentages of correct classification dropped to 31.1% overall and 25.9% at Level 3 for the Concurrent method, while the corresponding numbers were 45% and 25.9 % for the FCIP method, and 44.4% and 25.4% for the TCC-ST method. When the Year Two group had a lower ability than the Year One group, with the drifting items in the linking process, the Concurrent method performed as well as the other two methods. 79 The overall percentage was around 90% and percentage at Level 3 was bigger than 80%. However, when the drifting items were taken out of the linking, the performance of the Concurrent method worsened. The percentages of correct classifications dropped to 46.8% overall and 41.6% at Level 3. On the contrary, the other two methods kept on doing well in the classification. The corresponding percentages were 88.4% and 94.6% for the FCIP method, and 89.1% and 94% for the TCC-ST method. 80 Table 4a.8 Percentage in Each Performance Level Classification (N=8, Drift=a+0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 92.0 2 3.5 92.9 Concurrent 3 7.3 92.0 4 24.2 75.8 Overall 5.5 91.5 1 0.0 99.4 Year One 2 16.5 83.5 (0,1) FCIP 3 45.3 54.7 Year Two 4 87.9 12.1 (0,1) Overall 28.2 71.7 1 0.0 98.3 2 11.1 88.9 TCC-ST 3 42.9 57.1 4 84.2 15.8 Overall 25.0 74.7 Group Linking Method 81 1 8.0 3.6 0.6 0.0 3.1 0.6 0.0 0.0 0.0 0.1 1.7 0.0 0.0 0.0 0.3 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 85.8 14.2 0.6 67.6 31.8 0.0 72.7 27.3 0.2 63.3 36.5 0.0 100.0 17.6 82.4 47.1 52.9 86.7 13.3 29.3 70.7 0.0 99.8 18.5 81.5 48.0 52.0 83.0 17.0 29.8 70.2 1 0.2 0.0 0.0 0.0 0.0 Table 4a.8 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 16.2 83.7 Concurrent 3 24.7 75.2 4 44.2 55.8 Overall 18.1 81.7 1 0.0 100.0 Year One 2 51.3 48.7 (0,1) FCIP 3 74.4 25.6 Year Two 4 98.2 1.8 (0.2,1) Overall 53.5 46.5 1 0.0 100.0 2 46.5 53.5 TCC-ST 3 73.5 26.5 4 97.0 3.0 Overall 51.2 48.8 Group Linking Method 82 1 1.0 0.1 0.1 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0.0 0.0 0.0 93.0 3.8 70.3 0.0 78.8 1.4 67.5 0.0 55.3 74.1 98.2 55.0 0.0 56.6 74.6 96.4 55.6 0 100.0 7.0 25.9 21.2 31.1 100.0 44.7 25.9 1.8 45.0 100.0 43.4 25.4 3.6 44.4 1 Table 4a.8 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.2 2 7.3 92.2 Concurrent 3 16.7 83.2 4 43.0 57.0 Overall 11.5 87.6 1 0.0 75.0 Year One 2 0.3 93.2 (0,1) FCIP 3 4.0 96.0 Year Two 4 66.1 33.9 (-0.2,1) Overall 5.3 87.8 1 0.0 64.8 2 0.0 91.7 TCC-ST 3 2.2 97.8 4 57.6 42.4 Overall 4.0 86.6 Group Linking Method 83 1 3.8 0.5 0.1 0.0 0.9 25.0 6.4 0.0 0.0 6.9 35.2 8.3 0.0 0.0 9.5 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 70.9 29.1 0.1 58.3 41.6 0.0 58.8 41.2 0.0 53.2 46.8 0.0 77.7 0.5 94.4 5.4 94.6 62.4 37.6 5.6 88.4 0.0 78.9 0.3 95.4 6.0 94.0 57.6 42.4 5.5 89.1 1 22.3 5.1 0.0 0.0 5.9 21.1 4.3 0.0 0.0 5.4 Examination of Table 4a.9 indicated that when there were eight items showing a negative drift of 0.4 in the a-parameter, the performance of the Concurrent method was not significantly influenced by the way the drifting items were handled. When the mean ability of the Year Two group were equivalent to or lower than that of the Year One group, the three methods were performing equally well. The overall percentages of correct classifications were around 70% when the groups were equal in ability and around 90% when the ability of the Year Two group was lower. When the Year Two group had a higher ability than the Year One group, the Concurrent method was doing better than the FCIP and TCC-ST methods. The overall percentage of correct classification was 69.5% for the Concurrent method, while it was 45% and 47.7% for the FCIP and TCC-ST methods. The percentage at Level 3 was 59.2% for the Concurrent method, while it was 27.4% and 32.4% for the other two methods. Moreover, when the drifting items were dropped from the linking process, the resulting percentages of correct classification declined slightly, but the decreases in the percentages were not significant. Furthermore, the pattern of the performances observed when the drifting items were kept in the linking remained the same when the items were removed from the linking. The Concurrent method did a better job than the other two methods when the Year Two group had a higher ability than the Year One group. Otherwise, the three methods were doing equally well in classifying the examinees into the right performance levels. 84 Table 4a.9 Percentage in Each Performance Level Classification (N=8, Drift=a-0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.8 2 18.2 81.8 Concurrent 3 31.7 68.3 4 55.2 44.8 Overall 22.1 77.9 1 0.0 100.0 Year One 2 20.8 79.2 (0,1) FCIP 3 43.4 56.6 Year Two 4 76.4 23.6 (0,1) Overall 28.6 71.4 1 0.0 100.0 2 21.6 78.4 TCC-ST 3 38.7 61.3 4 66.7 33.3 Overall 26.7 73.3 Group Linking Method 85 1 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 32.7 67.3 33.9 66.1 41.2 58.8 27.9 72.1 0.0 99.6 21.3 78.7 49.7 50.3 85.5 14.5 31.7 68.3 0.0 100.0 21.4 78.6 48.8 51.2 81.8 18.2 31.2 68.8 1 0.4 0.0 0.0 0.0 0.1 Table 4a.9 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 29.9 40.8 62.4 30.5 0.0 57.3 72.6 93.9 55.0 0.0 55.9 67.6 89.7 52.3 86 0 100.0 70.1 59.2 37.6 69.5 100.0 42.7 27.4 6.1 45.0 100.0 44.1 32.4 10.3 47.7 Items Dropped Est.Level - True Level 1 -2 -1 0.0 48.2 49.7 63.0 41.2 0.0 58.6 75.5 97.6 56.8 0.0 59.5 75.5 96.4 57.1 0 100.0 51.8 50.3 37.0 58.8 100.0 41.4 24.5 2.4 43.2 100.0 40.5 24.5 3.6 42.9 1 Table 4a.9 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 98.1 2 10.4 89.4 Concurrent 3 21.2 78.7 4 41.2 58.8 Overall 14.3 85.2 1 0.0 88.2 Year One 2 2.4 92.5 (0,1) FCIP 3 5.0 94.9 Year Two 4 41.2 58.8 (-0.2,1) Overall 5.1 90.8 1 0.0 91.4 2 3.0 89.9 TCC-ST 3 3.9 95.5 4 26.1 73.9 Overall 4.1 91.4 Group Linking Method 87 1 1.9 0.3 0.1 0.0 0.5 11.8 5.1 0.1 0.0 4.1 8.6 7.0 0.5 0.0 4.5 Items Dropped Est.Level - True Level -2 -1 0 0.0 99.4 17.5 82.1 19.1 80.2 23.6 76.4 15.4 84.1 0.0 84.4 2.3 93.7 7.7 92.3 58.8 41.2 7.0 88.7 0.0 88.4 2.1 94.0 7.7 92.3 49.7 50.3 6.4 90.0 1 0.6 0.4 0.7 0.0 0.5 15.6 4.0 0.0 0.0 4.3 11.6 3.9 0.0 0.0 3.6 On the whole, when the a-parameter drift was positive, it mattered whether the drifting items were removed from the linking, as far as the Concurrent method was concerned. Moreover, when the inclusion of the drifting items had an effect on the performance of the Concurrent method, the drifting items were helpful in improving the performance of the Concurrent method. Therefore, in those cases, the Concurrent method was more likely to do a better job than the FCIP and TCC-ST methods. 4.2 Drift on Difficulty Parameter b Eight kinds of drift on difficulty parameter b were examined to explore how different ways of handling the drifted items might influence the accuracy of θ estimates: 1) three items drifting with b-parameter increasing by 0.2; 2) three items drifting with b-parameter decreasing by 0.2 ; 3) three items drifting with b-parameter increasing by 0.4 and 4) three items drifting with b-parameter decreasing by 0.4; 5) eight items drifting with b-parameter increasing by 0.2; 6) eight items drifting with b-parameter decreasing by 0.2 ; 7) eight items drifting with b-parameter increasing by 0.4 and 8) eight items drifting with b-parameter decreasing by 0.4. 4.21 Correlation between θ Estimate and True θ Table 4b.1 lists the correlation coefficients between the θ estimates and the true θs when b-parameter drifts. Similar to what have been observed with a-parameter drifts, the correlations range from 0.914 to 0.927. The high correlations indicate that the θ estimates have a very strong and positive association with the true θs. In the no-drift baseline condition the correlations have an average of 0.927, and this strong and positive relationship between the θ estimates and true θs is consistent across the eight conditions of b-parameter drifts. Compared with the conditions when drifted items are kept, in conditions when drifted items are dropped the correlation coefficients show a tendency to drop slightly, but this kind of drop is insignificant, less than 88 0.013. This indicates that the strong and positive relationship between θ estimate and true θ still holds regardless of the different ways to handle the drifting. 89 Table 4b.1 Average Correlation Coefficients between θ Estimates and True θs when b-parameter Drifting (with SDs in Parentheses) Group Linking Difference Methods Year One (0,1) Year Two (0,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (0.2,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (-0.2,1) Concurrent FCIP TCC-ST Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift 3 items b-drift +0.2 3 items b-drift -0.2 3 items b-drift +0.4 3 items b-drift -0.4 0.927(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.927(0.002) 0.922(0.004) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 90 Table 4b.1 (cont’d) Group Linking Difference Methods Year One (0,1) Year Two (0,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (0.2,1) Concurrent FCIP TCC-ST Year One (0,1) Year Two (-0.2,1) Concurrent FCIP TCC-ST Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift 8 items b-drift +0.2 8 items b-drift -0.2 8 items b-drift +0.4 8 items b-drift -0.4 0.927(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.927(0.002) 0.917(0.003) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.923(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.926(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.927(0.002) 0.914(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.924(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.916(0.004) 0.927(0.002) 0.925(0.002) 0.927(0.002) 0.917(0.002) 0.927(0.002) 0.927(0.002) 0.918(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 0.927(0.002) 91 4.22 Accuracy of θ Estimates Examination of the correlation between θ estimates and true θs in different conditions shows that the relationship between the θ estimates and true θs are consistently strong and positive, regardless of the differences in the types of b-drift, group abilities, linking methods and ways of handling drifted items. This is one indication of the accuracy of θ estimation. To further examine the accuracy of the θ estimation, bias and RMSE values between θ estimates and true θs were also calculated for different situations of b-parameter drifts. Table 4b.2 and 4b.3 give the bias and RMSE values respectively for θ estimates when b-parameter drifting occurs. 4.221 Bias and RMSE in Eight b-drift Situations Eight scenarios of b-drift were examined in the study: three items (10%) showing an increase of 0.2 in b-parameter; three items showing an decrease of 0.2 in b-parameter; three items showing an increase of 0.4 in b-parameter; three items showing an decrease of 0.4 in b-parameter; eight items (25%) showing the increase of 0.2 and 0.4; eight items showing the decrease of 0.2 and 0.4. It was noticeable that most of the bias values were negative across all the conditions, which indicated that θ was underestimated under most circumstances. This might be explained by the fact that the average b-parameter of the items used in simulating the datasets was below zero. When the difficulty parameter was lower, it was more likely that the θ value was underestimated for groups with average ability at or above zero. The largest bias was -0.807, -1.092, -0.686, -0.694, -0.810, -0.872, -0.708, -0.765 for b-parameter drifting conditions of three items with 0.2 increase, three items with 0.2 decrease, three items with 0.4 increase, three items with 0.4 decrease, eight items with 0.2 increase, eight items with 0.2 decrease, eight items with 0.4 increase and eight items with 0.4 decrease, 92 respectively. Examination of the bias values showed that in six out of the eight drifting conditions, the largest underestimation occurred when the Concurrent linking method was used, drifted items were dropped and the Year Two group had a higher ability than the Year One group. Two exceptions were when the b-parameter showed a 0.4 increase drift. In these two conditions, the absolute value of bias was largest when the TCC-ST method was used to link the higher-ability Year Two group with the Year One group while drifting items were dropped from the calibration and yet kept in scoring. The smallest bias value was 0.001, -0.016, 0.015, -0.006, -0.016, 0.001, -0.039, 0.006 for the same eight b-parameter drifting conditions. It was obvious that the absolute bias value was among the smallest when the Year Two group with a lower ability was linked with the Year One group through FCIP or TCC-SC methods, and it did not seem to make much difference whether the drifting items were kept or dropped from the linking process. Examination of the largest RMSE values showed similar patterns as observed with the bias values. For each of the eight b-parameter drifting conditions, the largest RMSE was 0.827, 1.117, 0.714, 0.739, 0.837, 0.909, 0.737 and 0.813, respectively. In six out of the eight drifting conditions, the largest RMSE value occurred when Year Two group with a higher ability was linked with the Year One group through the Concurrent method while the drifting items were dropped from the linking. The two exceptions were in conditions when the b-parameter was showing a 0.4 increase. In these two cases, the largest RMSE also occurred in linking the Year Two group with a higher ability with the Year One group while the drifted items were dropped from the linking process, but with the TCC-ST method instead of the Concurrent method. However, examination of the smallest RMSE values for each drifting condition showed a pattern somewhat different from what was observed in bias values. The smallest RMSE value for 93 each of the eight b-parameter drifting conditions was 0.205, 0.207, 0.206, 0.166, 0.199, 0.198, 0.203 and 0.193, respectively. Most of the smallest RMSE values occurred when the Year Two group with a lower ability was linked with the Year One group, but the RMSE was smallest when the Concurrent method was used without dropping any drifting items, although the RMSE produced by the FCIP and TCC-ST methods was small as well. Examination of the bias and RMSE values also showed a tendency that the accuracy of estimation varied with the group ability and the linking methods. When the FCIP and TCC-ST linking methods were used, RMSE became smaller if the Year Two group had a lower ability than the Year One group and it became bigger if the Year Two group had a higher ability. However, when the two groups were linked with the Concurrent method, RMSE did not change a lot with the change in group ability, as long as all the items were kept in the linking process. Nevertheless, when the drifted items were dropped from the linking process, the RMSE values became bigger with the Concurrent method. 94 Table 4b.2 Bias for θ Estimates when b-parameter Drifting Group Difference Year One (0, 1) Year Two (0, 1) Linking Method Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift -0.227 -0.337 -0.337 -0.208 -0.668 -0.668 -0.130 -0.013 -0.036 3 items b-drift +0.2 -0.234 -0.577 -0.327 -0.334 -0.326 -0.337 -0.336 -0.212 -0.807 -0.663 -0.669 -0.664 -0.674 -0.672 -0.130 -0.369 0.001 -0.007 0.002 -0.007 -0.006 95 3 items b-drift -0.2 -0.216 -0.940 -0.348 -0.327 -0.349 -0.332 -0.330 -0.305 -1.092 -0.686 -0.664 -0.687 -0.669 -0.665 -0.135 -0.825 -0.037 -0.017 -0.034 -0.018 -0.016 3 items b-drift +0.4 -0.230 -0.402 -0.310 -0.339 -0.310 -0.342 -0.340 -0.267 -0.558 -0.650 -0.680 -0.653 -0.686 -0.683 -0.129 -0.211 0.015 -0.015 0.017 -0.015 -0.035 3 items b-drift -0.4 -0.066 -0.576 -0.353 -0.320 -0.354 -0.323 -0.322 -0.044 -0.694 -0.688 -0.654 -0.688 -0.659 -0.659 -0.134 -0.377 -0.036 -0.006 -0.036 -0.007 -0.008 8 items b-drift +0.2 -0.232 -0.608 -0.301 -0.338 -0.303 -0.344 -0.337 -0.330 -0.810 -0.635 -0.673 -0.640 -0.685 -0.675 -0.125 -0.425 0.024 -0.016 0.022 -0.022 -0.018 8 items b-drift -0.2 -0.239 -0.697 -0.368 -0.326 -0.370 -0.332 -0.325 -0.339 -0.872 -0.700 -0.653 -0.705 -0.664 -0.654 -0.139 -0.509 -0.036 0.001 -0.040 -0.003 0.001 8 items b-drift +0.4 -0.227 -0.361 -0.247 -0.354 -0.265 -0.366 -0.357 -0.341 -0.561 -0.588 -0.695 -0.604 -0.708 -0.696 -0.120 -0.170 0.069 -0.041 0.054 -0.045 -0.039 8 items b-drift -0.4 -0.238 -0.586 -0.406 -0.315 -0.400 -0.323 -0.317 -0.375 -0.765 -0.725 -0.632 -0.725 -0.645 -0.650 -0.116 -0.408 -0.079 0.013 -0.076 0.006 0.009 Table 4b.3 RMSE for θ Estimates when b-parameter Drifting Group Difference Year One (0, 1) Year Two (0, 1) Linking Method Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop No drift 0.275 0.397 0.398 0.257 0.698 0.698 0.205 0.203 0.202 3 items b-drift +0.2 0.280 0.601 0.390 0.398 0.389 0.397 0.399 0.259 0.827 0.694 0.701 0.694 0.704 0.703 0.205 0.402 0.211 0.213 0.210 0.208 0.212 96 3 items b-drift -0.2 0.265 0.969 0.407 0.389 0.406 0.390 0.390 0.340 1.117 0.718 0.697 0.718 0.699 0.697 0.209 0.855 0.221 0.216 0.215 0.207 0.212 3 items b-drift +0.4 0.278 0.436 0.376 0.402 0.376 0.400 0.402 0.307 0.587 0.683 0.711 0.684 0.714 0.714 0.206 0.262 0.218 0.219 0.215 0.211 0.218 3 items b-drift -0.4 0.166 0.627 0.409 0.386 0.410 0.385 0.386 0.152 0.739 0.719 0.689 0.720 0.692 0.694 0.205 0.446 0.210 0.216 0.208 0.208 0.214 8 items b-drift +0.2 0.280 0.637 0.366 0.407 0.366 0.401 0.404 0.362 0.837 0.666 0.708 0.670 0.713 0.709 0.202 0.458 0.203 0.222 0.199 0.202 0.219 8 items b-drift -0.2 0.285 0.737 0.426 0.392 0.424 0.388 0.391 0.371 0.909 0.731 0.687 0.734 0.693 0.688 0.210 0.556 0.214 0.217 0.208 0.198 0.215 8 items b-drift +0.4 0.277 0.401 0.325 0.422 0.335 0.419 0.423 0.373 0.592 0.625 0.732 0.639 0.737 0.733 0.203 0.234 0.224 0.231 0.215 0.208 0.231 8 items b-drift -0.4 0.281 0.645 0.454 0.388 0.449 0.383 0.388 0.402 0.813 0.754 0.674 0.754 0.680 0.690 0.193 0.482 0.216 0.226 0.215 0.206 0.224 4.222 Effect of Percentage of Items Showing b-parameter Drift The effect of the change in the number of drifting items was not strong in most of the linking conditions examined. Table 4b.4 showed the change in bias and RMSE when the number of items showing b-parameter drift increased from three to eight. When more items were showing a positive b-parameter drift, the resulted change was small. The mean change was 0.006 in bias and -0.006 in RMSE when b-parameter drifted 0.2 in the positive direction. The mean change remained small (-0.007 in bias and 0.002 in RMSE) when b-parameter drifted 0.4 in the positive direction. The mean change became bigger when more items were showing a negative b-parameter drift. The mean change was -0.034 in bias and 0.032 in RMSE at the -0.2 drift and it was 0.036 in bias and -0.030 in RMSE at the -0.4 drift. However, these changes could still be considered small, which indicated that the change in the number of drifting items did not have big influence on the accuracy of θ estimation. Furthermore, it was noticeable that the changes in the number of drifting items had very little effect on the accuracy of θ estimation, if FCIP or TCC-ST methods were used in the linking process. However, it did have some effect if the Concurrent method was used. When the b-parameter was showing a negative 0.2 drift, the accuracy of θ estimation was influenced by the number of items showing the drift, if the drifting items were dropped from the linking with the Concurrent method. However, when the b-parameter was showing a negative 0.4 drift, the effect of the change in the number of items was stronger for the Concurrent method without dropping any drifting items, especially when the Year Two group had a higher ability. Examination of the drifted items might explain why the change in the number of drifting items affected the performance of the Concurrent method differently. When the items with the -0.2 drift were examined, it was found that the average b-parameter of the three drifted items was much bigger 97 than that of the eight drifted items. Therefore, it had more influence on the accuracy of the estimation if the three items with bigger b-parameters were dropped from the linking process. On the other hand, in the case of the -0.4 drift, examination of the drifted items showed that the average b-parameter of the three drifted items were not much different from that of the eight drifted items, but both of the average b-parameters were negative. Items with negative b-parameter were less effective in estimating the ability of groups with higher θ values. Therefore when more items with negative b-parameter were kept in the linking, it would be less accurate to estimate the θ when the Year Two group had a higher ability than the Year One group. The different effect of the change in the number of the drifting items indicated that the Concurrent method was more sensitive to the characteristic of the drifting items. 98 Table 4b.4 Changes in bias and RMSE with More Items Drifting in b-parameter (3 items Group Difference Linking Method Year One (0, 1) Year Two (0, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop Mean Std Deviation 8 items) Change in bias Change in RMSE b+0.2 b-0.2 b+0.4 b-0.4 b+0.2 b-0.2 b+0.4 b-0.4 -0.001 0.031 -0.025 0.004 -0.023 0.006 0.001 0.118 0.004 -0.028 0.005 -0.024 0.010 0.003 -0.006 0.056 -0.023 0.009 -0.020 0.015 0.012 0.006 0.033 0.023 -0.244 0.020 -0.002 0.021 0.000 -0.005 0.034 -0.219 0.014 -0.011 0.018 -0.004 -0.011 0.004 -0.316 0.000 -0.018 0.006 -0.014 -0.017 -0.034 0.097 -0.002 -0.041 -0.063 0.015 -0.045 0.025 0.018 0.074 0.003 -0.062 0.015 -0.049 0.023 0.013 -0.009 -0.041 -0.054 0.026 -0.037 0.030 0.005 -0.007 0.037 0.172 0.010 0.053 -0.006 0.046 0.000 -0.005 0.331 0.071 0.037 -0.022 0.037 -0.013 -0.009 -0.018 0.030 0.043 -0.019 0.040 -0.012 -0.017 0.036 0.081 0.001 -0.036 0.024 -0.010 0.023 -0.004 -0.006 -0.103 -0.009 0.028 -0.007 0.024 -0.010 -0.006 0.004 -0.056 0.009 -0.009 0.011 0.006 -0.007 -0.006 0.030 -0.020 0.232 -0.019 -0.003 -0.018 0.001 -0.001 -0.031 0.208 -0.013 0.010 -0.016 0.007 0.009 -0.001 0.299 0.006 -0.001 0.007 0.009 -0.003 0.032 0.092 0.001 0.035 0.051 -0.020 0.040 -0.018 -0.021 -0.066 -0.005 0.057 -0.021 0.045 -0.023 -0.019 0.004 0.028 -0.006 -0.012 0.001 0.003 -0.013 0.002 0.031 -0.115 -0.017 -0.045 -0.001 -0.039 0.001 -0.002 -0.250 -0.074 -0.035 0.016 -0.034 0.012 0.004 0.012 -0.036 -0.007 -0.010 -0.007 0.002 -0.010 -0.030 0.059 99 4.223 Effect of the Direction of b-parameter Drift The effect of the change in the direction of the b-parameter drift varied with the different linking methods used. The bias and RMSE values when the b-parameter was positive were compared with the corresponding values when the b-parameter was negative and the results were listed in Table 4b.5. For the FCIP method, the size of the difference ranged from 0.004 to 0.159 in bias, and from 0.002 to 0.129 in RMSE. For the TCC-ST method, the size of the change ranged from 0.005 to 0.135 in bias and from 0 to 0.115 in RMSE. For the Concurrent method, the size of the difference ranged from 0.004 to 0.455 in bias and from 0.001 to 0.453 in RMSE. Examination of the differences showed that there was a similar pattern of the difference for the FCIP and TCC-ST methods. For these two methods, in most of the situations, the change of the direction of the drift had very little effect on the accuracy of θ estimation. The only exception was when eight items were showing a size 0.4 drift in b-parameter and these eight items were kept in the linking process. In this situation, θ estimation was less accurate when the b-parameter drift was negative than when the drift was positive. However, for the Concurrent method, the effect of the direction of b-parameter drift was stronger when the drifted items were dropped from the linking process, which was different from FCIP and TCC-ST methods. Moreover, this effect of the change in the direction of the drift could be observed in all of the conditions except when three items had a size 0.2 drift. Nevertheless, the impact of the change in the direction of the drift worked in the similar way regardless of the linking method used. The θ estimation was less accurate when the items were showing a negative b-parameter drift than when a positive drift was present. The reason behind this might be that as the b-parameter drifted in the negative direction, some items with very small b-parameter were very easy and they were not useful in distinguishing different levels of abilities. 100 Table 4b.5 Changes in bias and RMSE with b-parameter Drifting in Different Directions (Positive Drift Group Difference Linking Method Year One (0, 1) Year Two (0, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop Mean Std Deviation N=3, 0.2drift -0.018 0.363 0.021 -0.007 0.023 -0.005 -0.006 0.093 0.285 0.024 -0.004 0.023 -0.006 -0.008 0.005 0.455 0.037 0.010 0.037 0.010 0.009 0.064 0.132 Negative Drift) Change in bias Change in RMSE N=3, N=8, N=8, N=3, N=3, N=8, N=8, 0.4 drift 0.2 drift 0.4 drift 0.2 drift 0.4 drift 0.2 drift 0.4 drift -0.164 0.007 0.010 0.015 0.112 -0.006 -0.004 0.174 0.088 0.225 -0.368 -0.191 -0.100 -0.244 0.043 0.067 0.159 -0.017 -0.033 -0.059 -0.129 -0.018 -0.012 -0.039 0.009 0.015 0.016 0.034 0.044 0.067 0.135 -0.018 -0.034 -0.058 -0.114 -0.018 -0.012 -0.044 0.007 0.015 0.012 0.035 -0.018 -0.012 -0.040 0.008 0.016 0.013 0.035 -0.223 0.009 0.034 -0.081 0.155 -0.009 -0.028 0.136 0.062 0.204 -0.290 -0.152 -0.072 -0.221 0.038 0.065 0.137 -0.025 -0.036 -0.065 -0.129 -0.025 -0.020 -0.062 0.004 0.022 0.021 0.058 0.035 0.065 0.121 -0.024 -0.035 -0.064 -0.115 -0.027 -0.021 -0.063 0.004 0.023 0.021 0.058 -0.024 -0.021 -0.045 0.006 0.020 0.021 0.043 0.005 0.014 -0.004 -0.003 0.001 -0.008 0.009 0.166 0.084 0.238 -0.453 -0.183 -0.097 -0.248 0.052 0.060 0.148 -0.009 0.008 -0.012 0.008 -0.009 -0.017 -0.053 -0.002 0.003 0.005 0.005 0.053 0.062 0.130 -0.005 0.008 -0.009 0.000 -0.008 -0.019 -0.051 0.001 0.003 0.003 0.002 -0.026 -0.019 -0.048 0.000 0.005 0.004 0.007 0.009 0.024 0.052 -0.059 -0.012 -0.021 -0.045 0.091 0.042 0.108 0.134 0.082 0.041 0.100 101 4.224 Effect of the Magnitude of b-parameter Drift When the b-parameter drift increased in size from 0.2 to 0.4, its effect on the accuracy of θ estimation varied with the linking methods used. If the FCIP and TCC-ST methods were used, the change in the size of the drift had very little effect on the accuracy of θ estimation. If the Concurrent method was used, the impact of the change in the drift size was different depending on the way of handling the drifting items. Table 4b.6 listed the effect of the change in the magnitude of the b-parameter drift in terms of changes in bias and RMSE values. When FCIP or TCC-ST method was used, the difference in bias caused by the change in the size of the b-parameter drift was no more than 0.055 and the difference in RMSE was less than 0.042. This small size of difference indicated that there was little change in accuracy of the θ estimation whether the b-parameter drift was changing in size, as long as FCIP or TCC-ST method was used in linking the groups. However, the result was different for the Concurrent method. When the drifting items were kept in the linking process, in most conditions the change in the size of the drift had little effect on the accuracy of θ estimation. On the other hand, when the drifting items were removed from the linking process, the change in the size of the drift had an impact on the size of the difference in θ estimation. The resulting change in bias was between 0.101 and 0.447, and the change in RMSE ranged from 0.092 and 0.409. These differences in bias and RMSE were relatively larger, compared with the corresponding values for FCIP and TCC-ST methods. This indicated that when the b-parameter drift increased in size the accuracy of the θ estimation decreased, if the two groups were linked through the Concurrent method without the use of the drifting items. 102 Table 4b.6 Changes in bias and RMSE as the size of b-parameter Drift Increases (0.2 Group Difference Linking Method Year One (0, 1) Year Two (0, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (0.2, 1) Concurrent FCIP TCC-ST Year One (0, 1) Year Two (-0.2, 1) Concurrent FCIP TCC-ST Drifted Items Handling (link/score) keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop keep/keep drop/drop keep/keep drop/drop keep/keep drop/keep drop/drop Mean Std Deviation N=3, (+)drift -0.004 -0.175 -0.017 0.004 -0.016 0.004 0.004 0.055 -0.249 -0.012 0.011 -0.011 0.011 0.010 -0.002 -0.158 -0.015 0.008 -0.015 0.008 0.028 -0.025 0.074 Change in bias N=3, N=8, (-)drift (+)drift -0.150 -0.005 -0.364 -0.247 0.005 -0.055 -0.007 0.016 0.004 -0.038 -0.009 0.023 -0.008 0.021 -0.261 0.011 -0.398 -0.249 0.002 -0.047 -0.010 0.021 0.002 -0.036 -0.010 0.023 -0.006 0.021 -0.001 -0.004 -0.447 -0.255 0.000 -0.046 -0.010 0.024 0.002 -0.032 -0.011 0.023 -0.007 0.021 -0.080 -0.039 0.149 0.093 103 N=8, (-)drift -0.001 -0.111 0.037 -0.011 0.030 -0.009 -0.008 0.036 -0.107 0.025 -0.020 0.020 -0.019 -0.004 -0.023 -0.101 0.042 -0.011 0.036 -0.009 -0.008 -0.010 0.045 0.4) N=3, (+)drift 0.003 0.165 0.014 -0.004 0.013 -0.003 -0.004 -0.048 0.240 0.011 -0.011 0.010 -0.011 -0.010 -0.001 0.139 -0.007 -0.006 -0.005 -0.003 -0.006 0.023 0.070 Change in RMSE N=3, N=8, (-)drift (+)drift 0.099 0.002 0.342 0.236 -0.002 0.041 0.002 -0.014 -0.003 0.031 0.005 -0.018 0.004 -0.019 0.188 -0.011 0.378 0.244 -0.001 0.041 0.007 -0.024 -0.002 0.031 0.008 -0.024 0.003 -0.023 0.004 -0.001 0.409 0.224 0.011 -0.021 -0.001 -0.009 0.007 -0.016 -0.001 -0.007 -0.001 -0.011 0.069 0.031 0.136 0.088 N=8, (-)drift 0.004 0.092 -0.028 0.004 -0.024 0.005 0.002 -0.031 0.096 -0.023 0.013 -0.020 0.013 -0.002 0.017 0.074 -0.002 -0.010 -0.007 -0.008 -0.009 0.007 0.036 4.225 Effect of the Linking Method Choice of the linking method had a significant impact on the accuracy of θ estimation. It was particularly true when the Concurrent method was compared with the FCIP and TCC-ST methods. The effect of the linking method was compared in Figure 4b.0.1 to Figure 4b.0.6. The performance of the FCIP linking method was almost as same as the TCC-ST linking method across all conditions. However, the performance of the Concurrent method varied with different situations. When all the drifting items were kept in the linking process, the Concurrent method worked better than, or as well as, the FCIP and TCC-ST methods. With all the items included in the linking, when the Year Two group was of equal or higher ability than the Year One group, the Concurrent method performed much better than the FCIP and TCC-ST methods. Its superiority was most obvious when the Year Two group had a higher ability than the Year One group. When the Year Two group had a lower ability than the Year One group, there was not much difference between the performance of the Concurrent method and the other two methods. However, when the drifting items were dropped from the linking process, the performance of the Concurrent method took a down turn. The Concurrent method produced less accurate θ estimation than the FCIP and TCC-ST methods in most situations, regardless of equivalency of the group abilities. Nevertheless, the Concurrent method managed to catch up with the FCIP and TCC-ST methods in terms of estimation accuracy when there was a positive 0.4 drift in b-parameter. The different performance of the Concurrent method suggested that when more items were kept in the linking process, the Concurrent method worked better. In the Concurrent calibration, the item and ability parameters were estimated on one single computer run where the Year One group and the Year Two group were combined and the responses of the Year Two group to the unique items were considered missing. It was likely that the inclusion of more items, 104 especially items with similar characteristics, would help improve the accuracy of the estimation. Therefore, when all the drifting items were removed from the linking process, the performance of the Concurrent method was undermined. Moreover, in this study, the mean b-parameter of the items used in simulating the item responses was - 0.1199. Therefore, a big positive drift in the b-parameter might result in items with distinctive characteristics and removing those drifting items from the linking process might help the performance of the concurrent method. 105 Figure 4b.0.1 Comparison of Linking Methods (Mean Year 1 = Mean Year 2; All Items Included) 0.50 0.45 0.40 0.35 RMSE 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 106 TCC-ST I8_b+0.4 I8_b-0.4 Figure 4b.0.2 Comparison of Linking Methods (Mean Year 1 = Mean Year 2; Drifed Items Dropped) 1.10 1.00 0.90 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 107 TCC-ST I8_b+0.4 I8_b-0.4 Figure 4b.0.3 Comparison of Linking Methods (Mean Year 1 < Mean Year 2; All Items Included) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 108 TCC-ST I8_b+0.4 I8_b-0.4 Figure 4b.0.4 Comparison of Linking Methods (Mean Year 1 < Mean Year 2; Drifed Items Dropped) 1.20 1.10 1.00 0.90 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 109 TCC-ST I8_b+0.4 I8_b-0.4 Figure 4b.0.5 Comparison of Linking Methods (Mean Year 1 > Mean Year 2; All Items Included) 0.25 0.20 RMSE 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 110 TCC-ST I8_b+0.4 I8_b-0.4 Figure 4b.0.6 Comparison of Linking Methods (Mean Year 1 > Mean Year 2; Drifed Items Dropped) 0.90 0.80 0.70 RMSE 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Concurrent FCIP 111 TCC-ST I8_b+0.4 I8_b-0.4 4.226 Effect of Group Difference The difference in the ability of the groups to be linked had an impact on the accuracy of θ estimation. However, this effect of group difference varied depending on the linking method and the way of handling the drifting items. Figure 4b.1.1 to Figure 4b.1.7 compared the accuracy of θ estimates in terms of RMSE when the equivalent or non-equivalent groups were linked under different circumstances. The impact of the group difference was most consistent with the FCIP method. The θ estimation was least accurate when the mean ability of the Year Two group was higher than the Year One group. When the two equivalent groups were linked, the estimation of θ became more accurate. The estimation of θ was most accurate when the Year Two group had lower ability than the Year One group. This pattern could be observed with the FCIP method across all the eight drifting conditions, no matter how drifting items was handled. One reason of the formation of such a pattern might be that there were more items with a lower b-parameter which were less efficient in estimating higher θ values but better at estimating lower θ values. A similar pattern was found with the TCC-ST method when the drifting items were removed from the set of the linking items in calibration. The estimation of θ was most accurate when the Year Two group had a lower ability, became less accurate as the mean ability of the Year Two group increased. However, when the drifting items were kept in the calibration, although the estimation was still more accurate in situations when the mean ability of the Year Two group was lower, there was little difference in the accuracy of θ estimation whether the mean ability of the Year Two group was equal to or higher than the Year One group. For the Concurrent method, when the drifting items were dropped from the linking process, the same pattern occurred as with the FCIP method. However, when the drifting items were kept 112 in the linking process, there was a little change in the pattern. The difference in the accuracy of θ estimation was not as big as what was observed with the FCIP and TCC-ST methods, when there were only three items showing b-parameter drift. In the situation when three items were exhibiting a negative 0.4 drift in b-parameter, the θ estimation became least accurate when the Year Two group had a lower ability. This exception might be explained by the fact that one of the drifting items happened to have an extreme low b-parameter, and therefore it helped the performance of estimation for groups with lower ability. 113 Figure 4b.1.1 Effect of Group Difference (Concurrent; All Items Included) 0.45 0.40 0.35 RMSE 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.2 Effect of Group Difference (Concurrent; Drifted Items Dropped) 1.20 1.10 1.00 0.90 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.3 Effect of Group Difference (FCIP; All Items Included) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.4 Effect of Group Difference (FCIP; Drifted Items Dropped) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.5 Effect of Group Difference (TCC-ST; All Items Included) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.6 Effect of Group Difference (TCC-ST; Drifted Items Dropped in Linking but Kept in Scoring) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 Figure 4b.1.7 Effect of Group Difference (TCC-ST; Drifted Items Dropped) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 I8_b+0.4 Drift Year1=Year2 Year1Year2 I8_b-0.4 4.227 Effect of Drifted Items Handling When it comes to the question whether the accuracy of θ estimation was influenced by the way how the drifting items were handled, the choice of the linking method played an important role. Figure 4b.2.1 to Figure 4b.2.9 displayed the effect of the way of handling drifting items. When FCIP and TCC-ST linking methods were used, it did not make much difference in accuracy of the θ estimation whether the drifting items were kept or removed from the linking process. However, as more items were showing larger drift, there seemed to be a tendency that keeping the drifting items would produce better estimation than dropping the items when the drift was positive, but dropping the items would lead to better estimation when the drift was negative. This suggested that more items with larger b-parameter would help improve the accuracy of θ estimation, but items with smaller b-parameter would not help and sometimes even do more harm than benefit to the θ estimation. Nevertheless, the observed differences in RMSE were not big enough, and different ways of handling drifting items had no significant effect on the performance of θ estimation, as far as FCIP and TCC-ST linking methods were concerned. On the other hand, when the Concurrent method was applied, the effect of removing or keeping drifted items was more substantial. The θ estimation was more accurate when the drifting items were kept in the calibration. This pattern was observed in all of the situations included in this study, although the size of difference in the accuracy of estimation varied. In conditions when three items were making a negative 0.2 drift in b-parameter, the biggest difference was bigger than 0.7 in terms of RMSE. In conditions when eight items were making a positive 0.4 drift in b-parameter, the smallest difference was less than 0.04 in terms of RMSE. These observations suggested that the Concurrent method did a better job in θ estimation when more items were included in the linking process. 121 Figure 4b.2.1 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 = Mean Year 2) 1.10 1.00 0.90 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 122 I8_b+0.4 I8_b-0.4 Figure 4b.2.2 Comparison of Handling Drifted Items (FCIP; Mean Year 1 = Mean Year 2) 0.50 0.45 0.40 0.35 RMSE 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 123 I8_b+0.4 I8_b-0.4 Figure 4b.2.3 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 = Mean Year 2) 0.50 0.45 0.40 0.35 RMSE 0.30 0.25 0.20 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Keep 124 Drop/Drop I8_b+0.4 I8_b-0.4 Figure 4b.2.4 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 < Mean Year 2) 1.20 1.10 1.00 0.90 0.80 RMSE 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 125 I8_b+0.4 I8_b-0.4 Figure 4b.2.5 Comparison of Handling Drifted Items (FCIP; Mean Year 1 < Mean Year 2) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 126 I8_b+0.4 I8_b-0.4 Figure 4b.2.6 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 < Mean Year 2) 0.80 0.70 0.60 RMSE 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Keep 127 Drop/Drop I8_b+0.4 I8_b-0.4 Figure 4b.2.7 Comparison of Handling Drifted Items (Concurrent; Mean Year 1 > Mean Year 2) 0.90 0.80 0.70 RMSE 0.60 0.50 0.40 0.30 0.20 0.10 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 128 I8_b+0.4 I8_b-0.4 Figure 4b.2.8 Comparison of Handling Drifted Items (FCIP; Mean Year 1 > Mean Year 2) 0.25 0.20 RMSE 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Drop 129 I8_b+0.4 I8_b-0.4 Figure 4b.2.9 Comparison of Handling Drifted Items (TCC-ST; Mean Year 1 > Mean Year 2) 0.25 0.20 RMSE 0.15 0.10 0.05 0.00 No_Drift I3_b+0.2 I3_b-0.2 I3_b+0.4 I3_b-0.4 I8_b+0.2 I8_b-0.2 Drift Keep/Keep Drop/Keep 130 Drop/Drop I8_b+0.4 I8_b-0.4 4.228 Effect of Drifted Items Handling at Different θ Levels When the effects of the handling of drifting items were studied in conditions of a-parameter drift, scatter plots between the θ differences and the interval of the true θs were drawn to examine the impact of the items handling at different θ levels. The θ scale was divided into equally spaced intervals and the θ estimates and the true θ values were compared and averaged for each interval. The same procedure was conducted for the conditions of b-parameter drift. Figure 4b.3.1 to Figure 4b.3.8 showed the resulting scatter plots comparing the mean θ differences at different θ intervals. Figure 4b.3.1 demonstrated the effect of drifting items handling when three items were showing a positive b-parameter drift with a size of 0.2. When the FCIP or TCC-ST method was used, it did not make any difference whether these drifting items were removed from the linking or not. The size of bias was larger at the two extreme ends of the θ scale. Small θs tended to be overestimated while large θs tended to be underestimated. This pattern did not change whether drifting items were removed from the linking or not. However, when the Concurrent method was applied, ways to handling the drifting items made a difference and the difference varied depending on the size of the θ values. When the θ values fell within a moderate range, for instance, between -2 to 2 on the θ scale, the size of bias increased when the drifting items were removed from the linking. When the θ values were at the extremely lower end of the θ scale, for example, lower than -3, the size of bias became bigger when the drifting items were kept in the linking process. When θ values were at intervals between -3 and -2 or when θ values were quite bigger, there was not much difference whether the drifting items were kept or removed from the linking process. 131 Figure 4b.3.1 Mean bias at θ Intervals (3 Items b-drift +0.2) Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 132 FCIP_drop 4.0 Figure 4b.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 133 concurrent_drop Figure 4b.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 134 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 135 FCIP_drop 4.0 Figure 4b.3.1 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Figure 4b.3.2 displayed the effect of drifting items handling when three items were showing a negative b-parameter drift with a size of 0.2. When the FCIP or TCC-ST method was applied in the linking process, the accuracy of θ estimation was not affected whether the drifting items were kept or removed from the linking. θ estimation was more accurate for θs in the middle of the θ scale. When the concurrent method was used, if the drifting items were dropped from the linking, the estimation became less accurate. The size of bias became larger as θ values fell within the middle range of the scale. As the θ moved towards the higher end of the scale, the size of bias became smaller. Despite of the change in the size of bias, it was consistent that keeping the drifting items helped yielding better estimation. The only exception was when the θ value was at the extremely lower end of the θ scale. In that case, it produced better θ estimation to drop the 136 drifting items. Figure 4b.3.2 Mean bias at θ Intervals (3 Items b-drift -0.2) Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 137 FCIP_drop 4.0 Figure 4b.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 138 concurrent_drop Figure 4b.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 139 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 > Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 140 FCIP_drop 4.0 Figure 4b.3.2 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop As shown in Figure 4b.3.3, when there was a positive 0.4 drift in b-parameter for three items, the way to handle the drifting items had no effect on the accuracy of θ estimation, if the FCIP or TCC-ST method was used. For the Concurrent method, θ estimation was more accurate when the drifted items were kept in the linking, except at the two extreme ends of the θ scale. At the lower end of the θ scale, estimation was more accurate when the drifting items were dropped from the linking. At the higher end of the θ scale, when the Year One group was of equal or higher ability than the Year Two group, θ estimation was more accurate when the drifting items were removed from the linking; when the Year One group was of lower ability than the Year Two group, θ estimation was not influence by whether the drifting items were kept or removed from the linking. Furthermore, compared with conditions where the size of drift was 0.2, as the size of the drift increased, the discrepancy in bias caused by the different ways of handling the drifting 141 items became smaller. Figure 4b.3.3 Mean bias at θ Intervals (3 Items b-drift +0.4) Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 142 FCIP_drop 4.0 Figure 4b.3.3 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 143 concurrent_drop Figure 4b.3.3 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 144 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.3 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 145 FCIP_drop 4.0 Figure 4b.3.3 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Figure 4b.3.4 showed the effect of ways to handling the drifting items when the items were experiencing negative b-parameter drift of a size 0.4. The pattern was not much different from what had been observed in other conditions of b-parameter drift. The effect of ways to handling the drifting items was close to zero for the FCIP and TCC-ST methods. When the Concurrent method was used, keeping the drifting items increased the accuracy of θ estimates when the estimates fell in the middle range of the θ scale. To estimate θs with extreme low values, it was more accurate if the drifting items were dropped from the linking. To estimate θs with larger values, the effect varied. It did not make difference whether to keep the drifting items or not, if the mean ability of the Year One group was equivalent to or smaller than that of the Year Two group. Otherwise, if the Year One group had a higher ability than the Year Two group, the estimation was more accurate when the drifting items were dropped from the linking. 146 Figure 4b.3.4 Mean bias at θ Intervals (3 Items b-drift -0.4) Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 147 FCIP_drop 4.0 Figure 4b.3.4 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 148 concurrent_drop Figure 4b.3.4 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 < Year 2) 1.0 0.5 BIAS 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 149 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.4 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 150 FCIP_drop 4.0 Figure 4b.3.4 (cont’d) Mean BIAS at θ Intervals (3 Items b -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop As shown in Figure 4b.3.5, when there were eight items showing a positive b-parameter drift of size 0.2, a very similar pattern could be observed about the effect of keeping or removing the drifting items, as in the condition when 3 items were showing the similar drift. There was almost no difference whether these drifting items were dropped when the FCIP or TCC-ST method was used. When the Concurrent method was used, for θs that were in the middle (-2.5 to 2), better estimation was achieved when the drifting items were kept in the linking. For θs at the extremely lower end (less than -3), the estimation performed better if the drifting items were dropped from the linking. For θs in other ranges on the scales, they were estimated equally well either with or without the drifting items. 151 Figure 4b.3.5 Mean bias at θ Intervals (8 Items b-drift +0.2) Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 152 FCIP_drop 4.0 Figure 4b.3.5 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 153 concurrent_drop Figure 4b.3.5 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 154 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.5 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 155 FCIP_drop 4.0 Figure 4b.3.5 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop When eight items were showing a negative b-parameter drift of size 0.2, as displayed in Figure 4b.3.6, there was not much difference in the pattern of the effect of handling drifting items from those observed in other conditions. The θ estimation was equally accurate whether the drifting items were kept or removed, if the FCIP or TCC-ST method was used. With the application of the Concurrent method, in most intervals on the θ scale (-2.5 to 2), it was better to keep the drifting items. But when θ values were extremely low, it was better to drop the drifting items to get more accurate estimation. For θ values at other intervals, it made no difference in the accuracy of the estimation whether to keep the drifting items or not. The only difference might be that when the θ value was small (less than -2, for instance), it tended to be overestimated when the drifting items were kept in the linking and it was likely to be underestimated when the drifting items were dropped from the linking. But the size of bias was similar in those situations. 156 Figure 4b.3.6 Mean bias at θ Intervals (8 Items b-drift -0.2) Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 157 FCIP_drop 4.0 Figure 4b.3.6 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 158 concurrent_drop Figure 4b.3.6 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 159 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.6 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 160 FCIP_drop 4.0 Figure 4b.3.6 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.2; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop When eight items had positive b-parameter drift of size 0.4, as shown in Figure 4b.3.7, the effect still existed as in other drifting conditions for the Concurrent method, though the effect was relatively smaller compared with those shown in other drifting conditions. When the Concurrent method was used, for θs at the two ends of the θ scale, estimation was more accurate when the drifting items were dropped from the linking. For θs in the middle part of the scale, better estimation was achieved when the drifting items were kept in the linking. In this drifting condition the difference was relatively smaller between the size of bias when drifting items were kept and the size of bias when drifting items were removed. As observed in other conditions, for the FCIP or TCC-ST method, the accuracy of θ estimation was not significantly affected by whether the drifting items were dropped from the linking or not. Keeping the drifting items seemed to be doing a little better than removing them, 161 but the difference was too small to be statistically significant. Figure 4b.3.7 Mean bias at θ Intervals (8 Items b-drift +0.4) Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 162 FCIP_drop 4.0 Figure 4b.3.7 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 163 concurrent_drop Figure 4b.3.7 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 164 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.7 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 165 FCIP_drop 4.0 Figure 4b.3.7 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift +0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop When eight items were having a negative b-parameter drift of size 0.4, the pattern of the effect of ways to handling drifting items was no different from those observed in other conditions. From Figure 4b.3.8 it was obvious that the performance of θ estimation was not affected by the way how the drifting items were handled, as long as the FCIP and TCC-ST methods were used. When the Concurrent method was used, for extremely smaller θ values (less than -2.5) and for relatively larger θ values (bigger than 1), estimation was more accurate when the drifting items were dropped from the linking. On the other hand, for θ values that were of moderate size (between -2.5 and 1), estimation was more accurate when the drifting items were kept in the linking. 166 Figure 4b.3.8 Mean bias at θ Intervals (8 Items b-drift -0.4) Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 167 FCIP_drop 4.0 Figure 4b.3.8 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 = Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep 168 concurrent_drop Figure 4b.3.8 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ FCIP_no_drift FCIP_keep FCIP_drop Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 < Year 2) 1.5 1.0 BIAS 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ TCC-ST_no_drift TCC-ST_drop/keep 169 TCC-ST_keep TCC-ST_drop/drop 4.0 Figure 4b.3.8 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 > Year 2) 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ concurrent_no_drift concurrent_keep concurrent_drop Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 θ FCIP_no_drift FCIP_keep 170 FCIP_drop 4.0 Figure 4b.3.8 (cont’d) Mean BIAS at θ Intervals (8 Items b -drift -0.4; Year 1 > Year 2) 2.0 1.5 BIAS 1.0 0.5 0.0 -0.5 -1.0 -1.5 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 θ TCC-ST_no_drift TCC-ST_drop/keep TCC-ST_keep TCC-ST_drop/drop All in all, from the examination of the effect of ways of handling drifting items in all the eight b-parameter drifting conditions, it was found that the Concurrent method was sensitive to how the drifting items were handled and the effect worked differently for θs at different locations of the θ scale. On the contrary, it did not matter whether the drifting items were removed from the linking process, when the FCIP and TCC-ST methods were used. 4.23 Accuracy of Performance Level Classification One index of the quality of the estimation was the percentage of examinees classified into the appropriate performance levels. Correct classification of the performance level was one of the indications that the θ estimates were accurate. Table 4b.7 to Table 4b.14 summarized the percentages of correct classification in each performance level when the b-parameter was drifting 171 in different size and direction for different numbers of the items: three items showing a positive drift of size 0.2 (Condition 1), three items showing a negative drift of size 0.2 (Condition 2), three items showing a positive drift of size 0.4 (Condition 3), three items showing a negative drift of size 0.4 (Condition 4), eight items showing a positive drift of size 0.2 (Condition 5), eight items showing a negative drift of size 0.2 (Condition 6), eight items showing a positive drift of size 0.4 (Condition 7), and eight items showing a negative drift of size 0.4 (Condition 8). Examination of Tables 4b.7 to 4b.14 suggested that there was a pattern in the percentage of correct classification, regardless of the different drifting conditions of b-parameter. In all eight drifting conditions, for the FCIP and TCC-ST methods, there was little difference whether the drifting items were removed from the linking. However, it made a difference whether the drifting items were kept in the linking or not, as far as the Concurrent method was concerned. If the drifting items were kept in the linking process, the Concurrent method performed as well as or even better than the FCIP and TCC-ST methods. When the mean ability of the Year One Group was equivalent to that of the Year Two group, the Concurrent method was doing slightly better than the other two methods. When the Year Two group had a higher ability than the Year One group, the superiority of the Concurrent method to the other two methods was outstanding. When the Year Two group had a lower ability than the Year One group, the Concurrent method was doing equally well as the FCIP and TCC-ST methods. This tendency could be observed in all of the b-parameter drifting conditions. When the Year One group and the Year Two group were of equal ability, the Concurrent method did better than the other two methods. The overall percentages of correct classifications were over 80% for the Concurrent method, while the corresponding percentages for the FCIP and TCC-ST methods were around 70%. For the Concurrent method, the percentages of correct 172 classifications at Level 3 were over 70%, and it was close to 90% in the condition when three items were showing a size 0.4 b-parameter drift. In comparison, the corresponding percentages for the FCIP and TCC-ST methods were about 60%, and it was less than 50% in condition when eight items were showing a negative b-parameter drift of size 0.4. When the mean ability the Year Two group was higher than that of the Year One group, the Concurrent method yielded much better estimation than the other two methods. For the Concurrent method, the overall percentages of correct classifications were over 70% and in some conditions even as high as 82.1% (Condition 1) and 90.8% (Condition 4). In comparison, the corresponding percentages were about 50% for the FCIP and TCC-ST methods. In terms of correct classifications at Level 3, the percentages for the Concurrent method were around 60% in most conditions, the highest being 74.9% in Condition 1. On the contrary, the corresponding percentages for the FCIP and TCC-ST methods were around 30%. When the Year Two group had a lower ability than the Year One group, the Concurrent method performed as well as the other two groups, though the FCIP and TCC-ST methods did better than the Concurrent method at the classification of Level 3. The overall percentages of correct classifications were close to 90% for all conditions of the b-parameter drifting, regardless of the linking method used. In terms of the percentage of correct classifications at level 3, the FCIP and TCC-ST methods outperformed the Concurrent method (over 90% versus 80%). There was a big decline in the performance of the Concurrent method when the drifting items were dropped from the linking, but the performance of the FCIP and TCC-ST methods was not affected. Therefore, the Concurrent method did much worse than the FCIP and TCC-ST methods when the Year Two group had a lower or equivalent ability than the Year One group. When the Year Two group was of a higher ability, there was not much difference in the 173 performances of the three methods. This pattern was consistent across all of the b-parameter drifting conditions except when items were showing a positive drift of size 0.4. In that particular condition, the Concurrent method performed as well as the other two methods when the two groups were of equal ability. When the Year One group and the Year Two group were of equal ability, with drifting items dropped, the Concurrent method did a worse job than the other two methods. The only exception was when items were showing a positive drift of 0.4, in which case, the three methods were doing equally well. In most of the conditions, the overall percentages of correct classifications were less than 60% for the Concurrent method for conditions, while the corresponding percentages were above 70% for the FCIP and TCC-ST methods. In conditions 3 and 7 with 0.4 b-parameter drift, the overall percentages of correct classifications were about 70% for all three methods. In terms of percentage of correct classifications at Level 3, the differences among the three methods were small. For the Concurrent method, the percentages were around 50%, except for Condition 1 (32.7%), and the percentages were between 50% to 60% for the FCIP and TCC-ST methods. When the Year Two group had a higher ability than the Year One group, there was not much difference in the relatively poor performance of the three methods, although the Concurrent method was doing slightly better when the drift was big, especially in classifying examinees at the Level 3. The overall percentages of the correct classifications were consistently a little below 50% for the FCIP and TCC-ST methods. The corresponding percentages for the Concurrent method were around 50% for most of the situations except for condition 1(30.7%). The percentages of correct classifications at Level 3 were less than 30% for the FCIP and TCC-ST methods for all of the 8 conditions. The corresponding percentages for the Concurrent method 174 were higher (near 40% for conditions 1, 5 and 6, around 50% for conditions 3, 4, 7 and 8), except for condition 2 (27.7%). When the Year Two group had a lower ability than the Year One group, it was obvious that the FCIP and TCC-ST methods were doing a better job, because the Concurrent method was sensitive to the way drifting items were handled. But when the items were showing a positive drift of 0.4, the Concurrent method managed to do as well as the other two methods. The overall percentages of correct classifications were close to 90% for the FCIP and TCC-ST methods in all of the eight conditions. On the contrary, the corresponding percentages for the Concurrent method were only around 70% or even lower, except in Condition 3 and 8 (83% and 84.9%, respectively). In terms of the percentage of correct classifications at Level 3, the FCIP and TCC-ST methods performed much better than the Concurrent method. The percentages were well above 90% for the FCIP and TCC-ST methods, while they were around 70% for the Concurrent method (with highest being 81.2% for Condition 7 and lowest being 37.7% for Condition 2). 175 Table 4b.7 Percentage in Each Performance Level Classification (N=3, Drift=b+0.2) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 15.2 84.8 Concurrent 3 27.2 72.8 4 52.1 47.9 Overall 19.0 80.8 1 0.0 99.0 Year One 2 16.6 83.4 (0,1) FCIP 3 42.3 57.7 Year Two 4 79.4 20.6 (0,1) Overall 26.7 73.1 1 0.0 99.0 2 16.4 83.6 TCC-ST 3 41.9 58.1 4 78.8 21.2 Overall 26.5 73.4 Group Linking Method 176 1 1.0 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 51.8 48.2 50.4 49.6 57.6 42.4 42.5 57.5 0.0 99.4 17.3 82.7 43.5 56.5 79.4 20.6 27.4 72.5 0.0 99.8 17.3 82.7 43.5 56.5 78.8 21.2 27.4 72.6 1 0.6 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 0.0 Table 4b.7 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 14.9 85.0 Concurrent 3 25.1 74.9 4 43.6 56.4 Overall 17.7 82.1 1 0.0 100.0 Year One 2 51.0 49.0 (0,1) FCIP 3 72.0 28.0 Year Two 4 94.5 5.5 (0.2,1) Overall 52.3 47.7 1 0.0 100.0 2 51.0 49.0 TCC-ST 3 72.0 28.0 4 94.5 5.5 Overall 52.3 47.7 Group Linking Method 177 1 1.0 0.1 0.0 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0.0 0.0 0.0 71.8 0.1 64.2 0.0 72.7 0.0 56.5 0.0 51.3 72.3 96.4 52.6 0.0 52.3 72.2 94.5 52.9 0 100.0 28.2 35.7 27.3 43.5 100.0 48.7 27.7 3.6 47.4 100.0 47.7 27.8 5.5 47.1 1 Table 4b.7 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.8 2 8.3 91.2 Concurrent 3 17.7 82.2 4 40.0 60.0 Overall 12.1 87.1 1 0.0 76.4 Year One 2 0.3 92.0 (0,1) FCIP 3 2.9 97.1 Year Two 4 44.2 55.8 (-0.2,1) Overall 3.6 89.2 1 0.0 76.0 2 0.3 92.0 TCC-ST 3 2.8 97.2 4 43.6 56.4 Overall 3.6 89.2 Group Linking Method 178 1 3.2 0.5 0.1 0.0 0.8 23.6 7.7 0.0 0.0 7.2 24.0 7.7 0.0 0.0 7.3 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 34.0 66.0 34.9 65.0 37.6 62.4 28.6 71.4 0.0 77.3 0.3 92.6 3.6 96.4 44.8 55.2 3.9 89.3 0.0 78.1 0.3 92.3 3.1 96.9 43.6 56.4 3.7 89.5 1 0.0 0.0 0.1 0.0 0.0 22.7 7.0 0.0 0.0 6.8 21.9 7.4 0.0 0.0 6.8 Table 4b.8 Percentage in Each Performance Level Classification (N=3, Drift=b-0.2) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 98.9 2 14.0 86.0 Concurrent 3 25.9 74.1 4 47.9 52.1 Overall 17.8 82.0 1 0.0 99.8 Year One 2 17.8 82.2 (0,1) FCIP 3 44.8 55.2 Year Two 4 79.4 20.6 (0,1) Overall 28.1 71.9 1 0.0 99.8 2 18.2 81.8 TCC-ST 3 44.4 55.6 4 80.0 20.0 Overall 28.2 71.8 Group Linking Method 179 1 1.1 0.0 0.0 0.0 0.2 0.2 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 84.7 15.3 0.4 66.9 32.7 0.0 73.9 26.1 0.1 62.7 37.2 0.0 99.8 17.0 83.0 41.8 58.2 80.0 20.0 26.7 73.3 0.0 99.8 17.3 82.7 41.9 58.1 78.8 21.2 26.8 73.1 1 0.2 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 Table 4b.8 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 20.3 34.1 57.6 23.9 0.0 53.0 74.1 96.4 54.0 0.0 53.2 73.8 96.4 53.9 180 0 100.0 79.7 65.9 42.4 76.1 100.0 47.0 25.9 3.6 46.0 100.0 46.8 26.2 3.6 46.1 Items Dropped Est.Level - True Level 1 -2 -1 0.0 0.0 0.0 93.2 3.9 70.8 0.0 80.6 1.5 67.9 0.0 50.8 71.5 97.0 52.2 0.0 51.3 72.3 95.8 52.6 0 100.0 6.8 25.3 19.4 30.7 100.0 49.2 28.5 3.0 47.8 100.0 48.7 27.7 4.2 47.4 1 Table 4b.8 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 97.0 2 8.6 91.0 Concurrent 3 18.0 81.9 4 41.2 58.8 Overall 12.4 86.9 1 0.0 80.0 Year One 2 0.5 94.5 (0,1) FCIP 3 5.2 94.8 Year Two 4 53.9 46.1 (-0.2,1) Overall 5.1 89.4 1 0.0 80.8 2 0.8 93.9 TCC-ST 3 5.3 94.7 4 50.9 49.1 Overall 5.1 89.4 Group Linking Method 181 1 3.0 0.4 0.1 0.0 0.7 20.0 5.0 0.0 0.0 5.5 19.2 5.4 0.0 0.0 5.5 Items Dropped Est.Level - True Level -2 -1 0 0.0 0.0 100.0 0.0 74.8 25.2 0.1 62.2 37.7 0.0 63.6 36.4 0.0 56.4 43.6 0.0 77.9 22.1 0.6 92.2 3.8 96.2 49.1 50.9 4.3 88.9 0.0 79.8 0.4 92.7 3.9 96.1 44.2 55.8 4.1 89.7 1 7.2 0.0 0.0 6.7 20.2 6.9 0.0 0.0 6.3 Table 4b.9 Percentage in Each Performance Level Classification (N=3, Drift=b+0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 14.5 85.5 Concurrent 3 27.1 72.9 4 52.7 47.3 Overall 18.7 81.1 1 0.0 99.0 Year One 2 14.5 85.5 (0,1) FCIP 3 40.9 59.1 Year Two 4 78.8 21.2 (0,1) Overall 25.3 74.5 1 0.0 99.0 2 14.7 85.3 TCC-ST 3 40.9 59.1 4 78.8 21.2 Overall 25.4 74.4 Group Linking Method 182 1 1.0 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 37.5 62.5 36.2 63.7 37.6 62.4 30.5 69.5 0.0 99.8 17.1 82.9 43.5 56.5 78.8 21.2 27.3 72.6 0.0 99.8 17.8 82.2 43.8 56.2 78.8 21.2 27.7 72.3 1 0.0 0.0 0.1 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 Table 4b.9 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.8 2 18.1 81.9 Concurrent 3 31.3 68.7 4 55.2 44.8 Overall 21.9 78.1 1 0.0 100.0 Year One 2 49.1 50.9 (0,1) FCIP 3 72.0 28.0 Year Two 4 95.2 4.8 (0.2,1) Overall 51.6 48.4 1 0.0 100.0 2 50.0 50.0 TCC-ST 3 71.6 28.4 4 94.5 5.5 Overall 51.7 48.3 Group Linking Method 183 1 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0.0 51.3 46.9 47.3 40.4 0.0 52.8 73.0 95.8 53.4 0.0 53.5 72.8 95.8 53.7 0 100.0 48.7 53.1 52.7 59.6 100.0 47.2 27.0 4.2 46.6 100.0 46.5 27.2 4.2 46.3 1 Table 4b.9 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.6 2 7.9 91.6 Concurrent 3 17.8 82.1 4 41.2 58.8 Overall 12.0 87.1 1 0.0 73.3 Year One 2 0.2 91.6 (0,1) FCIP 3 2.4 97.6 Year Two 4 43.6 56.4 (-0.2,1) Overall 3.4 88.7 1 0.0 73.5 2 0.1 91.2 TCC-ST 3 2.2 97.8 4 43.6 56.4 Overall 3.2 88.7 Group Linking Method 184 1 3.4 0.5 0.1 0.0 0.8 26.7 8.2 0.0 0.0 7.9 26.5 8.7 0.0 0.0 8.1 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 19.6 80.2 20.8 78.3 19.4 80.6 16.6 83.0 0.0 76.6 0.3 93.2 4.5 95.5 47.3 52.7 4.4 88.9 0.0 78.3 0.3 92.7 4.0 96.0 45.5 54.5 4.1 89.3 1 0.0 0.2 0.9 0.0 0.4 23.4 6.4 0.0 0.0 6.7 21.7 7.0 0.0 0.0 6.6 Table 4b.10 Percentage in Each Performance Level Classification (N=3, Drift=b-0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 94.3 2 5.3 92.6 Concurrent 3 10.9 88.6 4 30.9 69.1 Overall 7.9 90.1 1 0.0 99.8 Year One 2 18.6 81.4 (0,1) FCIP 3 44.9 55.1 Year Two 4 79.4 20.6 (0,1) Overall 28.5 71.5 1 0.0 99.8 2 18.4 81.6 TCC-ST 3 44.9 55.1 4 80.0 20.0 Overall 28.4 71.5 Group Linking Method 185 1 5.7 2.1 0.4 0.0 2.0 0.2 0.0 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 56.4 43.6 43.8 55.7 33.9 66.1 40.6 59.2 0.0 99.2 16.1 83.9 42.1 57.9 79.4 20.6 26.4 73.4 0.0 99.0 16.5 83.5 41.3 58.7 78.8 21.2 26.3 73.6 1 0.0 0.0 0.4 0.0 0.2 0.8 0.0 0.0 0.0 0.1 1.0 0.0 0.0 0.0 0.2 Table 4b.10 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 94.5 2 5.3 91.7 Concurrent 3 9.2 90.0 4 21.8 78.2 Overall 6.7 90.8 1 0.0 100.0 Year One 2 53.7 46.3 (0,1) FCIP 3 73.5 26.5 Year Two 4 96.4 3.6 (0.2,1) Overall 54.0 46.0 1 0.0 100.0 2 53.4 46.6 TCC-ST 3 73.7 26.3 4 96.4 3.6 Overall 54.0 46.0 Group Linking Method 186 1 5.5 3.0 0.8 0.0 2.5 Items Dropped Est.Level - True Level -2 -1 0.0 65.0 51.4 43.0 47.4 0.0 49.0 72.2 95.8 51.6 0.0 50.1 72.2 95.8 52.1 0 100.0 35.0 48.6 57.0 52.6 100.0 51.0 27.8 4.2 48.4 100.0 49.9 27.8 4.2 47.9 1 Table 4b.10 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 97.1 2 9.0 90.5 Concurrent 3 18.0 81.9 4 38.8 61.2 Overall 12.4 86.9 1 0.0 82.3 Year One 2 1.0 93.7 (0,1) FCIP 3 4.9 95.1 Year Two 4 47.3 52.7 (-0.2,1) Overall 4.8 90.0 1 0.0 83.2 2 1.0 93.6 TCC-ST 3 5.2 94.8 4 46.7 53.3 Overall 4.9 90.0 Group Linking Method 187 1 2.9 0.4 0.1 0.0 0.7 17.7 5.3 0.0 0.0 5.2 16.8 5.4 0.0 0.0 5.1 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 40.7 59.3 30.4 67.7 12.1 87.9 28.2 71.1 0.0 75.8 0.4 92.1 3.5 96.5 44.8 55.2 3.9 88.9 0.0 77.9 0.3 92.2 3.3 96.7 43.6 56.4 3.8 89.4 1 0.0 0.0 1.9 0.0 0.7 24.2 7.5 0.0 0.0 7.2 22.1 7.5 0.0 0.0 6.8 Table 4b.11 Percentage in Each Performance Level Classification (N=8, Drift=b+0.2) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 15.2 84.8 Concurrent 3 27.3 72.7 4 51.5 48.5 Overall 19.1 80.8 1 0.0 98.9 Year One 2 14.9 85.1 (0,1) FCIP 3 39.6 60.4 Year Two 4 78.2 21.8 (0,1) Overall 25.0 74.8 1 0.0 99.0 2 14.9 85.1 TCC-ST 3 39.6 60.4 4 77.6 22.4 Overall 24.9 74.9 Group Linking Method 188 1 1.0 0.0 0.0 0.0 0.2 1.1 0.0 0.0 0.0 0.2 1.0 0.0 0.0 0.0 0.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 56.7 43.3 51.1 48.9 53.3 46.7 44.5 55.5 0.0 99.4 16.9 83.1 44.5 55.5 81.2 18.8 27.8 72.1 0.0 99.8 18.1 81.9 44.0 56.0 78.8 21.2 27.9 72.1 1 0.6 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 0.0 Table 4b.11 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 22.5 36.2 57.6 25.6 0.0 48.2 69.6 93.9 50.3 0.0 48.6 69.9 93.9 50.5 189 0 100.0 77.5 63.8 42.4 74.4 100.0 51.8 30.4 6.1 49.7 100.0 51.4 30.1 6.1 49.5 Items Dropped Est.Level - True Level 1 -2 -1 0.0 0.0 0.0 72.4 0.1 61.9 0.0 66.7 0.0 55.5 0.0 53.4 72.6 97.0 53.6 0.0 53.4 72.8 95.8 53.6 0 100.0 27.6 38.0 33.3 44.4 100.0 46.6 27.4 3.0 46.4 100.0 46.6 27.2 4.2 46.4 1 Table 4b.11 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.6 2 8.0 91.5 Concurrent 3 17.3 82.6 4 40.0 60.0 Overall 11.8 87.3 1 0.0 74.9 Year One 2 0.3 90.7 (0,1) FCIP 3 2.2 97.8 Year Two 4 39.4 60.6 (-0.2,1) Overall 3.1 88.9 1 0.0 76.2 2 0.3 90.4 TCC-ST 3 2.2 97.8 4 38.2 61.8 Overall 3.0 89.1 Group Linking Method 190 1 3.4 0.5 0.1 0.0 0.8 25.1 9.0 0.1 0.0 8.0 23.8 9.3 0.1 0.0 7.9 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 39.7 60.3 38.3 61.6 38.2 61.8 32.1 67.8 0.0 77.3 0.8 92.8 4.3 95.7 49.7 50.3 4.7 88.8 0.0 82.3 0.9 92.8 4.2 95.8 43.6 56.4 4.3 90.1 1 0.0 0.0 0.1 0.0 0.0 22.7 6.4 0.0 0.0 6.5 17.7 6.3 0.0 0.0 5.6 Table 4b.12 Percentage in Each Performance Level Classification (N=8, Drift=b-0.2) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.0 2 15.9 84.1 Concurrent 3 28.2 71.8 4 53.3 46.7 Overall 19.8 80.1 1 0.0 99.8 Year One 2 18.9 81.1 (0,1) FCIP 3 47.7 52.3 Year Two 4 81.8 18.2 (0,1) Overall 29.8 70.2 1 0.0 100.0 2 19.3 80.7 TCC-ST 3 47.0 53.0 4 81.8 18.2 Overall 29.7 70.3 Group Linking Method 191 1 1.0 0.0 0.0 0.0 0.2 0.2 0.0 0.0 0.0 0.0 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 64.6 35.4 52.5 47.5 50.9 49.1 48.0 52.0 0.0 99.4 17.1 82.9 40.6 59.4 81.2 18.8 26.4 73.5 0.0 99.8 17.6 82.4 41.7 58.3 78.8 21.2 26.8 73.1 1 0.6 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 0.0 Table 4b.12 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 23.5 37.0 59.4 26.4 0.0 55.1 74.6 96.4 55.0 0.0 55.9 74.3 95.8 55.2 192 0 100.0 76.5 63.0 40.6 73.6 100.0 44.9 25.4 3.6 45.0 100.0 44.1 25.7 4.2 44.8 Items Dropped Est.Level - True Level 1 -2 -1 0.0 0.0 0.0 78.1 0.2 61.6 0.0 63.0 0.1 57.5 0.0 49.7 70.9 97.0 51.5 0.0 51.4 71.7 94.5 52.3 0 100.0 21.9 38.3 37.0 42.5 100.0 50.3 29.1 3.0 48.5 100.0 48.6 28.3 5.5 47.7 1 Table 4b.12 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 97.0 2 9.0 90.7 Concurrent 3 18.9 81.0 4 40.0 60.0 Overall 12.8 86.5 1 0.0 81.0 Year One 2 0.8 93.8 (0,1) FCIP 3 5.3 94.7 Year Two 4 50.3 49.7 (-0.2,1) Overall 5.1 89.5 1 0.0 83.6 2 1.1 93.8 TCC-ST 3 5.4 94.6 4 46.7 53.3 Overall 5.0 90.1 Group Linking Method 193 1 3.0 0.3 0.1 0.0 0.7 19.0 5.4 0.0 0.0 5.5 16.4 5.1 0.0 0.0 4.9 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 49.8 50.2 40.3 59.5 33.9 66.1 36.7 63.2 0.0 76.8 0.4 91.4 3.0 97.0 47.9 52.1 3.9 88.8 0.0 79.6 0.4 92.0 3.0 97.0 41.2 58.8 3.5 89.8 1 0.0 0.0 0.2 0.0 0.1 23.2 8.2 0.0 0.0 7.3 20.4 7.6 0.1 0.0 6.6 Table 4b.13 Percentage in Each Performance Level Classification (N=8, Drift=b+0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 98.9 2 14.2 85.8 Concurrent 3 27.1 72.9 4 54.5 45.5 Overall 18.7 81.1 1 0.0 97.9 Year One 2 10.1 89.9 (0,1) FCIP 3 34.6 65.4 Year Two 4 75.2 24.8 (0,1) Overall 21.0 78.6 1 0.0 98.5 2 12.1 87.9 TCC-ST 3 35.7 64.3 4 75.8 24.2 Overall 22.2 77.5 Group Linking Method 194 1 1.1 0.0 0.0 0.0 0.2 2.1 0.0 0.0 0.0 0.4 1.5 0.0 0.0 0.0 0.3 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 34.8 65.2 32.9 66.8 29.7 70.3 27.7 72.2 0.0 99.4 18.8 81.2 47.3 52.7 82.4 17.6 29.6 70.3 0.0 100.0 19.3 80.7 45.6 54.4 80.0 20.0 29.1 70.9 1 0.0 0.0 0.4 0.0 0.1 0.6 0.0 0.0 0.0 0.1 Table 4b.13 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 22.9 37.5 60.6 26.4 0.0 42.4 67.1 94.5 47.0 0.0 44.0 68.0 93.9 48.0 195 0 100.0 77.1 62.5 39.4 73.6 100.0 57.6 32.9 5.5 53.0 100.0 56.0 32.0 6.1 52.0 Items Dropped Est.Level - True Level 1 -2 -1 0.0 52.4 46.8 44.8 40.7 0.0 55.6 74.6 96.4 55.2 0.0 56.3 74.6 96.4 55.4 0 100.0 47.6 53.2 55.2 59.3 100.0 44.4 25.4 3.6 44.8 100.0 43.7 25.4 3.6 44.6 1 Table 4b.13 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.2 2 7.4 92.1 Concurrent 3 16.9 83.0 4 41.8 58.2 Overall 11.5 87.6 1 0.0 66.7 Year One 2 0.0 85.8 (0,1) FCIP 3 0.7 99.2 Year Two 4 37.6 62.4 (-0.2,1) Overall 2.3 86.1 1 0.0 69.5 2 0.1 87.7 TCC-ST 3 1.1 98.8 4 37.6 62.4 Overall 2.5 87.3 Group Linking Method 196 1 3.8 0.5 0.1 0.0 0.9 33.3 14.2 0.1 0.0 11.5 30.5 12.2 0.1 0.0 10.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 99.8 18.2 81.3 17.1 81.2 10.9 89.1 14.2 84.9 0.0 79.8 0.9 94.6 6.1 93.9 52.1 47.9 5.5 89.2 0.0 84.4 1.2 94.1 5.6 94.4 47.3 52.7 5.2 90.2 1 0.2 0.5 1.7 0.0 0.9 20.2 4.5 0.0 0.0 5.3 15.6 4.7 0.0 0.0 4.6 Table 4b.14 Percentage in Each Performance Level Classification (N=8, Drift=b-0.4) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 99.2 2 16.6 83.4 Concurrent 3 27.1 72.9 4 46.7 53.3 Overall 19.2 80.6 1 0.0 100.0 Year One 2 23.0 77.0 (0,1) FCIP 3 50.5 49.5 Year Two 4 81.2 18.8 (0,1) Overall 32.4 67.6 1 0.0 100.0 2 22.4 77.6 TCC-ST 3 50.2 49.8 4 81.8 18.2 Overall 32.1 67.9 Group Linking Method 197 1 0.8 0.0 0.0 0.0 0.1 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 57.5 42.5 42.7 56.8 29.7 70.3 40.4 59.4 0.0 99.2 15.4 84.6 42.2 57.8 79.4 20.6 26.2 73.7 0.0 99.2 16.8 83.2 41.1 58.9 78.8 21.2 26.3 73.6 1 0.0 0.0 0.4 0.0 0.2 0.8 0.0 0.0 0.0 0.1 0.8 0.0 0.0 0.0 0.1 Table 4b.14 (cont’d) Items Kept True Level Est.Level-2 1 2 Concurrent 3 4 Overall 1 Year One 2 (0,1) FCIP 3 Year Two 4 (0.2,1) Overall 1 2 TCC-ST 3 4 Overall Group Linking Method True Level -1 0.0 29.2 38.9 57.6 29.3 0.0 58.7 75.4 96.4 56.7 0.0 58.7 75.2 96.4 56.6 198 0 100.0 70.8 61.1 42.4 70.7 100.0 41.3 24.6 3.6 43.3 100.0 41.3 24.8 3.6 43.4 Items Dropped Est.Level - True Level 1 -2 -1 0.0 0.0 0.0 70.5 0.1 54.1 0.0 46.7 0.0 50.8 0.0 46.1 71.9 96.4 50.4 0.0 48.5 72.1 95.8 51.4 0 100.0 29.5 45.8 53.3 49.2 100.0 53.9 28.1 3.6 49.6 100.0 51.5 27.9 4.2 48.6 1 Table 4b.14 (cont’d) Items Kept True Level Est.Level- True Level -2 -1 0 1 0.0 96.6 2 8.1 91.1 Concurrent 3 15.5 84.4 4 36.4 63.6 Overall 11.0 88.1 1 0.0 89.1 Year One 2 2.2 94.5 (0,1) FCIP 3 8.4 91.6 Year Two 4 52.7 47.3 (-0.2,1) Overall 6.9 89.9 1 0.0 89.3 2 2.2 94.5 TCC-ST 3 8.6 91.4 4 53.3 46.7 Overall 7.0 89.8 Group Linking Method 199 1 3.4 0.8 0.1 0.0 0.9 10.9 3.4 0.0 0.0 3.2 10.7 3.4 0.0 0.0 3.2 Items Dropped Est.Level - True Level -2 -1 0 0.0 100.0 43.7 56.3 31.8 66.4 9.7 90.3 29.8 69.6 0.0 72.6 0.4 90.5 2.8 97.2 47.9 52.1 3.8 87.7 0.0 76.8 0.3 91.8 2.7 97.2 41.8 58.2 3.4 89.3 1 0.0 0.0 1.8 0.0 0.7 27.4 9.1 0.0 0.0 8.4 23.2 7.9 0.1 0.0 7.2 Chapter V: Conclusions, Implications and Future Research This chapter summarizes the findings from this simulation study. The practical implications of these findings are explored. The limitations of the study are discussed to provide directions for future research. 5.1 Conclusions The purpose of the simulation study was to investigate the impact of the drift in item parameters on the accuracy of θ estimation when forms administered to different populations are linked using a common item linking method. Several factors are manipulated in exploring the effect of drift – characteristics of the drift such as the number of the drifting items, the direction of the drift and the size of the drift, the linking method, the group ability difference and the handling of the drifting items. The accuracy of θ estimation (i.e. estimation of θ after the parameters are transformed onto the same scale) is examined at the individual θ level, through the average of mean difference between the estimates and the true values at intervals on the θ scale, and also from the perspective of the correct classification of the performance level associated with the true level of performance. Findings from the investigation are discussed below. First of all, for the conditions studied, the drift in item parameters does not affect the strong and positive relationship between the θ estimates and the true θs regardless of the characteristics of the drift, the group difference or the choice of linking methods. Second, in general, the characteristics of the drift have little influence on the performance of the FCIP or the TCC-ST method. Likewise, their impact on the performance of the Concurrent method is not significant, as long as the drifting items are kept in the linking process. However, if 200 the drifting items are removed from the linking process, the performance of the Concurrent method is likely to be affected by the number of drifting items, the direction of the drift, or the size of the drift. When the linking was conducted with the Concurrent method with deletion of the drifting items from the common item link, estimation of θs tends to be more accurate if fewer items are deleted, or if the drift is positive, or if the drift is large. Third, the difference in the ability of the groups involved in the linking plays an important role in the θ estimation. In this study, two groups are taking tests with common items in different administration years. θ estimates for the Year One group are to be placed onto the scale of the Year Two group (for FCIP and TCC-ST methods) or onto a common scale of Year One and Two (for the Concurrent method). In comparison with accuracy of θ estimation achieved when the two groups are equivalent in ability, when the mean ability of the Year One group is higher than that of the Year Two group, θ estimation is more accurate. When the mean ability of the Year One group is lower than the Year Two group, the estimation is less accurate. This is particularly tangible for the FCIP and TCC-ST methods. As for the Concurrent method, this relationship between the accuracy of θ estimates and the group ability can be observed if the drifting items are removed from the linking process. However, when the Concurrent method is used without removing the drifting items, the effect of the group difference is diminished. Fourth, when it comes to the choice of the linking method, both of the differences in group abilities and the way of handling the drifting items are to be considered to determine whether one method is better than the others. There is very little difference between the performance of the FCIP and TCC-ST methods, but the Concurrent method behaves in a different way from the other two methods. If the drifting items are kept in the linking process, the Concurrent method does a better job when the two groups are equal in ability or when the mean ability of the Year 201 Two group is higher than that of the Year One group. Otherwise, when the mean ability of the Year One group is higher than that of the Year Two group, there is no significant difference among the three linking methods. On the contrary, when the drifting items are dropped from the linking process, the Concurrent method does a worse job than the other two methods when the two groups are equal in ability or when the mean ability of the Year Two group is lower than that of the Year One group. However, when the mean ability of the Year One group is lower than that of the Year One group, there is no substantial difference between the three linking methods, because in which case, all of the linking methods are doing a poor job in θ estimation. The next conclusion is the answer to one of the main research questions: does removing or keeping drifting items make a difference? The answer is: It depends. The way the drifting items are handled is of little importance to the FCIP or the TCC-ST method, but it plays a crucial role on the performance of the Concurrent method. On average, when the drifting items are dropped from the linking process, θ estimation becomes less accurate. Similar conclusions can be drawn when the performances of the linking methods are examined through the percentage of examinees classified into the appropriate performance levels. When the drifting items are dropped from the linking process, the Concurrent method does a worse job in classifying θs into the correct levels. There is a drop in both of the overall percentage of correct classification and the percentage of correct classifications of examinees at the Level 3 (considered as passing level in this study), if the drifting items are removed from the linking process. However, the influence of removing the drifting items differs for estimation of θs at different locations of the scale. For θs at the two ends of the scale, accuracy of estimation improves if the drifting items are removed from the linking. On the contrary, for θs at the middle part of the scale, estimation is more accurate if the drifting items are kept in the linking process. 202 5.2 Implications The investigation of the effect of drift is intended to shed some light on what strategies to be adopted to achieve accurate θ estimations when item parameter drift is present in practice. The findings of this study suggest that it is reasonable to keep the drifting items in the linking. For one thing, whether to keep or drop the drifting items has little influence on the performance of the FCIP and TCC-ST methods. For another, the Concurrent method is sensitive to the dropping of items. Removing dropped items is likely to affect the performance of the Concurrent method. If the drifting items are to be removed, the use of the Concurrent method is limited. Therefore, keeping drifting items in the set of the common items enables a bigger pool of linking methods to choose from. However, in extreme cases when the examinees are of extremely low or high ability, it is better to drop the drifting items if the Concurrent method is to be used. Furthermore, the findings of this study are helpful in explaining the inconsistent conclusions in research on linking methods. Much research has been done to explore which linking method gives more reliable results. However, the conclusions about the performance of the Concurrent method from some research are contrary to those found in other research. One possible reason might be that the inconsistency in conclusions is the result of the effect of item parameter drift on the performance of the Concurrent method. The Concurrent method is sensitive to many aspects of the item parameter drift. If item parameter drift exists in the linking, the findings on the Concurrent method are likely to be inconclusive. There are many things to consider when it comes to selecting the appropriate linking method, such as cost, time, and consistency of the choice of linking method if a longitudinal study is involved. The findings of this study suggest that one more thing to take into consideration is the effect of item parameter drift. If item parameter drift is suspected but the characteristics of the 203 drift are not clear, it is more conservative to resort to the use of FCIP or TCC-ST method, since these two methods are less likely to be affected by the item parameter drift. But if the drift has been analyzed and it is found to be the kind of drifting condition that may have little effect on the performance of the Concurrent method, the Concurrent method is preferred, because the Concurrent method has the potential to yield more accurate estimates, if the confounding factors are sorted out. In this simulation study, the existence of item parameter drift and the properties of the drift are known. However, in practice, it is uncertain whether there is true shift in distribution or whether the correct items are detected as having drift. From the perspective of the uncertainty of item parameter drift, it is advisable that FCIP or TCC-ST method be selected. Nevertheless, the Concurrent method does give more accurate results in certain conditions. Therefore, it is helpful to run the Concurrent method in company with the selected FCIP or TCC-ST method. Big discrepancy between the results from the Concurrent method and FCIP or TCC-ST method would be an indication that further investigation is needed as to group differences and item parameter drift. Moreover, the study indicates that the percentage of correct classification of performance level has the potential of being an important criterion in the research of estimation accuracy and it is encouraged that researchers use the percentage of correct classification in evaluating results in addition to other criteria. In practice, in terms of teacher or school performance, educational policy makers, administrators and educators pay attention to the progress of students in total and they would like to know the percentage of students achieving at different levels. Therefore, the percent correct criterion is a straightforward indicator and helps them to interpret the research results. 204 5.3 Limitations and Future Directions As with any simulation study, there are limitations with this investigation. The simulation in this study is just a simplified version of the real testing setting. In practice, there are more confounding factors contributing to the accuracy of θ estimation. Future research may attempt to include more variables that may have impact on the item parameter drift. The drift conditions simulated in this study are simplified. In any particular condition, the drift occurs in either a-parameter or b-parameter, the items are drifting in one direction and the size of the drift is fixed. Variations in the size of the drift, the direction of the drift and type of the parameter are thus examined in separate drifting conditions. But in reality, the drift is more complicated. In future research, compound drift can be simulated by combining two or more drifting conditions in one set of common items. It is expected that the effect of the compound drift will be more difficult to interpret, but the results from the study of the simple drifts will be helpful in the interpretation of the compound drift. While simulating the item parameter drift, it may be meaningful to check whether the drift is identifiable. In the simulation study, the identification of the drift is not a problem. However, in practice, it is likely that some drifts exist unnoticed while some items are falsely identified as drifting. Yet, the parameter drift has to be identified first before decisions can be made on how to deal with the drifting items. Without identification of the drift, some findings that involve the dropping of the drifting items can be practically meaningless. In addition, the identification of drift can be an inspiring perspective to investigate the effect of parameter drift. Further research on the power of detection of item parameter drift of different amounts will be helpful. Some factors have been studied in the research of linking methods and have been found to be contributing to the performance of the linking methods. Similarly, these factors can be 205 explored in the investigation of the effect of the item parameter drift, such as the location of the drifting items, the length of the test, the number of common items, and other characteristics of the items that may have an impact on the functioning of the drifting items. The effect of dropping the drifted items on the performance of the Concurrent method is noticeable in this study. Further research may explore more explanations behind it. Items included in this study are short constructed response items that are scored as correct or incorrect. A modified 3-parameter model with guessing parameter set at 0.2 is used with these dichotomous items. In practice, a test form may consist of more than one item type. There can be open ended questions with more than two possible score levels along with the multiple choice items. More types of items can be included in the future study. Polytomous items with more than two possible score levels can be included in the simulation and analyzed with a generalized partial credit model. 206 APPENDIX 207 Table 6 Population Item Parameters Used for Simulations Unique Parameters Common Parameters Items a b c Items a b c 0.74678 0.2 1 0.72254 1 0.57468 -1.44364 0.2 0.58672 0.2 0.58391 0.2 2 0.56372 2 0.53469 0.31765 0.2 3 0.56018 3 0.33565 -1.16553 0.2 4 0.57526 -1.06793 0.2 4 0.61154 -2.14948 0.2 0.18754 0.2 0.41077 0.2 5 0.45493 5 0.55166 0.51235 0.2 6 0.65534 -1.25667 0.2 6 0.71273 7 0.30423 -1.17147 0.2 7 0.43509 -2.04783 0.2 1.75506 0.2 8 0.32111 8 0.26379 -0.89548 0.2 -1.4499 0.2 9 0.36325 9 0.49211 -0.49774 0.2 0.23493 0.2 -0.9313 0.2 10 0.54163 10 0.76061 1.36101 0.2 11 0.67998 -0.34346 0.2 11 1.14946 0.7498 1.24709 0.2 12 12 0.75617 -0.20063 0.2 0.71809 0.2 1.78164 0.2 13 0.54751 13 0.38295 1.5273 0.2 14 0.35562 -1.76456 0.2 14 0.26207 0.4562 -1.26656 0.2 15 15 0.35157 -1.76583 0.2 0.0811 0.2 16 0.74717 16 0.52074 -1.50204 0.2 -0.7611 0.2 1.54113 0.2 17 0.65129 17 0.34359 0.01191 0.2 1.07221 0.2 18 0.79095 18 0.46569 0.3527 -0.86567 0.2 1.46023 0.2 19 19 0.94583 20 0.70557 -1.13245 0.2 20 0.38408 -1.52826 0.2 1.08491 0.2 21 0.26988 -1.42414 0.2 21 0.56123 0.21671 0.2 1.79836 0.2 22 0.88485 22 0.35709 0.24138 0.2 23 0.69931 23 0.63443 -1.12987 0.2 24 0.46743 -1.55173 0.2 24 0.81237 -1.29526 0.2 0.16086 0.2 25 0.32602 -0.94629 0.2 25 0.36358 0.68469 -0.01511 0.2 0.59573 1.31145 0.2 26 26 27 0.58506 -0.26278 0.2 27 0.65013 -0.43379 0.2 0.86041 0.2 1.13526 0.2 28 0.83568 28 0.69662 0.15776 0.2 0.6089 -0.30114 0.2 29 0.61425 29 1.32624 0.2 0.7615 0.94246 0.2 30 0.39992 30 -0.0201 0.2 average 0.56220 -0.21968 0.2 average 0.56254 208 BIBLIOGRAPHY 209 BIBLIOGRAPHY Baker, F. B., & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients. Journal of Educational Measurement, 28(2), 147-162. Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275-285. Chan, K.-Y., Drasgow, F., & Sawin, L. L. (1999). What is the shelf life of a test? The effect of time on the psychometrics of a cognitive ability test battery. Journal of Applied Psychology, 84(4), 610-619. Cook, L. L., & Eignor, D. R. (1991). IRT equating methods. Educational Measurement: Issues and Practices, 10(3), 37-45. Cook, L. L., Eignor, D. R., Taft, H. L. (1988). A comparative study of the effects of recency of instruction on the stability of IRT and conventional item parameter estimates. Journal of Educational Measurement, 25(1), 31-45. DeMars, C. E. (2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17(3), 265-300. Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33-51. Eignor, D. R. (1985). An investigation of the feasibility and practical outcomes of preequating the SAT verbal and mathematical sections (Research Report 85-10). Princeton, NJ: Educational Testing Service. Giordano, C., Subhiyah, R., & Hess, B. (2005, April). An analysis of item exposure and item parameter drift on a take-home recertification exam. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal. Goldstein, H. (1983). Measuring changes in educational attainment over time: problems and possibilities. Journal of Educational Measurement, 20(4), 369-377. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park: Sage. Hanson, B. A., & Beguin, A. A. (1999). Separate versus concurrent estimation of IRT item parameters in the common items equating design (ACT Research Report Series 99-8). Iowa City, IA: American College Testing. 210 Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24. Hanson, B., Zeng, L., & Cui, Z. (2004). ST: A Computer Program for IRT Scale Transformation (Version 1.1) [Computer Program]. Iowa City, IA. Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational Measurement (4th ed., pp. 187-220). Westport, CT: Praeger Publishers. Holland, P.W., & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Lawrence Erlbaum Associates. Jodoin, M. G., Keller, L. A., & Swaminathan, H. (2003). A comparison of linear, fixed common item, and concurrent parameter estimation equating procedures in capturing academic growth. The Journal of Experimental Education, 71(3), 229-250. Kim, S.-H., & Cohen, A. S. (1991). A comparison of two area measures for detecting differential item functioning. Applied Psychological Measurement, 15(3), 269-278. Kim, S.-H., & Cohen, A. S. (1998). A comparison of linking and concurrent calibration under item response theory. Applied Psychological Measurement, 22(2), 131-143. Kim, S.-H., Cohen, A. S., & Park, T.-H. (1995). Detection of differential item functioning in multiple groups. Journal of Educational Measurement, 32(3), 261-276. Kolen, M. J. (2006). Scaling and Norming. In R. L. Brennan (Ed.), Educational measurement (4th ed., p.163). Westport, CT: Praeger. Kolen, M. J., & Brennan, R. L. (1995). Test equating: Methods and practices. New York: Springer. Lee, W.-C., & Ban, J.-C. (2007). Comparision of three IRT linking procedures in the random groups equating design (Research Report). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, College of Education, University of Iowa. Li, Y. H., Griffith, W. D., & Tam, H. P. (1997). Equating multiple tests via an IRT linking design: Utilizing a single set of anchor items with fixed common item parameters during the calibration process. Paper presented at the annual meeting of the Psychometric Society, Knoxville, TN. Li, Y. H., Tam, H. P., & Tompkins, L. J. (2004). A comparison of using the fixed common-precalibrated parameter method and the matched characteristic curve method for linking multiple-test items. International Journal of Testing, 4(3), 267-293. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. 211 Miller, G. E., & Fitzpatrick, S. J. (2009). Expected equating error resulting from incorrect handling of item parameter drift among the common items. Educational and Psychological Measurement, 69(3), 357-368. Muraki, E., & Bock, R. D. (2003). PARSCALE 4 (Version 4.1) [Computer Program]. Chicago: Scientific Software International, Inc. Raju, N. S. (1988). The area between two item characteristic curves. Psychometrika, 53(4), 495-502. Rupp, A., & Zumbo, B. (2003, April). Bias coefficients for lack of invariance in unidimentional IRT models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201-210. Stone, C. A., & Lane, S. (1991). Use of restricted item response theory models for examining the stability of item parameter estimates over time. Applied Measurement in Education, 4(2), 125-141. Sykes, R. C., & Fitzpatrick, A. R. (1992). The stability of IRT b values. Journal of Educational Measurement, 29(3), 201-211. Veerkamp, W. J. J., & Glas, C. A. W. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25(4), 373-389. Wells, C. S., Subkoviak, M. J., & Serlin, R. C. (2002). The effect of item parameter drift on examinee ability estimates. Applied Psychological Measurement, 26(1), 77-87. Witt, E. A., Stahl, J. A., Bergstrom, B. A., & Muckle, T. (2003, April). Impact of item drift with non-normal distributions. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL. Wollack, J. A., Sung, H. J., & Kang, T. (2005, April). Longitudinal effects of item parameter drift. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. Wollack, J. A., Sung, H. J., & Kang, T. (2006, April). The impact of compounding item parameter drift on ability estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA. Yen, W. M., & Fitzpatrick, A. R. (2006). Item Response Theory. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 111-154). Westport, CT: Praeger. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3.0 [computer software]. Lincolnwood, IL: Scientific Software International. 212