2.2.3.. r 3:. u: my“... . a ‘ u... . 5.1),. Tr. .1 11:39:13 @633 This is to certify that the dissertation entitled The Effect of Weighting in Kernel Equating Using Counter-Balanced Designs presented by Yanxuan Qu LIBRARY Michigan State University has been accepted towards fulfillment of the requirements for the Ph. D. degree in Counseling, Educational Psychology, and Special Educafion Wan/Z 49. flaw Major Professor’s Signature Q/v/o 7 p I Date MSU is an Affirmative Action/Equal Opportunity Institution 44a..-.—.—.-.-n-n-o-I-I-I-O-I-i-l-O-3-.-.-.- - _-—.—.-.--—- ‘ l PLACE IN RETURN BOX to remove this checkout from your record. i TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/07 p:/ClRC/DateDue.indd-p.1 THE EFFECT OF WEIGHTING IN KERNEL EQUATING USING COUNTER-BALANCED DESIGNS By Yanxuan On A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 2007 ABSTRACT THE EFFECT OF WEIGHTING IN KERNEL EQUATIN G USING COUNTER—BALANCED DESIGNS By Yanxuan Qu The Counter-Balanced (CB) design for test equating is often used in pilot studies for testing programs when sample size is limited. When a CB design is used to conduct equating, data are usually treated as an Equivalent Group design or a Single Group design (Kolen & Brennan, 2004). On the other hand, von Davier, Holland and Thayer (2004) proposed a new approach under the Kernel Equating (KE) framework which treats data as a weighted synthesized mixture of data from the two groups. This new approach is named as the two independent Single Group approach (28G approach). This study investigates the performance of the 28G approach in comparison to other data treatment approaches under different sample sizes and order effect situations. Both linear and equipercentile equating methods under KB and traditional equating frameworks were applied to two real datasets and six simulated datasets. The results from traditional equipercentile equating on each simulated population data were considered as the benchmark to which all the other equating methods were compared. Standard Errors of Equating (SEE), Root Mean Square Error (RMSE), equating bias, and Standard Error of Equating Difference (SEED) were reported for each equating of the simulated data. The standard Error of Equating and Root Mean Square Error were reported for equating of the real data samples. The results indicated the 28G approach unifies the Equivalent Group approach and the Single Group approach into its flexible framework. The weighting mechanism in the 28G approach seemed to be sensitive to different order effects. Possible criteria for selecting the best weights are discussed. DEDICATION To my dear parents, my husband, and my little brother iv ACKNOWLEDGEMENTS This dissertation work is completed under the help of many people. First, I am deeply indebted to Professor Mark D. Reckase for his guidance and encouragement in this dissertation work. Without his constant and unconditional support, this work would not have been possible. I learned from him not only his knowledge, but also his dedication to work and his peaceful and respectful attitude to people. I would like to thank Dr. Alina von Davier for her generous help and guidance. She is enthusiastic, upbeat and proactive. Thanks also go to Dr. Richard Houang, and Dr. Sharif Shakrani for their insightfiil comments on this dissertation; Dr. Ning Han and Dr. Henry Chen for their assistance in the KB sofiware and Dr. Linda Chard for her assistance in editing the early version of my dissertation. Meanwhile, I am very grateful to Dr. Betsy Becker, Dr. Mary Kennedy, and Dr. Edward Wolfe. Working with them on different projects broadened my scope of knowledge. Their financial support made me concentrate on my study and made me feel the family-like atmosphere. Finally, my deep gratitude goes to my husband Lixiong Gu for his love and support, and to my parents and my brother, for their understanding and encouragement. TABLE OF CONTENTS LIST OF FIGURES X NOTATION XIII CHAPTER I: INTRODUCTION - 1 1.1 EQUATING PROCEDURE IN GENERAL .................................................................................................... 2 1.2 COUNTER-BALANCED DESIGN AND EQUATING .................................................................................... 2 1.3 LITERATURE REVIEW ........................................................................................................................... 3 1.4 RESEARCH QUESTIONS ......................................................................................................................... 5 1.5 RESEARCH EXPECTATIONS ................................................................................................................... 6 CHAPTER II: THEORETICAL FRAMEWORK 6 2.1 COUNTER-BALANCED DESIGN ............................................................................................................. 7 2.2 EQUATING USING COUNTER-BALANCED DESIGNS ............................................................................. 10 2.2.1 Approaches to Treating Data in a CB Design ........................................................................... 10 2.2.2 Equating Methods for a CB Design ........................................................................................... 12 2.3 EQUATING WITH A CB DESIGN UNDER THE KERNEL EQUATING FRAMEWORK ................................... l4 2. 3.1 Step I. Log-linear Pre-smoothing .............................................................................................. 14 2.3.2 Step 2. Estimating Score Probabilities on the Target Population ............................................. 16 2.3.3 Step 3. Continuization ................................................................................................................ 18 2.3.4 Step 4. Equating ......................................................................................................................... 19 2.3.5 Step 5. Calculating Standard Error of Equating (SEE) and Standard Error of Equating Diflerence (SEED) ......................................................................................................... 20 2.4 EQUATING ERROR .............................................................................................................................. 22 2.5 EVALUATING THE RESULTS OF EQUATING ......................................................................................... 22 2. 5. 1 Standard Error of Equating ....................................................................................................... 23 2.5.2 Root Mean Squared Deviation (RMSD) .................................................................................... 25 2.5.3 Equating Bias ............................................................................................................................ 26 2.5.4 Root Mean Square Error ........................................................................................................... 27 2.5.5 Standard Error of Equating Difference ..................................................................................... 2 7 CHAPTER III: METHODS _ 30 3.1 QUANTIFICATION OF DIFFERENTIAL ORDER EFFECT .......................................................................... 30 3.2 DATA .................................................................................................................................................. 31 3.2.1 Real Data ................................................................................................................................... 3] 3.2.2 Simulated Data .......................................................................................................................... 33 3.3 ANALYSIS ........................................................................................................................................... 44 3. 3. I Equating Methods Applied for Simulated Data ......................................................................... 44 3.3.2 Procedure for Estimating Empirical SEE for Simulated Data .................................................. 45 3.3.3 Evaluating Equating Results from Simulated Data ................................................................... 46 CHAPTER IV: RESULTS -- - - - 48 4.1 REAL DATA 1 ..................................................................................................................................... 48 4.1.1 Selecting the Best Equating Function Using RMSE .................................................................. 49 4.1.2 Selecting the Best Equating Function Using SEED ................................................................... 51 4.2 REAL DATA 2 ..................................................................................................................................... 53 4. 2. 1 Selecting the Best Equating Function Using RMSE .................................................................. 55 4.2.2 Selecting the Best Equating Function Using SEED ................................................................... 56 4.3 SIMULATED DATA .............................................................................................................................. 59 4.3.] Model Fit ................................................................................................................................... 59 vi 4.3.2 Evaluating the Equating Results by RMSE ................................................................................ 60 4.3.3 Evaluating the Equating Results by SEED ................................................................................ 74 CHAPTER V: DISCUSSION 79 5.1 PERFORMANCE OF THE KE METHODS ................................................................................................ 79 5.2 EFFECTS OF THE WEIGHTING METHOD ............................................................................................... 80 5.3 LIMITATIONS OF THIS STUDY ............................................................................................................. 82 5.3.1 Arbitrary Nature of the Equating Criterion ............................................................................... 83 5.3.2 Problem with Simulated Data .................................................................................................... 83 5.4 FUTURE STUDY ................................................................................................................................... 83 APPENDICES -- - _ -- -- -- - 85 REFERENCES - 113 vii LIST OF TABLES TABLE 1. Equivalent-Groups design ................................................................................. 7 TABLE 2. Single-Group design ......................................................................................... 8 TABLE 3. Counter-Balanced design .................................................................................. 8 TABLE 4. Ways of treating data in a CB design appearing in the literature .................. 12 TABLE 5. KE methods and corresponding traditional equating methods ...................... 28 TABLE 6. All equating methods compared in this study for simulated data ................... 29 TABLE 7. Summary statistics for real data I .................................................................. 32 TABLE 8. Summary statistics for real data 2 .................................................................. 32 TABLE 9. Descriptive statistics for simulated data I ...................................................... 36 TABLE 10. Descriptive statistics for simulated data 2 .................................................... 38 TABLE 11. Descriptive statistics for simulated data 3 .................................................... 39 TABLE 12. Descriptive statistics for simulated data 4 .................................................... 40 TABLE 13. Descriptive statistics for simulated data 5 .................................................... 42 TABLE 14. Descriptive statistics for simulated data 6 .................................................... 43 TABLE 15. Evaluation of equating results from real data 1 ........................................... 50 TABLE 16. Evaluation of equating results from real data 2 ........................................... 55 TABLE 17. Summary statistics for POP] linear equating methods ................................ 61 TABLE 18. Summary statistics for POP] equipercentile equating methods ................... 62 TABLE 19. Summary statistics for POP2 linear equating methods ................................ 63 TABLE 20. Summary statistics for POP2 equipercentile equating methods ................... 64 TABLE 21. Summary statistics for POP3 linear equating methods ................................ 65 TABLE 22. Summary statistics for POP3 equipercentile equating methods ................... 66 TABLE 23. Summary statistics for POP4 linear equating methods ................................ 67 viii TABLE 24. Summary statistics for POP4 equipercentile equating methods ................... 68 TABLE 25. Summary statistics for POP5 linear equating methods ................................ 69 TABLE 26. Summary statistics for POP5 equipercentile equating methods ................... 70 TABLE 27. Summary statistics for POP6 linear equating methods ................................ 71 TABLE 28. Summary statistics for POP6 equipercentile equating methods ................... 72 TABLE 29. Selected equating function based on SEED .................................................. 78 TABLE 30. Selected equating function based on RMSE .................................................. 78 TABLE A1. Standard error of linear equating for real data 1 ........................................ 85 TABLE A2. Standard error of equipercentile equating for real data I .......................... 87 TABLE A3. Standard error of linear equating for real data 2 ........................................ 89 TABLE A4. Standard error of equipercentile equating for real data 2 .......................... 90 ix FIGURE 1. FIGURE 2. FIGURE 3. FIGURE 4. FIGURE 5. FIGURE 6. FIGURE 7. FIGURE 8. FIGURE 9. FIGURE 10. FIGURE 1 1. FIGURE 12. FIGURE Al. FIGURE A2. FIGURE A3. LIST OF FIGURES Observed score distributions for X 1 and Y] in real data I ........................ 48 Observed score distributions for X 2 and Y 2 in real data I. ...................... 49 Equating difference between ZSG(I, 1) linear and ZSG(. 5, .5 ) linear and the i’ 2SEED confidence interval band around zero line, real data 1. 51 Equating difference between 2SG(1, I) equipercentile and 2SG( 5, .5) equipercentile and the i 2SEED confidence interval band around zero line, real data 1. ........................................................................................ 52 Equating diflerence between 2SG(. 5, .5) linear and 250(5, .5) equipercentile and the i ZSEED confidence interval band around zero line, real data I. ....................................................................................... 53 Observed score distributions for X 1, and Y1 in real data 2, ..................... 54 Observed score distributions for X 2, and Y 2 in real data 2, ..................... 54 Equating difference between 2SG(1, 1) linear and 2SG(.5, .5) linear and the i 2SEED confidence interval band around zero line, real data 2. 56 Equating diflerence between 2SG(1, 1) equipercentile and 2SG( 5, .5) equipercentile and the i 2SEED confidence interval band around zero line, real data 2. ........................................................................................ 57 Equating difference between ZSG(I, 1) linear and 2SG(1, 1) equipercentile, and the i‘ ZSEED confidence interval band around zero line, real data 2. ...................................................................................... 58 One example of F reeman-T ukey residual plot for POP3. ........................ 59 Equating dflerences and the i ZSEED band for simulated data 1. ......... 77 Equating difference between ZSG(1, I)linear and 2SG(. 5,.5) linear, POP], n=1000 ......................................................................................... 91 Equating diflerence between 2SG(1, I) equipercentile and 2SG(. 5, .5 ) equipercentile, POP], n=]000. ............................................................... 91 Equating difference between 2SG(. 5, .5) linear and 2SG(.5, .5) equipercentile, POP], n=1000. ............................................................... 92 FIGURE A4. Equating diflerence between 2SG(I, ]) linear and 2SG( 5, .5) linear, POP2, n= —...1000 .. FIGURE A5. Equating difference between 2SG(1 I) equipercentile and ZSG(. 5,. 5) equipercentile, POP2, n=1000 FIGURE A6. Equating difference between 2SG(.5, .5) linear and 2SG(.5, .5) equipercentile, POP2, n=1000. ............................................................ FIGURE A7. Equating difference between 2SG(1, 1) linear and 2SG( 5, 5) linear, POP3, n= —1..000. ..... . .. .. FIGURE A8. Equating difference between ZSG(], )1 equipercentile and ZSGf. 5, .5) equipercentile, POP3, n= —...1000 FIGURE A9. Equating difference between ZSG(. 5, .5) linear and ZSG(.5, .5) equipercentile, POP3, n=1000. ............................................................ FIGURE A10. Equating diflerence between 2SG(1, 1) linear and 2SG(. 5, .5) linear, POP4, n= 1000.. FIGURE A1 1 Equating difference between 2SG(1, )1 equipercentile and 2SG(. 5, 5) equipercentile, POP4, n=1000 FIGURE A12. Equating difi"erence between 2SG(. 5, .5) linear and ZSG(.5, .5) equipercentile, POP4, n=1000. ............................................................ FIGURE A13. Equating difl‘érence between 2SG(1, 1) linear and 2SG(. 5,. 5) linear, POP5, n= —..500. ... .. ... .. FIGURE A14. Equating diflrerence between 2SG(1, )1 equipercentile and 2SG(. 5,. 5) equipercentile, POP5, n— =500... FIGURE A15. Equating difference between 2SG(. 5, .5) linear and 2SG(.5, .5) equipercentile, POP5, n=5 00. ............................................................... FIGURE A16. Equating difference between 2SG(11) linear and 2SG(. 5. 5) linear, POP5, n= —1000. .. . . . FIGURE A17. Equating diflerence between ZSG(1, ]) equipercentile and 2SG(. 5, 5) equipercentile, POP5, n= —1..000 . .. ... . FIGURE A18. Equating diflerence between 2SG(. 5, .5) linear and ZSG(.5, .5) equipercenti 1e, POP5, n =1 000. ............................................................ xi . ...93 ...93 94 ......95 ...95 96 ...97 ....97 98 ...99 ...99 .100 ...101 ....101 .102 FIGURE A19. Equating difference between 256(11) linear and ZSG(. 5. 5) linear, POP6, n= —300... ......103 FIGURE A20. Equating difference between 2SG(1 )1 equipercentile and ZSG(. 5, 5) equipercentile, POP6, n= -3..00. .. .. ... ....103 FIGURE A21. Equating difl"erence between ZSG(. 5, .5) linear and 2SG(. 5, .5) equipercentile, POP6, n=300. ............................................................... 104 FIGURE A22. Equating diflerence between 2SG(1, 1) linear and 2SG(. 5, 5) linear, POP6, n— 5.00 . ... ... . .. 105 FIGURE A23. Equating diflerence between 2SG(1, )1 equipercentile and ZSG(. 5, 5) equipercentile, POP6, n= —...500 ... .. .. 105 FIGURE A24. Equating difference between 2SG(. 5, .5 ) linear and ZSG(.5, .5) equipercentile, POP6, n=500. ............................................................... 106 FIGURE A25. Equating dzflerence between 2SG(11) linear and 2SG(. 5,. 5) linear, POP6, n= 1000... .. 107 FIGURE A26. Equating difference between ZSG(1 ,1) equipercentile and 2SG(. 5,.5) equipercentile, POP6, n=1000......... 107 FIGURE A27. Equating diflerence between 2SG(. 5, .5) linear and 2SG(.5, .5) equipercentile, POP6, n=1000. ............................................................. 108 FIGURE A28. Equating difl"erence between ZSG(], 1) equipercentile and EG equipercentile, POP], n=50. .................................................................. 109 FIGURE A29. Equating difference between 2SG(1, I) equipercentile and E G equipercentile, POP], n=100. ............................................................... 109 FIGURE A30. Equating diflerence between 2SG(I, 1) equipercentile and E G equipercentile, POP4, n=50. ................................................................. 110 FIGURE A31. Equating diflerence between 2SG(1, 1) equipercentile and E G equipercentile, POP4, n=100. ............................................................... 110 FIGURE A32. Equating difference between 2SG(1, I) equipercentile and E G equipercentile, POP4, n=300. ............................................................... ll 1 FIGURE A33. Equating difference between 2SG(1, I) equipercentile and E G equipercentile, POP6, n=50. ................................................................. 111 FIGURE A34. Equating difference between 2SG(1, I) equipercentile and E G equipercentile, POP6, n=1000. .............................................................. 112 xii NOTATION Symbol Explanation X, Y Names of two test forms to be equated X, Y Scores on X and Y, random variables P Population of examinees T Target population of examinees on which the equating of X and Y takes place CB Counter-Balanced data collection design EG Equivalent Group Design SG Single Group Design DF Design Function X1 Test X that is taken first X2 Test X that is taken second Y1 Test Y that is taken first Y2 Test Y that is taken second F (x) Cumulative distribution of variable X G( y) Cumulative distribution of variable Y J Number of possible X scores K Number of possible Y scores xj A possible score value for X, j is from 1 to J y k A possible score value for Y, k is from 1 to K R Generic symbol for the population probability of X afier pre-smoothing for all designs S Generic symbol for the population probability of Y after pre-smoothing for all designs r Estimated probabilities on target population T, transformed by DF from R into r s Estimated probabilities on target population T, transformed by DF from S into s éY (x) Estimated score x on Form X equated to Form Y é X ( y) Estimated score y on Form Y equated to Form X ,9]. An estimated specific value of r § k An estimated specific value of s 13 j Estimated probability of getting a score x j on X [3,, Estimated probability of getting a score yk on Y pjk Estimated joint probability of getting a score x j on X and a score yk on Y over the target population, T. 13(12)jk Estimated population probability of getting a score x j on test X1 which xiii 13(21) jk 11X, by X(hx) Why) Jey(F,§) JDF(R?,§) is taken first and a score yk on test Y2 which is taken second Estimated population probability of getting a score x j on test X2 which is taken second and a score yk on test Y] which is taken first Bandwidth used to define the KB continuizations of F (x) and G(y). They are positive numbers. Large values of the bandwidths lead to linear equating, while smaller values give more “equipercentile-like” equating functions. ContinuiZed random variable for scores on Form X Continuized random variable for scores on Form Y Jacobian matrix of the KB function, which is a function of fand S Jacobian matrix of the design function, which is a function of R and S xiv Chapter I: Introduction Test equating is an important statistical procedure in educational testing. It is used to produce scores that are comparable across different but parallel test forms, both within a year and across years. Although there have been many comparative studies investigating the accuracy of different equating methods, very few studies have been done for equating with a Counter-Balanced (CB) design. Traditionally as in Lord (1950), AngofT (1971) and Kolen and Brennan (2004), data collected by a CB design were either pooled together as a Single Group (SG) design or discarded as an Equivalent Group (EG) design. Recently, a new approach of treating data collected by a CB design was proposed by von Davier, Holland and Thayer (2004). This new approach involves weighting data before pooling them together. To evaluate the performance of this new approach, this study compared the overall equating accuracy of the two independent single group approach, abbreviated as the 28G approach, to the other approaches of treating data collected by a CB design. The rest of this chapter introduces the general procedure for equating using the counter-balanced design and equating approaches for a CB design including the new 28G approach under the Kernel Equating (KE) framework, and gives a brief summary of literature on KE equating. At the end of this chapter, the research questions and research expectations of this study are presented. Chapter 11 describes the CB design and KB framework as well as equating errors and the evaluation of equating results. Chapter 111 describes the real and Simulated datasets to which the equating methods were applied and the procedure of this study. Chapter IV presents the study results and Chapter V discusses the findings and limitations of this study. 1.1 Equating Procedure in General Every equating procedure consists of two basic components: equating design and equating methods. Typical equating designs include Equivalent Group (also called random group) design, Single Group design, Counter-Balanced design, and Non- Equivalent Anchor Test (NEAT) design. Typical equating methods can be classified into the following three categories: 1) Classical observed score equating; 2) Item Response Theory (IRT) true score equating; and 3) Item Response Theory observed score equating. Classical observed score equating methods include the mean, linear, and equipercentile equating methods reported by Kolen (1988). They define the score correspondence between two forms by setting certain characteristics of observed score distributions for a specified group of examinees. Item response theory true score equating defines the score correspondence by setting the true scores of examinees to be equal (Cook & Eignor, 1991). 1.2 Counter-Balanced Design and Equating Counterbalance or Latin Square is often used in pure experimental designs to cancel out order effects (Montogomery, 2000). In educational testing, a CB design is often used to collect data in pilot studies of testing programs. In a CB design, two independent groups of examinees usually take two parallel test forms X and Y in different order. Various ways of dealing with data in a CB design test equating were described in Lord (1950), Angoff (1971), and Kolen and Brennan (2004). None of these approaches is satisfactory for situations when order effect cannot be cancelled out. In order to improve the equating practice for a CB design, especially when order effects cannot be cancelled out, von Davier, Holland, and Thayer (2004) proposed a new way of treating data collected by a CB design under their Kernel Equating framework. This new way of treating data is named the two independent single group approach (ZSG approach), which creates a synthetic target group by assigning different weights to the two tests taken in different order, and applies linear and equipercentile equating methods to the synthetic group. The significance of this approach is its weighting mechanism, which is supposed to have the potential to provide optimal equating results with the smallest equating error by using as much data information as possible. However, the effectiveness of this 2SG approach hasn’t been evaluated. The 2SG approach, the EG approach, and the SG approach are all about data collection designs in an equating procedure. The 2SG approach is under the framework of Kernel Equating. The equating methods related to this approach are KE linear or KE equipercentile equating methods. The EG approach and SG approach can be implemented under both KB and traditional equating framework. Therefore, the equating methods related to these two approaches are the KB linear, KE equipercentile, traditional linear or traditional equipercentile equating methods (see more details in Chapter II). 1.3 Literature Review Descriptions about equating using a CB design can be found in Lord (1950), Angoff ( 1971), Kolen and Brennan (2004), Zeng and Cope (1995) and von Davier, Holland, and Thayer (2004). The 2SG approach of treating data collected by a CB design was mentioned in von Davier, Holland, and Thayer (2004). The only study compared the performance of this ZSG approach with the EG and SG approach in improving equating accuracy of a CB design equating is conducted by Qu and von Davier (2006). They compared the 28G approach to the SG and EG approach under KE framework using a real data collected by a CB design. It was found that, when order effect can be cancelled out, the 28G approach with equal weights produce similar equating results as the SG approach under KE framework. It is still unclear how the 2SG approach performs when order effects cannot be cancelled out. Moreover, it is not well documented in the literature how to test whether the order effects can or cannot be cancelled out. The 28G approach is carried out under the KB fi'amework. KE is a unified approach to test equating based on a flexible family of equipercentile—like equating functions that contain the linear equating function as a special case. It belongs to the category of classical observed score equating. Studies comparing the KB methods with other equating procedures concluded that the KB procedure can improve or approximate the equating results of corresponding traditional equating methods. Livingston (1993a) compared KE methods with traditional linear and equipercentile equating methods using small samples collected by a NEAT design. He evaluated the equating methods in terms of random equating error and equating bias and found that the KB methods with log-linear smoothing provided more accurate equating results, when compared to traditional equating methods without smoothing. He also found that, compared to the empirical standard error of equating, the analytic standard error of equating calculated by the delta method is larger at the lower or higher score range when sample size is less than 200. Mao and von Davier (2005) compared Kernel Equating methods with their corresponding traditional equating methods using real data in a NEAT design and an EG design. For the NEAT design, they compared the traditional frequency estimation equipercentile equating with KE post-stratification equating method and the Tucker method with the KB linear post-stratification equating method. They found that KE methods and their corresponding traditional equating methods have very similar equating results. Von Davier, Holland, and others (2005) did a similar study using a pseudo-test data with a NEAT design and drew the same conclusion. Han, Li, and Hambleton (2005) compared KE with IRT true score equating methods using data collected by a NEAT design. Again, they found the KB methods provide similar equating results as those of the IRT equating methods. 1.4 Research Questions This study intends to quantify differential order effects, to compare the 2SG equating procedures under KE framework with other traditional equating procedures, and to discover whether the weighting mechanism can enhance the equating accuracy under different order effect Situations. The specific research questions are: 1) How Should differential order effects in CB designs be quantified? 2) Are the KB methods better than their corresponding traditional equating methods? 3) Does the weighting in the 2SG approach provide better results under certain order effect situation? 4) What weight should be used for a 28G approach? Table 6 displays the 22 equating procedures compared in this dissertation. What distinguishes them from each other are the way they treat the data collected by a CB design (EG, SG or 28G with weighting) and the equating method (linear or equipercentile) they adopted. To compare the performance of KE with traditional equating methods, the equating results of two KE procedures are compared to the equating results of their corresponding traditional equating procedures (as listed in table 5). 1.5 Research Expectations 1) The KE equating methods and their corresponding traditional equating methods provide similar equating results. 2) As DOE increases, the weights of the 28G approach assigned on tests taken first increases accordingly. 3) Decision on the selection of an equating function with the optimal weights may vary when using different statistical criterion to evaluate the equating results. As presented above, the literature on any CB design equating is sparse. Since CB design is still used in research projects and in the pilot study of testing programs (Yu, 2003) when examinees are hard to find, it is useful to comprehend the 2SG approach and to evaluate how much it can enhance overall equating accuracy when compared to other methods in various order effect situations. Such a study will contribute to the general knowledge about a CB design and the methods available for equating using data collected by a CB design. Chapter II: Theoretical Framework This chapter first introduces the equating designs related to a CB design, the linear and equipercentile equating methods and the Kernel Equating framework, and then describes the concept of equating error and the criteria used for evaluating equating results. 2.1 Counter-Balanced Design A CB design is Often used in practice when administering two forms to examinees where it is difficult to obtain sufficiently large group of examinees (Kolen & Brennan, 2004). To explain the CB design in more detail, a brief description about EG design and SG design is necessary: Equivalent Group Design TABLE 1. Equivalent—Groups design Population Sample X Y P l J P 2 \/ In an EG design, two independent random samples are drawn from a common population of examinees, P. Each group of examinees is randomly assigned to take one of the two parallel forms X and Y as shown in Table 1. Single Group Design TABLE 2. Single-Group design Population Sample X Y P 1 v J In a SG design, only one random sample of examinees is selected from population P, and all the examinees take the two test forms X and Y in one administration as shown in Table 2. Because the two test forms are parallel and they are taken by the same examinee, it is almost certain that the examinee’s performance on the second form will be affected by their performance on the first form. The effect may be a “practice/learning effect,” or “fatigue effect.” If familiarity with the test increased performance, then Form Y could appear to be easier than Form X. On the other hand, if fatigue is a factor in examinee performance, then Form Y could appear relatively more difficult than Form X because examinees would be tired when administered Form Y (Kolen & Brennan, 2004). For simplicity, all such possible effects will be named as “order effect” (Lord, 1950). If the two test forms are administered in the same order to all examinees, as in a SG design, it is impossible to obtain any estimate of the amount of order effect. Consequently, to control for the order effect, it is usual to counterbalance the order of administration by dividing the group in a SG design into two random halves and giving two test forms to each group but in different order. This design is what is ofien called a CB design. TABLE 3. Counter-Balanced design Population Sample X 1 Y 1 X 2 Y 2 P 1 x/ ~/ P 2 v v *The subscripts of X and Y indicate the order. Eg, X1 means take test X first, Y 2 means take test Y second. Table 3 illustrates a CB design, in which, two samples of examinees were randomly chosen from a same population P and were randomly assigned as sample 1 and sample 2. Sample 1 takes test X first (denoted as X1), test Y second (denoted as Y2), and sample 2 takes test Y first (denoted as Y1) and test X second (denoted as X2). The purpose of counterbalancing the order of testing is to ensure any order effects are present equally in the scores obtained for both test forms X and Y such that the order effects on Form X and Form Y can be cancelled out. Theoretically, if random selection and random assignment of the examinees are carried out strictly in operation, the purpose of canceling out “order effect” can be accomplished by collecting data using a CB design. However, in practice, the assumption of random selection is often violated. Usually, random sampling is replaced by random cluster sampling. The violation of these two assumptions leads to the interaction between group abilities and form difficulties, which is the reason why the order effects often cannot be cancelled out. For example, some group of people might do better on the second test after practicing on the first test, while the other groups might do worse. There have been different definitions for order effects in literature. Lord (1950) and Angoff (1971) defined the order effect on Form X asKX = X2 —X1 = CO'X1 = C'O'X2 = COX, and the order effect on test FormYas KY = Y2 — Y1 = CO'Yl = CO'Y2 = CO'Y (where C is a constant). They assumed that order effects are constant for all examinees and are proportional to the standard deviations. Kolen and Brennan (2004) explained order effects without assuming they are constant for each examinee. They defined Differential Order Effect (DOE) as (X1 - I71) — (X2 — I72 ) and suggested that a significant DOE would indicate that order effects cannot be cancelled out in a CB design. However, there is not a significance test described in their book. In chapter 111, this dissertation adopted their definition of DOE, described a hypothesis testing for the statistical significance of DOE and suggested using the effect size statistics for the magnitude of DOE. 2.2 Equating Using Counter-Balanced Designs Like every equating procedure, equating using a CB design has two parts: data collection design and equating methods. 2. 2. I Approaches to Treating Data in a CB Design: The nature Of CB design leads to different ways of dealing with data. Comparing tables 1, 2 and 3 we see that CB design actually contains both EG and SG designs. For example, there are two (dependent) EG designs, one for X1 and Y1, and the other for X2 and Y2. In addition, there are two (independent) SG designs, one for X1 and Y2, and the other for X2 and Y1. Finally, the two groups of examinees can be pooled together and all the data from X1, Y2, X2 and Y1 can be treated as a pooled SG design. Because of these different ways of considering data in a CB design, several data treatment approaches have been used to equate test forms X and Y. Lord (1950) and Angoff(1971) described a linear equating method that actually treated the data as pooled single group design. They assume constant order effect and bivariate normal distributions of test X and Y in the population. By constant order effect, they mean that order effects are the same for all examinees and are proportional to the relevant standard deviations. Kolen and Brennan (2004) did not assume constant order effects across examinees. They suggested using the pooled SG approach when order effects can be cancelled out. Otherwise, only the EG approach with X] and Y1 should be used, since it is perhaps the 10 only unbiased way of treating data in a CB design. Nonetheless, each of these two approaches for treating data has its own weaknesses. Although The E G approach using X1 and Y] only is unbiased, it throws away half of the data and makes no use of the correlation between X and Y, which is implicit in the SG aspects of the CB design. The pooled SG approach is considered problematic when order effects cannot be cancelled out because it is hard to interpret the pooled distribution of X1 and X2 (or Y, and Y2) when they each have a different distribution (von Davier, Holland, & Thayer, 2004). In an attempt to find a better way of using data collected by a CB design, von Davier, Holland, and Thayer (2004) proposed the 2SG approach, a new approach using all data information as much as possible and more flexibly. It is expected to be able to unify the other three approaches into one single approach and provide an optimal equating solution while taking into account different sizes of order effects. Section 2.3 explains this approach under the KE framework in detail. Table 4 summarizes different ways of dealing with data in a CB design discussed in literature review. 11 TABLE 4. Ways of treating data in a CB design appearing in the literature EG design Explanation Use data from X 1 and Y1 only for X1 and Random selection from a single population & random Yl only Assumptions assignment Suggested when DOE is significant Advantage/Disadvantage Unbiased/loss of half data Source Kolen and Brennan (2004), von Davier et al. (2004) EG design Explanation Use data from X2 and Y2 only for X2 and Random selection from a single population; random Y2 only Assumptions assignment Definitely not when DOE is Significant Advantage/Disadvantage /biased; loss of half data Source Kolen and Brennan (2004) EG Explanation Average two EG equating functions pooling , Random selection; random assignment approach Assumptrons DOE is not significant A dvantage /Disa dvantage 3:;fprlllgdfaltracrtpcfpgmatron/Ignore dependency between two Source Von Davier et al. (2004) SG design Explanation Use data from X1 and Y2 only for XI and , Random selection Y2 only Assumptions DOE is not significant Advantage/Disadvantage /loss of data information Source Kolen and Brennan (2004) SG design Explanation Use data from X2 and Y1 only for X2 and , Random selection Y1 only Assumptions DOE is not significant Advantage/Disadvantage /loss of data information Source Kolen and Brennan (2004) Pooled SG . Use all data from X1, Y1, X2 and Y2 equally when order approach Explanation effect can be cancelled out , Random selection; random assignment AssumptIons . . . DOE Is not srgmficant A dvantage /Disa dvan tage Use full data information/not applicable when DOE is Significant Kolen and Brennan (2004), Lord (1950), von Davier et al. Source (2004) ZSG , Use all data information unequally when different order approach Explanation effects present Assumptions Advantage/Disadvantage Source Random selection & random assignment All kinds of DOE Use full data information/ Von Davier et al. (2004) * Approaches 2, 3, 4, 5 are possible ways of treating data in a CB design but are of no interest to this study 2. 2.2 Equating Methods for a CB Design: Linear or equipercentile equating 12 methods following KB or traditional equating procedure are the equating methods related to a CB design found in literature. Every equating method defines a target population T, on which scores on the two test forms are to be made equivalent (for the population as a whole, not necessarily for every individual in the population) (Livingston, 2004; von Davier, Holland, & Thayer 2004; etc.). The target population depends on the data collection design. This study focuses on the CB, EG, and SG designs where there is only one population P of test takers from which particular samples are drawn. For these designs the target population T is assumed to be the same as the underlying population P (von Davier, Holland, & Thayer, 2004). The linear equating method is appropriate when tests X and Y have the same distribution on the target population while the equipercentile equating method adjusts for the differences in the distribution. Linear equating defines the equating relationship as the equivalence of Z—scores, whereas equipercentile equating method defines equating relationship as the equivalence of cumulative distribution functions of X and Y in the population. Equation (1) and equation (2) define the equating relationship for linear and. equipercentile equating when equating X onto Y, which means each of the raw scores, xj is transformed to e Y(xJ-) or y by these equating functions, i.e., a raw score of xj on test X is interchangeable with a raw score of e fix!) or y on test Y. .x___-“X=__y‘“Y 2. y=aY+"—Y(x—ux) (I) 0X 0r 0X 60») -—- F(x) 2 y = G‘1) <2) Equation 2 holds only when X and Y are continuous. KE applies the Gaussian 13 Kernel continuization procedure (von Davier, Holland, & Thayer, 2004). While the traditional equipercentile equating in this study uses linear interpolation to continuize score distributions. 2.3 Equating with a CB Design under the Kernel Equating Framework The KE framework accommodates both linear and equipercentile equating procedures with pre-smoothing and continuization. Pre-smoothing is the log-linear smoothing before scores are equated. Continuization is used to convert discrete score distributions to continuous distributions by using a normal (Gaussian) “kernel” (Holland & Thayer, 1989; von Davier, Holland, & Thayer, 2004). In the case of a CB design, the KB framework incorporates three different ways of treating data -- the E G approach, the pooled SG approach, and the 2SG approach. Both linear and equipercentile equating methods are available to each of the three ways of treating data. The following section introduces the five steps of the KB framework particularly for a CB design and presents how the three approaches differ with respect to each of these five steps. 2.3.] Step 1. Log-linear Pre-smoothing In pro-smoothing, the empirical score distributions are smoothed. Smoothing can remove irregularity in the empirical score distributions and make them as smooth as the population score distribution relationship. Smoothing is necessary, especially when sample size is small (Livingston, 1993). KB conducts pre-smoothing using a log-linear method. Compared to the other pro-smoothing methods, the log-linear method has the flexibility of accommodating many distributions and is well-behaved and relatively easy to estimate. Because the log—linear models are a part of the exponential families, the 14 estimated distribution can match the sample distribution by as many moments as possible (Holland & Thayer, 2000; Kolen & Brennan, 2004). In this step, a log-linear model with best fit is selected to fit the sample data and to estimate discrete score probabilities. The fit of the log-linear models can be evaluated. by examining changes in the likelihood ratio chi-square index over different models and conditional Freeman-Tukey residual plots. The Freeman—Tukey residual plot displays the deviation between ey (X) and Y or between ex(Y) and X. A log-linear model with good fit will have conditional Freeman-Tukey residuals randomly distributed within 3 units above or below the zero line. In addition, the fit of a log-linear model can be somehow reflected by the Standard Error of Equating introduced in step 5. A bad model fit could lead to large SEE. Let J and K denotes the total number of possible scores on Form X and Form Y respectively, x j represents a possible score value for test X, j=l to J on X; yk represents a possible score value for test Y, k = l to K on test Y; p jk =Prob {X= x j , Y= yk | T }=the bivariate score probability of X= x j and Y= yk over the target population T; let ,6 ’s be the slope parameters that will be estimated by maximum likelihood method, a and a * are the normalizing constants selected to make the sum of population score probabilities equal to one; let T X and T Y denote the number of moments matched between the fitted probabilities and the observed score probabilities; and let I and L denote the number of cross moments matched between the fitted and the observed score probabilities. Then, 15 A univariate log-linear model takes the form of: I . 10g 5» = 6,7,1 = 6;; (1%., (x» 19 (11) Where F ( 11 X) and G(hy) represent cumulative density functions of X (11 X ) and Y (by) respectively. The linear equating method is considered as a special case in KE framework. 2. 3.5 Step 5. Calculating Standard Error of Equating (SEE) and Standard Error of Equating Difference (SEED) KE provides a formula for calculating SEE derived from the delta method (see von Davier, Holland, and Thayer, 2004): SEE (éY(x)) = SEE(€Y(X;f,§)) = JJey (f,§)JDF(1},§)2R,§~/eyz 1i §)JDFE;,§) (12) Here R and S are used as generic names over all the designs for the population score probabilities of X and Y estimated by the log-linear pre-smoothing model in step 1, A R like [9]- ,pk , P(l2)jk , and [3mm etc. When sample size is large, .. is S asymptotically normally distributed with mean of (S) and variance matrix of 2 with dimension ((JK + JK) x (JK + JK)) ; f and S are the estimated population RS score probabilities of X and Y over target population T; Z R 3 is the covariance matrix of R and S . The estimated equating function is a composition of éy and DF (éY (x) = eY (x;f,§) = G_1(F(x)) ); the design function (DF) is a function of 20 R andS ; J ell“, g) and J D F( R, S) are Jacobian matrices (in formula 13 and 14) related to the equating function and the design function respectively. J 8), (,2, §) is a (1X (J + K ))- row vector of the first derivatives of the estimated equating function with respect to each estimated score probabilities r" and § over target population T, and J D F ( R 5) is a ((J + K) x (JK + JK)) - matrix of the first derivatives of the DF with respect to each of the output variables from the pre-smoothing procedure: JeY(fa§) : (86,): age—3:) (13) r as (lx(J+K)) a: a) JDF(R,$‘) = OR US (14) US US \gfi,$/((J+K)X(JK+JK)) Kernel Equating provides an analytic tool to calculate standard error of equating. It is known as the delta method (also known as Taylor Series method) and provides a statistical procedure widely used to estimate the variance or standard error of a fiinction of some statistical estimates with known asymptotic distributions (Kolen & Brennan, 2004; von Davier, Holland, & Thayer, 2004). In addition to calculating the conditional SEE’S at each score point, KE also provides the SEED statistics for calculating the standard error of equating difference between two KE functions at each score point. Von Davier, Holland, and Thayer (2004) used SEED to decide whether the equating results of two KE methods are significantly different from each other. 21 2.4 Equating Error Equating error reflects the difference between the equated scores estimated from the sample and the equated scores from the population. It consists of two sources of error — random equating error and systematic equating error. Random equating error is the error simply due to sampling. Systematic equating error arises if 1) the equating design is inappropriately executed; 2) the statistical assumptions of an equating method are violated; 3) equating procedure is inappropriately implemented, for example, applying an IRT equating to a multidimensional test. The definition of random error and systematic error determines that the magnitude of the random equating error closely depends on the sample size, while the systematic equating error does not depend on the number of examinees in the equating (Kolen & Brennan, 2004). 2.5 Evaluating the Results of Equating After equating is conducted, the results of equating can be evaluated with several criteria. According to Harris and Crouse (1993) and other evaluation studies of KE, the evaluation criteria for equating results include: 1) Standard error of equating conditional on scores; 2) Root Mean Squared Deviation (RMSD) index and “average equating error” index (Klein & Jarjoura, 1985; Livingston, Dorans, & Wright, 1990) for evaluating overall equating accuracy; 3) Conditional equating bias and “average equating bias” (Livingston, 1993); 4) Root Mean Square Error (RMSE) for overall adequacy of equating (Mao, von Davier, & Rupp, 2005); 5) Standard Error of Equating Difference calculated under the KB framework 22 (von Davier, Holland, & Thayer, 2004). 2. 5. 1 Standard Error of Equating The Standard Error of Equating (SEE) is useful in indicating the amount of random error in equating which is due to sampling of examinees. There are two ways of calculating SEE’s: analytic methods, and computational methods such as a bootstrap resarnpling method or other empirical methods. The delta method is an analytic method replying on asymptotic statistical assumptions. It uses normal distribution to approximate the probability distribution of a statistical estimator. The assumption of asymptotic normality holds only when sample size is relatively large. When sample size is small, the delta method will not be accurate unless strong normality assumption holds for the population. Using a real data with a common item nonequivalent group design, Hanson, Zeng, and Kolen (1993) compared the delta method standard errors of equating with the bootstrap standard errors of equating for Levine observed score and true score linear equating. The sample size is over 700. The results of their study indicate that compared to the bootstrap SEE, the random equating errors for scores at the higher end were overestimated by the delta method with a normality assumption while the random equating errors for scores at the lower end were underestimated. Lu and Kolen (1994) used the delta method and the bootstrap method to estimate SEE’S of Tucker linear equating for a common item nonequivalent group design. They compared the differences between standard errors derived from the delta method and the bootstrap method given different sample sizes and different number of bootstrap replications. They also found that the difference between standard errors calculated by the delta method and the 23 bootstrap method become larger as sample size decreases and as the number of bootstrap replications decreases. Bootstrap method refers to the resampling procedure of selecting random samples with replacement from a given sample with size N repeatedly. The theoretical framework for the bootstrap method and the applications of the bootstrap method were decribed in Efron (1982), Efron and Tibshirani (1993) and Kolen and Brennan (2004). Suppose in a random equivalent group design, two groups of examinees of size n, and n2 took test forms X and Y respectively, Form Y is equated to Form X using equating method B, Then a typical bootstrap method has the following steps: 1) Draw a sample of size n, with replacement from the group of examinees taking test form X (size = n 1); 2) Draw a random bootstrap sample of size n 2 with replacement from the group of examinees taking test form Y (size = n2); 3) Conduct equating on the random bootstrap samples and obtain an equating function; 4) Repeat step 1 through step 3 for a large number of times and equate Y to X every time; 5) All the equating results at each score point form a distribution. Calculate standard deviation of the equating results at each score point. The result is called the estimated bootstrap standard error of equating conditional on every score point. Then the bootstrap standard error of this equating procedure conditional on each score level will be: l n ,, 7: SEE: Z:E(ex(yk)-ex(yk))2 (15) where n is the total number of replications; yk represents the kth score on Form Y; e X ( yk) is the equated score on Form X corresponding to score yk; 5X (yk) is the mean 24 of equated scores at score yk over the n replications. Parshall, Houghton, and Kromrey (1995) used bootstrap standard error of equating and statistical bias in equating to study the adequacy of equating. Their results incidate that as sample size decreased, equating bias remains stable but the bootstrap SEE increased substantially. Therefore, they argued for using the bootstrap method instead of the delta method to calculate SEE for samall samples (Tsai, 1995). Livingston (1993a) compared the standard errors of kernel equating methods with traditional equipercentile methods using a common item nonequivalent group design. He calculated random standard error of equating using an empirical method different from the typical bootstrap method. He selected 50 small random samples of size n without replacement from a big population dataset of size N. He then obtained equating results for each of the 50 small samples. Standard deviation of the 50 equated scores from the population criterion equating result at each raw score point is regarded as the conditional standard error of equating at each score point. Instead of using the mean of the 50 equated scores for each raw score point (EX (yk) in formula 15), he used the equated score on the population criterion. The simulation study in this dissertation follows the same procedure as described in Livingston (1993) to calculate empirical standard error of equating. The bootstrap method was applied on the real datasets to calculate standard error of equating. 2. 5.2 Root Mean Squared Deviation (RMSD) The root mean squared deviation (RMSD), is a measure of the overall equating accuracy (Livingston, Dorans, & Wright, 1990; Livington, 1993; Schmitt, Cook, Dorans, & Eignor, 1990). It can be calculated by: 25 ZnYk (jay/c _ xJ’k )2 RMSD = (16) \ ZnYk where x yk is the equated score on Form X corresponding to score y using the criterion equating method; 55 y k is the equated score on test form X corresponding to score y using other equating methods; nyk is the number of observations at each score level of test Y. The RMSD is basically an average of the conditional random equating errors. An alternative summary statistics is the average equating error, which is simply the average of the conditional standard error of equatings over all the score points on test Form Y (Klein & Jarjoura, 1985). 2. 5.3 Equating Bias Equating bias is useful in indicating systematic error in equating. In equating practice, equating bias is often estimated when comparing equating results with an arbitrarily selected sound criterion. Generally, results from equipercentile equating are a good candidate for such a criterion. Yen (1985) suggested. using the results from equipercentile equating as a criterion because it is as accurate as the IRT-based equating results. Livingston (1993a and 1993b) used the equipercentile equtaing results for a very large sample as a baseline criterion. Alternatively, the true equating relationship can be found from simulated data. In simulation studies, the population equating relationship is known and can be reckoned as a comparison criterion for calculating equating bias, but the degree to which the simulated data can represent real data is questionable. Use the same notation defined above, the equating bias conditional on each score 26 level can be caculated by: xyk _ xyk (17) The overall bias of equating can be calculated by: ZnJ’k (jeYk _ xyk l/ZnJ’k (18) 2.5.4 Root Mean Square Error As described above, SEE and RMSD reflects random equating error and systematic equating error respectively. Tsai (1995) and Mao, von Davier, and Rupp (2005) adopted the Root Mean Square Error (RMSE) index. Tsai (1995) explained why this statistics takes into account the random equating error and systematic equating error simultaneously. RMSE=\/(d)2+(sdd)2 (19) Where d is the mean of the equating differences at each score level, and sdd is the standard deviation of the equating differences between two methods. It reflects how biased and how accurate the equating results are comparing to an equating criterion. 2. 5.5 Standard Error of Equating Difference SEED calculated in KE can be used to determine whether the equating difference between two KE methods is significant or not. Von Davier, Holland, and Thayer (2004) used SEED to decide if equating bias in a CB design is significantly big. When equating using a CB design, the equating function of the 286 approach with weights of (l, l) is unbiased since the data from tests taken first is not affected by order effects. If a 2SG 27 method with certain weights is compared with the unbiased ZSG(l, 1) method, and their equating difference falls within the range of j: ZSEED, then the equating bias of this 28G method is small enough to be neglected. The standard error of equating will become the only statistics to compare when selecting an equating function. TABLE 5. KE methods and correggonding traditional equating methods ZSG(.5, .5) KB linear Traditional SG linear equating 2SG(1, 1) KB linear Traditional EG linear equating ZSG(.5, .5) KB equipercentile Traditional SG egmipercentile equating 2SG(1, 1) KB eguipercentile Traditional EG equipercentile equating 286 with other weights Not available 28 TABLE 6. All eqpating methods compared in this study for simulated data Equating Explanation 28G ZSG(.5,.5) Log-linear smoothing; Treat data as two independent groups; Using Design weights of (5,5) for X and Y L' 28G(.5,.75) Log-linear smoothing; Treat data as two independent groups; Using "war weights of(.5,.75) for x and Y 28G(.6,.5) Log-linear smoothing; Treat data as two independent groups; Using weights of(.6,.5) for X and Y ZSG(.6,.6) Log-linear smoothing; Treat data as two independent groups; Using weights of (6,6) for X and Y ZSG(.75,.5) Log-linear smoothing; Treat data as two independent groups; Using weights of(.75,.5) for X and Y 28G(.75,.75) Log-linear smoothing; Treat data as two independent groups; Using weights of(.75,.75) for X and Y ZSG(.9,.5) Log-linear smoothing; Treat data as two independent groups; Using weights of (.9,.5) for X and Y ZSG(.9,.9) Log-linear smoothing; Treat data as two independent groups; Using weights of (9,9) for X and Y 2SG( l ,1) Log-linear smoothing; Treat data as two independent groups; Using weights of(l,l) for X and Y ZSG ZSG(.5,.5) Log-linear smoothing; Treat data as two independent groups; Using Design weights of (5,5) for X and Y E , ZSG(.5,.75) Log-linear smoothing; Treat data as two independent groups; Using qur- , weights of(.5,.75) for X and Y percentile _ . . , ZSG(.6,.5) Log-linear smoothing; Treat data as two Independent groups; Usmg weights of (6,5) for X and Y ZSG(.6,.6) Log-linear smoothing; Treat data as two independent groups; Using weights of(.6,.6) for X and Y 28G(.75,.5) Log-linear smoothing; Treat data as two independent groups; Using weights of(.75,.5) for X and Y 28G(.75,.75) Log-linear smoothing; Treat data as two independent groups; Using weights of(.75,.75) for X and Y ZSG(.9,.5) Log-linear smoothing; Treat data as two independent groups; Using weights of (.9,.5) for X and Y ZSG(.9,.9) Log-linear smoothing; Treat data as two independent groups; Using weights of (9,9) for X and Y 28G(1 ,1) Log-linear smoothing; Treat data as two independent groups; Using weights of(l,l) for X and Y SG design SG_Lin Linear-interpolation; Traditional linear equating SG_Equi Linear-interpolation; Traditional equipercentile equating EG design EG Linear Linear-interpolation for continuization; Traditional linear equating EG Equi Linear-interpolation for continuization; Traditional equipercentile equating Among these methods, the EG linear, EG equipercentile, SG linear and SG equipercentile equating methods are the corresponding traditional equating methods for 29 the ZSG(1, 1) linear, 28G(l, 1) KB, SG KE linear and SG KE methods. Chapter III: Methods 3.] Quantification of Differential Order Effect This study draws on DOE as (A71 _ 171) — (X2 — )72) (Kolen and Brennan, 2004) to further introduce Hypothesis Testing and effective size and estimate order effects in a CB design. The following is a derivation for a hypothesis testing of the statistical significance OfDOE: DOE=(/3X1‘Il7I/1)-(flx2 ’flY2)=(flX1+flY2)—(flXz +1ah) : 2X1+ZY2 _ 2X2 +221 N1 N1 N2 N2 Z(X1+Y2) Z(X2+YI)_ ~ . N1 ‘” N2 _'U(X1+Y2) —’u(X2+Yl) (20) where AZ(Xl +Y2) is the average sum scores of X 1 and Y 2 for sample 1,fl(X2+ Y1) is the average sum scores of X 2 and Y, for sample 2; N1 is the number of examinees in sample 1, and N 2 is the total number of examinees in sample 2. Therefore, the hypothesis testing for the significance of DOE is actually equivalent to a two independent sample t—test for the mean difference of Sum] 2 and Sum2]. The null hypothesis for DOE becomes: H0 :IU(X,+Y2) —’u(X2+YI) = () ; . DOE and the t test 15: t = (21) 30 where sp is the square root of the pooled variance of the two sum scores, 2 2 s _ (n1 1)S(X1"'Y2)-i-(n2 1)S(Y1+X2) (22) p— n1+n2—2 The statistical Significance of DOE, however, relies heavily on sample sizes. To avoid the influence of sample size on the quantification of differential order effects, the effect sizes of DOE can be calculated: .. Mean —Mean . Effect size at = (X1+Y2) (Y1+X2) (23) Sp 3.2 Data This study uses 2 real datasets and 6 simulated datasets with CB designs. The six simulated datasets are generated in a systematic way with different sizes of DOE. 3.2.] Real Data Real data ]: Von Davier, Holland, and Thayer (2004) provided a real dataset from a small field study of an international testing program. In their dataset, both test forms X and Y are number-right scored. They have 75 items and 76 items respectively and their correlation is le,y2) = r(X2,Y1) = 0.88 . 31 TABLE 7. Summary statistics for real data I X 1 Y 2 X 2 Y 1 X Y Sum12 Sum21 N 143 143 140 140 283 283 143 140 Mean 52.65 51.42 50.64 51.39 51.66 51.41 104.07 102.04 SD 12.41 11.03 13.83 12.18 13.15 11.59 22.72 25.23 Skew -0.52 -0.37 -0.54 -0.58 -0.55 -0.49 -0.45 -O.57 Kurt -0.15 -0.64 -0.82 -0.52 -0.50 -0.55 -0.40 -0.67 Min 16 27 19 18 16 18 45 45 Max 74 71 72 71 74 71 142 142 *X and Y are scores for combined groups; Sum12 is the sum of scores on test X, and Y; for the first group; Sum21 is the sum of scores on test X2 and Y, for the second group. The differential order effect in this dataset is DOE == (X1 - )71) — (A72 - 172) = 2.03, which has an effect size of 0.08 approximately. T-test is not significant. Real data 2: The second real data was collected using a CB design for an algebra test. Each of the equating forms has 25 multiple-choice items. Group one has 399 students, who took Form X first and Form Y second, and Group two has 362 students, who took Form Y first and Form X second. Both test forms X and Y are number-right scored and their total score correlations are r(X1,Y2) = 0.64 and r(X2 ’ Y1 ) = 0.74 respectively. TABLE 8. Summary statistics for real data 2 X, Y; Y, X; X Y SumIZ Sum21 N 399 399 362 362 761 761 399 362 Mean 13.04 13.00 12.14 11.84 12.47 12.59 26.04 23.98 SD 3.94 4.35 4.15 4.66 4.33 4.27 7.50 8.22 Skew -0.22 -0.25 0.25 0.22 -0.03 -0.01 -0.07 0.37 Kurt 0.21 0.40 —0.34 -0. 15 -0.06 -0.02 0.19 -0.28 Min 0 0 2 0 0 0 0 4 Max 23 25 23 25 23 25 48 48 *X and Y are scores for combined groups; Sum12 is the sum of scores on test X1 and Y2 for the first group; Sum21 is the sum of scores on test X2 and Y1 for the second group. The differential order effect in this dataset is 2.06, which has an effect size of 0.26 approximately. 32 3.2.2 Simulated Data In compliance with Davey, Nering, and Thompson’s (1997) purpose of simulating realistic item response data, this study made an effort to generate data as close as possible to the first real data described earlier. The reason for selecting real data 1 as a target is that the two test forms in this dataset have equal test-retest reliabilities, which is an important assumption for linear and equipercentile equating. There are 75 items on each simulated test form. Six population datasets were simulated with different sizes of order effects using a 3 parameter logistic Item Response Theory model (3PL IRT model). In Lord (1980), a 3PL IRT model takes the form as below: 1— c -l .7a(t9-b) (24) P9 26+ () 1+e where 19 is the underlying ability to be measured, a is the item discrimination parameter, b is item difficulty, and c is the item guessing parameter indicating the probability that a person completely lacking in ability will answer the item correctly. Each of the six simulated datasets has two samples, each with size of 100,000. Each sample takes two tests X and Y but in different order. A 75 by 100,000 item-person response matrix with 0 and 1 scores was generated for each sample using the 3PL IRT model. The scores on each item were then totaled to get an observed test score for each examinee. After the simulation of data for two independent group taking two test forms in different order, data from the two independent samples were simply combined together to form the dataset with a pooled SG design. Please see the design below: 33 sam lel: X ,Y ForaCB design: [9 ( 1 2) sample2:(X2,Y1) X1 Y2] For 3 SG design: pooled sample: (X 2 Y1 However, one drawback of using real data 1 is its lack of item response data. Without the item response block, it is more difficult to estimate the item parameters of the real test items and use the estimated parameters for simulation. In this simulation, the parameter distributions were decided based on empirical experience. To ensure that the generated item discriminant parameter a and item guessing level c are positive, parameter a ’s were randomly selected from a log-normal distribution, and parameter c ’s were randomly selected from a beta distribution. Furthermore, in order to make the simulated data more realistic, means and variances of the distributions of parameter a, b, and c were adjusted to be certain values to best emulate the first real data set used in this study. Specifically, the mean and variance for the log-normal distribution of parameter a was fixed as 1 and 0.12; the mean and variance for the normal distribution of parameter b was fixed as -0.3 and 0.8 and the mean and variance for the beta distribution of parameter c was fixed as 0.25 and 0.008. Order effects were considered as a second dimension of examinee’s underlying abilities when taking the second test and the size of order effects varies across examinees. Assume that the changes in examinees’ performances reflect the changes in their underlying abilities, then, 612k = 611k +01k (sample 1); (25) 622k = 621k + 02k (sample 2); (26) 34 where k is the number of examinees; 611k denotes the underlying abilities of examinees in sample 1 taking the first test (X1); 31 2 k denotes the abilities of examinees in sample 1 taking the second test (Y2); 01k denotes the order effects of examinees in sample 1 taking test X first and Y second; 621k denotes the underlying abilities of examinees in sample 2 taking the first test (Y1); 622 k denotes the abilities of examinees in sample 2 taking the second test (X2); 02 k denotes the order effects of examinees in sample 2 taking test Y first and X second; It was assumed that 611k and 612 k (or 621k and 622k) follows a bivariate normal distribution with the same standard deviations. The correlation between 611k and 012 k (or 621k and 922 k ) may not be perfect since order effects are not constant across examinees. It was set to be 0.94 in this study in order to achieve a correlation of observed score at 0.88. 01k and 02 k both have variances of (1-0.94)2. When all the parameters a, b, c, and 6 were randomly selected, calculate the probability of each examinee with certain 6 level answering each item correctly from the 3PL IRT model. If the probability of a correct response is greater than a random number from a uniform distribution, the item response for a person on a specific item will be 1, otherwise it will be 0. In this study, the effect sizes of differential order effects were controlled to be 35 changing from O to 0.2 in the simulated datasets. In order to meet this restriction and make simulated data as real as possible, different means for the distributions of 61 1 and 612 (or 621 and 622) were tried and DOE’s were calculated afterwards until order effects are within the range and the simulated test scores share similar descriptive statistics as test scores in the first real dataset. The distributions and descriptive statistics of the six simulated datasets are provided below. As shown in table 9 to table 14, the simulated data has similar distn'bution moments as the first real dataset. Simulated data 1 with insignificant order elfectsLDOE = -0.04) 0 Sample 1 (N=100000): 2 0 =1 0 =94 HN (”511:0 ”612:0'01)’ 911 611:]2 06119122294 0012=1 0 Sample 2 (N=100000): 2 0 =1 a :94 9 6 t9 6'" (111621 :0 #622 20.01), 21 21222 =.94 0 =1 0921922 922 a ~ (,ua =10; =0.12); b ~ (72,, =—0.3,a,3 =08); C~ (726 = 025,05 = 0.008) TABLE 9. Descriptive statistics for Simulated data I Test Min. Max. Mean Std Skewness Kurtosis X1 10 75 52.52 13.78 0% -0.67 Y2 9 75 50.50 13.57 -032 -0.81 X2 10 75 50.51 13.59 -031 -0.81 Y1 8 75 52.55 13.80 -045 -0.68 "(X1, Y2) = " (Y1, X2) 2 0-88 36 3000 25 200 150 1000 . 1000. 500 500 0— _ 01020304050607075 0 10 2030 40 50 607075 X] (skewness=—O.46) Y2 (skewness=—0.32) 0 —10 20 30 4O 50 60 70 75 00 —1—0 20 30 4O 50 60 70 75 Y] (skewness=—0.45) X2 (skewness=-O.3l) Simulated data2 with significant order @tects (DOE = —0. 58, e ect size 0 DOE = 0.025 0 Sample 1 (N=100000): 2 _ __ 0611—1 0611612_'94 19~ (719”:0 77612 =—0.025), _ 94 2 1 0611612 —. 0612 _ 0 Sample 2 (N=100000): 2 0' =1 0 =94 _ _ 1921 921322 t9~ (#921 _0 21,922 _0.025), _ 94 0921922 _' 0622 _ 37 a ~(ya =10; =0.12); b~ (,ub =—0.3,0'§ =08); c ~ = 0.25 02 = 0.008 (#6 7 b ) TABLE 10. Descrytive statistics for simulated data 2 Test Mm. Max. Mean Std Skewness Kurtosis X1 9 75 52.01 13.71 -0.43 -0.7] Y2 10 75 50.54 14.01 -0.27 -O.89 X2 11 75 51.15 13.90 -0.30 -0.87 Yl 10 75 51.98 13.66 -0.41 -0.73 7?le Y2) = For], X2) z 0.88 3000 3000 250 200 150 1000- 500 O _ 0 10 20 30 40 50 60 7075 X1 (skewness=—0.43) 3000 250 200 ‘150 500 Q _ 0 10 2O 30 40 50 60 7075 Y1 (skewness=-0.4l) 500 0 _ C 10 20 30 40 50 60 7075 Y2 (skewness=-0.27) 500 o _ 0 ‘10 20 30 4O 50 60 7075 X2 (skewness=-0.3) 38 Simulated data3 with significant order etZects (DOE= 1.41, etZect size of DOE = 0. 05 2 0 Sample 1 (N=100000): 0'2 =1 0' = 94 611 611912 ' 0~ (7191 =0 #91 =0.05), 1 2 o = 94 02 —1 611912 ' 612 0 Sample 2 (N=100000): 2 _ _ 0'921 —1 0921922 —.94 0~ (#9 =0 #9 =—0.05), 2‘ 22 — 94 0'2 —1 0921922 _' 1922 _ _ 2 _ . _ 2 _ . a~ ,ua —1,ob —0.12 ,b~ ,ub ——0.3,0'b —0.8, 6 ~ (flc =0.25,a§ =0.008) TABLE 1 1. Descriptive statistics for simulated data 3 Test Min. Max. Mean Std Skewness Kurtosis X1 9 75 52.01 13.71 -043 .071 Y2 10 75 51.54 13.90 -0.34 -0.84 X2 11 75 50.15 14.00 -024 -0.92 Y1 10 75 51.98 13.66 -0.41 -0.73 rm, Y2) = ’01,,1’2) = 0-88 3000 25 200 150 1000- 500 0 —— 0 __ 0 10 2° 30 40 50 60 7075 0 10 20 30 40 50 60 7075 X1 (skewness=-0.43) Y2 (skewness=-0.34) 39 3000 O _ o 10 20 30 40 so 60 7075 0 1° 20 30 4° 50 6° 7°75 Y1 (skewness=-0.41) X2 (skewness=-0.24) Simulated data4 with significant order eflects [DOE= -2. 75, etZect size at DOE = 0.12 0 Sample 1 09400000): 02 =1 0 = 94 01 1 61 1612 . 6~ (“1911:0 #912=_0'1)’ 2 0011612='94 0612=1 . Sample 2 (N=100000): 07-1 621 : 0.021022 :94 _. 2 _ 0321322 -—.94 0922 ——1 6~ (#921:0 #4922 =0‘1)’ a ~ ()2, =10; =0.12); b~ (77,, =—0.3,a§ =08); c~ =025 0'2 =0.008 (lac 2 b ) TABLE 12. Descriptive statistics for simulated data 4 Test Min. Max. Mean Std Skewness Kurtosis X1 10 75 50.34 13.50 -0.31 —0.80 Y2 10 75 48.64 13.57 -0.29 -0.84 X2 11 75 51.35 13.27 -0.45 -0.67 Y1 9 75 50.39 13.56 -0.30 —0.81 r(Xl, Y2) = rm,X2) ‘~' 0-88 40 3000 3000 25 200 150 1000- 500 0 —— 0 _ 0 10 20 30 40 50 60 7075 0 ‘IO 20 30 40 50 60 7075 X] (skewness=—0.31) Y2 (skewness=-0.29) 3000 3000 250 200 150 1000- 500 0—10 20 3O 4O 50 60 7075 O0 —‘l_0 20 30 40 50 60 7075 Y] (skewness=—0.30) X2 (skewness=-0.45) Simulated data5 with significant order fins (DOE= -3. 7 6, etZect size of DOE = 0.152 0 Sample 1 (N=100000): 2 0 =1 0' =94 611 911312 6~ (#1911:0 #312=_0‘1)’ 2 0011012 =.94 0612 =1 0 Sample 2 (N=100000): 2 0' =1 0 :94 1921 921922 6~ (#921=0 #922 :0'2)’ 0 - 94 0'2 —1 621622 —. 622 _ 41 a ~(,ua =1,a§ =0.12); b~ (#7, =—0.3,a§ =08); c~ =0.25 02 =0.008 (#0 2 b ) TABLE 13. Descriptive statistics for simulated data 5 Test Min. Max. Mean Std Skewness Kurtosis X1 10 75 50.99 14.07 -0.29 -0.89 Y2 9 75 48.50 13.65 -0.11 -O.88 X2 11 75 52.33 13.36 -0.34 -0.75 Y1 9 75 50.92 14.10 -0.29 -0.88 r(Xl, Y2) = r02], xz) z 0.88 3000 3000 250 200 150 1000 ' 500 0 A 10 20 30 40 50 60 7075 01020304050607075 X1 (skewness=-0.29) Y2 (skewness=—0.l l) 3000 3000 250 7 200 150 1000 1000. 500 500 0 — 0 — 0 ‘10 20 30 40 50 60 70 75 0 10 20 30 40 50 60 70 75 Y] (skewness=—0.29) X2 (skewness=0.34) 42 0 Sample 1 (N=100000): 0' —1 =94 _ _ 611 911912 6~ (#611—0 fl912_—0'2)’ 94 2 _1 0311912 — 0612 T 0 Sample 2 (N=100000): 0'32] =1 0321922 .94 0~ (#1921:0 1“1922 =0°2)’ 2 0' = 94 0' —1 921922 ' 1922 a ~ (ya :10; =012); b~ (,ub =—03,a§ =08); c~ =0.25 02 =0.008 (#c 2 b ) TABLE 14. Descriptive statistics for simulated data 6 Test Min. Max. Mean Std Skewness Kurtosis X1 9 75 52.52 13.78 -0.26 -0.88 Y2 9 75 47.75 13.79 -005 -0.96 x2 11 75 52.93 13.24 -0.37 -079 Y1 11 75 52.55 13.80 -025 -0.89 rm, Y2) = r(Yl,X2) z 088 3000 2509 200 150 1000 - 500 0 _. 0 ’10 20 30 40 50 60 7075 X1 (skewness=-0.26) 43 0— 010 20 30 40 50 60 7075 Y2 (skewness=—0.05) 0— o— 0 ‘10 2030 4050607075 0 10 203040 50607075 Y1 (skewness=-O.25) X2 (skewness=-0.3 7) 3.3 Analysis The analysis of real data and simulated data in this study differs slightly. For the two real datasets, the bootstrap method was employed to calculate standard error of 14 out of the total 22 equatings (as listed in Table 15 and Table 16). The equating results were evaluated by SEE and RMSE. For the simulated datasets, empirical standard errors of equating were calculated for 22 equating methods as displayed in Table 6. The equating functions were evaluated by SEE, equating bias relative to the large sample standard, RMSE and SEED. Computer sofiware SAS, MATLAB, Compaq Visual Fortran, and MATLAB were used to simulate data and conduct equating procedures. 3.3.] Equating Methods Applied for Simulated Data Table 6 lists the names of all the equatings conducted for simulated data in this study and provides detailed explanations for each equating. The results of the traditional equipercentile equating (EG_Equi) on each population dataset were considered as criterion equating results. All the other equating results were compared to this criterion equating for each population. In this study, all the equatings are from test Form Y to test 44 Form X, i.e., the equating function takes the form of ex ( y) , which is a function of score y. 3.3.2 Procedure for Estimating Empirical SEE for Simulated Data Once the population datasets were generated, 500 random samples were selected from each of the four populations without replacement. The estimation of empirical SEE for the simulated datasets followed procedures as below: 1. Randomly select one sample (n=50) from each of the two independent samples from population 1 without replacement. Selected sample 1 has scores for Form X, which is taken first and Form Y, which is taken second. Selected sample 2 will have scores for Form X, which is taken second, and Form Y, which is taken first. Data from the two independent samples were simply combined to form a data with the pooled single group design. Apply the 22 equatings to the samples selected from the population. When the sample size is greater than 100, two log-linear models were fit to the data for all the KB equating methods. The first log-linear model (model (2, 2, 1)) preserves the first bivariate moment (the correlation of scores on Form X and Form Y) and the first two univariate moments of each variable (mean and standard deviation). The second log-linear model (model (4, 4, 1)) preserves the first bivariate moment and the first four univariate moments of each variable. Replace the test-takers into the corresponding population and repeat 45 sampling for 500 times. Then the 500 replications build up a conditional distribution of equating results at each score point. The mean of this conditional distribution is the equating results at each score point and the standard deviation of this conditional distribution is the empirical conditional SEE at each score point. 4. Repeat step 1 to 3, change the selected sample size from 50, to 100, 300, 500 and 1000. 5. Repeat the above procedures for simulated data 2 to data 6. The bandwidth for KE linear equating was set at 200. The weighting parameter Wx or w took values from 0.5 to 1. y 3.3.3 Evaluating Equating Results from Simulated Data For the simulated data, traditional equipercentile equating results with the EG design were considered as the criterion. All the other equating methods were compared to this criterion and. were evaluated in terms of Standard Error of Equating, equating bias relative to the large sample standard, Root Mean Square Error and Standard Error of Equating Difference. For the two real datasets, only bootstrap SEE and RMSE were reported. Equating Bias Relative to the Large Sample Standard To calculate equating bias at each score point, for each of the 22 equatings under each of the six population conditions, the mean of the 500 replications’ equating results were subtracted from the criterion equating results (EG_Equi) at each score level (as in formula 17). Conditional equating bias was not reported for simulated data. Instead, the average of all the conditional biases at each score level was calculated and reported in 46 chapter IV. Root Mean Square Error (RMSE) The Root Mean Square Error of each equating compared to the criterion equating is equal to the square root of the sum of squared average bias and variance of bias over possible score points: RMSE: JW +(2de) ,where d is the mean of the equating differences and sdd is the standard deviation of the differences between the equating results of one method and the criterion equating results. It reflects how biased and how accurate the equating results are compared to the population criterion. Standard Error of Equating (SEE) The empirical conditional standard error of equating was considered as the standard deviation of the conditional distribution formed by the equating results for 500 replications. It can be calculated using the following formula. In chapter IV, only the average of these conditional SEE’s over different score points was reported for each equating method. 1 500 SEE= —— Z (8X(yk) eX(yk))2 (27) 499 j— _1 where j = 1 to 500 is the number of selected samples; k = 1 to K is the possible score points on Form Y; é X ( yk) is the equated score from Form Y to Form X for the f“ replication; 8 X ( yk ) is the equated score of X corresponding to score yk from the population dataset. In this study, SEED was calculated directly by the KB software. 47 Chapter IV: Results 4.1 Real Data 1 Real data 1 has a DOE of 2.03, which is not statistically significant (t=.713, se=2.85, p=.476, dfi281), i.e., the order effect can be almost cancelled out by pooling together the two groups of data in this specific example. The effect size of DOE is 0.08. Levene’s test of homogeneity of variance (Levene, 1960) is not significant (F =1 .67, p=.197). The best fit model for the KB methods is model (2, 2, 1): T X = T Y = 2 and I = L = 1. The following figures show the observed score distributions for X1, Y2, X2, and Y1 and their fitted data distributions. 10 10 8 0 Observed o Obsened 5‘ - Fitted 5‘ - Fitted o 00 g 6 1 o. o o o 01:) a o o 3 O O 3 O O.“ 8 4 a O o 90 8 4 d 0 00 LL 0 00 .9 LI": 0. o o \.o 2 ~ 0 o o o ‘1. 2 - o o o o .71 on o o «T o o 000 o ‘1“ 0 4 > t > o 15 30 45 60 75 0 19 38 57 76 X1 Scores Y1 Scores FIGURE 1. Observed score distributions for X 1 and Y, in real data 1. 48 10 10 8 0 Observed 0 8 0 Observed 4 = o a - Fitted . . 5; - Fitted g 6 1 QC, 6 . o «I» 3 o oo «o 3 02" 8' 4 — o o 00 g 4 7 ’° ; ° "3 I: o o no 9 009 U“: o ’11-". o ‘99 2 — oo o o 2 ~ 000 an. o o '-'-_ «goo «o. o 0. g o 05': 0 0 0 15 30 45 60 75 0 19 38 57 76 X2 Scores Y2 Scores FIGURE 2. Observed score distributions for X 2 and Y 2 in real data 1, 4.1.1 Selecting the Best Equating Function Using RMSE All equating methods were compared to the traditional equipercentile equating with an E G design (EG Equi.). It shows that, when DOE is insignificant, 2SG(.5,.5) and SG_KE has similar equating results with almost the smallest SEE’s over the whole score point scale, but they have bigger RMSE compared to the EG design. Not much difference was found between the equating results of traditional equating and Kernel Equating. No large difference was found between linear and equipercentile equating methods except for traditional EG linear and traditional EG equipercentile. This is because the sample size for EG design is only about 70 for each sample in this dataset, which is too small for equipercentile equating. Equating results of ZSG (.75, .75) have relatively small SEE and RMSE. It is the only method that best represents the criterion equating results. 49 TABLE 15. Evaluation of equating results from real data 1 ZSG KE SG EG (5,.5) (.5,.75) (.75,.5) (.75,.75) (1,1) traditional traditional Linear Mean SEE 0.663 0.839 0.884 1.252 2.381 0.663 2.384 SD SEE 0.313 0.371 0.42 0.565 1.113 0.313 1.118 Min. SEE 0.32 0.44 0.425 0.646 1.164 0.32 1.164 Max. SEE 1.334 1.634 1.776 2.433 4.674 1.334 4.648 Mean Diff 2.066 1.769 1.229 0.908 -0.403 2.066 -0.418 RMSE 2.92 2.5 2.1 1.76 1.68 2.92 1.69 Equipercentile Mean SEE 0.692 0.833 0.846 1.147 2.196 2.133 3.1 SD SEE 0.343 0.346 0.343 0.408 0.928 2.241 1.926 Min. SEE 0.332 0.385 0.419 0.429 0.491 0 0 Max. SEE 1.384 1.485 1.43 1.714 3.557 6.778 6.82] Mean Diff 2.29 2.04 1.518 1.262 -0.062 1.369 0 RMSE 3.09 2.72 2.31 1.98 1.42 2.26 0 *Criterion equating = traditional EG equipercentile equating The 2SG approach with weights of (1 , 1) has the smallest RMSE when taking the EG traditional equipercentile equating function as a baseline. Therefore, the 28G (1, 1) equipercentile method is the best equating function when using RMSE as an index. 50 4.1.2 Selecting the Best Equating Function Using SEED f I T I T I 1 1 7 T 1' T T T 100 0 Equating Difference oo 8 000000 0 ZSEED ._ 000 ~ ° -2$EED 00° 5 o 6*»... oo 000 Zero Line 4 ... 00000 .0...” 000000 4 P 0. 00° 70 ... °°°°00000 00000000 0.. 00000000000000 2 _ “......“ 4 .0“... “...... 0 "o “"00 -2 7 000 000 7 0000000 000000 00000000 00° 001) _4 " 00000 -l 000000 0000 -6 r— 00000 d oo o00 0000 '8 0000 4 00 00° 00° -10 ~ 4 4 1 4 1 L1_ 1 1 1 1 1 1 1 1__ 1— 0 510152025303 404550556 657075 FIGURE 3. Equating difference between 2SG(], 1) linear and 2SG(. 5, .5 ) linear and the i 2SEED confidence interval band around zero line, real data I. 51 [ I fl fir T Y7 f Y 1 1 T 1 j I I 8 — oooooooo 0°00 0 Equating thference H o 00000 o ZSEED 00 6 l- 00 0000 O '25EED p m. 00 —— I O ....m ...“ 00°00 zero Llne 4 0 .0... o 00 4 o .. o 0... 000090000 1: .0... 000000000000000 0.. 0° 2 0...... 0° .... o 0. fit 0 fl“ 0.000.“.3 O -2 ~ 00 A o 0 lb ooooooooooooooooooO°° 00° -4 ~ fi 0 O 000 O o 0000 -6 o 000 d 0 000° 000000000000000 -8 l 1 m 4 1 n— 1 1 J 1 1 1 4 0 5101520253035404550556 657075 FIGURE 4. Equating diflerence between 2SG(], 1) equipercentile and 2SG(. 5, .5 ) equipercentile and the 1“— 2SEED confidence interval band around zero line, real data 1. Figure 3 and Figure 4 indicate that the differences between the two KE linear and the two KE equipercentile methods using weights of (1 , 1) and weights of (0.5, 0.5) are small in comparison with the i 2SEED band. According to von Davier, Holland, and Thayer (2004), this indicates that the equating bias introduced by order effects is small enough to be ignored. Thus, the best equating function can be selected solely based on the random equating error, i.e., the standard error of equating. In this case, the 2SG linear or equipercentile equating with weights of (.5, .5) will be considered as the best ones. Their equating difference can be tested against SEED again to decide which one to choose. 52 3 fl f r T T f f ‘17 r T 1 ' Equating Difference ‘0 ° ZSEED 2’ ° 2SEED "’ Zero Line 000000 l t °° .+ «“398... d .0 3833.... . O 00003:. ... . 0° °°o .0. 0° 0 0000033..“ °°°°°°°00Ooooooooooooooooooooa . 000° oooooooooooooooo°o a...530000000000000008.800000 o0000 l ‘1 ”. C,oc’oo _‘ 00000000 -2 = _ 1P -3 I l 1 L 1 J 1 1% L l 4‘ 0 5 10 15 2L0 25 3O 35 40 45 5O 55 60 65 70 75 FIGURE 5. Equating difference between 2SG(. 5, .5) linear and ZSG(. 5, .5 ) equipercentile and the i 2SEED confidence interval band around zero line, real data I. As shown in Figure 5, the difference between the KB linear and the KB equipercentile equating functions falls beyond the 95% confidence intervals along the whole score scale except the lower end. The equating function deviates from a linear function. Therefore, the 28G equipercentile equating function with weights of (.5, .5) is preferable to the 2SG linear equating function with weights of (.5, .5) (von Davier, Holland, & Thayer, 2004). 4.2 Real Data 2 The second real data has a DOE of 2.06. This is significant as the order effect can not be cancelled out by pooling together the two groups of data in this example. The 53 effect size of DOE is 0.26. The best fit model for the KB methods is model (2, 2, 1) (TX=TY=2,I=L=1)forgrouplandmodel(4,4,1)(TX=TY=4,I=L=1)for group 2. The following figures show the observed score distributions for X1, Y2, X2, and Y1 and their best-fit log—linear models. 50 — 0 Observed 40 ~ , .__.° - Fitted >. '0. 8 30 - ," - m 3 ‘ '0 g 20 ~ .. LL 1O — 9.: ‘ _ 0 4H“: ...—fag“ 0 5 10 15 20 25 X1 Scores Frequency 50~ 4o— 30« 20« 104 04 0 Observed - Fitted 10 15 Y1 Scores FIGURE 6. Observed score distributions for X 1, and Y1 in real data 2. 401 Frequency N on O ..L O O L O A O 5 . Obsened .. - Fitted t I a - O M—T—r—‘fl—fi—Q‘f 10 15 20 25 X2 Scores Frequency 50 — 0 Observed 4o _ 1., - Fitted c- . ' 4 .. 3o .- ”... 20 —+ _ - 1o 1 "3 :2. u‘ .- 0 4W? 0 5 10 15 20 25 Y2 Scores FIGURE 7. Observed score distributions for X 2, and Y 2 in real data 2, 54 4. 2. 1 Selecting the Best Equating Function Using RMSE TABLE 16. Evaluation of equating resultsfrom real data 2 23G KE SG EG (.5,.5) (.5,.75) (.75,.5) (.75,.75) (1,1) traditional traditional Linear Mean SEE 0.205 0.223 0.243 0.296 0.51 0.205 0.51 SD SEE 0.07 0.068 0.079 0.083 0.143 0.07 0.143 Min SEE 0.117 0.138 0.145 0.193 0.333 0.117 0.333 Max SEE 0.341 0.354 0.403 0.454 0.767 0.341 0.767 Mean Diff 0.749 0.498 0.448 0.198 -0.387 0.774 -0.29 RMSE 1.007 0.832 0.753 0.638 0.876 1.161 0.673 Equipercentile Mean SEE 0.254 0.284 0.25 0.304 0.49 . 0.382 0.54 SD SEE 0.114 0.124 0.075 0.077 0.113 0.241 0.274 Min SEE 0.134 0.166 0.165 0.224 0.343 0 0 Max SEE 0.484 0.548 0.399 0.463 0.72 0.845 0.96 Mean Diff 0.671 0.526 0.505 0.362 0.033 0.965 0 RMSE 0.97 0.857 0.774 0.681 0.624 1.317 0 *Criterion equating = traditional EG equipercentile equating. In Table 16, the ZSG equipercentile equating with weights of (1 , 1) has the smallest RMSE when taking the EG traditional equipercentile equating function as a baseline. Therefore, the 2SG (1, 1) equipercentile method is the best equating function when using RMSE as an index. 55 4. 2.2 Selecting the Best Equating Function Using SEED 3 . . ° Equating Difference " . , o ZSEED 2. ’ . . ° -ZSEED . . ° , Zero Line 0 o . . o . 0 4b 1 0 o o o . . . o o o 0 J O o o o o o o o o o g 0 Z 0" . U 0 -1 o o o 0 ° 0 ° 0 0 Al 0 O O O O > _2~ 4 -3 4 l l 0 5 10 15 20 25 FIGURE 8. Equating difierence between ZSG(1, 1) linear and ZSG(.5, .5 ) linear and the i 2SEED confidence interval band around zero line, real data 2. 56 3 . _ fi fi 0 Equating Difference ° ZSEED 2~ ° ~28EED . Zero Line Ii:— o o o . . . 0 o o ° ° 0 if 8 O 0 ° 0 0 ° 9 ° 04' . . ' . . . 0 0 o o o o b -2» -3 i 0 5 10 15 20 25 FIGURE 9. Equating diflerence between 2SG(], I) equipercentile and ZSG(. 5, .5) equipercentile and the i 2SEED confidence interval band around zero line, real data 2. Figure 8 and Figure 9 indicate that the differences between the two KE linear and the two KE equipercentile methods using weights of (1, 1) and weights of (0.5, 0.5) are beyond the :t ZSEED band in the middle part of the score scale, where most of the scores distributed. For von Davier, Holland, and Thayer (2004), this indicates that the equating bias introduced by the use of the data from form X2 and Y2 cannot be ignored. The best solution would be to discard data from tests taken second, that is, to treat the data collected by a CB design as an EG design. After the weights are decided, the SEED plots can be used again to decide which equating fimction to choose, the 28G linear equating with weights of (1 , 1) or the 28G equipercentile equating with weights of (1 , 1). 57 0 Equating Difference ° ZSEED 21! , ° -2$EED H 0 Zero Line <3 . 1» ° 0 ' o :t o o o . o o o 3 O 0 8 8 2 o o o 0 ° ° 0. O 8 8 . . o (D 0 0° .833000000 FIGURE 10. Equating diflerence between ZSG(I, 1) linear and ZSG(1 , I) equipercentile, and the i 2SEED confidence interval band around zero line, real data 2. As shown in Figure 10, the difference between the 2SG (1 , 1) linear and the 28G (1, 1) equipercentile equating functions falls beyond the 95% confidence intervals at the lower and the middle score scale end. This indicates the equating function deviates from a linear function. Therefore the 2SG(1, 1) equipercentile equating function is preferable to the 28G (l , 1) linear equating function (von Davier, Holland, & Thayer, 2004). 58 4.3 Simulated Data All the simulated data can be fitted by a log-linear model of (2, 2, 1) with adequate model fit. Fitting a model with more parameters did not reduce the likelihood ratio chi-square statistics significantly. In addition, the Freeman -Tukey residual plots are within the range of (-3, +3) for all the simulated data when fitted with a model of log- linear model of (2, 2, 1) like in Figure 13. 2-5- Freeman-Tukey Residual (AIX) 24 1.54 / 3 -2- Score FIGURE 1 1. One example of F reeman-T ukey residual plot for POP3. 4.3.] Model Fit Various log-linear models were fitted to the simulated sample datasets. The results indicate that, when sample size is 50, model (2, 2, 1) is the best fit model. When sample size is 100, 300, 500 or 1000, both model (2, 2, 1) and model (4, 4, l) have 59 fairly good model fit. In this study, only the equating results of fitting model (2, 2, l) are reported since the equating results of fitting model (4, 4, 1) are very similar to the equating results of fitting model (2, 2, 1). 4.3.2 Evaluating the Equating Results by RMSE As shown in Table 17 and Table 18, the pooled SG and ZSG(.5,.5) approaches under the KB framework have the lowest SEE and RMSE when DOE is almost zero. This indicates that when order effect can be cancelled out, the pooled SG method or 2SG(.5,.5) method can both provide optimal equating results. Table 19 and Table 20 show the equating results for population data 2 where DOE has an effect size of 0.025. The 2SG linear and equipercentile equating methods with weights of (.5 , .75) for X and Y have the smallest RMSE. When the differential order effect gets larger, as in data 3 where the effect size of DOE is 0.05, the 28G linear equating methods with weights of (.9, .9) have the smallest RMSE (Table 21 and Table 22). When the effect size of DOE approaches to 0.1, the pooled SG approach and the ZSG(.5, .5) approach are apparently not the best (Table 23 and Table 24). Instead, the 28G linear equating method with weights of (1, l) (i.e., EG KE linear method) or the EG traditional linear method has the smallest RMSE. Furthermore, in population data 5 and data 6 when the effect size of DOE is around 0.15 and 0.2, the benefit of using weights of (1, 1) in the 2SG approach becomes outstandingly bigger. As shown in Table 25 to Table 28, the EG KE linear or EG traditional linear methods have much smaller RMSE than those methods which treat data as a single group design. 60 to... 2.....- to... ..o- 82.- .8... NS...- o.o.o- 5....- 5.... «=....- Em m8... :3... m8... 2.... $4... 2...... men... .42. Sm... on... an... mmm ......o «...... :2... 5:... m5... 5...... 5...... 5..... as... «we... 3.... mmzm ooo.te 08... 2......- eS... 8.....- E...- Eo... .5...- mmo... 3....- E... 3...... Em .m. «.4... .m. E... wee... 2:. .2... MS... 3..... 8m... 3.... mmm w... .o 5..... we. ... m... ... 8:. 3..... was an... EN... 84... E... mmzm com": 8... 2.....- 8... ... ..o- 2.....- ...; E...- .8...- SN..- 32. 2.....- Em 3.. 3m... 35 .9... we... mom... who... 48... an... as... am... mmm m... 5..... 2... 3.... 2...... 33 we... 2.... m3... 4%... E... mmzm com”: o 5....- c an..- .3...- 8..... 2......- m~....- m8...- SX. .8..- Em Bed 2...... Boa 23 2%.. $5.. S... S... 3... S... SS. mmm 5.... 2:... a. ... .3... 48... S. ... ..m... 3 . ... .2. 3..... 3..... mmzm 8.1.. .2...- mm.....- .2..- E.... SN..- 3.....- Men...- wowé mom...- 82. ”2...- Em 83. $2 83. 3... 83 3mm 25.. new. .2. at... we". mmm mm... 2... wmm... 2.... SN. 8.... Mam... SN... 3... 3..... ...... .52.. cm”: ...: 8.3 3.3 5.65 .35 3.3 6.3 $5.3 3...... ...0m 32.: 325 0mm 0mm 0mm 0mm 0mm 0mm 0mm 6mm 0mm Ecomumfimhb EtomwmfiP—F .305..— Om 8 0mm $356... 35533 .835 35% ..Qimo...m.:3w ASSEPA .: ”mt—max...- 61 8.....- .m.... 2...... 3.....- ......- ....o.... 8.....- t.....- S...- gm... ......- Em ....m.. 3..... ...... o3... ...... 9...... m2... ..Nm... m... E... ....N... mm... 2...... .2... N2... .2... 5;... ...... S... ...... .2... 2...... «...... m2)... ooo.te ...... ...... ...-.... 5..... mm...- ...... an..- .8... mm...- ........ 8...... Em ...... N. ... SN. 5...... m8... 3.... 3..... .3... ...... ...... 2n... mmm ...... mom... 5.... .2... .... 2... $4... 2.... EN... ...... «...... mmzm com“: .8... N8...- wNo... ......- E..- ....o... 8.....- .No...- 3..... e... 3.....- Em .8. .3. an. Sn. .5... 5...... 2.... 2.... N2... 2.... .m... mmm .m... mm... 3... ...... $2. 3.... 8.... .2... ....N... we... 3..... mmzm com“: 8.... ....o... ...... .8...- me...... to... .......- ...- .8...- ....... ......- Em am... E. $2 8.... Sn. 3... a... 8.. 3..... 8.... 8.... mmm .2... E... 2... m2... ...... NS... 3..... ...... om... ...... mm... mmzx .....u: 2...... 2.....- 0......- 2.....- 3....- m....- E...- S....- E...- mwm... ......- Em mmo... 22 23 m... at... 22 ...... ...... N3. 5%.. ...N. mmm gm... ...... 2.... wow... 3.... mom... 2.... NE... N2... 3..... 2.... mmzm on”: 9... a... 9...... 55.. m5. ...... ...... 5.6. at... ...... ......m ......m 0% 0mm 0mm 0mm 0mm 0% 0mm 0mm 0mm .mmoEuE-r 3:03:68.- o...:mo.o&=wm 0m 0m 8m 5.3656... M5333 fittmuxmfizwm NKQQ Lox-8.22.3... buEEzm. .w. mama...- 62 ..mm..- .2... 3.... NS...- wmwé ......- t....- 2...... om..- 3..... .3... Em ..8... 8m... N8... .8... 8.... 3m... .3... an... a. m... .33.. EN... mmm 8.... 8m... 3 . ... .2... m ...... .2... E... 8.... 8.... S... ..m... mmzm coo.u= m..- NNN... 3..... .. 3.....- 2.....- 0...... amé a...- ....... an..- Em 2.... 8.... on. ... ... 3..... ...3 NE... NE... 8.... 8m... 8.... mmm 9.... SN... 2 . ... ..m . ... wt... ...N... ...... mm... NR... 8.... Nu... mmzm can": .2..- .a... m... ... 2...... E...- wo...- 8.... ....~..- 8.....- em... 8.... Em E... .2... ...... .5... m8... ...... ..fi... N8... 0...... 8:. m3... mmm ..m... 8... am... 3... 3..... E... E... 2m... 2m... 3... ....... ma... com“: 3..... .NN... .... ...- a. . ...- aw..- mmm..- N3..- wmm? .m..- N....... 8.....- Em ww..~ 8.... 22 Na... ...... £2 .8... . 2...... 3.... E... mmm ...... ...m... a. . ... .8... ....o... ....N... 3... 2.... ....m... an... 2.... mmzm .....n: E...- .. .... N2..- 2m...- m.....- am...- fi... 8...... N.......- 2...... 8m...- Em 83 No... 83 2.2 2...... on; E... 3.. .mm. mm. 8... mmm Na... 8... 9...... 0...... 8.... S... N .2. a»... 0...... 3.... N2... mmzm own: ...: a... 8.0.. .35.. $5.. 66.. ...... .3... ....m. .35. .85. 0mm 0mm 0mm 6mm 0mm 0mm 0mm 0mm 0% ECOBBWP—l 3:036qu .305..— Om Om 0mm 30.. mnoSmE @5333 Emma NKQK ..o\8.:m...3m bafifixm .9 mama...- 63 ...-.... mm... .8... 8.....- 2......- m..~...- m8...- omm..- En...- ..n..... 2.....- Em 8.... .3... wt... ...... 3..... ..9... 8m... .2... ....m... En... gm... mmm EN... 3..... E... a... E... am... an... 8.... ...... em... «...... mmzm .......n= ........ SN...- Noo... ......- SS- ......- m.....- 2.2.- .o.....- 2...... 3.....- Em mm... :3 5a.. 2.... E... 5... Nmm... E... ......o «E... 5.... mmm .2... 2...... ...... on... 3.... ..N... ..S... 5...... 2m... ...-...... 2.... mmzm can“: ......... NR..- N. .... w ...... ...»...- : ...- Sm...- .m~..- ..m.....- 2... .2..- Em ......N ..R. .0... a»... 3...... mg... .8... .... Sn... .8... MR... mmm SN... an... m . ... ...... 2.... .... . ... Sm... 8m... ...... at... £2. mmZm cemnc ......- S...... .3...- .m....- 2.....- RN... 0%.... m...- 3....- .3... m2..- Em ......N 8.. ......N ......N 82 on. 2.... .8... Q... ..N... v..... mmm N2... 2...... S . ... ...... 8.... m2... ..8... 8m... 0...... N... am... mmzm .....u: 3.2.- ..mm..- N2...- ..N...- 9.....- mm...... 3..... .m~...- ......- 3.... .8..- Em ...2 23.. 8.... a... $2 .3. an. RN. 3... N... o... mmm gm... .2... E... ....N... 3..... com... 8.... SN... 2...... .3... 3m... mm... on”: ...: a... .3... 43.5.. $5.. ...... ...... 5.3 8.3 ......m ......m 0mm own 0mm 0mm 0mm cm...- omm 0mm 0mm 322.68... ...—5:68..- flammobmSGm Om Om 0mm 20.. £333... @333 3:28me33 ”NOR kahuna-.28.- .DQEESA .om mama...- 64 NR..- 83- E... ....N... 3..... mo.... 2:... a... ....m... .8... E... Em ..o... 8.... as... E... 8.... ..m... 3m... 8m... .2... ...»... 3N... mmm 8.... ..8... an... 5... SN... 2.... ....m... an... :m... R... w . n... mmzm oco.u= mom...- Soé ...... 2."... 3..... m........ .8... NE... 3m... 3..... .32. Em a... 3.... S... 2... 3.... at... $2. 8m... 3.... a... 3.... mmm N2... 8.... mm . ... .3... ...-N... .0... .8... N8... 3.... woo... ct... mmzm can”: ..2..- E? 5...... ....N... ......- wwv... .8... ..S... ....m... 8.... 3.... Em m... . .3... ..N... «...... .8... ...... ...... 3...... 2.2. 8.... ....m... mmm ..m... ....m... .... . ... an... ....N... ...... ....N... a... 8.... .m... ...... mmzm cow": 2 . ...- So...- a . ...- ...... a . ...- ..mm... .... . ... 8m... 5... .3... .3... Em ....Z 9...... $2 22 mm... mm. 8.... . ...o... 8.... 8.... mmm 8.... 2.... ...... 2... SN... .3... .8... ...... mm... 9...... E... mmzm .....u: 23.- ...... .- 9......- 2.....- .8..- .02 m . N... 8... . .... $0... .2... Em ..N..... ...N. 8... man ....o. 32 .... mm... ....m. 2... NM... mmm .3... ...... 9...... ...... m. m... on... ...N... «S... cm... 8.... 5..... mmzm own: ...: 3.0.. ...... 5...... $5.. 3..... ...... 5..... .3... .85. Ex... SN 0% 0mm 0mm 0mm 0mm 0mm 0mm 0mm 22.2.68.- _mcoEvm.-r 52.: um 8. 0mm 20.. $853.. $5.333 .523 MKQK ..onu...m...u.m ...-853..” ..N mum/2- 65 mndd mnvd mid comd mend- Smd 02d Nmmd :uvd mvwd Edd mam NS”.— mmdd momd Ed mmvd oovd nomd de odmd wmmd omd mmm 2nd coed 2 Nd mwmd and Evd :de dmd mwvd Ewd awed ”mm—2m cod—Ha .ddd wdvd Ed mdmd dedd- wvvd mNd 5m d novd mood Edd mam 0mm; t. _._ mom; 08.— ooed 2:. band obvd vad mdmd N _ vd mmm :Hmd 3nd momd mmd 22d ddvd mdmd m Gd dmmd nmmd :bd mmSE ddmnc amdd wand mcmd vwmd mdd- wwvd dvmd Sod hmmd Edd bond 2.5 deN «am. Ed; 3d; omwd mood owed oomd nmmd mvcd de mmm mwmd mid .mmd mwmd mm—d Ed mad 3%... mmmd Sod Ed mmEx ddmu: oodd- mmvd Edd vomd on —d- cwmd mdmd mend Smd Gwd mmod 92m m_w.~ vmn; mcvd to; ~24 Sm. d5.— vod hcwd _~d._ Rwd mum mmmd nod mmmd End de Evd wmmd Smd vmmd mdad mFd ”mm—>2 den: mvmd- dmmd vMod- 2:. 53d- wnmd dmmd Sod End mad mgd mam owed mmdd mama mvwd m3; d5. mom; mum; on: now; u: mmm gmd NEd onvd bmmd m—Nd mmvd Nomd owed and mood wdwd mm—zm om“: ...... .....m. 3.3 5.2.. $5.. 66.. 3..... .36. Ga. .53. .ch 0mm 0mm 0mm 0mm 0mm 0mm 0mm 0mm 0mm 35:69:. 22.2.68... 23582561”. mv— Om Om 0mm 20.. £853: 953%» fitnmummmsg mnsnm m0\mu.:m.:3m «932.53% .mm mqm/Z. 66 NS...- mmmf «8...- ea...- NN... 3n..- $2.. ..3...- ms...- t...- mm..- Em ....a... ...m... 3...... E... 8.... .3... ......... 2...... 2m... 3..... ...m... mmm 8.... .2. 2.... 8.... ... ... 8.... 3..... 23... 2...... .5... .8... mmzm ......W: 3.....- ma..- «0...... ovm..- 3..... as..- an..- mo...- oow..- 9...- ...N..- Em an. ...... .2. 2... NE... 8.... 5... Em... A...... a... 2.... mmm 2.... SN. 2.... ...... 3..... 8... SM... ...... 8o... .5... SN. mmzm ....muc ......- wmm..- ......- ommé 2.....- 3....- 3m...- E....- 2.....- .m..- 8...... Em ...... an... a... ...... 3..... 2.... ..fi... 3.... 2.... as... am... mmm an... a. as... .3... 0...... .8... 3..... S... .8... :2 N8. mmzm con”: man...- wvm..- 2......- mmm..- 3.....- 9.2.- 3......- 3... .- .8...- ox. .- ....N..- Em 5..” .3... a..." 32 .2. a: .2. ... ... 8.... 9.2 mg... mmm 3..... SN. 3...... .8... SN... 5... 2.... ...... So... on... «2.. mmzm .....u: an..- ..N. .- Rm...- ..... .- 8.....- «R..- 2......- wom..- .... . .- ..m. .- ..N. .- Em .... 8m. .... .2 8.3 83 2.... .... mm... a... com. mmm .8... SN. .3... ..N... am... .3... NR... 0%.. mm... ..m. ...N. mmzm emu: ...: a... .2. .25. 3.2.. ...... .3. 5a. .3. 325 .83. 6mm 0mm 0m...- cmm 0mm 0mm 0mm 0mm 0mm 3:333qu 3:03:55;- ...-«OE..— Om Om 0mm Eon— mnofims. M5833 .52.... ..QO .o\8...m.:3m 532.53% .mm mama...- 67 ...... ......- ...2.- ......o- 2..... .8...- .$...- .8...- .~....- ........- m....- Em ...... .3... ...... $2. ...... ...... ...... 3.... 2.... $2. $2. mm. $2. 3..... ...... 2...... ...... ...... 8.... ...... ...... ...... ...... mmzm .......u.. 2...... .02.- $2.- ......o- m......- $9..- 8...... $......- 2.....- ....- .S..- Em on... ...... ... ...... 9.... $2. 3..... $2. 3...... mm... ...... mm. .2... ...... .... ...... ...... ...... ...... .3... $.... ...... ...... mm... . ...”: ......- ...2.- mm...- $...- ...- .2.- ..v..- ...... .- ......- ..... .- .....- Em ...... .2. 2... ...... ...... .3... 82. .$... ...... ..2. ...... mm. .52. ...... ...... ...... .2. ...2. ...... .... 3.... ...... ...... mm... .....u: 8.....- 8....- t.....- 2....- ......- ......- No.....- .....- ......- m..... o... Em $..N 8... ...... ...... ...... ... $2.. ...... to... c. . .. .. .... mm. .2... 2.... $2. ..2. ...... ...... ...... ...... ...... ...... .2. mm... .....n: 8.....- $2.- 3....- .8...- m.....- 82.. 2....- .......- .......- .... .- .....- Em .... ....N .8... $3. ...... S... ...... t... ...... ...... ...... mm. $2. E... ...... ...... .$... ...... .... at... ...... ...... .... mmzm own: ..... ...... ...... 5..... ...... ...... ...... ...... ...... ......m ......m 0.. 0.. 0.. 0.. 0.. 0.. 0.. .... 0.. 1325...:- 13223. .- 235809.51”. Om Om 0mm Eon. $332.. Macczwm m...:mu.$&=~.m VKQR ..QKE...M..S.. bufifizw ...N mag..- 68 ...... ......- m..... 8....- ..N..- .8...- N. m. .- to. .- .... .- .... .- ......- Em ....... ...... ...... .R... S... ...... ...... ...... ...... ...... ...... mm. 2.... ...... 5.... 8.... .... 8.... .... .... .... ...... ...... mm... ......n: E... ......- ..:.... .....- ......- ma.... 3....- S...- 8..- ......- .....~- Em ...... ...... ...... ...... 3..... ...... .9... 3..... ...... ...... ...... mm. ...... ...... 3..... 3.... .... ...... ...... ...... ...... a... ...... mm... .....u: .3... E...- E... 2......- m.... ......- Sm. .- ...- ......- .....- ......- Em m... .9... ...... no... ...... 3..... .8... S... ...... 8... .2... mm. 3..... ...... m2... ...... .... ...... ...... .... a... .... .... mm... .....u: ...... 3..... N..... E...- ......- ......- ......- .....- o...- v....- ......- Em .m..~ .... .... .... ...... .8. N. . .. ...... .3... .. ... .... mm. 3..... ...... 9.... ...... ...... v.5... ...... a... .... N... .0... mms... .....u: 3...... ......- vec... ......- m..... .0...- S...- ......- .....- .....- .....- Em .2... N... .2... .... .. ... ...... a... .... ...... ...... .... mm. S... 9.... .... ...... ...... .8... t... .... .... .... 2.... mm... cm”: .... ...... ...... 5...: ...-.... ...... ...-... 5..... ...... .85. ...-...... 0.. 0.. .... 0.. o...- o.. .... .... .... ECOE—uEr—L 3:036qu .3054 Om Om 0mm 20.. @3332. M533? .32.... wRQK .o\8.zm.:3m 535.22% .mm mama...- 69 3..... 8.... 2.....- S.....- .....- ......- m....- ......- E...- .....- ......- Em a... .8... ...... 2K... ...... .... ...... ...... ...... ...... 3.... mm. .. .... ...... .... .9... ...... .3... .... ...... .... ...... .... mmsa 8...”: ...... a... .- ......- m.....- ......- va..- t .. .- .....- .... .- .2... E..- Em ...... ...... .... ...... ...... 2.... .5... ...... ...... ...... 8.... mm. ...... .... ...... ...... 8... .3... ...... ...... ... 8... N... mm... ...”: ...... ......- .o.....- .5...- N....- ...-....- ..m..- ......- m...- .....- .E..- Em N... N... ...... N... 0..... .... ...... ...... ...... .8... E... mm. ...... ...... ...... 9...... ...... 9.0... 8... 5... N... ...... ...... mm... .....u: N... .....- =......- m.....- 8...- ...o..- .....- mm... 8.... N...- ..8..- Em .3...- o... .... a... .... ...... ...... t... ...... ...... 8.... mm. =..... ...... .... ...... ...... ...... ...... .... .... .... a... mm... .....nc ...... .....- ......- mfié .....- R...- N....- 2....- S...- S...- ......- Em c... ...... a... .... ...... .... .... 3..... .... .... 8... mm. ...... S... ...... ...... .... ...... a... .... .3. R... 8... mm... ...”: .... a... ...... 5...... ...... a... ...-... ...-...... ...-... ......m ..=..m .... 0.. .... .... .... 0.. o... 0.. .... 22.2.6...- _.:o_.€E-_. 2:582:35 Om Om 0mm 20.. 3.83»... 353%.. m...§m9.m&.§@m ...-ka ENSEMQS. 532.53% .cm mam/D- 7O ......- ....m- .8...- o.....- ......- c....- ......- o..... .3..- o....- .....- Em 5.... ...... 5.... ...-... 5.... .0... ...... ...... ...... ...... ...... mm. ...... ...... ...... ...... ...... .... ...... E... .... t... 5..... me)... ......u: 3......- ..o..- ......- m.....- ......- m...- mm... m..... .3..- .. ...- ......- Em S... 3.... 5.... .... Sm... ...... 9.... =...... ...... ...... ...... mm. 3..... ...... ...... ...... .... .... ...... E... ...... ...... ...... mmzm ...”: ......- ........- .8... .N....- ......- .....- ......- RE- .....- ......- .......- Em ...... mm... ...... .... ...... .8... .8... ...... ...... .8... N... mm. ...... 3..... mm... ...... ...... .... ...... 8.... a... .... ......- mm... ...”: ......- .o..... S...- m...... .....N. .....- .....- E...- ...o..- ......- S....- Em ...... ...... ...... a... ...... a... ...... .... ...... .... ...... mm. ...... ...... ...... .... .... ...... a... 8.. .3. .... ...... mm... .....u: S...- S..... ......- v... . - ......- m . .. .- 3....- S... .3..- . ...- ......- Em =...... t... =...... ...... .... .... ...... .... ...... .... 2... mm. 3..... ...... 3.... ...... .... .... .... .... ...... ...... =..... mm... on": .... ...... ...... 5...... ...... ...... ...... 5..... ...... $2... 32.. om. .... .... .... .... .... .... .... .... 3.856.:- _..:o_.=.fih 50:5 Om Om 0mm ...—Om £852.. 353.3 .82.... “.050. ..o\..u.:w.:S-n buEEzm. . hm mama...- 71 owed ...—Qw- wmmd- .mwd- So. T Sm. T NNNN- .NNN- End. 39. T hch- ..E mow. wad can... no... Yo wmvd EH... Sm... mum... wmmd Km... mmm 9.2. gm...” mow... co... mmod mom. .om.~ ...-Nd end omcd 3..-N mmSE coo—n: m5... mom. T Swo- mowd- 33. T mmm. T wmmd- m _ Nd- oovd- mmo. T awed- ...m coo.— Noo. 03.. N _ wd . _m.o mmcd 53.... wow... «mm... 5...... won... mmm mm... Emd mm»... 2...... Sued coo. :m.~ wow-d omm.m god wmnd mmEm , com”: 8... mood- m3..- mmwd- 53. T mom. T m _ NN- mead- VmVN- 3o. T wvoN- ...m 03.. mmm. .9... v.0... 3...... 7%... «and mm... mow... mom... wt... mmm Em... CNN was... om... ..moN Non. mom-N SNN meN 8N Sud ”mm—2x com”: owe-o- SQN- mwmd- coT mo_.m- mmm..- mvmd- omfim- own-N- .moT mvod- 2.5 20d 2...— mmcd 3N.N mum. NS... m2... 3m... wwwd wmo._ mmwd mmm .m... mmd 0..-ed no... wmmd woo. nmmd vamd Sod SQN 3..-N mmEm 2:”: mg..- mo.~- omvd- momT wwmd- om..- GNN- mwm.~- mood- mg..- 39W 2.5 ooh-m SN mom-m mo...“ 3.. SN m3... 2.... mum. S..— _o~._ mum mmmd wand mm... on... mow-N 3.0.— mm.m no...” SQN 3a.. unfim mmZm on”: .... ...... ...... 5...... ...-.... ...... ...-... 5..... ...... ...... ....cm .15 0mm 0mm 0mm 0mm 0mm me 0mm 0mm 0mm .9823...- EcoEcwfi- Bummeofimwm Om Om 0mm £832.. @533 m...§mu..m.~..§m was... .8\......:..:Sm $328.5 .wm mqua- 72 The results indicate the KB methods can approximate their corresponding traditional equating methods. No large differences were found between the KB equating methods and their corresponding traditional equating methods (e. g., KE linear and traditional linear, KE equipercentile and traditional equipercentile). This is consistent with the results of evaluation studies for KE, such as Mao, von Davier, and Rupp (2005), von Davier, Holland, Livingston, and others (2005). Compared to the standard error of equating, the equating bias index is more sample size independent. Given the same equating method, the equating bias does not change a great deal as sample size increases. However, the standard error of equating decreases conspicuously as sample size increases. The more data we have, the more information we can use to estimate the equating relationship; the less equating error there will be. This feature of SEE is inherited from its calculation formula. When using RMSE as a means of evaluating equating functions, it was found that: a) When DOE is almost zero, pooling the two samples together or using the 286 approach with weights of (.5, .5) are the optimal equating methods with small standard error of equating and small bias; b) As DOE increases, the ZSG methods under the KB framework with different weights can provide optimal equating results with smallest RMSE. The weights for the ZSG approach gets larger as DOE increases; c) When the size of DOE approaches to a certain point, treating data collected in a CB design as an E6 design will be the best equating solution. The weights of the 28G approach will become 1. The equating method could be either ZSG (1 , 1) or traditional linear or equipercentile method. 73 4.3.3 Evaluating the Equating Results by SEED Equating differences were compared against their 95% confidence intervals for all the sample size conditions under each population. The last graph in Figure 12 plots the equating differences between ZSG(.5, .5) linear and ZSG(.5, .5) equipercentile methods for simulated data 1 when sample size is 1000. The straight horizontal line in the middle is the zero line. The equating differences represented by solid dots are around the zero line within the range of the i- ZSEED band. The other five graphs present the equating differences between the ZSG equipercentile equating with weights of (.5, .5) and (1, 1) for different sample sizes drawn from simulated data 1. It can be seen from these plots that SEED gets larger when the equating methods are different from each other and when sample size decreases. Among the graphs in Figure 12, the last graph exhibits the smallest SEED, showing that the 28G methods with the same weighting parameters provide more similar equating results than the 28G methods with different weighting parameters. Furthermore, the plots in Figure 12 indicate that under a certain order effect situation, the equating difference stays relatively unchanged, but SEED decreases as sample size increases. Therefore, the significance of the equating difference mostly depends on the sample size. If the equating differences between two methods fall beyond the i ZSEED band when sample size is 500, they must also be out of the band when sample size is 1000. Reversely, if the equating difference is not significant when sample size is 1000, then it must not be significant when sample size is 500. More SEED plots are provided in the appendix. Most of the SEED plots are for the differences between the ZSG(.5, .5) method and the ZSG(1, 1) method. The rational of 74 not comparing the equating difference between the ZSG(1, 1) method with the 286 method with any weights between 0.5 and 1 is provided here: In equating for a CB design with differential order effect, the 28G approach with weights of (1 , 1) has no equating bias. The 286 approach with weights of (.5, .5) will have the biggest equating bias. If the equating difference between ZSG(.5, .5) and ZSG(1, 1) is not significant, then the equating difference between ZSG(1, 1) and a 2SG approach with any weights between 0.5 and 1 will not be significant. All the SEED plots for all the simulated datasets indicate that none of the equating differences between methods ZSG(.5, .5) and ZSG(1, 1) under different sample size conditions of population data 1, data 2 and data 3 are significant. Therefore the bias introduced by using data from tests taken second can be ignored. Thus the 28G approach with weights of (.5, .5) can be selected as the best equating line for simulated data 1, data 2 and data 3 when the effect size of DOE is relatively small. 75 muchmoocmwommvovflnommmomm.o_ m o - cam”: 00 000 00000 000000000000000000000000 0 00000 00 0:3 PEN all Qmmmm- .. QmMmN o ooze-6&5 was... . r n F D h b » 000000000 000000000000000 o mT -oT m— mh wk. on no 1 i4 11 1 _ commommvovwmommmomfla: m 0 m7 muchmooommOmmVOme-cmmmomm—A: m o f b 4 can”: 0000000 0000000 0000 0000 0 00° 0 0 0° 0000 0 000000000 000000 0 0 0000 0 00°00 ° 000 0 0° 000 W.- -o_- Y. 000000 000 0 0000000000000000 0 000060000000 000 00° 00 00 00°00 000 00000000000 0 0:5 PEN Ill Qmmmm- .. Qmmmm o coco-5&5 warm: . p b b h b v F p b b h r p A: on. me Go mm Om mVOmeOm mNON 2.: m C m— 2... Bow .. Qmmle 0 00000000000000000 00 Dmmmm o oo:o.ot_D wcflmscm . q q 4 0000000 0°00 00° oo oo o o o o . o o C O—I r Own-C ace 0 g . o0 — I oooooooooooo coo o 0000 00 00 o o o oo o oo 00 o 00 0 00° 0 0°00 0 o oo o 00 00000 0 00000000 0 I 000000 0 ml f 00000 L 000000000 000 . 000000 0 00 oo o 000 r 0 oo o oo 0 00 oo 00 Av .. - - 5... - - I -- 45.1.. 1 oo 4o 00 o o as o 00 0 00 t 0 o o 00000 00 0000000000 0 000 0000 00 m— oocoshmfl mafia—am . 0:5 OuoNill . ommmm- .. .. . ammmm .. m_. o_. .n- o. m— 76 .. 3...... .8333... 85...... QR...- “ 3.. ...... .......§. .53.... ... mmawa mmcnmooommommvovmommmomm.o. m o mushmoocmmommvovgommmomflo. m o . . _ _ m . - . m T - 8on .OT - ace—H: -oT - - m- - - w- oooooooooooooooooooooooooooooo oo: oooooooocczont » r< to co . $3.38“... - ---------------- hnuugififino .3333. - ..... 1 ..o ooooooooooooooooooooooooooooooooooooo oooooo - . m r m 0:5 PEN III on: PEN ill I Qmmle o i O.— L Qmmle o i O— Qmmmm o Dmmmm o BEBEE manamvm . comes-ta wanna-m. . b _ . _ . . _ _ . p p p p h m #- p p 0 Li 1- » _ l—r L — — LiiP — m.— 77 As DOE increases in simulated data 4, the equating difference between methods ZSG(.5, .5) and ZSG(1, 1) falls beyond the 95% confidence interval when sample size is 1000. In this case, the 28G approach with weights of (1 , 1) is preferred to avoid the equating bias introduced by including data from X2 and Y2. This is also the case for data 5 when sample size is 500 and 1000 and for data 6 when sample size is 300, 500 and 1000. Table 29 summarizes the equating functions selected by using SEED plots for different samples under different order effect situations. It reflects that the EG design (the ZSG approach with weights of (1 , 1)) is more appropriate at the lower right corner when DOE gets larger and when sample size gets bigger. TABLE 29. Selected equating function based on SEED DOE n=50 n=100 n=300 n=500 n=1000 d=0 25C (.5, .5) ZSG (.5 .5) 2SG (.5, .2 * 25C (.5, .5) 2SG (.5, .5) d=0.025 zsgs, .5) 236 (.5, .5) 286 (.5, .5) 2304.5, .5) 236 (.5, .3 d=0.05 2SG 15,3 2SG (.5, .5) 286 (.5, .5) 236 (.5, .5) 286 (.5, .5) d=0.1 236 (.5, .5) 2SG (.5, .5) 256 (.5, .5) 230 (.5, .5) 2SG (1, 1) d=0.15 2SG (.5, .5) 236 (.5, .5) 236 (.5, .5) 256 (1, 1) 2SG (1, 1) d=0.2 286 (.5, .5) 250 (.5, .5) 28G (1, 1) 23C (1, 1) 256 (111) TABLE 30. Selected equating function based on RMSE DOE n=50 n=100 n=300 n=500 n=1000 d=O 2SG (.5, .5) 2SG (.5, .5) 2SG (.5, .5) 2SG (.5, .5) 286 1.5, .5) d=0.025 256 (.5, .75) 286 (.5, .75) 286 (.5, .75) 28G (.5, .75) 2SG (.5, .75 d=0.05 28G (.9, .9) 28G (.9, .9) sz (.9, .9) sz (.9, .9) 2SG (.9, .9) d=O.1 256 (1, 1) 2SG (1, 1) 2SG (1, 1) 236 (1, 1) 280 (1, 1) d=0.15 236 (1, 1) 286 (1, 1) 2SG (1, 1) 2SG (1, 1) 28G (1, 1) =0.2 2SG (1, 1) 236 (1, 1) ZSG (1, 1) ZSG (1, 1L 2SG (1, 1) Comparing Table 30 with Table 29, it can be found that the RMSE and SEED statistical indices produce same results when DOE is almost zero and when DOE is large 78 (effect size > 0.2 in this case). When the effect size of DOE is within a certain small range, the RMSE can provide more fine- grained equating solution. This is when the weighting method comes into place. Chapter V: Discussion 5.1 Performance of the KB Methods The results of this study are consistent with previous studies that compared the KB methods with the traditional equating methods. In general, the KB methods produce results very similar to their corresponding traditional equating methods. These similarities in equating results support KE method as a promising unified approach to test equating based on a flexible family of equipercentile-like equating functions. The entire classic observed score equating methods can be incorporated into its framework. The summary statistics in Table 17 to Table 28 indicate that the 2SG(.5, .5) linear method and the SG linear method produce very similar equating results in terms of SEE, equating bias and RMSE. Similarly, the 2SG(] , 1) linear and traditional EG linear equating methods provide equating results very close to each other; so are the ZSG(.5, .5) equipercentile, SG KE equipercentile and traditional SG equipercentile equating methods. The equating differences between 286(1, 1) equipercentile method and the traditional EG equipercentile method are small as well. Although the summary statistics in Table 17 to Table 28 indicate their equating difference'is relatively larger compared to the equating differences between the other previously-discussed approximation pairs. The actual differences of their equating functions are smaller than 1 raw score point for any score point above chance score, which are not large differences. Figure A28 to Figure A34 plot the equating differences between the 2SG(1, l) equipercentile method and the 79 traditional EG equipercentile method for selected cases. The equating differences between these two methods are the biggest in simulated data 6. KB provides the SEED statistics for examining the equating difference between two KE methods. The usefulness of this statistics is discussed below. 5.2 Effects of the Weighting Method The overall equating accuracy consists of two parts: random equating error (SEE) and systematic error (equating bias). When a CB design is used to collect data for an equating, the 286 approach under KE framework attempts to provide an optimal equating solution with the least overall equating error, which is indicated by the magnitude of RMSE in this study. In the rest of this section, the effect of the weighting method in enhancing overall equating accuracy is discussed in terms of both equating bias and the overall equating error. The study results based on both real and simulated data indicate that the weighting mechanism is effective in some extent. As DOE gets larger, the weights with smallest RMSE also increase (as indicated in Table 30 for simulated data 2 and data 3). Because random equating error increases as weights increase, the reduction in RMSE must be due to the reduction of equating bias. Therefore, the results of this study demonstrate that the 28G approach can reduce systematic equating error by adjusting the weights placed on the data from tests taken first. However, the reduction in equating bias is not significant as indicated by the SEED plots (as indicated in Table 29 for simulated data 2 and data 3). The reduction of equating bias is only significant when sample size is large enough and when DOE is big enough. When this happens, the weights in the 28G 80 approach will be (1, 1), which indicates an E6 design. The reason for the small amount of improvement in terms of RMSE is because, as DOE gets larger, examinee’s performance on the second test will be more affected by order effects and will be less accurate. Thus the 28G approach assigns more weights on the tests taken first to reduce bias introduced by order effects. The bigger the order effects, the more weights will be put on the tests taken first to reduce bias. However, the more weights on the first tests, the bigger the random equating errors are. Because of this trade-off between random equating error and system equating error, when both random and systematic equating errors are considered together, the equating error in terms of RMSE does not seem to be reduced much. The findings of this study support the 28G approach as a sensitive approach with the flexibility of using optimal data information as the size of order effects changes. The RMSE index provides more detailed information and can help decide which weights to use. However, the way of trying every possible weight between 0.5 and l to decide the fine-grained weights using the criterion of RMSE involves lengthy calculations. Other possible ways of determining how to treat the data collected by a CB design could be the hypothesis testing of DOE introduced in the method section and the SEED method applied in this study. If the hypothesis test of DOE is not significant, the data collected by a CB design shall be pooled together as 3 SG design. Otherwise, the data shall be treated as an EG design. The SEED plot method tests the significance of the equating difference between 2SG(.5, .5) and 2SG(], 1). If the equating difference is not significant, the ZSG(.5, .5) method will be used, i.e., data from the two samples will be pooled together and will be treated as 3 SG design. Otherwise, if the equating difference 81 is significant, the 286(1, 1) method will be used, i.e., the data in a CB design will be treated as an EG design. These two methods may not be as accurate as the RMSE method, but they are simpler to be carried out in practice. Further study can investigate how consistent the decisions are when using these three methods to select the best equating design. Finally, the results of this study suggest that the advantage of collecting data using a CB design over an EG design appears only when the magnitude of DOE is small. When DOE is within a small range, data from the two groups can be pooled together using different weights to reduce the overall equating error. However, when DOE is large, information from tests taken second will make no contribution to improve the overall equating accuracy. On the other hand, this study alerts us to the importance of implementing random sampling and random assignment in a CB design. 5.3 Limitations of This Study One concern about real data 2 is that test X and test Y has different test-retest reliabilities, e.g., r(X1,Y2) =0.64, r( X2, Y1 ) =0.74. Effort was made to enhance the reliability of test X and to make it equal to the reliability of test Y. One way was to remove items on test X that had low correlation with test score of Y 2. This purpose has not been achieved successfully. It turned out that the reliability of test Y increased by a similar amount as the reliability of test X increased. As a result, the equatings were conducted to real data 2 disregarding the issue of unequal reliabilities. The average equating bias reported in this study also has its disadvantages. That is, when averaging all the conditional equating differences, the negative bias at individual 82 score levels will cancel out the positive bias at each individual raw-score level. 5.3.] Arbitrary Nature of the Equating Criterion In this study, the equating criterion for each population was selected to be the results of traditional equipercentile equating. It might be interesting to regard the results of an IRT-based equating method as the equating criterion for each population. However, this will not make too much change to the patterns of the equating differences between different methods from the author’s point of view since Lord and Wingersky (1984) found the IRT true score equating and equipercentile observed score equating yields almost indistinguishable results using a sample of size around 3000. 5. 3.2 Problem with Simulated Data Besides the 3PL IRT model, the one parameter IRT model and two parameter IRT model were also applied to simulate data in this study. Comparing to the IPL or 2PL model, the distributions of data simulated by using the 3PL model better represent the distributions of real data 1 in terms of the minimum observed score level, the mean scores, the skewness and the kurtosis statistics. Although efforts were made to make the simulated data as close as possible to a real dataset, like many simulation studies, it is unsure to what extent that the simulated data represents real order effects in a real CB design. 5.4 Future Study The 95% confidence interval in the current SEED plot is two times of the conditional standard error of equating difference at each raw score level, which indicates 83 that the current SEED plot conduct independent t-test at each score level to examine the significance of equating difference. One drawback of the current SEED plot is that it does not control the family-wise error rate. Since the error rate at each score level is 0.05, the overall error rate across the whole score scale must be larger than 0.05. When the attention is on the equating difference at a particular cut score or within a small score range, it is fine to apply the i 2SEED confidence interval at each score level. Nevertheless, when it is needed to make a statement on the overall equating differences across the whole score scale, a multivariate global test will need to take into account the dependency among each score point and to control for the family—wise error rate. Future study can explore how to develop such an overall test for the significance of global equating difference between two equating methods. 84 TABLE A]. Standard error of linear equating for real data I APPENDICES Datal ZSG KB SG EG 2SG 28G ZSG 2SG ZSG Traditional Traditional X (.5,.5) (.5,.75) (.75,.5) Q5,.75) (1, 1) Linear Linear 0 1.334 1.634 1.776 2.433 4.674 1.334 4.648 1 1.311 1.607 1.745 2.393 4.533 1.311 4.575 2 1.288 1.579 1.715 2.354 4.405 1.288 4.503 3 1.265 1.552 1.685 2.315 4.355 1.265 4.431 4 1.242 1.525 1.655 2.276 4.326 1.242 4.359 5 1.219 1.497 1.624 2.236 4.287 1.218 4.287 6 1.196 1.47 1.594 2.197 4.215 1.195 4.215 7 1.173 1.443 1.564 2.158 4.144 1.172 4.143 8 1.15 1.415 1.534 2.12 4.072 1.149 4.071 9 1.127 1.388 1.504 2.081 4 1.127 3.999 10 1.104 1.361 1.474 2.042 3.928 1.104 3.928 11 1.081 1.334 1.444 2.003 3.857 1.081 3.856 12 1.058 1.307 1.414 1.965 3.786 1.058 3.785 13 1.035 1.28 1.385 1.926 3.715 1.035 3.714 14 1.013 1.253 1.355 1.888 3.643 1.013 3.643 15 0.99 1.227 1.325 1.849 3.573 0.99 3.572 16 0.967 1.2 1.296 1.811 3.502 0.967 3.501 17 0.945 1.174 1.266 1.773 3.431 0.945 3.431 18 0.923 1.147 1.237 1.735 3.361 0.922 3.361 19 0.9 1.121 1.208 1.697 3.291 0.9 3.291 20 0.878 1.095 1.178 1.66 3.221 0.878 3.221 21 0.856 1.068 1.149 1.622 3.151 0.856 3.151 22 0.834 1.042 1.12 1.585 3.082 0.834 3.082 23 0.812 1.017 1.092 1.548 3.013 0.812 3.012 24 0.79 0.991 1.063 1.511 2.944 0.79 2.943 25 0.768 0.965 1.034 1.474 2.875 0.768 2.875 26 0.746 0.94 1.006 1.437 2.807 0.746 2.807 27 0.725 0.915 0.978 1.401 2.739 0.725 2.739 28 0.703 0.89 0.95 1.365 2.671 0.703 2.671 29 0.682 0.865 0.922 1.329 2.604 0.682 2.604 30 0.661 0.841 0.894 1.293 2.537 0.661 2.537 31 0.64 0.816 0.867 1.258 2.471 0.64 2.471 32 0.62 0.793 0.84 1.223 2.405 0.62 2.405 33 0.6 0.769 0.813 1.189 2.34 0.6 2.34 34 0.58 0.746 0.786 1.155 2.275 0.58 2.275 35 0.56 0.723 0.76 1.121 2.211 0.56 2.211 36 0.54 0.7 0.735 1.088 2.148 0.54 2.148 85 TABLE A1. Continued Data] 280 KB so BO 286 2SG 280 280 250 Traditional Traditional 8 (.5,.5) (.5,.75) (.75,.5) (.75,.75) (1, 1) Linear Linear 37 0.521 0.678 0.709 1.056 2.086 0.521 2.086 38 0.503 0.657 0.685 1.024 2.024 0.503 2.024 39 0.485 0.636 0.661 0.992 1.963 0.485 1.963 40 0.467 0.616 0.637 0.962 1.903 0.467 1.903 41 0.45 0.596 0.614 0.932 1.845 0.45 1.845 42 0.433 0.577 0.592 0.903 1.787 0.433 1.787 43 0.418 0.559 0.571 0.875 1.731 0.418 1.731 44 0.403 0.542 0.55 0.848 1.677 0.403 1.677 45 0.389 0.526 0.531 0.822 1.623 0.389 1.623 46 0.376 0.511 0.513 0.797 1.572 0.376 1.572 47 0.364 0.497 0.496 0.774 1.523 0.364 1.523 48 0.353 0.484 0.481 0.752 1.476 0.353 1.476 49 0.344 0.473 0.467 0.732 1.431 0.344 1.431 50 0.336 0.463 0.455 0.714 1.389 0.336 1.389 51 0.33 0.455 0.445 0.697 1.349 0.33 1.349 52 0.325 0.449 0.437 0.683 1.313 0.325 1.313 53 0.322 0.444 0.431 0.671 1.28 0.322 1.28 54 0.32 0.441 0.427 0.661 1.251 0.32 1.251 55 0.321 0.44 0.425 0.653 1.225 0.321 1.225 56 0.323 0.441 0.426 0.648 1.204 0.323 1.204 57 0.327 0.444 0.429 0.646 1.187 0.327 1.187 58 0.333 0.448 0.434 0.646 1.175 0.333 1.175 59 0.34 0.455 0.442 0.649 1.167 0.34 1.167 60 0.349 0.463 0.451 0.654 1.164 0.349 1.164 61 0.359 0.472 0.463 0.662 1.166 0.359 1.166 62 0.37 0.483 0.476 0.672 1.172 0.37 1.172 63 0.383 0.496 0.491 0.684 1.183 0.383 1.183 64 0.397 0.509 0.507 0.699 1.199 0.396 1.199 65 0.411 0.524 0.525 0.715 1.22 0.411 1.219 66 0.426 0.54 0.543 0.734 1.244 0.426 1.244 67 0.443 0.557 0.563 0.754 1.272 0.442 1.272 68 0.459 0.575 0.584 0.776 1.304 0.459 1.304 69 0.477 0.594 0.606 0.799 1.34 0.477 1.34 70 0.495 0.614 0.629 0.824 1.378 0.495 1.379 71 0.513 0.634 0.652 0.85 1.42 0.513 1.42 72 0.532 0.655 0.676 0.877 1.464 0.532 1.464 73 0.551 0.676 0.701 0.905 1.511 0.551 1.511 74 0.571 0.698 0.726 0.934 1.559 0.571 1.56 75 0.591 0.72 0.751 0.964 1.61 0.591 1.61 86 TABLE A2. Standard error of equipercentile equating for real data I Datal 28G KB 80 E6 2SG ZSG ZSG ZSG 25G Traditional Traditional X (.5,.5) (.5,.75) (.75,.5) (.75,.75) (1 , D Equipercentile Equipercentile 0 1.218 1.269 1.169 1.272 1.778 0 0 1 1.354 1.432 1.349 1.511 2.328 1.159 1.025 2 1.384 1.477 1.409 1.609 2.664 2.318 2.049 3 1.383 1.485 1.428 1.657 2.896 3.478 3.073 4 1.369 1.478 1.43 1.683 3.068 4.512 3.981 5 1.348 1.464 1.424 1.698 3.2 5.325 4.751 6 1.324 1.445 1.414 1.707 3.303 5.928 5.341 7 1.298 1.424 1.4 1.712 3.383 6.335 5.791 8 1.27 1.401 1.385 1.714 3.444 6.65 6.231 9 1.241 1.378 1.369 1.714 3.49 6.724 6.366 10 1.212 1.353 1.352 1.712 3.523 6.774 6.485 11 1.182 1.328 1.335 1.708 3.545 6.777 6.551 12 1.152 1.303 1.317 1.703 3.556 6.762 6.582 13 1.122 1.278 1.298 1.697 3.557 6.755 6.622 14 1.092 1.252 1.28 1.689 3.551 6.747 6.673 15 1.063 1.227 1.26 1.679 3.536 6.758 6.744 16 1.033 1.201 1.241 1.668 3.515 6.778 6.821 17 1.004 1.176 1.221 1.655 3.486 5.972 6.441 18 0.975 1.15 1.2 1.64 3.451 5.548 6.292 19 0.947 1.125 1.179 1.624 3.41 5.347 6.214 20 0.919 1.1 1.158 1.606 3.363 3.977 5.787 21 0.892 1.075 1.136 1.586 3.311 3.277 5.531 22 0.865 1.05 1.113 1.564 3.254 1.957 5.042 23 0.839 1.026 1.09 1.541 3.193 1.516 4.665 24 0.813 1.001 1.067 1.516 3.127 1.284 4.371 25 0.788 0.977 1.043 1.49 3.058 1.018 4.131 26 0.763 0.953 1.018 1.462 2.987 0.866 3.697 27 0.739 0.929 0.993 1.433 2.912 0.801 3.397 28 0.715 0.905 0.968 1.403 2.837 0.786 3.201 29 0.692 0.881 0.942 1.371 2.759 1.121 3.049 30 0.669 0.858 0.916 1.339 2.681 1.197 2.39 31 0.646 0.834 0.889 1.305 2.603 1.267 2.266 32 0.624 0.811 0.862 1.271 2.524 1.201 2.194 33 0.603 0.788 0.836 1.237 2.446 1.027 2.283 34 0.582 0.765 0.809 1.202 2.368 0.762 2.493 35 0.561 0.742 0.782 1.167 2.292 0.835 2.671 36 0.541 0.72 0.755 1.131 2.217 1.085 2.757 37 0.521 0.697 0.728 1.096 2.143 1.344 2.797 38 0.502 0.675 0.702 1.062 2.071 1.305 2.848 39 0.483 0.654 0.676 1.028 2.001 1.279 2.884 87 TABLE A2. Continued Datal 2SG KE SG EG 2SG 28G 25G ZSG 2SG Traditional Traditional X (.5,.5) (.5,.75) (.75,.5) (.75,.75) 1, I) Equipercentile Equipercentile 40 0.465 0.633 0.651 0.995 1.934 1.071 3.091 41 0.448 0.613 0.626 0.962 1.869 0.849 3.152 42 0.431 0.594 0.603 0.931 1.807 0.875 2.997 43 0.415 0.575 0.58 0.902 1.748 1.047 2.818 44 0.401 0.558 0.559 0.873 1.693 1.008 2.715 45 0.387 0.542 0.539 0.847 1.641 0.929 2.726 46 0.374 0.527 0.52 0.823 1.593 0.772 2.568 47 0.363 0.513 0.504 0.801 1.548 0.831 2.431 48 0.353 0.502 0.489 0.781 1.508 0.896 2.241 49 0.345 0.492 0.477 0.764 1.472 0.764 2.024 50 0.338 0.484 0.467 0.749 1.441 0.687 1.94 51 0.334 0.478 0.46 0.737 1.413 0.75 1.983 52 0.332 0.474 0.455 0.727 1 .3 89 0.934 1.989 53 0.332 0.473 0.454 0.721 1.369 0.988 1.907 54 0.334 0.473 0.454 0.717 1.353 0.745 1.832 55 0.338 0.476 0.458 0.715 1.339 0.619 1.626 56 0.344 0.48 0.464 0.715 1.328 0.59 1.442 57 0.352 0.487 0.472 0.718 1.319 0.574 1.353 58 0.362 0.495 0.482 0.722 1.312 0.539 1.322 59 0.374 0.504 0.494 0.727 1.306 0.541 1.276 60 0.387 0.515 0.508 0.734 1.299 0.5 1.242 61 0.401 0.526 0.522 0.741 1.293 0.534 1.308 62 0.416 0.538 0.538 0.748 1.285 0.623 1.442 63 0.432 0.551 0.554 0.755 1.276 0.86 1.501 64 0.448 0.563 0.569 0.761 1.265 0.928 1.51 65 0.464 0.575 ' 0.585 0.767 1.251 0.984 1.471 66 0.479 0.586 0.599 0.771 1.234 0.682 1.4 67 0.494 0.596 0.613 0.772 1.212 0.486 1.276 68 0.508 0.604 0.624 0.771 1.185 0.613 1.177 69 0.519 0.608 0.632 0.766 1 . 152 0.745 1.23 70 0.527 0.609 0.635 0.755 1.109 0.945 1.032 71 0.53 0.603 0.632 0.736 1.054 0.859 1.051 72 0.524 0.585 0.618 0.704 0.98 0.58 1.142 73 0.502 0.549 0.585 0.648 0.876 0.88 1.353 74 0.452 0.479 0.516 0.553 0.717 1.326 1.576 75 0.38 0.385 0.419 0.429 0.491 0 0 88 TABLE A3. Standard error of linear equating for real data 2 IDatal 2SG KE so EG 2SG ZSG 286 28G 2SG Traditional Traditional x (.5,.5) (.5,.75) (.75,.5) (.75,.75) (1, 1) Linear Linear 0 0.341 0.354 0.403 0.454 0.767 0.341 0.767 1 0.318 0.331 0.377 0.425 0.718 0.318 0.718 2 0.296 0.309 0.352 0.397 0.67 0.296 0.67 3 0.274 0.287 0.327 0.37 0.623 0.274 0.623 4 0.252 0.265 0.302 0.343 0.577 0.252 0.577 5 0.231 0.244 0.278 0.317 0.533 0.231 0.533 6 0.211 0.224 0.255 0.293 0.491 0.211 0.491 7 0.191 0.205 0.233 0.27 0.452 0.191 0.453 8 0.173 0.187 0.212 0.249 0.417 0.173 0.417 9 0.156 0.172 0.193 0.23 0.386 0.156 0.387 10 0.141 0.158 0.176 0.215 0.362 0.141 0.362 1 1 0.129 0.148 0.162 0.203 0.344 0.129 0.344 12 0.121 0.141 0.151 0.195 0.334 0.121 0.334 13 0.117 0.138 0.146 0.193 0.333 0.117 0.333 14 0.118 0.14 0.145 0.196 0.341 0.118 0.341 15 0.124 0.147 0.15 0.204 0.357 0.124 0.357 16 0.134 0.157 0.16 0.216 0.381 0.134 0.381 17 0.147 0.17 0.173 0.232 0.411 0.147 0.411 18 0.163 0.185 0.189 0.251 0.445 0.163 0.445 19 0.18 0.203 0.208 0.272 0.483 0.18 0.483 20 0.199 0.222 0.229 0.295 0.525 0.199 0.525 21 0.219 0.242 0.251 0.32 0.568 0.219 0.568 22 0.24 0.262 0.274 0.346 0.613 0.24 0.613 23 0.261 0.284 0.298 0.373 0.66 0.261 0.66 24 0.283 0.306 0.323 0.4 0.708 0.283 0.708 25 0.305 0.328 0.347 0.428 0.757 0.305 0.757 89 TABLE A4. Standard error of equipercentile equating for real data 2 IDatal 28o KB so EG ZSG 2SG 2SG ZSG ZSG LTraditional Traditional X (5,.5) (.5,.75) (.75,.5) (.75,.75) (1, 1) quipcrcentile Equipercentile 0 0.484 0.548 0.399 0.456 0.72 0 0 1 0.469 0.532 0.393 0.463 0.67 0.711 0.785 2 0.448 0.498 0.393 0.451 0.624 0.827 0.916 3 0.393 0.433 0.362 0.41 1 0.575 0.845 0.96 4 0.328 0.36 0.318 0.361 0.525 0.448 0.832 5 0.272 0.295 0.277 0.315 0.479 0.353 0.696 6 0.229 0.247 0.244 0.279 0.439 0.239 0.535 7 0.202 0.216 0.222 0.255 0.404 0.286 0.423 8 0.185 0.2 0.207 0.241 0.378 0.268 0.348 9 0.172 0.191 0.198 0.234 0.358 0.201 0.513 10 0.16 0.184 0.189 0.229 0.347 0.196 0.425 11 0.149 0.177 0.18 0.226 0.343 0.198 0.334 12 0.139 0.171 0.172 0.224 0.346 0.192 0.422 13 0.134 0.167 0.166 0.224 0.358 0.179 0.454 14 0.134 0.166 0.165 0.227 0.377 0.216 0.374 15 0.139 0.17 0.169 0.235 0.402 0.213 0.532 16 0.149 0.177 0.178 0.245 0.431 0.231 0.5 17 0.161 0.186 0.19 0.257 0.459 0.24 0.662 18 0.176 0.197 0.204 0.269 0.486 0.305 0.598 19 0.195 0.213 0.219 0.282 0.511 0.312 0.74 20 0.222 0.239 0.235 0.297 0.538 0.345 0.883 21 0.259 0.276 0.254 0.316 0.566 0.406 0.952 22 0.307 0.326 0.277 0.34 0.592 0.439 0.769 23 0.354 0.379 0.297 0.362 0.61 0.839 0.262 24 0.375 0.413 0.298 0.365 0.609 0.779 0.131 25 0.374 0.418 0.304 0.351 0.605 0.671 0 90 a 66% e838 aeeeaeSe 0.53:8 .2 $50; .ooo N H 2 $me 0238.43.33 «w. ..n uth use amazmoxmmwzom mmommmommfiem o 2. mm on 4 4 « ooo on: it. i a!“ 6 o 0°00 00000 I 0000 _amo no oo mm on 9... ov a _ J ooooooo ooooooooooooooo °oooooo 0 1' 00°00 o ’3." Qmmmmi Qmmmm 6:5 PEN ll 0 O P _ 8:95me menu: > p 1F 1? p E r h — b .— 0 000° 0 0° 37 2 .821: .Eom .885 «3 Comm nee 88$ mm a $6.3 8.8.58 8e68§6 823$ .2 $505 on mooommommvonmommmomES m o 2.. vl ml or ooo o Hz 00000 ooooo°oooooooooooooooooooooooooooo ‘ 000 0000000 ooooooooooooooo oo 00 00° iiii’ iiii o7 3m- 0000: A 0:5 PEN III Dmmmm- .. Dmmmm .. cantata wanescm . F n p w v .o~ 2 91 .821: .364 83828366 a. .e. Comm 88 .862: «h. 6.»me 20$:me motmxmfimfi wfing .m< mMDOE whom.mooommommvonmommmomfo_. m o . m _- i i o _ . coo ~H: Ti 1 ml $8.8” ..... H - .............. - “----zoigmmmmmmmmmmo T 1 m 054 BoN ll - mmmmm- . LS Qmmmm a 8:20th wacmscm . » n > hi p p P p l» w p r h b m~ 92 .oooTi: .NmOm 055809300 Am. .moOmN 50:0 03:00:38? : 50% 8058 858:6 0:025: .2. 559: mm on 3 oo mm on m: ov mm om mm om m5 o5mo m5- « q 5 5 ooo 5 H: 1 00 00°00 00000 0000 0000 000000 00000000ooooooooooooooooooo ooo OOO . ”I .1- 11.13.: TJ 000000000 0 o o O oooo 1111111111 ooooo 00000000000000OOOOOOOOOOOOOOO 00°00 000° 0000 0000000 0:5 80N ill 5mmmm- .. Qmmmm .. 00:00th wfifi: . — — p _ h h + 1? p L p u o o o 000000000 9 "' O 0 O 43 3:- c : m5 .ooo5u: .NmOm 50:: Amamowmm 50:0 30:: A 5.30mm :002509 00:00.55: wfifiscm .v< EDGE mm on no oo mm om m: o: mm om mm om m5 o5 m o _ _ 15: a 5 5 5 fl - ooo 5 H: 00000 000000 0 voooooooooooooooooooooooooooooooooo u 5 4 5 _ l 0. 000000 oooooooooooooooo 00000 ’9’, v 6 0000000000 00000 00000 00000 00000 00000 00 0000000 00 0:5 EQN 11 r Qmmmm- o Qmmmm 0 00:80:55 wfifiscm . h — b p p p P h T m5- o5- o_ 5 93 .821: «:0: 0328853516. .6 Com. use 302.: «w. Wow-am 200206 002005.: M53355 .o< mMDOE mnohmooommommvonmommNoNQo5 m o - . . 2- - 821: .2- - .m- o0: “Ema.“ 1180108888 .... Hun-ooooomommmmmm oommmuflwo - . m 0:5 BQN 11 - 000mm- . -2 Amman 0 00:00.55 wfiamscm . : . . t _ _ : . . . _ - l . m5 94 .02 Te .20: 05588806 G. .202 05 6288833 22% .20: 03:: $.3me een 88: A 2 .302 e858 850:6 05:55: .20. $52.: 2.30% 8628 850:6 058:0: .2 .550: mnonmooommommvovmmommmomfo_ m o A 4 A muonmooommommvonmommNoNEo_ m o 4 1~ 4 5 q a 4 4 WT. W—i - 823H .2- - 821e .2- r 1m“: i L mm. 00°: 0 00000 O 00000000 00000 00000 000000 0000000 000000000 0 O 00 00°00 0 0° 000000 000000 i 0000 name a 8 68. a. 6. mmmmm mama 688 a f ...... ...... 1 ..... 1 -- 1 - : 68 E” “““ 444444 4 4444:, G T11 1111! .ll'. Iil .1111: 1 o oooooo l 1 r ..oooooooooooooooooooii 1 1 i it i i 1» 0000 3!. A 00000 00000 ooooooooooooooo 4 44444444 9)) D 0 G 00000 00°00 00° 44444 caoooooo 0 0° 000° 0000000000000 000 000000 0000000000 00006000000 0000 00000 00000 0000: 0:5 80N 111 0:5 PEN Ill Dmmmm- .. 5mmmm- .. Qmmmm .. Qmmmm .. 00:20:55 mcuaaom . 0050055 252:5 . — h r .— p 5 hi P h p p p h m.— P h h p h p h h r F 1P 1.7 P m# 95 .82”: .32 02380353 «n. w .0me 35 32s a. 30% :833 800$?“ 3:30. .3 $590 muonmooommommVOVmMOmmmomfl2 m o 1 1 m T 1 COO fl ”C 1 O —| x L w- $3.: - - ....... §§a§§§unnumho , l m 023 BoN 1| 0 Qmmmm- o - 2 # Qmmmm o oozeoba mafia: . L _ 0 _ _ 0 0 p r p _ p r mfi 96 .ooofinc .vam 65580353 a. £62 as 05520953 .82“: .32 .38: 350mm 23 38: : .303 5323 885% wqfiém .0 2 550; 2.30% 5053 85306 33:5 .o 2 $520 muonmcoommommvovaommmcmmfi3 m o muchmooommommVOmeommNomfl3 m o q q 4 4 m ~ I J 1 m Ml ooo # H: ooo _ H: , I 1 o .H I I 1 Q ~ I f P m- fl 1 m- 00000 000 oooooooooAv 000.0111 » 1H bbbbb Hun 1H 0080b owmmomwmoooowooow r WWooo oooooooo “whom hhhhh 1 1000.1. EMMWWW omwoo 0000M OOWOOWto bbbbbbbbbb Tiiii, l 11-11.11115 Av 00000 800000000 000000000000000 00°00 000 u 0000 0000 0 00000 00000 00 00. w r L m T L m 0:5 EQN I 0:5 PEN Ill 1 ammmm- o 12 I 952. o a. Qmmmm o QflMmN 0 8:08.005 mamasvm . 2 oocouoba magnum . r L — p p — h _ h p p b h b m# 97 .82": .Eom £22833? a. w .093 BS .325 «M. w. $va :32qu mu§x®§c $0.23sz .N~< mMDOE mnonmooommommVCmeommmommH S m o 2- r cog": loT .. .m: A 00000 00000 0000 “mmmfiggiggiiggflmflflofl O as: PEN ll Qmmmm- o Qmmmm o ooaouoba wfiamscm . h _ _ b h h h h r _ » mfi 98 down: .mmOm 63598953 Am. .303 98 33580933 : 50% 88:8 855% 8:85 +2 $50; 17 4 mnonmooommommVOmeommNomflA: m o a} fi 14 q 4 com“: 'I 000000000000000000000 000000 000 3.338333808.§8§-38- - o 44 o 00 00° 000000000 00 0000000000000 00000 00000000 000 000000000 0000 000000000000 T: 0:5 BQN III I Qmmm N- .. QmMmN o oosobba wcumswm . .133... ,1. ,l Mupzwwa We m7 o7 ml 2 2 down: .mmOm 03:: Am..m.VOmN 98 08:: 2.30% 8258 855% 888m .mZ 856; ms 3. mo co mm em 3 ow mm om mm om fine 2- f Gown: A 0000 0000.: m .3. “wgioggggagtgiogwfn-1811 11 1: 11 VIII: 1'! 1" ‘ll 1.1.l ‘Iil l| Qmmm N- Dmmmm and PEN III 8:80.005 wcumswm . L .— — p r r .m o— 2 99 gnu: .22 88888338 a. w .0me .33 .525 «w. w. .093 $953 muzmxfixfi Matuawm .m ~< mMDOE whenmwoommommvovaommmommfiSm o T d 1 J _ q d m —l 0 com”: mm 1 1 o T .. 1 WI ME”- 1.: HHHéTJNMHMMmmmoM. o fl . m 85 BoN ll 1 Qmmmm- . 1 3 QmMmN . momenta mean: . _ _ _ _ _ _ T + L . _ _ _ 2 .ooofinc .mmOm 6558333 G. 50% 88 288.838 .82”: .28 88: 850mm 88 88: 2 50% 8838 888....6 888m .5 2 mmaoz 2.30% 8858 888:8 88:5 82 550; whenmo¢emm¢mw¢0¢mm¢nmm¢m23 m o A _ 4 muchmooommommVOmeommNomfl3 m o q 4‘ 4‘ ~ .— 4 4 m7 m7 cog”: . cog”: . f .3. I S- , - m- , m- l 00: 000000000 p it 00 00000300000000 000000.000: 014414 «mammogmawoooo 00000 I p -1 1 {gaoaoooooigbwwo oooooo 0030 0000000000 o :8» ‘4: « oooooooo CdMWO 00000000 000000000 00.: GINO” 0000000 00000 0000000000 00000 F -. I 1.1%.: o r .. o L 00.0 00 O o 0000000oooooooooooooooooooooo 0° v000000000000°00OOOOOOOOOOOOOOOOOOOOOOO 000000 00 00000 00000 0 0000000 0000000 0000 00000 00000000 0000000 000000 000000; , J m l 1 m 0:5 BoN II 85 EQN 1|: I 95%. . é ; ammmm- . é Qmmmm . QmMmm 0 82800.05 wcumscm . 322805 95QO . . _ _ . . p _ _ p . . r . L . _ p _ _ . W — _ — P b p p L 0 WM 101 .82": .32 80.888838 a. w .0me Ea 88.: a. w .68. 88:8 888%.. 8:83. .w 2 $59.. mnonmooommommVOmeOmmNommfi2 m o 1 . . m T , 82”: -2- r m- m.......................... ..... ......8...” to}. T .m 25 EQN 1| 1 95mm. .. , E Qmmmm .. 8:20me was“: . 102 don”: .omOm 65580353 Am. .30mm can 2358383 : 59% 5223 agate $335 .o~< $505 mm on no oo mm on 3. ow mm Om mm om m _o~m 0 m7 q 4‘ J 1 a cemuc ~ 4 QmMmN- Qmmmm 85 Saw Ill 0 1 # J‘ « oooo fl 000000000 00000 000 o 000 0 0.0 ooo o 0000000000ONOouooouuuwma. 0&00000000000 00 Quooooooooooo 00 00000.... o M 00 00000 own 0000 M O :0 o o 0000 00000000 o ooooooooooooooo c 00000 00000 o 00000 O 000 o 00 o r. 00000 000 0000000000 coachota wcuasvm . .oT 2 .oomnc .omOm mama: Am. .m.v0mm can 30:: A ~ .3me 8233 855:6 wcfiscm .a 2 $505 mmomfe m o . m7 mm on me om mm on 3 CV mm cm f cam“: .00.... 0.00.0.00mbm111 L.§séaunfiu:z...o.......§§ 0:3 80M ll Qmmmm- Dmmmm vow—Boga @535 O 0 h p L, b b n u « 00000 #100000 00000 00°00 00°00 00°00 00°00 0000 O 00° 0° 00 00° 0000 0 00° 00 00°00 000° L m 00° 00° 0°C .2. 000: -m- o: a: 103 8?: ES $328.338 «w. 65% Ba .82.: mm on no ow m a. w .0me §§B 8§m§n mssém ._~< $505 mom wVOmeommNomfe m o a a 4 co mu: v wwwmaggsagsasgiéfi O 0:5 BQN ll Qmmmm- o Qmmmm o oosouoba wcumswm . p P L h w p p h wwwmooo ooooonooomwooo cocooooouommmmo O A 0000 o ix.1 i o o 000 O A o o. 1 2- OT 2 2 104 down: 6&0; 65208383 Am. .303 98 2:582:53 : dumm c853 Dosage $823 a? 950$ mh on 3 oc mm cw mv ow mm Om mm om 2 A: m o com“: ooooooo ooooooo ooo 0000 oo .000 0000 0000000011 00 OOOOMMMOOOmm “0000000.. oooooo oo .00.. .0 000000000000 00000000 a 0 O 0 O O 0 000000 oo 000 0000 0000000000 OOOOOOOOOOOOOOOOOOOOO 0 000000 0 00°00 00 00000000000000 I 25 BoN Ill Qmmmm- o Dmmmm .. ooceoba maumscm . p b h F b P P \P F u p P O O O O A o m7 o7 9 m— .oomH: .cmOm duos: Am. .m.v0mm was 30:: 2 .30mm 20253 oocohotfi wcumswm .Nm< EDGE mm on we ow mm on Q ow mm Om mm om 2 2 m . m7 1 o ooooooooooooooooooooooooooooooo 4 J J J W cam”: 0000000 0:3 oHoN .ll Qmmmm- o Qmmmm o 85.5me @535 . p L! p p {r h » 000 \4 \J A 00000 ooooooooooooooooooooo $00000 oooooooooooooooo 00000 0000 0000 0000 0°00 0000 000 3.....si.§..::...:..... .oT oooeoooofiv WI 0000: m 2 105 00%”: .33 smsasgém a. w .0me 0050 .523 a. w .0me “08.53 8.08%00 m§§£ ...? 520E mnonmooommo mVOmeOmmNomfe m o q d _ 7 ~ 1 a T a m .— I r co m“: . o T r 1 m: 0 memmwooasoaaoasssnnn ,unufifigmfimwmmmmmmwoo ..o . .00 o 0 o... r . m 0:5 EuN ll .. Qmmwm- .. . 2 QmmmN 0 8280005 wctascm . » p p p r b L! \F r h L, \P [F L{ m— 106 .ooofin: .cmOm 65:80:33 3. 600% 05 02520220 .202”: .000: .32: G. .303 08 52: 2 .ccmm 5053 85.020 0:09:00 0?. $50: 2 50% 5258 802020 0:025 «.2 $52: mm on no so mm on 3 ov mm om mm om 2 S m of- mnonmcoommommVOmeommmomflo3 m o 4 4 4 3‘ _ J. _ a 2. ”G H 1 82 .2- 0 202 a .2- m- . .m- >» u ‘ ommwwwwuomaua: _ 21111 <« {411 ooooooooooo 111 4» « 000000 0.000 ooummoogorooo: oooo aoooooooooooo .00 ooooooo 0. 00 :000 )3 o 000 00 yyyyyyyyy 000000 000000 a 0000000 00000 ¢¢¢¢¢¢¢ .. e O 0000.... .. ««««« 000 a «000° com—.9 fil. i | -ul..illllf,l¢1‘|1: . l O 0 000000000000 00: vooooooeooo ocooo ooooooooooooooooooo 0000000 00000 00000 000000000 00 00000000000000000 000000 o 0000 0000 00 000000 000000 000000000000 0000 00000 coco: I .w T .m 0:3 0:0N 0:3 PEN ammmm- . 2 - ammmm- . .2 QMMmN o Qmmwm o 00:0:0t5 $5025 . 00:05:05 gums—um . n b P h \P P r LIL » p h » m~ b b L b P h r r P [F F b F m~ 107 .202”: .30: 00038002010. .0 .0000 025 .5050 Q... .m. .093. 2003000 002000.30 0050:wa sm< mmDOE whenmcoommomeOmeOmmmomfo_ m o . . - 2- 002”: f 1 O _ u r . m- “Wmmmmogggogunfl ......... unufiggmmmmummmwwwwwwéo Av. 00.00 o: no: I . m 0:3 80N III I QWMWN- o . O~ Qmmmm 0 00:88.05 wfifiscm . h h h _ - — L k P p b h _ — m— 108 Equating Difference, n=50. POP1 2.5 — 9 1.5 - 8 (D 05 4°“... 0 8 .0 rmWM‘ W. S '0.5 ‘I .9 .0 LU -1.5 I '2.5 T T I T I I I I I I I I I I I O 5 10 1520 253035404550556065 7075 ScoreY FIGURE A28. Equating difference between ZSG(I, I) equipercentile and E G- equipercentile, POP], n=50. Equating Difference, n=100, POP1 2.5 - a) 1.5 ~ L— 8 U) 0.5 3.”. “o 0 OWN 93 .0. 0° . o (:0 '0.5 .V. 5 -1.5 ~ '2.5 f I T I I T f r I I I O 51015202530354045505560657075 ScoreY FIGURE A29. Equating difference between 2SG(], I) equipercentile and E G equipercentile, POP], n =1 00. 109 Equating Difference, n=50, POP4 2.5 7 9 1.5 4 8 (I) 0.5 . '9 '0 5 ... .00”... “’0... ‘3 ' °. ”W [B- ... ... -1.5 ~ ... -2.5 I I f fl fifi r f I r j I I I I O 51015202530354045505560657075 ScoreY FIGURE A30. Equating difference between ZSG(I, I) equipercentile and EG equipercentile, POP4, n=50. Equating Difference, n=100, POP4 2.5 - a) 1.5 ~ L- 8 a) 0.5 ~ '0 «0" Inn ”000 ° 9 o". ”o 0 g '0'5 J”... ....”W. W 0 La- .0. o. '1.5 3 .m.’ '2,5 I I7 r I r T I I I j I I I I I 0 51015202530354045505560657075 ScoreY FIGURE A31. Equating difference between ZSG(I, I) equipercentile and E G equipercentile, POP4, n=100. 110 Equating Difference, n=300. POP4 2.5 - a, 1.5 " L 8 (D 0.5 q 93 00 0’ M. g -0.5 - .0. .0 .9000“... C’ .0. .0. L“ -1.5 - ’00.. -2.5 7 I I T f I If I I I T I 0 51015202530354045505560657075 ScoreY FIGURE A32. Equating difference between 250(1, 1) equipercentile and E G equipercentile, POP4, n=3 00. Equating Difference, n=50. POP6 2.5 - 93 1.5- O I?) 0-5 1 WW 8 00 out... “... o. a; -o.5 — 0. o." “'00 3 0 .w 18 ’. .°’ "1.5 '1 .. .. ’w’ -2-5 F I I I I I I I I I T j I T j 0 5 1O 15 20 25 30 35 40 45 5O 55 60 65 7O 75 Score Y FIGURE A33. Equating difference between ZSG(1 , I) equipercentile and E G equipercentile, POP6, n=50. 111 Equating Difference, n=1000, POP6 2.5 - 9 1.54 8 (D 054 8 ' "’ *5 -05 o N...“ V 3 o o a . ° -1.5 - 0 ,° 9 0.... ‘2.5 I I I I I T I T I I T I I i 0 5 1O 15 20 25 30 35 40 45 50 55 60 65 70 75 Score Y FIGURE A34. Equating diflerence between 2SG(], I) equipercentile and E G equipercentile, POP6, n=1000. 112 REFERENCES Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed)., Educational Measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. (Reprinted as W.H. Angoff, Scales, Norms, and Equivalent Scores. Princeton, NJ: Educational Testing Service, 1984). Cochran, W.G., & Cox GM. (1957). Experimental Designs (2rld Ed.), New York: Wiley. Compaq Visual Fortan 6.5. (2000). Compaq Computer Corporation. Cook, L.L., & Eignor, D.R. ( 1991). IRT Equating Methods. Educational Measurement: Issues and Practice, 10(3), 37-45. Davey, T., Nering, M.L. & Thompson, T. (1997). Realistic simulation of item response data. (ACT Research Report Series 97-4). Iowa City, IA: American College Testing. von Davier, A.A., Holland, P.W., & Thayer, D.T. (2004). The kernel method of test equating. New York: Springer Verlag. von Davier, A.A., Holland, P.W., Livingston, S.A., Casablanca, J ., Grant, M.C., & Martin, K. (2005). An evaluation of the kernel equating method in a non- equivalent groups design with an external anchor-- a Special study with pseudo- tests from real test data. Paper presented at the National Council of Measurement in Education, Montreal, Canada. von Davier, A.A. & Kong, N. (2005). A unified approach to linear equating for the nonequivalent groups design. Journal of Educational and Behavioral Statistics, 30(3), 313-342. Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. Philadelphia, PA: Society for Industrial and Applied Mathematics. Efron, B., & Tibshirani, R]. (1993). An introduction to the bootstrap (Monographs on Statistics and Applied Probability 57). New York: Chapman & Hall. 113 Han, N., Li, S., & Hambleton, R. K. (2005). Comparing kernel and IRT equating methods. Paper presented at the National Council of Measurement in Education, Montréal, Canada. Hanson, B.A., Zeng, L., & Kolen, M.J. (1993). Standard errors of Levine linear equating. Applied Psychological Measurement, 1 7, 225-237. Harris, D.J., & Crouse, J .D. (1993). A Study of Criteria Used in Equating. Applied Measurement in Education, 6(3), 195-240. Holland, P.W., & Thayer, D.T. (1989). The kernel method of equating score distributions. Program statistics research technical report no. 89-84. Access ERIC: F ulltext (142 Reports--Evaluative No. ETS-RR-89—7). New Jersey: Educational Testing Service, Princeton, NJ. Holland, P.W., & Thayer, D.T. (2000). Univariate and bivariate log linear models for discrete test score distributions. Journal of Educational and Behavioral Statistics, 25, 133-183. Holland, P.W., Liu, M., & Thayer, D.T. (2005). Exploring the population sensitivity of linking functions to differences in test constructs and reliability using the Dorans- Holland measures, kernel equating and data from the last. Paper presented at the National Council of Measurement in Education, Montreal, Canada. KB Software (2004). Computer Program. Princeton, NJ: Educational Testing Service. Klein, L.W., & J arjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22, 197-206. Kolen, M.J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1-1 1. Kolen, M.J. (1984). Effectiveness of analytic smoothing in equipercentile equating. Journal of Educational Statistics, 9, 25-44. Kolen, M.J., (1988). Traditional equating methodology. Educational Measurement: Issues and Practice, 7(4), 29-36. 114 Kolen, M.J., & Brennan, R]. (2004). Test Equating: Methods and Practices (2nd ed.). New York: Springer. Liou, M., & Cheng, RE. (1995). Asymptotic standard error of equipercentile equating. Journal of Educational and Behavioral Statistics, 20, 259-286. Liou, M., Cheng, P.E., & Johnson, E.G. (1997). Standard errors of the kernel equating methods under the common-item design. Applied Psychological Measurement, 21(4), 349-369. Liu, J .H., Allspach, J .R., Feigenbaum, M., Oh, H.J., & Burton, N. (2004). A study of fatigue effects from the new SAT. (Research Report 2004-5 & RR-04-46). New York: College Entrance Examination Board, & Princeton, NJ: Educational Testing Service. Livingston, S.A., Dorans, N.J., & Wright, N.K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3, 73-95. Livingston, S.A. (1993a). An empirical tryout of kernel equating (142 Reports-- Evaluative No. ETS-RR—93-33). New Jersey: Educational Testing Service, Princeton, NJ. Livingston, S.A. (1993b). Small sample equating with log-linear smoothing. Journal of Educational Measurement, 30(1), 23-39. Lord, RM. (1950). Notes on comparable scales for test scores (Research Bulletin 50-48). Princeton, NJ: Educational Testing Service. Lord, EM. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M. (1982a). The standard error of equipercentile equating. Journal of Educational Statistics, 7, 165-174. Lord, F.M. (1982b). Item response theory and equating — A technical summary. In P. W. Holland and D. B. Rubin (Eds.) Testing Equating (pp. 141-148). New York: Academic Press. 115 Lord, F.M., & Wingersky, MS. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452-461. Lu, S., & Kolen, M.J. (1994). Bootstrap standard errors and confidence intervals in linear equating. Paper presents at the annual meeting of the American Educational Research Assosciation, New Orleans. Mao, X., von Davier, A.A., & Rupp, S. (2005). Comparisons of the kernel equating method with the traditional equating methods on praxis data. Paper presented at the National Council of Measurement in Education, Montreal, Canada. MATLAB version 7.1, (1984—2005). The MathWorks, Inc. Montogomery DC. (2000). Design and analysis of experiments (5th edition). New York : Wiley. Moses, T., Yang, W., & Wilson, C. (2005). Using kernel equatingrto check the statistical equivalence of nearly identical test editions. Paper presented at the National Council of Measurement in Education, Montreal, Canada. Moses, T.P., & von Davier, AA. (2005). A SAS macro for log linear smoothing: Applications and implications. Paper presented at the American Educational Research Association, Montréal, Canada. Parr, WC. (1983). A note on the jackknife, the bootstrap and the delta method estimators of bias and variance. Biometrika, 70, 3, 719-22. Parshall, C.G., Houghton, Du Bose P., & Kromrey J .D. ( 1995). Equating error and statistical bias in small sample linear equating. Journal of Educational Measurement, 32, 37-54. Qu, Y. & Von Davier, A. (2006). Comparison of two approaches for Counter-Balanced design in a Kernel Equating framework. Paper presented at the National Council of Measurement in Education, San Francesco, USA. Rice, J .A. (1988). Mathematical statistics and data analysis. Monterey, Calif. : Brooks/Cole. SAS version 9, (2002). SAS Institute Inc., Cary, NC, USA. 116 Tsai, T.H. (1998). A comparison of bootstrap standard errors of [RT equating methods for the common item nonequivalent groups design. Unpublished Dissertation. Iowa City: University of Iowa. Yu, L., Anderson, D.O., & Zeller, K. (2003). Report of the counterbalanced equating study for the Algebra End-Of-Course assessment (Research report SR — 2003 — 56). Princeton, NJ : Educational Testing Service. Zeng, L., & Cope, R. (1995). Standard error of linear equating for the counterbalanced design. Journal of Educational and Behavioral Statistics, 20(4), 337-348. 117 iiiIiiiiiiiigiiii 3 1293 02845