.. .... .. a 71‘9“}..- . 3v ... . fi 3! 9 ‘1! ‘4 ‘ ‘9 ‘- -.r . t ‘s :l:-:: ‘_y‘ .. V .. V... y‘ 5.» . 1.. . 1 . 0.. 4 . i. t... .‘n..1 ..III c.<... . ........ in! £._,...‘...‘....n1.m...x§!..r..w r... ‘ . . .v...:..... 711. A MIC CHIGAN STATEU UNIVE ESR I! “W lllll INN lJMHIIIWIIIUHHI 71’ ~. 31293 01712 8970 "1 This is to certify that the dissertation entitled THE EFFECTS OF CONTENT HOMOGENEITY AND EQUATING METHOD ON THE ACCURACY OF COMMON-ITEM TEST EQUATING presented by Wen-Ling Yang has been accepted towards fulfillment of the requirements for Ph. 0. degree m Counsel inL Educational Psychology. and Special Education ”MA MW Ma jOl'\T1pr0feS r Date‘DéC- 2?; /97‘; MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 ' LIBRARY Michigan State Unlverslty PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE :1? 11‘6" 302333 1/98 chIRCJD‘oDmpGS—p.“ THE EFFECTS OF CONTENT HOMOGENEITY AND EQUATING METHOD ON THE ACCURACY OF COMMON-ITEM TEST EQUATING By Wen-Ling Yang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1997 ABSTRACT THE EFFECTS OF CONTENT HOMOGENEITY AND EQUATING METHOD ON THE ACCURACY OF COMMON-ITEM TEST EQUATING By Wen-Ling Yang Often in educational testing and measurement, people use alternate test forms to achieve comparable test scores for measuring growth or ensuring test security. To obtain valid comparisons between groups and to enhance test fairness, they rely on various equating techniques to equate forms of the same test. It is important to evaluate the adequacy of these equating methods and the accuracy of their outcomes. In my dissertation, I studied the effects of test characteristics on the accuracy of equating outcomes when different methods were used to equate two test forms of a test. Specifically, I wanted to know whether equating accuracy improves with a test made of content-homogeneous items, whether it improves with an anchor test that is content- representative of its total test, and whether such content effects depend on the particular equating method used for equating. My major goal is to improve test results, which often lead to critical educational decisions. The data I analyzed is the test results from a professional in-training examination. It has a negatively skewed score distribution because the test was written for a minimum- competency examination. In equating practice, such test outcome receives less attention than it should have. The common-item equating design was used because the two groups of examinees taking different forms were not randomly formed or assigned. I used an item-sampling design to create four tests that differ in the content homogeneity of their items and the content representativeness of their anchor items. All the items in these tests are from one overall content domain, but fall into 23 different content areas. Each of the four tests has two forms, and a set of common anchor items is embedded in each form. I applied linear, equipercentile, and two [RT-based equating methods to equate the two forms of each test. By means of the item-sampling designs, I was able to establish two innovative criteria based on true score for evaluating the accuracy of equating outcomes from these methods. I also used two other criteria based on the outcomes of arbitrary equatings to examine how well equating accuracy is estimated with such criteria. I also elaborated on issues of construct validity and test dimensionality, which are relevant to test equating. Overall, I found that all the equating methods yielded accurate results to a moderate degree. They all produced more accurate results when the anchor items were more representative of the total test, or when the items in a test had homogeneous content. Therefore, to improve equating accuracy, I recommend an inclusion of anchor items that fully reflect the overall test content. I also found that the [RT-based equating outcomes were more accurate than the outcomes from the other equating methods. However, the difl‘erences are small thus may not have practical significance. If the degree of equating accuracy is critical for decision-makings of a testing program, such as high-stake examinations, IRT-based equating methods are recommended. Copyright by WEN-LING YANG 1997 To my loving parents, for their great expectations in their children’s education. To my many dear friends for their faithful friendships, which comforted me on my long way to reach this goal, far away from home. ACKNOWLEDGMENTS For my dissertation director, Dr. Richard T. Houang, I thank him for all his assistance in generating research ideas, acquiring test data, critiquing my study design and method, and reviewing my analysis results. Most of all, I would like to thank him for being a great mentor to me and making me believe in myself. Dr. Irvin Lehmann is my long-time academic advisor, who always listens to my problems and questions. I enjoyed being his advisee because he is such an understanding professor, who always has good solutions or useful suggestions for me. I owe special thanks to him for his patience and superb academic advice and counseling. I am grateful to Dr. Betsy J. Becker for her encouragement and her thorough review of an earlier draft of my dissertation. Her timely and challenging feedback on my studies enlightened my view of conducting in-depth and precise research. I appreciate the efforts of Dr. Mary Ann Reinhart and Dr. Maria T. Tatto in making time from their busy schedules to attend the proposal and defense meetings for my dissertation. Their valuable comments on practical equating and related research issues are also appreciated. Lastly, I would like to acknowledge the help from Dr. Michael J. Kolen. He not only provided an extended version of computer program to assist me with equipercentile equating, but also answered, via e—mail, many of my questions concerning practical equating issues. TABLE OF CONTENTS LIST OF TABLES .................................................................................. xii LIST OF FIGURES ................................................................................ xiv PART I: COMMON-ITEM EQUATING ISSUES - AN OVERVIEW . . ..............1 INTRODUCTION ................................................................................. 2 CHAPTER 1 RESEARCH PURPOSES AND QUESTIONS .............................................. 5 PART 11: REVIEW OF LITERATURE ....................................................... 8 CHAPTER 2 CONDITIONS AND GENERAL GUIDELINES FOR EQUATING .................... 9 Conditions of Equivalency ................................................................. 9 Same Construct .................................................................... 10 Equity .............................................................................. 10 Symmetry .......................................................................... 10 Population Invariance ............................................................. ll Unidimensionality for [RT-Based Equating ................................... ll Equating Guidelines ....................................................................... ll Criteria for Selecting Equating Methods ................................................ 12 Tenability of Model Assumptions .............................................. 13 Applicability of Design and Method ............................................ l4 Equating Accuracy ................................................................ 15 Limitations of Equating ................................................................... 15 CHAPTER 3 TUCKER LINEAR EQUATING ............................................................. 17 Synthetic Population ...................................................................... 17 Model Assumptions ....................................................................... 17 Equating Procedures ...................................................................... 18 Practical Concerns ......................................................................... 21 CHAPTER 4 EQUIPERCENTILE EQUATING ........................................................... 23 Equipercentile Function ................................................................... 23 Frequency Estimation Method ........................................................... 24 Assumptions ....................................................................... 25 Conditional Distributions ........................................................ 26 vii Procedures ......................................................................... 26 Smoothing Techniques .................................................................... 28 Chained Equipercentile Equating ........................................................ 31 CHAPTER 5 IRT-BASED EQUA'TINGS ................................................................... 32 Conceptual Steps of IRT Equating ...................................................... 32 Linear Transformation of IRT Scales ................................................... 33 Fixed-b Method ............................................................................ 35 Characteristic Curve Transformation (Formula) Methods ........................... 36 IRT True Score Equating ................................................................. 37 IRT True Scores ................................................................... 37 Equating True Scores ............................................................ 38 Concurrent Calibration Method ......................................................... 39 Advantages of IRT-Based Equating .................................................... 4O Curvilinear Equating ............................................................. 4O Item-Free and Person-Independent Measures ................................ 40 Practical Appeal .................................................................. 41 IRT Assumption of Test Dimensionality ................................................ 42 Definition of Test Dimensionality ............................................... 43 Definition of Unidimensionality ................................................. 43 Robustness of IRT Unidimensionality Assumption ........................... 44 CHAPTER 6 ISSUES IN COMMON-ITEM EQUATING ............................................... 46 Effects of Ability Differences and Sampling on Equating ............................ 46 Effect of Ability Differences ..................................................... 46 Representative vs. Matched Sampling .......................................... 47 Characteristics of Anchor Items .......................................................... 48 Anchor Length ..................................................................... 48 Content Representativeness ..................................................... 49 Equating Tests with Skewed Distributions ............................................. 50 CHAPTER 7 EVALUATION OF EQUATING ACCURACY ........................................... 52 Approaches for Evaluating Equating Accuracy ....................................... 52 Estimating Equating Errors ............................................................... 55 Arbitrary Nature of Equating Criteria .................................................. 56 Root-Mean-Squared Deviation (RMSD) ............................................... 57 PART III: METHODOLOGIES AND RESULTS ........................................ 59 CHAPTER 8 DATA, DESIGN AND METHOD ........................................................... 60 Description of Data ........................................................................ 60 Test Content and Format ........................................................ 61 Examinee Groups ................................................................. 62 Appropriateness for Equating ................................................... 63 Research Design ............................................................................ 63 Common-Item Design for Equating ............................................ 66 Manipulation of Content Representativeness .................................. 66 Simple random sampling ................................................ 67 Equal-weight domain random sampling .............................. 68 Pr0portional-weight domain random sampling ...................... 70 Purposeful sampling ..................................................... 7O Controlling test length ................................................... 7O Controlling anchor length ............................................... 72 Equating Methods ......................................................................... 73 Tucker Linear Equating Method ................................................ 73 Frequency-Estimation Equating Method ...................................... 74 IRT-Based Equating Methods .................................................. 74 Criteria for Evaluating Equating Accuracy ............................................. 75 True-Score Based Criteria ...................................................... 76 Arbitrary Criteria ................................................................. 77 Test Dirnensionality ....................................................................... 78 Group Disparity ........................................................................... 79 Construct Validity Issues ................................................................. 80 Research Tools ............................................................................ 81 Research Restrictions and Limitations .................................................. 83 Data Manipulation ................................................................ 83 Common-item equating design ......................................... 84 Long anchors ............................................................. 84 Unequal test length and anchor length ................................ 84 Interpretation of Equating Accuracy ........................................... 85 Arbitrary nature and limited use of criteria ........................... 85 Issues of Auto-correlation .............................................. 85 Generalization of Results ........................................................ 86 Item format and scoring system ....................................... 86 Skewed score distributions ............................................. 87 Characteristics of examinee populations .............................. 87 Assumptions of equating models ...................................... 87 CHAPTER 9 RESULTS AND DISCUSSIONS ............................................................ 88 Characteristics of Tests and Examinee Groups ........................................ 89 Internal Consistency of Tests ................................................... 89 Analysis of Item Difficulty ....................................................... 91 Analysis of Item-Total Correlation ............................................. 95 Evidence of item sampling effect ....................................... 97 Inspecting anomaly cases ............................................... 97 Characteristics of Anchor Items ................................................. 99 rmmiqm: Index of equating efficiency .............................. 101 ix run“: Index of content representativeness for anchor items .......................................................... 101 Evidence of item sampling effect ..................................... 102 Cautions about auto-correlation and anchor-length effects ....... 103 Inspecting Examinee Group Differences ..................................... 104 Ability differences ...................................................... 104 Years of experience .................................................... 107 Program participation ................................................. 109 Needs of demographic information .................................. 109 Summary ......................................................................... l 12 Score Distributions of Various Test Forms ........................................... 112 Adequacy of 3PL IRT Model for IRT-Based Equatings ............................ 113 Equating Outcomes of IRT-Based Methods .......................................... 113 Estimation of IRT Parameters ................................................. 113 Equated IRT Ability Estimates ................................................. 115 3PL IRT-Based Equivalent True Scores ..................................... 119 Smoothing Equipercentile Equating Outcomes ....................................... 123 Graphical Inspection on Smoothing Results ................................. 123 Evaluation of "Moment Preservation” ........................................ 126 Results of Selecting Smoothing Parameters .................................. 128 Results of Tucker Linear Method ...................................................... 129 Similarities Among Outcomes of Various Equating Methods ...................... 131 Evaluation of Equating Accuracy ....................................................... 134 Preview of Important Results .................................................. 135 Evaluation Using Raw-145 as a Criterion .................................... 137 Comparing equating accuracy of various methods ................. 137 Comparing equating accuracy for various sampled tests .......... 140 Equating method-test interaction ..................................... 140 Effects of content homogeneity and content representativeness. 141 Evaluating efl'ect of auto-correlation ................................ 142 Reliability and validity evidence on anchor tests ................... 144 Evaluation Using IRT-145 as a Criterion .................................... 145 Accuracy of equating outcomes from various methods ........... 145 Inspecting estimation bias due to IRT-145 .......................... 146 Comparing equating accuracy for various sampled tests ......... 146 Content homogeneity and content representativeness ............. 147 Evaluating effect of auto-correlation ................................. 147 Reliability and validity evidence on anchor tests ................... 148 Evaluation Using EF-long as a Criterion ..................................... 149 Equating accuracy of various methods and auto-correlation .. 149 Equating accuracy for various sampled tests ....................... 150 Estimation Bias due to an Arbitrary Criterion -- FE-short ................. 152 Measures of equating accuracy for EF-short ....................... 152 Bias from the arbitrary nature of EF-short .......................... 153 Bias due to index of equating accuracy .............................. 154 Advantages of Using Multiple Criteria ....................................... 157 Uses of Raw-145 and IRT-145 ....................................... 157 Uses of FE-long and FE-short ........................................ 158 Construct Validity Issues ................................................................ 159 Issues of Test Dimensionality .......................................................... 162 Confirmatory Factor Analysis ................................................. 163 Exploratory Factor Analysis ................................................... 164 Principal Component Analysis ........................................ 165 Principal Factor Analysis .............................................. 165 Maximum Likelihood and a Factor Analyses ...................... 168 Unidirnensionality Assumption for the Test ......................... 168 CHAPTER 10 SUGGESTIONS ............................................................................... 170 Selection of Equating Method ......................................................... 170 Test Construction ......................................................................... 171 Controlling Anchor Length ............................................................. 172 Equal Anchor Length ........................................................... 172 Fewer Anchor Items ............................................................ 173 Multiple Criteria for Evaluating Equating Accuracy ................................. 174 Selecting Representative Index of Equating Accuracy .............................. 175 Issues of Construct Validity ............................................................ 175 Test Dimensionality Issues .............................................................. 176 Applications of Study Design and Techniques ........................................ 176 APPENDICES APPENDIX A: Distributions of Total Raw Scores for Various Test Forms . . 177 APPENDD( B-l: Score to Score Equivalents by Various Degrees of Smoothing for Sampled Test PW ............................... 182 APPENDIX B-2: Score to Score Equivalents by Various Degrees of Smoothing for Sampled Test SR ................................. 184 APPENDIX B-3: Score to Score Equivalents by Various Degrees of Smoothing for Sampled Test PS ................................. 186 APPENDD( C: Adjusted Correlation Matrix for Evaluating Equating Accuracy Indices of Equating Accuracy after Controlling for Auto- Correlation ............................................................. 188 APPENDIX D: Reliability and Validity Evidence for the Anchor Tests of Four Sampled Tests ......................................................... 189 LIST OF REFERENCES ......................................................................... 190 LIST OF TABLES Table l - Proportional Distribution of Test Items Across the 23 Core Content Areas .. ..61 Table 2 - Number of Items Sampled Under Four Sampling Schemes ........................ 65 Table 3 - Number of Items Sampled from the Original Test by Simple Random Sampling ........................................................................... 69 Tabb 4 - Number of Items Sampled Using Proportional-Weight Domain Random Sampling ............................................................................ 71 Table 5 - Reliability of Sampled Test Forms: Indices of Internal Consistency .............. 90 Table 6 - Number of Items Sampled for Two Test Forms Using Simple Random Sampling .......................................................................... 98 Tabb 7 - Coefficients of Correlations Between Total Scores on Anchor Items, Non- Anchor (Unique) Items, and Total Test ...................................... 100 Table 8 - Significance Test Results for Group Mean Differences on Anchor Tests . 105 Table 9 - Average Item Difficulty for Anchor Items and Unique Items .................... 106 Table 10 - Group Comparisons on "Years of Experience” .................................. 108 Table 11 - Program Participation -- Number of Exarninees from Each In-Training Program ........................................................................... 1 10 Tabb 12 - Results of IRT Parameter Estimation .............................................. 114 Tabb l3 - Comparisons of the Resulting Ability Estimates of Two IRT-Based Equating Methods ............................................................... 117 Tabb l4 - Comparisons of the Resulting True Score Estimates of Two IRT-Based Equating Methods ............................................................... 120 Tabb 15 - Moments for Postsmoothing Outcomes ............................................ 127 Tabb 16 — Summary ofthe Results ofTucker Linear Equating Method ............l30 Tabb 17 - Relationships among Various Equating Outcomes for Different Sampled Tests ................................................................................ 132 Tabb 18 - Accuracy of Equating Outcomes from Various Equating Methods on Different Sampled Tests ........................................................ 136 Tabb 19 - Root-Mean-Squared-Differences for Evaluating Equating Accuracy ..........155 Tabb 20 - Average Equivabnt Scores of Examinee Groups on the Original Test by Test Form and Years of Experience ....................................... 160 LIST OF FIGURES Figure 1a - Item Difficulty (p) for Items in Sampled Test Forms (in Ascending Order) ............................................................ 93 Figure lb - Cumulative Frequency Distributions of Item Difficulty for Sampled Test Forms ............................................................................... 94 Figure 2 - Item-Total Correlation (rm) for Sampled Test Forms (in Ascending Order) ............................................................... 96 Figure 3 - Cumulative Frequency Distribution for Number of Programs ................... 111 Figure 4 - Relationship Between the Resulting Ability Estimates of the Two IRT- Based Equating Methods ........................................................... 118 Figure 5 - Relationship Between the Resulting True Score Estimates of the Two IRT-Based Equating Methods .................................................... 122 Figure 6 - Score to Score Equivalents by Various Degrees of Smoothing for Sampled Test EW ................................................................... 125 Figure 7 - Relationship Between Equating Outcomes of the Tucker Method and the Frequency-estimation Equipercentile Method ................................... 133 Figure 8 - "Test Form" by ”Years of Experience” Interaction Effect ....................... 161 Figure 9 - Scree Plot of Eigenvalues for Principal Component Analysis . 166 Figure 10 - Scree Plot of Eigenvalues for Principal Factor Analysis ....................... 167 xiv PART I: COMMON-ITEM EQUATING ISSUES -- AN OVERVIEW INTRODUCTION In educational testing, to ensure test security, alternate test forms are often used so that allexamineesdo not need to takethesame test at the same time. The need for interchangeable test forms is especially important for licensure examinations and other tests used to make critical decisions. Comparable test forms are also used to measure growth or trends. In theory, alternate test forms are created by careful test construction such that their items will have similar item difliculties. However, results of test construction are often not satisfactory because test forms are seldom parallel in the straight theoretical sense. A practical strategy to achieve comparable test scores is to establish equivalency between different forms via equating. Equating procedures are based on the idea of making statistical adjustments in pursuit of four conditions: same construct, equity, symmetry, and population invariance (Hambleton & Swaminathan, 1990; Lord, 1980). By satisfying these conditions, in theory, one can obtain comparable test scores from equated test forms. Equating can be linear or non-linear depending on how test scores are transformed across forms. Various equating models vary substantially in their assumptions and equating functions. Selection of an equating model should take into account purpose of equating, theoretical plausibility and applicability of a model, and characteristics of the examinees and test forms being equated. Conventional linear equating methods, such as Tucker linear method, are straightforward and convenient in computation. Therefore, they have been used widely in 2 3 practice for years. Nevertheless, their results do not always meet all criteria for equivalent tests. For example, equivalent scores from linear equating can vary across examinee groups and item samples. To overcome such drawbacks of linear equating, different equating models based on item response theory (IRT) have been developed. The use of the IRT-based methods has recently increased in its popularity. IRT-based equating methods are especially useful for equating based on the common-item design, where random assignment of examinees is not required. They are often used when the assumptions made by linear equating are not likely to hold (Cook & Eignor, 1991; Crocker & Algina, 1986). Research has shown that IRT-based methods are more robust than linear equating, and they will lead to greater stability when tests to be equated differ somewhat in content and bngth (Petersen, Cook, & Stocking, 1983). Despite their theoretical appeal and empirical advantages, IRT-based equating methods are often under scrutiny because of the inconsistency in their equating outcomes. Another issue is the possible IRT—based equating method by test interaction (Hills, Subhiyah, & Hirsch, 1988; Peterson, Cook, & Stocking, 1983). In equating practice, there are also concerns about its cost and the practical significance of improvement in equating accuracy due to IRT-based methods. Dimensionality of a test is an issue relevant to equating accuracy. It is particularly critical for IRT-based equating that assumes an unidimensional trait (Hambleton & Swaminathan, 1990). An IRT-based equating model assuming unidimensionality is not likely to work well on a test of multidirnensionality. Test dimensionality may also affect the effectiveness of conventional equating methods. This is because the conventional approach also assumes unidirnensionality but in an implicit way (Green, Yen, & Burket, 4 1989). Often, a broad test domain is defined to encompass a variety of knowledge or skills. It is bss likely that only one trait or one single dominant trait influences the examinee performance on the test. To ensure equating accuracy, it is therefore crucial to check the IRT assumption of unidirnensionality. It is also important to evaluate the robustness of IRT applications when the assumption is violated. This study addresses practical common-item equating issues concerning equating methods and test characteristics. Four pairs of sampbd test forms, varying in their content homogeneity, are equated by the Tucker linear method, frequency-estimation equipercentib method, and two IRT-based equating methods. Various equating results from these methods are evaluated using four types of criteria for evaluating equating accuracy. Resulting equating outcomes are compared and discussed, with considerations of restrictions on this study. Suggestions are made for equating practice and future research. The major goal of this study is to inform testing practice, leading to improved measures of ability or achievement and more valid comparisons of different groups. The study on the effect of content homogeneity and content representativeness on equating accuracy should improve the precision of equivalent scores, test construction, and test efliciency in the context of common-item equating. The findings and conclusions reached in this study will provide sound groundwork for future studies. The unique part of the research design, such as the item-sampling design and the use of multiple innovative criteria for evaluating equating accuracy, should cast insights on improving the estimation of equating accuracy for future studies. Chapter 1 RESEARCH PURPOSES AND QUESTIONS To better understand the effectiveness of various equating methods, and the influences on equating accuracy from the characteristics of a test and its items, this study has these specific goals: To estimate and compare the equating accuracy of linear equating, equipercentile equating and IRT—based equating. To investigate the effects of content homogeneity of test items, and content representativeness of anchor items, on the estimation of equating accuracy yielded by various equating methods. To assess the effectiveness of various criteria for evaluating equating accuracy. To address dimensionality issues related to the test data, and to investigate their influences on the IRT-based equating results, such as the robustness of a unidirnensional IRT model To inform testing practice and future studies about useful ways for (l) improving anchor-item equating design, (2) selecting an equating method, and (3) evaluating equating accuracy. 6 Pursuing these goals, this study outlined a set of research questions to direct the design, method, and analysis of its equating research. These questions not only reflect specific research interests but also address important issues and concerns about equating practice. To what extent does the results of Tucker linear equating, equipercentile equating, and the IRT-based equating agree? Does equating result depend on content homogeneity of test items and content representativeness of anchor items? In other words, does the accuracy of equating improve when the items in a test are content homogeneous? Does it improve when the content of anchor items becomes more representative of the total test? How accurate are the results of various equating methods, based on these criteria for evaluating equating accuracy: (3) a raw-score-based true-score estimate, (b) an IRT- based true-score estimate, (c) the result of the equipercentile equating method on equating the two forms of a longer and, in theory, more reliable test, and (d) the result of the equipercentib equating method on equating the two forms of a shorter subtest, sampbd from the longer original in-training test? Which criterion, among various criteria for evaluating equating accuracy, is relatively better than the other criteria, for this particular minimum competency examination? Does the IRT assumption of unidirnensionality hold for the IRT-based equatings? Mathematically or conceptually speaking, how can we describe the dimensionality of the test? Will the resulting outcomes of the IRT-based equatings suggest that the three parameter logistic (3PL) IRT model is appropriate for a minimum competency test that has a negatively skewed score distribution? Is the IRT model sound in theory? 7 In the following chapters, literature for rebvant equating issues are first reviewed and summarized. They include conditions of equivalency, equating guidelines, assumptions and procedures of various equating methods, features of the common anchor- item equating, and estimation of equating accuracy. Then, the data, item sampling schemes, common-item non-equivabnt group design, and particular equating methods used in this dissertation are described. After the results of various equating methods on different tests are discussed, suggestions are made for the equating practice and future research. PART II: REVIEW OF LITERATURE Chapter 2 CONDITIONS AND GENERAL GUIDELINES FOR EQUATING Important requirements of equating, including the conditions of equivalency, general guidelines for conducting equating studies, and the criteria for selecting equating methods, are reviewed in this chapter. The limitations of equating are also discussed. Conditions of Equivalency If test Y is to be equated to test X, no matter what equating procedure is chosen, the following conditions must be satisfied to conclude that scores on test X and test Y are made equivalent (Angoff, 1984; Dorans, 1990; Hambleton & Swaminathan, 1990; Kolen & Brennan, 1995; Lord, 1980; Petersen, Kobn, & Hoover, 1989): 0 Both tests measure the same construct. 0 The equating achieves equity. That is, for individuals of identical proficiency, the conditional frequency distributions of scores on the two tests are the same. 0 The equating transformation is symmetric; that is, the equating of Y to X is the inverse of the equating of X to Y. 0 The equating transformation is invariant across sub—groups of the population, from which it is derived. 10 In addition to the above conditions, equating using the IRT model also requires the assumption of unidimensionality. These conditions of equivalency are elaborated below. We! The requirement of the same construct is a matter of test construction and can be achieved by carefully sebcting items that measure the same construct. Formulas relating tests of difi'erent constructs to each other can be computed for the purpose of regression or prediction, but it is meaningless to compare tests measuring difi'erent constructs. Since equating is a matter of transforming scores for the sake of comparison, it makes no sense for the forms of a test to measure different constructs. Equity The condition of equity implies that individuals of the same proficiency obtain the same score, no matter which tests are taken. The proficiencies of individuals taking two different tests are usually estimated via their performance on the common items or an anchor test. At every ability level, the conditional frequency distributions on different forms should be the same. The corresponding percentile ranks in any given group should be equal for equivalent scores. Smrnctnr The score transformation should be invertible to achieve symmetry. Regardless of equating from X to Y or from Y to X, the same score on one test should correspond to one given score on another test. The condition of symmetry requires that the function (ex) used to equate a score (y) on Form Y to the scale of Form X be the inverse of the function (e,.) used to equate a score (x) on Form X to the Form Y scale: ex(y) = e;‘(y) and e,(x) = e;‘(x) . 1 1 E l . I . Equating results are desired to be independent of the unique characteristics of the examinee sampbs used in the equating process. No matter which groups of examinees are used, the equating results should not change with the characteristics of the particular examinee groups. The results should depend on only the underlying construct measured by the test. Among various equating models, the procedures based on regression inherently fail to achieve this condition, while IRT-based equating is expected to result in population invariance by assigning the same estimated ability score to all the examinees at the same ability bvel. The condition of population invariance is one of the ultimate goals of test equating, and can be assessed by examining thetequivalency of the test forms across sub-groups. If population invariance is not achieved, one possible reason is that the tests or test forms may not measure the same construct. In this case, the procedures of test construction and the resulting test items should remain under scrutiny. Il'l' . l' E IBI-B lE . Unidimensionality is an underlying assumption for the equating based on item response theory, although it is not explicitly recognized as a condition of equating. The IRT-based equating is more restrictive because it requires unidimensional test items. Equating Guidelines When test data from different forms of a test are very similar or very different, equating may not be desired. Other than reducing errors, inappropriate equating may introduce more error to test scores, and unnecessary equating will increase the cost of testing. Once it is determined that equating is preferred, factors such as feasibility, cost, and any unique testing context should all be considered to carry out the equating. However, there are no absolutely superior criteria for the selection of equating design or 12 method (Harris & Crouse, 1993). As a result, arbitrary judgments and decisions are necessary and should be based on equating expertise and experience. Brennan and Kobn (1987, 1995) proposed a set of rules to guide test equating. They argued that the test content and statistical specifications for tests being equated ought to be defined precisely and be stable over time. In the process of test construction, item statistics should be obtained from pre-testing or a previous use of the test. Each test should be reasonably long, with at bast 35 items, and the scoring keys should be consistent. The stems for common items, alternatives, and stimulus materials should be identical for the forms to be equated. The characteristics of examinee groups should be stabb over time, too. The groups should be relatively large, larger than roughly 400 examinees. The curriculum, training materials, and field of study should also be stable. The test items should be administered and secured under standardized conditions. Brennan and Kolen (1987) also have a set of ideal suggestions for test equating: 0 Embed two sets of common items in the full-length test; 0 Length of an anchor test should be at bast 1/5 of the total-test bngth, and the anchor test should mirror the total test in content specification and statistical characteristics; 0 Administer at bast one link form no earlier than one year in the past, and administer at bast one link form in the same month as the form to be equated; and 0 Place each common item in approximately the same position in the two forms. Criteria for Selecting Equating Methods Usually, an equating method is selected or tailored to accommodate a particular testing situation. If guessing is explicitly encouraged during testing time and its effect cannot be overlooked, a fair equating should account for this factor. Suppose accurate equivalent scores are strongly desired by some testing programs, it is critical to select an equating method that yields the most accurate equating for that particular test. 13 . Three major aspects to be considered for the sebction of an equating method are: (1) Are the required underlying assumptions tenabb? (2) Is the equating procedure practical? and, (3) How good is the equating result? (Crocker & Algina, 1986) Common equating methods are compared on each of these three aspects below. I 1'1' [111]! . The premise of model application is that all the underlying assumptions of the sebcted model hold. Linear equating assumes that the score distributions of the tests being equated have identical shapes, and is appropriate for equating use when score distributions only differ in the means and/or standard deviations. Due to this assumption, the derived equivalent scores will have identical percentile ranks. Equipercentile equating requires fewer assumptions than linear equating. It does not assume the same shapes for score distributions, but determines which scores on the different tests will have the same percentile rank (Crocker & Algina, 1986). However, in theory, the equipercentile method is associated with larger errors than linear equating (Lord, 1982a). Also, it is less practical to apply equipercentile equating because it is far more complicated. Both linear and equipercentile equatings assume that the tests being equated measure the same trait and have equal reliability. Given two tests that have different average difficulty, however, the assumption of equal reliability usually does not hold. In such case, these two equating methods are likely to yield erroneous results. The results of the two methods also depend on the particular test items used, and fail to meet the condition of equity for equating. Furthermore, the methods do not meet the requirement 14 of population invariance (Hambleton & Swaminathan, 1990). Unlike these methods, IRT— based equating does not have the same drawbacks and could be a better alternative. Random group design, single group design with counter-balancing, and the common-item nonequivalent groups design are three common designs used to collect data before equating (Kobn & Brennan, 1995). Equating designs differ in the need for randomly formed examinee groups, single or multiple test administrations, test length, or the examinee sample size needed. Depending on various conditions in real testing situations, feasible equating designs are chosen and corresponding methods are used to equate the test results of different forms. For example, traditional equating may be adequate if examinees are randomly assigned to form groups and each group takes a different test form, or if different forms are assigned to examinees randomly, or if the groups take both test forms in randomly assigned orders. Otherwise, IRT—based methods are more appropriate. Random group design is often desired because each examinee only has to take one form and several forms can be equated at the same time. Nevertheless, this approach requires the test forms to be availabb and administered at ‘the same time, which is sometimes not practical. One solution to this problem is the use of the anchor design. Either multiple test forms with embedded anchor items (the internal anchor) can be given to difl‘erent examinee groups, or a third test (the external anchor) can be given to two examinee groups that take different test forms. Without random assignment, the anchor- score distributions for different sub-populations may be markedly difierent and the 15 assumption of equity is unlikely to hold (Crocker & Algina, 1986). In this case, the linear and equipercentile methods are likely to yield inaccurate results, while IRT-based equating tends to be more adequate. W One major justification for test equating is its efl‘ectiveness, that is, the extent to which the equating method used yields adequately equivalent scores. Nevertheless, because true scores can never be known in practice, perfect equivalency can never be determined in a strict sense. As a consequence, there is no best criterion for evaluating equating accuracy, and there is also no definite procedure for determining which equating methods should be used in a given context (Harris & Crouse, 1993). The interest for assessing equating accuracy thus is to find adequate equating methods that are appropriate for a given context. Issues regarding the assessment of equating accuracy are discussed further in Chapter 7. Limitations of Equating Test equating cannot solve probbms originating through crude and improper test construction. It is meant to overcome the insufficiency of good test-construction practice that has faibd to yield test forms of the same difficulty level Both traditional equating and IRT-based equating are primarily designed to remedy minor differences in difficulty between test forms. Cook and Eignor (1991) indicated that no equating method could satisfactorily equate tests that were markedly different in difficulty, reliability or test content. This perspective raises doubt over the practicality of 16 vertical equating, which transforms scores across bvels of achievement onto a single scale. Theoretically and operationally, vertical equating is much more difficult to accomplish than horizontal equating. In addition, vertical equating often results in a lack of test invariance, which can be accounted for by multidirnensionality (Skaggs & Lissitz, 1988). Equal reliability is usually assumed by equating models, such as the linear equating and the equipercentib equating. Due to floor and ceiling effects, however, tests that differ in difficulty are not likely to be equally reliable for all sub- groups of examinees, and the relationship between the tests can be nonlinear (Skaggs & Lissitz, 1986). It is implied that observed scores on tests of different difficulty cannot be equated. Therefore, in such cases, equating is actually carried out in a loose sense. From a pragmatic point of view, however, equating aims to arrive at a conversion equation that approximates an ideal equivalency. Therefore, despite its limitations by nature, test equating is of great use in comparing scores on test forms with minor differences. Chapter 3 TUCKER LINEAR EQUATING Linear equating has the appeal of simplicity in terms of score transformation and is used most often with the common-item design (Kolen & Brennan, 1987). Among the many linear methods, Tucker linear equating is one of the methods employed most frequently. Synthetic Population For the common-item design, the Tucker method involves the use of a synthetic population (Braun & Holland, 1982). A synthetic population is usually defined as a combination of the proportionally weighted (proportional to sample sizes) populations of examinees taking different test forms. Typically, an equating function is viewed as being defined for a single population, and the two examinee populations must be combined as one singb population for defining an equating relationship (Kobn & Brennan, 1987). Model Assumptions In an anchor-item equating design, suppose examinees in Population 1 take Form X, those in Population 2 take Form Y, and V is the embedded set of anchor items in both 17 18 forms. To equate scores on Form X to the scale of Form Y, Tucker linear equating requires strong statistical assumptions, as follows (Kobn & Brennan, 1987; Kolen & Brennan, 1995): l. The linear regression function (slope and intercept) for the regression of X on V is the same for Populations 1 and 2. The function for the regression of Y on V is also the same for the two populations. 2. The variance of X given V is the same for the two populations, and the variance of Y given V is also the same for the two populations. Under the above assumptions, the linearly transformed scores on one form, yielded by Tucker’s method, will have the same mean and standard deviation as the scores on another form. Because of the assumptions about the variances and regression functions in relation to the two populations, Tucker linear equating is more accurate when examinee groups are similar. Equating Procedures Using the proportional weights to form a synthetic population, Tucker linear equating basically involves the following concepts and procedures (Kobn & Brennan, 1987; Kobn & Brennan, 1995): 1. Find the weights for Populations l and 2 by using these formula: w1=n;/(nl+n2) and w2=nz/(n1+n2), where n1 and n; are the sample sizes of examinees from populations 1 and 2, respectively. 2. Let a; and or; be the regression slopes for the populations. For Population 1, 19 outx IV)=ollx.V)/a; (V) and al(Y IV)=01(Y.V)/ai (V). (3.1) and for population 2, a20 (4.4) (Berry & Lindgren, 1990); where (l) f(x,v) is the joint distribution of total score and common-item score, and it represents the probability of earning a score of x on Form X and a score of v on common items V, (2) for all x and v, f(x,v)ZOand &,.,)f(x,v) = 1, and (3) hv(v) represents the marginal distribution of scores on the common items, which is the probability that V=v and equals 2,, f (x,v) (Kobn & Brennan, 1995). Procedures Frequency-estimation equipercentile equating defines the distributions for the synthetic population on Forms X and Y as follows: f,(x) = w:f,(x)+W2f2(X) and 8,0) = wig.(y)+)v282(y). (4.5) where W1+W2=1. 27 To arrive at expressions composed of direct probability estimates of earning various scores, for the distributions for the synthetic population, the first main task is to express f 2(x) and g,( y) in quantities for which direct estimates are available. This can be achieved as follows (Kobn & Brennan, 1995): (1) By definition of density, f (xlv) = f (x, v)! h (v) , therefore f2(x.V) = f2(XIV)h2(V) and 810W) = 81(YIV)hl(V)- (4.6) (2) By assumption of identical conditional distributions, f,(x.v) = f,(XIV)h2(V) and g.(y.v) = 82(YIV)ht(V). (4.7) (3) By summing over common-item scores, there follow the marginal distributions: f2(x) = 2.f2(x.V) = 2, f1(llV)h2(V) and 810’) = 2. 310.1!) = 2. 82(yIV)ht(V). (4.8) The distributions for the synthetic population therefore can be expressed in quantities that can be directly estimated from the data. The equations are f,(x) = wtmx) + W22. f,(x|v)hz(v) and g.(y) = m2. 82(YIV)ht(V)+ w: 32o). (4.9) where w, +w2 =1. 28 By summing f ,(x) and g,( y) over values of x and y respectively, the cumulative distributions F;(x) and G, ( y) can be derived. Define P, as the percentile rank function for Form X and Q, as the percentib rank function for Form Y, then P," and Q? are the percentib functions. For frequency estimation method, thus, the equipercentile function for the synthetic population is ey’ (x) = Q;‘[p,(x)]. Smoothing Techniques Equipercentile equating is often not sufficiently precise due to sampling error. The lack of precision is typically indicated by irregular sample score distributions and equipercentib relationships (Kobn & Brennan, 1995). To obtain more accurate equating results, various smoothing methods have being used on an empirical base to produce smoothed estimates of the population score distributions that are supposed to have less estimation error than the sample score distributions (Kobn, 1991). Typical smoothing approaches include (1) presmoothing, such as the polynomial log-linear method (Holland & Thayer, 1987; Holland & Thayer, 1989; Rosenbaum & Thayer, 1987; Kolen, 1991) and the strong true score method (Lord, 1965; Hanson, 1991; Kolen & Brennan, 1995), and (2) postsmoothing, such as the cubic splines method ( Kolen & Iarjoura, 1987; Kolen & Brennan, 1995). Rather than providing better descriptions for the score distributions, often the goal of smoothing is to improve the accuracy in estimating the population score distributions. In such case, it is important to select smoothing methods that improve the precision in estimation but do not introduce substantial bias into the smoothing process (Kobn, 1991). 29 It was found that both presmoothing and postsmoothing methods improve estimation of equipercentile equivalents to a similar degree. More specifically, smoothing in equipercentile equating can be expected to produce a modest decrease in mean-squared equating error when compared to unsmoothed equipercentib equating (Hanson, 1991; Hanson, Zeng, & Colton, 1994; Kobn and Brennan, 1995). Since postsmoothing directly smoothes the equipercentile relationship, it is more direct than presmoothing, which smoothes the score distributions. Because there is no statistical test for the fit of the postsmoothing method, Kolen and Brennan (1995) suggested applying and evaluating various degrees of smoothing to avoid adding equating error. Specifically, the graphs of the raw-to-raw equivalents for the various degrees of smoothing should be examined to find the relationship that is smooth but does not depart too much from the unsmoothed equivabnts. Standard error bands could be constructed to facilitate the evaluation. In addition, the moments of the equated raw scores should be examined to study the similarity among the moments. Kobn and Brennan also ofi‘ered recommendations for smoothing in scab-score equating, when raw scores are converted to scale scores for the sake of interpretation or presentation. Overall, the smoothing process requires judgments that are dependent on the sample sizes, distribution shapes, numbers of items, and other rebvant characteristics of a testing program (Kobn & Brennan, 1995). The cubic spline postsmoothing method fits a curve to the equipercentile relationship (Kobn & Jarjoura, 1987). It is designed to increase equating precision with frequency estimation method of equipercentile equating, for the common-item non- equivalent group design. For integer scores, x, , the spline function is, 30 3.1"):Vor‘l’Vlilx-xt)+v2.'(x"X.-)2+v3,-(x-x,-)3, x. Sxtasequence, the instructions or practices of difi‘erent programs might vary slightly and tIII-ls cause the disparity among the examinee performances. To determine the adequacy of equating methods, which usually required groups with same or similar ability, it was also important to examine the degree of disparity between the non-equivabnt groups. The need was especially true in this study because the Q’3§arninee groups taking the two test forms were not randomly sampbd or assigned. “reform availabb demographic data, such as program participation and years of prerience, were analyzed in this study to help determine the degree of disparity. Qonstruct Validity Issues In this study, an examinee’s professional ability was in part dependent on the fixaminee’s professional experience. Logically, the more years of professional experience, the more likely an examinee would score higher on the test. If the test forms of the test Were truly equated, the resulting equivabnt scores of both examinee groups taking difi'erent test forms should demonstrate such effect of professional experience. Therefore, this study compared the average equivabnt scores of the examinee groups after equating to study the construct validity of the test. 81 For the sake of compbteness and convenience, this study used the equivabnt scores produced by frequency-estimation equipercentib method for the original test to do the group comparisons after equating. This set of equivalent scores was compbte because it involved all the items in the original test, and it was convenient because it already had been made availabb earlier in the study for equating accuracy. Specifically, the construct validity of the test was studied by investigating the effects of test form, years of experience, and their interaction on the exarninees’ performance. Rmarch Tools The IRT calibration program chosen for the analyses of this study was the aq‘ranced version of PC BILOG, BILOG-MG. One advantage of using BILOG is that ‘3 ILOG yields marginal maximum likelihood (ML) estimates and the number of b arameters estimated does not increase with the increasing number of examinees. Qcmpared to BILOG, LOGIST simultaneously maximizes the joint likelihood function for “he estimates of item and examinee parameters. However, the joint maximum likelihood (JML) estimates are likely to become inconsistent when the numbers of examinees or items increase (Baker, 1990; Misbvy & Stocking, 1989). Consequently, BILOG should yield more consistent results than LOGIST in such a situation. PC-BILOG uses the estimated posterior 0 distribution to establish the location and Inetric for the 0 scab (Baker, 1990). In an earlier simulation study, Yen (1987) found that BILOG largely yielded more precise estimates of individual item parameters than LOGIST. She also expected the improvement in accuracy of BILOG to increase if sample size decreased substantially. In terms of estimating item and test characteristic functions, Yen (1987) found that the relative effectiveness of the two programs depended on test 82 length. Specifically, BILOG yielded more precise estimates than LOGIST on equating shorter test forms with ten items; However, the two programs yielded very similar estimates when longer tests with 20 and 40 items respectively were equated. Mislevy and Stocking (1989) also found that BILOG would yield more reasonable results when tests are shorter or the samples are smaller. Based on a Bayesian framework, BILOG imposes prior distributions on all item parameters of the 3PL model. If the prior information is not appropriate for the data, item parameter estimates are biased (Baker, 1990). In addition to BILOG, SAS for Unix and Excel spreadsheets were also used in this st‘l-lciy to assist with various equatings, as well as data management and other statistical a-‘~'I--':llyses. The equipercentile equatings were facilitated by an extended version of the Qommon Item Program for Equating (CIPE) (Hanson, Zeng, & Kolen, 1995), which uses the frequency estimation method described by Kolen and Brennan (1995). The extended QElma prograrn--CIPE300 Plus is written in FORTRAN and has the capacity to handle l'ang test forms with more than 200 items. It uses the cubic spline method (Kolen & J~€irjonra, 1987; Kolen & Brennan, 1995) to post-smooth the resulting equipercentile j“Qlationship. Up to eight user-specified smoothing parameters are allowed to manipulate ‘he degree of smoothing (Hanson, Zeng, & Kolen, 1995). Essentially, the smoothing parameter controls the average squared standardized difference between the smoothed and the unsmoothed equating outcomes. After the frequency-estimation equipercentile equatings were applied, the CIPE output of the unsmoothed equivalents and their corresponding standard errors were graphed using Excel, along with the other sets of smoothed equivalents yielded by the various smoothing parameters. The various graphs for the smoothed equivalents were inspected for their smoothness and compared to the unsmoothed equivalents. In addition, to evaluate the smoothing requirement of “moment 83 preservation” (Kolen & Brennan, 1995), the four moments -- mean, standard deviation, skewness, and kurtosis -- for the entire examinee population of the unsmoothed and smoothed equivalents were also computed using SAS. The moments of the smoothed equivalents were compared to the unsmoothed equivalents to identify the best smoothing parameter that yielded a smooth function not departing too much from the unsmoothed equivalents. Research Restrictions and Limitations The rich context of the test data analyzed in this study rendered opportunities for item sampling and data manipulation. With such advantages, this study was able to address a variety of research questions in depth. In addition, the complexity of data enriched the study design and helped expand the scope of research. However, the secondary nature of the data also restricted the study design in some ways and limited the generalizability of the study results. Restrictions on this study and limitations of the study results, caused by the data and the design used to accommodate the data, are briefly described below. DE I I I . l . The secondary data used in this study was limited in a sense that it was collected [*3me the study design was conceived. As a consequence, any manipulations before or during the data-collection process were not possible. The design of this study was merefore restricted by the nature of this secondary data. Typical consequences of such muictions, and the consequent study limitations due to these restrictions, are discussed below; 84 W The test forms of the in-training test were linked by a set of common anchor items, embedded in the forms and given to non-equivalent examinee groups. Therefore, the anchor-item design for equating was the only option for equating the test forms in this study. M11915. Most of the items in the test were anchor items. As a result, the item sampling in this study naturally resulted in sampled test forms (PS-A, PS-B, PW-A, PW-B, EW-A, and EW-B) containing more anchor items than unique items. Given such long anchor tests, it was difficult for this study to evaluate the accuracy of equating under a situation where there were only few anchor items. Such restriction of long anchors might have caused the study result that -- all the equating methods yielded similarly satisfactory equating outcomes. Given the long anchors, all of the equating methods were likely to yield accurate outcomes. As a consequence, true differences among the equating accuracy of the various methods could not be detected or differentiated. W The number of items available for item sampling in each of the 23 core content areas was limited. In addition, the various item- sampling schemes had different demands in the types and amounts of test items. As a result, it was difficult for this study to obtain sampled test forms that all had the same number of items or anchor items. Although this study intentionally created sampled test forms that had similar test lengths and ensured that each sampled test form had a sufficient number of anchor items, there was still a slight chance that the equating results were under influence of the difi'erential anchor lengths. If the effect of anchor length existed indeed, such effect was likely to confound with the effect of content homogeneity and the effect of content representativeness of anchor items. Therefore, the study findings about the effects of content homogeneity and 85 content representativeness should be interpreted with special cautions for the confounding efl'ect due to differential anchor lengths. I . [E . E The criteria used for evaluating equating accuracy and the index incorporated in this study to measure the accuracy of equating had inherent limitations. These limitations are summarized below. W. The arbitrary nature of the two criteria for evaluating equating accuracy based on the results of equipercentile equating method was self-evident. The major drawback of using these criteria was that they did not address equating accuracy directly. Only the consistency between the criteria and the results of the other equating methods were measured. Therefore, the evaluation outcomes based on these two criteria should be interpreted with cautions. The other two criteria used in this study were based on the “pseudo true scores”. They were only appropriate when the examinee population and the testing occasion were considered fixed. It was because the “pseudo true scores” estimated the true scores, at a particular point of time and for the particular examinee population of this study, on the complete set of the 145 common anchor items in the overall item pool. As a result of assuming such “pseudo true score” was the true score, the two criteria were only valid in an equating context where the examinees were from a population same as or similar to the one in this study and tested under a circumstance same as or similar to the one in this Study. Waxing. The Pearson’s r used to estimate the accuracy of equating was inflated by the artifact of auto-correlation. The auto-correlation was caused 86 by the overlapping of items from the sampled test and the items from the criterion used for evaluating equating accuracy. By excluding the overlapped items from the sampled test and then correlating the remaining items with the criterion, this study attempted to control the influence of the auto-correlation. The magnitude of the resulting Pearson’s r, after controlling the auto-correlation, was used to measure the impact of the auto-correlation and to determine whether the probkam of over-estimating equating accuracy was substantial. This strategy for controlling auto—correlation, however, did not compketely eliminate the influence of the auto-correlation. Despite the difficulty, nevertheless, the strategy was still a useful way for improving the study on the effectiveness of various equating methods and the effect of content representativeness on equating accuracy. G l' . ER 1 Various sources of limitations on generalizing the study results are discussed below. They include the characteristics of the test items, the test, the groups of examinees, and the particular equating models used. 1W. In this study, all the test items had multiple- choice format and were dichotomously (right or wrong) scored. Logically, the research findings based on these items should not be generalized to an equating context where test forms have non-multiple-choice items (e.g., short-answer items or extended-response items) or a non-dichotomous scoring scheme (e.g., a partial credit system) is used. Test items of different format are likely to induce differential examinee responses, therefore, they usually require different scoring keys or rubrics to provide adequate interpretation for the examinee responses. Equating involving non-dichotomously scored items entails different equating models with different assumptions, other than the ones 87 used in this study. However, the featured design of this study, such as the manipulation of content representativeness and the use of the true-score based criteria for evaluating equating accuracy, are useful in improving the design of other research involving items of different format or scoring system. The results and findings of this study also cast useful insights for equating context with slight variations. Wm. One important feature of the test data analyzed in this study is that the test was written for a minimum competency examination. The test therefore had a negatively skewed score distribution, since the examinees were expected to answer most of the items right. The four sampled tests also had negatively skewed score distributions. The study results based on such skewed score distributions should be carefully generalized to other testing situations that have similar score distributions. W. The particular examinee population studied in this research also limited the generalization of the study results. As this study focused on a group of professionals in a medical field from a number of in-training programs, the results of this study should not be generalized to the other subject populations that difl‘er from the one in this study. W. The IRT-based equatings in this study assumed unidirnensionality for the test forms being equating. Therefore, the equating results should not be generalized to other testing contexts where multi-dimensionality prevail. In addition, because the equatings were based on a 3-PL IRT model, which accounted for the chance of guessing, generalization of the study findings should be limited to the contexts where the 3-PL model applies. Similarly, generalization of the result from any other equating method should also take into account the particular assumptions made by that method. Chapter 9 RESULTS AND DISCUSSIONS To facilitate an inspection on characteristics of the four sampled tests, this chapter first summarizes the outcomes of reliability studies, item analyses, and correlation studies among the total scores on anchor items, unique items, and total test. Then, an examination of the examinee group differences is presented. Before presenting the main equating outcomes, the score distributions of various test forms are discussed. Adequacy of the 3PL IRT model, on which the IRT-based equatings are based, is also discussed. Intermediate and final equating outcomes are presented and discussed in the following order: (1) results fi'om various equating methods, using a raw-score-based true- score criterion, (2) equating results yielded by an IRT-based true-score criterion, (3) equating results produced by a criterion based on the outcome of the equipercentile method on equating the two original test forms, and (4) results yielded by an arbitrary criterion that was based on the outcomes of the equipercentile method on equating sampled test forms. These results are compared to explore the effectiveness of various criteria for evaluating equating accuracy. At the end, this chapter addresses important issues relevant to the adequacy of test equating and the assumption of IRT-based equating. These issues relate to the construct 88 89 validity and dimensionality of a test. Investigation outcomes for the construct validity of the original test, and the adequacy of the method used to equate the two original test forms, are summarized. Empirical results and theoretical elaboration on issues of test dimensionality are also presented and discussed. Characteristics of Tests and Examinee Groups Upon inspecting the characteristics of the original and sampled tests: (a) reliability of these tests were studied by measuring their internal consistency, (b) item difficulties and item- total correlations were computed and examined, and (c) content homogeneity of the items in the same test, and content representativeness of anchor items in a test, were addressed. Also evahrated was the possible ability difference between the two examinee groups and its influence on the test equating results. lntemalfionsistenmflests The reliability of internal consistency, measured by Cronbach’s a is .894 for both original test forms. This indicates adequacy of the in-training test. All of the sampled test forms created by item sampling also show internal consistency. As shown in Table 5, the values of Cronbach’s or ranges from .658 to .774 across sampled test forms. Although the numbers seem small, these indices suggest moderate reliability for these achievement tests. Typically, an achievement test has less emphasis on item homogeneity than an attitude or personality questionnaire does. In addition, an ability test usually has items that fall within wider ranges of item difiiculty and item discrimination than an attitude questionnaire does. Table 5 - Reliability of Sampled Test Forms: Indices of Internal Consistency Sampled Test Form Cronbach's or SR—A 0.658 Simple Random Sampling SR-B 0.713 Equal-Weight EW-A 0.684 Domain Random Sampling EW-B 0.690 Proportional- PW-A 0.662 Weight Domain Random Sampling PW-B 0.691 PS-A 0.774 Purposeful Sampling PS-B 0.768 91 Internal consistency of an achievement test is therefore often lower than that of an attitude questionnaire, because of the lower correlations among individual items. The measures of internal consistency for the sampled test forms are smaller than those measures for the original test forms. This is partly because there are fewer items in the sampled test forms. Comparing across the eight sampled test forms (see Table 5), as expected, the two forms based on the purposefirl sampling have the highest internal consistency. The Cronbach’s or is .774 for PS-A and .768 for PS-B. It is because all the items in PS-A or PS-B are from only three core content areas. The other six sampled test forms have similar internal consistency. Moreover, the two test forms fi'om the same test are similar in their internal consistency. This provides some justification for using the Tucker linear method. As discussed in the section for equating methods in Chapter 8, the Tucker method requires equal reliability of test forms being equated. Aimsesnmemnimsuln The results of item analyses, including analyses of item difficulty and item-total correlation, provide useful information on the sampled tests and their test items. The classical item difficulty (p) of item j, in a test taken by a group of N examinees, can be Number of exa min ees with a correct response on item j technically defined as p j = N In words, difficulty of an item is the proportion of examinees that answer an item correctly (Crocker & Algina, 1986). Analyses of item difficulty revealed that the items in all four sampled tests had moderate difficulties on average. Across various test forms, the average item difficulty range from 0.688 to 0.759. 92 In Figure la, individual items in various sampled test forms are sorted by their item difliculties in ascending order. The patterns of the graphed item difficulties for the two forms of each sampled test are similar, and the patterns across various sampled tests are also alike. Different from reporting average item difliculty, the graphs in Figure 1a give a closer look at the variations in item difficulty within and across test forms. They better describe how these sampled test forms resembled or difl'ered from one another. Overall, they suggest that the sampled test forms had similar difficulties. For each of the test forms, the difficulties of the items spread rather evenly between 0.4 to 1. Such range of item difficulty (from medium to high) is expected, because it is typical for a test written for a minimum competency examination. Figure lb presents the cumulative fi'equency distributions of item difficulties. Items within a test form were sorted into ten intervals by their difficulties for a summary of the item-difficulty distribution of that test. The resulting distributions in Figure 1b show that all the sampled test forms had easy, moderate, and difficult items. Across test forms, the distributions look similar. It suggests that the sampled tests were similar in their difficulties, as suggested by Figure 1a. Figure lb also presents the mean and standard deviation of item difficulties for each sampled test form. It also includes significance test results for the differences between average item difficulties of each pair of sampled test forms. The equal-variance two-tailed Student’s t-test was used to examine such mean differences. For all of the four sampled tests, none of the mean differences was statistically significant at the .05 level. These small and statistically non-significant differences attest to the adequacy of the sampled test forms for equating study. Typically, equating is only used to adjust for minor 93 p SR-A SR-B l 2 T 1 _ 0.9 "" “Fr-‘4". l 0 ' 0.8 “" ...-v—I" .f’ FF 0.8 ‘0- f-' 0.7 "P ’(‘J' 0.6 a .. ...-J"- ’J 0.6 i ,.—- 0.5 .. #:- r“ ' 0.4 «w' 0.4 i“. 0.3 a. 0.2 + 0.2 ~~ 0.1 -» 0 4% s ‘ i i i i 0 + i , a i i a 0 lo 20 30 40 50 60 70 0 lo 20 30 40 so 60 70 60 Items 60 Items p 1.2 - EW'A 1.2 a law-‘3 I .4 ...f' l A .— 0.8 “ p 0.8 i / /" .. 0.6 ‘ ’J-F 0.6 " ...—'— 0.4 . ." 0.4 . -' 0.2 J 0.2 ~' 0 i i r —i 0 i i i fi‘ 0 20 40 60 so 0 20 40 60 so 69 Items 69 Items p 1.2 ‘1 PW'A 1,2 l PW'B l < , 1 J - ..."- p/ 0.8 « ,_.. 0.8 . / f a.” I 0.6 a .r" 0.6 « _..° J". 3'" 0.4 -- 0.4 C 0.2 . 0.2 . 0 i 0 . T i A 0 20 40 60 so 0 20 4o 60 80 60 Items 60 Items P I l 1 PS-A ‘ PS-B , 0.9 " 0.9 ...; J/ T ...p-C'F 0.8 « ,f" 0.8 ~ _.-- 0.7 « / 0.7 . _-"' 0.6 ~ _..—-’ 0.6 ~ ..--' 0.5 - _...-' 0.5 . ...” 0.4 u" 0.4 - -" 0.3 < 0.3 i 0.2 ~ 0.2 . 0.1 ~ 0.1 < 0 W1 1L ng 0 v I J! 0 20 40 60 so 0 20 4o 60 60 Items 57 Items Figure la - Item Difficulty (p) for Items in Sampled Test Forms (in Ascending Order) Cumulative Frequency (# of items) SR-A 70 mi mean=0.688 sd=0.152 0.] 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0,9 1.0 Item Difficulty EW-A >\ 0 = U 3 U' 2 LL 0 _> E 5 E 5 U 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty PW-A 70 >. 60 g I mean=0708 g 50 sd=0.l45 U‘ I}: 40 .3 30 E 20 E 5 10 0 . 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty PS-A >\ U 5 mean=0.693 E- sd=0.l46 u. 0 .> 5 D E .‘3 U o 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty 94 t2_uum=0.593 (n.s. at a=.05) IzmlaFOJZI (n.s. at a=.05) t2_m|wd).676 (n.s. at (1:05) t2_wm=0.329 (n.s. at (1:05) SR-B 7o 3 6o 5 50 mcan=0.703 E. 40 sd=0.l53 u. . u 30 ’> ‘3 20 E to :l U 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty EW-B 80 70 3 5 60 = 50 mean=0.759 g sd=0.147 ”- 40 ‘>’ E 30 g 20 u 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty PW-B 70 >~. 60 g meani).7l9 g 50 scHHSO a 40 u. f; 30 E 20 E 5 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty PS-B ”l E 50 , o mean=0.720 g- ”i sd=0 153 u. sol D > '5 20 E 5 IO U o 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Item Difficulty Figure lb - Cumulative Frequency Distributions of Item Difficulty for Sampled Test Forms 95 difl‘erences in item difficulty between different forms of a sampled test (Cook & Eignor, 1991). Specifically, the stande deviations of the item difficulties range from 0.145 to 0.153 over various test forms, indicating similar item dimculties. Combined with the patterns found in the distribution plots, these standard deviations suggest that the item difliculties for various sampled test forms spread in a similar way. The small standard deviations also suggest that the dimculties of the items within a form were not far apart. - l l i For each sampled test form, the items generally correlated positively and moderately to their total test. This provides evidence of homogeneity (in examinees’ responses) for items from the same test form. Figure 2 presents graphs for item-total correlations in ascending order for various sampled test forms. For each form, most of the item-total correlation coefficients spread between .10 and .50. The average item-total correlations range between .199 to .293 across various forms. The standard deviations of the item-total correlations are between 0.103 and 0.126. These figures suggest that the overall patterns of item-total correlations were not too different across test forms. The average item-total correlations for sampled test forms (SR-B, EW-B, PW-B, and PS-B) created from the original Book B were higher than those for their counterparts sampled fi'om the original Book A. The equal-variance two-tailed Student’s t-test was used to examine the mean differences for each pair of the sampled test forms. The significance test results show that none of the mean differences was statistically significant at the level of .05. For each sampled test, its two forms had similar item-total correlations. r” t2_m,,cd=0.080 (n.s. at 0t=.05) I'm tzmw =0.63l (n.s. at or=.05) r“ tZ-lnilcd=0°301 (n.s. at 0t=.05) I'm t2_m1cd=o.63l (n.s. at or=.05) 0.5 - ' 0'4 “ mean=0.l99 r’ sd=0.ll6 _- 0.3 i .1- .r' 02 ”I . f, if 0.1 - .. 0 A i .L i —t i t —a 0- lo 20 30 40 50 60 70 -0.1 J Item ,6. 0.5 « _. 0.4 4 ,- mean=0.208 "r 03* sd=0.ll4 J;— 0.2 « fr- f 0.! '4 ..i-P'r .r" 0 i i t .L i i i t 10 20 3o 40 50 60 70 01 Item 0.5 . 0.4 « mean=0.203 J. 0 3 4 sd=0.103 ,r' ff. 0.2 . f...— If 0.1 « ’- _,_. o ... i i i i i i i t. 10 20 30 40 50 60 70 ‘0-1 ‘ Item 0-7 ~ PS-A 0.6 '4 a- 0'5 ‘ mean=0.282 .' o 4 4 sd=o.126 .- d_‘ 0.3 « If,“ 0.2 « f"- J 0.1 . p“ 0 i i i i i i i ii- 10 20 30 4o 50 60 70 01 ~ Item 0.6 1 0.5 . ' ...—- 0.4 - mean=0.236 sd=0.ll3 .' 0.3 . -" / J” 0.2 ‘ f”- .9" .4" 0.1 ~ _. 0 ' *f + fil .; i 4. J. 010203040506070 0.1 ‘ Item ... — 0.5 r . f 0-4 ‘ mean=0.218 _..' 0,3 « sd=0.l 13 ..." f 0.2 a J # ...-1"" 0.] * ...-r, of“ 0 i a i i i i i -019 10 20 30 40 so 60 70 Item 0.5 " J' 0‘4 ‘ mean=0.225 x: sd=0.l26 ,. 03 " j 41- f 0.2 ‘ xv .ud' 0.1 * _.' ...; 0 ' i i i A. i i C 10 20 30 40 50 60 01 ‘ Item r .. 0'61 -E- . 0.5 < .' r. 0.4 « mean=0.293 ' sd=0.117 f... f 03 " ‘5', —-i"' #- 0.2 ‘ ..." .di' 0.1 i 0 i i i i i 4. F IO 20 30 4O 50 60 4“ ‘ Item Figure 2 - Item-Total Correlation (rm) for Sampled Test Forms (in Ascending Order) 97 MW. Among the eight sampled test forms, PS-A and PS-B had the highest average item-total correlations (see Figure 2). This is because the two forms were created by purposeful sampling, which sampled items fi'om only three out of the 23 core content areas. As a result of the sampling scheme, items in PS-A or PS-B were expected to correlate with one another to a higher degree than those items in the other six test forms. The largest average item-total correlations of PS-A and PS-B provide evidence for the effect of item sampling. Test forms EW-A, EW-B, PW-A and PW-B all have items fi'om each of the 23 core content areas. This fact could have contributed to the smaller average item-total correlation coefficients for these forms (than those coefficients for PS-A and PS-B). Also, it explains the similarities between the average item-total correlations of sampled tests EW and PW. It is plausible that the variations in average item-total correlation across various test forms are in part due to the four item sampling schemes, incorporated to the study design. These item-sampling schemes successfully manipulated the content homogeneity (or heterogeneity) of items in the sampled tests. WW. As shown in Figure 2, some sampled test forms had few items that correlate negatively to their corresponding total tests. These anomaly cases were identified and their item-total correlations (for the original and sampled test forms) were examined. Overall, the examination results show that, across various sampled test forms, the anomaly cases were not always the same items. Anomaly items in the original and sampled test forms are listed in Table 6. 98 Table 6 - Items with Negative Item-Total Correlations Book A: (#27), (#52), #95, #206 SR—A: (#27), (#52), #78 EW-A: (#116) PW-A: (#27), (#52), #70 PS-A: #78 Book B: #66, #139 SR-B: None EW-B: #189 PW-B: #66, #139, #143 PS-B: None Note. Items in “0” are anchor items. As shown in Table 6, while some anchor items correlated negatively to one form of a test, they did not correlate negatively to the other form of the same test. For instance, anchor items #27 and #52 had negative correlation coefficients in SR-A, but they did not correlate negatively to SR-B. None of the items in SR-B had negative item-total correlations. Similarly, #27 and #52 had negative correlation coefficients in PW-A but not in PW-B. Item #116 had negative correlation coefficient in EW-Abut not in EW-B. Therefore, to keep the set of 145 anchor items from the original test intact, these anomaly anchor items were kept for the subsequent equating studies. Fortunately, the magnitudes of these negative correlation coefficients were mostly less than .05. Table 6 also shows that some of the anomaly items in sampled test forms were indeed anomaly items in the original test forms. Specifically, items #27 and #52 had negative item-total correlation coefficients in the original Book A, and they also had 99 negative coefficients in two of the sampled test forms (SR-A and PW-A). In addition, items #139 and #66 correlated to Book B negatively, and they also correlated to one sampled test form (PW-B) negatively. Based on the above findings, it may seem reasonable to exclude these anomaly items fi'om the study. However, these anomaly items were not always anomalies across test forms. In addition, not all the anomaly items in sampled test forms correlated negatively to the original test forms. Items #70, #78 and #116 are examples for such case for sampled test forms PW-A, SR-A and EW-A respectively. Similarly, item #143 is an example for PW-B, and item #189 is an example for EW-B. In part, the inconsistency in the anomaly cases across different test forms can be explained by the item sampling schemes used in this study. Such inconsistency can be part of the item-sampling outcomes, since the item sampling schemes manipulated content homogeneity of the sampled items in various test forms. Therefore, the second reason that the anomaly items were kept for further analyses is to maintain such effect of item sampling. It is determined that the impacts of the negative item-total correlations were not serious, since the negative correlations all had small magnitudes. ClumteristicscfAncthltems The characteristics of anchor items in the original and sampled test forms were analyzed with different types of correlation coefficients. These analyses examined the relationship between the total score on anchor items and the total score on non-anchor (unique) items, and the relationship between the total score on anchor items and total test score. Their results are summarized in Table 7. 100 Table 7 - Coefficients of Correlations Between Total Scores on Anchor Items, Non-Anchor (Unique) Items, and Total Test T Correlation Coefficients est Form 1' (anchor,unique) T (anchor,total) f (unique,total) *1! it *1! Original Book A .754 .981 .866 Test Book B .736" .979“ .859” SR-A .503“ .861“ .873** SR—B .537“ .863“ .889" EW-A .439M .939** .721** Sampled EW-B .488“ .939“ .758“ Test PW-A .443“ .924" .752" PW-B .482M .925M .778** PS-A .486“ .968“ .690" PS-B .451** .968“ .660“ Note. **-Significance level less than 0.01 (two-tailed) 101 W- Correlation between the total score on the anchor items and the total score on the unique items of a test (rmchamiquc) provides a further check for test composition. In addition to providing empirical evidence of item homogeneity, this type of correlation can also be used to indicate efficiency for test equating. Budescu (1985) argued that the larger the rmmmiquc is, the more precisely the parameters will be estimated for the combined group in equating. Table 7 shows that all of the indices of equating efficiency for various test forms (see the first column of correlation coeflicient) are statistically significant. Their values range from .44 to .54. This suggests that the anchor items and non-anchor items fiom the same test form were similar, and the common-item equating in this study would be efficient. Moreover, the total scores on the unique items also correlate strongly and significantly with the scores on total test forms (see the third column of correlation coefficient in Table 7). This provides more evidence of adequate test composition. . Test scores on the anchor tests also correlate significantly with the total test scores to a considerable degree (see the second column of correlation coefficient in Table 7). The values of such correlations (rm,m) range from .861 to .968 for various sampled tests. They show that the anchor items in the sampled tests were content representative of their total tests. However, these coefficients of rmmm were inflated by auto-correlation. The anchor test was correlated to itself when the rmcba,m was computed, since the anchor test was embedded in the total test. Despite the auto-correlation effect, from the perspective that these anchor items were an integral (inseparable) part of the total test, the coefficients of 102 ram,“ still provides a sensible measure of equating efficiency (Budescu, 1985). Therefore, rm,“ was used in this study, as an index of content representativeness for anchor test. Concerns about the influence of auto-correlation will be further addressed later in this section. W. The various item sampling schemes used in this study intended to manipulate the content representativeness of the anchor items in the sampled tests. The differences in the rm,w across various sampled tests provide evidence of item-sampling effect for these schemes. Table 7 clearly shows that the various anchor tests were more or less representative of their corresponding total tests, despite the fact that all of the values of rancha,w are large. This item-sampling effect improves the chance for the subsequent studies of the content-representativeness effect on equating accuracy to be valid. As shown in Table 7, the anchor...“ decreases as the content specificity of the sampled test changes. The pattern of the changes for test form A is: from .968 (PS) to .939 (EW) to .924 (PW) to .861 (SR). The pattern for form B is: fi’om .968 (PS) to .939 (EW) to .925 (PW) to .863 (SR). These two patterns show that test forms A and B had identical trends of decreases in rmmm. Forms PS-A and PS-B had the most content-representative anchor items, because the purposeful item sampling scheme merely drew items from three core content areas. As a result, items in PS-A and PS-B were more homogeneous in their content, which logically led to the larger runway“. Forms SR—A and SR-B had the least representative anchor items. This is because their randomly sampled items were from 19 and 20 core content areas respectively (see Table 3). 103 Not only the items in SR-A and SR—B were less content homogeneous, the content variations across items in SR—A and SR-B were also less predictable, due to the random sampling of items. The magnitudes of the rmM‘W for EW-A and PW-A were similar. For EW-B and PW-B, the magnitudes were also alike. They suggest that there were no substantial difl'erences between the tests created by the equal-weight domain random sampling and the proportional-weight domain random samplings, in terms of the anchor-item content representativeness. WNW. As mentioned earlier. rmclmm was inflated by auto-correlation. Given same test length, a long anchor test would inflate the magnitude of rmcba,m more than a short anchor test. As explained in the section of research limitations in Chapter 8, due to the limited availability of original test items in each of the 23 core content areas, and the different demands on test characteristics of the four item sampling schemes, the item-sampling design resulted in varying anchor lengths for different sampled tests. The percentage of anchor items was 50% for SR, 61% for EW, and 66% for PW. The two forms of PS had slightly different percentages: 75% for PS-A and 79% for PS-B. As a consequence of the differential anchor lengths, the impact of auto-correlation might be different on various sampled tests. Part of the stronger anchor-total correlations for PS-A and PS-B are attributed to their longer anchor lengths. Similarly, shorter anchor length partly accounts for the weaker correlations for SR-A and SR-B. Such anchor-length effect could affect the findings about the content representativeness of anchor items. Therefore, rmchamal may not be sufficient in estimating how anchor items were representative of the total test. 104 Despite the limited empirical findings described above, the differential content representativeness for various anchor tests is in theory plausible. It is backed up by the particular content structure of the original test and the item-sampling design. In addition, although the anchor lengths were different across test forms, all of the anchor tests were controlled to be sufficiently long for test equating. The fact that all of the anchor tests are long in length is expected to lessen the impacts due to the difi'erential lengths on equating accuracy. In each sampled test form, at least 50% of the items were anchor items (see Table 2). Such high percentage far exceeded the commonly recommended anchor lengths for adequate test equating (Angoff, 1984). The differences between the two examinee groups are studied with considerations of the ability, years of experience, and program participation of the examinees. W. The examinee groups performed slightly differently on the anchor items. The group taking Book B (mean=107.721) scored slightly higher than the group taking Book A (mean=105.457) on the 145 anchor items. The difference between the group means was statistically significant at the .05 level (t=3.987, df=2,239, p=.0001). There were similar group differences across the four pairs of sampled test forms. Table 8 summarizes the statistical test results for the group mean differences on anchor items. Average item difficulties were computed for the anchor items and unique items separately to further examine the group differences. Table 9 summarizes these results. Across various sampled tests, there were slightly larger values of item difficulty on the anchor decent? Emcee: :0 women mm 523 .mém Ea «.th BEE e353 Beanbag 2: “8 a8-“ 05 c8 E85 .859? we 565380: mo 5363mm 2: =0 comma 0.8 $83 05 =< .oilB Z... 105 _Nm.m Sham m: _ mtmm acmd Socd fibomd Race on _ A 39m 05. g m N2: c> c> O> 082...on 9623850 - M 859.. 1 12 more demographic information is available, in-depth group differences and their potential influences on dimensionality (defined empirically) and equating accuracy could be more thoroughly examined. The data analyzed in this study is secondary data, from which only two demographic variables (years of experience and program participation) were available. It restricted the investigation of this study on group differences. Summary In short, the sampled test forms were reliable in terms of their internal consistency. The items within the forms had moderate difficulties and the anchor items were representative of the total tests. The studies on the demographic attributes of the two examinee groups indicate between- group similarities. Although- there was slight difference between the two examinee groups in their ability, the difference was not serious. Score Distributions of Various Test Forns The score distributions of the two original test forms, presented in APPENDIX A, were both negatively skewed This is because the original test was written for a minimum competency examination. After the sampled test forms were created for this study, the score distributions of these test forms were also examined (included in APPENDIX A). Like the original Book A and Book B, all of the sampled test forms had more or less negatively skewed distributions. Such properties of skewed score distributions for the four sampled tests are taken into account in the subsequent equating studies, for the discussions and interpretations of study results. 113 Adequacy of 3PL IRT Model for IRT-Based Equatings The outcomes of the two IRT-based equating methods were based on a three parameter logistic IRT model. The IRT model incorporated a guessing parameter to account for the likely guessing factor in the minimum competency examination. Grounded in the 3 PL IRT model, the IRT-based equating outcomes therefore gained a logical and theoretical advantage of taking into account the chance of guessing. In addition, the satisfactory equating outcomes of the two IRT-based equating methods (presented later in this chapter) provide empirical evidence of adequacy for the underlying 3PL IRT model. Combining the theoretical appeal with the empirical evidence of adequacy, this study concludes that the 3PL IRT model was apprOpriate for the IRT- based equatings conducted in this study, where test forms with negatively skewed score distributions were equated. Equating Outcomes of IRT-Based Methods The outcomes of IRT parameter estimation and the equating outcomes of the two IRT—based methods are summarized below. The equating results of the two IRT-based methods are found to be very similar. E . . [1131 E r The results of fitting a 3-PL IRT model are summarized in Table 12. Over various test forms, for both IRT-based linear transformation method and fixed-b method, the intermediate outcomes of parameter estimation showed small variation in item 114 8.088.... 8...... 582. n O 8888...... 9:32.» u o 5.088.... 932...... 88. u a 82088.... 8.88888... 88. u e .802 MMM.O OOwO Sud OwwO MMM.O MMM.O MMM.O «cwd Sad MMM.O 4MM.O .MwO ....» c GOO N4. .O . .0.0 OMOO MO0.0 N.0.0 400.0 MO0.0 ..OOO MO0.0 OOOO MO0.0 508 .. <2 82.- 8...? a»..- 9...- 82.- 98... an»..- 9.....- 82.. as: . O40.0 OMOO w40.0 MM0.0 .40.0 NM0.0 OM0.0 4M0.0 NM0.0 ONOO MM0.0 e4OO ....» 9 FNO 4wM.O 54M .O . . M .O .MN.O MNMO OhNO .4N.O ..4N.O OMNO OONO NMN.O =38 < 49.. .OM.N 4..M.N .Ma. MOE. .OM.N O4N.N .Ow. m4w.. M40.N .MN.N OMNN ....» a OMOO- NMOO OON. . - .OMO- 400.0- OMMO- .OM. . - wOO. . - MMOO. OOO. . - M44..- 436- =38 .. 40..O 8..O MO..O 40..O .40.0 No.6 No.6 MO..O NO..O ..N..O no.6 MEO ...... a N046 MM4.0 MMM.O OO4.0 444d O.4.0 MMM.O ....MO 444.0 O4M.O N4M.O O4M.O 508 < min. n-3,". 93.... mamw m.-m.. mi... 9.5m. mamm «ran. <55 <->»m. <-~.w .5803. o - x. can - .65 S. n ..o .... m. ...~.. 8838.888... .88... 482.-.»: . 9.80 0885mm. .2088... .. 9.80 8.5m .8... 31.8mm 8.883%. 8.0885. PM. .o 3.33. - N. 2%... 115 discrimination. Overall, the estimated values of the average item discrimination parameters were close and the standard deviations, relative to the means, were also small. In addition, the estimation for the guessing parameter yielded pretty similar results. However, the variation across sampled test forms in item difficulty seemed large. When alternate forms of each sampled test were calibrated separately, as required by the IRT-based linear transformation method, the resulting average anchor-item difficulties differed across various sampled tests. Particularly, the anchor items of PS-A and PS-B had greater values in their average item difficulties than those of the other three pairs of test forms. These differences are attributed to the item sampling in this study, which sampled different numbers of items from different numbers of core content areas. For each of the sampled test, however, the average anchor-item difficulties for its two forms were not too far apart. This lends some evidence of similar ability for the two examime groups. E I m1 5] .1. E . The equivalent ability estimates yielded by the IRT-based linear transformation method and fixed-b equating method correlated strongly. It indicates similarities between the outcomes of the two methods. Across various sampled tests, all of the Pearson’s rs were statistically significant and had values close to 1. Therefore, the two IRT-based methods did not differ much in determining individual examinee’s standing in the entire examinee group. To further study how similar the results of the two IRT—based equating methods were, for each sampled test, the means and standard deviations of the two sets of resulting 1 l6 ability estimates were compared. The mean difference between the ability estimates of the two IRT-based methods was also tested for its significance. As shown in Table 13, the average ability estimates of the two methods seem to differ only slightly and the standard deviations are very similar. However, the dependent-samples t-tests show that the outcomes of the two methods were significantly difl'erent (p<.001), no matter which pair of sampled test forms were equated. To control for the total error rate, which is likely to increase with the number of hypothesis tests, a more conservative significance level (0:0.01) was chosen for the t-tests instead of the conventional (1 =0.05. Overall, the test results suggest that the outcomes of the two IRT-based methods were not as close as represented by the Pearson’s rs. Although the statistical tests suggest significant differences, it should be noted that the large t—values in Table 13 are partly due to the small standard errors of mean difference and the large sample size, and hence more degrees of fieedom. The effect sizes across the four tests are all very small (less than 0.5), implying practical irrsignificance. Graphing the resulting equivalent ability estimates of one IRT-based method against those of another method, the scatter plots in Figure 4 illustrate the positive relationships between the outcomes of the two methods across tests (differing in their item homogeneity). While the fairly solid straight lines in the plots suggest strong linear relationships, the slight thickness and coarseness of these lines indicate that the relationships were not as nearly perfect as indicated by the Pearson’s rs. Overall, at the two ends of the ability scale, the outcomes of the two methods were more similar than the outcomes at the middle range of the ability scale. 117 0.0.0 .0.0 wwO ....m OOwO ....w NOOO- OO0.0 M4. . M403- .O0.0 MM0.0- 5.08 OMOO 8.08 M. ..O 8.08 Mm ONOO ....m Ohwd ....m OMwO ....m OM4.0 OO0.0 M4: .M M .54 .O0.0 MMM.O 808 N4. .O 508 wMNO- SW08 3.. 08.0 ...... .88. ...... Rm... ...... M4. .O- OO0.0 w4. . OM4.0M .- .O0.0 MN. .O- 8008 . .0.0 8008 4M. .O 8.08 .../M. . .0.0 ....m MMM.O ....m 3.3 ....m O. M.O- OO0.0 M4. . 44M.MOM- OO0.0 MMNO- 8.08 .OOO 8.008 MMM.O 8008 ms. 08m 80.8. 00:80....9 805088.88... .00...m. -N. .w.m ... . 8.0.2. 0080.0...9 A#605,... . H008... .0 88m. Em 50... 8.8% 50...; 8.988 .008... 880.). 888...”. ..0mem-...~.. 8050.). weenie... ..0m0m.-..~.. 03... .0 3.08.3”. 5:3». 98.33. 0... .0 8882.800 - M. 030... Equal-weight Domain Random Sampling '8 4 ' . a 3 9 E 2- .6 i ‘" 0 - .2 8 -1 ~ [g _2 _ r:.9996l** a-.. - fl :3 -4 -3 -2 -1 o 1 2 3 Fixed-bMethod Purposeful Sampling 4 1' A g 9 o 3 " I 5 2~ 0 '5 1 E .- a -. . 1.99993" t- -2 ~ 5 -3 ~ ‘ Us u-l .4 . 3 2 1 o 1 2 3 Fixed-bMethod 118 Proportional-weight Domain Random Sampling '8 4 — A § 3 . 9 2 :1 2 ~ - .O i ‘ ” .2 ° ” 8 -1 ~ 5 -2 _ r:.99961** a. _. CE ..1 4 3 «1-3-2-101234 Fixed-bMethod Simple Random Sampling '3 4 P .. 5 3 L 6 0 I E 2- O i ‘ ” e 0 F 3-, . 1:999:35“ !- -2 - 2+ : .4 1 _r J. J -4 3 2 1 o 1 2 3 4 Fixed-bMethod Note. (1) 9 is the examinee ability; (2) **- Significance level less than .01. Figure 4 - Relationship Between the Resulting Ability Estimates of the Two IRT-Based Equating Methods 119 Wm Applying the following formula (Lord, 1980) to the outcomes of the 3PL IRT- based equatings, true score estimates were obtained: i = if, p‘.(9) = 21c” (1- c,)/ [1+ Exp'l‘7°‘(°'b‘)]}, (9.1) i-l where T is the estimated true score, pi(9) is the probability of getting item i correct given examinee ability 6 , n is the number of items, a,- is the item discrimination for item 1' , b; is the item difficulty for items 1' , and c,- is the pseudo-chance level (guessing) for item i (Hambleton & Swaminathan, 1990). As expected, for each sampled test, correlation between the resulting true-score estimates from the two IRT-based equating methods was fairly strong and statistically significant. The Pearson’s r ranged from .976 to .999 across the four sampled tests, indicating nearly perfect relationships between the outcomes of the two methods. These findings are similar to those based on the correlations for the ability estimates. Therefore, the two IRT-based methods used in this study were similar in determining individual examinee’s standing in the entire examinee group. Table 14 shows that the average true score estimates of the two IRT-based equating methods were very similar. So were their standard deviations. However, regardless of the pair of sampled test forms being equated, the dependent-samples t-test reveals significant difference between the outcomes of the two methods (a=0.01 and p<.001). According to these results of significance tests, across various sampled tests, the 120 fl :3 2.3. Nmod coed ovmm flmdv mood mad :88 not; :38 80.? :38 mm Sad dd Eoé 65 mm fl .v dd vmmd- oood ovum Node- fled «Nod- 52: www.mv 508 made 538 25 .ovfio .ua mil» .ua awmé 6d mbod oood ovmm ONES“ hood Smd SEE mwwdm :38 034m 588 3m mvwd a; wand A; End 4; ~26 coed ovmm ommfiv wad vad 52: 98.; :88 whcdv :88 mm T 36 825 8:20wa coagofiafih 3me é .wmm as a :82. 855:5 $85 . $25 90 Sum Em “mob coanam “SH; 338% REE @0502 magnum Exam—n5: $5522 mauaswm nommméyn 03H mo 3325mm Boom can. chstM 05 mo maomgmfiou .. E 035. 121 relationships between the two IRT-based methods were not as nearly perfect as suggested by the Pearson’s rs. Nevertheless, the large t-values and significant test results in Table 14 can be attributed to the small standard errors of mean differences and the large sample size. In addition, the effect sizes for the difl'erences across the four tests are all very small (less than 0.25). Thus, the differences between the two methods in estimating the true scores might not have practical significance. The scatter plots in Figure 5 illustrate the delicate relationship between the two IRT-based equating methods, by graphing the resulting true score estimates of the fixed—b method against the resulting estimates of the linear transformation method. These plots are more revealing than those plots in Figure 4 (which are based on the resulting ability estimates) are, in showing the differences between the two equating methods. While the scattered data points form a pretty solid straight line for sampled tests EW and SR respectively, the data points for PW and PS clearly show more than one line. The two separate lines in the plots for PW and PS suggest that the resulting true score levels of one method did not correspond to the levels of another method on a one-to-one basis. On either one of the two sampled tests (PW or PS), when one IRT-based equating method was used, a group of examinees might receive the same scores, but the same group of examinees might receive different scores when another IRT-based method was used for equating. An inspection on the resulting estimates of equivalent true scores yielded by the two equating methods confirmed such speculation. In addition, the formations of the data points for tests PW and PS in Figure 5 seem to be linear yet slightly elliptical. For each test, the two separate lines shown in the scatter plot not only look slightly curvilinear but also cave to the opposite directions. They 122 Equal-weight Domain Random Proportional-weight Domain Sampling Random Sampling 70 60 60 al- 50 <~ ‘8 ‘8 g 50 ‘* g 40 a» 2 2 4.3 ‘9 40 ._ 30 ~~ - E E " 1:976" in. u. 30 ** 20 .0- l 20 .g Y # Y i 10 ¢ 1 fi‘ 1* 20 30 4O 50 60 70 IO 20 30 40 $0 60 Linear Transformation Method Linear Transformation Method Purposeful Sampling . Simple Random Sampling 60 ~—- ——~ ————~ i 60 _ _ 50 3- g 50 .1»- "8 8 a «w a. 2 2 4o ._ ‘1" 4? 30 a" § § 1:999" E ii.“ 30 + 20 ab 10 v 4. fi‘ P 20 #1 v . IO 20 30 40 50 60 20 30 40 50 60 Linear Tmasformation Method Linear Transformation Method Note. (1) i is the equivalent true score estimate; (2) **- Significance level less than .01. Figure 5 - Relationship Between the Resulting True Score Estimates of the Two IRT-Based Equating Methods 123 suggest that outcomes of the two IRT-based methods were more similar for cases receiving scores near or at the two ends of the true-score scale (than for the cases in the middle range of the scale). The slightly non-linear relationships between the two IRT- based equatings on PW and PS suggest that using Pearson’s r for summarizing or comparing the outcomes of different equating methods could be misleading. Graphical displays contrasting the equivalent scores from different equating methods are recommended to improve the comparisons. Smoothing Equipercentile Equating Outcomes This study used the frequency estimation method (Kolen & Brennan, 1995) to conduct equipercentile equating. To increase equating precision, after obtaining the frequency-estimation equipercentile equivalent scores, this study applied the cubic spline postsmoothing method (Kolen & Jarjoura, 1987; Kolen & Brennan, 1995) to smooth the equivalent scores. A total of eight smoothing parameters were specified (s=.01, .05, .10, .20, .30, .50, .75, and l) for postsmoothing. These parameters yielded smoothed equivalent scores differing in their degree of smoothing (Hanson, Zeng, & Kolen, 1995). That is, they controlled the amount of the average squared standardized difference between the smoothed and the unsmoothed equating outcomes. El'lI . S 1.3] The resulting eight sets of smoothed equivalent scores were inspected graphically and statistically to determine which of the eight smoothing parameters resulted in the least amount of smoothing required for a smooth equipercentile equating function. For 124 graphical inspection, each of the eight sets of smoothed equivalent scores was graphed with the set of unsmoothed equivalent scores, and a standard error band was constructed around the unsmoothed equating outcomes to facilitate visual inspection. The adequacy of the various smoothed equating outcomes were in part judged by their smoothness and deviations from the unsmoothed equating outcomes, shown in such graphs. When there were more than one adequate smoothing parameters, judgment was made with considerations for the large sample size of this study and the numbers of items in the sampled test forms. Figure 6 presents a set of eight graphs for one sampled test (EW) to illustrate such graphical inspection techniques. Those graphs depict the changes (before and after a smoothing parameter is applied) in the relationship between the equivalent scores on sampled test forms EW-A and EW-B, when the degree of smoothing varies. The same type of graphs for the other sampled tests (PS, PW, and SR) are included in APPENDIX B. In the case of smoothing the frequency estimation outcomes for sampled test EW, any values of smoothing parameter equal to or greater than .10 would result in smoothed equating outcomes that are too far from the unsmoothed equating outcomes (more than one standard error of the unsmoothed equivalent scores). This is illustrated in the graphs for smoothing parameters s=.10, s=.20, s=.30, s=.50, s=.75, and s=l.0 in Figure 6. These graphs show that using any one of these six parameters, some smoothed outcomes (between the scores 37 and 40 on the EW-B scale) would fall outside the standard error band around the unsmoothed outcomes. Figure 6 also shows that, using these parameters, there would be larger overall differences between the smoothed and unsmoothed outcomes (than when the parameters s=.01 and s=.05 were used). Therefore, both s=.01 125-a 3m an... 00388 8..“ weEEoEm .8 $8on 309$ S fieofi>8am Boom 8 Boom - o 8:me 8m . ouzm. conuoosm Iol uouum cuaucaum Ha ...... .a-ma no aloos. snntw .v-ma no axons instantnba. 20H.onzm. cmeuooam.191 Houum cucucdum flu ...... .a-ma no woos. snntw .v-ma no axons anoinAInba. Tn; an; oonuoofimcDILTl nonuooanpbilfl aufiueoeH [n aunueoou rm ch nw 00 mm On n? O? mm On a on u “'3” [ in! P L1 1? l» b N- m NI m :0 88m M. M. . m. . - m. Tm—- m in. m S . - m _- m _ m f m m m w n61 3 W61 3 .M .M o u" c “w .. M m. ... r 3 m . 3 m .mo . OHZMV fiwfluoofim IOI. .~ 1 — m :5 . ouzmv OOSUOOEm 1.6.1 . r — N nounm ouaoaaum a“ ...... -n; m nonnm ouncemum Ha ...... .m; m 00330335111 M US$09.85.» Icl m 33:03 fi N mm Kinsmen - N E 125-b 9:83 o 2&5 Om .% m N- m. m. A w . m. m . n _- m ... S x. v 7 m no“ N O 0 u y no. u m _ m w o N . . m. m. w. ... 1 n6 m m. ...... 1 ~ My . 0 .~ I — m 804qu nocuOOEmlol m .2. ouxm. can» ofimlol o nouum vunccnum Ha ...... y n; w uouum Uuovcnum an ...... en; ..u conuooEmchT. M vonuooanglal .M . - 8 hufiucog e N am auaucooull N u my 3 mm Qm u 2. no 8 mm 3 mm Om u - N. no 8 nn N- m. m N- m m m r n..- m r m T m. s m - m . _. f _ a w m V . u v n6. 3 . n6. M . .M . o v o N . w .. M r . m. ,. - 3 m 3 m. .. u S ...... _ m. . .. f _ m 89°"sz omfiooEmlol % a 8m ouzm. 3585+ w nouum vuoneaum Z ...... . w. m nouum 33:85 3 ...... - n. u 005028 a: If: M conuooamno lcl M 3333 - a fl». 3353' . N am 126 and s=.05 were more appropriate for postsmoothing. Compared to the graph for s=.01, the graph for s=.05 suggests slightly smoother outcomes. Although s=.05 would result in slightly higher degree of smoothing, given the large sample size and number of items for EW, the amount of smoothing required by s=.05 should be scarcely more than the amount required by s=.01. E l . [,1 I | E l' .. In addition to the graphical inspection, the four moments -- mean, standard deviation, skewness, and kurtosis -- of the resulting smoothed equivalent scores were estimated to evaluate the smoothing requirement of “moment preservation” (Kolen & Brennan, 1995). The estimation outcomes for the moments of the eight sets of equivalent scores are summarized in Table 15. The moments of the smoothed equivalent scores were compared to the moments of the unsmoothed equivalent scores such that the most appropriate smoothing parameter could be identified. An appropriate smoothing parameter will result in a smooth equipercentile equating function that does not depart too much from the unsmoothed equating outcomes. Using the evaluation outcome for sampled test EW as an example, the assessment procedure of “moment preservation” is briefly illustrated. As shown in Table 15, for sampled test EW, the four moments of the smoothed equivalent scores resulted from s=.05 are more similar to the moments of the unsmoothed equivalent scores than those from s=.01. Combining this finding with the information from Figure 6, where smoothness and deviations from the unsmoothed outcomes of the smoothed outcomes were examined, s=.05 was therefore chosen to produce final smoothed equipercentile- 127 Table 15 - Moments for Postsmoothing Outcomes Smoothing A A A A Sampled Test Parameter u o Skewness Kurtosis Unsmoothed 42.2699 6.7032 -0.5084 0.0696 S=.01 42.2745 6.7038 -0.4997 0.0529 s=.05 42.2739 6.7060 -0.5006 0.0583 PS S=.10 42.2739 6.7078 -0.5017 0.0628 S=.20* 42.2731 6.7096 -0.5030 0.0690 S=.30 42.2728 6.7108 -0.5040 0.0733 S=.50 42.2730 6.7140 -0.5049 0.0674 S=.75 42.2769 6.7197 -0.5 121 0.0489 S=1.00 42.2796 6.7237 -0.5223 0.0518 Unsmoothed 51.0025 5.7939 -0.5517 0.6772 S=.01 51.0043 5.7973 -0.5465 0.6675 S=.05 * 51.0031 5.7 973 -0.5474 0.6724 EW Sz. 10 51.0031 5.7974 -0.5478 0.6733 S=.20 51.0029 5.7972 -0.5467 0.6678 S=.30 51.0029 5.7 972 -0.5467 0.6678 S=.50 51 .0027 5.7956 -0.5499 0.6679 S=.75 51.0035 5.7963 -0.5550 0.6784 S=1.00 51.0035 5.7963 -0.5550 0.6784 Unsmoothed 42.8099 5.5660 -0.3986 . 0.3033 S=.01 42.81 17 5.5657 -0.3940 0.2846 PW S=.05* 42.81 16 5.5663 -0.3951 0.2932 S=.10 42.8120 5.5649 -0.3934 0.2881 S=.20 42.8130 5.5620 -0.3898 0.2739 S=.30 42.8127 5.5601 -0.3864 0.2612 , S=.50 42.8148 5.5577 -0.3833 0.2467 S=.75 42.8155 5.5613 -0.3917 0.2544 S=1.00 42.8168 5.5686 -0.4183 0.2837 Unsmoothed 41.5984 5.7093 -0.371 1 0.0754 S=.01* 41.6009 5.7101 -0.3687 0.0756 S=.05 41.6020 5.7140 -0.3661 0.0824 SR 8:. 10 41.6037 5.7166 -0.3621 0.0803 S=.20 41.6052 5.7162 -0.3564 0.0584 S=.30 41.6055 5.7128 -0.3513 0.0322 S=.50 41.6111 5.7081 -0.3365 -0.0351 S=.75 41.6091 5.7103 -0.3251 -0.0616 S=1.00 41.6089 5.7106 -0.3227 -0.0651 Note. * Indicates the smoothing parameter selected for postsmoothing the frequency- estimation outcomes of equipercentile equating, after taking into account the information from this table and Figure 6. 128 equivalent scores for test form EW. Although some other smoothing parameters such as s=.10 and s=.50 seemed to yield moments more similar to those moments of the unsmoothed outcomes, they were not appropriate for smoothing the equated scores on EW because some of their smoothed outcomes would fall outside the standard error band. It should be noted that the smoothing requirement of ”moment preservation" also requires that the moments of the equated scores on one form of a test to be close to those on the other form of the same test (Kolen & Brennan, 1995). This property is desired for both random group equating design and common-item non-equivalent group design. However, for the non-equivalent group design used in this study, it is a lot more difficult to examine this property and the interpretation will not be as clear as for the random group design (M. J. Kolen, personal communication, May 6,1997). This study therefore did not directly assess the “moment preservation” on one form for the particular population taking the other form because of missing data. In addition, the moments in this study depended on the particular assumption made by the frequency-estimation method, used for the equipercentile equating. The frequency-estimation method assumes that, for both forms of a test, the conditional distribution of total score given each common-item score is the same in both populations. 1 ' m hin r Despite the difficulty in assessing “moment preservation” across test forms, the graphical inspection on the smoothing results and the evaluation of “moment preservation” within test forms provide useful information for assessing the effectiveness of various smoothing parameters. Judgments were made about the relative estimation errors caused 129 by the eight smoothing parameters. Taking into account all the information, the following smoothing parameters were selected and used respectively for the four sampled tests, to improve the equated scores resulted from equipercentile equating: (a) S=.05 for sampled tests EW and PW, (b) S=.20 for sampled test PS, and (c) S=.01 for sampled test SR. The final equipercentile equivalent scores yielded by the above smoothing parameters appeared to be smooth and were not too far apart from the unsmoothed results (see Figure 6 and APPENDIX B). Their four moments (see Table 15) were also close to those of the unsmoothed equivalent scores. Without introducing substantial bias into the smoothing process, the use of these smoothing parameters improved the precision of the equipercentile equating in estimating the equivalent scores (Kolen, 1991). Results of Tucker Linear Method For each of the four sampled tests, Tucker linear method found an equating equation that transformed scores on one form to a set of new scores comparable to the scores on the other form. Important intermediate outcomes of the Tucker method and the resulting equating equations are summarized in Table 16. As reviewed in Chapter 3, the four Tucker equating equations presented in Table 16 were derived by defining a synthetic population, assuming equal conditional variances and same linear regression functions for the two populations, and estimating the means and variances for the synthetic population (Kolen & Brennan, 1987; Kolen & Brennan, 1995). Using these resulting Tucker equations, equivalent scores were established for the two forms of each sampled test. 130 .m _ m. m_ m E8 wee—8 noun—:92“ 05 8m 2363 05 was .53. mm < 8.8m wee—S cows—smog 2: co... Emma? 23. .v .m .58 :0 Boom coinage 05 fl .. a: v5 acts—smog 305:? 05 8856 .. a... .m .m .58 wee—S nous—=98 05 Co... 2.30508 :ommmphmoc 05 fl ad can J. Eat “m8 wee—S noun—=33 2: he 320508 commmoewe 05 mm <5 .N @803 3:23 .8888 :0 88m 05 8:829: .. >.. 23 .m .EB 33 co 208 05 85852 .. m: .< .58 38 :0 28m 05 85338 .. S. .— 9% Kamisovdvéac. z a: 08.? 84.9, on: «3.3. :24 8: mm camisowmsewaA as» 9.0% 08.? 5? 22m 2&9 32 2E mooemégcmémSA a3 awn? SSW RS 33m 83m 2? 3m n$._v+fimw._vem3A «3 56.8 5:... an. $52 29:. 32 mm 3va 2:.1 S325 250 3.: £55 :3“ng msflwscm— Hog—05H. m 8.8”— < 8.5m amok. boaamm Beaumm ~82:th @0532 wcumsvm H854 Hoe—2F we gamma 2: mo Efifism - 3 2an 131 Similarities Among Outcomes of Various Equating Methods The equating results of the Tucker method and the other equating methods are compared in this section. The positive and significant strong relationships among these results (see the underlined correlation coefficients in Table 17) indicate similarities among these various equating outcomes. Individual examinees were ordered in a similar way, regardless of the equating method used. Comparisons between the outcomes of the two IRT-based methods have been discussed previously and considerable similarities are found. Using the same strategies -- Pearson’s r, dependent-samples t-test, and scatter plot, results of the Tucker method and the fi'equency-estimation equipercentile method are compared. The large Pearson’s rs between the outcomes of these two methods in Table 17 suggest that these methods yielded almost identical rank order for individual examinees. For each of the four sampled tests, the correlation is nearly perfect (r > .999). The scatter plots in Figure 7 further confirm the similarities. Except for the plot for sampled test SR, each of the scatter plots in Figure 7 clearly shows one single narrow straight line. This indicates great resemblance of Tucker equating outcomes to the outcomes of the equipercentile method. The plot for sampled test SR shows that the outcomes of the Tucker and equipercentile methods were similar when their resulting equivalent scores were in the middle range or at the high end of the score scale. However, they differed slightly when their resulting scores were at the low end (between scores 20 and 30) of the score scale. This suggests that the equating outcomes of the two methods were very similar when examinees had medium or higher scores. The outcomes only differed slightly for the examinees with pretty low scores. 132 .5.n8 3 Emu—.2635 2a mu 2.5.9.535 05 .«o =< .202 85.. 38 Ed 855 mwlod SS 28 255 a _ 22 an; 8:. a :2 9.3 82 mm 83 33 ~85 83 a 83 855 Sad—g _ Ed 83 855 Glad :3 28 2E 85.022 83 255 a; .53 Halo 825 23 82—33—88 28 38 2.5g 88 km flame? 83 53 a; 255 8'2. 53 $3 «88—.ng :55 88 38 a 25 83 8:. :8 955 a _ ES 33 32 a Ed 38 38 we 83 £3 28 2.3 _ as _ 83 SS 23 «Nd on; 88 3.2 aouwmmmwsfi 8.: :55 £8 ammo—$54188 5:8 53 8'3 83 am BMMHE 83 85.5 88 owns—g 485 38 $8 a 2m 83 83 3.8 :55 28:34 38 3.5 22.5 2 8a.. 53 33 coho—mg _ 58 33 2E 8522 83 $8 88 - $3 _% _ $8 am 2888552 83 255 33 awe—g4 mm 83 88 23 22.5 E 83 SS 33 3.. 8522 08. — NWBAV 3m EGG—)— MDXUD'F 83 2m 2 26 am mm 2 32 3m 25 8 2E 3m. 2m 2 2E am 2m 8522 2.85 852.5% 8fiwfimwasme—awfi 8522 288.855 8.85 8:5 23.8.5 3 522.280 . :055-0H—OU thmHflDnm 8522 2.882 amok BEEwm E2855 ..8 88830 wcgswm 2:223 98:2 mmfimcocflom - E 2an 133 EW PW 7 0 —-— —— 60 50 e ’ an e 4".- '5 5° ‘* : E 50 4r -'.. a ‘5’ __.- :s : __. LS- so .. L3 40 <- ....- 2 2 ,- ca ' J- 8 40 4} g 30 ‘L _- e e . "’ 8. r>.999** 8' .1" r>.999** .. 3° .. 4r .... 2., .. - =3 " = . [S 20 . . - . LS- . 20 so 40 so so 70 '0 ' ' r f no 20 so so so so Tucker Method Tucker Method PS SR 60 - 60 '8 e I, .8 e ..-- so <1- ._I ...,- § ..- g 50 4 ... 2 -.-' 2 if! 2 ‘0 T- I... ” ... a ’2' g 4° .. so ,- § f ** e - . *1: .g‘ .r’ 0'999 8. 30 ‘r '1' p.999 5- 20 T _..l' 5 =1. l0 r . fl . 20 ' - f : I0 20 so so so so 20 so 40 so so “flicker Method Tucker Method Note. (1) e is the resulting equivalent score; (2) **- Significance level less than .01. Figure 7 - Relationship Between Equating Outcomes of the Tucker Method and the Frequency-Estimation Equipercentile Method 134 The relationships between the IRT-based equating outcomes and the non-IRT equating outcomes were not as strong as the relationship between the outcomes of the two IRT-based methods. They were also less strong than the relationship between the outcomes of the Tucker and equipercentile methods. The Pearson’s rs between the IRT- based and the non-IRT equating outcomes ranged from .944 to .973 (see the underlined correlation coefficients in Table 17). This finding refbcts the logical differences between the IRT-based equating approach and the conventional equating approach. Evaluation of Equating Accuracy The accuracy of equating outcomes was evaluated using four different criteria. An index of equating accuracy was computed by correlating resulting equivalent scores from different methods to each of these criterion scores: (a) total raw scores on the 145 anchor items (Raw-145), (b) IRT-estimated true scores on the 145 anchor items (IRT-145), (c) resulting equivalent scores of the frequency-estimation equipercentile method on equating the two original test forms (FE-long), and (d) resulting equivalent scores of the equipercentile method on equating the sampled test forms (FE-short). The last two criteria, FE-long and FE-short, were arbitrary criteria for evaluating equating accuracy. However, FE-long was expected to be more reliable. FE-short was used to facilitate an examination on evaluation bias caused by using an arbitrary criterion for evaluating equating accuracy. By correlating the outcomes of the other three equating methods to the outcomes of the equipercentile method on the sampled tests, this study examined estimation bias due to the arbitrary nature of FE-short. 135 W Before presenting the massive information, in details, regarding findings from studies of equating accuracy, this section first previews selected important results to highlight major findings. In brief, using Raw-145 and IRT—145, this study found that the IRT-based equating outcomes were more accurate than those outcomes of the linear and equipercentile methods were. Although the difi'erences between the estimated equating accuracy of the IRT-based methods and the non-IRT-based methods were small, statistical significance tests for the differences concluded that they were statistically significant at (12.05. However, the statistically significant but little improvement of the IRT-based methods in equating accuracy might not have practical significance. Among various sampled tests, equating results for sampled test PS were often the most accurate, regardless of the equating method used. The twofold implications of this finding, in improving equating accuracy for common-item equating practice, are: 0 It is important to include anchor items that are representative of the total test in content. 0 It is also useful to construct test forms with items that are more homogeneous in their content, or to limit the content coverage of test forms to a small number of topics. In addition, the findings from EF-long and EF-short confirmed that the use of an arbitrary criterion would lead to erroneous assessment outcomes of equating accuracy, as concluded in the literature (Dorans & Kingston, 1985; Harris & Crouse, 1993). Table 18 summarizes the estimation results of equating accuracy for outcomes from various equating methods on different sampled tests, using three different evaluation 136 Table 18 - Accuracy of Equating Outcomes from Various Equating Methods on Different Sampled Tests Criterion for Evaluating Equating Accuracy Index of Equating Accuracy "Pseudo True Score" (Pearson's r) FE-long Raw-145 IRT-145 SR 0.832 0.819 0.884 Tucker Linear EW 0.859 0.829 0.880 Method PW 0.860 0.839 0.883 PS 0.892 0.898 0.903 SR 0.832 0.819 0.884 Equipercentile EW 0.858 0.829 0.880 Equating Method PW 0.859 0.838 0.882 PS 0.892 0.898 0.903 SR 0.856 0.860 0.897 IRT-Based Linear EW 0.877 0.870 0.893 Transformation Method PW 0.845 0.839 0.864 PS 0.894 0.917 0.897 SR 0.854 0.858 0.896 IRT-Based EW 0.873 0.865 0.889 Fixed-b Method PW 0.870 0.867 0.888 PS 0.895 0.916 0.898 Note. All of the indices of equating accuracy (the Pearson's r s between the criterion scores and the resulting equivalent scores of an equating method) are significant at 0t=.01. 137 criteria -- Raw-145, IRT-145, and FE-long. The evaluation results of equating accuracy using FE-short are included in Table 17 (see the bordered Pearson’s correlation coemcients). Details of analysis outcomes on equating accuracy are presented below. First, the estimation of equating accuracy using Raw-145 is discussed. The results based on IRT-145 follows. Then, results from FE-long are examined, followed by an inspection on the results from FE-short. El'II'B-IIS L!" The total raw scores on all of the 145 common anchor items (Raw-145) in the original item pool were treated as "pseudo true scores" of individual examinees. Therefore, it could be used as one type of criterion for evaluating equating accuracy. Specifically, to study equating accuracy of the IRT-based methods, the equivalent true scores estimated by the two IRT-based equating methods were correlated to Raw-145. Raw-145 was also correlated to the equivalent scores resulted from the Tucker method and the equipercentile method to estimate the accuracy of these equating outcomes. The resulting Pearson’s rs indicated the degrees of accuracy for the equating outcomes of these four methods. These evaluation outcomes of equating accuracy are summarized in the first numeric column of Table 18. WW. Using Raw-145 as a criterion, the indices of equating accuracy ranged from .832 to .895 for various sampled tests and equating methods. Overall, all four equating methods yielded accurate results to a moderate degree for the four sampled tests. For each sampled test, the accuracy of equating outcomes from the four equating methods differed slightly. The outcomes of the 138 IRT-based methods were consistently more accurate than those of the non-IRT methods, regardless of the sampled test forms being equated. The only exception occurred on sampled test PW, where the Pearson’s rs of the Tucker method (r=.860) and the equipercentile method (1:859) were slightly larger than the IRT-based linear transformation method (H.845). Therefore, these diflerences were tested for their statistical significance. Suppose equating accuracy of two equating methods, Y and Z, are compared. Let X be the criterion Raw-145. The significance test statistic appropriate for the dependent samples in this study is (rer - rle )‘[(n - 3)(1+ r”) t = 2 2 2 , (9.2) J2(l- rxy -r,,z —ryz +2rxyrzryz) where x is the Raw-145 criterion-score (the total raw scores on all of the 145 common anchor items), y is the resulting equivalent score of method Y, z is the resulting equivalent score of method Z, and n is the sample size (Hinkle, Wiersma, & Jurs, 1979). In essence, the statistic r ,0 represents the estimated equating accuracy of method Y, and represents the estimated equating accuracy of method Z. The statistic ryz estimates r12 the relationship between the equating outcomes of methods Y and Z. The underlying distribution of this test statistic is the Student’s r-distribution with n — 3 degrees of fieedom. The critical values of the test statistic for this study are 11.962, because all the tests will be non-directional, the level of significance is set at 0t=.05, and there will be 139 1,146 degrees of freedom. For sampled tests SR and EW, where the IRT-based methods had slightly larger indices of equating accuracy than the non-IRT methods, the significance tests found all those difierences significant (ltl >l.962). For sampled test PS, although the IRT-based methods also ind slightly larger indices of equating accuracy than the non-IRT methods, the significance tests found no significant difi'erences because the difl'erences were so small. For sampled test PW, no matter the IRT-based methods had larger or smaller indices of equating accuracy than the non-IRT methods, none of the differences were significant. Across sampled tests, the Tucker method and the equipercentile method had almost identical indices of equating accuracy. Statistically, none of the differences between the two methods was significant. It suggests that these two methods produced equally accurate outcomes, when Raw-145 was used as a criterion for evaluating equating accuracy. This finding coincides with the similarities previously found between the two methods (as shown in Figure 7). In summary, there is a clear pattern across sampled tests in Table 18 showing that the IRT-based equating outcomes were more accurate than those outcomes of the Tucker linear and the equipercentile methods. Despite the fact that the improvements of the IRT- based equating methods were not much, they were statistically significant. While such small improvements may not have practical significance for equating in some occasions, they can be very valuable in some other occasions, especially when there is a strong demand for precise equated scores such as high-stake certification examination. Information on the degree of equating accuracy will help to make decisions about which 140 equating method to use for a particular testing program in a particular equating context. MW. Using the criterion Raw- 145, this study also found that both conventional and IRT-based methods worked best on equating PS-A and PS-B. The average index of equating accuracy was .893. All methods but the IRT-based linear transformation method yielded the least satisfactory results on equating SR-A and SR-B, and the average accuracy was .839. The results of statistical significance tests concluded that, regardless of the equating method used, the equating outcomes for sampled test PS were significantly more accurate than the outcomes for SR. In addition, although the IRT-based linear transformation method had the least accurate result on equating PW-A and PW-B (n.845), this outcome was not significantly different from the outcome of the same method on equating SR—A and SR-B. Overall, all of the four equating methods produced the least accurate outcomes for SR. For sampled test PW, the average equating accuracy of the four methods was .858, and it was .867 for EW. The accuracy of equating outcome from any of the equating methods, except for the IRT-based linear transformation method, on equating PW-A and PW-B was not significantly difl'erent fi'om the outcome of the same method on equating EW-A and EW-B. The linear transformation method yielded slightly more accurate outcome for EW than it did for PW. In summary, regardless of the equating method used, the overall equating outcomes for EW and PW were equally accurate. These outcomes were less accurate than those outcomes for PS but more accurate than those outcomes for SR. WW Given the above findings and conclusions about the equating accuracy for various sampled tests and of various equating methods, it is plausible that 141 overall there was no method-test interaction on estimating the accuracy of equating. That is, the relative equating accuracy of difi‘erent equating methods did not depend on the particular test forms being equated, and the equating accuracy for difi'erent sampled tests were independent of the particular equating method used. As mentioned earlier, regardless of the equating method used, the equating results for sampled test PS were always the most accurate and the results for SR were always the least accurate. The twofold interpretations for these findings are: 0 By the means of item-sampling designs, both PS-A and PS-B had items that were the most homogeneous in content, and both SR—A and SR—B had items that were the least homogeneous in content. The above findings therefore suggest that equating outcome based on a set of more content homogeneous items is likely to be more accurate than equating outcome based on a set of less content homogeneous items. 0 Also, because of item sampling, PS-A and PS-B had anchor items that were the most representative to their total tests in content, and SR—A and SR-B had anchor items that were the least representative to their total tests. The above findings suggest that equating outcomes on a test containing content-representative anchor items is likely to be more accurate than the equating outcomes on a test containing less content-representative anchor items. In short, equating accuracy depends on content homogeneity of test items in the test forms being equated. Equating accuracy also varies with the content representativeness of anchor items embedded in the test forms being equated. To improve equating accuracy, a testing program can use test fonns composed with content-homogeneous items, given it is 142 realistic. For common-item equating, it is also important to include anchor items that adequately mirror a total test in content. W The Pearson’s rs used to represent the degrees of equating accuracy were contaminated by an artifact of auto-correlation. The auto-correlation was caused by the fact that the sampled tests overlapped with the criteria for evaluating equating accuracy on some items. In the cases where Raw-145 or IRT-145 was used as a criterion, all of the anchor items embedded in various sampled tests overlapped with part of the criterion, because these sampled anchor items were subsets of the “anchor universe” containing all the 145 anchor items. Due to such overlap of items, the resulting Pearson’s rs were inflated. The resulting indices of equating accuracy were overestimated. To study the impact of auto-correlation on the estimation of equating accuracy, the degrees of equating accuracy were estimated by excluding the anchor items from the sampled tests and then correlating the resulting IRT-estimated true scores on these reduced sampled tests to the scores of Raw-145. The correlation outcomes are summarized in APPENDIX C. This strategy for controlling artifact of auto-correlation, however, was not applicable in the cases where the Tucker linear method or the frequency—estimation equipercentile method was used for equating. In the cases of the IRT-based equating outcomes, it is feasible and convenient to drop the anchor items and obtain a set of new total scores. This is because the IRT item parameters were estimated on an individual basis. Such advantage of IRT item calibration renders ease and convenience for test revision (to add or drop items). Based on observed test scores, the two non-IRT equating methods do not have the same flexibility. Their 143 resulting equivalent test scores can only be interpreted as a whole, that is, when the test forms being equated are kept intact. In addition, the necessary re-equating after dropping the anchor items is time-consuming and laborious. Although it is technically possible to obtain the difference scores between the resulting equivalent total scores and the subtotal scores based on only the common anchor items, such difl'erence scores are not practical or logically sound. Therefore, the strategy for controlling artifact of auto-correlation was not used with the Tucker linear method and the frequency-estimation equipercentile method. The magnitudes of the previously inflated indices of equating accuracy only attenuated slightly (less than .01), after controlling the auto-correlation (see the bordered and bolded Pearson’s rs, before and after the adjustment for auto-correlation, in APPENDIX C). Also, the slight attenuation was not statistically significant. In addition, for various equating methods and sampled tests, the rank-order patterns of these indices remained the same as before the adjustment for auto-correlation. These similarities between the indices computed before and after the adjustment for auto correlation suggest that the impact of the auto-correlation was not serious or substantial on the estimation of equating accuracy. The unadjusted indices of equating accuracy remained valid. The findings, discussions, and conclusions based on these unadjusted indices were retained. It is noted that the strategy used for controlling the artifact of auto-correlation did not completely eliminate the influence of the auto-correlation. Part of the artifact originated in the IRT—parameter estimation process and was therefore not easy to be controlled. Nevertheless, despite such difliculty, the strategy provides a useful alternative for improving the studies on the efl'ectiveness of various equating methods and the eflect of content representativeness. 144 Rehahihtxanualiditxesddenmuachomem Validity and reliability evidence for anchor tests, embedded in the sampled tests, provide sound basis for validating findings from equating studies and reaching for plausible conclusions. By excluding the non-anchor items fiom the sampled tests, and then correlating the resulting IRT-estimated true scores on these reduced sampled tests (containing sampled anchor items only) to the scores of Raw-145, this study examined reliability and validity of anchor items, embedded in the various sampled tests. The investigation outcomes are summarized in APPENDIX D. Raw-145, a type of the “pseudo true score”, was regarded as a similar but more reliable measure to the score on the anchor test. This is because the anchor items included in the four sampled tests were all subsets of the “anchor universe”, based on which the “pseudo true score” was computed. The “anchor universe” contained all the 145 anchor items available for this study, which was substantially more than the number of items in the sampled anchor tests. Therefore, the correlation coeflicient between Raw-145 and the score on the anchor test could provide concurrent validity evidence for the anchor test. Moreover, fi‘om the perspective that a correlation coefficient was computed between an observed score (the score on the anchor test) and its true score (the “pseudo true score”), the Pearson’s r also represented reliability of the anchor test. The statistically significant Pearson’s rs in APPENDIX D provide strong evidence of validity and reliability for the anchor tests in the four sampled tests. The average validity (reliability) was .895 for the anchor test embedded in PS. It was .875, .859 and .856 for the anchor tests of EW, PW, and SR respectively. 145 El . 11' 181-145 C' . The second criterion for evaluating equating accuracy, IRT-145, was also based on the 145 common anchor items as the first criterion. It was the IRT-estimated true score on the 145 anchor items. Different from Raw-145, IRT-145 was not susceptible to the drawback of being person-dependent and item-dependent. Using IRT-145 as a criterion, Pearson’s rs were computed to measure the degree of accuracy as before. A summary of the evaluation outcomes of equating accuracy using IRT-145, for various equating methods and sampled tests, is included in Table 18 (see the second numeric column). As shown in the table, these outcomes were very similar to the evaluation outcomes resulted from Raw-145. More details of the evaluation outcomes from IRT-145 are presented below. Similarities and differences between the outcomes from IRT-145 and Raw-145 are discussed. WWW Using the IRT-based criterion, the estimated equating accuracy ranged from .819 to .917 for various equating methods and sampled tests. As Raw-145, IRT-145 also found that all of the four methods yielded accurate results to a moderate degree, and the two IRT-based equating methods yielded significantly more accurate results than the Tucker linear method and the equipercentile method. The only exception is that the accuracy of the outcomes of the IRT-based linear transformation method on equating PW-A and PW-B was no different from those of the Tucker and the equipercentile methods. This exception, however, did not affect the overall conclusion that the IRT-based methods had more accurate outcomes than the non-IRT methods had. 146 WWW. As mentioned earlier. one of the concerns about using an IRT-based criterion is that such criterion may overestimate the equating accuracy of the IRT-based equating methods. In this study, however, IRT-145 did not seem to favor the IRT-based equating methods. Overall, IRT-145 did not systematically produce larger indices of equating accuracy for the IRT-based methods than for the non-IRT methods. Contrasting the indices of equating accuracy yielded by Raw-145 and IRT-145 in Table 18, this study found that for only half of the time, IRT-145 yielded slightly larger indices than Raw-145 did for the IRT-based equating outcomes. For the other half of the time, Raw-145 yielded slightly larger indices than IRT-145 did for the outcomes from the IRT-based methods. In addition, most of these small differences between IRT-145 and Raw-145 were not statistically significant. Thus, IRT-145 was not biased in overestimating the equating accuracy of the IRT-based methods. Using IRT-145, this study found that all of the equating methods worked best on equating sampled test forms PS-A and PS-B. For sampled test PS, the average index of equating accuracy across various equating methods was .907 (see Table 18). On equating SR-A and SR-B, however, all but the IRT-based linear transformation method produced the least accurate results. The average index of equating accuracy over various methods was .839 for SR. Combining the results of statistical significance tests for the difl‘erences between the equating accuracy for sampled tests PS and SR, it is found that regardless of the equating method used, the equating outcomes for PS were significantly more accurate than the outcomes for SR. Although the IRT-based linear transformation method yielded the least 147 accurate result when equating PW-A and PW-B, the index of equating accuracy obtained for SR (.860) was not significantly larger than the index for PW (r=.839). For sampled test PW, the average equating accuracy of the four methods was .846, and it was .848 for EW. The accuracy of equating outcome fi'om any of the equating methods, except for the IRT-based linear transformation method, on equating PW-A and PW-B was not significantly difiemm from the outcome of the same method on equating EW-A and EW-B. The linear transformation method yielded slightly more accurate outcome for EW than it did for PW. In summary, regardless of the equating method used, the equating outcomes for EW and PW were equally accurate most of the time. These outcomes were less accurate than those outcomes for PS but more accurate than those outcomes for SR. Combined with all the other findings, this finding led to the same conclusion of Raw-145 -- there was no method-test interaction on estimating the accuracy of equating. Overall, using IRT-145 as a criterion for evaluating equating accuracy led to findings that are very similar to those yielded by Raw-145, and the conclusions reached by these two true-score-based criteria are exactly consistent. - . Given the above findings and conclusions about the equating accuracy of various methods and for different sampled tests, IRT-145 also found that equating accuracy depended on content homogeneity of a set of test items and the content representativeness of anchor items. The finding suggests that when an anchor test was more content representative of its total test, regardless of the equating method used, the equating result for this test would be more accurate. WWW As Raw-145, the estimates of equating accuracy yielded by IRT-145 are also susceptible to the artifact of auto-correlation, due to 148 overlap of items fi'om the sampled tests and criterion tests. Therefore, the same strategy used to control the impact of the auto-correlation for the equating results based on Raw- 145 was applied to improve the auto-correlation problem. The resulting Pearson’s rs, adjusted for the artifact of auto-correlation, are included in APPENDIX C. Compared to their corresponding unadjusted Pearson’s rs, they show only a trivial amount of attenuation in their magnitudes (less than .01). In addition, the rank-order patterns of these indices of equating accuracy, before and after the adjustment, are pretty much the same. These findings suggest that the impact of the auto-correlation was not serious for the estimation of equating accuracy. The conclusions based on the unadjusted indices of equating accuracy should remain valid. WW. Regarding IRT-145 as a “pseudo true score”, the concurrent validity and reliability of the anchor tests of various sampled tests were estimated. The estimation was conducted in a way similar to the validity and reliability studies described earlier, where Raw-145 was regarded as the ‘pseudo true score”. Using IRT-145, the evidences of validity and reliability for the anchor tests were found to be satisfactory. These assessment outcomes are recorded in APPENDIX D, with the outcomes from Raw-145. The large positive Pearson’s rs between the “pseudo true scores” and the scores on the anchor tests, shown in APPENDIX D, provide evidences of reliability and validity for each of the four sampled tests. These measures of validity (reliability) range fiom .839 to .917 and are all statistically significant at (1 =01. On average, the validity (reliability) of the anchor test embedded in sampled test PS is .917. The average validity (reliability) measures are .868, .854, and .860 respectively for the anchor tests of EW, PW, and SR. 149 These evidences of reliability and validity improve the chance for the research outcomes in this study to be valid. E l . ll . EE-l C . . The two original test forms -- Book A and Book B -- had many more items than the sampled test forms. These two forms were equated by the frequency-estimation equipercentile method, and the equating outcome was regarded as a more reliable criterion (FE-long) for evaluating equating accuracy because it was based on a test similar to but longer than the sampled tests. Using FE-long, the accuracy of various equating methods on equating the two forms of a sampled test was evaluated. As before, Pearson’s r between the criterion score of FE-long and the resulting equivalent score on the sampled test was computed to represent the degree of equating accuracy of a particular equating method on a particular sampled test. The evaluation outcomes of equating accuracy are summarized in Table 18 (see the third numeric column). WW. Using FE—long, all the outcomes of the four equating methods were estimated to be moderately accurate as before. The equating accuracy of various methods for different sampled tests ranged from .864 to .903. Across various sampled tests, the average index of equating accuracy for the IRT fixed-b method was .893 (see Table 18), very close to the index for the IRT-based linear transformation method (r=.888). The Tucker linear method and the equipercentile equating method had the same degree of accuracy on average (r=.887). However, compared to Raw-145 and IRT-145, FE-long often resulted in larger indices of equating accuracy. The larger Pearson’s rs are in part attributed to the worsening artifact of auto- 150 correlation. In such cases, the impacts of the auto-correlation were more serious (than they were when the other criteria were used), because there was much more overlap between FE-long and the sampled tests. The sampled tests were all subsets of FE-long. Moreover, using FE-long, the outcomes of the IRT-based equating methods were not always estimated to be more accurate than those outcomes from the non-IRT methods. This is somewhat different from the findings based on Raw-145 and IRT-145. Even when the outcomes of the IRT-based equating methods appeared to be more accurate, their improvements over the other methods were often not significant, or not as large as when the other criteria were used to estimate equating accuracy. The artifact of auto-correlation may explain such differences between the results fiom FE-long and the results fi'om the previous two criteria. The worsening auto-correlation, associated with FE-long, could result in similar indices of equating accuracy. In such case, the dependent samples t-test could not detect the real differences. This explanation is supported by the facts that the indices of equating accuracy resulted fi'om FE-long (shown in Table 18) had the narrowest range (.039) and the smallest standard deviation (.01). For Raw-145, the range was .063 and the standard deviation was .021. Evaluation outcomes of IRT-145 had the widest range (.098) and the largest standard deviation (.033). Ideally, FE-long should be a more reliable criterion and thus provide an alternate way to study equating accuracy. However, such advantage of FE-long was smeared by its inherent problem of auto-correlation and its nature of being an arbitrarily selected criterion. W. The evaluation results from FE-long show that, regardless of the equating method used, the equating results for sampled test 151 PS were the most accurate among the results for all of the four sampled tests. On equating PS-A and PS-B, there were no statistical significant differences among the equating accuracy of the four methods. The average equating accuracy across the methods was r-.900. Across the methods, the average accuracy for SR was r=.890. However, on equating SR—A and SR—B, the outcomes of the IRT-based methods were slightly but significantly more accurate than the outcomes of the non-IRT methods. Across the four sampled tests, the two IRT-based methods had the least accurate outcomes on equating PW-A and PW-B. For PW, while the outcome of the IRT-based linear transformation method was slightly (but significantly) less accurate than the outcomes of the non-IRT methods, significance test results also show that the outcome of the IRT-based fixed-b method was no different fi'om the outcomes of the Tucker and the equipercentile methods. Across the sampled tests, the non-IRT methods had the least accurate outcomes on equating EW-A and EW-B. For EW, the outcomes of the non-IRT methods were slightly (but significantly) less accurate than the IRT-based linear transformation method. However, significance test results also show that the outcomes of the Tucker and the equipercentile methods were no different from the outcome of the IRT-based fixed-b method. In summary, many of the findings based on the criterion FE-long are not consistent with the findings based on the first two true-score based criteria. The improvement of the IRT-based equating methods over the non-IRT methods in equating accuracy is not clearly confirmed. In addition, the efi'ect of content homogeneity of test items and the efi‘ect of content representativeness of anchor test, on estimating equating accuracy, are not 152 clear. Although the equating outcomes for PS seemed always more accurate than the outcomes for the other three sampled tests, regardless of the method used for equating its two forms, such advantage of PS was not always significant statistically. The patterns of the accuracyindicesacrossthesampledtestswerenotclear,becausetheresultingindicesof equatingaccuracyfi’omFE-long hadsimilarvalues. Theinconsistencies between these findings and the previous findings are in part attributed to the problem of more serious auto-correlation underlying the criterion FE-long. They can also be partly accounted for by the vulnerability of FE-long, due to the fact that it was an arbitrarily selected criterion for evaluating equating accuracy. An implication fiom these conclusions is that accuracy of an arbitrary criterion itself is important, and should receive special attention, in evaluating efi'ectiveness of the other equating outcomes. It is common practice to use some arbitrary criterion for evaluating equating accuracy. However, the estimation of equating accuracy based on an arbitrary criterion is often biased because of the subjectivity of the particular criterion used. Therefore, it is one of the particular interests of this study to investigate the potential bias due to the arbitrary nature of a criterion for evaluating equating accuracy. W. The arbitrary criterion being studied was the outcome of the equipercentile method on equating the two forms of a sampled test (BF-short). Such an arbitrary criterion was established for each of the four sampled tests so that the relative equating accuracy of the other three equating methods (the Tucker method and the two IRT-based methods) could be estimated. In addition to the 153 index of equating accuracy (Pearson’s r), the root-mean-squared deviation (RMSD) statistic was also used as a second measure for equating accuracy. RMSD was appropriate for the estimation of equating accuracy in this study, because the outcomes of the equating methods being evaluated and the criterion FE-short were all based on the same sampled test. As a result, the resulting equivalent scores fi'om the three methods and the criterion score were on the same scale. WWW Using FE-short, the indices of equating accuracy were computed for the Tucker method and the two IRT-based methods. The resulting indices are presented in Table 17 (see the bordered Pearson’s rs). In summary, these indices differed somewhat from the indices resulted from the previous three criteria. While the previous indices suggest moderate equating accuracy of various methods on difl'erent sampled tests, the indices produced by FE-short suggest much higher degree of equating accuracy. The indices based on FE-short ranged fi'om .944 to 1, while the indices based on FE-long ranged from .864 to .903, those based on Raw-145 ranged fi'om .832 to .895, and those based on IRT-145 ranged from .819 to .917. These findings suggest potential bias due to the use of an arbitrary criterion. Averaged across various sampled tests, the mean accuracy of the outcomes from the IRT-based linear transformation method and the fixed-b method were .961 and .964 respectively. The similarity between the two IRT-based methods is consistent with previous findings produced by the other criteria. Based on FE-short, a finding dramatically different from the previous findings is that the outcome of the Tucker linear method was significantly more accurate than the outcomes of the IRT-based methods. The indices of equating accuracy for the Tucker method over various sampled tests were 154 all close to 1, suggesting nearly perfect accuracy. This finding also suggests that FE-short was biased in evaluating equating accuracy. Arbitrarily selected to be a criterion, FE-short overestimated the accuracy of the outcome fiom the Tucker method. Relatively speaking, this non-IRT-based criterion might have underestimated the outcomes from the IRT-based equating methods. This conclusion is compatible to those reached in the literature about the bias against the IRT-based equating outcomes. Despite its drawbacks discussed above, FE-short still found the equating results on PS the most accurate among the equating results on all four sampled tests, regardless of the equating method used. This finding provides some evidence for the effect of content representativeness of anchor test. W. The evaluation outcomes of equating accuracy measured by the RMSD statistic are summarized in Table 19. Overall, these outcomes agreed with the previous findings about equating accuracy measured by the Pearson’s r. However, there were still small discrepancies between the outcomes yielded by the RMSD and the Pearson’s r. The RMSDs for various equating methods on different sampled tests in Table 19 suggest that the equating outcomes of the Tucker method were more accurate than the outcomes of the two IRT-based methods all the time. This is consistent with the conclusion reached by the Pearson’s rs about the differences between the equating methods. Across the four sampled tests, the RMSD for the Tucker method ranged fiom .102 to .177, while the RMSDs for the IRT-based methods ranged fiom 1.871 to 2.314. They indicate that the IRT-based equating outcomes deviated more from the criterion equating (FE-short) than the outcome of the Tucker Method. 155 Table 19 - Root—Mean-Squared-Differences for Evaluating Equating Accuracy Equating Method Sampled Test PS EW PW SR L—'_W——_l—_T Tucker Linear Method 0.109 0.102 0.177 0.170 IRT-Based Linear Transformation Method 1.871 1.955 2.314 2.019 IRT-Based Fixed—b Method 1.875 1.994 1.972 2.018 156 Different fi'om the finding based on the Pearson’s rs, the RMSDs suggest that only when IRT-based equating methods were used, the equating results for sampled test PS were more accurate than those for EW, PW, and SR. The Tucker method yielded more accurate result for EW than for PW, SR, and P8. In addition, when the IRT-based linear transformation method and the Tucker method were used, the equating results for PW were the least accurate among the results for all sampled tests. The IRT-based fixed-b method yielded the least accurate result for SR. Comparing the estimation outcomes resulted from the Pearson’s r and the RMSD, clearly, using different statistics to represent equating accuracy may lead to somewhat difl'erent estimations. Therefore, when assessing accuracy of equating outcomes, it is important to know how well a particular index of equating accuracy serves its purpose. The natures of the statistics used to represent the degree of equating accuracy should be taken into account when interpreting the estimation results of equating accuracy. This study did not compute the RMSD statistic for all the equating outcomes and mainly relied on using Pearson’s r to represent equating accuracy. The reason is that the resulting equivalent scores from different methods and the various criterion scores were not on the same scale most of the time. In such cases, Pearson’s r provides an efficient and direct way to study the accuracy of equating outcomes. Although it is possible to transform the resulting equivalent scores and put these scores and the criterion scores onto the same scale, this study decided not to apply such transformation because: (a) score transformation may introduce more errors, and (b) score transformation may complicate the interpretations and implications of the analysis outcomes. In addition, from a practical perspective, transformed scores usually require additional explanations and justifications. 157 Therefore, this study chose to use Pearson’s r to keep the resulting equivalent scores precise and the estimation outcomes straight, and to make the interpretations of analysis results direct. Taking into account the limitations of the Pearson’s r, such as the issue of auto-correlation, this study presents and discusses the outcomes of equating accuracy with cautions. The use of multiple criteria for evaluating equating accuracy in this study proves to be very informative. The comparisons among the resulting evaluation outcomes of various criteria render an opportunity to thoroughly study the effectiveness of various equating methods and the effect of content homogeneity on equating accuracy. W. The criteria Raw-145 and IRT-145 were both computed using the 145 common anchor items. These anchor items show adequate internal consistency. The Cronbach’s or was .866 for the raw scores on the 145 items, and the Cronbach’s or was .869 when the item scores were standardized to have unit variances (n=2,241). The evidence of internal consistency suggests that the criteria Raw-145 and IRT-145 were reliable. Raw-145 and IRT-145 also correlated positively and strongly to each other (r=.982 at 0t=.01). They were appropriate for evaluating the equating accuracy of the four methods in this study, because they were conceptually the “pseudo true scores”. Raw-145 and IRT-145 also complemented each other in improving the estimation of equating accuracy. On one hand, Raw-145 did not over-estimate the equating accuracy of the outcomes from the IRT-based methods. Instead, it provided conservative estimates 158 of equating accuracy. On the other hand, IRT-145 was not susceptible to the problem of person-dependent and item-dependent. By incorporating both Raw-145 and IRT-145, the assessment of equating accuracy in this study was less prone to biases. Overall, these two criteria yielded very similar estimates of equating accuracy. WW. Both the other two criteria -- FE-long and FE- short -- produced evaluation outcomes that were short of interpretable patterns and largely inconsistent with the outcomes fi'om Raw-145 and IRT-145. This finding reflects the drawback of FE-long and FE-short for being subjective and arbitrary. FE-long was expected to be a more reliable criterion for evaluating the equating accuracy of the four methods on the sampled tests, but its assessment outcomes were influenced by serious auto-correlation and thus deviated considerably from those of Raw-145 and IRT-145. No better than FE-long, FE-short led to conclusions that were dramatically different from those of Raw-145 and IRT-145. Despite the inability of FE-long and FE-short in producing precise estimates of equating accuracy, one implication from the findings about the flawed criteria is that it is critical to take into account the estimation errors accompanying an arbitrary criterion. In summary, the use of multiple criteria and comparing their resulting assessment outcomes guarded the estimation of equating accuracy from being biased by a single arbitrary criterion. The results from using the strategy also cast valuable insights for equating practice and future research, on selecting appropriate criteria for evaluating equating accuracy. 159 Construct Validity Issues The test of the professional examination analyzed in this study was written to measure an examinee’s professional ability (knowledge or skills). The professional ability of an examinee partly depends on the examinee’s professional experience. In theory, the more years of professional experience an examinee had accumulated, the more likely that the examinee would score higher on the test. Such efi‘ect of professional experience should exist for the two examinee groups taking different forms of a test, after the test scores from the different forms are equated. Therefore, after the test forms were equated, the construct validity of the test could be investigated by comparing the resulting average equivalent scores of the two examinee groups. Based on this scenario, this study conducted an investigation on the construct validity of the professional in-training test. Specifically, using a set of equivalent scores on the original test, the effect of test form, the effect of years of experience, and the interaction between these effects on the examinee’s performance were studied. Since the equivalent scores for the original test had been obtained by the equipercentile equating method in previous analyses for equating accuracy, for the sake of completeness and convenience, this set of equivalent scores were used for the group comparisons. The group means of the equivalent scores by test form and by years of experience are summarized in Table 20. These group means were graphed in Figure 8 to facilitate the inspection on the interaction effect of test form and years of experience. In summary, there are evidences of construct validity for the equated original test forms, and the equating outcomes were determined to be adequate. If there were test form by experience interaction, the test would be lacking construct validity, or the statistical adjustment made via the equipercentile equating was 160 Table 20 - Average Equivalent Scores of Examinee Groups on the Original Test by Test Form and Years of Experience TCSt Form EISSZE; Mean Std. Dev. 11 1 133.374 17.682 380 Book A 2 147.116 15.087 352 3 155.816 13.185 359 1 137.826 16.366 409 Book B 2 150.435 15.262 367 3 157.761 13.148 361 Total 146.740 17.682 2228 2x3 ANOVA Result -- For "formxyears" interaction: F=1.27, p=.281, dffomxyeaFZ, dfcm,=2,222, 0t=.05 Years Of Mean Std. Dev. 11 Expenence 1 135.682 17.147 789 2 148.810 15.256 , 719 3 156.792 13.193 720 Test Form Mean Std. Dev. 11 Book A 145.192 18.060 1091 Book B 148.226 17.188 1137 method. Note. (1) The equivalent scores were obtained by the equipercentile equating (2) The few cases missing information about years of experiences were excluded from the analysis. 161 160 ~— / P / .0 2 \ / / ' 8 / m 150 —~ F a / . ' as / .' ‘3 / . E145 4- // ,' a , * :1 ll ’ 8140 -_ / E / . d . ' Book-A 135 ~~ ,- o 130 . , , 5 o 1 2 s 4 Years of Experience Note. (1) The equivalent scores were obtained by the equipercentile equating method, using all the items from the original test. (2) The few cases missing information about years of experiences were excluded from the analysis. Figure 8 - "Test Form" by "Years of Experience" Interaction Effect 162 inadequate. As shown in Figure 8, there is no crossed interaction between test form and years of experience. The result of a significance test for the interaction effect further indicates that there was no statistically significant interaction at or=.05 (F=1.27, p=.281). Moreover, the means plot in Figure 8 and the group means presented in Table 20 show that the more experienced group always had higher average scores than the other group(s), no matter which test form was taken. The multiple comparisons using Tukey and Scheffe’s tests further indicated significant differences among the groups differing in years of experience (p=.000 for all of the possible comparisons). All of these findings suggest that the equated test forms had construct validity. As shown in Table 20, regardless of their professional experiences, the group taking Book B always scored higher on average than the group taking Book A. The significance test for the group difference further concludes that the two groups taking different test forms differed significantly (p=.000) in their test performances at a significance level of or=.05. This finding is not surprising, since it has been found in previous chapters that the two examinee groups were slightly different in their abilities. Issues of Test Dimensionality Assuming unidirnensionality, the equating outcomes based on the 3PL IRT model were conducted and their outcomes were found satisfactory. However, because there were 23 core content areas nested within the single overall content domain underlying the sampled tests, whether the IRT assumption of unidirnensionality held for the equatings in this study is not clear. If there were indeed more than one traits underlying the test being studied, regardless of the violation of the unidirnensionality assumption, the satisfactory 163 IRT-based equating outcomes of this study would indicate robustness of the 3PL IRT model. Nevertheless, in such case, equatings based on multidimensional IRT models may be good alternatives. Therefore, to better understand the nature of the test being studied and to probe its impact on the equating practice, this study explored the dimensionality issues with a subset of manageable data. Specifically, confirmatory and exploratory factor analyses were used to investigate the dimensionality of a 45-item sub-test (all the items were common anchor items). To avoid complicating the investigation with too many items or too many underlying factors (there were potentially 23 factors corresponding to the 23 core content areas), this study focused on the small sub—test. To include as many examinees as possible, the sub-test only contained anchor items. Responses of the entire examinee population (n=2,241) on these 45 anchor items were analyzed. In theory, there were three distinct factors underlying the 45-item sub-test, because all the items were drawn from the sampled test PS and PS only covered three of the 23 core content areas. The outcomes of the factor analyses are summarized and discussed below. WW Considering the content structure of the sub-test, these models were appropriate for confirmatory factor analyses: (a) a model with three underlying factors, (b) a model with three first-order factors and one second-order factor, (c) a model with one overall factor only, and (d) a model with three single-factor sub-models, each dealt with items fi'om the same core content area (14 of the 45 items were from the same area, another 12 were from a second area, and the remaining items were from the third area). 164 For dichotomously scored items, the Pearson’s product-moment correlation based on normal scores is biased (inconsistent). The standard errors of the parameter estimates yielded by the generalized least square (GLS) method are not correct because of the wrong formula used (Jéreskog & Scrbom, 1993). Joreskog & Sorbom (1993) recommended that tetrachoric correlation be estimated for each pair of the dichotomous items and the resulting correlation matrix be analyzed by the generally weighted least squares (WLS) method, using LISREL. Therefore, following these recommendations and using LISREL, tetrachoric correlations were estimated and used for the factor analyses. The inverse of the estimated asymptotic covariance matrix of these tetrachoric correlation coefiicients was used as the weight matrix for the WLS method. Results from the chi-square tests for overall model-data fit suggest that none of the theory-driven confirmatory factor analysis models fit the data. That is, the content structures specified in the various factor models for the sub-test were significantly difl'erent fi'om the content structure of the sub—test implied by the actual data. However, it should be noted that the chi-square test was very sensitive to large sample size. Given the large sample size in this study, the test statistic was very likely to have large value. As a result, the test was more likely to show that there was significant difference between the theoretical model and the observed model. EmlaratanEactcLAnalxses Exploratory factor analyses were conducted to further explore the dimensionality of the sub-test. The tetrachoric correlation matrix estimated previously by PRELIS was used as input data for various exploratory factor analysis models. Using statistical package 165 SAS on Unix, these factor analyses were conducted: principal component analysis, principal factor analysis, maximum likelihood factor analysis, and or factor analysis. In summary, these factor analyses suggest that there was more than one factor underlying the sub-test and the dimensionality issues were complex. W. The results of the principal component analysis suggest one dominant component underlying the 45 items, although sixteen components were retained using the eigenvalues-greater-than-one criterion. The first component had an eigenvalue of 6.902, which accounted for 15.3% of the standardized variance of the correlation matrix. Each of the other components, however, had eigenvalue less than 1.55 and explained for less than 3% of the variance. The first principal component appeared to be much more important than any other components, despite the fact that multiple components were required to provide an adequate summary of the data. In addition, all of the 45 items had positive loadings (ranging fiom .174 to .681) on the first principal component. The scree plot of eigenvalues in Figure 9 provides visual evidence of the single dominant factor. W. The principal factor analysis used the squared multiple correlations for the prior communality estimates. As a result, the total eigenvalue of the correlation matrix reduced to 9.714 and the average eigenvalue was 0.216. By the default “proportion” criterion (SAS Institute Inc., 1989), eight factors were retained. The first factor (eigenvalue equals 6.173) explained 63.6% of the variance, while each of the other factors accounted for no more than 8%. The resulting pattern of principle factors was similar to the pattern of principal components, and factor loadings of various items on the first factor were all positive. The resulting scree plot is presented in Figure 10. OOCHU 28 288m 32% 38. 5o 2222:5525 < XHQZmEmAx 177 178 Boom 32M 38. 8 om 8 cm om S o T|4|444 2 2 «444444 4 2 o .44 4 44 11 ON a 44 4 1.. av m 4 4 4 m 4 44 444 4 11 co u 3 a~m.on.>oQ.Em 44 11 cm A «8.3.1.862 4 6:5: 23 9mm 1 2: 88m 32m :33. 8 cm 9. 52 cm 2 o T|44i4 2 2 1 4.4444 _ 2 o 4 4 44 11 ON 44 4 4 44 0.? Wu 4 I 44 4 4 4 4444 .m 4 44 4 11 co ,m . I. . ll ow $5 21 >65 55 6:8: :8 27%— t 2: $241582 32.308 .4. NHQZm—mmax 179 88m 26% gob on 90 on ow om cm 2 o l4 2 2 4%4444444IL1I4 2 2 o 44 44 2! ON ..d 4 44 ... 4 4 % ow .m 4 4 44 m 44 4 ll 00 w 83-58 .3 444444 11 cm A I 3:8: 8 mdsm % gmdmngoz a OS 88m 3mm :38; on ow om ow Cm ON 3 o T|4|44 2 2 2 44444444? 2 T 2 o 444 4444 4: ON H 4 (44 11 94 ..m 4 44 4 44 1- co m 4 4 444 1: Ow ..A wamuéa .3 22:32 a: 4-3m : 82 Namdmuamuz 5:58 < XHQZm—mm/x 180 Boom 3am 2905 8 cm o4 om cm 2 o 2 4 2 2 _4|4|444 2 4 2 o 444 44444 4 1‘ cm 44 44 H 4 4‘ L1 CV rm 4 4 4 2] co m 44 444 4 Lu Ow ..«Au mmcwnson .3m 4 4 - 92.3282 @832 as m >5 2- 2: 88m Bum 238. 8 on o4 om cm 2 o _ .c4 4 + 2 44$fl4l4444l4‘1 2 O 44 444 11 ON H ‘ “ ll CV m 44 4 n 4 44 2' 8 m 4444 if Ow rA . "(zon— . u 4 24% m u m 4 $on :8 4-3m .2- 2: www.mvncmoz 62:82 < 5&2??? Boom 3mm 30H 181 00 on O? om ON 2 o 22 2|||444 4 2 2 +4J444|4 2 2 O _ 444 4 4 44 IT ON H 2 4 4 444 % ow m 2 4 4 4 4 4 [I O0 m ‘ 44 44 b 2 3.03093 4 44 1] ow 2 VwfimvncmoE 3:5: :8 M—J—m 2i 2: 2 88m 26% 288. cc om ow om ON 2 o T444244 2 2 4043441 2 o 444 4 11 cm H 44 444 .11 ow m 44 444 24 co m 4 444444 ow m . ".26 . 11 a3 m a am as: 82 Ea [r 82 vwm. 2 $502 9:53 < NHDZmEm< APPENDIX B n-3m F - P - 8H . out»: ”onuoofim IT ..H’Md "0 91008.. WM ..V’Md “0 31°98 “'91“qu ..a-Md no 91008.. mum ..V'Md 00 was mammba. Smdutmv ownuooEmlol ... , n— uouum uuaonmum .3 ...... nouum cucucaum an ...... .. N vmnuooemcolfl ... conuooaucslal auflucwnH -mN >uaucmnH -na 8 mm Om n4 mm ON ON ...—4?— [ _ _ r . T _- :0 28m . Ame . ouEmv vwzuooEmIOl nounm unaccoum an ...... 69300.55 Ial aufiucmuunlll 1 n6. 2Ho.ouzm2 conuoosm.lql uounm uumvnmum H« ...... Uonuooamcblcl ..a—Md no 01035.. mum ..V'Md no was wannabe“ auflucmvH r UN r “N ..H'Md “0 91098.. 3“"!W ..V‘Md “0 31°98 IWWMMJHu 3m 28,—; BEES 8.2 wEfiooEm .8 82on 32.5, 3 flan—«>35 88m 9 88m Tm van—Zmnadfl 182 183 ..a-Md no woos, snnm .v-Md no woos "memos“ “H'Md no woos“ snnm ..V'Md no onoos InolnAmba. 2ooénxm2 vwsuooEmlbl 32.6"sz vacuoosmlol ... ,2 uouum Uuovcmum an ...... uouum 0.3653 3 ...... ... N 093 0029.5 + «.62» 09:35 If: ... .33:on 1 Wm 2335le ; ma 93.28 a. on a. 3 an on 8 u . co _ _ . _ . 2 2 _- .m. _- .m. m. m. Boo A A m m . w x... . .. ... ‘09 m ,3 m. ... x x . 0 O ... o a o m 2 . m m .. .d ... :5 .M 22 m .. V V .. u n .. v _ W. . 2 m. u n n S s 2cm.o..2m2 vwnuooEmIT ./ v n; m 8m.onzm2 anuoosmlol .... Tn; m .. a ...... ... a Mouum unaccoum an ...... ,.....1 N "0.. uouum numncdum Ha . m m uwnuooEmcolal .. M nwnuooEmchhl .. Wu 338.3 - ma 8 3353 . 2 8 8.808 Tm 5929*me gum 0.0 an 9». 94 6088» 2 m mm FR mm ‘4-4-u...' V N0 V non-bun...- SN . ouzm 8585+ ua-as no woos. snnm ..v-as no aloos Inotmnba. mm On 9. 8 mix” [ LP p L? 00808 2 mm cm nu 2:5qu uofiooamlol ..H'HS “0 31038.. “WM ..V'XS “0 31°98 wamunba. uouum vuuvcdum «a ...... . uouum Uuuucdum an ...... gnuooamcslr . vacuooamglcl 23353 | . N 3353' , n 8 mm 3 9. an an a an aim 8 nm cm 9. an em mm an Mfllflflm" P b h h p h 11L F vi ac r h b 1 n p h h p *‘ . W IECCU . . :0 800m . u . .. I m- .... u r m. Y NI ... r N0 b F Ame . ou2m2 vmnuooEm .IOI. uouum numocmum Ha ...... oozuooemcalcl mafiucmoH I '— r N “8‘88 “0 31098.. “WM ..V‘XS "0 31008 lam/Wm?!» ‘-r '4 O O :o . ouxmv Danuoofiml‘l MOMMN Ouddcdum HH ...... 09.300625 lil 3353 II I .— r N “8‘88 “0 91098.. “URN ..V‘XS “0 31°98 Inotwnba. Mm 23H 00383 .202 wEEooEm mo 82on msota> 3 8:024:55 200m 9 800m m-m 592me 184 185 ...-ouno-00-00000004 cocooooo-nuucoooocoup canyon-... . I n n I P o a 200 . Hula. umnuoofim .IOI Saw conaaom Colloo-I-IQOI-oonoccooooD ...-OIOOOIOIOIOIIOIIIIII no... 1 2mm. . outm. 00.3008 '0' “8'88 “0 ”008.. “mm .V‘HS "0 93°08 MAM. ..a-us no woos. snnm ..v-us no woos Inommba. uouum Guava—mum an ...... neuum Uuafium an ...... umnuooamca + 09300655 ICI huaucwuu . N huwucooHlll . N 8 mm on 9 mm an N 8 8 mm o... n4 3 mm Om nN ON mlmm — n p 1 r h P h h VI mix” F u p p F b p p ”I :0 200m 2 m :0 200m 2 2 .n- 0.00)....- o-ooboonno-c 3m . earn. vwnuooam Iol uouum cumuccum an ...... Umnuooamcslcl xuflucmuH poo-ooo-cu-oaoo-cunoo-od .a-us no woos. snnM .v-us no aloos momma. 8m.ouzm2 vwsuoofimlol 2. _ uouum nuancwum an ...... ...... . _ vonuoofimcolil . .2 2~ zuflucmoHllln .N ..a-us no woos. sonm .v-us no woos Inolnmbau 8.8002 N-m 59495.4 8N . ouZmV flmnuoofiml‘l “H‘Sd "0 91098.. “WM ..V'Sd “0 01008 JWFAM. 8 mm Om he 9. mm Om nN “lg P p k . P P p » :0 Boom 8H . ouzm. oofiooamlol ..H‘Sd “0 91038.. “WM ..V'Sd “0 93°08 W°WAWbHu ... uouum 335.6 an ...... m .. uouum uuqcauum «a ...... m ... 3583.5ch fl ... 0058.88ch auwuaoUH - Wm xuwucoqu - Wm 8 mm On 3. Ow mm Om nN 8 8 am On 3 9. mm Om mN ON m mm - p p F r p p F “.OO mlmm — \P p p r p h p “.9 co 88m no 88m ... . o ..,. c .2. ca. . no fl 3 $0 . ouxm. cmfiooEmlol uouum ounuqaum an ...... 69.3 00895 IT 5333 V V! _ I N I V? N t M r '0 M ..H'Sd “0 91038.. “WM ..V'Sd “0 91°98 WNMWbHu 3o . ouzmv 35026 Iol uouum uuaundum an ...... Umnuoosmcblcl. .3383 I v n.N I M r V”. m ..a-sa no woos. snnm ..v-sa no woos anommbau mm “mop. BEES H8 wfifioofim go 82on £59; 3 853355 28m 9 28m Wm Nae/man?» 186 187 9mm 8 mm cm 9. av mm Om nN 8 8 mm we 9 mm cm nN 0N I? r p b p _ p L . u mug p p h . h p h . u :0 ...—C90 “O. m .5980” “O. um. c m. c m. . m .. m m. w 0 m a m m M .d . s v. m m. w. m m m m s Aooéuza cwsuoosmlol a $5.0"sz omsuooglol m uouum vumcgum Ha ...... 1 m m uouum 0.26536 an ...... r m m 00: 39503 Icl ..Sa 003000635 Icl mm 33ch r 3 mm 3353 , hm aw. m cm 3 9. n o N m 0N mi N _ _ L .m w _ a. a new m. J. cm W . n :0 800m 3. .m. :0 800w no. .m. n n ..,. o M ..,. o m. .. a .. a .. m .. w .. r . S S ... n6 m m a m i _ m m . m. m. V n; V. V. m M . N m m S s . ' n N o . 0 3m ouzmv oonuooalt m 8M 01-23 vmnuoofimlol m uouum cuovcoum an ...... r m m .355 canon—RE Nu ...... 1 m m nonuooamglcl M uwsuooamcblcl MM 3353 r nm a...” SBcwEl , 3. 8 6.380 Wm 0002mm“? APPENDIX C .S.u8 a 835:3». 803 80065000 00020000 0830a— 05 .00 =< 02 00:00:00 03395 03. 0.0 000300 02000022 05 00.3%.: 30065000 00020-000 005—0000: 2:. ANV .08. 0.088 05 80¢ mE§ 00005-000 05 .200 007.: 0030800 0208 0.5 Eghfi 05 05 00003.6 0.0 0003.00 0... 08300 05 803 05. $02008 950000 .«0 00205 803 00.00 000050.03 m8» 5 30065000 00020000 0.:- A: .30. z. - am»... am: 83 ~80 53 80.0 llama 02.0 5.0 80.0 0%... «mm... «m S»... 20... 83 33 83 83 '23 33 83 30.0 $0... 30 8522 I n K— Wfi .- 03... man... 08.. 38 30 $2 $3 28 2.3 «a... E a B m B m b: m3... Ex. 83 08.0 $3 «who '82 em»... me»... am 03... 0mg 8.: $3 3.0 80.0 0mm... emu... mm . . . . 8502 9.0... me»... 08 _ a: o E o 08 0 new... 30 852505:- 550 :2. 0%... 83 $3 3.3 5... mm Baum-b: 05... E; 83 80.0 «8... km 0.7:: $25. 80.. 03.0 $7.50 80200 83 3-35m .8533 2800 mm E mm 3m am 2E mm 30 0.7:: 3-32 0 0.000.303 C 0.000000% mom-503x 0.5.35 .6 305 033.3 00502 conmscommqfih . O a X— mm s GOP—Du: BszmeB 0,50 80583090 . .0 :0020000-0S< 08 90:00:00 00% >08=00< wczmncm h_0 m00m05->00500< mcumswm wags—gm 00.0 x502 0000—2-80 005390 U EDme% 188 APPENDIX D APPENDIX D Reliability and Validity Evidence for the Anchor Tests of Four Sampled Tests , . , , , "Pseudo True Score" Valldlty/Rellblllty (Pearson S r ) Raw-145 IRT-145 , EW 0.877 0.870 IRT-Based Linear PS 0.894 0.917 TranSfOI’mation PW 0.846 0.839 Method SR 0.857 0.861 EW 0.873 0.865 IRT-Based PS 0.895 0.916 Fixed-b Method PW 0.871 0.868 SR 0.855 0.859 Note. (1) The validity/reliability measure is the Pearson's r between the "pseudo true score" and the resulting IRT true score estimates on a sampled test containing anchor items only. (2) All of the Pearson correlation coefficients were significant at (1:01. LIST OF REFEERENCES LIST OF REFEERENCES Angofi', W. H. (1984). Wm. Princeton, NJ: Educational Testing Service. Baker, F. B. (1990). Some observations on the metric of PC-BILOG results. Applied EsmhologicaLMcas-nremem 1:1.- 139-150. Baker, F. B. & Al-Karni, A. (1991). A comparison of two procedures for computing IRT equating coefficients W 28147-162 Berk, R. H. (1982). Discussion of item response theory. In P. Holland & D. B. Rubin (Eds), W. New York: Academic Press. Berry, D. A, & Lindgren, B. W. (1990). W. Belmont, CA: Brooks/Cob. Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P.W. Holland and D. B. Rubin (Eds) W (pp. 9-49). New York: Academic. Brennan, R. L, & Kolen, M. J. (1987). Some practical issues inequating. Applied 25W 11- 279-290 Budescu, D. (1985). Efficiency of linear equating as a function of the length of the anchor test W 22.13-20 Camilli, G., Wang, M. & Fesq, J. (1995). The effects of dimensionality on equating the law school admission test loumalotEducationaLMear-mmem 12 79-96 190 191 Cook, L. L., & Eignor, D. R. (1983). Practical considerations regarding the use of item response theory to equate tests. In R. K. Hambleton (Ed. ), Applicatansat‘jtem aspnnsflhenm (pp. 175-195). Vancouver, British Columbia. Educational Research Institute of British Columbia. Cook, L. L., & Eignor, D. R. (1991). An NCME instructional module on IRT equating methods. WW1!)- 37- 45. Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Wm 11. 225-244. Cook, L. L. Eignor, D. R., &Schmitt, A. P. (1988). W ‘.l..' 1!! ' '29-: l.‘ 0 I' J -_ ‘1110' "1.. 9! a... (Research Rep. No. RR-88- 52). Princeton, NJ: Educational Testing Service. Crocker- L.- & Algina. J. (1986).1ntmdnction_to_classical_andmodem_test_theory Chicago: Holt, Rinehart and Winston, Inc. Dorans, N. J. (1990). Equating methods and sampling designs. AppMMeasnamenLin mm: 3’ 3'17- Dorans, N. 1., & Kingston, N. M. (1985). The effects of violations of unidirnensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale loumalntfidncationalMeasi-uemem 22- 249-262 Efron, B. (1982). WWW Philadelphia- PA: Society for Industrial and Applied Mathematics. Efron, B., & Tibshirani, R. J. (1993). Anintrndngtanjmhghpgtstap (Monographs on Statistics and Applied Probability 57). New York: Chapman & Hall. Eignor, D. R. Stocking, M. L., & Cook, L. L. (1990). Simulation results of efl‘ects on linear and curvilinear observed- and true-score equating procedures of matching on afallible criterion AmhedMeasnmmenLinEdnr-atan- 3 37- 52 192 Green, D. R., Yen, W. M. & Burket, G. R. (1989). Experiences 1n the application of item l"215901186 theory In test construction AnnfieiMeasntemem-jnfiducation. 2. 297- 312. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. lanamseflsxchnlogicaLResemh-22-144-49. Hambleton, R. K., & Cook, L. L. (1977). Latent trait models and their use in the analysis of educational test data JonmalnLEdncationalMeasuremem. 14- 75- 96. Hambleton. R K. &Swaminathan. H (1990) W applicntans. Boston. Kluwer- Nijhofl' Publishing. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Eundamentdsafiitem W. Newbury Park, CA: Sage. Hanson, B. A. (1991). A comparison of bivariate smoothing methods 1n common-item equipercentile equating AppliedfisychologicaLMeammem. 15 391- 408 Hanson-B A Zens-L.- &Colton. D (1994) WWW postsmoothinamethodsinmuinememfleequanng (ACI‘ Research Report 94-4). Iowa City, IA: American College Testing. Hanson, B. A., Zeng, L., & Kolen, M. J. (1995, October). WW. (Available from Michael Kolen, ACT, 2255 N. Dubuque Street, Iowa City, IA 52243). Harris, D. J., &Crouse,J. D. (1993). Astudy of cntenausedmequating App_'ed Woman-6.195240. Hills, J. R., Subhiyah, R. G., & Hirsch, T. M. (1988). Equating minimum-competency tests Comparisons of methods IonmalnifldncationalMeasntement. 25. 221- 231. 193 Hinkle. D. B., Wiersma, W., & Jurs. S. G. (1979). Anaheistausueeierthembaueral W. Chicago: Rand McN ally. Holland P W &Thayor D T (1987). lietesuntbeuseuflugzlinearmodehiorfittins WWW (Technical Report No. 87- 79). Princeton, NJ: Edcuational Testing Service. Holland. P W» &Thayor. D T (1989) Ibekemelmethudnfienuatingseore distfibutipnsa‘echnical Report No. 89-84). Princeton, NJ: Educational Testing Service. Jarjoura, D. & Kolen, M. J. (1985). Standard errors of equipercentile equating for the common item nonequivalent populations design. WW, it), 143-160. Joreskog, K. G., & Serbom. D. (1989). WWW applieatitms. Chicago: Scientific Software International, Inc. Jareskog, K. G., & Sorbom, D. (1993). W. Chicago: Scientific Software International, Inc. Joreskog. K G.. & Sorbom. D. (1995) BRELISAurugramiortnuLanatejata W. Chicago. Scientific Software International, Inc. KendalLM., & Stuart, A. (1977). W (4th ed., Vol.1). New York: Macmillan. Klein, L. W. & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. loumflpfifidpgatipnai Measurement. 22. 197- 206. Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests loumaluLEdueatienalMeasurement. 1.8. l- 10 194 Kolen, M. J. (1991). Smoothing methods for estimating test score distributions. humaLpf EdueauenalMeasurement. 23, 257- 282. Kolen, M. J., & Brennan, R. L. (1987). Linear equating models for the common-item nonequivalent-populations design. MW, 11, 263- 277. Kolen, M. J., & Brennan, R. L. (1995). W. New York: Springer-Verlag. Kolen, M. J., & Harris, D. J. (1990). Comparison of 1tem preequating and random groups equating using IRT and equipercentile methods. humaigLEdygatippal Measurement. 21. 27-39. Kolen, M. J ., & Jarjoura, D. (1987). Analytical smoothing for equipercentile equating under the common item nonequivalent populations design. Bsmhpmettjka, 52, 43- 59. Lawrence, 1. M. & Dorans, N. J. (1990). Effect on equating results of matching samples on an anchor test AnLedMeasuremenLinEdueatien. 3. 19-36 Linn, R. L. Levine, M. V., Hastings, C. N., &Wardrop, J. L. (1981). Aninvestigation of item bias 1n a test of reading comprehension. AppheifisyghgiggmalMeamnt, 5. 159-173. Livingston, S. A. (1993). Small-sample equating with log-linear smoothing. loumaifl' EdueatitmaLMeasurement. 3.0, 23-39. Livingston, S. A., Dorans, N. J., & Wright, N. K. (1990). What combination of sampling and equating methods works best? ApnfieiMeasuremenLinfiduQatiQn. 3. 73- 95. Lord, F. M. (1965). A strong true score theory with applications. Esyctmmetika. 3), 239-270. 195 Lord, F. M. (1977). Practical applications of 1tem characteristic curve theory. Mam; EdueatiunalMeasurement.14, 117-138. Lord. F. M. (1980). Annlieationsefitemresmnsetheumunraetiealtestingumblems Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Lord, F. M. (1982a). The standard error of equipercentile equating. 193mm Edueauunalflausues. 1. 165-192. Lord, F. M. (1982b). Item response theory and equating- A technical summary. In P. Holland & D. B. Rubin (Eds), W. New York: Academic Press. Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch ModeLIpumaJ utEdueatiunalMeasurement. 11. 179-193. Marco, G. L. (1977). Item characteristic curve solutions to three intractable testing problems. leurnaLefEdueatiunalMeasurement. 14. 139-160. Marco, G. L., Petersen, N. C., & Stewart, E. E. (1983). Atest of the adequacy of curvilinear score equating models. In D. J. Weiss (Ed) Wing; latenttraitsheunrandeumuutenzedadamiyetestmg. New York: Academic Press Mislevy. R. 1.. &Bock. R D (1990) WWW Wis. Mooresville, IN. Scientific Software. Mislevy, R. J., & Stocking, M. L. (1989). A consumer’s guide to LOGIST and BILOG. AnuhedisxeholoeitsNMeasurement. 13. 57-75. Parshall, C. G., Houghton, P. D., & Kromrey, J. D. (1995). Equating error and statistical bias in small sarnple linear equating loumalutEdueauunflMeasurement. 32. 37- 54 Petersen, N. S, Cook, L. L., & Stocking, M. L. (1983). IRT versus conventional equating methods. A comparative study of scale stability. WW, 8. 137-156. 196 Petersen, N. S., Kolen, M. J., 8!. Hoover, H. D. (1989). Scaling, norming, and equating. InR. L. Linn(Ed.), EducationflMeasumment. New York. ACE/Macmillan. Raju, N. S. Bode, R. K., Larsen, V. S. & Steinhaus, S. (1986, April). W . - ‘ ‘ .Paper presented at the annual meeting of the National Council on Measurement 1n Education, San Francisco. Raju, N. S., Edwards, J. E., & Osberg, D. W. (1983, April). Wm ‘ - . . . ‘ ‘ .Paper presented at the annual meeting of the National Council on Measurement 1n Education, Montreal. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications loumaloLEdueationaLStatistics. 4. 207- 230. Reckase, M. D. Ackerman, T. A., &Carlson, J. E. (1988). Building aunidimensional test using multidimensional items. loumaipflfidupatippaiMeasmementji, 193-203. Rosenbaum, P. R., & Thayer, D. T. (1987). Smoothing the joint and marginal distributions of scored two-way contingency tables 1n test equating. Britishloumalpf Mathemanoalandfitamttoalfimhology AD. 43 49. SAS Institute Inc. (1989). SASLSIAfluseLsguidexersronoiotmbedmonmolmrted Cary, NC: SAS Institute Inc. Schmitt, A. P. Cook, L. L., Dorans, N. J., & Eignor, D. R. (1990). Sensitivity of equating results to different sampling strategies. ApplieiMeastmmentin Education. 3. 53-71. Skaggs, G., & Lissitz, R. W. (1986). IRT test equating. Relevant' 1ssues and a review of recent research ReadeurofEdueationaLReseareb. 5.6. 495- 529. Skaggs, G., & Lissitz, R. W. (1988). Effect of exammee ability on test equating invariance AnoliedlisuchologieaLMeasurement. 12. 69- 82. 197 Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory ApphedfismhologioalMeasurement, 1. 201 210. Wang. L & Kolen. M. J. (1994). Aouadraticcumemuatmgmethodsmtatethefirst mreemomenmincouipercentileequating (ACT Research Report 94-2). Iowa City. IA: American College Testing. Wingersky, M. S., & Lord, F. M. (1984). An investigation of methods for reducing sampling error in certain IRT procedures. ApphedfisxciflogmflMeasurement. 8. 347- 364. Wingersky. M. 8.. & Barton. M. A (1982). WE). Princeton, NJ: Educational Testing Service. Yen, W. M. (1980). The extent, causes and importance of context efi‘ects on item parameters for two latent trait models. loumalpfiEducaponfiMeasurement, 11, 297- 31 l. Yen, W. M. (1983). Tau-equivalence and equipercentile equating. Emhometrika, 48, 353-369. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three- -parameter logistic model. Apudfimholpgpafldeasmmegt, 8, 125-145. Yen,W. M. (1985). - .Paper presented atthe meeting of the American Educational Research Association, Chicago. Yen, W. M. (1987). A comparison of the efficiency and accuracy of BILOG and LOGIST. Emhometrika. 52. 275-291. ZimowskLM. F. Muraki,E., Mislevy,R. J., &Bock, R. D. (1996). BILLXLMG; u. r-r o . l' .1... U .' .11.: 1.. - 0, 1n .1 .Chicago: Scientific Software International. "IlllllllllllllllliES