EVALUATING EQUATING RESULTS IN THE NON-EQUIVALENT GROUPS WITH ANCHOR TEST DESIGN USING EQUIPERCENTILE AND EQUITY CRITERIA By Minh Quang Duong A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Measurement and Quantitative Methods 2011 ABSTRACT EVALUATING EQUATING RESULTS IN THE NON-EQUIVALENT GROUPS WITH ANCHOR TEST DESIGN USING EQUIPERCENTILE AND EQUITY CRITERIA By Minh Quang Duong Testing programs often use multiple test forms of the same test to control item exposure and to ensure test security. Although test forms are constructed to be as similar as possible, they often differ. Test equating techniques are those statistical methods used to adjust scores obtained on different test forms of the same test so that they are comparable and can be used interchangeably. In this study, the performance of four commonly used equating methods under the nonequivalent group with anchor test (NEAT) design - the frequency estimation equipercentile method (FE), the chain equipercentile method (CE), the item response theory (IRT) true score method (TS), and the IRT observed score method (OS) – were examined. In order to evaluate equating results, four evaluation criteria - the equipercentile criterion (EP), the full equity criterion (E), the first-order equity criterion (E1), and the second-order equity criterion (E2) – were used. Simulated data were used in various conditions of form and group differences. Several major findings were obtained in this study. When the distributions used to simulate ability for the groups were equal, the four methods produced similar results, regardless of the criterion used. When group difference existed in the distributions used to simulate the data, the results produced by different methods diverged significantly when the EP, E, and E1 criteria were used. The difference was small when the E2 criterion was used. In general, the OS method outperformed the others in regarding to the EP and E criteria. The TS method performed the best in regarding to the E1 criterion followed by the OS, CE, and FE methods. Between the two observed score methods (i.e., FE and CE), which were outperformed by the two IRT methods, the CE method produced much better results and they were close to those produced by the two IRT methods. The FE method produced the worst results, regardless of the criterion used. It was also found that test form difference had clear effects on all methods, regardless of the criterion used. Larger difference between test forms led to worst equating results. While the two IRT methods were not clearly affected by group differences in the generating distributions, the two observed score equating methods were. Larger group differences produced worse equating results obtained from the CE and the FE methods. In addition, the impacts of group differences were much stronger for the FE method than for the CE method. Group and form interaction effects were not found for the IRT methods. They were, however, present for the FE and CE methods although those effects were small. When evaluated with the E2 criterion, the four equating methods produced results that were not better than those obtained from using directly raw scores from test forms without equating. These results are discussed in more details and some recommendations are made for equating practice. Limitations of the study and suggestions for further research are also presented. Copyright by MINH QUANG DUONG 2011 DEDICATIONS To my parents, who gave me my life, raised me with great love, taught me the value of education, and encouraged me to pursue further education in order to be a better person. And to my beloved wife, LIEN KIM NGUYEN, who sacrificed her teaching career to come with me to US and has always been by my side in all steps, sharing with me all ups and downs during our time at MSU and beyond. v ACKNOWLEGEMENTS There are many people who have given me much more than I can ever possibly repay. I would like to express my deep gratitude to my adviser and dissertation chair, Dr. Mark Reckase. My first meeting with him, which turned out to be one of the most important conversations of my life, inspired me to pursue advanced studies in psychometrics. Without his rich guidance, tremendous support, insightful comments, and great patience, this dissertation would not have been possible and my doctoral studies would never have been completed. I could not have asked for a better adviser. My special thanks go to Dr. Tenko Raykov who, for the last six years, has become my mentor. I would like to thank him for his strong support, both academically and emotionally, during my time at MSU. Since my first step at MSU, Dr. Richard Houang has become not only a good mentor but also a great friend. I am grateful to him for many things he has done for me. I will miss the conversations I had with him about educational measurement. I thank Dr. Alexander Von Eye for his helpful comments and great support to my dissertation. I am privileged and proud to have him in my dissertation committee. My sincere gratitude goes to Dr. Richard Prawat, Chair of the CEPSE department, and to Dr. Karen Klomparens, Dean of the MSU Graduate School, for their great support. With all my heart, I would like to thank Dr. Christopher Wheeler for everything he has done for me. It was Dr. Wheeler who built a bridge connecting me to my success today. Finally, I thank many other professors, graduate students, and friends who have enriched my life at MSU. vi TABLE OF CONTENTS LIST OF TABLES ……………………………………………………………………………… x LIST OF FIGURES ……………………………………………………………………………... xi CHAPTER 1 INTRODUCTION …………………………………………………………………………….…. 1 1.1. Test equating ……………………………………………………………………………. 1 1.2. Evaluating equating results …………………………………………………………....... 2 1.3. Concerns regarding equating criteria ………………………………………………........ 4 1.4. The approach taken: equating definition and equating criterion ……………………….. 5 1.4.1. Equipercentile definition ………………………………………………………...... 6 1.4.2. Equity definition ………………………………………………………………....... 6 1.4.3. Equipercentile criterion and equity criteria ……………………………………….. 7 1.5. Motivation …………………………………………………………………………......... 8 1.6. Purpose of the study and research questions …………………………………………..... 9 1.6.1. Purpose …………………………………………………………………………..... 9 1.6.2. Research questions ………………………………………………………………… 9 1.7. Research expectations ………………………………………………………………….. 10 1.8. Significance of the study ………………………………………………………………. 10 1.9. Additional notes …………………………………………………………………………11 1.10. Overview of the dissertation ………………………………………………………….. 11 CHAPTER 2 LITERATURE REVIEW ……………………………………………………………………......13 2.1. Test equating …………………………………………………………………………… 13 2.2. The nonequivalent groups with anchor test (NEAT) design ……………………………15 2.3. Equipercentile OSE methods under the NEAT design ………………………………… 17 2.3.1. General framework ………………………………………………………………. 17 2.3.2. Frequency estimation equipercentile equating method (FE)………………………19 2.3.3. Chain equipercentile equating method (CE)……………………………………… 20 2.4. Presmoothing score distributions using log-linear models …………………………….. 22 2.5. Item response theory (IRT) equating methods under the NEAT design ………………. 24 2.5.1. Three-parameter logistic (3PL) model …………………………………………… 24 2.5.2. IRT scale linking ………………………………………………………………… 25 2.5.3. IRT true score equating method (TS)…………………………………………..… 26 2.5.4. IRT observed score equating method (OS)………………………………………. 28 2.6. Equating criteria ……………………………………………………………………….. 29 2.6.1. Equipercentile criterion ………………………………………………………….. 30 2.6.2. Equity criteria ……………………………………………………………………. 31 2.7. Summary of related research ……………………………………………………………33 2.7.1. Prior research on comparing equating methods …………………………………. 33 2.7.2. Prior research using equipercentile and equity criteria ………………………….. 36 2.7.3. Summary …………………………………………………………………………. 37 vii CHAPTER 3 RESEARCH METHOD …………………………………………………………………………39 3.1. Purpose of the study and research questions ………………………………………………. 39 3.2. Overall research design…………………………………………………………………….. 40 3.2.1. General framework……………………………………………………………...... 40 3.2.2. Data source……………………………………………………………………….. 41 3.2.3. IRT model………………………………………………………………………… 41 3.2.4. Fixed factors……………………………………………………………………… 41 3.2.5. Varied factors…………………………………………………………………….. 42 3.2.6. Simulation conditions…………………………………………………………….. 43 3.2.7. Equating methods………………………………………………………………… 44 3.2.8. Replications………………………………………………………………………. 44 3.3. Test form generation……………………………………………………………………….. 44 3.4. Data simulation……………………………………………………………………………... 44 3.5. Equipercentile equating procedures………………………………………………………… 47 3.6. IRT equating procedures…………………………………………………………………… 48 3.6.1. Calibration………………………………………………………………………... 48 3.6.2. Scale linking……………………………………………………………………… 48 3.6.3. Equating………………………………………………………………………….. 48 3.7. Procedures for assessing criteria…………………………………………………………… 49 3.7.1. Equating criteria………………………………………………………………….. 49 3.7.2. Population score distributions……………………………………………………. 49 3.7.3. Evaluation indices………………………………………………………………… 51 3.8. Simulation steps within each condition…………………………………………………….. 54 CHAPTER 4 RESULTS……………………………………………………………………………………….. 56 4.1. Review of research purpose and questions…………………………………………………. 56 4.2. Review of evaluation indices……………………………………………………………….. 57 4.3. General framework for presenting the results……………………………………………… 57 4.4. Overall comparison among methods……………………………………………………….. 58 4.4.1. Index EP………………………………………………………………………….. 60 4.4.2. Index E……………………………………………………………………………. 61 4.4.3. Index E1………………………………………………………............................... 63 4.4.4. Index E2…………………………………………………………………………... 64 4.5. Effects of group and form factors on the performance of the FE method………………….. 66 4.5.1. Group effects for the FE method…………………………………………………. 70 4.5.2. Form effects for the FE method………………………………………………….. 70 4.5.3. Group and form interaction effects for the FE method…………………………… 72 4.6. Effects of group and form factors on the performance of the CE method…………………. 73 4.6.1. Group effects for the CE method…………………………………………………. 73 4.6.2. Form effects for the CE method………………………………………………….. 73 4.6.3. Group and form interaction effects for the CE method……………………………77 4.7. Effects of group and form factors on the performance of the TS method………………….. 77 4.7.1. Group effects for the TS method…………………………………………………. 77 4.7.2. Form effects for the TS method………………………………………………….. 81 viii 4.7.3. Group and form interaction effects for the TS method…………………………… 81 4.8. Effects of group and form factors on the performance of the OS method…………………. 81 4.8.1. Group effects for the OS method…………………………………………………. 85 4.8.2. Form effects for the OS method………………………………………………….. 85 4.8.3. Group and form interaction effects for the OS method………………………...... 85 4.9. To equate or not to equate?………………………………………………………………… 85 4.10. Summary………………………………………………………………………………….. 86 CHAPTER 5 SUMMARY AND DISCUSSIONS…………………………………………………………….. 89 5.1. Brief overview of the study………………………………………………………………… 89 5.2. Summary of major findings………………………………………………………………… 90 5.2.1. Overall performance……………………………………………………………… 90 5.2.2. Effects of form difference………………………………………………………… 91 5.2.3. Effects of group difference……………………………………………………….. 91 5.2.4. Interaction effects of form difference and group difference……………………… 92 5.2.5. To equate or not to equate?………………………………………………………. 92 5.3. Discussion of the results……………………………………………………………………. 92 5.3.1. Overall performance……………………………………………………………… 92 5.3.2. Effects of form difference………………………………………………………… 94 5.3.3. Effects of group difference……………………………………………………….. 94 5.3.4. Interaction effects of form difference and group difference.…………………….. 96 5.3.5. To equate or not to equate?……………………………………………………….. 96 5.3.6. Order effect of a-parameter difference ……………………………………………97 5.3.7. Unusual high index values for CE method ………………………………………. 97 5.4. Recommendations…………………………………………………………………………... 98 5.4.1. Recommendation on the selection of equating methods in the NEAT design……. 98 5.4.2. Recommendation on the communication of equating results………………………99 5.5. Limitations………………………………………………………………………………….. 99 5.6. Directions for future research ……………………………………………………………. .100 APPENDIX …..……………………………………………………………………………….. 103 REFERENCES………………………………………………………………………………… 123 ix LIST OF TABLES Table 2.1. The NEAT design …..……………………………………………………………….. 16 Table 3.1. Descriptive statistics of item parameters of three initial blocks ……………………. 45 Table 3.2. Illustrative example: x, y, ye, and cumulative distributions …………………………. 53 Table 4.1. ANOVA results for the FE method for each index …………………………………. 67 Table 4.2. ANOVA results for the CE method for each index ……………………………….... 74 Table 4.3. ANOVA results for the TS method for each index …………………………………. 78 Table 4.4. ANOVA results for the OS method for each index ………………………………… 82 Table 4.5. Summary of major results …………………………………………………………… 87 Table A1. Repeated ANOVA results for index EP……………..………………………………103 Table A2. Repeated ANOVA results for index E..……………..……………………………… 104 Table A3. Repeated ANOVA results for index E1 ……………..………………………………105 Table A4. Repeated ANOVA results for index E2 ……………..………………………………106 Table A5. Means of index EP for five equating methods in all conditions …………………… 107 Table A6. Means of index E for five equating methods in all conditions …………………….. 110 Table A7.Means of index E1 for five equating methods in all conditions…………………….. 113 Table A8. Means of index E2 for five equating methods in all conditions ………..………….. 116 Table A9. Comparison of results obtained from using fixed and random test forms …………. 119 x LIST OF FIGURES Figure 3.1: Illustrative example of area between cumulative distribution functions of X and Ye ………………………………………………………………………….. 53 Figure 4.1: Means of index EP for FE, CE, TS, and OS methods in all conditions …………..... 60 Figure 4.2: Means of index E for FE, CE, TS, and OS methods in all conditions …………...... 62 Figure 4.3: Means of index E1 for FE, CE, TS, and OS methods in all conditions …………..... 63 Figure 4.4: Means of index E2 for FE, CE, TS, OS, and IE methods in all conditions ……....... 65 Figure 4.5: Means of index EP for FE method in all conditions ………………………………. 68 Figure 4.6: Means of index E for FE method in all conditions ………………………………… 68 Figure 4.7: Means of index E1 for FE method in all conditions ……………………………….. 69 Figure 4.8: Means of index E2 for FE method in all conditions ……………………………….. 69 Figure 4.9: Means of index EP for CE method in all conditions ………………………………. 75 Figure 4.10: Means of index E for CE method in all conditions ………………………………. 75 Figure 4.11: Means of index E1 for CE method in all conditions ……………………………… 76 Figure 4.12: Means of index E2 for CE method in all conditions ……………………………… 76 Figure 4.13: Means of index EP for TS method in all conditions ……………………………… 79 Figure 4.14: Means of index E for TS method in all conditions ……………………………….. 79 Figure 4.15: Means of index E1 for TS method in all conditions ………………………………. 80 Figure 4.16: Means of index E2 for TS method in all conditions ………………………………. 80 Figure 4.17: Means of index EP for OS method in all conditions ……………………………… 83 Figure 4.18: Means of index E for OS method in all conditions ……………………………….. 83 Figure 4.19: Means of index E1 for OS method in all conditions ……………………………….84 xi Figure 4.20: Means of index E2 for OS method in all conditions ……………………………… 84 Figure B: Comparing equating results from two directions in two selected cases ……………. 121 xii CHAPTER 1 INTRODUCTION This introductory chapter presents the foundations of this study. Major points include context and nature of the problem, the approach used to address the problem, purpose of the study, specific research questions to be answered, research expectations, and the significance of this study to test equating research and practice. 1.1. Test equating In many testing programs, alternative forms of the same test are used in different administrations to maintain test security. For example, the SAT exam is given at several administrations each year with different forms. In developing various forms of the same test, test developers use test specifications to ensure that alternative forms are similar in contents and statistical characteristics. Despite test developers‟ efforts, it is almost inevitable that differences among test forms exist to some degree unless they are identical. As a result, one test form may be easier or more difficult than others. Therefore, some test takers might have advantages or disadvantages simply because they are administered a relative easy or difficult test form. In order to maintain test fairness, scores obtained from different test forms should not be used before some adjustment is made to ensure score comparability (i.e., being on the same scale). The adjustment process is called test equating or equating. Equating is often defined as a statistical process used to adjust scores on alternative test forms so that their scores can be used interchangeably (Kolen & Brennan, 2004). If equating is successfully performed, test fairness is maintained and it becomes possible to compare examinees or to measure their growth (Angoff, 1971; Petersen, Kolen, & Hoover, 1989). 1 In general, a test equating process consists of two important components: an equating design, and one or more equating methods. Equating design refers to a plan to collect equating data. For that reason, it is sometimes called data collection design. The most commonly used design is the non-equivalent groups with anchor test (NEAT) design. In this design two test forms, which share some common items, called anchor items, are administered to two samples from two, usually distinct, populations of test takers (von Davier, Holland, & Thayer, 2004a). If the total score includes the score on the anchor items, the anchor is called internal. If the score from the anchor items is not included in the total score, it is called an external anchor. This design is also called the common-item nonequivalent groups design (Kolen & Brennan, 2004). Other common designs are single group design and random groups design. In each equating design, different equating methods can be used. An equating method is a framework to derive the equating function which places scores of one test form on the scale of another test form. Equating methods can be generally classified into two different groups: the observed score equating (OSE) methods, and the item response theory (IRT) methods. The OSE methods are usually referred to as traditional methods. Another way to classify equating methods is based on the assumed relationship between scores on the two test forms being equated. Within this framework, an equating method can be classified as either linear or equipercentile depending on whether the relationship between scores on the two test forms is assumed linear or non-linear. 1.2. Evaluating equating results Given the importance of equating in making scores comparable, which in turn has crucial impacts on decision making, it is critical that equating results be evaluated for accuracy. Evaluation results are also useful in helping psychometricians compare and select appropriate equating procedures in a specific situation. 2 Evaluating equating requires a criterion or criteria to which equating accuracy can be judged. A variety of criteria have been proposed and used in research and practice (for a detailed review, see Harris & Crouse, 1993). Traditionally, equating results from a very large sample are often used as benchmarks to evaluate other equating procedures (e.g., see Holland, Sinharay, von Davier, & Han, 2008; Livingston & Kim, 2010; Puhan, Moses, Grant, & McHale, 2009; Sinharay & Holland, 2007). However, comparing a method to another assesses the similarity between them, but not necessarily the accuracy of the former. In addition, the selection of what method to use on a large sample to obtain the criterion is arbitrary. Any equating method can be used to produce the criterion. Because different methods likely yield different results, this approach does not seem reasonable. According to Harris and Crouse (1993), large sample equating procedures do not necessarily provide the true equating results to which other methods should be compared. Another popular equating criterion is the standard error of equating (SEE) which is defined as the standard deviation of equated scores over many hypothetical replications of an equating process on samples from a target population of test takers (Kolen & Brennan, 2004). Assessing SEE often involves drawing random samples from the same population under the same set of conditions. Other statistical techniques, such as bootstrap methods, can be used to assess SEE for a single equating. Equating processes with smaller SEE are preferred. Several studies used SEE to evaluate equating results (e.g., see Cui & Kolen, 2008; Hanson, Zeng, & Kolen, 1993; Liu, Schulz, & Yu, 2008; Lord, 1982a, 1982b; Wang, Hanson, & Harris, 2000; Zeng, Hanson, & Kolen, 1994). However, the use of SEE as means of comparing different equating methods has been criticized because it only accounts for random errors due to sampling examinees from the population and ignores other sources of errors (Harris & Crouse, 1993). 3 Cross-validation and replication are also frequently used to evaluate equating results. Cross-validation applies equating transformation obtained on one sample to another independent sample. Replication requires recalculation of the equating transformation on another sample. Both methods use results from two different applications to check the stability of equating results. Examples of research of this kind are those conducted by Holmes (1982), and Kolen (1981). Circular equating, which equates a test form to itself through a chain of equatings, is another commonly used equating criterion. Traditionally, the circular equating criterion is intended to assess systematic error. Ideally, the final result must be an identity (i.e., a score is transformed to an identical score). Many studies used this criterion to evaluate equating results (e.g., see Gafni & Melamed, 1990; Han, Kolen, & Pohlmann, 1997; Klein & Jarjoura, 1985; Lord & Wingersky, 1984; Philips, 1985; Puhan, 2010; Skaggs, 2005; Wang, Hanson, & Harris, 2000). Cross-validation, replication, and circular equating have the same setback as SEE. Stability obtained in those methods may not be an appropriate criterion for choosing an equating method because an incorrect equating procedure may produce more stable equating relationships than correct procedures (Lord & Wingersky, 1984). Kolen and Brennan (1987) recommended that circular equating should be used with considerable caution. Wang, Hanson, and Harris (2000) showed in their simulation study that the accuracy of equating methods cannot be determined by circular equating because this equating criterion does not take into account systematic error (bias) embedded in the equating. 1.3. Concerns regarding equating criteria As presented previously, most widely used equating criteria have shortcomings. Although many equating criteria have been proposed and used, no one criterion is unambiguously preferable to others. For many years, researchers have recognized that there is a 4 problem in evaluating equating because no definitive criterion exists (Harris & Crouse, 1993). Using different criteria may lead to different conclusions about equating adequacy in a given context (Skaggs, 1990). Kolen (1990) indicated that there is no universally agreed upon equating criterion. This does not mean that all criteria are equally problematic. Some criteria can be better than others in a specific situation. However, the lack of a common equating criterion makes it difficult to compare results across equating studies. Even the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999), which states that technical information should be provided on the accuracy of the equating, provides no guideline on how the adequacy of equating should be assessed. 1.4. The approach taken: equating definition and equating criterion When implementing an equating evaluation process, it is crucial to consider the adopted definition of equating. The goal of equating evaluation must be to assess the extent to which the definition of equating holds. In other words, equating criteria should be closely linked to what it means for the two test forms to be equated. Theoretically, equating is often defined as a statistical process to adjust scores of multiple test forms so that their scores are comparable and interchangeable (Kolen & Brennan, 2004; Petersen, Kolen, & Hooever, 1989). Definition of equating presented this way does not carry much usefulness to the process of selecting an appropriate criterion. What does it mean to say scores are comparable and interchangeable? How does one determine if scores are actually comparable after being equated? In order to be able to select a correct and fair criterion, one needs a definition that can be operationally applied. In other words, an operational definition of equating is necessary for criterion selection. In order to be useful, any operational definition must be able to specify what comparability and interchangeability mean. 5 This study focused on two operational definitions of equating that have been proposed in the literature: the equipercentile definition proposed by Angoff (1971), and equity definition proposed by Lord (1980). 1.4.1. Equipercentile definition According to Angoff (1971), “two scores, one on form X and the other on form Y (where X and Y measure the same function with the same degree of reliability), may be considered equivalent if their corresponding percentile ranks in any given group are equal” (p. 563). This statement is commonly regarded as Angoff‟s equipercentile definition of equating (Harris & Crouse, 1993). The equipercentile definition implies that the distributions of scores on two test forms in a population should be identical after equating (Kolen & Brennan, 2004, p.12). The equipercentile definition is also labeled the definition of observed-score equating. 1.4.2. Equity definition The equity definition of equating, also called the definition of true-score equating, was proposed by Lord (1980) as “if an equating of test X and Y is to be equitable to each applicant, it must be a matter of indifference to applicants at every given ability level  whether they are to take test X or test Y” (p.195). Equity requires that for every , the conditional distributions of scores on the two test forms to be equated must be identical after equating. The two definitions are not unrelated. If one applies the equipercentile definition to a group of examinees with the same ability  level, it is equivalent to the equity definition (Divgi, 1981). In addition, both definitions are based on score distributions. The difference is that the equipercentile definition is focused on marginal score distributions while the equity definition is defined on conditional score distributions. 6 Lord (1980) proved that equity is never satisfied unless the test forms being equated are perfectly reliable or strictly parallel, in which case equating is unnecessary. In practice test forms are never perfectly reliable nor strictly parallel. In other words, equity is unlikely to be fully satisfied in practice. Nevertheless, equity can be considered as a gold standard for evaluating equating in the sense that it represents an ideal equating. Since full equity is unlikely to be satisfied in practice, some weaker versions of equity have been proposed. Two popular weakened versions are the first-order equity (Divgi, 1981) and the second-order equity (Morris, 1982). Those equity definitions require only the first-order moment (i.e., expected value) or the second-order moment (i.e., variance or standard deviation) of the two conditional score distributions be the same, respectively. Kolen, Hanson, and Brennan (1992) argued that the second-order equity should be nearly satisfied in order for the two test forms being equated to be used interchangeably. 1.4.3. Equipercentile criterion and equity criteria Four equating criteria can be formulated from the two basic operational definitions of equating. Let X, Y, and Ye represent score on Form X (old form), score on Form Y (new form), and equated Form Y score, respectively. Also, let x, y, and ye represent particular values of X, Y, and Ye, respectively. The equipercentile criterion compares two marginal score distributions: one for Ye, and one for X. The full equity criterion compare two conditional score distributions for Ye and X at a specific value of latent ability . The first-order and second-order equity criteria compare the means and the variances (or standard deviations) of those two conditional distributions. Note that equity-based criteria must be evaluated at all levels of  in the range of 7 interest. Different criteria are appropriate for different equating definitions (or purposes). For example, if the equating purpose is to obtain the same marginal distributions of scores on two test forms after equating, the equipercentile criterion would be appropriate. If getting the same conditional distributions on two test forms after equating is the goal, the full equity criterion would be more suitable. 1.5. Motivation Test equating is an important task in many testing programs to make scores from alternative test forms comparable. It is very crucial that equating results be evaluated based on appropriate criteria in accordance with predetermined equating purposes. There are several factors that motivated this study as follow:  There are urgent needs for evaluating equating results properly using appropriate and fair criteria which are directly linked to the adopted definition of equating. In order to ensure test fairness, the evaluation process for equating must be correct and fair.  Although equity is the most important aspect of equating (Lord, 1980), equity-based criteria have rarely been used in equating research and practice. In fact, no research using full equity criterion to evaluate equating results has been reported, at least up to the time this dissertation was written.  Many testing program currently use equipercentile OSE methods and IRT equating methods. Therefore, it is necessary to compare their performances. Although many research studies comparing them have been conducted, most focused only between two methods of the same kind (i.e., either OSE or IRT equating), and they led to different conclusions. In addition, research comparing equipercentile OSE and IRT equating methods is sparse. 8  The NEAT design is the most popular equating design used in practice. However, not much research has been conducted to compare equipercentile OSE and IRT equating methods in this design, especially using equity criteria. 1.6. Purpose of the study and research questions 1.6.1. Purpose The primary purpose of this study was to use equipercentile and equity-based criteria to evaluate performance of four commonly used equating methods under the NEAT design. Specifically, those equating methods are (see Chapter 2 for more details):  Presmoothed frequency estimation equipercentile method (FE)  Presmoothed chain equipercentile method (CE)  IRT true score equating method (TS)  IRT observed score equating method (OS) In addition, the identity equating (i.e., no equating) method was also used to examine possible conditions when no equating is preferred. The performance of those equating methods was investigated in various conditions of differences between test forms and differences between groups of test takers. 1.6.2. Research questions Particularly, this study aimed to address the following research questions: Question 1: Overall, how do those equating methods compare to one another in terms of equipercentile and equity criteria? Question 2: How do test form differences affect equating results for each method? Question 3: How do group differences affect equating results for each method? 9 Question 4: Are there interaction effects between test form differences and group differences for each method? Question 5: In what conditions is the identity equating preferred to the others? 1.7. Research expectations Kolen and Brennan (2004) stated that each equating method tends to function optimally under certain situations. Therefore, it was expected that the investigated methods would perform differently relative to different criteria. Specifically, it was expected that  FE, CE, and OS perform relatively well under the equipercentile criterion.  TS produces the most accurate results under the first-order equity criterion.  OS performs better than TS when the equipercentile criterion was used.  All equating methods perform similarly when test form differences and group differences are small.  When form differences and group differences are large, equating results, regardless of methods used, would be worse than when these differences are negligible.  Identity equating is preferred when form differences and/or group differences are very large. 1.8. Significance of the study Given the lack of research on using equipercentile and equity criteria to evaluate equating results, and the scarcity of studies comparing equipercentile OSE and IRT equating methods in the NEAT design, this study was initiated to fill the gap. It was hoped that this study would make significant contributions to the research literature by providing an alternative perspective on how to evaluate equating results in such a way that is well aligned with equating purposes in specific 10 contexts. In addition, results from this study would provide more comprehensive guidance for practitioners to select appropriate methods based on their adopted purposes. It was also expected that this study will inform equating practice by suggesting the size of form difference and group difference that can cause a specific equating method to perform well or poorly relative to various criteria. 1.9. Additional notes  In this study, the equating direction was from Form Y to Form X. In other words, Form Y was the new form and Form X was the base (old) form.  The anchor used in this study was internal which means the anchor score was included in the total score.  Although the words „definition‟, „purpose‟, and „property‟ have different meanings in the regular context, they are used interchangeably in this dissertation in the phrases such as „equating definition‟, „equating purpose‟, and „equating property‟ to mean the same things. All of them mean what is supposed to be accomplished from equating.  The term „equating criterion‟ is frequently used in this dissertation. In general, it means a certain property of equating that should hold for the equating results to be considered accurate. For example, equity criterion means equity property proposed by Lord (1980). Note that the equating criterion used in this study is a different concept than the commonly used statistical criterion. 1.10. Overview of the dissertation The rest of this dissertation is organized as follows.  Chapter 2 presents theoretical background relevant to the study. Major topics that are 11 addressed include the NEAT design, equipercentile OSE and IRT equating methods used in the study, equating criteria, and a review of relevant research.  Chapter 3 is reserved for presenting research design and methodology. Detailed steps are laid out including overall framework, research factors, procedures, and evaluation criteria.  Results of the study are presented in Chapter 4 for all research conditions, focusing on addressing proposed research questions.  The last chapter, Chapter 5, summarizes main findings and discusses their practical implications. Limitations of the study, current issues and future steps are also discussed in this closing chapter. 12 CHAPTER 2 LITERATURE REVIEW In this chapter, theoretical issues relevant to this study are discussed. The chapter begins with the issue of test equating including its two main components: equating design and equating methods. Details about a specific design, the NEAT design which was used in this study, are discussed next. After that, four equating methods used in this study, frequency estimation equipercentile equating, chain equipercentile equating, IRT true score equating, and IRT observed score equating, are examined. The equipercentile definition and the equity definition along with their corresponding criteria are the next topics. The chapter concludes with a summary of prior research relevant to this study. 2.1. Test equating Testing programs often use multiple forms of the same test for a variety of reasons. For example, in situations such as college admission, people can take the test at different times. If the same questions were used at each administration, they would become known and people taking the test at a later administration would have advantages. Thus, using multiple forms of a test maintains test fairness and security. Another example is a situation where it is necessary to use pretest and posttest (e.g., measuring growth). The main reason for using different forms of a test is to ensure that a test taker‟s score is a current measure of his or her competence and not a measure of ability to recall questions on the form previously administered. Furthermore, using multiple alternative forms of the same test also serves to satisfy broad content coverage. Although multiple forms are created to have similar characteristics, it is unlikely that test forms are exactly equivalent. For this reason, some examinees may have advantages or 13 disadvantages by taking an easy or difficult form. To ensure test fairness, scores for different test forms must be adjusted by a process commonly referred to as equating. Whenever alternative forms are used, equating is performed to place scores from different test forms on the same scale. Equating is commonly defined as a statistical process for adjusting scores of different test forms to account for unintended form-to-form differences such that scores can be considered comparable (Kolen & Brennan, 2004). When test forms are equated, a group (or population) of test takers to whom the equating relationship is supposed to be applied must be identified (Braun & Holland, 1982). This group is usually called the target population in the equating literature. An equating process consists of two major components: equating design and equating method. The equating design is a framework for collecting equating data. Common equating designs include: (a) single group design where the two test forms being equated are given to a single group randomly drawn from a population which is also the target population; (b) random groups design where the two forms are administered to two groups of test takers randomly drawn from a target population; and (c) nonequivalent groups with anchor test (NEAT) design where two test forms which share a set of common item, called an anchor, are given to two groups from two populations which usually differ in level of ability measured by the test. This study focused on the third design which is discussed in the next section. Under each equating design, various methods can be used. Equating methods can be classified into two categories: (a) observed-score equating (OSE) methods, and (b) IRT equating methods. OSE methods, which are also called traditional methods, are conducted on empirical observed scores and can be further grouped into two kinds depending on the hypothetical equating relationship. Linear methods specify a linear relationship between scores on the two 14 test forms being equated. Equipercentile methods determine a non-linear relationship between scores of the two test forms. This study used two common equipercentile OSE methods that are widely used in practice: the frequency estimation equating method and chain equating method. Details about these methods follow. Unlike traditional OSE methods, IRT methods are not conducted on empirical scores. These methods are based on IRT models which hypothesize a relationship between a specific examinee‟s latent ability, represented by , and the probability of his or her getting a correct answer to a specific test item. IRT equating is either conducted on true score or observed scores generated by the adopted models. Two common IRT equating methods were investigated in this study: the IRT true score equating and the IRT observed score equating. Details of these methods are also reviewed in the subsequent sections. Further details about equating designs and methods can be found in Holland and Dorrans (2006), Kolen and Brennan (2004), von Davier, Holland, and Thayer (2004a), and Petersen, Kolen, and Hooever (1989). 2.2. The nonequivalent groups with anchor test (NEAT) design Various equating designs can be used to collect data for equating. One of the most popular designs is the nonequivalent groups with anchor test (NEAT) design (von Davier, Holland, & Thayer, 2004a). This design is also called the common-items non-equivalent group design (Kolen & Brennan, 2004). In this dissertation, the term „NEAT‟ is adopted. In this design (see Table 2.1), two test forms to be equated, Form X and Form Y, are administered to two groups (i.e., samples), group 1 and group 2, of test takers from two different populations P and Q, respectively. The two test forms share a subset of items which is usually called the anchor (denoted A in Table 2.1). The sets of non-common (i.e., unique) items of Form 15 X and Form Y are labeled XU and YU, respectively. A is internal anchor if its score is included in the total score. Otherwise, it is external anchor. Note that in the NEAT design, as presented in Table 2.1, scores on XU are not obtained for the population Q sample and scores on YU are not obtained for the population P sample. Table 2.1. The NEAT design Population Sample XU A YU P 1   Not observed Q 2 Not observed   The anchor A is used to adjust for differences between the two groups in terms of abilities or skills relevant to the test. In other words, A serves to remove group differences to increase equating accuracy. It is recommended that the anchor should be a representative of the test forms being equated in content and statistical characteristics (see Sinharay & Holland, 2007; Kolen & Brennan, 2004). That is, the anchor should be a mini-version of the test forms. When groups differ substantially, the anchor may fail to adjust for group differences. In such situations, equating may not be accurate. In the NEAT design, the target population T is the mixture of P and Q and can be formulated as T = wP + (1-w)Q (2.1) The mixture is determined by the weight w given to population P. Theoretically, w can be 16 any number between 0 and 1. When w = 1, T  P; and when w = 0, T  Q. In most cases, w is the ratio of sample size of the group from P and the sum of the sample sizes of the two groups (Angoff, 1971). In the equating literature, the mixture is also called the synthetic population. The NEAT design is widely used in many testing programs. There are some reasons for its popularity. The first reason is that this design requires only one test form to be administered per test date. In many testing situations, it is not possible to give more than one test form at the same administration because of the test security and disclosure concerns. In such situations, the NEAT design is a good choice. Another reason is that with external anchors, non-common items can be disclosed after the test date without compromising future test forms. The ability to disclose test items is important for many testing programs as some states require disclosure of test items. The ability to deal with groups of test takers with different abilities is another advantage of the NEAT design because the groups taking the test at different administrations tend to be self-selected so they usually differ in systematic ways (Petersen, Kolen, & Hooever, 1989). Various equating methods can be employed in the NEAT design. This dissertation focused on four non-linear methods: two equipercentile OSE methods and two IRT equating methods. Details of these methods are discussed in the following sections. 2.3. Equipercentile OSE methods under the NEAT design 2.3.1. General framework Equating methods can be classified into two major categories: observed score equating (OSE) methods, and IRT equating methods. Among OSE methods, equipercentile equating methods are the most important (von Davier, Holland, & Thayer, 2004a) and they are widely 17 used in testing practice (Brennan, 2010). This study focused on two popular equipercentile OSE methods. Equipercentile OSE methods focus on the distributions of observed scores on the two test forms being equated. These methods equate the quantiles of those score distributions on the two forms. In other words, in equipercentile OSE, scores on two forms are considered to be equivalent if their corresponding percentile ranks in some groups are equal (Angoff, 1971). The equipercentile equivalence of a score from Form Y on the scale of Form X is calculated by first finding the percentile rank on Form Y of a score y, and then finding the score x on Form X associated with that percentile rank. Formally, general equipercentile equating framework can be described as follows. Let X and Y represent scores on Form X and Form Y, respectively, and Y is equated to the scale of X. The equipercentile equating transformation is a function from the scale of possible values of Y to the scale of X, that is, from y to x. The transformation (y) equates the quantiles of the two population distributions for X and Y using their cumulative distribution functions, FY(y) and FX(x), on the target population T. The transformation function is   ( y )  FX 1  FY ( y )  (2.2)  where FX 1 represents the inverse function of FX. Because actual scores are discrete rather than continuous, a procedure is required to approximate a continuous distribution of score. This procedure is called continuization (von Davier, Holland, & Thayer, 2004a). Some methods have been used including linear interpolation and Gaussian kernel smoothing (for more details see Kolen & Brennan, 2004; von Davier, Holland, & Thayer, 2004a). 18 From Equation 2.2, in order to conduct equipercentile equating, two cumulative distribution functions of the scores on the two test forms being equated on the target population must be defined. In the NEAT design, because of the missing data of XU on Q and YU on P (see Table 2.1), some assumptions have to be made to compute FX(x) and FY(y). Two popular equipercentile equating methods in the NEAT design were investigated in this study. They are discussed next. 2.3.2. Frequency estimation equipercentile equating method (FE) The frequency estimation equipercentile equating method (FE) consists of two steps. The first step is to estimate score distributions for each test form on a target population T, which is usually a synthetic population as defined in (2.1). The second step is to derive the equating function using the estimated score distributions obtained from the first step and the equipercentile equating framework (2.2). The score distributions of the two test forms in the target population are estimated as f X ( x)  w. f X .P ( x)  (1  w). f X .Q ( x) (2.3) fY ( y )  w. fY .P ( y )  (1  w). fY .Q ( y ) (2.4) where f represents the population frequency distribution and w represents the weight given to population P. Because of the characteristics of the NEAT design, as seen in Table 2.1, f X .Q ( x ) and fY . P ( y ) are not available from the observed data. Therefore, some statistical assumptions need to be made to obtain score distributions in the target population. The FE method assumes that the conditional distributions of X and Y, conditioning on the anchor score A, are population 19 independent. That is, f X | A.P ( x | a)  f X | A.Q ( x | a) (2.5) fY | A.P ( y | a)  fY | A.Q ( y | a) (2.6) where a is a particular value of A. Combining (2.3) with (2.5), (2.4) with (2.6), it follows that f X ( x )  w. f X . P ( x )  (1  w). f X | A. P ( x | a ) f A.Q (a ) (2.7) a fY ( y )  w. fY | A.Q ( y | a ) f A. P (a )  (1  w). fY .Q ( y ) (2.8) a where f A. P (a ) and f A.Q (a ) are the marginal distributions of A in P and Q, respectively. All quantities on the right hand sides of (2.7) and (2.8) are observable from the NEAT design. From f X ( x ) and fY ( y ) , the cumulative distributions FX ( x ) and FY ( y ) can be derived for Form X and Form Y, respectively. Equipercentile equating is then applied to FX ( x ) and FY ( y ) using (2.2). Although the FE method is theoretically appealing, it was found to produce larger equating bias in comparison to the other methods in the NEAT design, especially when the group differences are substantial (e.g., see Holland, von Davier, Sinharay, & Han, 2008; Wang, Lee, Brennan, & Kolen, 2008). One reason for this disadvantage might be that the assumptions about missing data made in the FE method is too strong and does not always hold in practical situations (Sinharay & Holland, 2010). 2.3.3. Chain equipercentile equating method (CE) The chain equipercentile equating method (CE) (Dorans, 1990; Kolen & Brennan, 2004) 20 is another popular OSE method used in the NEAT design. It is also called the Design V method (Angoff, 1971; Braun & Holland, 1982; Harris & Kolen, 1990). The CE method consists of three sequential steps. In the first step, Form Y score y is equated to the anchor score a in population Q using the equipercentile equating method, resulting in an equating function  YA.Q ( y )  FA.1  FY .Q ( y )  Q (2.9)  where FA.1 represents the inverse cumulative function of A in Q and FY .Q represents the Q cumulative function of Y in Q. In the second step, the anchor score A is equated to Form X score x in population P, producing an equating function   AX .P (a)  FX 1P  FA.P (a )  . (2.10) 1 where FX . P represents the inverse cumulative function of X in P and FA. P represents the cumulative function of A in P. Finally, Y is equated to X through a chain of the two equipercentile equating functions    ( y )   AX .P (YA.Q ( y ))  FX 1P ( FA.P ( FA.1 ( FY .Q ( y )))) . Q . (2.11) In comparison with the FE method, the CE method is easier and less computationally intensive to implement because it does not require consideration of the joint distribution of total score and anchor score (Kolen & Brennan, 2004). It can use marginal distributions of X and A for the examinees taking Form X and the marginal distributions of Y and A for the examinees taking Form Y. However, the CE method has theoretical shortcomings. The method involves 21 equipercentile equating between a long test (total test) and a short test (anchor). Theoretically, test forms of unequal lengths, thus unequal reliabilities, cannot be equated in the sense that their scores can be used interchangeably. Another problem is that the CE method does not clearly determine the target population (Braun & Holland, 1982). The CE method consists of two equating procedures performed on two different groups but it is not clear how the groups are combined. However, the CE method does not require equivalent groups so it can be helpful when group differences exist. Equating research has found that the CE method produces smaller bias than the FE method when group differences are large (e.g., see Holland, von Davier, Sinharay, & Han, 2008; Wang, Lee, Brennan, & Kolen, 2008). 2.4. Presmoothing score distributions using log-linear models In the NEAT design, there are two observed bivariate score distributions, one for the pair (X, A) of Form X and the other for the pair (Y, A) of Form Y. Those distributions are obtained from samples of examinees taking the two forms. The sample score distributions are usually irregular, particularly at the extremes of the score range. The irregularities are primarily due to the random errors in sampling examinees from the population of the test takers. This may, especially when the sample sizes are small, result in unstable and inaccurate equating functions (Liou & Cheng, 1995). To mitigate these effects, smoothing sample score distributions prior to equating is often recommended (Hanson, 1991, Kolen & Brennan, 2004; Rosenbaum & Thayer, 1987; van der Linden & Wiberg, 2010). This process is called presmoothing since it is conducted prior to equating. The purpose of presmoothing is to smooth out some of sampling variability to produce more stable score distribution estimates. The resulting smoothed distributions are then used to equate test forms. It has been found that for small samples presmoothing can reduce equating error. When 22 the samples are large, presmoothing may not produce a large improvement, but it may be a useful way to remove undesired roughness in the sample score distributions (Hanson, Zeng, & Colton, 1994; Livingston, 1993; Livingston & Feryok, 1987). Various presmoothing methods are available to psychometricians. Among the popular models used in presmoothing are the log-linear models, the beta binomial models, and the fourparameter binomial models. The log-linear models were used in this study because they are very flexible in the sense that they can potentially fit a wider class of bivariate distributions. The loglinear models are discussed in more details in Holland & Thayer (2000). The log-linear models considered in this study are those used to produce a smoothed version of a bivariate distribution of total test score and anchor score such as (X, A) for Form X or (Y, A) for Form Y. Assume that possible values for X and A are xi (i=1,…,I) and aj (j=1,…,J) respectively. The vector of observed bivariate frequencies, n = (n11,..., nIJ )' , sums to the total sample size, N. The following log-linear model can be used to fit a bivariate distribubion to the observed distribution of (X, A) C loge ( pij )  0   c 1  xc xic D  d 1 ad a d j E F     xaef xiea jf e 1 f 1 (2.12) where pij is the expected joint score probability of the pair (xi, aj) (xi on X, aj on A), 0 is a normalizing constant that forces the sum of the expected probability pij to equal 1, and the remaining s are free parameters to be estimated in the model-fitting process. This model produces a smoothed bivariate distribution that preserves C moments in the 23 marginal (univariate) distribution of X; D moments in the marginal (univariate) distribution of A; and number of cross moments in the bivariate (X, A) distribution determined by E and F. For example, a model with C=D=2, E=F=1 (denoted as model 2211) will preserve the first two univariate moments (i.e., mean and standard deviation) of X and A as well as the first cross moment (i.e., covariance) between X and A. The observed bivariate (Y, A) distribution can be fit by a log-linear model in a similar procedure. 2.5. Item response theory (IRT) equating methods under the NEAT design Item response theory (IRT) equating methods are used in many testing programs. In this section, two commonly used IRT equating methods employed in this study are discussed. 2.5.1. Three-parameter logistic model IRT consists of a family of probabilistic models that relate examinee‟s proficiency level  to the probability of answering an item within a particular category (Lord, 1980). For dichotomously scored items, there are only two response categories, correct and incorrect. Various IRT models have been developed for dichotomously scored items as well as for polytomously scored items. The general and commonly used IRT model for dichotomous items is Birnbaum‟s three-parameter logistic (3PL) model (Lord & Novick, 1968). Under the 3PL model, the probability that an examinee, with latent ability j, scores a correct response, uij = 1, to item i, is pij  p(uij  1|  j ; ai , bi , ci )  ci  1  ci 1  exp   Dai (  j  bi )   (2.13) where ai is the item discrimination parameter, bi is the item difficulty parameter, ci is the item 24 guessing parameter, and D is the scaling constant equal to 1.7. In practice, item and examinee parameters are estimated from data (i.e., examinees‟ responses to test items). 2.5.2. IRT scale linking When using the NEAT design, IRT item and ability parameters are typically estimated separately for the two test forms, resulting in two different ability scales. However, in order to perform IRT applications, parameters must be on the same scale. This problem can be solved by a process called scale linking, or simply, linking. In the 3PL model, the two scales X and Y have a linear relationship  X  SY  I (2.14) If the 3PL model perfectly holds, parameters of common items have the following relationship a Xi  aYi S (2.15) bXi  SbYi  I (2.16) c Xi  cYi (2.17) where i indices a common item. If the model holds perfectly and item parameters are known, the true linking coefficients S and I can be obtained from any one of the common items. In practice, equations (2.15), (2.16), and (2.17) are not satisfied for all common items. Thus, a linking process is needed to estimate S and I. Various linking procedures are available (see Kolen & Brennan, 2004, for more details about IRT linking methods). Four linking methods are often used in research and practice: mean/sigma (Macro, 1977), mean/mean (Loyd & Hoover, 1980), Haebara (Haebara, 1980), and Stocking-Lord (Stocking & Lord, 1983). In this study, the 25 Stocking-Lord method was used. This method estimates S and I by minimizing the difference between the test characteristic curves for the anchor associated with two sets of anchor item parameter estimates obtained from two separate calibrations, one for each form. After S and I are estimated, equations (2.14)-(2.17) can be applied to ability estimates of Form Y and to all noncommon item parameter estimates of Form Y to place them on the scale of Form X. Once item parameter estimates of the two forms are on the same scale, equating can be conducted. Details of IRT equating methods can be found in Kolen and Brennan (2004) and Lord (1980). The following sections briefly present two commonly used IRT equating methods which were used in this study. 2.5.3. IRT true score equating method (TS) In the 3PL model, the number-correct true scores on Form X and Form Y associated with ability  are defined through their test characteristic function (Lord, 1980) as, where the summations are over items in Form X and Form Y, respectively  X ( )   pi ( ; ai , bi , ci ) (2.18) i: X  Y ( )   p j ( ; a j , b j , c j ) (2.19) j:Y where p represents the probability of getting the correct response as presented in (2.13) In the 3PL model, very low true scores are not available because when   -∞, p( )  c. The same problem occurs when true score equals to all-correct score because   + ∞. Therefore, the range of true scores on Form X and Form Y are defined as  ci   X  K X (2.20) i: X 26  c j  Y  KY (2.21) j:Y where KX and KY are the total numbers of items on Form X and Form Y, respectively. In IRT true score (TS) equating, the true number-correct score on one form associated with a given  is considered to be equivalent to the true score on another form associated with the same  (Kolen & Brennan, 2004). Mathematically,  X ( ) and  Y ( ) , as computed from (2.18) and (2.19) where the same  value is used, are considered to be equivalent. TS equating can be conducted in three steps (Kolen & Brennan, 2004, p. 176): 1. Specify a true score  Y on Form Y ,  c j  Y  KY j:Y 2. Find a value  that correspond to  Y 3. Find the true score on Form X,  X , that is associated with  obtained from step 2. The second step, which requires solving (2.19) for , requires an iterative procedure such as Newton-Raphson as presented in Kolen and Brennan (2004). Pairing two true scores associated with the same  values across different s produces a true-score equating table. This table is then applied in practice to observed number-correct scores. Since true score is not the same as observed score, this step does not have a sound theoretical justification (Lord, 1980). When using TS equating with observed scores, a procedure is needed for equating scores outside the range of possible true scores described in equations (2.20) and (2.21). Lord (1980) and Kolen (1981) proposed ad hoc procedures to handle this problem. The Kolen‟s procedure, which was used in this study, is as follows: 27 1. Set a score of 0 on Form Y equal to a score of 0 on Form X 2. Set a score of  c j on Form Y equal to a score of  ci on Form X i: X j:Y 3. Apply linear interpolation to find equivalents between these points 4. Set of score of KY on Form Y equal to a score of KX on Form X. Because TS equating is theoretically population invariant and straightforward to implement, it has been used widely in equating research and practice. 2.5.4. IRT observed score equating method (OS) IRT observed score equating (OS) method consists of two steps. The first step is to estimate the distributions of observed number-correct scores on Form X and Form Y on the target population T. The second step is to conduct traditional equipercentile equating on these estimated distributions. For Form X, the recursion formula presented in Lord and Wingersky (1984) can be used to obtain the conditional distribution of observed scores at each  value. Define fr(x|) as the distribution of the number-correct scores over the first r items for examinees with ability , and th pr as the probability for those examinees getting r item correct. For r >1, the recursion formula is as follows (Kolen & Brennan, 2004): fr(x|) = fr-1(x|).(1 – pr) x=0 = fr-1(x|).(1 – pr) + fr-1(x-1|).pr 0 .25, using Form Y with similar b-parameters as Form X and higher a-parameters resulted in better results. 72 4.6. Effects of group and form factors on the performance of the CE method All procedures that were conducted for the FE method were also used with the other methods, including the ANOVA model presented in (4.1). The ANOVA results for the CE method are displayed in Table 4.2. The cell means are presented in Figures 4.9 thorough 4.12 and in Tables A5 through A8 in the Appendix. 4.6.1. Group effects for the CE method From Table 4.2, it is clear that the group factor had statistically significant effects on all four indices. A closer look at the graphic reveals more details. Results for indices EP, E, E1, and E2 for the CE method are presented in Figures 4.9, 4.10, and 4.11, respectively. Those figures have different shapes from those for the FE method. They show that the curves go up slightly from the left to the right. That means when groups became more different, the equating results by the CE method became worse. However, the effects of group difference on EP and E1 were slightly stronger than on E, and much stronger than on E2. In summary, although the effects of the group factor were found to be significant, they were smaller for the CE method than for FE method. 4.6.2. Form effects for the CE method The ANOVA table (Table 4.2) clearly indicates that the form factors have statistically significant effects on all indices for the CE method. A review of Figures 4.9 through 4.12 provides more information. For EP (Figure 4.9) and E1 (Figure 4.11), there are some high index values indicating unsatisfactory results in some conditions. Those conditions were when group difference existed, and Form Y was much easier than Form X (b=-1.2) and at the same time had higher a 73 Table 4.2. ANOVA results for the CE method for each index Index Source SS df MS F Sig. µ < .0001 24.20 2 12.10 135.83 < .0001 39.75 6 6.62 74.35 < .0001 a*b 51.51 12 4.29 48.18 < .0001 59.03 80 0.74 8.28 < .0001 Error 458.41 5145 0.09 Total 753.45 5249 µ 34.04 4 8.51 38.98 < .0001 a 177.44 2 88.72 406.43 < .0001 b 101.92 6 16.99 77.81 < .0001 a*b 32.25 12 2.69 12.31 < .0001 µ*a*b 35.55 80 0.44 2.04 < .0001 Error 1123.11 5145 0.22 Total 1504.31 5249 µ 146.89 4 36.72 394.57 < .0001 a 50.05 2 25.03 268.90 < .0001 b 55.45 6 9.24 99.29 < .0001 a*b 67.21 12 5.60 60.17 < .0001 µ*a*b 90.00 80 1.12 12.09 < .0001 Error 478.86 5145 0.09 Total 888.45 5249 µ 6.34 4 1.59 3.45 0.0081 a 489.32 2 244.66 531.69 < .0001 b E2 338.23 µ*a*b E1 30.14 b E 4 a EP 120.54 66.91 6 11.15 24.23 < .0001 a*b 50.74 12 4.23 9.19 < .0001 µ*a*b 37.11 80 0.46 1.01 0.4597 Error 2367.49 5145 0.46 Total 3017.91 5249 74 1.5 1 EP 0.5 0 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.9: Means of index EP for CE method in all conditions 2 1.5 E 1 0.5 0 a = .5 1 2 =0 .5 1 2 .5 =.25 1 =.5 2 .5 1 =.75 2 .5 1 =1 Figure 4.10: Means of index E for CE method in all conditions 75 2 2 1.5 E1 1 0.5 0 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.11: Means of index E1 for CE method in all conditions 1.5 1 E2 0.5 0 a = .5 1 =0 2 .5 1 2 .5 =.25 1 =.5 2 .5 1 2 .5 =.75 Figure 4.12: Means of index E2 CE method in all conditions 76 1 =1 2 parameters (a = 2). Except for those conditions, the form effects on EP and E1 were not very strong and clear. However, from those figures, equating similar forms (in terms of a- and bparameters) appeared to produce good results. For E (Figure 4.10) and E2 (Figure 4.12), the same observation can be made. For both indices, when forms were similar (a = 1, b = 0), the best results were obtained. When Form Y a-parameters were higher than those on Form X, the results were better than when Form Y had lower a-parameters. 4.6.3. Group and form interaction effects for the CE method The ANOVA table (Table 4.2) indicates that significant group-form interaction effects were found for all but E2 index. However, from Figures 4.9 through 4.12, except for some surprisingly high index values mentioned previously, it appears that the group-form interaction effects were not strong. The interaction effects found may have been produced by those spikes. 4.7. Effects of group and form factors on the performance of the TS method The ANOVA results for the TS method are displayed in Table 4.3. The cell means are presented in Tables A5 through A8 in the Appendix. They are also graphically displayed in Figures 4.13 thorough 4.16. 4.7.1. Group effects for the TS method The ANOVA table (Table 4.3) clearly indicates that the group factor did not have statistically significant effect on all indices for the TS method. This result can be verified by the graphical presentation in Figures 4.13 through 4.16. Although the shapes and the index values 77 Table 4.3. ANOVA results for the TS method for each index Index Source SS df MS F Sig. µ 0.13 1.18 0.3195 4.75 2 2.38 21.42 < .0001 b 19.45 6 3.24 29.22 < .0001 a*b 4.73 12 0.39 3.56 < .0001 µ*a*b 4.20 80 0.05 0.47 0.9981 Error 570.62 5145 0.11 Total 604.28 5249 µ 0.84 4 0.21 1.61 0.1699 a 237.21 2 118.61 901.90 < .0001 b 116.31 6 19.39 147.41 < .0001 19.40 12 1.62 12.30 < .0001 3.98 80 0.05 0.38 0.9975 Error 676.60 5145 0.13 Total 1054.35 5249 µ 1.31 4 0.33 1.82 0.1223 a 15.24 2 7.62 42.45 < .0001 b 6.92 6 1.15 6.43 < .0001 a*b 4.77 12 0.40 2.22 0.0089 10.58 80 0.13 0.74 0.9618 Error 923.20 5145 0.18 Total 962.02 5249 µ 3.26 4 0.82 1.48 0.2066 a 507.11 2 253.55 458.98 < .0001 b E 4 a EP 0.52 216.22 6 36.04 65.23 < .0002 47.89 12 3.99 7.22 < .0003 4.93 80 0.06 0.11 0.9999 Error 2842.26 5145 0.55 Total 3621.67 5249 a*b µ*a*b E1 µ*a*b E2 a*b µ*a*b 78 0.7 0.6 0.5 EP 0.4 0.3 0.2 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.13: Means of index EP for TS method in all conditions 1.5 1 E 0.5 0 a = .5 1 2 =0 .5 1 2 .5 =.25 1 =.5 2 .5 1 =.75 2 .5 1 =1 Figure 4.14: Means of index E for TS method in all conditions 79 2 0.5 0.4 0.3 E1 0.2 0.1 0 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.15: Means of index E1 for TS method in all conditions 2 1.5 E2 1 0.5 0 a = .5 1 2 =0 .5 1 2 .5 =.25 1 =.5 2 .5 1 =.75 2 .5 1 =1 Figure 4.16: Means of index E2 for TS method in all conditions 80 2 for the curves in those figures are not the same, they share one common pattern. That is the curves do not change when group difference changed. In other words, group difference had no effects on equating results of the TS methods evaluated by the four indices. 4.7.2. Form effects for the TS method As seen from Table 4.3, the form effects were statistically significant. The interaction effects of a and b were also significant. More details are obtained from the figures. Similar conclusions can be made for E and E2 (see Figures 4.14 and 4.16). The best results were obtained when forms were similar (in terms of both a- and b-parameters). When Form Y a-parameters were higher than those on Form X, the results were better than otherwise. For EP (Figure 4.13), the best results were obtained if the two forms had similar bparameters unless Form Y a-parameters were smaller than those of Form X. For E1 (Figure 4.15), the pattern was more complicated. Nevertheless, using similar forms still led to the best results. When a-parameters on the two forms differed, the results were worse no matter which form had higher a-parameters. 4.7.3. Group and form interaction effects for the TS method The ANOVA did not find significant group-form interaction effects (see Table 4.3). Figures 4.13 through 4.16 display similar patterns in all group conditions, indicating that for the TS methods, there were no interaction effect between the group and form factors. 4.8. Effects of group and form factors on the performance of the OS method The ANOVA results for the OS method are displayed in Table 4.4. The cell means are presented in Tables A5 through A8 in the Appendix and in Figures 4.17 thorough 4.20. 81 Table 4.4. ANOVA results for the OS method for each index Index Source SS df MS F Sig. µ 0.10 1.81 0.1233 1.15 2 0.58 10.18 < .0001 b 4.81 6 0.80 14.13 < .0001 a*b 2.58 12 0.21 3.79 < .0001 µ*a*b 1.44 80 0.02 0.32 0.9979 Error 291.70 5145 0.06 Total 302.09 5249 µ 0.76 4 0.19 1.78 0.1292 a 185.12 2 92.56 866.21 < .0001 b 67.61 6 11.27 105.45 < .0001 a*b 11.84 12 0.99 9.23 < .0001 2.34 80 0.03 0.27 0.9981 Error 549.78 5145 0.11 Total 817.45 5249 µ 1.12 4 0.28 1.55 0.1836 a 36.79 2 18.39 102.37 < .0001 b 25.15 6 4.19 23.33 < .0001 a*b 16.43 12 1.37 7.62 < .0001 µ*a*b 10.52 80 0.13 0.73 0.9651 Error 924.48 5145 0.18 Total 1014.49 5249 µ 1.42 4 0.35 1.56 0.1832 a 413.51 2 206.76 909.30 < .0001 b E 4 a EP 0.41 100.22 6 16.70 73.46 < .0001 23.52 12 1.96 8.62 < .0001 3.07 80 0.04 0.17 0.9999 Error 1169.87 5145 0.23 Total 1711.61 5249 µ*a*b E1 E2 a*b µ*a*b 82 0.6 0.5 EP 0.4 0.3 0.2 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.17: Means of index EP for OS method in all conditions 1.5 1 E 0.5 0 a = .5 1 2 =0 .5 1 2 .5 =.25 1 =.5 2 .5 1 =.75 2 .5 1 =1 Figure 4.18: Means of index E for OS method in all conditions 83 2 0.8 0.6 E1 0.4 0.2 0 a = .5 1 2 .5 =0 1 2 .5 =.25 1 2 .5 =.5 1 2 .5 =.75 1 2 =1 Figure 4.19: Means of index E1 for OS method in all conditions 1.5 1 E2 0.5 0 a = .5 1 2 =0 .5 1 2 .5 =.25 1 =.5 2 .5 1 =.75 2 .5 1 =1 Figure 4.20: Means of index E2 for OS method in all conditions 84 2 4.8.1. Group effects for the OS method Like the TS method, results from the OS method were not affected by the group factor. Table 4.5 clearly shows that the group effects were not statistically significant. In Figures 4.17 through 4.20, which present results for the OS method, the curves do not change their shape when moving from one group condition to another. Therefore, it is obvious that group difference did not have effects on equating results of the OS methods evaluated by the four indices. 4.8.2. Form effects for the OS method As seen from Table 4.4, the form effects were statistically significant. The interaction effects of a and b were also significant. More details are obtained from the figures. A review of Figures 4.18, 4.17 and 4.20 reveals more information. When similar forms were used, the best results were produced. When Form Y a-parameters were higher than those on Form X, the results were better than otherwise. For EP (Figure 4.17), using forms with similar b-parameters resulted in the best results as long as Form Y a-parameters were not smaller than those on Form X. 4.8.3. Group and form interaction effects for the OS method It is clear from Table 4.4 that the group-form interaction effects were not statistically significant. Figures 4.17 through 4.20 display similar patterns in all group conditions, indicating that for the OS methods, there were no interaction effect between the group and form factors. 4.9. To equate or not to equate? The IE method (i.e., identity equating or no equating) was used in this study to determine if there was any condition among those studied when no equating would be the best choice. It 85 came from a concern that sometimes, equating might introduce more errors than it can remove. In such a case, using the directly observed score Y instead of the equated score Ye would be better. In order to address this issue, the IE was used as a regular equating and its index values were calculated for all conditions. Those results are included in Tables A1 through A4 in the Appendix. The IE results were also used in the ANOVA procedure comparing equating methods presented in Table 4.1. As previously mentioned, the IE performed poorly in terms of the EP, E, and E1 indices. Therefore, in all conditions used in this study, if equating results were evaluated using either EP, E, or E1 index, the IE should not have been recommended. In other words, doing equating was absolutely better than not equating at all. Even using the FE method, which was the worst method among those used in this study, would have led to better results than using the IE method. However, there is an exception. The IE surprisingly produced comparable values for E2 as discussed previously. In fact, the IE method produced the best in terms of the E2 index. A closer look at the cell means, presented in Table A1 through A4, reveals that the IE outperformed the other methods when Form Y had smaller a-parameters than Form X. Therefore, if the equating purpose is to satisfy the second-order equity, which is associated with E2, then not equating would be preferred. 4.10. Summary In the Table 4.5, some main results are presented. When groups were similar, all four methods performed similarly. When groups became distinct, results produced by different methods were different to various degrees. The overall performance of the four investigated 86 methods are ranked for each index. The best method is ranked number 1, the next is number 2, and so on. Based on the results presented above, the OS was ranked the best in three indices EP, E, and E2 while the TS method outperformed the others in terms of the index E1. The FE method was ranked last due to its higher index values. The CE method, although ranked third in all indices, performed fairly well, coming close to the two IRT methods. Table 4.5. Summary of major results _____________________________________________________________________________ Factor Effects (**) __________________________________________________ Overall Performance (*) Form _________________ _______________ Group _______________ Form * Group ______________ Index FE CE TS OS FE CE TS OS FE CE TS OS FE CE TS OS _____________________________________________________________________________ EP 4 3 2 1 N Y Y Y Y Y N N N Y N N E 4 3 2 1 Y Y Y Y Y Y N N Y Y N N E1 4 3 1 2 Y Y Y Y Y Y N N Y Y N N E2 4 3 2 1 Y Y Y Y Y Y N N Y N N N _____________________________________________________________________________ Note. (*) number representing the ranking where 1 is the best, 4 is the worst (**) Y: yes, N: no Table 4.5 also summarizes the results on the effects of form difference and group difference as well as their interactive effects. In the table under the Factor Effect heading, a Y (yes) indicates a significant effect and a N (no) indicates non-significant effect. Form difference was found to have effects on equating results under different indices except for EP in the case of the FE method. In general, equating similar forms tended to produce the best results. Effects of 87 group difference were found for the OSE methods (FE and CE) but not for the IRT methods (TS and OS). Small interactive effects of form and group difference were found for the FE method for the equity indices (E, E1 and E2). This chapter presents the results obtained from this study. Those results will be discussed in more details in the next chapter where practical implications of the obtained results will also be addressed. In addition, the next chapter will present some perceived limitations of the study. Some related issues will also be presented along with recommendations for further studies. 88 CHAPTER 5 SUMMARY AND DISCUSSION In this final chapter, the overall structure of the study is briefly reviewed, followed by the summary of major findings. The discussion of the results will be provided next. The subsequent sections are reserved for the recommendations, limitations of the study, and ideas for further research. 5.1. Brief overview of the study Testing programs often use multiple test forms of a single test due to test security and exposure concerns. Despite the efforts to make them parallel, these forms are usually not parallel and their sores cannot be used directly before being adjusted to be comparable. Equating is a statistical process of making scores from different test forms of the same test comparable. However, this definition does not explicitly state what it means for the scores to be comparable. Therefore, an operational definition of equating is needed and the equating results must be evaluated by the criteria directly linked to the adopted operational definition. This study used the criteria derived from two common operational definitions of equating to evaluate results from some equating methods in the NEAT design. The two operational definitions of equating used in this study are called the equipercentile and the equity definitions in the literature. The equipercentile definition requires that the distributions of scores on the two test forms being equated be the same after equating. The equity definition requires that conditioning on ability , the distributions of scores on the two forms be the same after equating. Four evaluation indices based on the two definitions were used: (1) the equipercentile index EP, 89 (2) the full equity index E, (3) the first-order equity index E1, and (4) the second-order equity index E2. Four commonly used equating methods were evaluated: (1) the presmoothed frequency estimation equipercentile equating (FE), (2) the chain equipercentile equating (CE), (3) the IRT true score equating (TS), and (4) the IRT observed score equating (OS). In addition, the identity equating (i.e., no equating) (IE) was also employed to determine if there is any situation where not equating at all is even better than equating. The IRT 3PL model was used to simulate the data and to compute the evaluation indices. Two factors were varied in this study. The first factor was the difference between the test forms, which was manipulated by changing the a- and b-parameters of the new form. The second factor was the difference in ability of the two groups taking the two test forms. This factor was manipulated by changing the population mean of the ability  for the group who took the new form. A summary and discussion of major results, centering on the research questions presented in Chapter 1, are provided next. 5.2. Summary of major findings Detailed results were presented in the previous chapters. Some major results are presented here. 5.2.1. Overall performance  When groups were similar in the ability measured by the test, the four methods produced similar results, evaluated by the values of all four indices. When group difference increased, the results produced by different methods diverged, especially in terms of the 90 EP, E, and E1 indices. The difference in terms of the E2 index was not large even when groups became dissimilar.  In general, the OS method outperformed the others in regarding to the EP and E indices across all studying conditions. The TS method produced the smallest values of the E1 index in almost all conditions. However, the difference between these two IRT methods was small. Surprisingly, the IE method produced the best results in terms of the E2 index although the results from the OS method were close.  Between the two OSE methods, the CE method produced much better results and they were close to those from the two IRT methods.  The FE method produced the worst results. Its values for the EP, E, and E1 indices were far higher than those from the other three methods. 5.2.2. Effects of form difference  Form difference had clear effects on all methods with all indices. When test forms were dissimilar (the new form was either easier or more difficult than the old form), the equating results became worse. 5.2.3. Effects of group difference  For the two IRT methods, group difference did not have clear effects. When the groups became dissimilar, the index values produced by those methods did not change.  The two observed OSE methods (FE and CE) were clearly affected by group difference. Larger group differences led to larger index values. The impacts of group difference were much stronger for the FE method than for the CE method. 91 5.2.4. Interaction effects of form difference and group difference  No group-form interaction effects were found for the IRT methods.  Although significant group-form interaction effects were found for the FE and CE methods, those had small magnitudes. 5.2.5. To equate or not to equate?  The identity equating (IE) method produced huge values of the EP, E, and E1 indices compared to those from the four investigated methods.  For the E2 index, the IE method produced values either equal to or better than those from the other methods. 5.3. Discussion of the results In this section, the obtained results are discussed in more details. Again, the discussion is organized around the research questions. 5.3.1. Overall performance The finding that when groups were similar all four equating methods produced similar results was not unexpected. Research has already shown that different equating methods tend to lead to comparable results when the groups taking the forms come from the same or similar populations (Kolen & Brennan, 2004; Sinharay & Holland, 2007; Wang et al., 2008). When groups are distinct, group difference may produce confounding effects with form difference making equating, which is supposed to adjust for form difference, more complicated. Different methods behave differently in these situations. Each method makes, either implicitly or 92 explicitly, some assumptions which may be violated to various degrees when groups are different, leading to various results. It is reasonable that the two IRT methods were found to perform well compared to the two OSE methods. The same IRT model was used for data simulation, IRT equating, and for producing the population distributions of scores, which were used to compute the index values. This gave the two IRT methods advantages. It was also expected that the TS method outperformed the others in regarding to the E1 index. The TS method is based on matching the true scores on the two forms which share the same . The true score is in fact the expected score at a given . Matching two expected values with the same  is the first-order equity property, which is evaluated by the index E1. In other words, the purpose of the TS equating is perfectly matched with the property associated with the E1 index. That explains why the TS was found to be the best method to satisfy the first-order equity. The OS method is the equipercentile equating on two distributions which were produced in the same way the population distributions of scores on two forms were produced for computing the index EP. Therefore, it is explainable why the OS was found the best method in regarding to the EP index. Perhaps for the same reason, the OS performed well in regarding to the full equity index E, which was calculated based on the same model used by the OS method. Between the two OSE methods, research has found that the CE tends to produce better results than the FE, especially when groups differ (Wang et al., 2008). This result was confirmed again in this study. 93 5.3.2. Effects of form difference Equating is supposed to adjust for unintended form difference. The degree to which this adjustment can be made depends on how large the difference is. The form difference cannot be as large as possible and the equating is still able to adjust for it. There must be some point where the equating can no longer adjust for the form difference, simply because the difference is too large to be adjusted. Research has found that although equating is used to adjust for form difference, it works best when the forms are similar, and larger form difference tends to result in larger equating errors (Kolen & Brennan, 2004; von Davier et al., 2004b). This was confirmed again in this study. In an equating study using evaluation indices associated with the equipercentile and equity definitions, Tong and Kolen (2005) also found that the evaluation index values increased when form difference increased. 5.3.3. Effects of group difference In the IRT framework, the item parameters are assumed to be population invariant. In other words, they are assumed to remain unchanged across different examinee populations. The TS method is conducted using only item parameters. Thus, its results are not affected by group difference. The OS method uses the estimated  distribution (from empirical data) of the target population and the assumed IRT model to produce two marginal score distributions and then conducts a regular equipercentile equating on those distributions. The evaluation indices were calculated using the distributions of X and Y as presented in Section 3.7 of Chapter 3. Those distributions are also produced from the assumed IRT model and the theoretical  distribution of the target population. The difference between the two  distributions, one used in the OS method and one used in computing the indices, is that the former is estimated from empirical data and the 94 latter is theoretically hypothesized. The latter was also used to simulated data in this study. Therefore, is it reasonable to assume that those two  distributions are close. This and the fact that the same IRT model was used in the OS method and in the computation of the indices may be one reason why group differences had no clear effect on the OS results. More research is needed to shed more lights on this issue. For the two OSE methods, the results were affected by group differences but to different degrees. The CE method was affected less than the FE method. Although the CE method consists of two equating steps, from Y to A and from A to X, it does not make any strong assumption. The FE, on the other hand, makes a strong assumption about the equality of the conditional distributions in the two involved populations (Section 2.3.2 in Chapter 2). When group difference is substantial, this assumption may not hold. This can be illustrated as follows. Let fG ( ) be the distribution of  and f XA| .G ( x, a |  ) be the joint conditional distribution of X and A in a population G. The distribution of X conditioning on A in G is f X | A.G ( x | a )   f XA| .G ( x, a |  )dfG ( )  f A| .G (a |  )dfG ( ) (5.1) It is obvious from the equation (5.1) that f X | A.G ( x | a ) depends on fG ( ) , which means that f X | A.G ( x | a ) is not likely to be population invariant. In other words it is likely that f X | A. P ( x | a )  f X | A.Q ( x | a ) (5.2) fY | A.P ( y | a)  fY | A.Q ( y | a) (5.3) 95 if P and Q are different. Therefore, the assumptions made in the FE method (i.e., equations (2.5) and (2.6) in Chapter 2) may not hold. This may explain why the FE performed poorly compared to the CE method when groups were different. 5.3.4. Interactive effects of form difference and group difference In the NEAT design, there are two sources of differences that need to be adjusted. The equating process is supposed to adjust for form difference. The group difference is supposed to be adjusted by a set of common items (or anchor). Those two sources of difference are confounded and may create interactive effects on equating results. In this study, the interactive effects were not found. That can be explained by looking at the quality of the anchor. The anchor was created in such as way that it was a mini-version of the two test forms in terms of statistical characteristics with a reasonable length (i.e., one third of the total test). In other words, the anchor used in this study was fairly ideal. As a result, it performed well in adjusting group differences. This may explain why the interactive effects between form difference and group difference were either small (in case of the OSE methods) or not founded (in case of the IRT methods) in this study. 5.3.5. To equate or not to equate? The IE method was used in this study to see if there was any condition when not doing equating would be a good solution. The IE produced large values for the indices EP, E, and E1. Therefore, if those indices are used to evaluate equating results, the IE is not recommended. In other words, doing equating is always, in the conditions of this study, better than not doing equating at all. 96 When it comes to using the index E2 to evaluate the equating results, the IE method may be a good choice because it produced E2 values either equal to or better than those from other methods. However, as recommended by Harris and Crouse (1993), the second-order equity should not be used alone but in combination with the first-order equity. If this recommendation is followed, the IE method is no longer preferred because it produced huge values for E1 index, which is associated with the first-order equity. Combining E1 and E2 would render the IE method unacceptable. 5.3.6. Order effect of a-parameter difference In several cases, it appears that the direction of a-parameter difference between the two forms has effects on equating results. Particularly, in those cases, index values tend to be smaller when Form Y a-parameters are larger than those of Form X. To investigate this issue further, two special cases were selected and equating was conducted in both directions: from Y to X, and from X to Y. Results are presented in Figure B in the Appendix. For each case, two figures are presented, one for each equating direction. Those figures seem to be mirrored to each other. From these results, it seems that equating a form with larger a-parameters (i.e., more reliable) to a form with smaller a-parameters (i.e., less reliable) would results in smaller errors than the conducting equating in the opposite direction. Apparently, more research needs to be conducted to shed more lights in this issue. 5.3.7. Unusual high index values for CE method As seen from Figures 4.9 and 4.11, there are some unusually high values of EP and E1 in the CE method. Those spikes are associated with µ > 0, a = 2, and b = -1.2. In other words, in those conditions, the new form was much easier and more reliable than the old form but was 97 taken by a more able group (group Q). That might have produced a huge difference between two score distributions of the two forms. This may be a reason for those spiked values. It is desirable that more research needs to be done to further understand the reasons of those unusual values. 5.4. Recommendations Results from this study have some practical implications on equating, especially in the NEAT design. Some recommendations on selecting appropriate equating methods in the NEAT design and on communicating equating results are made as follows. 5.4.1. Recommendation on the selection of equating methods in the NEAT design  Group difference should be assessed before the selection of equating methods. The magnitude of group difference can be determined by comparing scores of the two groups on the common items.  If groups are similar, either FE, CE, TS, or OS method can be used.  When groups are different, the FE method is not recommended. The results obtained in this study show that even when the group difference is one fourth of the standard deviation, the index values are more than one score point for the FE method which suggests that it should not be used. The IRT methods, especially the OS method, are highly recommended. If satisfying the first-order equity is the priority, the TS method is the best choice. The CE method can also be used if the group difference is not too large. The use of IRT method may require some strong assumptions such as unidimensionality. Some researchers argued that in practical situations, tests are multidimensional (Reckase, 1985; Reckase & McKinley, 1991). However, the unidimensional model is believed to be somewhat robust to some violations of the unidimensionality assumption (Reckase, 1979; 98 Thissen, Wainer, & Thayer, 1994). Therefore, unless there are strong reasons to switch to multidimensional models, unidimentional IRT equating is suitable.  When form difference is substantial, it is recommended that various methods should be used and the results be compared before the final decision is made. 5.4.2. Recommendation on the communication of equating results  Equating reports should explicitly state the operational definitions of equating that was adopted and how it was determined if the results were accurate relative to the selected definitions.  The Standards (AERA, APA, NCME) should clarify how the equating accuracy is evaluated and require this information be reported to the clients by the organizations that conduct equating. 5.5. Limitations Any study has limitations and this dissertation is not an exception. Several limitations are perceived and listed as follows:  Only simulated data were used in this study. Although using simulated data allows the factor manipulation, the applicability of the obtained results to real data remains somewhat unclear.  Some important factors which were found to have significant impacts on equating, especially in the NEAT design, were not studied. Among those are anchor characteristics, test length, sample size, IRT linking method, and presmoothing technique.  The study depended heavily on an IRT model. The 3PL model was used to simulate data and to produce the population distributions of scores on two test forms for calculating the 99 evaluation indices. This may have given the IRT methods advantages. In addition, the selected IRT model was assumed to be true so the conclusions are limited to the situations in which the IRT model fits the data.  Within each condition, two test forms were fixed across replications while in practice test forms are constantly changed. The main purpose for fixing the test forms in each condition was to eliminate random errors due to sampling items for test forms in each replication. To determine if using randomly generated forms would have led to different results, additional simulations were conducted for four extreme conditions where form and group differences were largest. For each replication, Form X and Form Y were randomly created by sampling their item parameters from the corresponding distributions according to the condition specifications. The results from using random forms, along with those from using fixed forms, are presented in Table A9 in the Appendix. The notable differences in the results from the two approaches suggest that using random forms should be considered in future research.  This study assumed that tests are unidimensional (i.e., measuring a single latent ability). In practice, tests tend to be multidimensional. For example, a mathematical test can measure both mathematical and verbal abilities. When tests are multidimensional, an equating framework which takes into account the nature of multimendionality should be used. 5.6. Directions for future research Some directions for future research have been planned as follows  Extend the current study to examine effects of other important factors such as anchor characteristics, test length, sample size, IRT linking method, presmoothing technique, 100 and other equating methods such as linear equating, local equating (van der Linden, 2010), kernel equating (von Davier et al., 2004a), and modified FE (Wang & Brennan, 2009).  Investigate effects of equating direction to see if equating a more reliable form to a less reliable form results in smaller errors than the opposite direction.  Investigate the possible relationship between the framework used in this study and the concept of population invariance in equating (Holland & Dorans, 2006).  Extend the study to multidimentional tests (Reckase, 2009).  Apply the current research framework to real data. 101 APPENDIX 102 Table A1. Repeated ANOVA results for index EP (a) Adjusted Sig. Source SS df MS F Sig. (b) G-G H-F( c) Between conditions  707.45 4 176.86 439.72 <.0001 a 1012.00 2 506.00 1258.04 <.0001 b 6674.74 6 1112.46 2765.82 <.0001 *a 5.21 8 0.65 1.62 0.1141 *b 55.32 24 2.31 5.73 <.0001 a*b 793.24 12 66.10 164.35 <.0001 46.74 48 0.97 2.42 <.0001 2069.40 5145 0.40 89735.51 4 22433.88 58849.10 <.0001 <.0001 <.0001 method 2967.17 16 185.45 486.47 <.0001 <.0001 <.0001 method*a 3395.39 8 424.42 1113.36 <.0001 <.0001 <.0001 methodb 23259.46 24 969.14 2542.28 <.0001 <.0001 <.0001 method**a 66.70 32 2.08 5.47 <.0001 <.0001 <.0001 method**b 203.52 96 2.12 5.56 <.0001 <.0001 <.0001 method*a*b 2452.11 48 51.09 134.01 <.0001 <.0001 <.0001 118.52 192 0.62 1.62 <.0001 0.0004 7845.31 20580 0.38 *a*b Error Within conditions method method**a*b Error (method) Total 141407.80 26249 Note. (a) The multivariate tests are significant at the same level. (b) Greenhouse-Geisser Epsilon = 0.4225 (c) Huynh-Feldt Epsilon = 0.4312 103 0.0003 Table A2. Repeated ANOVA results for index E (a) Adjusted Sig. Source SS df MS F Sig. (b) G-G (c) H-F Between conditions  327.89 4 81.97 65.13 <.0001 a 950.82 2 475.41 377.73 <.0001 b 8079.97 6 1346.66 1069.97 <.0001 *a 10.59 8 1.32 1.05 0.3941 *b 50.05 24 2.09 1.66 0.0231 a*b 911.60 12 75.97 60.36 <.0001 34.07 48 0.71 0.56 0.9935 6475.51 5145 1.26 77813.63 4 19453.41 15631.10 <.0001 <.0001 <.0001 method 2087.78 16 130.49 104.85 <.0001 <.0001 <.0001 method*a 3837.49 8 479.69 385.43 <.0001 <.0001 <.0001 methodb 21392.96 24 891.37 716.23 <.0001 <.0001 <.0001 method**a 108.89 32 3.40 2.73 <.0001 0.0021 0.002 method**b 217.11 96 2.26 1.82 <.0001 0.0038 0.0035 method*a*b 2410.33 48 50.22 40.35 <.0001 <.0001 <.0001 104.49 192 0.54 0.44 1.0000 1.0000 25612.55 20580 1.24 *a*b Error Within conditions method method**a*b Error (method) Total 150425.73 26249 Note. (a) The multivariate tests are significant at the same level. (b) Greenhouse-Geisser Epsilon = 0.3195 (c) Huynh-Feldt Epsilon = 0.3260 104 1.0000 Table A3. Repeated ANOVA results for index E1 (a) Adjusted Sig. Source SS df MS F Sig. (b) G-G (c) H-F Between conditions  715.01 4 178.75 417.35 <.0001 a 1142.13 2 571.07 1333.32 <.0001 b 6922.09 6 1153.68 2693.60 <.0001 *a 8.08 8 1.01 2.36 0.0157 *b 52.14 24 2.17 5.07 <.0001 a*b 824.59 12 68.72 160.44 <.0001 61.92 48 1.29 3.01 <.0001 2203.63 5145 0.43 95408.05 4 23852.01 64200.70 <.0001 <.0001 <.0001 method 3083.11 16 192.69 518.66 <.0001 <.0001 <.0001 method*a 3143.25 8 392.91 1057.56 <.0001 <.0001 <.0001 methodb 22349.48 24 931.23 2506.52 <.0001 <.0001 <.0001 method**a 88.74 32 2.77 7.46 <.0001 <.0001 <.0001 method**b 215.94 96 2.25 6.05 <.0001 <.0001 <.0001 method*a*b 2522.67 48 52.56 141.46 <.0001 <.0001 <.0001 122.78 192 0.64 1.72 <.0001 <.0001 <.0001 7645.93 20580 0.37 *a*b Error Within conditions method method**a*b Error (method) Total 146509.53 26249 Note. (a) The multivariate tests are significant at the same level. (b) Greenhouse-Geisser Epsilon = 0.4033 (c) Huynh-Feldt Epsilon = 0.4116 105 Table A4. Repeated ANOVA results for index E2 (a) Adjusted Sig. Source SS df MS F (b) Sig. G-G (c) H-F Between conditions  22.98 4 5.75 3.99 0.0031 a 1393.21 2 696.61 483.39 <.0001 b 418.29 6 69.71 48.38 <.0001 *a 13.97 8 1.75 1.21 0.2872 *b 45.26 24 1.89 1.31 0.1432 a*b 182.99 12 15.25 10.58 <.0001 8.27 48 0.17 0.12 1 7414.32 5145 1.44 method 379.63 4 94.91 280.15 <.0001 <.0001 <.0001 method 179.06 16 11.19 33.03 <.0001 <.0001 <.0001 method*a 531.04 8 66.38 195.94 <.0001 <.0001 <.0001 methodb 244.42 24 10.18 30.06 <.0001 <.0001 <.0001 method**a 24.27 32 0.76 2.24 <.0001 0.0032 method**b 84.73 96 0.88 2.61 <.0001 <.0001 <.0001 method*a*b 370.52 48 7.72 22.79 <.0001 <.0001 <.0001 41.39 192 0.22 0.64 1.000 6972.07 20580 0.34 *a*b Error Within conditions method**a*b Error (method) Total 18326.42 26249 Note. (a) The multivariate tests are significant at the same level. (b) Greenhouse-Geisser Epsilon = 0.4964 (c) Huynh-Feldt Epsilon = 0.5066 106 0.9978 0.003 0.998 Table A5. Means of index EP for five equating methods in all conditions µ = 0 a µ = 0.25 b FE IE FE CE TS OS IE 0.54 0.43 0.50 0.40 6.35 1.45 0.46 0.52 0.39 5.91 0.45 0.40 0.47 0.36 4.59 1.38 0.42 0.54 0.38 4.25 -0.4 0.42 0.36 0.46 0.33 3.10 1.38 0.39 0.47 0.34 2.91 0 0.44 0.40 0.44 0.36 2.34 1.35 0.41 0.49 0.37 2.34 0.4 0.44 0.37 0.47 0.37 2.72 1.44 0.41 0.49 0.36 2.91 0.8 0.46 0.40 0.51 0.38 4.19 1.36 0.42 0.48 0.37 4.50 1.2 0.44 0.38 0.52 0.35 6.12 1.28 0.41 0.56 0.35 6.49 -1.2 0.48 0.42 0.52 0.38 9.08 1.44 0.46 0.51 0.38 8.77 -0.8 0.45 0.39 0.42 0.34 6.54 1.42 0.41 0.42 0.35 6.36 -0.4 0.48 0.35 0.34 0.31 3.63 1.42 0.38 0.35 0.31 3.56 0 0.41 0.34 0.31 0.29 0.51 1.39 0.35 0.33 0.31 0.50 0.4 0.34 0.27 0.33 0.27 2.77 1.46 0.30 0.34 0.29 2.75 0.8 0.47 0.35 0.44 0.32 5.89 1.45 0.39 0.41 0.31 5.93 1.2 0.47 0.37 0.57 0.34 8.73 1.50 0.38 0.55 0.35 8.86 -1.2 0.50 0.43 0.54 0.39 10.85 1.45 1.54 0.53 0.40 10.57 -0.8 0.41 0.37 0.45 0.33 7.99 1.41 0.44 0.45 0.35 7.92 -0.4 0.40 0.35 0.40 0.33 4.65 1.42 0.43 0.41 0.33 4.74 0 0.37 0.28 0.34 0.25 2.15 1.41 0.36 0.34 0.25 2.17 0.4 0.35 0.31 0.41 0.30 3.60 1.42 0.31 0.42 0.29 3.39 0.8 0.42 0.33 0.48 0.31 7.24 1.45 0.37 0.48 0.30 7.07 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 0.45 0.39 0.65 0.37 10.51 1.41 0.41 0.61 0.36 10.53 Note. Boldface represents an equating method that produces the smallest value of EP in a specific condition. 107 Table A5. (cont‟d) µ = 0.50 a µ = 0.75 b FE IE FE CE TS OS IE 1.83 0.48 0.47 0.40 5.47 2.33 0.59 0.53 0.39 5.05 1.87 0.52 0.49 0.36 3.93 2.38 0.56 0.43 0.36 3.64 -0.4 1.89 0.44 0.44 0.34 2.75 2.38 0.55 0.44 0.34 2.64 0 1.89 0.49 0.47 0.37 2.38 2.35 0.55 0.46 0.39 2.45 0.4 1.90 0.45 0.49 0.39 3.11 2.32 0.58 0.50 0.38 3.30 0.8 1.83 0.47 0.50 0.38 4.79 2.39 0.54 0.50 0.38 5.03 1.2 1.90 0.44 0.52 0.35 6.81 2.37 0.51 0.50 0.33 7.07 -1.2 1.89 0.57 0.53 0.41 8.39 2.37 0.64 0.49 0.41 7.98 -0.8 1.94 0.51 0.42 0.36 6.12 2.44 0.67 0.42 0.36 5.85 -0.4 1.81 0.45 0.33 0.29 3.45 2.39 0.66 0.34 0.30 3.31 0 1.91 0.44 0.37 0.34 0.48 2.43 0.69 0.34 0.32 0.46 0.4 1.87 0.41 0.35 0.30 2.72 2.39 0.56 0.31 0.26 2.66 0.8 1.92 0.48 0.42 0.33 5.90 2.42 0.62 0.47 0.33 5.81 1.2 1.87 0.47 0.58 0.35 8.91 2.40 0.57 0.54 0.34 8.85 -1.2 1.91 1.41 0.62 0.45 10.19 2.44 1.31 0.58 0.44 9.73 -0.8 1.99 0.95 0.46 0.36 7.75 2.43 0.90 0.45 0.38 7.49 -0.4 1.90 0.55 0.42 0.33 4.75 2.37 0.74 0.44 0.35 4.70 0 1.84 0.49 0.39 0.31 2.19 2.42 0.69 0.38 0.28 2.21 0.4 1.91 0.48 0.37 0.27 3.16 2.40 0.64 0.40 0.27 2.95 0.8 1.92 0.50 0.53 0.30 6.82 2.37 0.64 0.39 0.29 6.49 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 1.90 0.51 0.63 0.36 10.40 2.33 0.63 0.58 0.33 10.13 Note. Boldface represents an equating method that produces the smallest value of EP in a specific condition. 108 Table A5. (cont‟d) µ = 1.00 a b FE IE 2.71 0.67 0.46 0.40 4.67 2.72 0.66 0.45 0.36 3.40 -0.4 2.68 0.59 0.45 0.36 2.57 0 2.75 0.68 0.45 0.37 2.53 0.4 2.72 0.66 0.48 0.37 3.47 0.8 2.72 0.64 0.52 0.38 5.21 1.2 2.69 0.80 0.57 0.37 7.25 -1.2 2.69 0.82 0.50 0.41 7.55 -0.8 2.74 0.80 0.40 0.37 5.55 -0.4 2.75 0.84 0.34 0.31 3.15 0 2.76 0.75 0.35 0.32 0.43 0.4 2.78 0.75 0.39 0.31 2.58 0.8 2.76 0.74 0.49 0.33 5.67 1.2 2.76 0.59 0.54 0.33 8.70 -1.2 2.82 1.24 0.65 0.51 9.23 -0.8 2.83 0.87 0.52 0.43 7.17 -0.4 2.85 0.79 0.50 0.41 4.58 0 2.79 0.94 0.38 0.29 2.22 0.4 2.81 0.87 0.59 0.35 2.75 0.8 2.75 0.82 0.46 0.31 6.11 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 2.76 0.80 0.61 0.33 9.74 Note. Boldface represents an equating method that produces the smallest value of EP in a specific condition. 109 Table A6. Means of index E for five equating methods in all conditions µ = 0 a u = 0.25 b FE IE FE CE TS OS IE 1.27 1.10 1.14 1.00 6.38 1.46 1.12 1.16 1.00 5.96 1.15 1.03 1.07 0.93 4.67 1.39 1.05 1.12 0.95 4.34 -0.4 1.11 0.99 1.05 0.89 3.24 1.39 1.01 1.06 0.90 3.05 0 1.10 1.00 1.01 0.90 2.51 1.35 1.00 1.05 0.90 2.52 0.4 1.14 1.01 1.06 0.92 2.87 1.45 1.02 1.07 0.92 3.05 0.8 1.24 1.11 1.17 1.01 4.27 1.37 1.11 1.14 0.99 4.57 1.2 1.38 1.23 1.30 1.11 6.14 1.31 1.22 1.30 1.09 6.51 -1.2 0.84 0.75 0.80 0.67 9.08 1.44 0.77 0.80 0.68 8.77 -0.8 0.69 0.60 0.61 0.53 6.54 1.42 0.60 0.61 0.54 6.36 -0.4 0.60 0.46 0.45 0.41 3.63 1.42 0.48 0.46 0.41 3.56 0 0.43 0.36 0.32 0.30 0.53 1.39 0.36 0.34 0.32 0.52 0.4 0.37 0.30 0.35 0.30 2.77 1.46 0.33 0.36 0.31 2.75 0.8 0.67 0.54 0.61 0.49 5.89 1.45 0.55 0.57 0.47 5.93 1.2 0.93 0.81 0.96 0.75 8.73 1.51 0.79 0.90 0.72 8.86 -1.2 1.02 0.92 0.97 0.83 10.85 1.46 1.73 0.95 0.82 10.57 -0.8 0.87 0.78 0.81 0.70 8.01 1.42 0.80 0.80 0.69 7.94 -0.4 0.78 0.70 0.73 0.64 4.72 1.43 0.72 0.71 0.62 4.80 0 0.75 0.65 0.69 0.58 2.27 1.41 0.67 0.67 0.56 2.28 0.4 0.76 0.68 0.75 0.63 3.65 1.43 0.67 0.75 0.61 3.45 0.8 0.95 0.84 0.95 0.77 7.24 1.48 0.85 0.93 0.74 7.08 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 1.33 1.20 1.36 1.09 10.51 1.49 1.16 1.28 1.04 10.53 Note. Boldface represents an equating method that produces the smallest value of E in a specific condition. 110 Table A6. (cont‟d) µ = 0.50 a u = 0.75 b FE IE FE CE TS OS IE 1.83 1.14 1.14 1.01 5.53 2.33 1.20 1.16 1.00 5.12 1.87 1.10 1.09 0.94 4.03 2.38 1.10 1.05 0.93 3.75 -0.4 1.89 1.03 1.03 0.90 2.90 2.38 1.06 1.03 0.89 2.79 0 1.89 1.02 1.02 0.89 2.55 2.35 1.04 1.01 0.89 2.61 0.4 1.90 1.03 1.05 0.92 3.24 2.32 1.06 1.03 0.89 3.42 0.8 1.83 1.11 1.12 0.97 4.85 2.39 1.11 1.10 0.95 5.08 1.2 1.90 1.21 1.24 1.06 6.82 2.37 1.20 1.20 1.02 7.08 -1.2 1.89 0.82 0.81 0.70 8.39 2.37 0.87 0.79 0.69 7.98 -0.8 1.94 0.67 0.61 0.54 6.12 2.44 0.80 0.61 0.55 5.85 -0.4 1.81 0.53 0.43 0.39 3.45 2.39 0.71 0.44 0.39 3.31 0 1.91 0.45 0.38 0.34 0.49 2.43 0.70 0.35 0.32 0.47 0.4 1.87 0.43 0.37 0.32 2.72 2.39 0.58 0.33 0.28 2.66 0.8 1.92 0.62 0.57 0.47 5.90 2.42 0.74 0.59 0.46 5.81 1.2 1.87 0.83 0.89 0.69 8.91 2.40 0.87 0.84 0.65 8.85 -1.2 1.91 1.61 1.00 0.85 10.19 2.44 1.52 0.99 0.84 9.73 -0.8 1.99 1.14 0.80 0.69 7.76 2.43 1.10 0.79 0.69 7.50 -0.4 1.90 0.78 0.69 0.59 4.81 2.37 0.88 0.70 0.59 4.75 0 1.84 0.73 0.69 0.59 2.30 2.42 0.83 0.67 0.56 2.31 0.4 1.91 0.76 0.71 0.59 3.23 2.40 0.85 0.73 0.58 3.02 0.8 1.92 0.92 0.93 0.72 6.83 2.37 0.99 0.83 0.70 6.50 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 1.91 1.19 1.26 1.00 10.40 2.33 1.22 1.19 0.96 10.13 Note. Boldface represents an equating method that produces the smallest value of E in a specific condition. 111 Table A6. (cont‟d) µ = 1.00 a b FE IE 2.71 1.21 1.10 0.99 4.75 2.72 1.14 1.04 0.91 3.52 -0.4 2.68 1.06 1.01 0.87 2.71 0 2.75 1.08 0.98 0.86 2.68 0.4 2.72 1.09 0.99 0.86 3.57 0.8 2.72 1.13 1.08 0.91 5.26 1.2 2.69 1.25 1.20 0.99 7.26 -1.2 2.69 1.00 0.80 0.70 7.55 -0.8 2.74 0.90 0.59 0.54 5.55 -0.4 2.75 0.87 0.44 0.40 3.15 0 2.76 0.75 0.36 0.33 0.44 0.4 2.78 0.76 0.41 0.33 2.58 0.8 2.76 0.83 0.60 0.45 5.67 1.2 2.76 0.86 0.83 0.63 8.70 -1.2 2.82 1.47 1.04 0.89 9.23 -0.8 2.83 1.07 0.83 0.72 7.18 -0.4 2.85 0.93 0.74 0.62 4.63 0 2.79 1.00 0.66 0.55 2.30 0.4 2.81 1.00 0.85 0.62 2.83 0.8 2.75 1.09 0.87 0.71 6.13 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 2.76 1.31 1.18 0.93 9.74 Note. Boldface represents an equating method that produces the smallest value of E in a specific condition. 112 Table A7. Means of index E1 for five equating methods in all conditions u = 0 a u = 0.25 b FE IE FE CE TS OS IE 0.63 0.47 0.31 0.51 6.38 1.45 0.46 0.20 0.42 5.96 0.60 0.43 0.27 0.46 4.67 1.38 0.36 0.16 0.34 4.34 -0.4 0.52 0.41 0.19 0.37 3.23 1.38 0.43 0.16 0.36 3.05 0 0.44 0.40 0.22 0.40 2.51 1.35 0.41 0.13 0.31 2.52 0.4 0.50 0.38 0.23 0.41 2.87 1.44 0.41 0.15 0.35 3.05 0.8 0.58 0.44 0.17 0.40 4.27 1.36 0.45 0.17 0.39 4.57 1.2 0.60 0.49 0.16 0.45 6.14 1.28 0.48 0.11 0.38 6.51 -1.2 0.43 0.33 0.15 0.25 9.08 1.44 0.29 0.23 0.29 8.77 -0.8 0.27 0.21 0.20 0.24 6.54 1.42 0.28 0.19 0.29 6.36 -0.4 0.38 0.19 0.18 0.23 3.63 1.42 0.25 0.17 0.20 3.56 0 0.18 0.13 0.07 0.08 0.53 1.39 0.22 0.16 0.16 0.52 0.4 0.10 0.10 0.08 0.07 2.77 1.46 0.22 0.07 0.08 2.75 0.8 0.27 0.16 0.16 0.17 5.89 1.45 0.33 0.12 0.16 5.93 1.2 0.43 0.40 0.15 0.29 8.73 1.50 0.43 0.14 0.26 8.86 -1.2 0.37 0.35 0.32 0.40 10.85 1.45 1.65 0.28 0.43 10.57 -0.8 0.32 0.31 0.22 0.28 8.01 1.41 0.24 0.28 0.33 7.94 -0.4 0.19 0.18 0.20 0.27 4.72 1.42 0.28 0.21 0.27 4.80 0 0.32 0.24 0.18 0.24 2.27 1.41 0.29 0.12 0.18 2.28 0.4 0.26 0.23 0.29 0.36 3.65 1.43 0.26 0.24 0.30 3.44 0.8 0.38 0.42 0.32 0.49 7.24 1.47 0.43 0.29 0.43 7.08 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 0.71 0.69 0.43 0.73 10.51 1.45 0.68 0.45 0.70 10.53 Note. Boldface represents an equating method that produces the smallest value of E1 in a specific condition. 113 Table A7. (cont‟d) µ = 0.50 a µ = 0.75 b FE IE FE CE TS OS IE 1.83 0.52 0.28 0.51 5.53 2.33 0.48 0.11 0.35 5.12 1.87 0.49 0.17 0.38 4.03 2.38 0.51 0.29 0.49 3.75 -0.4 1.89 0.42 0.22 0.41 2.90 2.38 0.52 0.17 0.36 2.78 0 1.89 0.45 0.16 0.34 2.55 2.35 0.51 0.19 0.35 2.61 0.4 1.90 0.47 0.25 0.42 3.24 2.32 0.58 0.10 0.26 3.42 0.8 1.83 0.46 0.14 0.35 4.85 2.39 0.55 0.22 0.39 5.08 1.2 1.90 0.57 0.14 0.41 6.82 2.37 0.63 0.17 0.36 7.08 -1.2 1.89 0.42 0.32 0.38 8.39 2.37 0.53 0.17 0.30 7.98 -0.8 1.94 0.42 0.21 0.25 6.12 2.44 0.62 0.17 0.28 5.85 -0.4 1.81 0.38 0.09 0.17 3.45 2.39 0.62 0.10 0.17 3.31 0 1.91 0.35 0.17 0.17 0.49 2.43 0.66 0.15 0.16 0.47 0.4 1.87 0.35 0.13 0.13 2.72 2.39 0.53 0.08 0.06 2.66 0.8 1.92 0.48 0.05 0.12 5.90 2.42 0.63 0.14 0.09 5.81 1.2 1.87 0.55 0.26 0.26 8.91 2.40 0.65 0.18 0.23 8.85 -1.2 1.91 1.52 0.43 0.49 10.19 2.44 1.39 0.34 0.46 9.73 -0.8 1.99 0.99 0.29 0.34 7.76 2.43 0.89 0.23 0.34 7.50 -0.4 1.90 0.47 0.16 0.20 4.81 2.37 0.68 0.24 0.27 4.75 0 1.84 0.44 0.26 0.30 2.30 2.42 0.66 0.17 0.18 2.31 0.4 1.91 0.48 0.17 0.26 3.23 2.40 0.66 0.19 0.19 3.02 0.8 1.92 0.60 0.23 0.33 6.83 2.37 0.73 0.24 0.40 6.50 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 1.90 0.77 0.38 0.61 10.40 2.33 0.88 0.41 0.61 10.13 Note. Boldface represents an equating method that produces the smallest value of E1 in a specific condition. 114 Table A7. (cont‟d) u = 1.00 a b FE IE 2.71 0.61 0.25 0.46 4.75 2.72 0.62 0.21 0.39 3.51 -0.4 2.68 0.57 0.15 0.32 2.71 0 2.75 0.67 0.19 0.33 2.68 0.4 2.72 0.65 0.16 0.29 3.57 0.8 2.72 0.59 0.18 0.27 5.26 1.2 2.69 0.78 0.17 0.29 7.26 -1.2 2.69 0.77 0.22 0.33 7.55 -0.8 2.74 0.77 0.17 0.26 5.55 -0.4 2.75 0.82 0.10 0.18 3.15 0 2.76 0.73 0.19 0.18 0.44 0.4 2.78 0.74 0.15 0.12 2.58 0.8 2.76 0.75 0.20 0.11 5.67 1.2 2.76 0.65 0.23 0.20 8.70 -1.2 2.82 1.31 0.47 0.53 9.23 -0.8 2.83 0.85 0.37 0.44 7.18 -0.4 2.85 0.72 0.31 0.33 4.63 0 2.79 0.93 0.24 0.23 2.30 0.4 2.81 0.89 0.33 0.18 2.83 0.8 2.75 0.91 0.35 0.43 6.13 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 2.76 1.04 0.30 0.48 9.74 Note. Boldface represents an equating method that produces the smallest value of E1 in a specific condition. 115 Table A8. Means of index E2 for five equating methods in all conditions u = 0 a µ = 0.25 b FE IE FE CE TS OS IE 1.15 1.19 1.28 1.04 0.30 1.27 1.21 1.33 1.10 0.29 1.07 1.11 1.20 0.99 0.24 1.24 1.16 1.27 1.05 0.24 -0.4 1.05 1.07 1.18 0.98 0.35 1.15 1.09 1.19 1.00 0.36 0 1.07 1.07 1.14 0.96 0.49 1.10 1.09 1.18 1.00 0.52 0.4 1.10 1.12 1.19 1.01 0.58 1.15 1.12 1.21 1.02 0.62 0.8 1.19 1.22 1.34 1.12 0.61 1.19 1.20 1.30 1.09 0.67 1.2 1.34 1.36 1.51 1.24 0.59 1.32 1.35 1.51 1.24 0.67 -1.2 0.59 0.61 0.78 0.59 0.95 0.85 0.65 0.77 0.59 0.98 -0.8 0.47 0.47 0.56 0.42 0.63 0.68 0.47 0.52 0.39 0.65 -0.4 0.25 0.29 0.32 0.24 0.32 0.54 0.30 0.33 0.25 0.34 0 0.08 0.11 0.13 0.11 0.05 0.37 0.14 0.09 0.07 0.05 0.4 0.13 0.12 0.20 0.14 0.22 0.18 0.11 0.18 0.13 0.24 0.8 0.45 0.42 0.55 0.39 0.43 0.24 0.37 0.49 0.36 0.46 1.2 0.78 0.77 1.03 0.73 0.60 0.52 0.72 0.96 0.70 0.63 -1.2 0.84 0.84 0.98 0.74 1.57 0.58 0.48 0.96 0.72 1.62 -0.8 0.70 0.70 0.78 0.62 1.20 0.92 0.71 0.75 0.58 1.27 -0.4 0.63 0.62 0.66 0.56 0.87 0.84 0.60 0.62 0.52 0.92 0 0.62 0.61 0.67 0.55 0.60 0.65 0.60 0.66 0.54 0.63 0.4 0.65 0.65 0.75 0.53 0.41 0.53 0.62 0.77 0.56 0.41 0.8 0.80 0.78 1.01 0.67 0.43 0.53 0.74 0.99 0.68 0.41 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 1.16 1.13 1.51 1.02 0.66 0.77 1.05 1.40 0.96 0.65 Note. Boldface represents an equating method that produces the smallest value of E2 in a specific condition. 116 Table A8. (cont‟d) µ = 0.50 a u = 0.75 b FE IE FE CE TS OS IE 1.46 1.22 1.29 1.07 0.29 1.63 1.25 1.34 1.12 0.29 1.31 1.16 1.24 1.04 0.25 1.56 1.15 1.18 0.99 0.26 -0.4 1.27 1.10 1.16 0.98 0.38 1.48 1.11 1.15 0.98 0.41 0 1.22 1.09 1.15 0.98 0.55 1.32 1.05 1.12 0.96 0.59 0.4 1.19 1.11 1.17 0.99 0.67 1.33 1.08 1.16 0.99 0.73 0.8 1.20 1.18 1.28 1.08 0.74 1.27 1.14 1.22 1.02 0.82 1.2 1.31 1.29 1.43 1.18 0.75 1.35 1.22 1.37 1.14 0.85 -1.2 1.17 0.68 0.75 0.57 1.01 1.42 0.69 0.77 0.60 1.03 -0.8 1.01 0.49 0.53 0.41 0.68 1.28 0.52 0.52 0.40 0.69 -0.4 0.80 0.30 0.33 0.25 0.35 1.06 0.32 0.32 0.25 0.36 0 0.62 0.13 0.07 0.05 0.05 0.92 0.16 0.07 0.05 0.05 0.4 0.44 0.12 0.19 0.13 0.25 0.69 0.08 0.17 0.12 0.27 0.8 0.26 0.35 0.49 0.35 0.49 0.56 0.30 0.51 0.37 0.52 1.2 0.40 0.65 0.93 0.67 0.67 0.41 0.62 0.86 0.62 0.72 -1.2 0.75 0.49 0.96 0.73 1.66 0.99 0.50 0.98 0.74 1.67 -0.8 0.79 0.56 0.74 0.57 1.31 0.99 0.55 0.72 0.54 1.34 -0.4 1.07 0.59 0.61 0.49 0.97 1.30 0.54 0.57 0.46 1.00 0 0.84 0.58 0.62 0.52 0.65 1.07 0.53 0.61 0.49 0.68 0.4 0.64 0.61 0.73 0.55 0.40 0.82 0.60 0.73 0.56 0.41 0.8 0.38 0.71 1.01 0.70 0.38 0.51 0.71 0.88 0.64 0.36 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 0.43 0.96 1.40 0.95 0.63 0.21 0.93 1.30 0.91 0.62 Note. Boldface represents an equating method that produces the smallest value of E2 in a specific condition. 117 Table A8. (cont‟d) u = 1.00 a b FE IE 1.86 1.23 1.26 1.06 0.29 1.76 1.13 1.17 0.99 0.28 -0.4 1.67 1.06 1.12 0.95 0.45 0 1.58 1.04 1.08 0.93 0.64 0.4 1.51 1.04 1.09 0.94 0.79 0.8 1.41 1.12 1.19 1.01 0.90 1.2 1.43 0.43 1.33 1.10 0.95 -1.2 1.71 0.71 0.77 0.60 1.03 -0.8 1.51 0.50 0.52 0.41 0.70 -0.4 1.36 0.33 0.31 0.24 0.37 0 1.16 0.14 0.06 0.04 0.04 0.4 0.93 0.11 0.23 0.16 0.29 0.8 0.77 0.29 0.51 0.36 0.56 1.2 0.62 0.61 0.84 0.60 0.77 -1.2 1.32 0.56 0.99 0.75 1.67 -0.8 1.27 0.57 0.71 0.53 1.36 -0.4 1.22 0.61 0.54 0.43 1.03 0 1.34 0.50 0.57 0.45 0.70 0.4 1.06 0.57 0.81 0.60 0.42 0.8 0.77 0.68 0.92 0.67 0.35 1.2 2 OS -0.8 1 TS -1.2 0.5 CE 0.38 0.85 1.32 0.94 0.62 Note. Boldface represents an equating method that produces the smallest value of E2 in a specific condition. 118 Table A9. Comparison of results obtained from using fixed and random test forms. Condition (µ = 1.00; a = 0.5, b = -1.2) Fixed forms Random forms Index FE CE TS OS FE CE TS OS EP 2.71 0.67 0.46 0.40 2.56 0.76 0.54 0.46 E 2.71 1.21 1.10 0.99 2.62 1.09 1.02 1.10 E1 2.71 0.61 0.25 0.46 2.89 0.73 0.31 0.43 E2 1.86 1.23 1.26 1.06 1.76 1.44 1.50 1.23 Condition (µ = 1.00; a = 0.5, b = 1.2) Fixed forms Random forms Index FE CE TS OS FE CE TS OS EP 2.69 0.80 0.57 0.37 2.90 0.85 0.65 0.32 E 2.69 1.25 1.20 0.99 2.92 1.35 1.12 1.04 E1 2.69 0.78 0.17 0.29 2.54 0.84 0.21 0.33 E2 1.43 0.43 1.33 1.10 1.35 0.52 1.23 1.22 119 Table A9. (cont‟d) Condition (µ = 1.00; a = 2, b = -1.2) Fixed forms Random forms Index FE CE TS OS FE CE TS OS EP 2.82 1.24 0.65 0.51 2.92 1.34 0.68 0.49 E 2.82 1.47 1.04 0.89 2.76 1.42 1.09 0.94 E1 2.82 1.31 0.47 0.53 2.74 1.47 0.54 0.60 E2 1.32 0.56 0.99 0.75 1.25 0.61 0.92 0.79 Condition (µ = 1.00; a = 2, b = 1.2) Fixed forms Random forms Index FE CE TS OS FE CE TS OS EP 2.76 0.80 0.61 0.33 2.79 0.78 0.64 0.37 E 2.76 1.31 1.18 0.93 2.84 1.40 1.13 0.91 E1 2.76 1.04 0.30 0.48 3.01 1.12 0.34 0.56 E2 0.38 0.85 1.32 0.94 0.41 0.79 1.45 0.89 120 CE method, index E2, condition  = 0 X is equated to Y Y is equated to X 1.5 1.5 1 1 E2 E2 0.5 0.5 0 0 a=.5 a=1 a=2 a=.5 a=1 a=2 TS method, index E, condition  = 1 Y is equated to X X is equated to Y 1.5 1.5 1.0 1.0 E E 0.5 0.5 0.0 0.0 a=.5 a=1 a=2 a=.5 a=1 a=2 Figure B: Comparing equating results from two directions in two selected cases 121 REFERENCES 122 REFERENCES Albano, A. (2010). Equate: statistical methods for test score equating (R package version 1.1-1). American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.) Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Educational Research Association. Bolt, D. M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education, 12(4), 383-407. Braun, H. I., & Holland, P. W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 949). New York: Academic Press. Brennan, R. L. (2010). Assumptions about true-scores and populations in equating. Measurement: Interdisciplinary Research and Perspectives, 8(1), 1-3. Conover, W. J. (1999). Practical nonparametric statistics (3rd ed.). New York: John Wiley. Cui, Z., & Kolen, M. J. (2008). Comparison of parametric and nonparametric bootstrap methods for estimating random error in equipercentile equating. Applied Psychological Measurement, 32(4), 334-347. Davey, T., Nering, M. L., & Thompson, T. (1997). Realistic simulation of item response data. ACT Research Report Series. Divgi, D. R. (1981). Two direct procedures for scaling and equating tests with item response theory. Paper presented at the annual meeting of the American Educational Research Association, Los Angeles, CA. Dorans, N. J. (1990). Equating methods and sampling designs. Applied Measurement in Education, 3, 3-17. Gafni, N., & Melamed, E. (1990). Using the circular equating paradigm for comparison of linear equating models. Applied Psychological Measurement, 14(3), 247-256. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149. Han, T., Kolen, M. J., & Pohlmann, J. (1997). A comparison among IRT true- and observed123 score equating and traditional equipercentile equating. Applied Measurement in Education, 10(2), 105-121. Hanson, B. A. (1991). A comparison of bivariate smoothing methods in common-item equipercentile equating. Applied Psychological Measurement, 15, 391-408. Hanson, B. A., & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 2-24. Hanson, B. A., & Zeng, L. (revised by Cui, Z.) (2004). PIE. A computer program for IRT equating. Hanson, B. A., & Zeng, L. (revised by Cui, Z.) (2004). ST. A computer program for IRT scale transformation. Hanson, B. A., Zeng, L., & Colton, D. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating (Research Report No. 94-4). Iowa City, IA: ACT, Inc. Hanson, B. A., Zeng, L., & Kolen, M. J. (1993). Standard errors of Levine linear equating. Applied Psychological Measurement, 17(3), 225-237. Harris, D. J. (1991). A comparison of Angoff‟s design I and design II for vertical equating using traditional and IRT methodology. Journal of Educational Measurement, 28(2), 221-235. Harris, D. J., & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6(3), 195-240. Harris, D. J., & Kolen, M. J. (1990). A comparison of two equipercentile equating methods for common item equating. Educational and Psychological Measurement, 50, 61-71. Harris, D. J., & Kolen, M. J. (1986). Effect of examinee group on equating relationships. Applied Psychological Measurement, 10(1), 35-43. Holland, P. W., Sinharay, S., von Davier, A. A., & Han, N. (2008). An approach to evaluating the missing data assumptions of the chain and post-stratification equating methods for the NEAT design. Journal of Educational Measurement, 45(1), 17-43. Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 187-220). Westport, CT: Praeger. Holland, P. W., & Thayer, D. T. (2000). Univariate and bivariate loglinear models for discrete test score distributions. Journal of Educational and Behavioral Statistics, 25(2), 133-183. Holmes, S. E. (1982). Unidimensionality and vertical equating with the Rasch model. Journal of Educational Measurement, 19, 139-147. 124 Hulin, C. L., Lissak, R. I., & Drasgow, F. (1982). Recovery of two- and three-parameter logistic item characteristic curves: A monte marlo study. Applied Psychological Measurement, 6, 249-260. Kim, D. I. (2000). A comparison of IRT equating and beta 4 equating. Unpublished Doctoral Dissertation, University of Iowa. Kim, D. I., Brennan, R., & Kolen, M. (2005). A comparison of IRT equating and beta 4 equating. Journal of Educational Measurement, 42(1), 77-99. Kim, S., & Livingston, S. A. (2010). Comparisons among small sample equating methods in a common-item design. Journal of Educational Measurement, 47(3), 286-298. Kim, S., Walker, M. E., & McHale, F. (2010). Comparisons among designs for equating mixedformat tests in large-scale assessments. Journal of Educational Measurement, 47(1), 36-53. Kim, S., von Davier, A. A., & Haberman, S. (2008). Small-sample equating using a synthetic linking function. Journal of Educational Measurement, 45(4), 325-342. Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22(3), 197-206. Kolen, M. J. (1990). Does matching in equating work? A discussion. Applied Measurement in Education , 3, 97-104. Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18(1), 1-11. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York: Springer. Kolen, M. J., & Brennan, R. L. (1987). A reply to Angoff. Applied Psychological Measurement, 11(3), 301-306. Kolen, M. J., Hanson, B. A, & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285-307. Liou, M., & Cheng, P. E. (1995). Asymptotic standard of equipercentile equating. Journal of Educational and Behavioral Statistics, 20(3), 259-286. Liu, Y., Schulz, E. M., & Yu, L. (2008). Standard error estimation of 3PL IRT true score equating with an MCMC method. Journal of Educational and Behavioral Statistics, 33(3), 257-278. Livingston, S. A. (1993). Small-sample equating with log-linear smoothing. Journal of 125 Educational Measurement, 30(1), 23-39. Livingston, S. A., & Kim, S. (2010). Random-groups equating with samples of 50 to 400 test takers. Journal of Educational Measurement, 47(2), 175-185. Livingston, S. A.,Dorans, N. J., & Wright, N. K. (1990). What combination of sampling and equating methods works best? Applied Measurement in Education, 3, 73-95. Livingston, S. A., & Feryok, N. J. (1987) Univariate versus bivariate smoothing in frequency estimation equating (Research Report 87-36). Princeton, NJ: Educational Testing Service. Lord, F. M. (1982a). Standard error of an equating by item response theory. Applied Psychological Measurement, 6(4), 463-472. Lord, F. M. (1982b). The standard error of equipercentile equating. Journal of Educational Statistics, 7(3), 165-174. Lord, F. M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. Lord, F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14, 117-138. Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings.” Applied Psychological Measurement, 8(4), 453-461. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test score. Menlo Park, CA: Addison-Wesley. Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193. Marco, G. L. (1977). Item characteristic curve solution to three intractable testing problems. Journal of Educational Measurement, 14, 139-160. Marco, G. L., Petersen, N. S., & Stewart, E. E. (1983). A test of the adequacy of curvilinear score equating methods. In D. Weiss (Ed.), New horizons in testing (pp. 147-176). New York: Academic Presss. Morris, C. N. (1982). On the foundations of test equating. In P. W. Holland & D. B. Rubin (Eds.) Test equating (pp. 169-191). New York: Academic Press. Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp.221-262). New York: Macmillan. Philips, S. E. (1985). Quantifying equating errors with item response theory methods. Applied 126 Psychological Measurement, 9(1), 59-71. Puhan, G. (2010). A comparisons of chained linear and poststratification linear equating under different testing conditions. Journal of Educational Measurement, 47(1), 54-75. Puhan, G., Moses, T. P., Grant, M. C., & McHale, F. (2009). Small-sample equating using singlegroup nearly equivalent test (SiGNET) design. Journal of Educational Measurement, 46(3), 344-362. Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer. Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401-412. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, 207-230. Reckase, M. D., & McKinley, R. L. (1991). The discrimination power of items that measure more than one dimension. Applied Psychological Measurement, 15, 361-373. Rosenbaum, P. R., & Thayer, D. (1987). Smoothing the joint and marginal distributions of scored two-way contingency tables in test equating . British Journal of Mathematical and Statistical Psychology, 40, 43-49. Sinharay, S., & Holland, P. W. (2010). A new approach to comparing several equating methods in the context of the NEAT design. Journal of Educational Measurement, 47(3), 261-285. Sinharay, S., & Holland, P. W. (2007). Is it necessary to make anchor tests mini-versions of the tests being equated or can some restrictions be relaxed? Journal of Educational Measurement, 44(3), 249-275. Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42(4), 309-330. Skaggs, G. (1990). To match or not to match samples on ability for equating: A discussion of five articles. Applied Measurement in Education, 3, 105-113. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. Thissen, D., Wainer, H., Thayer, D. T. (1994). Are tests comprising both multiple-choice and free response items necessarily less unidimensional than multiple-choice tests? Journal of Educational Measurement, 31, 113-123. Tong, Y., & Kolen, M. J. (2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418-432. 127 Thomasson, G. (1993). The asymptotic equating methodology and other test equating evaluation procedures. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign. Tsai, T-H., Hanson, B. A., Kolen, M. J., & Forsyth, R. A. (2001). A comparison of bootstrap standard errors of IRT equating methods for the common-item nonequivalent groups design. Applied Measurement in Education, 14(1), 17-30. van der Linden, W. J. (2010). Local observed-score equating. In A. A. von Davier (Ed.) Statistical models for equating. New York: Springer. van der Linden, W. J., & Wiberg, M. (2010). Local observed-score equating with anchor-test designs. Applied Psychological Measurement, 34(8), 620-640. von Davier, A. A., Holland, P. W., Livingston, S. A., Casabianca, J., Grant, M. C., & Martin, K. (2006). An evaluation of the kernel equating method: A special study with pseudotests constructed from real test data. ETS Research Report. Princeton, NJ: Educational Testing Service. von Davier, A. A., & Holland, P. W., & Thayer, D. T. (2004a). The kernel method of test equating. New York: Springer-Verlag. von Davier, A. A., & Holland, P. W., & Thayer, D. T. (2004b). The chain and post-stratification methods for observed-score equating: Their relationsip to population invariance. Journal of Educational Measurement, 41, 15-32. Wang, T., & Brennan, R. L. (2009). A modified frequency estimation equating method for the common-item non-equivalent groups. Applied measurement in education, 33(2), 118-132. Wang, T., Lee, W-C., Brennan, R. L., & Kolen, M. J. (2008). A Comparison of the frequency estimation and chained equipercentile methods under the common-item nonequivalent groups design. Applied Psychological Measurement, 32(8), 632-651. Wang, T., Hanson, B. A., & Harris, D. J. (2000). The effectiveness of circular equating as a criterion for evaluating equating. Applied Psychological Measurement, 24(3), 195-210. Wyse, A. E., & Reckase, M. D. A graphical approach to evaluating equating using test characteristic curves. Applied Psychological Measurement. Prepublished October 7, 2010, DOI:10.1177/0146621610377082. Zeng, L., Hanson, B. A., & Kolen, M. J. (1994). Standard errors of a chain of linear equatings. Applied Psychological Measurement, 18(4), 369-378. Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). BILOG-MG: Multiple-group IRT analysis and test maintenance for binary items [Computer software and manual]. Chicago: Scientific Software International. 128