4L: . .. $33.33. .00 5.9 . O . In a 2517!. v 3315... I. an . a... ‘ 3 ....:;....hfl is . v... 33:12). 2;. q E. itolLfi gin-"wt. 1. 1. .1 «as... 4. .5 1:. ‘7}... .III t1t:.!rtu%h. : .. :1... :. .. . sum!” “unlit.- .2. :L».u§efl.:s :. arax.‘ .p‘ 323443.... ! 2 . . o .3. .. .6... . I . : a. .31.. 1.. .. . . 1.3%.... 2.1 \ i . u. , 1.13.3: a... 3.125.. 3i\ I. I...-a~y..;0v a.751.,..0u3€ ll ...au\:w‘q_. )e't.’ ‘7 31!. t a 3.2 x) \3 w This is to certify that the dissertation entitled A COMPARISON BETWEEN THE VERTICAL SCALING OF TESTS SENSITIVE TO MULTIPLE DIMENSIONS USING COMMON-ITEM AND COMMON-GROUP DESIGNS presented by Jing Yu has been accepted towards fulfillment of the requirements for the PhD degree in Measurement and Quantitative Methods Z’z/aflfl W Major Professor’s Signature Vie/O7 Date MSU is an afl‘innative-action, equal-opportunity employer un-p-u-u-.---o---c-o—a-n-n-n—o-n-AQJ-u-n---2------.--.-.-.-a-a-n-o--.-.---n---.-.u-.-._._.--.-.-.-.-—---—- — ’_ 'V -' _ iv.“ v '.— I. r, {‘5 ,._.~' . 'i'flr I I a f"‘~.«~I-~ " an: ‘3 ‘ ‘35:) I I3.) I i ' «.5 Ir a “1- ’-'° r 1" put: .1 ~' - '" - ‘i L, 4 ' I U \ II s..- s ~. Y a 'h _‘_-__ FEM-9» PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATEDUE DATE DUE DATE DUE 1792109 DVGSBm T2: M 212010 ‘R “teal t» I 6/07 p:lCIRCIDateDue.indd-p.1 A COMPARISON BETWEEN THE VERTICAL SCALING OF TESTS SENSITIVE TO MULTIPLE DIMENSIONS USING COMMON-ITEM AND COMMON-GROUP DESIGNS By Jing Yu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY DEPARTMENT OF COUNSELING. EDUCATIONAL PSYCHOLOGY AND SPECIAL EDUCATION 2007 Abstract A COMPARISON BETWEEN THE VERTICAL SCALING OF TESTS SENSITIVE TO MULTIPLE DIMENSIONS USING COMMON-ITEM AND COMMON-GROUP DESIGNS by Jing Yu Three methods of item response theory (IRT) linking—common-item, common-group and a combination of common-item and common—group (referred to as common-common) linking designs were compared using real testing data from an English as second language (ESL) exam program. The methods were considered as “vertical scaling” instead of “equating” because, first, the test was designed to examine three different traits of English ability; multidimensional IRT and factor analysis on testing data confirms that the test was multidimensional. Second, the two test forms are not at the same difficulty level, the averaged difficulty parameters were different by about 0.5, 1.0 or 1.5 standard units, thus the linking was considered vertical. The effects of test length and averaged difficulty level differences were also analyzed. For practical reasons, the anchor test used in the common-item linking design could not represent all the dimensions of the test forms. The original data contained dichotomous responses from about 30,000 individuals on 130 items. For the evaluation of each linking design, a sub-sample of cases and responses were selected. The linking designs were evaluated by calculating the standard error of equating and by comparing the examinees’ scores and item parameters before vs. after equating. Results of the analyses indicate that common-group and common-common linking designs can serve as adequate alternatives to the well-recognized common-item design. Longer test forms work better for item parameter estimation and have smaller standard errors of equating. When the ability of the group does not match the difficulty level of the assigned form, the common-item design has a slightly smaller standard error of equating than the common-group and common-common designs. COPyright by .fingfifli 2006 ACKNOWLEDGMENTS First, I want to give all the glory and honor to my Lord, Jesus Christ. He hears prayers, and has led me through each step of my PhD study. My special appreciation goes to my academic advisor, Dr. Mark D. Reckase, I enjoyed being his advisee in the past four years. As one of the most knowledgeable and respected expert in our field today, not only his knowledge and skills in psychometrics, but also his humility, efficiency and patience had a great impact on me. Dr. Amy D. Yamashiro has been a wonderfiJl supervisor in my part time job for about three years. Through her support, I got access to the data for this dissertation study. I appreciate her being an excellent mentor who leads me into the field of language testing. I appreciate Dr. Kimberly S. Maier and Dr. Yeow Meng Thum in making time from their busy schedules to attend the proposal and defense meetings for my dissertation. Their valuable comments on the design and analyses of the dissertation study are highly appreciated. Finally and most importantly, I want to thank my dear husband, I in Wang, for his care and support during the years. I also thank my parents for building up my love for the truth and my analytical way of thinking—characteristics that are important in my academic life. TABLE OF CONTENTS LIST OF TABLES - - -- -- - - - - ...... VIII LIST OF FIGURES - - -- - - - - IX CHAPTER 1. INTRODUCTION - _ - ‘ ........... 1 I. EQUATING OR SCALING ......................................................................................... 1 II. PURPOSES AND RESEARCH QUESTIONS .................................................................. 3 CHAPTER 2. LITERATURE REVIEW ................................... 5 I. IRT AND IRT VERTICAL SCALING DESIGNS ............................................................ 5 A. The Strength and Limitations of IR T in Test Equating ..................................................... 5 B. IRT in Vertical Scaling .................................................................................................... 7 11. ISSUES OF MULTIDIMENSIONALITY IN IRT EQUATING ........................................... 12 III. ISSUES IN VERTICAL SCALING .......................................................................... 14 A. Issues of structure or dimension shifl ............................................................................ 14 B. Issues of DIF ................................................................................................................ 15 C. Dzfliculty diference between the forms ......................................................................... I 6 D. The efléct of test length ............................................................................................ I 7 IV. EVALUATING THE ERRORS OF EQUATING ........................................................................ 18 A. Comparing the parameters ’true values and those obtained through equating ............. 18 B. Standard error of equating ............................................................... _ ............................ 18 CHAPTER 3. METHODS ....... ......... - -- - -- - - -.---.21 I. DATA DESCRIPTION ............................................................................................. 21 A. The Items ’ Unidimensional and Multidimensional IRT Parameters ................................ 22 B. Factor analysis ............................................................................................................. 23 C. Goodness of fit ............................................................................................................. 23 II. EQUATING DESIGNS ............................................................................................ 24 A. Common-Group Equating ............................................................................................. 25 B. Common-Item Equating ............ . ................................................................................... 26 C. Common-group and common-item combined equating ................................................. 27 D. Data Analysis and Evaluation of Diflerent Designs .............. . ...................................... 29 III. STANDARD ERROR OF EQUATING .................................................................................. 30 CHAPTER 4. RESULTS - - - - - ........... - - 32 I. PARAMETERS, DIMENSIONS AND MODEL FIT ANALYSIS ...................................................... 32 II. ITEM SELECTION FOR EACH DESIGN ................................................................................ 40 vi III. REGRESSIONS BETWEEN THE TWO SETS OF ITEM PARAMETERS ................................ 46 IV. REGRESSION BETWEEN THE “REAL SCORES” AND EQUATED SCORES ......................... 67 IV. STANDARD ERRORS OF EQUATING ......................................................................... 79 CHAPTER 5. CONCLUSIONS AND DISCUSSIONS ....... . - - 86 I. DIMENSIONALITY AND MODEL FIT ......................... _ .............................................. 8 6 A. Unidimensional or Multidimensional Structure ................................................... 86 B. IRT Goodness of fit ............................................................................................ 87 ll. CORRELATION BETWEEN “REAL” AND EQUATED ITEM PARAMETERS ...................... 89 A. Scale indeterminacy in equating ........................................................................ 90 B. Errors in parameter estimate of IRT equating .................................................... 92 C. Evaluating errors in parameter estimation ........................................................ 93 III. IRT ABILITY ESTIMATE .......................................................................................... 94 A. Scatter plots of IRT score estimate ..................................................................... 94 B. Square Root of the Average squared dtflerence .................................................. 95 C. Standard Error of Equating ............................................................................... 97 IV. THE EFFECTS OF EQUATING DESIGN, TEST LENGTH AND DIFFICULTY DIFFERENCE ..... 98 A. The eflects of equating designs .......................................................................... 98 B. The efiécts of test length .................................................................................... 99 C. The efi'ects of form difliculty diflkrence .............................................................. 99 D. Future directions ............................................................................................. 100 APPENDIX 1. MATLAB CODE (1) - - - - - 103 APPENDIX 2. MATLAB CODE (2) -- 106 REFERENCES ...... - -- - 110 vii List of Tables Table 2.1. References on Difficulty Difference between Forms ...................................... 20 Table 3.1. Number of Unique Items ....................................... i ........................................ 22 Table 4.1. IRT Parameter Estimates for All Items ........................................................... 33 Table 4.2. MIRT Parameter Estimates for of All Items ................................................... 35 Table 4.3. Dichotomous Factor Analysis of All Items with Oblique Rotation ................. 37 Table 4.4. Percentage of the Highest Loading Items on One Dimension/Factor .............. 39 Table 4.5. Percentage Of Items Fitting the Model ........................................................... 40 Table 4.6. Difference between the Averaged Item Difficulty .......................................... 41 Table 4.7. Item/Common-Item Numbers for Each Design .............................................. 42 Table 4.8 Linking Item Fit for the Equating Design of 120 Item Tests (df=20) ............... 43 Table 4.9 Linking Item Fit for the Equating Design of 96 Item Tests (df=20) ................. 44 Table 4.10 Linking Item Fit for the Equating Design of 72 Item Tests (df=20) ............... 45 Table 4.11. Correlations Between the Two Sets of Parameters ....................................... 47 Table 4.12. ANOVA Significance of the Correlation Coefficients ................................... 47 Table 4.13. Slope of the Regression Function ................................................................ 48 Table 4.14. Intercept of the Regression Function ............................................................ 48 Table 4.15. Correlation Coefficients between “Real Score” vs Equated Score ................ 78 Table 4.16 Adjusted Averaged Squared Difference ......................................................... 79 Table 4.17. Averaged Standard Error between Scores -l.0 and 1.0 ................................. 8] viii List of Figures Figure3. 1. Three designs when total item=120 ............................................................... 28 Figure3.2. Three designs when total item=96.......................... ....................................... 28 Figure 3.3. Three designs when total item=72 ................................................................ 29 Figure 4.1 Item parameters for test length 120, difficulty difference 0.5 ......................... 49 Figure 4.2. Item parameters for test length 120, difficulty difference 1.0 ........................ 51 Figure 4.3. Item parameters for test length 120, difficulty difference 1.5 ........................ 53 Figure 4.4. Item parameters for test length 96, difficulty difference 0.5 .......................... 55 Figure 4.5. Item parameters for test length 96, difficulty difference 1.0 .......................... 57 Figure 4.6. Item parameters for test length 96, difficulty difference 1.5 .......................... 59 Figure 4.7. Item parameters for test length 72, difficulty difference 0.5 .......................... 61 Figure 4.8. Item parameters for test length 72, difficulty difference 1.0 .......................... 63 Figure 4.9. Item parameters for test length 72, difficulty difference 1.5 .......................... 65 Figure 4.10. “Real” vs. mean of the equated scores, test length=120 .............................. 68 Figure 4.11 “Real” vs. mean Of the equated scores, test length=96 items ........................ 71 Figure 4.12 “Real” vs. mean Of the equated scores, test length=72 items ........................ 74 Figure 4.13. Standard error, 120 items, difficulty difference=0.5 .................................... 82 Figure 4.14. Standard error, 120 item, difficulty difference=l .0 ..................................... 82 Figure 4.15. Standard error, 120 items, difficulty difference=l.5 .................................... 83 Figure 4.16 Standard error, 96 items, difficulty difference=0.5 ....................................... 83 Figure 4.17 Standard error, 96 items, difficulty difference=1.0 ....................................... 84 Figure 4.18 Standard error, 96 items, difficulty difference=l.5 ....................................... 84 Figure 4.19 Standard error, 72 items, difficulty difference=0.5 ....................................... 85 Figure 4.20 Standard error, 72 items, difficulty difference=l .0 ....................................... 85 Figure 4.21 Standard error, 72 items, difficulty difference=1 .5 ....................................... 85 Chapter 1. Introduction I. Equating or Scaling IRT (item response theory) models gain their flexibility by making strong statistical assumptions, which likely do not hold precisely in real testing situations. For this reason, studying the robustness of the models to violations of the assumptions, as well as studying the fit of the IRT model, is a crucial aspect of IRT applications. (Kolen and Brennan, 2004, p156) Equating the scores of two test forms has been in practice for at least half a century. In the early stage of the development of score equating, the similarity of forms in terms of structure or reliability was emphasized. Such requirements are reflected in the early works in educational measurements of Lord (1980), Angoff (1 971), and Lord and Novick (1968). Kolen and Brennan (2004) summarized five desirable properties of equating relationships between the forms or between the equated scores. The properties are: symmetry of equating transformations; same specifications between the two test forms; equity property, which holds when examinees with a given true score have the same distribution of converted scores on Form X as they would on Form Y; observed score equating property in observed score equating; and group invariance property that means that the same equating relationship can be found using different groups of examinees. If all requirements are met, the forms are strictly parallel and need not to be equated. What has never been clear is the degree to which the requirements require fulfillment so that equating can be performed. In recent years, as the demand for measuring achievement has increased, psychometricians have been challenged to put scores from test forms of differing content, structure or difficulty level on the same score scale. To differentiate traditional equating methods from the more recently developed ones that link two forms obviously differing in structure or difficulty, three names have been applied to the statistical process. Equating refers to the traditional approach wherein the five properties are relatively met. The process is called vertical scaling when two forms differ in difficulty levels while the test structures are believed to be similar. And vertical scaling is most often used to assess growth such as in math achievement between different grades. Linking usually refers to the statistical process for putting scores from tests that are different in both difficulty and in content on the same scale. A typical example is to identify the equivalent ACT score for an SAT I (V+M) composite score (Kolen and Brennan, 2004). In this dissertation, the two forms studied are of the same test specification but are at different difficulty levels, thus the equating method should be categorized as vertical scaling. However, in this dissertation, the process is sometimes called equating to for convenience. As often repeated in the history of science, practical issues usually challenge theories and technologies to improve. So is the case for test equating. When IRT was first developed, it required all items in a test measure the same trait or ability, or the same combination of traits/abilities. However in practice, it is well acknowledged that multiple skills are required to determine the correct answers for many test items. Multidimensional item response theory (MIRT) was developed to quantitatively analyze test structure. It has also been applied to equate tests that do not meet the unidimensionality assumption required by other procedures. However, MIRT models are more sophisticated and difficult to apply in practice. IRT models are robust and can tolerate multidimensionality to a certain degree, but it is necessary to consider in what situations, and to what degree, the simplicity of unidimensional IRT will reasonably model the data (Goldstein and Wood, 1989; Wang, 1985). In recent years, especially after the NCLB (NO Child Left Behind) implementation, accurate and accountable methods have been required to measure growth in achievement. Measurement instruments that are different in specifications need to be put on the same scale to enable accurate growth evaluation. Recently developed IRT software, BILOG-MG and ICL, have integrated vertical equating features so that two forms Of different difficulty levels can be put on the same scale. These developments make it possible to link forms that are different from each other in terms of test structure, examinee population, test difficulty etc. However, the validity Of the scoring method and its accountability of measuring student achievement and growth still remain to be evaluated. ll. Purposes and Research Questions The testing data studied in this dissertation characterizes the above mentioned challenges Of IRT equating: the data have multidimensional structure, and difficulty differences exist between the forms—the equating is vertical by nature. Less studied equating designs are applied to explore their feasibility, and to find Optimum solutions for today’s testing practice. Among the equating designs that are compared in this dissertation, common-item equating is most often seen in equating literature. In today’s well-recognized textbook of equating by Kolen and Brennan (2004), common-group and common/common equating designs are not even mentioned. Common-item equating links the two test forms by items that appear in both forms. Equating methods seldom utilized—common-group equating and common-group/common—item combined equating—will undergo a detailed examination in this dissertation; the comparison between the common-item and common-group equating designs in vertical scaling will be discussed. Common-group equating here refers to an equating design that has three groups of examinees who take different test form(s). Group 1 takes test form 1, Group 2 takes test form 2 and Group 3 takes both forms (form 1 and form 2 share no common items). The data collected from all three groups will be used to concurrently calibrate the test item parameters and examinees’ ability scores with the unidimensional IRT three parameter logistic (3-PL) model using maximum likelihood estimation. Another relatively obscure equating method—common—group/common-item equating, is also applied in this study. This equating design combines the common-group and common-item design. A detailed description of this method is. found in chapter 3. All equating was done as multiple group concurrent estimation of item parameters with the unidimensional 3-PL IRT model, using maximum marginal likelihood (MML) estimation. Therefore, the research questions will be answered by comparing the results of the three equating designs are: 1. What is the difference in item parameter calibration or ability score calculation between the three equating designs? 2. Which design is more advantageous at different test lengths: 36, 48 and 60 items? 3. Which design is more advantageous when the average difficulty difference between the exams is 0.5 unit, 1 unit and 1.5 unit? Standard errors of equating are calculated in evaluating the quality of equating designs; the item parameters and examinees’ ability scores obtained through equating are compared with those of their real values (the “real values” will be defined in chapter 3). IRT model fit and practical issues of equating design are also discussed. Chapter 2. Literature Review I. IRT and IRT Vertlcal Scaling Designs A. The Strength and Limitations of [RT in Test Equating Before the IRT models were widely used in testing practice, several equating designs based on true score theory were developed and applied. Kolen and Brennan (2004) thoroughly describe these equivalent group equating, non-equivalent group equating, linear equating, equipercentile equating, and other methods. Compared to equating methods based on classical testing theories, IRT models are more advantageous in that they model examinee responses at the item level instead of the total score level. IRT models are now widely used in almost all aspects of psychometrics such as item banking, scoring, differential item functioning (DIF) analysis, adaptive testing etc. Increasingly more powerful computer software for IRT models have been developed for the expanding IRT applications. Due to the simplicity of the IRT models and the availability of software, progressively more equating or scaling are now performed with IRT. Despite its strengths, IRT makes strong statistical assumptions, which are hard to meet precisely in real testing situations. The two major assumptions are local independence and unidimensionality; the two assumptions are related. Local independence means that the answer to one question is not related in any way to the answer(s) of other question(s). Lord (1980) stated it as Lazarsfeld’s assumption of local independence, which is described as: “if we know the examinee’s ability, any knowledge of his success or failure on other items will add nothing to the determination (of 9), if it did add something, then performance on the items in question would depend in part on some trait other than 9 ...... " Lord (1980) further described this assumption in a mathematical statement “that the probability of success on all items is equal to the product of the separate probabilities of success” (p19). This assumption cannot hold when testlets (like a reading test where items are grouped according to reading passages) are included in an exam. By unidimensionality is meant that all the test items test the same type of knowledge/ability or the same combinations of knowledge/abilities; put in the context of Lord (1980), a Single 9 is measured by the test. When multidimensionality exists, more complicated IRT models are needed to accurately express the mathematical relationships between 9 and item response pattern. Reckase (1997, p271) stated that “The number of skill dimensions needed to model the item scores from a sample of individuals for a set of test tasks is dependent upon both the number of skill dimensions and level on those dimensions exhibited by the examinees, and the number of cognitive dimensions to which the test tasks are sensitive.” Unidimensionality almost certainly does not hold for data from most achievement test. Fortunately, IRT models are robust to a certain degree against assumption violations, which means that, although sometimes the model does not perfectly fit, the estimation based on it is still accurate enough to make educational decisions. A combination of several elements determines the degree such violations are tolerable, and the degree of tolerance also depends on the research design. For example, in common-item nonequivalent equating using the IRT model, issues affecting the quality of equating may include the reliability of each test form, the quality of the test items, the selection of anchor items etc. The quality of test equating can be seriously undermined by a combination of inadequate test equating design and unsatisfied assumptions (J Odoin, 2003; Goldstein and Wood, 1989; Klein and Jaijoura, 1985; Beguin et a1 2000; Skyes et a1 2002). A number of IRT models have been developed so that different models can be applied according to the feature of the data and the needs of the analysis. In large-scale test equating designs, unidimensional IRT models are most Often used because the practice is simpler and more economical than multidimensional IRT, even though sometimes the unidimensional assumption is not satisfied. In this study, in order to test the model’s tolerance to multidimensionality, we will use unidimensional IRT equating although the evidence indicates that the data are multidimensional. The model applied here is three-parameter logistic model (3-PL), as presented in equation (2.1), which is widely used in multiple—choice large-scale testing. The definitions of the symbols in the equation are: P (Xij=1)—probability that person j with ability level Bj can answer item i correctly; B—examinee ability; b—item difficulty; a—item discrimination; c—lower asymptote or guessing parameter. eprai (9}. _ bi )] 1+ exp[a,(6j —b,.)] P(X,j =1|oj,b,.,a,,c,)=c, +(1—c,) (2.1) B. IRTin Vertical Scaling In theory, item parameters calibrated with IRT models are independent of the examinees’ ability level. The a,-, b,- and c,- in equation (2.1) are invariant parameters (Lord, 1980, p34). The difficulty parameter (b,-) is on the same scale as the examinees’ ability levels. The distribution of examinees’ ability is usually set as a standardized normal distribution. Another feature of item parameter calibration with the IRT model is called indeterminacy, which means “the choice of origin for the ability scale is purely arbitrary” (Lord 1980, p36). Thus the IRT parameters calibrated from the two forms require adjustment in order to be put in the same scale, because these parameters were calibrated based on different examinee groups. The quality of IRT equating is not only decided by quality of the test items and whether the data collected from the examinees fit the IRT model, but also by the appropriateness of the design of test equating. Best equating results could be obtained when the two forms satisfy the requirements Of equating mentioned in Chapter One. However, this study focuses on vertical scaling—equating between two forms that are different in difficulty levels, and more often than not, also different in test domains. The process therefore requires tolerance to both lack of fit of the IRT model (multidimensionality) and that of the unsatisfied requirements in equating. ’ According to Kolen and Brennan (2004), vertical scaling refers to the “process used for associating performance on each test level to a single score scale, and the resulting scale is a developmental score scale.” Because tests of different levels—and quite inevitably, different constructs—are involved in vertical scaling, issues such as domains measured, definition of growth, multidimensionality, and others. need to be considered. Vertical scaling is much more sophisticated than equating and it involves more decisions in the design for equating. It is challenging that in testing practice, large-scale assessment Often requires the scaling procedure to be simple and involve as little computation as possible. Most of the studies that approach the issues in vertical scaling use a common-item equating design. Considering the challenges of vertical scaling, this study proposes two designs that are seldom mentioned in psychometrics literature. These designs may serve as better alternatives to the well-recognized common-item design. a. Common-item vertical scaling As is indicated by its name, in common-item equating, the two forms have some items in common. According to the invariant item (Lord, 1980) feature of the IRT model, the common items are supposed to function identically even when the examinees are different. Thus, based on the parameters of the common items, the two forms are linked. As for how many common items is enough for adequate linking accuracy, no theoretical conclusion can be drawn based on solid research. Conventionally, the linking items should take at least 20% of the total items in a form (Kolen and Brennan, 2004). In this study, three levels of test length were applied to investigate the effect of test length. When the total number of unique items in the two forms was 120, 20 items were used as linking items; when the total number of items was 96, 16 items were used as linking; and when the total items was 72, 12 items were used as linking. Other than the requirements of the percentage that should be considered when selecting the linking items, the linking items are supposed to have high discrimination value, with stable function among different samples. What is more, the linking items Should represent all the domains of the test forms. Due to practical considerations such as test security, linking items sometimes may not be able to satisfy these requirements. Compare to concurrent IRT equating, two-step IRT equating is more often seen in the literature, especially in studies at the early stage of IRT equating. In two-step equating, the first step is to calibrate item parameters and examinee ability of the two forms separately. Then based on two sets of the parameters calibrated for the linking items from the two test forms, a linear or non-linear function is developed so that the two sets of parameters can be transformed to be equivalent to each other. The parameters Of all the other items are then transferred in the same scale by the same mathematics function. Because the IRT ability estimates are on the same scale as the item difficulty parameters, the examinees’ ability estimates of the two groups can also be transferred to the same scale based on this mathematical function. Before BILOG-MG was available, BILOG is Often used in IRT calibrating and equating. BILOG can be used for concurrent calibration of the item parameters when the two groups taking the two forms are randomly equivalent, but it is not strictly appropriate to use BILOG to concurrently estimate groups of different latent ability distributions. By using BILOG-MG, common-item non-equivalent group vertical scaling using IRT model becomes very convenient. BILOG-MG accomplishes multiple-group, common-item IRT equating concurrently for all the groups, with all item parameters calibrated concurrently. Research indicates that the concurrent BILOG-MG equating using marginal maximum likelihood (MML) estimator is comparable or even superior to that of the two-step equating methods (Hanson and Beguin, 1999). However, DeMars (2002) shows that if group ability level is not taken into consideration, item parameter estimation is biased using MML estimation. b. Common-group vertical scaling The common-group equating design that is studied in this dissertation refers to the following: two test forms (with no item in common) are given to three groups of people, Group 1 takes Form 1 only and Group 2 takes Form 2 only, Group 3 takes the items on both forms. This method is different from the single-group design described in Kolen and Brennan (2004). The usual single-group design has one group of examinees answer the two test forms. However, in practice it is Often expensive to have a sufficient number of examinees take the two tests. The common-group design does not require Group 1 and Group 2 to be equivalent, and Group 3 can be different from the other two groups. In this study, data is analyzed using BILOG-MG, and equating is performed concurrently using MML estimation. Compared to common-item equating, studies on common-group equating are rare (Hambleton & Swanminathan, 1985, p.205; Hambleton et al., 1991, p.128; Noguchi, 1986, Noguchi, 1990; Toyoda, 1986, Ogasawara, 2001). This kind of equating links two test forms based on the same group of examinees that take both forms. Considering the number/percentage of examinees should be included in the common-group equating, no theory has been available for reference. In this study, the common-group equating is designed to be compared with the common-item equating; the strength Of linking should be comparable between the designs. For example, in common-item equating, when the total number of items is 120, the number of common items is 20 and the number of examinees is 5,000 (2,500 for each group), a total number of 20*5,000=100,000 cells link the two forms together. To have the same number of cells linking the two forms, in common group equating, 10 the number of common examinees should be 100,000/1202830. This principle was applied in the common-group equating of this study. When linking items are unavailable, or when the statistical assumptions or requirements of common-item design cannot be fulfilled, common-group equating can be considered. Although issues such as fatigue exist in this design, common-group equating still serves as a possible alternative for the common-item equating design. In vertical scaling where the assumptions of IRT common-item equating are not fulfilled, common-group equating can possibly be a better choice. Harris (1991) compared spiraling design and single group design in vertical scaling, and found that across different examinee populations, the single group design exhibit more stability. The result of this study may or may not be applied to the common-group design here, for in the single-group design, two forms were equated by one group of people that were administered both forms. The common-group design described here has been seldom studied. c. Common-item/common-qroup combi_ned equating This type of equating design combines the characteristics of the two equating methods introduced above: the two test forms share some common items and there is also a group of examinees that takes all the items from both test forms. However, the number of common items is only half as many as the common-item design, and the number of common examinees is also only half as many as the common-group design. This equating design has been used in large-scale testing practice but is not documented in publications (Y. M. Thum, personal communication, Nov. 18‘”, 2005). This method is studied here because it may serve as an alternative practice when the number of common items and common examinees cannot satisfy the requirements of the common-item or common-group methods. Because this method combines the features of common-item and common-group equating, it contributes to the theory of equating design, 11 especially to the comparison between the common-item and common-group equating. ll. Issues of Multidimensionality in IRT Equating As stated above, appropriate use of the IRT models requires the tenability of assumptions of unidimensionality and local independence. Intest equating, each of the individual test forms should satisfy the assumptions. The testing data of each form should adequately fit the IRT model. However, in testing practice, these assumptions can be very stringent and impractical. In the past, a number of studies have been published on the effect Of multidimensionality on IRT equating. Jodoin (2003) used simulation data to investigate the impact of the violation of unidimensionality for individual test forms and inconsistency between the dimensional structure of the reference and focal forms. His conclusion was that low levels of dimensional inconsistency between the forms are reasonably well tolerated, but multidimensionality in either test form is not. Jodoin (2003) used IRT ability scores; he did not discuss the effect when anchor items do not represent all the domains of the form(s). Multiple studies have used IRT-true score equating functions to analyze the effect of test multidimensionality (Bogan & Yen, 1983; Bolt, 1999; Camilli, Wang & Fesq, 1992; Cook & Douglass, 1982; Cook, Dorans, Eignor, & Petersen, 1985; Dorans & Kingston, 1985; Kolen & Whitney, 1982; Snieckus & Camilli, 1993; Stocking & Eignor, 1986; Wang, 1985; Yen, 1984). Their results disagree with those of Jodoin (2003). The majority of these studies concluded that although multidimensionality Of the latent ability Space did affect the quality of IRT true-score equating, the impact Often appeared to be minimal and of little practical significance, especially when correlations among the dimensions are high. Goldstein and Wood (1989) stated that the impact of multidimensionality on the quality of IRT equating is likely to be negligible as long as the same linear composite of latent traits, or reference composite (Wang, 1985), underlies the item response on both tests. 12 However, these studies did not clearly state whether the effect of the linking item design was considered. Although no explicit “requirements” are stated for linking item design, accepted practice calls for the set of common items to be proportionally representative of the total test forms in content and statistical characteristics (Kolen & Brennan, 2004). Previous research indicates that in linear equating, inadequate common item content representation can impact test scores when examinee groups taking alternate forms differ considerably in achievement level (Klein and Jan'oura, 1985). A later study by Beguin et al (2000) using simulated data noted a large effect of multidimensionality on IRT equating for nonequivalent groups. According to Sykes’s et al (2002) research on a mixed-format math examination, equating by using anchors containing items that loaded more heavily on the first or the second dimension resulted into different standard errors. The testing program used in this research is designed to measure three different types of English language ability—grammar, vocabulary and reading. Previous studies suggest that the data indicate a multidimensional pattern (Yamashiro and Yu, 2005a; Yamashiro and Yu 2005b). Further, the reading items are in the form of testlets, with a set of items focusing on the same reading passage. The grammar and vocabulary items are individual items. Due to security considerations, the anchor test of the ECPE GN/R/ section contains only grammar and vocabulary items, no reading items. Based on the results from previous research, such anchor test designs are subject to systematic error (Sykes et a1 2002). By comparing the equating results based on common-item equating design, common-group equating design and common-common design the study will estimate how much anchor test’s lack of representative may affect the common-item equating, and whether common-group equating can circumvent the problem. In this study, exploratory MIRT and exploratory factor analysis with Oblique rotation on three factors/dimensions were applied to investigate the test’s dimensionality. Tate (2003) comprehensively summarized and compared the empirical methods of assessing the structure of tests with dichotomous items. About ten methods from exploratory and confirmatory families were included in this study, the methods were also categorized as parametric vs. non-parametric based on conditional item covariance. The results of this study indicated that for the most part, all methods performed reasonably well over a relatively wide range of conditions; exceptions only occurred when the test data departed appreciably from the assumptions or there is inherent limitation of a method. Compare with nonparametric methods, parametric modeling provides parsimonious and description of data structure. Factor analytic and MIRT methods were listed as parametric methods in Tate (2003). The MIRT method used to assess test dimension is the Normal-Ogive Harmonic Analysis Robust Method (N OHARM) developed by McDonald (2000) and programmed by Fraser and McDonald (1988). III. Issues in Vertical Scaling When different achievement tests are administered to different grades to assess growth in achievement, vertical scaling becomes inevitable. In vertical scaling, two test forms composed of items that have different difficulty levels are taken by two groups of examinees differing in ability. The results of vertical scaling can be unstable for multiple reasons such as: equating design, test dimensionality, test characteristics, DIF in different groups etc. These issues are discussed in the following sessions, and possible solutions are also introduced. A. Issues of structure or dimension shift When two test forms are composed of items from the same battery but are different in difficulty levels, there is a tendency for easier items to denote different constructs than higher difficulty items, even though they are designed to test the same constructs. A substantial amount of research suggests that when the same achievement battery measures achievement at different levels, the content, complexity and difficulty of the assessment tasks also change (Linn, 1993; Mislevy, 1992; Yen, 1985, 1986). Even in a single form, differences in scores at the lower end of the scale may represent a different constructs from the differences in scores at the higher end of the construct (Reckase, 1989). Even when the two forms are carefully constructed to be parallel, different constructs can be empirically identified between the forms (Reckase, 1998). Dorans (1990) emphasized that forms to be equated should measure the same mix of content so that construct invariance could be achieved. However, in vertical scaling, this requirement is purposely violated. When the two forms cannot be considered to have the same construct, all the issues concerning test multidimensionality in equating would affect the vertical scaling results. B. Issues of DIF Among the issues that arise with vertical scaling, differential item functioning (DIF) should be considered seriously, especially when IRT models are applied. In vertical scaling, items that can be included in a battery should fiinction identically between examinee populations that are of different ability levels (Kolen and Brennan, 2004). In common-item equating, only the linking items are administered to both examinee populations, and thus most of the items cannot be tested for DIF. As psychometricians are striving to improve the accountability Of vertical scaling, different equating designs should be compared and evaluated. In common-group equating design, all the items are administered to part of the examinees from both groups. Thus, it is possible to estimate the DIF effect on the equating results. Harris (1991) compared the results ofAngoff’s design I (spiraling design) and design 11 (single group design) in vertical scaling, and concluded that the single group design exhibits more stability across different samples. However, the increase in stability here is at the cost of more items administered to more examinees. In this study, vertical scaling results 15 will be compared using common-item or common-group as the linking design. This study can serve as reference to a seldom approached vertical scaling design that is of great potential. C. Dzfiiculty difierence between the forms Vertical equating is used to equate forms that differ in'difficulty level. A major question is how much difference is reasonable between the two adjacent forms. No research has been found that directly explores this issue. Most studies on vertical equating use exams whose structures and designs are usually decided based on factors like test specifications, curriculums, policies etc. other than based on the requirements Of vertical scaling. Item difficulty differences between the two forms, or the ability differences between the two groups, are seldom reported. Table 2.] lists the item difficulty difference or group ability difference from several vertical scaling studies. The numbers provided here set a reference for how much growth (or difference) one may expect from the two groups in vertical scaling. Pomplun et al (2004) and Kolen and Brennan (2004) reported the averaged item difficulty parameter after the forms were equated. In Pomplun’s (2004) study, the differences in averaged item difficulty between the two adjacent grades are around 0.5-1.5 SDS; while the differences in item difficulty reported by Kolen and Brennan (2004) are more likely to be around 0.5 SD, even though both examined math achievement tests of a similar grade range. Russell (2000) focused on the ability growth between the two grades; and data on three subject areas (math, reading and language) were reported. Ability growth ranged from about 0.3 to about 1.5 SD, with bigger grth expected between lower grade levels. The ability difference levels reported in J odoin (2003) are not comparable with those reported by other studies, because the forms were administered to students from the same grade, while different students were tested each year. The differences in averaged ability between years are very small (less than 0.1), indicating that, for the same grade, little change was Observed in 16 students’ ability from year to year. Based on the literature about item difficulty differences in vertical scaling, this dissertation assigns the forms to be different by 0.5, 1.0 and 1.5 SDs of averaged item difficulty. D. The eflect of test length It is well-known from the literature that longer tests are usually of higher reliability, as long as the items all target the same trait or ability. To obtain sound accuracy in equating, each test form Should have good reliability (above 0.85 in most high-stake exams) and a stable estimate of the IRT parameters. Existing literature offers some guidance on the test length needed to Obtain reasonable estimates of IRT parameters. Lord (1980) clearly stated that test length and sample Size, in combination, affects the quality of parameter estimates. Swaminathan and Gifford (1983) reported that multiple-choice tests below 15 items gave poor parameter estimates, and the inadequacy in item number could not be compensated by increasing sample size. Hambleton and Cook (1983) recommended a minimum of 200 examinees and 20 items to Obtain stable testing results. Few studies have been found targeting the effects of test length on equating. Fitzpatrick and Yen (2001) suggested that a test should have at least eight 6-point items or at least twelve 4-point items. This study investigated constructed-response tests. The situation becomes more complicated when multidimensional tests are equated. Sometimes tests that have only 20 items are seen in multidimensional IRT equating (Kim, 2001). It is assumed that both forms should have enough items to meet the requirement of reliability. Moreover, additional items may be needed for accurate equating results. While longer tests are favored when reliability is considered, shorter tests are preferred when cost is considered. This dissertation study explored the effects of test length using forms that contain 36, 48 and 60 items. It intends to address the question of whether shorter tests will perform equally well as longer tests in vertical scaling when reliability is adequate. l7 IV. Evaluating the errors of equating A. Comparing the parameters ’true values and those obtained through equating A lot of equating studies use generated data to evaluate a certain equating method in which the true item parameters values are known. Typically in these studies, the standard errors of item parameters are calculated through the squared difference between the parameters obtained by equating and the true parameters on which the data were created -- for example the study by Hansen and Beguin (2002). Some studies that compare the quality of different equating methods by directly comparing the parameter Obtained through these methods using scatter plots or correlation coefficients (Li, Griffith and Tam, 1997). When real data is used and the true parameters are unknown, error estimation can be challenging. In this dissertation study, we use real data and the true parameters are unknown. We consider the parameters obtained using the original data (about 30,000 examinees’ responds to 130 items) as the “real parameters”. Sub-samples of about 5000-6000 were drawn from the original data: the items were split into two forms for each of the equating design. Item parameters obtained through each equating design were compared with the “real parameters” to reflect the quality of equating. B. Standard error of equating Standard error of equating usually refers to the errors due to sampling, instead of systematic errors. The two methods most commonly used in estimating standard errors of equating are the bootstrap and delta methods. The delta method is a set of “procedures [that] result in an equation that can be used to estimate the standard errors using sample statistics” (Kolen and Brennan, 2004, p234). This analytic method usually includes a process of time-consuming development of the equations, and it Often results in very complicated equations. In this dissertation, standard errors of equating are estimated by a method similar 18 to the bootstrap method. The bootstrap method calculates the standard deviation of equated scores over hypothetical replications Of an equating procedure in samples from a population. In one hypothetical replication, a Specified numbers of examinees would be randomly sampled. Then the Form Y equivalents of Form X scores would be estimated at various score levels using a particular equating method. The standard error of equating at each score level is the standard deviation, over replications, of the Form Y equivalent at each score level on Form X. Standard errors typically differ across score levels. Bootstrapping is very computationally intensive, in which many samples are drawn from the data at hand and the equating functions are estimated on each sampling. In this dissertation study, a method similar to the bootstrapping but less computationally intensive was used. This method does not repeatedly sample subjects, makes the standard error estimate more reliable than bootstrapping. A large sample size of the testing data (about 30,000 for both forms) allows the examinees to be randomly divided into groups. For each test, the two forms were randomly paired to form ten paired samples. IRT equating of different designs was applied to each paired sample to Obtain the ability scores. The standard deviations were calculated between the ten values to obtain standard error at each score level. This method of standard error calculation will be introduced with more detail in the following chapter. $33833“ SN .5; .34 .Rod @9183 £86 and. an 232 page 2958 BE .5. B m e m 89m #83 q d .5555 Eu: Comments .w-m 03.2w ES..— couofifla So: 333m 8m 68 59: mmb Ea ._. .2 532 boéooame vod .mod. .2 .o. 088:8 bane. m ~30 v3.0 000; 0:0 50N.0 000.0 00;; 5:— 20 MNN.0 0w0.0 50> 0NN.0 000.7 000.. N> 0N0 5V0.0- N05._ 550 034.0 00N.0- 40¢.— 05— 02.0 505.0 02; 00> N0_.0 N00.0- 0004 —> 50N.0 52.0 204 0—0 wmmd ~0~.- 000.0 m:- VON.0 030- SN; 00> 000.0 000.0 0:; 000— 000.0 V0~.N- 00w.0 m—U wmmd 0_.0 V5_.N v—M 50_.0 00N.0- 254 v0? 020 N000 V50; 0VUF 000.0 005.7 0V5.0 930 000.0 0mV.N- 05—4 m:— Vm0.0 30d- 005.0 00> _5m.0 00v.- ww5.0 5.0— 50.0 wv0._- _N_.N 0:0 5MN.0 00.0- 5N0 N:— 50m.0 050.0- 0;; up? V000 N007 00_._ 5.0— VON.0 0N0.0- 0V0; N50 N20 ~0V.N- 000.0 m:- VON.0 50_.0 N3”; mm? 50N.0 0N0.0 050.N 0va_ NON.0 w00.0- N50; :9 00N.0 000.0- 05%; 05— N000 000.7 50m; 002 50_.0 5004 N540 010— 550.0 www.0- w_0._ 0—0 3&0 050.7 #050 0“ 0Vm.0 000.0 00—; 0N> 00N.0 000.0 00N._ VVU— _Nm.0 00v.0- 000A 09 wNNd 000.7 005.0 wu— NON.0 mNm.0- 005.0 0N> w0¢.0 000; VMON MVU— 0w_.0 00v.0- N00._ 00 000.0 0vo.0- 50; 5% 050.0 000.0 N554 5N? 0:...0 Nwmd _00._ Nafl 050.0 000.7 50N._ 5U 000.0 00N._ 000._ 0% 00—0 w0N._- mwmé 0N> 00—.0 500.0 000; anHZde 505.7 meA 0D 02.0 00—0 000.0 m”— 05N.0 0wN.N 0_m.N 0N> 500.0 00~.T mw0.0 0v0fi _0v.0 #004- 03a; 09 0w_.0 0N0.0- 000A ey— MON.0 000.7 00—.N VN> 0000 550- 000; 009! #000 w05.N- mw5.0 a mw0.0 00—; :0; m”— wm_.0 02.0 05g; MN> 02.0 20.0- :0._ 000! 00~.0 0Nm.N- NE; MU 000.0 3N; 0N0.0 NM 020 02.7 w054 NN> 000.0 3;.7 M000 509— V5—.0 00_._- 005._ N0 «4:0 0~0.N- 0~0.0 :— vw_.0 V50A- 0004 ~N> 050.0 000.0- 53v; 0MU_ 3.0.0 050.7 000.0 :0 o 0 e Eu: 0 0 a :3: o 0 a :8:— o a a 53— as: 3. é magnum Because 5: .2. 2.5 0000 590- 000.0 0N> 020 _N00- 000._ 000 0000 090.7 009.. 0~> 000.0 050.0- 0N0._ 900 00N0 05N0- 00N.~ 0—> 0000 _5N.7 900; 000 5000 090- 0_N._ 5—> 0900 950.0- N00._ NOU 000.0 000.0- N09; 0~> 0N00 050.0 N054 ~00 000.0 0000 500.0 00»— 0000 _00._ :0.N 00? 0000 0500- 000_ 0—> N000 000.7 500 00G 0000 500.7 N000 0mg— NON0 0_0.7 0N0._ 09> 00N0 N000- 550A 9~> 00N0 500.0- 000._ 0N0 00N0 N000- 000.. 0a NON0 0000 509._ 09> .000 00.0- 009.0 0—> 5000 NNO.7 050.0 0va 000.0 _000- N000 5N1 00N0 0_0.0 NN9._ 59> 0000 20.7 N_0.~ N~> 0000 _N07 009; 5va 90N0 009.7 0_0._ 02 000.0 N000 000.0 09> 0000 N000 0000 => N000 990.0 0_0._ 0N0 000.0 _000- 500._ mg 0000 0500 000; 09> 55N0 0000- 0N0._ 0—> 0N00 0990 000.0 0N9 00N0 000.0 050_ 9a 00N0 00N0- 00.— 99> 0900 500.0- 900— 0> 099.0 090.0- 000.0 9N0 00N0 000. 5NN._ 0N1 5900 000.0 0094 09> 99N0 000.7 000; 0> 90.0 ~N90- 0.00 0N0 _09.0 000.0 09N._ N”— 00N0 0090- 005; N9> 0_00 00N0 N9N._ 5> 000.0 055.0- :0. NNU 59N0 390.0 050; :0 0N00 _00._ _N0._ ~9> 0500 000.7 505._ 0> 0000 009.7 0500 30 00N0 090.0 000.0 0”— 009.0 0N00 00.— 09> N000 550.N- 900_ m> 000.0 5_0.N- _N00 0N0 _000 000.N 500.. 00: NN00 900.0 000.0 00> 090.0 995.7 000; 9> 9000 509.0 900; 0—0 68%:8 Z. 295 N... 3.... ..N... 2... ..N.. 3.... 2.. ..N... 3.... ....> ..N... .N... N... o... m> 2... E... 2.... v... ..N.. m... ..... ..... .5..- a... 3.... N... ..N... 3.... a? .m... ..N... ....... 8.. ..> ..m... ...... ..m... 8..- a... NN... c... N... ..N. a... 8... N»... R... N... ..9 ..m... N... ..N... a... m> ..m... 2... mm... on... a... on... aN... 2... N... h... 5.... z... w... 2... 5. ..m... .N... am... a... N.» ....... t... S... .N.. :c an... an... .N... N... e... t... N... N... ..N..- .9 .m... 2... an... m... .> N... 2... ..m... w... 20 a... mN... ..... 8... m... 3.... MN... Q... N... m; .N... ..... a... .N...- ..3 w... .2. N... m... m... a... on... 9.... ..m... w... m... N... a... S... 3... ..m... 2... 5.... N... 3.07.... 2... m... 3.... u... ..N... mm... 2... o... m... 2... m... ..N... 3.. m3 2... ..... ..N... 8.. mew—o... ..m... a... 3... n... S... .N... No... 2.. N... ..m... S... R... 8... N...» ..N... E... ..m... 8.. 5.07.... ..N... a... 8.. N... .N... a... N... .2 .3. a... t... N... .m... .9 R... 2... a... mo... ecu—N... .2. me... w... ..w .2. a... 3.... N... a... 3.... 2.. a... S... ..n> 8... 8.. ..N... .......- mic—..N... NN... 2.... a... ..5 .N... N... ..N... a... a. ..N... N... .N... NN... S.» N... 2.. mm... B... 36?... 2... N... S... a. NN... .m... NN... a... .3. 5.... w... a... K... ..N> ..N... .3... N... 8...- 9.9—2... ..N... G... N... .6 ..N... mN... mN... E... S. 2.... E... .N... w... .9 mm... :... 5.... .N... New—..N... w... mm... a... 5 NN... ..N... S... No... S. 2.... .N... ..m... 8.. .9 ..N... 8... an... 3...- 3&8... 2... cm... on. m... .N... .N... ..m... ..N... 5. ..... .3... 8...- mm..- mN> .3... N... S... w... ....U_NN.. 2... G... .5. 3 z... ..m... ..v... R... 3. S... 2... t... a... .9 on... N... m... cm... .87.... ...... 3.... .m. ..w N... m... N... 2...- Q. S... 2... ..m... m... .9 R... E... 8... ..m... .87.... ..N... ..m... m... m. 8.. N... 8... 2.... N: 8... N... 3... m... 2.» NN... 2... e... ..N... .87.... ..N... ....... on. Nu N... ..m... 9.... o... .z .m... ..N... .m... N... .9 ..N... 2... .m... 3.. 3.0— N.... 2... 3.... 8.. .u .2. N. ..w .. as. Q N. .a .. a... m. N. .a .. .5..— Q N. .m .. ea. may. .2... s. 8.9.5.... 5.253.. .2... .N... 0...... 35 90.0 NNO 00.0 00.0 0N> 00.0 0N0 00.0_ N... 000 09.0 NNO 00.0 00.. 0.> 0N0 N00 09.0 .50 903 09.0 0.0 5N0 000 0.> N00 0N0 00.0_ .00 006 09.0 5.0 09.0 09.0 5.> 0N0 5N0 000— 0N.. N00 000 000 09.0 9N0 0.> 000 0.0 090— 090- EU 9.0 99.0 .N0 000.0 02 N90. 00.0 00.0 50.0- 00> .00 000 00.0 00.0 0.> 900- 00.0 50.0_ 000 00G 00.0 000 00.0 .0... 0”— N90 0N0 00.0 5... 09> 090 0.0 590 N50 9—> 0N0 N00 .00 00.0 06.0 .N0 000 0.0 505.0 02 00.0_ N00 900 0N0 09> 59.0 0N0 000 N00 0.> .N0 0.0 090 95.0 0N9 0.0 99.0 9.0 0.9.0 5N“ 090— 0.0 0N0 00.0- 59> 00.0 00.0 05.0 05.. N.> N00 0.0 000_ 90.. 5N0 00.0 000 00.0 N00. 02 0N0_ 0.0 .N0 00.0- 09> 00.0 0.0 0.0 00.0 —.> .00 00.0 000_ .00- 0ND 0.0 000 5N0 0.N.. 0”— 090_ 0.0 0N0 0.0 09> 00.0 000 N00 50.0 0.> 0.0 ..0 0N0_ 0N0 0ND 9N0 00.0 0N0 00.0 92 00.0_ 5N0 590 00.0 99> 0N0 0.0 090 00.0 0> .N0 .0 00.0 00.0 9N9 NNO N00 0.0 .0N0- 02 .00_ 0.0 N90 0N0 09> 09.0 9N0 00.0_ 99.. 0> 000 00.0 0N.0_ .N0 an 0.0 00.0. .00 0000 N2 900— 0.0 900 00.0 N9> 0N0 0.0 00.0_ 00.0 5> 0N0 0.0 .00— N00 NNG 5N0 00.0 9N0 090.0 .Nfl 090— 900 0N0 00.0- .9> 50.0 N00 000 0N.. 0> 5.0 5.0 00.0% N... 3Q 2.3528 N... 2...... 36 .88 8...... NN8.8 .5. 8.8 ...88 .8.8_ .9 88 .888 ..8.8_ .> N...8 8.8.8- 88— a... ...8 ...8 8.8.8- .3. N...8 8.8.8- 8...8-_ .9 .88 .88 8N.8_ ~> .88 888- ...m8_ .1. 8.N.8 8N8 8.8.8- .... 8.8 8.8.8- 88.8-_ .9 ..N8 888.8 8.8— .> ...8 88.8 .88_ ..o 8.8 8.8 8.8- m... N...8 .88 888— 8.» .88 8.8- .88— 8... .88 8.8.8- 888— m... ...8 8.8 N.8.8 ...y. 8.... 8.8.8- ...88— ...> .88 8.8.8- .880 8.. 8.8. 88.8 .N..8_ v... .88 ...8. ...88 m... .88 ..88 8N.8_ 8.. 8....8 88.8 .8...— ....c 88.8- .88 8.8— m... :88 ...N8. 888. N... .88 88.8- N8.8_ .9 88.8 ..88 88— .5 8.8.8- .88 .38— N... .88 8.8 ..88 .5. 8.8 8N8.8- 88.8-_ .9 .88 N88 ....8_ ...o ...8 8.8.8- ..m.8_ ..o .88 .28 8.8 8... 8N... 8.8.8- 328— 89 88.8 888.8- ....8_ 8.. ...88 28.8 .88— 8... 8:8 88 ...88 8. 8.8 .88 .38! ..N> 8... 88.8 .8..8_ 3... .88 8.8.8- 888— ..u .N..8 8.8 .88 8. .38 88.8- ....8_ ...> 8.8 888.8- 8N8.8-_ a... ...8 .888 3.8— .0 3.8 ...8 88.8 B. .88 888. 8.8.8-_ .9 8.... 88.8- 8.8— N... 8.8 8.8.8 ...m8_ .0 .28 .38 8.8.8- .2 8.8 .88 828— .9 .88 88.8- .380 .3. 8.8 888 ...m8_ .8 .88 8.8. .28 m... ...8 8.8.8 8....8-_ .9 ...8- 8.8.8- 8.8— 8.... 88.8 888.8- 2.8— 8. ...8 .28 N...8 3. 88 ..N88 N8N.8_ ..N> ...8 8N8.8- 88— 8.. 8....8- ...88 8.8— ..o 8.8 .88 8.8 9. 2.8 8N8.8. N...8_ .9 8.8 8.8.8- ...mfi .8 8.8.8- 8.8.8 m....8_ m. ...88 3.8 8.8.8- 9. ...8 888.8- 8.8— NN> ..88 8N8.8 2.87.... 88.8 ..88 N8...8_ No 8.8 8.8. N...8 a. 8.8 9.88 32.8— ..> N...8 8.8.8- 2.8— .m. 8N8.8- 88.8 388— 5 m N . a... m N ._ a... m N ._ a... m N ._ a... 8...... 8...... 8.... .52. .2... 8...... .22.... 8888.885 .2. 2.... 37 83.8 888 N88— 8~> ...8 .888 .8...8_ 8.. 2.8 88.8 8....8_ 8... 8.8 8N8.8- .8.8_ 8.. 8.8 88.8- 88.8-_ ...> 88.8- 88.8 .38— 8.. .88 88.? ...8— ..> 88.8- 888 ...m8_ N... 8.8 88.8- 8.8— ..> 8.8 8.8.8- .NN.8_ 8.. 88.8- .8.8. N88 89. ..Nm8 8.8.8- 8.8. 8...» .88 8.8.8- 88.8_ m.> 8.N.8- .88 5.8— 8.... 88.8- ..m.8_ 888 ..N... .88 88.8 .NN.8_ 8.5 8.8 88.8- ....8_ ...> 88.8 88.8- 8.8— 8.. 88.8- ...8— .88 .5. 8.8 88.8- .88— ...5 ..m8 888 828— m... .88 8.8.8- .88_ a... 888.8- .88. 88.8- .... 3.8 888 8.8.8-_ ...> 8.8 88.8 88N.8_ N... ...8 888.8- 8.8— 2.. 88.8- .88 888 .5. 8.8 8N8.8- 88.8_ ..S .88 ..88 8.8.8-_ ..> .88 8:8- 3N.8_ 8.. 82.8- ..88_ 888.8- m... 8.8 .88 88.8— 8.. ..Nm8 8.8- 88.8-_ 8... 88.8 .88 88— 8.. ..N..8 8.8. 88.8- .5. 8.8 88.8 88.8_ 35 ...8 .88 ...N8_ ..> .38 8.8.8- 8NN.8_ 8.. ...8 8.8 888.8- 8.. 3.8 88.8- .8.8_ «.5 8.8 88.8 .NN.8_ .3 .38 8N8.8- 8.N.8_ 8.. .88 .88 8.8.8- .9. 8.8 88.8- N...8_ a...» .88 .88 2.87 .> 88.8 8N8.8 .8..8_ N... 82.8 8.8 888.8- 5. ...8 88.8- 888.8-_ ...> 88.8 .88 8.8— .> 88- .88 8....8_ 3.. ...8 2.8 88.8- 8... .88 88.8 N8.8_ 8.... ~38 N...8 2.8— m> 38.8 «8.8 .88— 8... ...8 8.8.8 ..88 a... 888 8.8.8- 88.8-_ 8> 8.N.8 N88 ...8_ ..> 8.N.8 88.8- 88— 8... 88288 m... 2...... 38 In the MIRT analysis, 98% of the grammar items have the highest a-coefficient on al; 60% of the vocabulary items have the highest a-coefficient on a2; and 76.7% reading items have the highest a-coeffrcient on a3. In the factor analysis with oblique rotation, 94% of grammar items and the same percentage of vocabulary items have the highest loading on the first and third factor respectively, and 76.7% of reading items have the highest loading on the second factor. About 16.0% of total variance is extracted by the first factor. Table 4.4. Percentage of the Highest Loading Items on One Dimension/Factor Dimension % 1 2 3 Grammar 98 mm ...... Vocabulary ------ ------ 60 Reading ------ 76,7 ...... Factor % l 2 3 Grammar 94 ............ Vocabulary ............ 94 Reading ------ 76,7 ...... As noted in the previous chapter, a portion of the 13.0 items in the Original test were selected according to each equating design. The item/common item numbers for each design are shown in Figures 3.1-3.3 in the previous chapter. Three levels of test length were selected for the equating designs. At each level of test length (120 items, 96 items and 72 items), the same set of items were used for all the equating designs that differ in difficulty differences or methods of equating; however, the items were grouped differently when designs are different. Item model fit was estimated at each level of test length according to methods presented in the previous chapter, and the results, exhibited in Table 4.5, are discussed in the next chapter. 39 Table 4.5. Percentage of Items Fitting the Model Test _ length ch1-squareé60 (3*df) P>0.01 120 items 87% 75% 96 items 90% 70% 72 items 86% 67% *for all the items, the degree of freedom (df) for chi-square estimate is 20 II. Item selection for each design The difference in the average difficulty between the two forms of each design is shown in Table 4.6. As the original items were actually developed for one test, and were not intended to be dramatically different in difficulty, it was challenging to select items and separate them into two forms whose average difficulty levels were different by 1.5 units on the 0-scale. Thus for the designs targeted to have a difficulty level difference of 1.5, the targets are not met; the differences between the two forms are particularly smaller in the common-item designs and also smaller when more items are included in each form (such as in data sets that have 120 total items). This may affect the results of the study. The item numbers in each section (G, V and R) maintain the same ratio as in the Original test; item/common item numbers of each section are shown in Table 4.7. The chi-squares of item fit for the linking items are displayed in tables 4.8-10. For designs that are of different difficulty difference between the test forms, the linking items in common-item and common-common designs are also different in a few items. The linking items used in each of the common-item designs are shown in tables 4.8-10; each table is for different test length. The linking items in the common-common design were included in the correspondent common-item. The items that do not fit the 3-PL IRT model according the rule of thumb that was mentioned before (chi-square<3*df) are marked. Most of the linking items (90% in average) fit the 3-PL IRT model. 40 Table 4.6. Difference between the Averaged Item Difficulty Nltem=120 Nltem=96 Nltemg72 Target , 0.50 1.00 1.50 0.50 1.00 1.50 0.50 1.00 1.50 difference Common . 0.48 0.95 1.14 0.52 0.92 1.29 0.52 0.99 1.34 -item Common 0.50 0.95 1.21 0.48 0.97 1.40 0.48 1.03 1.46 -common Common 0.51 0.99 1.28 0.51 0.99 1.45 0.51 1.07 1.51 -group 4| Table 4.7. Item/Common-ltem Numbers for Each Design Total item= 1 20 Design Common-group Common-item Common-common Number Items in Number Items in Number Items in of Items anchor of Items anchor of Items anchor Grammar 23 0 28 10 27/25* 6 Vocabulary 24 0 29 10 26/28* 6 Reading 13 0 13 O 13 0 Total 60 0 70 20 66 12 Total item=96 Design Common-group Common-item Common-common Number Items in Number Items in Number Items in of Items anchor of Items anchor of Items anchor Grammar 19 0 23 8 21 4 Vocabulary l9 0 23 8 21 4 Reading 10 0 10 0 10 0 Total 48 0 56 16 52 8 Total item=72 Design Common-group Common-item Common-common Number Items in Number Items in Number Items in of Items anchor of Items anchor of Items anchor Grammar 15 0 18 6 17/ l 6* 3 Vocabulary 15 0 18 6 16/17* 3 Reading 6 0 6 0 6 0 Total 36 0 42 12 39 6 42 Table 4.8 Linking Item Fit for the Equating Design of 120 Item Tests (df=20) Difficulty Difficulty Difference Difference Item Chi-Square 0.5 1.0 1.5 Item Chi-Square 0.5 1.0 1.5 G51 22.6 X V108 20.9 X GSZ 20.9 X V109 91 X GSS 28.7 X X V113 31.9 X G58 22.4 X X V120 24.8 X G59 32.4 X X V126 12.3 X X G60 27.4 X V127 76.1"“ X G67 24.8 X V128 30.6 X X G69 44.4 X V132 35.7 X X G72 38.6 X V134 72* X G73 1000* X V135 43.2 X G74 44.3 X V139 28.5 X X G75 79.2* X X V142 49.4 X X X G78 33.8 X X V143 26.9 X X 081 55.1 X V144 40 X X G83 69.7* X X V145 42.7 X G84 20.6 X X X V146 17.6 X G88 39.8 X X V147 39.4 X G90 60.4 X X V148 58.5 X X G92 28.8 X V149 42.6 X X G94 33.5 X *Chi-square>3*df, item does not fit IRT 3-PL model. 43 Table 4.9 Linking Item Fit for the Equating Design of 96 Item Tests (df=20) Difficulty Difficulty Difference Difference Item Chi-Square 0.5 1.0 1.5 Item Chi-Square 0.5 1.0 1.5 G52 34.2 X V101 25.1 X G58 25.5 X X V106 19.1 X X G60 21.1 X X X V114 21.6 X X X G61 32.3 X X V117 45.1 X X G62 30.4 X V120 31.7 X X G69 47.1 X V126 19.5 X X G72 39.3 X V128 24.5 X G77 34.9 X V132 41.1 X X G79 32.8 X V134 83.7“ X X G81 50.5 X V135 54.8 X 682 29.7 X V137 30.2 X G83 70.7* X V139 33.9 X X G85 1030* X X V142 54.0 X G86 35.1 X V148 660* X G88 53.8 X X V149 37.9 X G89 33.8 X X G90 50.3 X *Chi-square>3*df, item does not fit IRT 3-PL model. Table 4.10 Linking Item Fit for the Equating Design of 72 Item Tests (df=20) Difficulty Difficulty Chi- Difference Chi- Difference Item Square 0.5 1.0 1.5 Item Square 0.5 1.0 1.5 658 25.0 x V101 23.8 x G60 26.8 X X X V106 29.7 X G61 37.7 X V118 19.5 X G75 79.2* X V120 23.5 X X G79 35.0 X V128 30.5 X X G83 692* X X X V132 41.1 X X 085 94.4* X V134 81.1* X X G86 44.8 X X V135 59.3 X G88 36.3 X X X V137 38.9 X G89 25.4 X X V 142 51.4 X V143 36.5 X V148 55.9 X X V149 37.9 X *Chi-square>3*df, item does not fit IRT 3-PL model. 45 Ill. Regressions between the two sets of item parameters Item parameters were estimated using the original data and the data sets for the different equating designs. The parameters estimated using the original data are regarded as “real” values and those estimated using the data samples designed for each equating method are called equated parameters. Figures 4.1-4.9 illustrate the scatter plots between the “real” values and the equated values of the parameters. The graphs indicate that difficulty (b) parameter estimation is the most stable across different designs. In general, the values estimated through the common-group designs tend to be high and those estimated by the common-item design tend to be low. The values estimated by common-common design fall in the middle. Compared with the scatter plots of the b parameters, the scatter plots of the discrimination (a) parameters show more variance in their estimates; and even bigger variance in the lower asymptote (c) parameters. The same trend is also indicated in the correlation coefficients. Table 4.11 demonstrates the correlation coefficients for the a, b and c (slope, difficulty and asymptote) parameters between their “real” and equated values. The correlation coefficients are highest for the difficulty parameters (around 0.97-0.99), lower for slope parameters (around 0.90-0.93) and lowest for lower asymptote parameters (0.6-0.8). Table 4.12 lists the significance of ANOVA analysis between the means of each level. None of the parameters has significant difference in correlation coefficients between different designs. “a” and “b” parameters are significantly different when test lengths are different, “c” parameters are significantly different when item difficulty between the forms are different. Tables 4.13 and 4.14 list the slope and interception from the regressions respectively. 46 Rodi: Enema—3m 2 £05. Eobbfi 3 20:22.50 05 5353 3:20de 2: End mmbd de .. _ add add awed mmmd ...vmdd 13d $55..— 0 a a o a a o n a 282:2!— mnwfiun 3539A— b—soamn finned «88 55:3an do 3964 388580 coca—280 2: do oofiofiaflm <>OZ< .2 .v 2an dmwd mwdd dmwd mcwd wmdd dhwd dddd Nddd dddd m.— chd dwdd dmdd .mwd Sad Nwwd de wwdd Edd d.— mohd owdd Rwd «med 03d dmwd w d 5d dwdd m _ wd md was: 2. End pad wcwd mdmd nwdd Ncwd mmwd ded Edd m; 35d mad wwwd wmnd mwdd ddwd and mad avwd d.— mndd wood mam d deed dead vowd Rwd Sad wood md £5: 3 Ewd mwdd Sad nmhd Sad Sod de dad mdwd m.— 3wd thd ded mmwd vwdd dmdd mmwd Edd Sad d.— 3wd dad wmdd emnd dead 3 ad m _ wd Sod dmdd md 28: d9 9 a a o a a u a a 35 nanniefifieu :eEEeu-:eEEeU EST—...EEeU «Sta 986883 do flow 03—. 05 Eozcom macaw—BSD ._ _.v 2%... 47 Table 4.13. Slope of the Regression Function Difficulty Common-item Common-common Common-group Ierence a b c a b c a b c 120 items 0.5 1.13 0.86 0.71 1.13 0.82 0.64 1.26 0.81 0.66 1.0 1.10 0.81 0.69 1.04 0.83 0.68 1.18 0.82 0.59 1.5 1.08 0.84 0.71 1.00 0.89 0.62 1.13 0.89 0.67 96 items 0.5 1.06 0.90 0.70 1.08 0.83 0.61 1.26 0.75 0.67 1.0 1.05 0.89 0.67 1.09 0.83 0.63 1.07 0.83 0.59 1.5 1.09 0.86 0.69 1.09 0.83 0.66 1.13 0.89 0.63 72 items 0.5 0.95 0.89 0.70 0.97 0.86 0.67 1.13 0.80 0.61 1.0 1.10 0.84 0.73 1.25 0.85 0.73 1.18 0.86 0.82 1.5 1.03 0.83 0.80 1.12 0.82 0.70 1.28 0.87 0.79 Table 4.14. Intercept of the Regression Function Difficulty Common-item Common-common Common-group Difference a b c a b c a b 120 items 0.5 0.05 0.32 0.09 0.10 0.40 0.16 -0.00 - 0.51 0.10 1.0 0.14 0.52 0.09 0.18 0.62 0.10 0.03 0.85 0.13 1.5 0.16 0.90 0.14 0.14 0.98 0.11 0.04 1.30 0.12 96 items 0.5 0.05 0.44 0.15 0.12 0.45 0.17 0.09 0.54 0.11 1.0 0.11 0.65 0.16 0.08 0.60 0.09 0.10 0.82 0.11 1.5 0.12 0.73 0.08 0.17 0.90 0.09 0.06 1.26 0.12 72 items 0.5 0.24 .34 0.07 0.30 0.33 0.14 0.12 0.47 0.10 1.0 0.10 0.47 0.06 -0. 10 0.58 0.06 0.02 0.83 0.06 1.5 0.28 0.78 0.08 0.17 0.91 .09 -0.10 1.26 0.09 48 Figure 4.1 Item parameters for test length 120, difficulty difference 0.5 Equated Discrimination Equated Discriminating Parameters 120-items Difficulty difference 0.5 3 . . ”to: '1: z .5 2 ~~ “2;;3 "-ii‘fl E 1548* = 59 Eéiz.:" ° (13 1 r i .5; s" 0.. 0‘ :.' . 0 i J. 0 1 2 3 Original Discrimination Parameters . Common item - Common common . Common People Equated Difficulty Parameters I 00 Equated Difficulty Parameters 120 Items Difficulty Difference 0.5 a a a" . o i o .uo’ 3"" , 9 “PM? ‘.i . 11:12.55- -2 -1 0 1 Original Difficulty Parameters - Common item - Common common . Common People 49 (Figure 4.1 Continued) Equated Lower Asymptote Parameters 120 items Difficulty Difference 0.5 0.5 W .. a or ' . '2' 0 :n .: 2. ’ g g 0.4 . D -§.‘a;'.:'.°'; )0 ‘ . 4510-3” 1"- ':":.:;.‘.~'i*“’ . 'o a" - - r- .' ° . . m E . ‘ 3 f '5'? " ° ‘ H >‘ O 2 J_ . f: .3 a? l 3 if)?! 31?: ‘ LU 0.1 ” t o e 0 T .L 0 0 2 0.4 0 6 Original Lower Asymptote . Common item - Common common . Common People 50 Figure 4.2. Item parameters for test length 120, difficulty difference 1.0 Equated Discrimination Parameters 120-1tems Difficulty Difference 1.0 c 3 0 ° . 4'3 3’ f ‘ (U . 5.. E o ’51. . E g 2 f - 5‘3 215‘: I 8% dffif' 5531+ 39%;;3. m t r :3 18 O r .1 O 1 2 3 Original Discrimination Parameters - Common item - Common common - Common people Equated Difficulty Parameters 120-items Difficulty Difference 1.0 3 T O E‘ 2 «— : 3 . «$53 . as . -- ‘42-?" ‘3, E ‘3“: 255‘” a) S L ‘ . ‘ '1 3 § 3 O ‘;:~;£§?€I ' 0' . 3‘ 259 ’ - UJ -1 “r E :0 r '; '2 r L i l 1 -3 -2 -1 0 1 2 3 Original Difficulty Parameters . Common item - Common common . Common people 51 Figure 4.2 (Continued) Equated Lower Asymptote Equated Lower Asymptote 120-items Difficulty Difference 1.0 0.5 . 0.4 «— . , .f ,o “I a e. . ::9: u 0.3 7” ‘ 2 :.. ‘: ‘.ol 0‘ '5' 1.. .I‘”..a o..“'0 I 2.0.? 2 o :‘A J ' .:3. 9:15 “59:11:: :“ . 0.1 7" - o ' °i 0 f 1 f + 0 0.1 0.2 0.3 0.4 Original Lower Asymptote 0.5 . Common item - Common common . Common people 52 Figure 4.3. Item parameters for test length 120, difficulty difference 1.5 Equated Discrimination Parameter Equated Discrimination Parameters 120-ltems Difficulty Difference 1.5 3 ' ’. :33? . 2 I i .14“ ° __ a. ..39. s “ct. . '1: 81:1.” -'-- z ”'9 3 " 9 1 4i— . .‘I;. g. o:- t : “.2... : f” a 0 + l 0 1 2 3 Original Discrimination Parameter . Common item - Common common . Common people? Equated Difficulty Parameters 120-items Difficulty Difference 1.5 3 . - ‘0‘. O E g 1 ~ ‘ r’ I; 11’ o a: ' ”3:39 ‘0 E . a ‘ ”3&2... 953 g 0 P 2 .9, 3 Q :3 - 0' '9 : LI" -1 — a go -2 . i i + A. -3 -2 -1 O 1 2 3 Original Difficulty Parameters l l - Commonitem - Common common - Common people | I 53 Figure 4.3 (Continued) Equated Lower Asymptote 120-ltems Difficulty Difference 1.5 . 9 P e ':: . O _ .3 :‘ a: g .0 .; ‘. '00:? i... ’ o g ‘- .\ ‘. °. a ‘0‘: 1:." ':.°..: . o .. 4‘5. '.9. 3 939?.» ‘05:“ BE ”wk“: Mitt-.9: ‘5 (3‘ b a ..' t. 3?: $0: ' 3< -. ”.9“. 9.". : 0' 0:.‘0. .\{o ozi‘ea LU _ ' : . '3. I: 0 0.1 0.2 0.3 0.4 0.5 On'ginal Lower Asymptote . Common item - Common common . Common people 54 Figure 4.4. Item parameters for test length 96, difficulty difference 0.5 Equated Discrimination Parameters 96-items Difficulty Difference 0.5 c 3 1 g . g a 2.5 +- e. ..:. :ugg: . . E 9 2 + n...“ 3:3": :.?‘ o G) a." .5";‘:’. ° fig” ~z-ii.‘-s?-" 9: i6 1 J .’ ”3‘3. .-...- .9 a ’ r .9: ‘3 0.5 ~- . 0' "J 0 J. i i .1 0 0.5 1 1.5 2 2.5 Original Discrimination Parameters - Common item - Common common - Common People Equated Difficulty Parameters 96-items Difficulty Difference 0.5 3 a 2 ~~ 3 3 . 9 E E 1 _ c a ‘: no 9 C] 0’ ‘0 t t, E . .3; ,f . 9390— .'I's°‘$)3“ (U m 0’ a: $2... :" . g- n- . a. 5‘- 5 3 20. L” '1 P . a “ ' j : . r. '2 i rL r -3 -2 -1 O 1 2 Original Difficulty Parameters l - Common item - Common common - Common People l L_ . .__._ 55 Figure 4.4 (Continued) Equated Lower Asymptote Parameters 96-items Difficulty Difference 0.5 0.5 9 . 30.4» -.' ‘ E ' o ." .; ...zo'.9 . a; . . o "0.? .9 E 0.3 a“ 9:. :. ‘.,.,:;; ’95:. 0 ‘ g ." ..39 :13??? ".0 3 0.2 -— ’.--°' “4' 23.1.7.1; a .0 . .. ‘ o n I. I g 1 ‘° .... of e ‘3 0.1 «~ -. 2 ‘ ‘ 0' Lu 0 4 i r J. o 0.1 0.2 0.3 0.4 0.5 Original Lower Asymptote . Common item - Common common . Common People 56 Figure 4.5. Item parameters for test length 96, difficulty difference 1.0 Equated Discrimination Parameters 96-items Difficulty Difference 1.0 3 8 g 2.5 4r I ‘ _ .E 2 . . . 96:33 :I . g 9 2 4* 5.. .9 pars: -- 3§15~ -:°3::"§"" a e l . .93}! "‘5 3 :1" 1 i 3:2. 9%" ~‘ g = . o- 0.5 "' u.r 0 i i r L 0 0.5 1 1.5 2 2.5 Original Discrimination Parameters . Common item - Common common . Common people Original Difficulty Parameter Equated Difficulty Parameter 96-items Difficulty Difference 1.0 3 _L 3 2 e 9‘ '9) :‘9‘ 1 ~~ . , .gaaé- ” .0; 0“? o' .t . . .35: 4'", o ‘ . ergazi’é’f'" . -1 1. u ;: ° “- -2 i 1T A. i -3 -2 -1 O 1 2 Equated Difficulty Parameter - Common item - Common common - Common People 57 Figure 4.5 (Continued) 93.0.6.0 NOD->01 p Equated Lower Asymptote O Equated Lower Asymptote 96-items Difficulty Difference 1.0 0.5 o. f o . I. o .. 3'33 e '5‘; 5:"... 1 o .2. at, :.;.."o . . . ° \. 39" 9."? 2)... -- . . .4! . _ “'19:. r“ . . at O. A: t. :.:..o'. ' o 1 3. E“ :2”, -, . ’ w .. g, 0 0.1 0.2 0.3 0.4 Original Lower Asymptote . Common item - Common common - Common People 58 Figure 4.6. Item parameters for test length 96, difficulty difference 1.5 Equated Discrimination Parameter 3 96-items Difficulg Difference 1.5 C . 2 .0 . '.. fi Zo't . ‘ C d. :‘ t z e a-.2«~ .- n.» S E ‘ of $ to. g. .9 E 0. $4? t'.:. 0 g ”if"? ' ‘- 8 0.1 ”P g 5.; . . ‘ IE . t. 3 :3 0' DJ 0 l 1 1 i O 0.5 1 1.5 2 2.5 3 Original Discrimination Parameter . Common item - Common common . Common peopleg Equated Difficulty Parameter 96-items Difficulty Difference 1.5 3 . b ‘ .0 u ‘3 h 2 4* :33... .1 223.: E Q 1 o a 0 ‘$::‘:. . ._ a, 41— . .0 “ p o". 8 E 0 '- a o “ :97». o ‘g 8 3:5», - : O' _1 db 3 ° ‘. DJ -2 .L . 1 . .L -3 -2 -1 O 1 2 3 Original Difficulty Parameter . Common item - Common comm - Common people 59 Figure 4.6 (Continued) Equated Lower Asymptote 96-items Difficulty Difference 1.5 0.5 .93 . *8 0.4 4- . if 1‘ g .n “o .; ° (0 ‘ .. a <50.3—— ,».-',¢§-‘° .' .1 0.2 ~— " t 9 3:44; '3' 3 .2221}. h.‘ (D . .g‘:,' ‘ g 0.1 A . . - LIJ 0 . 1 1 1 O 0.1 0.2 0.3 0.4 0.5 Original Lower Asymptote - Common item - Common common . Common people 60 Figure 4.7. Item parameters for test length 72, difficulty difference 0.5 Equated Discrimination Equated Discrimination Parameter 72-items Difficulty Difference 0.5 N 01 N i A J. FD 01 L T Parameter A 01 I I A I 0 0.5 1 1.5 2 Original Discrimination Parameter . Common item - Common common - Common People 2.5 Equated Difficulty Equated Difficulty Parameters 72-items Difficulty Difference 0.5 2 :3? 1 ~- . ., (I) . ‘ .at» 15:1" 0 “th. E 0 ’3’:»'o S a 0‘ e“ .. S 1:3 - '1 4’ eé . -2 L 1 + -3 -2 -1 o 1 Original Difficulty Parameters - Common item - Common common 1 Common People ’ L.,._.. J 6i Figure 4.7 (Continued) Equated Lower Asymptote Parameters 72-ltems Diffictu Difference 0.5 .9 0.5 .9 O. E 0.4 1- . 2 - - E 03 q— f . . I. o .6. 3 .;g. f: g :: a... ‘ .5. . :. . 1.. ..102‘ ...g. 30.2.! 8 11:. "3 ' ' Iii 0-1 ‘* .2 D 0' ”J o 1 1 1 'e 0 0.1 0.2 0.3 0.4 Original Lower Asymptote i o Common item - Common common - Common People 62 Figure 4.8. Item parameters for test length 72, difficulty difference 1.0 Equated Discrimination Parameter Equated Discrimination Parameter 72-items Diffictu Difference 1.0 3.. t; i‘ ‘5 Lfiééfizégo . a oz‘Oi‘ ‘ 3.3.. . a.‘ gfi'h 6 iii" ‘. O 0.5 1 1.5 2 Original Discrimination Parameter 2.5 - Common item - Common common - Common People Equated Difficulty Parameters Equated Difficulty Parameters 72-items Difficulty Difference 1.0 a: .-: :3’ ._ $11.3 . c :0.';" -. z‘ .1- ! .‘fa; '.£ . 1- . {pits-W a .I:.£ 0‘. O “.9 o. a.“ 1 1 1 -2 -1 0 1 Original Difficulty Parameters - Common item - Common common - Common people J 63 Figure 4.8 (Continued) Equate Lower Asymptote 72-ltems Difficulty Difference 1.0 9 0.5 9 o. -_ E 0.4 > O U) 0 o €503— "1;- g ' o . ‘ ;.. :0... a I' . g o 3 0.2 w .18 . g 3.3.4 3: 3 3:3 .°.. 3“ .rt' ° ' g 0.1 — t 3.3. 1 U' LU 0 1 1 1 0 0.1 0.2 0.3 Original Lower Asymptote 0.4 ~ Common item - Common common . Common people 64 Figure 4.9. Item parameters for test length 72, difficulty difference 1.5 Equated Discrimination Parameters 72-items Difficulty Difference 1.5 C 3.5 5.3 3 1~ g a. a .E g 25 1F c ‘9'03':;'::{‘ .‘ . o '55 2 a . ': “v.5“ ":3 .2 E . {“1" "4- :' OS 15 L ' .,?s‘g°."‘:;“° U m 3 . ‘ ‘ . ' a 9. 0- 1 ~~ 1:: x» cu ' 2 ca:- 0.5 1 LLI O 1 1 1 1 O 0.5 1 1.5 2 2.5 Original Discrimination Parameters FCommon item - Common common . Common people Equated Difficulty Parameters 72-items Difficulty Difference 1.5 3 2.5 A» It I . E 2 -_ u . ., . 8 52 1-5‘“ . .o‘ “33"!" £9 14* ‘i‘-‘ z. o” m a a“: .' o. . ‘3, E 0.51— 3°." ’ ~22 S 0 11— . ‘.¢:.;:‘:... m D“: . 3‘3. '- 03." 8' '0.5 “‘F a ‘ 2:2.9o' I.“ -1 w— a A 3 ’ 8 -1.5 1- ‘ -2 1 1 1 1 -3 -2 -1 O 1 2 Original Difficulty Parameters . Common item - Common common - Common people} 4 65 Figure 4.9 (Continued) Equated Lower Asymptote 72-items Difficulty Difference 1.5 9 0.5 .9 . 3 0.4 —~ . - ‘ >~ . .a o (I) " . €0.31 . .. :‘,,.. . g .: .' . ‘ 5 3 '3. ‘ 3 0.2—- 31:33,} ‘1‘ ~ '° '0 t . ‘ ' " :1 .93 3 0 a . " g 0.1 — .: C“ LIJ O 1 1 1 1 O 0.1 0.2 0.3 0.4 Original Lower Asymptote . common item - Common common 1 Common People 0.5 66 IV. Regression between the “real scores” and equated scores As noted in chapter 3, sub-samples were selected from the original data set according to each equating design (as is shown in Figures 3.1-3.3). Some of the responses in each of the sub-samples were deleted according to the designed data matrices. The data were then processed through BILOG-M63 as vertical equating data. BILOG-MG3 calibrates item parameters and ability scores concurrently with marginal maximum likelihood (MML) estimation. The “real scores” in each data set was divided into intervals of 0.1 standard deviation, the mean of the equated scores for examinees whose “real scores” fall in the corresponding interval were calculated. The scatter plots between the mean of the equated scores and the interval of the “real scores” for each design are shown in Figures 4.10-4.12. Figure 4.10 is for tests with 120 items in total, Figure 4.11 is for 96 items and Figure 4.12 is for 72 items. In each figure, the first column is for common-item design, second column for common-common design and the third for common-group design. The first row is for the designs when the two forms differ in averaged difficulty for 0.5 SD, the second row when they differ in 1.0 SD and the third when differ in 1.5 SD. BILOG-MG3 assigned the lower ability group as the control group, thus the means of thetas in lower ability groups are zero; the means of the thetas for higher ability groups are higher in about 0.5, 1.0 or 1.5 standard deviations according to the design of the equating. The correlations between equated and “real” scores are calculated and listed in Table 4.15. The results show no obvious difference across different designs and different test length. However, the correlation coefficients are significantly (p<0.01) different when item difficulties are different (Table 4.15). 67 Figure 4.10. “Real” vs. mean of the equated scores, test length=120 Mean of Equated Scores In the Mean of Equated Scores In the Interval Interval Mean of Equated Scores In the Interval Common Item Equating, Difficulty Differenceao.5 '4 W 1 r 1 l r r r -4 -3.2-2.7-2.2-1.7-1.2-0.7-0.2 0.3 0.8 1.3 1.8 2.3 2.8 3.3 Real Score Common Common Equating, Difficulty leference=0.5 4 3 ... 2 a- 1 +- o 1 -1 1 .2 -3 -« .4 -3.7-3.1-2.6-2.1-1.6-1.1-0.6-0.1 0.4 0.9 1.4 1.9 2.4 2.9 Real Score Common Group Equating, Difficulty Difference=0.5 -2 J -3 J” -4 - 1 1 1 1 1 1 -4 -3.1 -2.6 -2.1-1.6-1.1-0.6 —0.1 0.4 0.9 1.4 1.9 2.4 ReaIScore 68 Figure 4.10 (continued) Common Item Equating, DIfflcuIty Difference=1.0 O-KNOO-b Mean of Equated Scores In the Interval -4 -2.7 -1.7 -0.7 0.3 1.3 2.3 3.3 Real Score Common Common Equatlng, leflculty Difference=1.0 Mean of Equated Scores In the Interval -4 -3.1-2.6-2.1-1.6-1.1-0.6—0.1 0.4 0.9 1.4 1.9 2.4 2.9 Real Score Common Group Equating. Difficulty DIfference=1.0 4 _. ..n w- c-“ M. --.N.-._ V.-. -..M . u..___._.__.._ _.,_-,-,, .. ...._.~..-_-._,..-.-_-__... - - , 0 1 5 3 'li" ,-,, 7 _ 1 1 >— 1 "r—- ~—* '- -- 1* -----——-- i s S ‘0 g , '8 5 i -4 - . fl . _ . . -4 ~32 -2.7 -2.2 -1.7 -1.2 -0.7 -O.2 0.3 0.8 1.3 1.8 2.3 Real Score 69 Figure 4.10 (continued) Common Item Equating, Difficulty Difference=1.5 Mean of Equated Scores in the Interval '4 i l l r i -4 -3.3-2.8-2.3-1.8-1.3-0.8—0.30.2 0.7 1.2 1.7 2.2 2.7 3.2 3.7 Real Score i 1’ Common Common Equating, Difficulty Difference=1.5 4 Mean of Equated Scores In the Interval o -1 - -2 1 -3 . -4 1 1 t v 1 ' ' 1 ' -4 -3.3~2.8-2.3-1.8-1.3-0.8-0.3 0.2 0.7 1.2 1.7 2.2 2.7 3.2 Real Score Common Group Equating, Difficulty Difference=1.5 Mean of Equated Scores in the Interval 4 l T f r l l -4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Real Score 70 Figure 4.11 “Real” vs. mean of the equated scores, test length=96 items Mean of Equated Scores in the Interval Mean of Equated Scores in the Mean of Equated Scores in the Common Item Equating, Difficulty Differencelos .4 W r T r f m T r -3.7-3.1-2.6-2.1-1.6-1.1—0.6-0.10.4 0.9 1.4 1.9 2.4 2.9 3.4 Real Score Common Common Equating, Difficulty Difference=0.5 fl T i T l T -3.6 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 Real Score Common Group Equating, Difficulty Difference=0.5 -4 -3.1 -2.6 -2.1 -1.6 -1.1 -0.6 -0.1 0.4 0.9 1.4 1.9 2.4 ReaIScore 7l Figure 4.11 (continued) Common Item Equating, Difficulty Difference=1.0 Mean of Equated Scores In the Interval -3.9-3.1-2.6-2.1-1.6-1.1-0.6-0.1 0.4 0.9 1.4 1.9 2.4 2.9 3.4 Real Score Common Common Equating, Difficulty Difference=1.0 Mean of Equated Scores in the Interval '4 T r r T r I Y -4 -3 -2.5 -2 -1.5 -1-O.5 O 0.5 1 1.5 2 2.5 3 Real Score Common Group Equating, Difficulty Difference=1.0 Mean of Equated Scores In the Interval -4 -3.2 -2.7 -2.2 -1.7 -1.2 -O.7 —0.2 0.3 0.8 1.3 1.8 2.3 Real Score 72 Figure 1 1 (continued) Common Item Equating, Difficulty Difference =1.5 Mean of Equated Scores in the Interval -4 -3.3-2.8-2.3-1.8-1.3-0.8-0.30.2 0.7 1.2 1.7 2.2 2.7 3.2 3.8 Real Score Common Common Equating, Difficulty Difference=1.5 Mean of Equated Scores in the Interval -4 -3.4—2.9—2.4-1.9-1.4-0.9-0.40.1 0.6 1.1 1.6 2.1 2.6 3.1 Real Score Common Group Equating, Difficulty Difference=1.5 Mean of Equated Score in the Interval -4 fir r r V r —4 -3.4 -2.9 -2.4 -1.9 -1.4 —O.9 -0.4 0.1 0.6 1.1 1.6 2.1 Real Score 73 Figure 4.12 “Real” vs. mean of the equated scores, test length=72 items Mean of Equated Score in the Mean of Equated Scores in the Interval Interval Mean of Equated Scores in the Interval Common Item Equating, Difficulty Difference80.5 ’4 i Y Y Y Tt 1 ti -4 -3.1-2.6-2.1-1.6-1.1-0.6-0.1 0.4 0.9 1.4 1.9 2.4 2.9 Real Score Common-Common Equating, Difficulty DIfference=0.5 4 3 2 1 O -1- -2 -3 .4 _,_2 2 -...-___k_.-.._k.,i.". .2, 7 _____a__2h _4 ______ , A, . ,7 ,i, , .__..___ ___ ___ ._.._ .._-, .4 _ - .iV .. .-7, _-.-.__ ___. _ . __.—__ .4 . .. . .7 .-A..- —_ . .——.._._..__.- _. - _ _— __.¥ .1 _.__ ., .._. .__ ,__ _ -_.. T v T T T f T l -3.6-3.1-2.6-2.1-1.6-1.1-0.6-0.1 0.4 0.9 1.4 1.9 2.4 2.9 Real Score Common-Group Equating, Difficulty Difference=0.5 Ak/J/ [vii _ ___ l L L I L - I L 1 4 I ti r r r t r r r f _ -4 -3 -2.5 -2 -1.5 -1 -0.5 O 0.5 1 1.5 2 ReaIScore 74 Figure 4.12 (continued) Common Item Equating, Difficulty Difference=1.0 4 _. ,.-,--.-_..__,. _ -.N--- .2.,2,---...._.,,-_ -.2 -_..... 2. a- Mean of Equated Scores in the Interval o i i I i i i l l ‘4 T T T I r T l T V T r T T l -4 -3.1-2.6-2.1-1.6-1.1-0.6-0.1 0.4 0.9 1.4 1.9 2.4 2.9 3.4 Real Score Common Common Equating, Difficulty Difference-1.0 Mean of Equated Scores in the Interval -4 -3 -2.5 -2 -1.5 -1.0.5 0 0.5 1 1.5 2 2.5 3 ReaIScore Common Group Equating, Difficulty Difference=1.0 .. .. ___. ._.___ .._..__»__ ~ Mean of Equated Scores in the Interval -4 -3.2 -2.7 -2.2 -1.7 -1.2 -0.7 -O.2 0.3 0.8 1.3 1.8 2.3 Real Score 75 Figure 4.12 (continued) Common Item Equating, Difficulty Difference=1.5 Mean of Equated Scores in the Interval .4 T T T T T T fl T 7 r Ti T T -4 -3.2-2.7-2.2-1.7-1.2-O.7-0.2 0.3 0.8 1.3 1.8 2.3 2.8 3.3 Real Score Common Common Equating, Difficulty Difference=1.5 Mean of Equated Scores in the Interval -4 i T r I i -4 -2.9 -1.9 -0.9 0.1 1.1 2.1 3.1 Real Score Common Group Equating, Difficulty Difference=1.5 Interval Mean of Equated Scores in the —4 r T I i I t r r I r I r -4 -3.3 -2.8 -2.3 -1.8 —1.3 -0.8 -0.3 0.2 0.7 1.2 1.7 2.2 Real Scores 76 The correlation coefficients are listed in Table 4.15. In common-group designs, the higher-level group and the lower-level group share more common examinees than those of the common—common and common-item designs; the total number of examinees in each common-group design is 5000. The common-common design has 5500 and the common-item design has 6000. The square of the difference between “real score” and the adjusted equated score for each examinee was calculated, and the averaged value of the square difference for each equating design is listed in Table 4.16. The adjusted equated score equals the equated score reduced by 0.25, 0.5, or 0.75 for designs when difficulty level differences are 0.5, 1.0 or 1.5 respectively. The reason for using adjusted equated score instead of equated score will be discussed in Chapter 5. The values indicate that on average, the average squared differences are smaller for longer tests; and when the difficulty difference between the two forms increases, the average squared difference increases. However, the differences between different test lengths or different form difficulty levels are not significant. Equating of common-group designs has higher average squared differences than those of the common-common designs. The common-item designs have the lowest average squared differences. The difference in average squared differences between the equating designs is statistically significant. 77 Table 4.15. Correlation Coefficients between “Real Score” vs Equated Score Difference Common Common Common Averaged by Averaged by -item -common -group test length Difficulty Difference“ 120 items 0.976 0.5 0.977 0.960 0.975 0.966 1.0 0.980 0.980 0.975 0.974 1.5 0.971 0.985 0.980 0.978 96 items 0.970 0.5 0.964 0.958 0.961 1.0 0.966 0.972 0.976 1.5 0.982 0.978 0.975 72 items 0.972 0.5 0.972 0.961 0.966 1.0 0.976 0.974 0.967 1.5 0.978 0.978 0.974 Average 0.974 0.971 0.972 "The averaged values between the levels are significant different (PANOVA<0.001) 78 Table 4.16 Adjusted Averaged Squared Difference Difference Common Common Common Averaged Averaged -item -common -group by by test length difficulty difference 120 items . 0.108 0.5 0.058 0.094 0.114 0.104 1.0 0.061 0.068 0.155 0.112 1.5 0.097 0.080 0.246 0.173 96 items 0.128 0.5 0.105 0.105 0.158 1.0 0.106 0.089 0.168 1.5 0.070 0.098 0.253 72 items 0.153 0.5 0.020 0.117 0.165 1.0 0.081 0.099 0.181 1.5 0.099 0.109 0.507 Average“ 0.077 0.096 0.216 MThe averaged values at each level are significantly different (PANOVA<0.001) IV. Standard Errors of Equating The testing data for each design was randomly divided into ten parts for standard error calculation according to the description in chapter 3, part III. The plots of standard errors of different equating designs are exhibited in figures 4.13-4.21. Three obvious trends are evident in the results. First, the plots indicate that in designs where the difficulty differences between the two forms are lower (if =0.5), the SE level tends to be lower and the range of lower SE are wider. Second, when test length and the level of difference in difficulty are kept the same, common-item equating tends to have lower SE between ability level -l and l; common-group equating tends to have higher SE in that range, although the difference in SE between the two designs are not large. Third, the SE level tend to be lower when the length of the forms is longer. The averaged SE between ability scores of -l and +1 are listed in Table 4. l 7. On 79 average, shorter test forms tend to have higher standard error of equating. What is more, common-item designs have lower standard error than common-common designs, which is again lower than the common-group design. However, none of the above differences are statistically significant. Higher difference in difficulty between the two forms also contributes to increased standard error of equating, and this trend is statistically significant (p<0.01). The trend agrees with what is shown by the average squared differences between the “real” and equated scores (Table 4.16). 80 Table 4.17. Averaged Standard Error between Scores - l .0 and 1.0 Difference Common Common Common Averaged Averaged by -item —common -group by test difficulty length Difference“ 120 item , 0.196 5 0.170 0.170 0.184 0.199 10 0.192 0.204 0.210 0.229 15 0.193 0.211 0.227 0.253 96 item 0.224 5 0.183 0.198 0.212 10 0.205 0.236 0.227 15 0.232 0.249 0.273 72 item 0.261 5 0.215 0.221 0.237 10 0.247 0.261 0.278 15 0.267 0.303 0.320 Average 0.212 0.228 0.241 8i "The averaged values at each level are significantly different PANOVA<0.001 Figure 4.13. Standard error, 120 items, difficulty difference=0.5 Standard Error of Equating. 120 items, difficulty different = 0.5 _._ item --— c-c -— —— people Un-Equated Score Figure 4.14. Standard error, 120 item, difficulty difference=1.0 Standard Error of Equating 120 items, difficulty difference=1 .0 -—-— item_se —— people Un-equated Scores 82 Figure 4.15. Standard error, 120 items, difficulty difference=1 .5 Standard En'or of Equating, 120 items, difficulty difierence=1.5 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Standard Error -4 -3 -2 -1 0 1 2 3 4 Un-Equated Scores Figure 4.16 Standard error, 96 items, difficulty difference=0.5 Standard Error of Equating, 96 item, difliculty difierence=0.5 h o t I a: —e— item ‘2 “5 —--— c-c 1: g — — people o—e (I) Unequated scores 83 Figure 4.17 Standard error, 96 items, difficulty difference=1 .0 Standar Error of Equating 96 items, difficulty difference=1.0 o oo :1 N P . N 1 .0 a» ti _._ item ; 9.6 hot 4‘ t '—‘-—CC i —— people Standard Error .09 NO) P cub O 1.. r- F i— l Un—Equated Score Figure 4.18 Standard error, 96 items, difficulty difference=1 .5 Standard Error of Equating 96 items, difference in difficulty=1.5 _._ item —-peop|e‘ Standard Error -4 -3 -2 -1 0 1 2 3 4 Un-Equated Score 84 Figure 4.19 Standard error, 72 items, difficulty difference=0.5 Stmdad 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Standar Erro of Equating. 72 items, difficulty difference=0.5 Un-Equated Score —e— item —-O-——CC — — people Figure 4.20 Standard error, 72 items, difficulty difference=1.0 staidaden'a Standard Error of Equating, 72 items. difficulty difference=1.0 Un-equated score —e-— item ——o—— CC ——-people Figure 4.21 Standard error, 72 items, difficulty difference=1 .5 Standard Error SE of Equating. 72 items. difference in difficulty =1.5SD -4 -3 -2 -1 0 1 2 3 4 Un-Equated Score 85 Chapter 5. Conclusions and Discussions I. Dimensionality and Model fit A. Unidimensional 0r Multidimensional Structure One major purpose of this dissertation study is to test the robustness of unidimensional IRT vertical equating when testing data is multidimensional. Thus the first step of data analysis starts from the dimensionality analysis. The direct reason that the data is considered to be multidimensional comes from the structure of the exam. The exam has three sections that are designed to probe different dimensions of language ability of ESL leamers—grammar, vocabulary and reading. As IRT is a quantitative model used to interpreting testing data, the dimensionality of the data should be investigated in order to decide whether IRT model is appropriate. Table 4.2 presents the NOHARM analysis result and 4.3 present dichotomous factor analysis results. The discrimination parameters (the a’s) on different dimensions in Table 4.2 mostly agree with the factor loadings in Table 4.3. The summary in Table 4.4 indicates that the multidimensional/ factor pattern agrees with the test design (in Grammar, Vocabulary and Reading sections). Although the same test dimensionality is reflected in factor analysis and MIRT, the percentage of items that load the highest disagrees between the two methods. Factor analysis methods assume that the variables are normally distributed and do not allow guessing in the model, thus MIRT is more suitable for dichotomous testing data where guessing probably exists. Exploratory factor analysis and MIRT were applied in this dissertation, both with oblique rotation on three factors/dimensions, because the test were developed to measure three types of English language ability and previous results indicated the distinctive dimensions between Grammar, Vocabulary and Reading sections. 86 As already mentioned in chapter 2, the effect of violation in unidimensionality might not be substantial on IRT equating. Among the studies exploring the issue of robustness of IRT unidimensionality assumption, Reckase, Ackerrnan and Carlson (1988) provides substantial evidence theoretically and empirically that IRT unidimensionality assumption was robust. The study concluded that even though more than one dimension of ability was manifested in examinees’ test performance, a set of items measuring the same weighted composite of abilities should be able to meet the assumptions of unidimensional IRT model. The studies by Yen (1984) and Dorans (1990) further support this argument. In this study, the effect of multidimensionality will be analyzed in combination with the effect of vertical equafing. B. IRT Goodness of fit Although the assumption of unidimensionality is robust for IRT equating, model fit of the data is essential for stable results in parameter calibration and equating. According to van der Linden and Hambleton (1997), well-established statistical tests for, two or three parameter IRT models do not exist. And the study further stated that “even if they did, questions about the utility of statistical test in assessing model fit can be raised, especially with large samples.” McDonald (1989) even concluded that when sample size is big enough, an IRT model will be rejected by statistical tests. Currently, model assessment methods that most often used are: judging item fit, judging person fit and compare the fit of different models (Embretson and Reise 2000). The item model-fit tool provided by BILOG-MG3 is a chi-square index. As described in chapters 3 and 4, the test data for this thesis were analyzed with item model fit for all the items at each level of test lengths. According to the p-value of chi-square analysis (Table 4.5), 65 to 75 percent of the items fit the model. Because chi-square analysis is sensitive to sample size, for 87 an exam that has about 30,000 examinees as here studied, fit analysis based on chi-square can be misleading. The rule of thumb for large-scale exam was then applied so that chi-square values less than 3 times of the degree of freedom is considered as non-significant (M. D. Reckase, personal communications). When this method is applied, 85 to 90 percent of the items fit the [RT model. The results indicate less item-fit for shorter tests—percentages of item-fit are the smallest for the 72 item tests. According to Stone and Zhang (2003), this can be a result of increased Type I error. The results of this study (Zhang, 2003) indicates that Type I error for the traditional item-fit method is big for short tests (less than 40 items), especially for large sample size. As no well-established, well-recognized method is available in testing two— and three-parameter IRT model fit, checks of model fit from different perspectives are often recommended. This includes checks on the unidimensionality assumption. However, as indicated before, IRT equating can still be valid even when unidimensionality is not fulfilled. Other checks on data include item biserial correlations, test format and difficulty analysis and the test speededness analysis. The biserial correlations of the original test are high (more than 95% of the items have biserial correlations higher than 0.30), however, no data are available to check the test speededness. Van der Linden and Hambleton (1997) suggest that if the model fit is acceptable, examinee ability estimates ought to be the same from different samples of items within the test. The results that will be discussed later in this chapter show the ability estimates based on a portion of the total items (120, 96 and 72 items). The estimates have high correlations with those based on the original data. The results support the goodness of fit for the IRT model (Figures 4.10-12). On the other hand, van der Linden and Hambleton (1997) also suggest that item parameter estimates ought to be about the same from different samples of examinees from the population of examinees for whom the test is intended. The results in the following 88 section also supports the model fit from this perspective, the correlations are high between item parameters that were obtained based on different equating designs and different sub-samples. In summary, the chi-square test of item fit, the biserial correlations and the ability estimates based on part of the items, or part of the samples, all support that the data-model fit of this study is satisfactory. MIRT and factor analysis results provide evidences that the data of this study is multidimensional; however, item model fit analysis and other results indicate that the data fit the IRT model relatively well. Because IRT model requires unidimensionality of the data, when the dimensions are very distinct, the data would not fit the IRT model. Here in this study, the dimensions are correlated between each other with moderately high correlation coefficients (around 0.6-0.7, results not shown). This can be the reason that the items still meet the unidimensionality assumption. If the dimensions were more distinct, the IRT model fit might not be satisfied. When the data does not fit the IRT model, the calibration results obtained through unidimensional IRT would not be stable. In such cases, multidimensional IRT is suggested for item calibration and ability scoring. II. Correlation behueen “real” and equated Item parameters Figures 4.1-4.9 present the scatter plots between item parameters calibrated using the original data (130 items by 30,000 examinees) and those calibrated using each equating design. Due to errors in IRT scoring and in equating, the “real” scores and the equated scores are not perfectly correlated, although the correlation is high. What is more, due to scale indeterminacy, we do not expect the regression between the “real” parameters and the equated parameters to cross the origin with slope equals one. The correlations are presented in Table 4.11 and the slope and the origin of each regression are presented in Tables 4.13 and 4.14. In the following sections, the results will be discussed from the perspectives of item parameter indeterminacy, error in parameter estimates and how to evaluate the errors in 89 parameter estimates. A. Scale indeterminacy in equating The BILOG-MG3 3-PL calibration used in this study defines the 9~scale as having a mean of 0 and a standard deviation of l for the set of data being analyzed. The IRT parameters are estimated based on this scale. In nonequivalent group equating, when the two groups have samples that are different in the distribution of ability (8) levels, scale transformation has to be done so that the item parameters and ability levels can be interpreted according to the same scale. The linear relationship between the two scales can be expressed through a set of equations: 91i=A*92i+B (5.1) a1j=a2j/A (5.2) b1j=A*b2j+B (5.3) A and B are called the equating coefficients. 9“ represents the ability level of person “i” estimated by scale 1; 82; represents the ability level of the same person estimated by scale 2. a1 j is the slope parameter of item “j” estimated in scale 1, and azj is the slope of the 66°99 same item estimated in scale 2; b] j is the difficulty parameter of item j estimated in scale 1, and sz is the difficulty parameter of this item estimated in scale 2. If the data fits the IRT model perfectly, the same A and B should be applied to all the examinees on all the items. In practice, in common-item equating, A and B are calculated using the averaged value of the of the slope parameters and difficulty parameters across the common items. According to the equations above, “A” equals the ratio between the two SDs (standard 90 deviation) of ability distributions. Mathematically, it is calculated as the ratio between averaged “a” parameters of the common items estimated through the data of the two groups. “B” coefficient is related to the difference between the mean of ability distributions, mathematically, it can be calculated through equations: B=b1j-Ab2j=91i'A92i (5.4) According to IRT scaling indeterminacy described before, the slopes of both a-parameter and b-parameter regressions (coefficient A’s) are related to the ratio between the standard deviation of the two samples (SDel/SDgz) from which the parameters were estimated. Generally speaking, the original data should have bigger variance in ability distribution than any of the equating samples. Among the equating designs, common-group equating has the smallest variance in ability since it has the biggest percentage of overlapped examinees shared by the two groups (20%). Common-common design has 10% of examinees shared by the two groups and common-item design has no overlapped examinee. Each of the examinee group in these designs has a standard deviation of 1. When the difference keeps constant between the means of ability for the two groups, the more common examinees are shared by the groups, the less variance exists in the sample. The effect of the ability variance in each design is reflected in the slopes shown in Table 4.13: most of the slopes of “a” parameters for common-item equating are smaller (closer to 1) than the correspondent slope for common-group equating, since a slope closer to 1 indicates less difference between the variance of ability in the sample and that in the original data. The slope of the “b” parameter is the reverse to that of the corresponding “a” parameter (according to equations 5.2 and 5.3), again, the slopes of common-item equating b-parameters are most close to “1” among the three designs. The reason lies in that common-item design has the biggest variance in examinee ability levels. 9| On the other hand, the intercept of the b-parameter regression (coefficient B’s) is related to the difference in the means of ability estimates (Equations 5.1 and 5.2). When coefficient A is close to l, the intercept almost equals the difference between the means of the two groups’ ability estimates. The results in the figures for-the b-parameters reflect this rule in that for designs with bigger difficulty difference, the regression lines lie further from the origin. Because when the item difficulty difference increase, the ability difference between the two groups increases accordingly to the data sampling. During equating, one of the groups (the lower ability group in this study) was assigned as the reference group and its scale was kept unchanged. The averaged ability estimate of this group is 0.25, 0.5 or 0.75 unit lower than that of the examinees in the original data. This difference is reflected in the intercepts of the b-parameter regression lines. B. Errors in parameter estimate of [R T equating Even when coefficients A and B are determined, the relationship between “real” parameter and the equated parameter still cannot be expressed through an equation. The reason lies in that error exists in both IRT parameter estimates using the original data and in the parameter estimate during IRT equating. The error in the parameter estimates can be defined as the amount of variance around the true parameter value. In IRT equating, we look for designs that has smaller errors in parameter estimate. A variety of factors can cause the error in parameter estimate of IRT equating. The first kind of factors come from IRT calibration process itself. Among these, four factors are often highlighted in the literature. First, because IRT make strong assumptions in modeling item functions, parameter estimate error is incurred when the assumptions are not met (Ackerrnan, 1992); second, estimation methods such as marginal maximum likelihood estimation (MMLE) or joint maximum likelihood estimation (JMLE) may not convert to the true values. Increased sample size and number of items may affect the accuracy of J MLE and 92 incorrect prior ability distribution specification may affect the result of MMLE (Baker, 1992; Seong, 1990). Third, model misfit can surely cause unstable item parameter estimation and fourth, practical limitations, such as small sample size or lack of variance in examinees’ ability may cause increased error in parameter estimate of too hard or too easy items (Stocking, 1990). Other than the errors that result from inaccuracies in the estimation of the parameters of the IRT model, the equating process does not perfectly transform item parameters to a common scale. Almost all the aspects of equating design can affect the translation of parameter estimates to a common scale such as the method of equating (single group or common-item, equivalent or non-equivalent group equating), the characteristics of the anchor test (in common-item equating), the characteristics of the two groups, the features of the two forms etc. Evaluating the errors in parameter estimate translation or ability estimation using the translated parameters is very important in evaluating the quality of certain equating design. C. Evaluating errors in parameter estimation Table 4.11 presents the correlation coefficients between the “real” and equated parameters: higher correlation coefficients indicate less discrepancy between the “real” and equated parameters. All the “real” parameters used here were obtained through the original data (130 items by 30,000 examinees). The first trend we can see in Table 4.11 is that the correlations of c-parameter estimates are obviously smaller than those of the correspondent a- and b-parameters’. c-parameters are not well estimated when the sample does not have enough low-ability cases. The second conclusion we can draw from the results presented in this table is that tests with fewer items tend to have lower correlations, and this trend is statistically significant for the “a” and “b” parameters. Since test reliability declines as the number of items decreases, error of estimate gets higher when the number of items gets 93 smaller. No obvious difference is seen between different equating designs or different form difficulty differences. III. IRT ability estimate In IRT equating, two sources of error exist in estimating ability. One is from the process of equating, the other comes from the process of IRT ability estimation itself. It is hard to separate the errors from the two resources. This study analyzes the error in ability estimate from two perspectives: first, comparing the ability scores obtained through the original data and those obtained through equating between the samples; second, computing the standard error of equating. A. Scatter plots of [R T score estimate In each equating design, the data set has about 5000-6000 examinees, scatter plots between equated scores and “real” scores of each examinee show strong relationship in the middle part of the ability range; while the dots are more scattered and the scores less related at the extreme values of ability (plots not shown). In the plots shown in Figures 4.10-12, “real score” is divided into intervals of 0.1 standard deviation, the corresponding equated scores for each interval are plotted. Compare with the scatter plots of the “real" vs. equated scores, Figures 4.10-12 provides a clearer relationship between the two score. Figures 4.10-12 shows that for ability levels from -2 to 2, a strong linear relationship is demonstrated between the “real” and equated scores. However, the two scores are not linearly correlated at extreme values. The reason is very likely because when an examinee has very high or very low ability, the exam does not have enough items to provide an accurate estimate at the examinee’s ability level. Thus the ability estimate of such examinees is not consistent between the results obtained using the original data and those obtained through the equating. It is also noticeable that for the common-item designs, the ranges of “real” scores are usually broader than those of the common-common designs. The ranges are the narrowest for the 94 common-group designs. The fact that the samples for different designs are differ in their ability score variance has been discussed in the previous session (chapter 5, II. A). The correlation coefficients summarized in Table 4.15 reflect no trend of the effect by test length or equating design (common-item, common-common or common-group). Longer tests are supposed to have higher reliability, and are thus expected to have higher correlation between the equated and “real” scores. However, in this case, all three lengths may have adequate reliability and the difference may not be large enough to be explicit in the scatter plots and the correlation coefficients. In Table 4.15, the averaged correlation coefficients by difficulty difference show that equated scores between forms that have bigger difficulty difference are more highly correlated with the “real” scores. And the difference is statistically significant. The reason probably lies in the sampling of the equating design for forms with bigger difficulty difference, more examinees with extreme ability levels are included in the sample, thus the ability at extreme levels are more accurately estimated. Another factor that contributes to the higher correlation is the bigger variance of examinees’ ability, for we know when two variables are correlated, the bigger the variance of each variable, the higher the correlation coefficient. No obvious difference in scatter plots or correlation coefficient is seen between different equating designs (common-item, common-common or common-group). B. Square Root of the Average squared dtflerence between the ”Real ” and Equated Scores In studies that using generated data to evaluate the quality of equating designs, squared difference between the true parameter and the parameters obtained from equating are calculated as a criteria for the evaluation (Hansen and Beguin, 2002). This study uses real data and the true values of examinees’ ability or item parameters are not known, however, the parameters obtained from the original data can be considered as close to their true values. For each design, the average squared difference between the examinees’ “real scores” and their adjusted equated scores is calculated. For equating designs with difficulty level differences of 95 0.5, 1.0 or 1.5, a value of 0.25, 0.5 or 0.75 were deducted from the equated scores to obtain the adjusted equated scores respectively. The reason of the deduction is because the scale indeterminacy introduced in the previous session (Chapter 5, II, A). By equating, a common scale is introduced, in which the lower ability group is assigned as reference group and its averaged ability is arbitrarily set as zero. However, while sampling the data, the averaged score of the lower ability group was set at -0.25, -0.5 and -0.75. Thus the equated scores were adjusted so that they are on the same scale with the ability scores obtained by the original data. Results of the average squared difference of each design are listed in table 4.16. Unlike the correlation coefficients, the average squared differences show a trend that shorter exams have higher differences between the original and the equated scores, although the trend is not statistically significant. Combined with the plots of the “real” scores vs. the means of equated scores, the discrepancy between the “real” and the equated scores is mostly caused by unstable estimates of the ability with extreme values. It is likely that shorter tests has less items targeting examinees of very high or very low abilities, and thus are less reliable in measuring extreme abilities than longer tests. However, the overall reliability of shorter tests are big enough and thus the overall correlation between the “real” and equated scores show no difference across test lengths. On average, the common-item design has smaller average squared differences than the common-common design, and the common-common design’s average squared difference is lower than that of the common-group design. And the difference here is statistically significant. One of the explanations can be because the common-group design has the smallest variance in examinees’ ability. Since fewer examinees score at the two extremities, the common-group equating cannot measure the abilities of these ranges as accurately as the other equating designs. Another possible explanation is that the common-group design itself is not as reliable as the common-item design. As very few studies on common-group 96 equating are available, no reference here can be quoted regarding the comparison between the two equating designs. Another obvious trend seen in Table 4.16 indicates that designs for bigger difficulty difference between the test forms have higher average squared difference, especially for tests with fewer items. When the difficulty difference between the two forms is bigger, it is likely that IRT model fit becomes more difficult, and thus the score estimate through equating is less stable and accurate. C. Standard Error of Equating The standard error curves in Figure 4.13-4.21 reflect the effects of test length, difficulty difference and equating design from several perspectives. The averaged standard errors of equating between scores of -l .0 and 1.0 are listed in Table 4.17. Several conclusions can be drawn from the analysis of the standard errors. First, the standard error is lower (statistically not significant), and stays low for a wider range, when the test length is longer. When the two forms have a total of 120 items, most of the standard errors between ability level of -l and +1 are lower than 0.2 of SD; however, when the forms have a total of 72 items, the standard error is almost never lower than 0.2 of SD. The second obvious trend is that when test length and difficulty difference between forms are kept the same, most of the time the standard error of common-item equating is the smallest, common-common is bigger and that of the common-group is the biggest among the three; however, the difference between the standard errors is very small, and sometimes the trend is not clear. Third, keeping test length the same, when the difference in item difficulty gets bigger, the standard error tends to be significantly higher. The values of averaged standard errors in Table 4.17 agree with the average squared differences listed in Table 4.16. First, as test length increases, the standard error decreases; second, standard error increases with the difficulty difference; and third, common-group 97 equating has the biggest errors while common-item equating has the smallest errors. Possible explanations of this pattern are provided in the previous session (Chapter 5, III, B), the results of Table 4.14 support the discussions about the results of Table 4.16. The standard error calculated here is the random error of equating. The standard error curves agree with the scatter plots shown in Figures 4.10 to 4.12: both reflect less error variance in the middle range of ability. However, the subtle difference between equating designs, form difficulty differences and test lengths are reflected through the standard error curves but not through the scatter plots. Part of the reason lies in that the sampling methods of the two analyses are different: in scatter plot analysis, the two groups have different ability levels; in standard error analysis, the two groups have similar ability levels (both were randomly drawn from the total sample). IV. The effects of equating design, test length and difficulty difference This dissertation study compares the effects on vertical equating of different equating designs, different test lengths and difference in averaged item difficulty between forms. The results of the analysis are summarized in the following sessions. A. The eflects of equating designs In the analysis of item parameter estimates and examinee ability estimates, the samples were selected so that the difference between the two groups’ abilities matches the difference between the two forms’ test difficulty levels. Through the correlation between the “real” ability scores and the equated scores, no obvious difference between the common-item, common-common and common-group designs can be seen. However, the average squared differences between the two scores reflect that equating of the common-group designs may be less accurate than common-item and common-common designs. In the standard error of equating analysis, the samples were randomly selected from the original data and are considered equivalent. In standard error analysis, test forms are different in difficulty level, 98 while groups are equivalent in ability. The results of standard error agree with that of the average squared difference—common-item equating gives less error than common-group equating, although the difference in error is small and not significant. B. The eflects of test length In general, longer test means a test has higher reliability, what is more, longer tests usually contains more items targeting different ability levels. Three different test lengths were chosen for this study so that in each equating, the numbers of unique items are 120, 96 and 72. The reliability of all the test forms are 0.85 and higher (results not shown), which satisfies the requirement of most achievement or ability tests in education; however the subtle difference in reliability may still affect the equating results. It is likely that longer tests have more items that accurately measure very high or very low ability, the average squared differences between “real” and equated scores for longer tests are lower than those for the shorter tests. However, the no obvious difference is seen in the correlation between the “real” and equated ability score estimation for different test lengths. For most of the examinees, different test lengths would not affect the effect of their ability estimation. The standard error is lower when test length is longer (not statistically significant), which indicates it is possible that standard error of the vertical equating is sensitive to test reliability. C. The eflects of form dzfiiculty dzflerence The analysis results indicate that difference between the difficulty levels of two forms does not affect the item parameter estimation; however, in examinee ability estimation, the average squared difference is smaller for equating that has smaller difficulty difference between the forms. On the other hand, bigger difference in difficulty results in higher correlation between the “real” and equated scores; the reason for this may because of bigger ability variance instead of more accurate score estimation. Like the average squared difference, the standard error of equating is higher when item difficulty difference is bigger. 99 D. Future directions This dissertation studies equating designs (common-group and common-common) that seldom explored in the practice of educational measurement. Although some of the issues such as fatigue, speededness may arise when using common-group or common-common designs, these designs can still serve as alternatives for the well-practiced common-item design, especially when common items are not available because of security or other issues. In vertical equating, especially when the testing data shows evidence of multidimensionality, common-item equating can be challenging. First, in vertical equating, the common items would be too advanced for examinees at the lower level while too elementary for examinees at the higher level. Higher level examinees may be careless in answering the questions, and lower level examinees may not be able to use their time efficiently (Kolen and Bremen, 2005). In common-item equating, when only a few such items are administered to both groups, the items may not function effectively as an anchor test. However, the common-group or common-common design may overcome this disadvantage. Second, when the test is multidimensional, it is ideal to design common-items that represent all the dimensions of the test; however, this may be impractical for some tests. The results of this study indicate that common-group or common-common designs, although they may not be superior to, are comparable in quality with common-item equating design in vertical equating of multidimensional data. When common-group or common-common equating is applied in testing practice, it is suggested that examinees take both of the forms are recruited from different ability levels. To minimize the effect of speededness or fatigue, the sequence of the two forms should be arranged so that half of the examinees in the common group take form 1 first and form 2 second; the other half take form 2 first and form 1 second. If sample size allows, testing data obtained from the common-group can be calibrated separately, data of either forms can also IOO be calibrated separately, and then data from all the examinees can be calibrated using concurrent methods. The results can be compared in terms test structure, examinees’ ability estimation etc. Such comparisons provide evidence of the validity for the scoring method. On the other hand, if the test is administered annually as in the case of many achievement exams, data from different administrations can be analyzed to check if the new equating design provides stable scoring over the years. Such analysis using longitudinal data can provide evidence for validity from different perspective. Because the study on this topic is still preliminary, many directions can be explored. Based on the results obtained from this dissertation, following suggestions are made for future research. First, for the convenience of sampling, the ability levels of the common-group used for this study are centered around the middle range. For example, in common-group design where difficulty difference between the two test forms is 1.0, a sample of 1000 normally distributed cases with mean=0 and SD=0.5 were first selected from the original data that has 30000 cases. These 1000 cases were used as common—group who are administered all the items. Then a sample of 2000 cases were selected from the rest of the data (now has about 29000 cases), so that these cases, together with the 1000 cases selected previously, form a normal distribution of mean=-0.5 and SD=]. Only half of the items are administered to these cases. In the next step, another sample of 2000 cases were selected from the rest of the data (now has about 27000 cases), so that these cases, together with the 1000 cases selected before, is a 3000-case normal distribution of mean=0.5 and SD=l. Again, half of the items are administered to these cases. When common-item equating is applied, usually items from different difficulty levels are selected; thus common-group design may give better equating results when the common-group represents cases from different ability levels. Second, the analyses here presented are based on data collected from real test. Although l01 the sample size is very big and normally distributed and the item model-fit is good, the results may not perfectly reflect the effects of the equating designs from pure theoretical perspective. To rule out the impact of some unexpected elements, it is suggested that generated data might be used to explore further about common-common or common-group design. Appendix 1. MATLAB code(1) «for selecting a normally distributed group from the original data clear all; clc %reset seeds for data generation% rand('state',sum( 100*clock)); % set the mean, SD and the number of subjects to be selected and %the number of bins u_demand=0 N_demand= l 000 SD_demand=0.5 N_bin=40 % Start simulation °/odata_30l(.dat is the available data with 29935 theta values data_30K=sort(data_30K); [N_30K,X_30K] = hist(data_30K.N_bin); N_30K=N_30K'; X_30K=X_30K’; %generate 1K normal distribution random numbers to define the bins for i=1 :N_demand data_3l((i,l) = randn*SD_demand + u_demand; end %check the mean and SD of the created data mean(data_3K) std(data_l K) %set the center of each bin for I K data equals the center of correspondent %bin for the 30K data l03 N_l K=hist(data_l K,X_30K); %compare the histogram of the original data and the target histogram figure hist(data_30K,N_bin) title('Mother Data Set') figure hist(data_3 K,N_bin) title('Son Data Set') %select the ability scores to fill the bins for i=l :N_bin i N_large=N_30K(i) N_small=N_l K(i) if N_small~=0 X_large=data_30K((sum(N_30K( l :(i-l )))+ l ):sum(N_30K( l :i))); %see to the attached "N_select_n.m" [X__small]=N_select_n(N_large,X_Iarge,N_small); data_new_l K((sum(N_l K( I :( i- l )))+] ):sum(N_l K( I :i)))=X_small; end end data_new_l K=data_new_l K'; %display the histogram of the selected cases figure hist(data_new_l K,N_bin) title('Final Data Set') %check the number, the mean and the SD of the selected cases N_demand size(data_new_3 K) mean(data_new_3 K) std(data_new_3 K) %in the resulted data contains 1K cases save data_new_l K.dat data_new_l K -ascii l04 % label the lK selected data in the original 30K data data_30K(:,2)=0; for i=1 : 1000 i . forj=l :size(data_30K, 1) if data_30K(j,2)% if data_new_l K(i,l )==data_30K(j, I ) data_3 0K(j ,2 )= l ; end end end end %the result data set contains 29935 cases with the 1000 cases labeled. the %labled cases were screened out, the rest can be used to select group I or %group 3. save new_30K.dat data_30K -ascii % 'N_se|ect_n.m' % To generate a small data set from a large data set function [X_small]=N_select_n(N_large,X_large,N_small) for i=1 :N_small index(i)=round(rand*N_large+0.5); % To compare with index selected for the small data set forj=l :(i-I) % If two index are the same, select again until there are no two identical index while index(i) == index(i) index(i)=round(rand*N_large+0.5); end end X_small(i)=X_large(index(i)); 105 End Appendix 2. MATLAB code (2) -for selecting 2000 cases for Group 1 in common-group equating clear all; clc %reset seeds for data generation% rand('state',sum( I 00*clock)); %first generate a distribution that has 2000 cases, in which 1000 cases %N{0, 0.5} are deleted from 3000 cases N{0.5, l} % Input u_demand=0.5 N_bin=40 %generate 3K normal distribution random numbers to define the bins for i=1 :3000 data_3 K(i,l) = randn + u_demand; end %generate lK normal distribution random number mean=0,std=0.5 for i=1 :1000 data_l K(i,l) = randn*0.5; end mean(data_3 K) std(data_3 K) mean (data_l K) std(data_l K) data_3 K=sort(data_3 K); [N_3 K,X_3 K] = hist(data_3 K,N_bin); N_3 K=N_3 K'; X_3 K=X_3 K'; %set the center of each bin for [K data equals the center of correspondent %bin for the 3K data N_l K=hist(data_l K,X_3 K); l06 %compare the histogram of the original data and the target histogram figure hist(data_3 K,N_bin) title('Mother Data Set') figure hist(data_l K,N_bin) title('Son Data Set') for i=1 :N_bin i N_large=N_3K(i) N_small=N_l K(i) if N_small~=0 X_large=data_3 K((sum(N_3 K(l :(i-l )))+ l ):sum(N_3 K(l :i))); %see to the attached "N_select_n.m" [X_smalI]=N_select_n(N_large,X_large,N_smalI); data_new_l K((sum(N_l K( I :(i-l )))+ l ):sum(N_l K(l :i)))=X_small; end end data_new_l K=data_new_l K'; figure hist(data_new_l K,N_bi n) title('Final Data Set') size(data_new_l K) mean(data_new_l K) std(data_new_l K) save data_new_lK.dat data_new_l K -ascii %the in the 3000-case group, the 1000 cases were labled data_3 K(:,2)=0; for i=111000 107 I for j= l :3000 if data_3 K(i,2)==0 if data_new_l K(i, l )==data_3 K(j,l ) data_3 K(j ,2)= I ; end end %data_3 K(j,2)== end end save data_3 K.dat data_3K -ascii %the 3000-case group were reorganized in Excel file and the 1000 cases were %deleted from it, ending up with 2000 cases that have the target %distribution, the data's name is gen _plus_2K.dat. data_29K is the data set %that has 1000 cases {0, 0.5) deleted from the original 30K data. clear all; clc % Input N_bin=40 % load data load data_29K.dat load gen _plus_2K.dat data_29K=sort(data_29K); [N_29K,X_29K] = hist(data_29K,N_bin); N_29K=N_29K'; X_29K=X_29K'; N_2K=hist(gen_plus_2K,X_29K); figure hist(data_29K,N_bin) title('Mother Data Set') figure hist(gen_plus_2K,N_bin) title('Son Data Set') l08 for i=1 :N_bin i N_Iarge=N_29K(i) N_small=N_2 K(i) if N_small~=0 . X_large=data_29K((sum(N_29K( i :(i-l )))+I ):sum(N_29K(l :i))); [X_small]=N_select_n(N_large,X_large,N_smalI); data_new_2 K((sum(N_2K( l :(i-l )))+ l ):sum(N_2K(l :i)))=X_small; end end data_new_2K=data_new_2 K'; figure hist(data_new_2 K,N_bin) title('Final Data Set') size(data_new_z K) mean(data_new_ZK) std(data_new_2 K) save data_new_2K.dat data_new_ZK -ascii l09 References Ackerrnan, T. A. (1992). A didactic explanation of item bias, item impact and item validity from a multidimensional perspective. Journal of Educational Measurement. 29, 67-91. Baker, F. B. (1992). Item response theory: parameter estimation techniques. New York: Marcel Dekker, Inc. Beguin, A.A., Hanson, B. A., & Glas, C. A. (2000). Effect of multidimensionality on separate and concurrent estimation in IRT equating. Paper presented at the annual meeting of the National Council of Measurement in Education, New Orleans, LA. Bimbaum, A. (1968). Some latent trait models. In F. M. Lord & M. R. Novick, Statistical theories of mental test scores. Reading, Mass: Addison-Wesley. Bock, R. D, Gibbons, R., Schilling, S. G., Muraki, E., Wilson, D. T., & Wood, R. (2003). TESTFAC T 4.0 [Computer software and manual]. Lincolnwood, IL: Scientific Software International. Bock, R. D. (i972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29-51. Bogan, E.D., & Yen, W. M. (1983). Detecting multidimensionality and examining its efl'ects on vertical equating with the three-parameter logistic model. Monterey, CA: CTB/McGraw-Hill. (ERIC Document Reproduction Service No. ED229450). Bolt, D. M. (1999). Evaluating the Effects of Multidimensionality on IRT True-Score Equating. Applied Measurement in Education 12, 4, 383-407 Camilli, (1 Wang, M.M., & Fesq, .l. (l992).7he Eflects of dimensionality on true score conversion tables for the Law School Admission Test. LSAC Research Report Series. Newton, PA. Camilli, G. Wang, M.M., & Fesq, I. (1995). The effect of dimensionality on equating the Law School Admission Test. Journal of Educational Measurement, 32, 79-96. Cook, L. L. & Douglass, J. B. (1982). Analysis of fit and vertical equating with the three-parameter model. Paper presented at the Annual Meeting of the American Educational Research Association. New York, NY. Cook, L. L., & Eignor, D. R. (l99l). NCME Instructional Module: IRT Equating Methods. Educational Measurement: Issues and Practice. 10, 37-45. ”0 Cook, L.I_.., Dorans, N.J., Eignor, D.R., & Petersen, NS. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating. (Research Rep. No. RR-85-30). Princeton, NJ: Educational Testing Service. DeMars, C. (2002). Incomplete data and item parameter estimates under JMLE and MML estimation. Applied Measurement in Education, 15(1), 15-3 1 . Donoghue, I. R., & Hombo, C. M. (2001). The distribution of an item-fit measure for polytomous items. Paper presented at the Annual Meeting of the NCME, Seattle, WA. Dorans, N. J., & Kingston, N. M. ( I985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22, 249-262. Dorans, N. J. (1990). Equating Methods and Sampling Designs. Applied Measurement in Education. 3(l), 3-l7. Embretson, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum, Mahwah, NJ Fraser, C. (1986). NOHARM: A computer program for fitting both unidimensional and multidimensional normal ogive models of latent trait theory [computer program], Center for Behavior Studies, The university of New England, Armidale, New South Wales, Australia. Fraser, C., & McDonald, R. P. (1988). NOHARM: Lease squares item factor analysis. Multivariate Behavioral Research, 23, 267-269. Goldstein, H., & Wood, R. (1989). Five decades of item response modeling. British Journal of Mathematical and Statistical Psychology, 42, 139- l 67. Hambleton, R.K., & Cook, L. L. (1983). Robustness of item response models and effects of test length and sample size on the precision of ability estimates. In D. I. Weiss (Ed.), New horizons in testing, (pp. 31-49). New York: Academic Hambleton, R.K., & Swanminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer. Hambleton, R.K., & Swanminathan, H., & Rogers, HJ. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Hanson, B. A. & Beguin, A. A. (1999). Separate versus Concurrent Estimation of [RT Item Parameters in the Common Item Equating Design. American Coll. Testing Program, Iowa City, IA., ACT-RR-99-8 Hanson, B. A. & Beguin, A. A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement. 26, 3-24. Ill Harris, D. J. (1991). A comparison of Angoff’s Design I and Design II for vertical equating using traditional and [RT methodology. Journal of Educational Measurement, 28(3), 221-235. Jodoin MG and Davey, T. (2003). A multidimensional simulation approach to investigate the robustness of I RT common item equating. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL. Jodoin, M.G., Keller, L. A. & Swaminathan, H. (2003). A comparison of linear, fixed common item, and concurrent parameter estimation equating procedures in capturing academic growth. Journal of Experimental Education, 71(3). 229-250 Johnson, J.S., Yamashiro, AD. and Yu. J (2004). The role of cloze in a model of foreign language proficiency. Annual Conference of Language Testing Research Colloquium, Temecula, CA. Kim, .I. P. (2001). Proximity measures and cluster analysis in multidimensional response theory. Unpublished doctoral dissertation, Michigan State University, East Lansing, MI. Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22, 197-206. Kolen M. J. & Brennan R. L. (2004). Test Equating, Scaling, and Linking (2” edition). Springer, New York, NY Kolen, M. J., & Whitney, D.R. (I982). Comparison of four procedures for equating the tests of general educational development. Journal of Educational Measurement. 19, 279-293. Li, Y. H., Griffith W. D. & Tam H. P. (1997). Equating Multiple Tests via an [RT Linking Design: Utilizing a Single Set of Anchor Items with Fixed Common Item Parameters during the Calibration Process. Paper presented at the Annual Meeting of the Psychometric Society. Knoxville, TN. Linn, R. L. (1993). Linking Results of Distinct Assessments. Applied Measurement in Education, 6( I 0), 83- 102. Lord, F. M (1980). Applications of item response theory to practical testing problems. .Hillsdale, NJ: Erlbaum. Matlab [computer program]. Mathwork Inc. McDonald, R. P. (1989). Future directions for item response theory. International Journal of Educational Research [3, 205-220 McDonald, R P. (2000). A basis for multidimensional item response theory. Applied ll2 Psychological Measurement, 24, (2), 99-l l4 Mislevy, R. H, (1992). Linking educational Assessments: Concepts, Issues, Methods, and Practice. Princeton, NJ: Educational Testing Service Policy Information Center. Noguchi, H. (I986). An equating method for latent trait scales using common subjects’ item response patterns. Japanese Journal of Educational Psychology, 34, 315-323 (In Japanese with English abstract) Noguchi, H. (1990). Marginal maximum likelihood estimation of the equating coefficients for two IRT scales using common subjects’ design. Bulletin of the Faculty of Education, Nagoya University (Educational Psychology), 37, l9l-l98. (In Japanese). Ogasawara, H. (2001). Marginal maximum likelihood estimation of item response theory (IRT) equating coefficients for the common-examinee design. Japanese Psychological Research. 43, 72-82. Orlando, M. & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64. Pomplun, M., Omar, M. H. & Custer, M (2004). Educational and Psychological Measurement. 64, 600-616 Reckase, M. D. (1989). Controlling the Psychometric Snake: Or, How I Learned to Love Multidimensionality. Paper presented at the annual meeting of the American Psychological Association, New Orleans, LA. Reckase, M. D. (1997). A Linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds), Handbook of modern item response theory (pp. 271-286). New York: Springer-Verlag. Reckase, MD. (1998). Investigating Assessment Instrument Parallelism in a High Dimensional Latent Space. Paper presented at the annual meeting of the Society of Multivariate Experimental Psychology, Woodcliff Lake, NJ. Reckase, M. D., Ackennan, T. A. & Carlson, J.E. (1988). Building a unidimensional test using multidimensional items. Journal of Educational Measurement, 25, 193-203. Russell, M. (2000). Using Expected Growth Size Estimates to Summarize Test Score Changes. ERIC/AE Digest Series EDO-TM-00-04, University of Maryland, College Park Seong, T. (1990). Sensitivity of marginal maximum likelihood estimation of item and ability parameter to the characteristics of the prior ability distributions. Applied Psychological Measurement, 14. l l-20. Sykes, R.C., Hou, L., Hanson, B. Wang, Z. (2002). Multidimensionality and the equating of a mixed-format math examination. Paper presented at the annual meeting of the ”3 National Council on Measurement in Education. New Orleans, LA. Snieckus, A.H., & Camilli, G (1993). Equated score scale stability in the presence of a two-dimensional test structure. Paper presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA. Stocking, M. L., & Eignor, D. R. (1986). The impact of diflerent ability distributions on IRTpre-equating (Research Rep. No. 86-49). Princeton, NJ: Educational Testing Service. Stocking, M. L. (1990). Specifying optimum examinees for item parameter estimations in item response theory. Psychometrika, 3, 461-475. Stone, C. A. (2000). Monte-Carlo based null distribution for an alternative fit statistic. Journal of Educational Measurement, 37, 58-75. Swaminathan, J ., & Gifford, J. A. (1983). Estimation of parameters in the three-parameter latent trait model. In D. J. Weiss (Ed.), New horizons in testing (pp. 13-30). New York: Academic. R. Tate (2003). A comparison of selected empirical methods fro assessing the structure of responses to test items. Applied Psychological Measurement, 27, 159-203. Toyota, H. (1986). An equating method of two latent ability scales by using subjects’ estimated scale values and test information. Japanese Journal of Educational Psychology, 34, 163-167. (In Japanese with English abstract). van der Linden, W. J. & Hambleton, R. K. (1997). Item response theory: brief history, common models, and extensions. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 1-28). New York: Springer-Verlag. Wang, M. M. (1985). Fitting a unidimensional model to multidimensional item response data: The eflects of latent space misspecification on the application of [RT Unpublished doctoral dissertation, University of Iowa, Iowa City. Yamashiro, A. D. and Yu. J. (2005). The ECCE three —year technical report: 2001 -2003. Ann Arbor, MI: English Language Institute, University of Michigan. Yamashiro, A. D. and Yu. J. (2005). The ECPE three —year technical report: 2002-2004. Ann Arbor, MI: English Language Institute, University of Michigan. Yen W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three parameter logistic model. Applied psychological Measurement, 8, 125-145. Yen, W. M. (1985). Increasing item complexity: A possible cause of scale shrinkage for 114 unidimensional item response theory. Psychometrika, 50(4), 399-410 Yen, W. M. (1986). The choice of scale for educational measurement: an IRT perspective. Journal of Educational Measurement, 23(4), 399-325. Zimowski, M. F., Muraki, E., Mislevy, R..I., &Bock, R. D. (2003). BILOG-MG [computer program]. Chicago, IL: Scientific Software International. 115