‘9‘. 1.1.1) Iiivyn. Kl: Li V9.31. , 181 (I... £1.51 . .. I‘VE-255.5.) .. .21: .1.zx21t1 a: 5.156 .. en. a} Ltv )9»: .lxlt; =10|.t« -, 1 llllilllllllllllllllllllllllilllllllllilll 3 1293 00910 8626 ; As... ‘ .4' 4' “fig/r"- 29‘105v , ,1 This is to certify that the dissertation entitled A Comparison of Rater Calibration Methods presented by George Stephen Denny has been accepted towards fulfillment of the requirements for Ph.D. Measurement, degree in Evaluation G Research Design Major professor Date ///XI/?fl MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE DUE DATE DUE DATE DUE MSU Is An Affirmative Action/Equal Opportunity Institution c:\circ\datedm.pm3—p. 1 A COMPARISON OF RATER CALIBRATION METHODS BY George Stephen Denny A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1990 é¢5~ (26706) ABSTRACT A COMPARISON OF RATER CALIBRATION METHODS BY George Stephen Denny When the quality of any performance is measured with human judgment, the score assigned depends both on the quality of the product and on how raters use the rating scale. As much as possible, idiosyncratic rater differences should be minimized. This study investigates a variety of methods of statistically adjusting raters' scores based on how their scoring compares to the scoring patterns of others. The ten methods compared were no equating (NO), mean equating (MN), truncated mean equating (TMN), linear equating (LI), truncated linear equating (TLI), equipercentile equating (EQP), ordinary least squares (OLS), truncated least squares (TLS), Rasch extension (RAS), and partial credit model (PCM). Data were from a suburban school district's writing assessment and from a simulation based on the PCM. Simulated data varied in the number of raters per paper, the number of rating scale points, the total number of papers scored, and the distribution of paper quality. Simulated raters varied in the stringency and spread of their scores. With the real data, methods were compared based on the relative proximity of their adjusted scores and with respect to their effects on passing rates relative to a given cut-score. With the simulated data, adjusted scores were compared to the true scores expected from the model given the generating parameters. Differences among methods were measured by root mean squared error, correlation, and maximum score difference statistics. In the real data sets, the methods produced adjusted scores that differed substantially from one another and from the raw scores. However, no judgment of which method worked best was possible because true scores for the papers were unknown. In the simulated data sets, the simpler methods (TMN and TLI) reproduced true scores well under all scoring conditions. EQP did well in data sets with more rating scale points. PCM did well in data sets with many papers and raters and few scale points. OLS, TLS, and RAS generally did worse than no equating. In assessment settings where raters differ in stringency and papers are randomly assigned, TMN and TLI are recommended for statistical adjustment of scores. TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . LIST OF FIGURES . . . . . . . . . . . . . . . LIST OF ABBREVIATIONS . . . . . . . . . . . . INTRODUCTION . . . . . . . . . . . . . . . . CHAPTER ONE . . . . . . . . . . . . . . . . . Controlling for Rater Differences . . . Training . . . . . . . . . . . . . Multiple Raters . . . . . . . . . . Statistical Adjustment . . . . . . When Adjustment is Not Recommended Methods of Score Adjustment . . . . . . No Equating (NO) . . . . . . . . . Mean Equating (MN) . . . . . . . . Linear Equating (LI) . . . . . . . Equipercentile Equating (EQP) Ordinary Least Squares (OLS) . . . Rasch Extension (RAS) . . . . . . . Partial Credit Model (PCM) . . . . Other Adjustment Methods . . . . . Truncated mean equating (TMN) Truncated linear equating (TLI) Equipercentile equating with smoothing iv 0 viii The Rasch Rating Scale Model Rater Response Theo Summary of Adjustment Methods rY c CHAPTER TWO: A REVIEW OF PREVIOUS STUDIES Contexts Requiring Performance Assessment Job Performance . . . . Large—scale Testing . . The Problem of Unreliability Research Studies on Rater Effects Early Studies . . . . . Paul . . . . . . . . . de Gruijter . . . . . . Cason and Cason . . . . Braun . . . . . . . . . Wilson . . . . . . . . Houston, Raymond, Svec, Lunz, Linacre, and Wrigh Denny . . . . . . . . . Summary . . . . . . . . . . CHAPTER THREE: METHOD . . . . . Methods Compared . . . . . . No equating . . . . . . Mean equating . . . . . Linear equating . . . . Equipercentile equating OLS . . . . . . . . . . Rasch extension . . . . o o and t . c o o o . o o 22 23 23 26 26 27 28 28 30 31 32 33 34 38 40 4O 42 44 44 46 46 47 47 47 48 49 51 PCM . . . . . . . . . . . . . . . . . . . . . . . . Data Sets . . . . . . . . . . . . . . . . . . . . . . . Real Data . . . . . . . . . . . . . . . . . . . . . Simulated Data . . . . . . . . . . . . . . . . . . Criteria for Comparing the Methods . . . . . . . . . . . CHAPTER FOUR: RESULTS . . . . . . . . . . . . . . . . . . . Writing Assessment Data . . . . . . . . . . . . . . . . RMSDs Compared . . . . . . . . . . . . . . . . . . Passing Rates . . . . . . . . . . . . . . . . . . . Simulated Data . . . . . . . . . . . . . . . . . . . . . Root Mean Square Errors (RMSEs) . . . . . . . . . . Overall . . . . . . . . . . . . . . . . . . . Number of Raters Per Paper . . . . . . . . . . Number of Papers Scored . . . . . . . . . . . Number of Rating Scale Points . . . . . . . . Paper Quality Distribution . . . . . . . . . . Rater Type . . . . . . . . . . . . . . . . . . Correlations . . . . . . . . . . . . . . . . . . . Maximum Difference . . . . . . . . . . . . . . . . One-Rater Data Sets . . . . . . . . . . . . . . . . CHAPTER FIVE: DISCUSSION . . . . . . . . . . . . . . . . . . Summary by Method . . . . . . . . . . . . . . . . . . . No equating (NO) . . . . . . . . . . . . . . . . . Mean Equating (MN) and Truncated Mean Equating (TMN) 52 52 52 57 59 62 62 66 71 74 74 78 78 78 79 79 80 84 86 9O 92 92 92 93 Linear Equating (LI) and Truncated Linear Equating (TLI)94 Equipercentile equating (EQP) . . . . . . . . . . . OLS, TLS, and Rasch Extension . . . . . . . . . . . vi Supplemental Analysis . . . . . . . . . . . . . . . . . 96 Partial Credit Model . . . . . . . . . . . . . . . . . 99 Recommendations . . . . . . . . . . . . . . . . . . . . . . 99 Better Models of Rater Scoring . . . . . . . . . . . 100 Better True Score Estimates for Real Data . . . . . . 101 Implications for Practice . . . . . . . . . . . . . . 102 EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 References . . . . . . . . . . . . . . . . . . . . . . . . . . 105 APPENDIX A: PROGRAMS FOR GENERATING AND ANALYZING SIMULATED DATA SETS . . . 109 RATER.BAS . . . . . . . . . . . . . . . . . . . . . . . . 109 PARGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . . 110 TRUEGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . 111 OBSGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . . 112 EZEQ.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 115 EQPEQ.BAS . . . . . . . . . . . . . . . . . . . . . . . . 117 OLS.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 118 RASCHEXT.BAS . . . . . . . . . . . . . . . . . . . . . . . 122 PCM.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 125 APPENDIX B: DISTRICT WRITING ASSESSMENT DATA WRITING ASSIGNMENT . . . . . . . . . . . . . . . . . . . . 129 APPENDIX C: DISTRICT WRITING ASSESSMENT DATA MODIFIED HOLISTIC SCORING CRITERIA . . . . . . . . . . . . 130 vii Table Table Table Table Table Table Table Table Table Table Table Table 10: Table Table 12: 4A: 4B: 4C: 9: 11: LIST OF TABLES Comparison of Adjustment Methods . . . . . . . . . . . Frequencies of Scores Assigned by Raters in the Real Data Set . . . . . . . . . . . . 55— Stringency Parameters for Each Rater in the Simulated Data Sets . . . . . . . . . . . . . . Method Comparisons for the Writing Assessment Data GRADE 5 . . . . . . . . . . . . . . . . . . . . . . . . Method Comparisons for the Writing Assessment Data GRADE 8 . . . . . . . . . . . . . . . . . . . . . . . . Method Comparisons for the Writing Assessment Data GRADE 11 . . . . . . . . . . . . . . . . . . . . . . . Pass/Fail Decisions for Adjusted Scores Relative to Unadjusted Scores for the Writing Data . . . . . . . Scoring Frequencies for One Simulated Data Set (2+551) Overall RMSEs for the Simulated Data . . . . . . . . . Overall RMSEs Averaged by Facet (l—Rater Data Sets Omitted) . . . . . . . . . . . . . . RMSEs by Rater Type by Facet . . . . . . . . . . . . . RMSEs by Rater Spread by Facet . . . . . . . . . . . . Correlations for the Simulated Data Sets . . . . . . . Maximum Score Difference Observed - True for All Simulated Data Sets and Each Method . . . . . viii 56 58 63 64 65 72 75 76 77 81 83 85 87 Table 13: Table 14: Table 15: Maximum Differences Averaged Across Methods (l—Rater Data Sets Omitted) . . . . . . . . . . . Average RMSEs for the One—Rater Simulated Data Sets Supplemental Data Set Simulated From a Linear Model ix . 90 . 97 Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7: Figure 8: Figure 9: Figure 10: LIST OF FIGURES A Graphical Demonstration of Mean Equating . . . . . A Graphical Demonstration of Equipercentile Equating Expected Score Functions with the Rasch Extension . Probability Curves for the Partial Credit Model . . Expected Score Function for the Partial Credit Model Graphical Representation of a Rater Characteristic Curve . . . . . . . . . . . . . Graph of Average RMSDs for the Grade 5 Data . . . . Graph of Average RMSDs for the Grade 8 Data . . . . Graph of Average RMSDs for the Grade 11 Data . . . . Graph of Average RMSDs for the Three Grades Combined 10 13 15 19 2O 35 67 68 69 70 IMPUTE LI NO NP NR NS RAS RCC RMSD TMN WLS LIST OF ABBREVIATIONS Equipercentile equating Exponential function, base 9 = 2.718... Generalized least squares Imputation of scores for missing rater/paper combinations Linear equating Natural logarithm function, base e = 2.718... Mean equating No equating Number of papers Number of raters Number of scale points Ordinary least squares Partial Credit Model Rasch extension Rater characteristic curve Root mean squared difference Root mean squared error Rasch rating scale Rater response theory Standard deviation Truncated linear equating Truncated mean equating Weighted least squares xi INTRODUCTION May 9 Ivan and Anita are both seniors in high school. Each has completed all graduation requirements except one--passing the writing component of the competency test. Today is their final opportunity to pass the test in time to graduate with their class. To pass the test, Ivan and Anita must write essays that receive a total of 7 points from two raters scoring on a 5-point scale. Both are somewhat nervous and have never done especially well on writing assignments. Both write essays of borderline quality. May 11 A team of English teachers trained in holistic scoring rate the essays. Mrs. Redpen and Mr. Markov read Ivan's paper and each teacher rates it a 3. Ivan's total is 6; he fails the competency requirement and does not receive a high school diploma. Miss Dove and Mr. Laxer score Anita's essay, and each teacher rates it a 4. Anita's total is 8; she passes the writing requirement and graduates with her classmates. July 3 A subsequent analysis of the essay scoring reveals that raters differed significantly in the level of scores they assigned. In fact, the two most stringent raters, Redpen and Markov, assigned average 2 scores one point lower than the average score over the entire team of raters. The two most lenient raters, Dove and Laxer, assigned scores that averaged one point higher than the overall average. August 12 Ivan's parents become aware of the results of this study and file suit to force the district to award Ivan a diploma. Their attorney makes the case that Ivan was failed only because of the luck of the draw in who rated his paper. The attorney argues that "if Ivan's paper had been scored by lenient or even average difficulty raters, he would not have been denied a diploma. The other graduation tests in the district are constructed so that various forms of the test are of equal difficulty. In the same way, scoring on the writing test should be made as equitable as possible." The school district superintendent claims the scoring system was satisfactory, because using statistics to adjust essay test scores is not standard educational practice. Besides, she argues, "if statistics are used to adjust some scores higher, then they should also be used to make other scores lower. Another student whose paper was scored by lenient raters would not have been allowed to graduate if scores were adjusted. Therefore, the current system is fair." If you were the judge hearing the case, how would you rule? CHAPTER ONE Many forms of assessment used in education require human judgment. Of the three R‘s, reading and arithmetic achievement are measured extensively with multiple-choice tests while writing achievement is more often assessed by ratings from trained judges. In other contexts, such as artistic or musical performance, evaluations are totally dependent on human judgment. When an instructional objective requires that a student produce original work, the quality of that work is measured by the ratings of evaluators. A problem with human judgment is lack of consistency--the same performance can receive different scores depending on who assigned the ratings. For scores to be meaningful and fair, differences among raters should be minimized. The focus throughout this study is on writing assessment and the terminology reflects that focus. Performances are referred to as "papers" which vary in their degree of "quality". The principle of minimizing individual differences among raters in how they use a rating scale can be generalized to many other contexts. Controlling for Rater Differences Coffman (1971) listed three ways in which raters differ: (a) in the level of scores they assign: some raters are more lenient or more stringent than others, (b) in the spread of scores they assign: some raters use the extreme values on the scale more than do others, and 4 (c) in idiosyncratic ways: raters differ in how they weight various aspects of a piece of writing as they assign an overall score to it. Training Training represents one attempt to minimize rater differences. A typical training session consists of two parts. First, all raters are presented with clear descriptions of what papers are like at each point on the rating scale. Second, raters practice scoring papers of known quality as determined by experienced raters (anchor papers), and discuss as a group why they assigned particular ratings to specific papers. A rating supervisor can follow up this training with activities designed to monitor or maintain rating practice, such as periodically assigning anchor papers for scoring, or watching for raters who consistently assign scores that are lower or higher than other raters, and then giving additional training as needed. Multiple Raters Another method that lessens rater differences involves using multiple raters. When more than one rater scores a paper and then the ratings are averaged, the effect of any one unusual rating on the total score is diminished. A common practice is to randomly assign two raters to a paper. If their scores differ by more than one scale unit, a third rater also scores the paper and the most discrepant score of the three is omitted. Increasing the number of raters will increase the reliability of the score assigned to any paper. Ideally, every rater would read every paper, but the excessive cost and the law of diminishing returns make such scoring prohibitively inefficient. Using two raters for all papers, and a third only if scores are discrepant, 5 is generally viewed as a cost-effective compromise for increasing reliability. Doubling the number of raters will double the cost of scoring, but will not double the reliability of scores. The effect on reliability of doubling the number of raters can be estimated directly from the Spearman-Brown Prophecy formula. If a measure with reliability rm(is replaced with an equivalent measure having K times as many observations, the new reliability Rm(is given by Rxx = K' rxx / (l + (K - 1) ° rxx). For example, if with two raters per paper scores had reliability .70, then with four raters (K = 2) reliability would be Rm(= 2(.70) / (1 + (2 — l)(.70)) = .82. Going from two raters per paper to four increased reliability only .12 points. Going from four raters per paper to eight would increase reliability even less. Statistical Adjustment Even when raters are trained and monitored, and even when discrepant scores are omitted, raters still differ in how they assign scores to papers. Another method to reduce rater differences is to statistically adjust each rater's scores to compensate for these idiosyncratic differences. This type of statistical adjustment of scores parallels the theory of test equating across multiple forms of the same test. Tests are equivalent if an examinee is expected to receive the same score regardless of which test form is used. To make tests equivalent, scores on each form of a test are adjusted so that scores on any one form have the same meaning as scores on other forms of the test. 6 There are various designs for equating test scores across different test forms. One equating design is based on equivalent groups. In this type of design, a large group of examinees is randomly divided into groups and each group is given a different form of the test. With random assignment into large groups, the average ability levels of the groups are nearly equal. Any group differences in test performance are assumed to be due to differences in the difficulty of the test forms. To compensate for these differences, scores on the more difficult forms are raised and scores on easier forms are lowered to make the scores across different forms equivalent. Scoring with rating scales presents a parallel situation. When adjusting for different raters, the raters are viewed as different forms of a test. If a large number of papers are randomly assigned to raters for scoring, then the scoring pattern for each rater should be about the same. When evidence suggests that raters vary in how they assign scores, some type of statistical adjustment may be appropriate. A given paper should have the same expected score irrespective of which raters assign the scores. When Adjustment is Not Recommended Before discussing various types of statistical adjustment, it is important to note the following situations in which it would not be appropriate or necessary to adjust raters' scores: 1. If the differences between raters' scores are not statistically significant, no adjustment is necessary. To test for mean score differences, use an F—test in an analysis of variance. A homogeneity of variance test (available as an option to the analysis of variance procedure in some statistical software) tests whether the raters differ significantly in how spread out their scores are. A chi— square test can determine differences in rater scoring patterns at any point on the rating scale. A Kolmogorov-Smirnov test does the same but with slightly more power because it takes advantage of the ordinal nature of the rating scale. When raters read only a few papers, sampling error can be confounded with rater stringency. For example, if someone only rated 5 papers and 2 of those were exceptionally poor, the rater could appear to be unusually stringent because of the small sample size. By only adjusting scores when differences among raters achieve statistical significance, spurious rater differences resulting from small sample sizes can be avoided. In large—scale assessments employing multiple raters, there is typically no problem with small sample sizes. 2. If papers are not randomly assigned to raters, one would not expect that raters would assign scores the same way. For example, if rater A graded a set of papers from an honors class and rater B graded a set from a regular class, rater A should not be considered lenient despite assigning higher scores than rater B, because the honors class papers are likely better than those of the regular class. The simpler equating methods ignore group differences, whether they are due to sampling error or to non—random assignment. The more complex methods require two or more raters per paper and take into account not only the scores a rater assigned to a set of papers, but also the scores that other raters assigned to those papers. These more sophisticated methods control for sampling error and non—random assignment. 8 3. If all raters score all papers, no adjustment is necessary. Any rater effects should affect all scores equally. When only two raters read each paper, those papers that happen to be scored by the two most stringent raters will be at a disadvantage compared to those scored by the two most lenient raters. Although scores in general will be fairly accurate with two raters, some individuals will receive scores that fail to reflect the true quality of their work. Rater adjustment would alleviate this problem. 4. If the uses made of the scores are of little importance, then rater calibration is unnecessary. For example, if a district wanted to know whether writing was improving in the district and only aggregate scores were used, then scores at the individual level need not be adjusted for rater effects. On the other hand, if a minimal writing score has to be obtained before a high school diploma is granted, then scores at the individual level must be as fair and equitable as possible, and some type of adjustment is appropriate. 5. In some political climates, adjustment of scores may not be acceptable. However, most people understand that some raters are tougher than others and would accept the need for some type of adjustment. Of course, individuals who have their own scores lowered might be less accepting of score adjustment. Methods of Score Adjustment Many methods, varying in their complexity and accuracy, have been suggested for adjusting rater scores. Some of these methods are discussed briefly here, then are described more fully in Chapter 3. No Equating (NO) This is the simplest method, where all scores are accepted at face value. When multiple raters score a paper, their scores are averaged or totalled to get the paper's score. This method is widely used, typically because one or more of the conditions listed above (e.g. random assignment, importance) fail to hold. When raters differ in how they assign scores, all rater scoring differences affect the scores papers actually receive. Mean Equating (MN) In this method, each rater's scores across all papers are averaged. Any rater who assigns a mean score lower than the overall mean score is considered stringent, and any rater who assigns a mean score higher than the overall mean score is considered lenient. A stringent rater's scores are all shifted upward and a lenient rater's scores are all shifted downward, so that all raters have the same mean after adjustment. This method compensates for rater differences in score level, but not for differences in score spread or score distribution shape. It also assumes that all between—rater differences are due to rater stringency differences and not to mean differences in paper quality across raters. Figure 1 illustrates mean equating on a simple data set consisting of only two raters. The first rater is stringent, assigning a mean score of 4; the second rater is lenient, assigning a mean score of 6 on a 9—point scale. Overall, the average score assigned is a 5, so the stringent rater's scores are increased by 1 and the lenient rater's scores are decreased by l. The distributions of adjusted scores have 10 Scores Assigned——Lenient Rater Scores Assigned——Stringent Rater 8 7 7 7 7 = 6.03 Mean = 4.04 Mean Scores Assigned-—Both Raters 5 = 5.01 Mean Adjusted Scores——Lenient Rater Adjusted Scores——Stringent Rater 66666 6666 44444 10 = 5.01 Mean = 5.01 Mean A Graphical Demonstration of Mean Equating Figure 1: 11 the same shape as the distributions of assigned scores, but have the overall mean. Linear Eqpating (LI) This method considers both the mean and standard deviation (SD) of each rater's scores. Scores are adjusted linearly so that all raters have the overall mean and SD after adjustment. If the scores rater i assigns have a mean of mi and a SD of si, and over all papers and raters the mean is m and the SD is s, then a score xi is adjusted to yi where yi = [(xi — mi)/si]°s + m. Linear equating compensates for rater differences both in score level and score spread, but it also assumes that there are no differences in the distribution of paper qualities across raters. Each of these three methods assumes that a one—unit difference in score has the same meaning throughout the rating scale. In reality, raters may have different standards for the level of performance required to achieve a particular score. For example, a rater might give many 45 but few 35 or Ss, compared to other raters. This scoring pattern suggests that for this rater, the quality of paper necessary to get a 4 is considerably less than the quality of paper required to merit a 5. This type of scoring pattern is local in scope, and only indirectly affects means and SDs, which are global parameters. Equipercentile Equating (EQP) This method takes into account differences in rater scoring at each level of the rating scale, and thus is slightly more complex than the methods already described. Overall, ratings have a cumulative frequency distribution. For example, on a 9-point scale, 2 percent of 12 the scores may be 0, 7 percent are 0 or 1, 13 percent are 0 through 2, and so on, until 100 percent of the scores are in the range 0 through 9. Graphically, these points can be connected with line segments to describe an increasing function from the point (0,2) to the point (9,100). One such cumulative frequency distribution is illustrated by the lower line in Figure 2. In equipercentile equating, a rater's scores are adjusted to the overall score with the same percentile rank, using linear interpolation as necessary. For example, if 23 percent of the scores rater A assigned are 3 or less, while overall only 19 percent are 3 or less and 27 percent are 4 or less, then a 3 from rater A would be adjusted to a 3.5, which represents the 23rd percentile over all scores. In terms of the graph in Figure 2, the upper line represents the cumulative frequency distribution for rater A. To transform a score from rater A, locate the score on the horizontal axis, move vertically to the line of rater A, then horizontally to the overall line, then back down to the horizontal axis to the point that represents the adjusted score. Ordinary Least Sanres (OLS) Linear models, such as ordinary least squares (OLS) or weighted least squares (Wilson, 1988; Houston, Raymond, & Svec, 1990; de Gruijter, 1984), model rater effects by additive constants. The usual additive model is =ot+6 +e Y" i j i u j where yij is the score given to paper i by rater j, ai is the true score of paper i, éj is the scoring bias for rater j, and eij is random error. Cumulative Percentage 100 80 60 40 20 Figure 13 H////// A//////O %°// 0 l 2 2: A Graphical Demon Rater A A Overall 3 4 5 6 7 8 f Rating Scale Points stration of Equipercentile Equating 14 To estimate the parameters of this model, each paper must be scored by at least two raters. By estimating the rater effects parameters simultaneously, the assumption that all raters are scoring equivalent distributions of paper quality is unnecessary. Computationally, the parameter estimates are not iterative as are the non-linear Rasch models, so they require less computing time. But the computations involve matrix algebra and so they must be performed with a computer rather than a calculator. The use of a linear model results in distortions at the high and low ends of the rating scale, so for example, a paper which receives a perfect score from stringent raters is adjusted even higher than a perfect score for lenient raters. Weighted least squares differs from OLS by weighting the scores of consistent raters more than the scores of inconsistent raters in estimating parameters, whereas OLS weights all raters' scores the same. Rasch Extension (RAS) This method, described by de Gruijter (1984), models a curvilinear relationship between a paper's underlying quality and the paper's expected score from a rater. Originally proposed by Choppin (1982), the model states that the expected score Rij of a paper with quality levellfi when rated by a judge with stringency parameter 6j on a scale ranging from O to M is given by Rij = M exp(l$i — (Sp/(1 + exp(r5i — 53.)). If M=l the formula looks like the Rasch 1—parameter item response model, but that model gives probabilities of correct answers whereas the Rasch extension yields expected scores. The function is graphed in Figure 3 for raters with stringency parameters 61 = —l.0, 62 = 0.0, and 15 Expected Score 5 -3 —2 -1 0 l 2 3 Paper Quality Figure 3: Expected Score Functions with the Rasch Extension 16 63 = 1.0, scoring on a 5—point scale. A paper of quality level 1.0 has an expected score of 4.4 from rater 1 (lenient), 3.7 from rater 2 (average), and 2.5 from rater 3 (stringent). To estimate the parameters of the model, the curve is transformed into a linear model, and the matrix solution of the OLS method is applied. Then the rater effect parameters are transformed back into the non—linear form of the model to get the adjusted scores for each paper. The model assumes scoring is continuous, so with discrete scoring categories the Rasch extension may not adequately fit the data. Partial Credit Model (PCM) The PCM (Masters, 1982; Wright & Masters, 1982) is another model for rater response based on item response theory. This model assumes separate stringency parameters at each level of scoring for all raters. Originally, the model was intended for test items that contained a finite number of discrete steps, and a partial credit score represented the number of steps (points) that an examinee got correct (received). In particular, a response must earn a k before it can be considered for a k+1. For any item, the steps involved in a solution can vary in difficulty. For example, an item which asks examinees to simplify the expression (4 + 5)% — 8 requires three steps to get to a final answer. The first step, adding 4 and 5, is fairly easy so the step from a O to a 1 on the item has a low difficulty parameter. This step must be done correctly to continue scoring the item. The second step, which requires knowledge of fractional exponents, is more difficult and so the difficulty parameter for that step should be higher than those 17 of the other steps. The third step, subtracting 8 from 3, is more difficult than the first but less difficult than the second and so its difficulty parameter should numerically lie between the other two. But to reach the third step, the examinee must do the second step correctly. Thus, relatively few examinees get a partial credit score of 2, because most of those who can get from a l to a 2 (by evaluating a fractional exponent) can also get from a 2 to a 3 (by subtracting integers). A similar phenomenon can occur in a rater's scoring pattern. A popular rating format is 1: demonstrates incompetence 2: suggests incompetence 3: suggests competence 4: demonstrates competence. A particular rater might have stringent standards to move from a 2 to a 3, because of stringent standards as to what constitutes "competence", but at the same time have lenient standards as to the difference between "suggests” and "demonstrates". This rater would likely give relatively few 33 and relatively more 25 and 43. Another rater with different standards and understanding of the terms used in the scale might assign many 3s and relatively fewer 28 and 45. These differences in stringency are not global, but apply at specific points on the scale. The PCM accounts for such rater differences at each level of the rating scale. 18 The PCM treats each scoring step as a dichotomous item. The probability émx that the nth paper is judged a k rather than a k—l by rater i is given by exp(l3>n — 61k) nik " 1 + exp(l3n — am) where éflcis the difficulty parameter for rater i to give the kth level score and Hg is the quality parameter of the nth paper. Because getting credit for any step is contingent on having already gotten credit for all previous steps, the steps can be combined into an overall probabilistic model. The probability that the nth paper is judged an x by rater i is given by the formula X exp 2 (fig — éij) j=o nix where x ranges from 0 to M and where the quantity in the numerator is 1 when x=0. Figure 4 illustrates probability curves for a 5—point rating scale item. The vertical axis represents probability, and the horizontal represents underlying paper quality. For each level of quality, there are five probabilities corresponding to the five score levels possible from a particular rater. In the graph, these probabilities are represented by the numerals 1 through 5 above each quality level. The probability curves are for a rater with step parameters —4, -1.5, 1.5, and 4. These parameters are the quality levels at which the rater is equally likely to assign a score or the next higher score. As the underlying paper quality moves from very low to very high, the most 19 Probability 1.0 .90 11 .80 11 55 .70 1 333 5 .60 l 22 222 3 3 444 44 5 .20 22 13 22 44 35 44 .10 22 33 ll 44 22 55 33 3333 111.444 222.555 3333 .00 Paper Quality Figure 4: Probability Curves for the Partial Credit Model 2O probable score moves from 1 to 5. Where two curves intersect, the paper is of a quality such that the two scores are equally likely to be assigned. This situation models the real-life indecision of a rater scoring a "borderline" paper. Figure 5 combines the probabilities with the score values to give the expected score of a paper as a function of its quality. The horizontal axis represents paper quality, and the vertical axis represents the expected score for the paper from a rater with step parameters -4, —1.5, 1.5, 4 as in Figure 4. This curve is similar to the Rasch extension curves in Figure 3, except the shape of the PCM Expected Score 5.0 -6 —5 -4 —3 —2 -l 0 1 2 3 4 5 6 Paper Quality Figure 5: Expected Score Function for the Partial Credit Model 21 expected score curve varies depending on the values of the difficulty parameters at each step of the rating scale. To use the PCM for rater score adjustment, first estimate the difficulty parameters for each judge and quality parameters for each paper that best fit the data. Then substitute the parameters back into the equations of the model to get the expected score of each paper for each rater, including the ones who did not actually rate the paper. The average over all raters is the adjusted score of the paper. The PCM offers a great deal of flexibility, by modeling differences in rater stringency at each point of the rating scale and by estimating scores even for those raters who did not rate the paper. One problem with the model is the large number of parameters which must be estimated, often with little data. Parameter estimates may be unstable or inaccurate, particularly for small data sets. Other Adjustment Methods A variety of other adjustment methods also address the problem of inconsistency across raters. Most are variants of the methods described above. 1. Truncated mean equating (TMN) is mean equating, but with a ceiling and floor imposed, so that no score is adjusted above the highest score on the rating scale and no score is adjusted below the lowest score on the scale. 2. Truncated linear equating (TLI) is linear equating, with scores truncated to the range of scores possible on the original scale. 3. Equipercentile equating with smoothing replaces the segmented cumulative frequency curves described earlier with smooth curves in an 22 attempt to reduce the effect of having discrete score categories. Instead of abruptly changing the slope of the curve at the category endpoints and using linear interpolation, the slope of the curve changes more gradually and some type of curvilinear interpolation is used. 4. The Rasch Rating Scale Model (Andrich, 1978; Wright & Masters, 1982) is another in a family of polychotomous response models. These models apply in situations where a response is scored with more than two categories of quality, such as essay scoring and rating scales of all kinds. In complexity, the rating scale model is intermediate to the Rasch extension and the PCM. The parameters in this model represent rater differences in overall stringency, and different sizes of steps between the various score categories, but it assumes that the different step sizes are the same for all raters. For example, the rating scale model assumes that the difference between a 2 and a 3 for any particular rater equals the difference between a 2 and a 3 for all other raters. This simplifying assumption reduces the number of distinct parameters which are estimated compared to the PCM, yet is still a more flexible model than the Rasch extension. The equations for this model are identical to the PCM, except the parameter 6ik which represents the difficulty parameter for rater i's kth step is replaced by 51 + tk, where 61 is the overall stringency parameter for rater i and tk is the difficulty for the kth step across all raters. The number of rater parameters to be estimated is thus reduced from the product i° k in the PCM to the sum i+k in the Rating Scale Model. 23 5. Rater Response Theory, developed by Cason and Cason (1984), is similar to the Rasch extension described above, but instead of using the logistic function, the model is based on the normal ogive. The two mathematical functions are virtually identical, graphed as S-shaped curves with upper and lower asymptotes to the right and left, respectively. Because the logistic function is easier to work with computationally than the normal ogive, it has become the more widely used of the two functions in item response theory. Summary of Adjustment Methods A variety of calibration methods statistically adjust for differences among raters in how they assign scores. Rater scoring patterns differ in their overall level of scores, in their overall spread of scores, and in their proportion of scores at each score level. All adjustment methods account for differences in rater means, but only some of the methods account for the other types of differences in scoring patterns. The simpler methods ignore sampling error and assume equivalent quality distributions of the papers each rater scores. The methods vary considerably in computational complexity, ranging from mean equating which can be done with a hand-held calculator or spreadsheet program, to the PCM which requires hours of computer time to perform the computations necessary to estimate parameters. The characteristics of each method are summarized in Table l. The focus of this study is on the accuracy of several of these adjustment methods. The study attempts to determine the extent to 24 which each method improves the quality of scoring when only a subset of raters scores each paper. The methods are compared over several data sets which vary in the number of papers scored, the number of rating scale points, and the way paper qualities are distributed. within a data set, raters vary in their level and spread of scores. By understanding how effective these adjustment methods are in a controlled study, better decisions can be made about which adjustment method to apply in real scoring situations. Scale 25 Table 1 Comparison of Adjustment Methods adjusts for differences in overall level? N Y in overall spread? N N at each point of the scale? N N Recognizes sampling error? Y N Possible with only one rater per paper? Y Y Computational complexity? (O=Lo, 5=Hi) o 1 Can adjust scores off the scale? N Y Number of rater parameters estimated? 0 R KEY R: P: N0 = MN = LI EQP = OLS = Number Number of raters No equating Mean equating Linear equating Equipercentile equating Ordinary least squares of points on rating 9; EQP OLS WLS RRT RAS RRS Y Y Y Y Y Y Y 1 Y Y N Y Y N N N Y N N N N N1 N N Y Y Y Y Y Y Y N N N N N 1 2 3 3 4 4 5 Y N Y Y N N N 2R PR R 2R 4R R P+R scale WLS = Weighted least squares RRT = Rater Response Theory RAS = Rasch Extension RRS = Rasch Rating Scale PCM = Partial Credit Model 1 PCM PR . The Rasch Rating Scale model allows for varying difficulties at each step of the rating scale, but assumes the same step size for each rater. CHAPTER TWO A REVIEW OF PREVIOUS STUDIES Research in rater calibration methods is fairly recent. This fact is somewhat surprising because extensive research has been done on equating across forms of objective tests (Angoff, 1971; Petersen, Kolen, & Hoover, 1989), yet subjective measures such as essay tests have a much longer history in education. Objective tests cost less to score and their scores are more reliable than are scores from essay tests; thus they continue to dominate large-scale testing. In recent years, however, performance assessment has played an increasing role in educational testing. Essays are used for measuring general writing ability as well as for evaluating student achievement in content areas. The emphasis in performance assessment is on what students can do rather than on what they know; on active production rather than on passive response; on recall rather than on recognition. But rating a performance is more subjective than scoring an objective test because rater biases can be confounded with the quality of the performance. The increased interest in performance assessment has led to research that addresses how best to separate the quality of a performance from the effects of the particular raters scoring that performance. Contexts Requiring Performance Assessment Some abilities can only be adequately measured with human judgment. Organizing thoughts and expressing them in writing are 26 27 important in nearly every subject area, and assessing the quality of this organization and expression requires human judgment. In problem solving, the processes used to arrive at an answer can be important and assignment of partial credit for correct steps leading to a final answer is typically based on subjective ratings. Ratings are also necessary to measure the quality of any creative product, such as a painting, a term paper, or a science fair project. A growing trend in educational measurement is toward performance assessment. By simulating real—life situations an examiner can evaluate a wide range of behaviors in realistic contexts and thus obtain more valid measures of an examinee's ability to perform certain tasks. These measures can be either objectively or subjectively scored, but generally involve a qualitative assessment of the quality of an educational product, as determined by rater judgment. Job Performance Performance assessment has a rich literature in industrial psychology, where rating scales are the primary means of measuring job performance. Wherry (1950) and Landy and Farr (1980) provided extensive reviews of performance rating research. Wherry's theory of rating (Wherry, 1952; Landy & Farr, 1983), which involved partitioning variance, foreshadowed later developments in generalizability theory. In 1980, Landy and Farr reviewed research studies in performance rating and suggested that further attempts to improve rating quality by adjusting the formats of rating scales were likely to prove unfruitful. Instead, they recommended more research into statistical control of common rating errors. They optimistically cautioned that "although 28 this is a mechanical solution that implies no increase in the understanding of the rating process, it offers the possibility of simultaneously providing the practitioner with better numbers and the researcher with hypotheses" (p. 101). Large—scale Testing Large—scale testing programs have increasingly begun to use subjective measures of achievement and ability. Godshalk, Swineford, and Coffman (1967) found that the predictive validity of a test of writing skills composed of multiple-choice items significantly increased when an item requiring a writing sample was added to the test. Performance measures requiring human judgment can test a broad range of skills in a variety of stimulus conditions, but when they are subjectively scored the score depends both on the person producing the performance and on the person rating the performance. The Problem of Unreliability The major problem with performance measures is lack of reliability in scoring. As Breland (1983) notes, "reliability has always been the Achilles heel of essay assessment" (p. 23). Clearly, the score on a single essay is an imperfect and unreliable indicator of a larger construct such as writing ability. To get a reliable measure of an individual's performance on essay questions, one would need several essays on different topics, written on different days and read by different raters. Generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Brennan, 1983) considers the question of reliability broadly. Writing tasks consist of several distinct facets, any of which can 29 systematically introduce variance into scoring. A person's writing ability varies depending on the type of writing, such as narrative, persuasive, or expository. The specific prompt (writing task) assigned makes a difference in the quality of essays. Score variation can also result from the assignment of raters to papers. Even when all these sources of variation are accounted for, error variance remains. Error variance measures differences in essay ratings that are not explained by any of the previously enumerated factors. Considering these sources of variance helps to answer the question of how well an essay score measures an examinee's ability. A narrower question is how well assigned ratings reflect the quality of any particular paper. The term inter—rater reliability refers to this specific type of reliability; it measures the level of agreement among raters. (For a more complete discussion of inter—rater reliability, see Weare, Moore, and Woodall, 1987.) Breland (1983) approximated reliabilities of writing ability assessments under various scoring conditions. Typically, a single essay with one rater has reliability of only .42. Adding a second rater increases reliability to .53. The use of two topics scored by different raters increases the reliability to about .57. Three essays with three different modes of discourse and three different raters result in estimates of writing ability with reliability .79. To get an ability estimate having reliability of .85 with only one reading per essay would require nine essays, and three different modes. With two readings per essay, six essays and two different modes would result in a reliability of .86. 30 As an example of how low the agreement among raters can be, Breland (1983) recounted a study by the Educational Testing Services where 300 essays written by college freshmen were rated on a 9-point scale by 53 raters from several different fields. The dispersion of ratings was large for each paper rated. None of the essays received fewer than five different ratings out of the nine possible. In fact, 23 percent of the essays received seven different ratings, 37 percent received eight different ratings, and 34 percent of the essays received all nine possible ratings! Despite the recognition that rater effects were problematic, little research was done on controlling for these effects, probably for two reasons: 1. Large scale assessment involving rater scoring was not prevalent. With small data sets, typically all raters score all papers and rater effects cancel out. When a subset of raters score each paper, the problem of sampling error exists. But separating rater leniency from paper quality is more difficult with small data sets. 2. More sophisticated scoring models that allow for separation of rater stringency and paper quality had not been developed, and the computing resources necessary to estimate parameters were not readily available. Consequently, most research into statistical control of rater effects occurred after 1980. Research Studies on Rater Effects Most research has investigated the viability of using statistical techniques to control for differences in overall rater stringency. 31 Most studies consist of models for rater effects being applied to data, either real or simulated, to determine how well the models reduce rater effects. All studies suggest that some form of rater calibration is desirable. Early Studies Ebel (1951) was one of the first to consider the problem of estimating reliability in a context where only a subset of raters rates each of a group of students. Ebel based his reliability estimates on an analysis of variance, applying the intraclass correlation formula to rater judgments. The between-raters variance component could either be included or excluded from the formula, depending on whether rater effects were retained or removed from the final scores. Guilford (1954) recommended an analysis of variance approach to control for rater differences, largely addressing the situation where all raters score all subjects on several traits. With reference to an incomplete matrix of ratings, Guilford recognized the potential for unfairness to subjects because of the particular raters scoring the paper. Guilford observed: "There is no simple, generally applicable solution to this problem. To the extent that any two or more raters have ratings in common sufficient to make the kind of study of ratings that was described above [ANOVA], something can be done to make adjustments. Linear transformations taking care of differences in means as well as differences in standard deviations would become important in this kind of situation. If one is willing to make assumptions concerning comparability of subgroups of ratees, one extends the possibility of making inferences about the amounts of errors of different kinds" (p. 289). Stanley (1961) addressed rater bias in the context of a three-way analysis of variance: ratees by raters by traits. He developed computational formulas, and recommended controlling not only for rater 32 main effects, but also for rater—ratee interactions (e.g. halo——the tendency for raters to rate an individual highly on all traits) and rater-trait interactions (e.g. the tendency for raters to rate one trait more stringently than they rate other traits.) Stanley did not address the common rating situation where only a subset of the raters rate each performance. Consequently, he points out that "the adjusted trait sums (over raters) and adjusted total scores (over both raters and traits) cannot be better for any purpose——predictive or otherwise-- than the unadjusted ratings are" (p. 214). But by removing rater main effect and interaction terms, internal consistency estimates of reliability are higher. In the data set Stanley studied, with 3 raters rating each of 7 individuals on 5 traits, the coefficient of equivalence increased from .84 for unadjusted ratings to .89 for adjusted ratings. While this increase is small, a common standard is that measures used to make decisions about individuals should have a reliability of at least .85; the adjusted ratings met this standard while the unadjusted ratings did not. Pa_ul Paul (1976, 1979, 1981) compared an additive model, which considered only differences in raters' mean stringency level, to a linear relationship model which modeled differences among raters both in their level of scores and in their spread of scores. Paul used both real data, where 85 raters scored each of 10 papers, and simulated data, again with 85 raters but scoring 20 papers each. Paul found little difference between the additive model and the linear relationship model, and recommended using the simpler additive model. 33 In one of his studies, Paul (1981) used Bayesian methods to estimate the models' parameters. In contrast to the estimates from simple mean equating and simple linear equating described earlier, Paul claims the Bayesian estimates are generally less susceptible to sampling fluctuations and should yield estimates closer to the true values. For the data he studied, the Bayesian estimates were indeed more accurate. But Bayesian estimates include additional subjective judgments and are therefore open to allegations of bias in predetermining results. de Gruijter De Gruijter (1984) outlined two models for rater effects: the additive model and the Rasch extension. In the additive model, raters are only assumed to differ in their mean level of stringency. De Gruijter used the method of ordinary least squares to estimate the parameters of the model. This method is equivalent to using simple linear regression to estimate the paper qualities and rater stringencies which best predict the data. De Gruijter claimed the additive model is ”computationally simple and straightforward, but unfortunately overly simplistic" (p. 215). He mentioned the possibility of using a more general linear model which allows for differences in error and true score variance across raters, but concluded that because of problems at the lower and upper bounds of the rating scale, only a nonlinear model is satisfactory. The Rasch extension, described earlier, is nearly linear for values near the center of the rating scale, but is curvilinear near the extremes of the scale (Figure 3). This models the reality that after a 34 certain level of quality, increases in quality improve the expected score of the paper only marginally. Similarly, if a poor paper is already likely to receive a zero, a much poorer paper is only slightly more likely to receive a zero. De Gruijter (1984) applied both the additive model and the Rasch extension to a data set consisting of 949 essays, each scored on a ten— point rating scale by two out of eight raters. The two models agreed very closely for values near the mean rating. De Gruijter argued that for more extreme values the results for the two models must diverge. De Gruijter suggested that the model may be too simple to adequately represent the effects of all raters. Cason and Cason In a series of studies (e.g. Cason & Cason, 1984; Cason & Cason, 1989), Cason and Cason present a non—linear model similar to the Rasch extension. As noted earlier, their model (Rater Response Theory, or RRT) was based on the normal ogive rather than the logistic function used in the Rasch extension. The RRT model consists of rater characteristic curves (RCCs). Figure 6 is a graph of one RCC. These RCCs are characterized by four parameters: 1. Resolving power refers to the extent to which ratings change as the quality of performance changes. Graphically, resolving power corresponds to the slope of the RCC, and is analogous to item discrimination in item response theory. Differences in resolving power 35 Expected Score 7 Effective Rating Ceiling Effective Rating Floor . Figure 6: l 2 3 Paper Quality Graphical Representation of a Rater Characteristic Curve 36 affect the spread of scores across raters. A closely related term is rater sensitivity, which is the maximum value of rater resolving power, or the maximum slope of the curve (e.g., at point M in Figure 6). 2. Rater stringency is the tendency to require higher or lower paper quality to assign any given rating. Graphically, it corresponds to the horizontal location of the RCC. In Figure 6, the value of point L is the stringency parameter for the rater. As with the Rasch extension, one parameter is used for stringency over the entire rating scale. 3. The effective rating ceiling represents the highest level of scores which a rater actually gives. This may be less than the highest value on the rating scale. 4. The effective rating floor represents the lowest level of scores which a rater assigns. Again, this may be higher than the lowest value on the rating scale. In their 1984 study, Cason and Cason imposed the simplifying assumptions that all raters have equal sensitivity and that the effective rating floor and ceiling for each rater were the lowest and highest values on the rating scale. Data were collected over a 3-year period from a medical school clerkship. In any year, each of 30 students were rated by 5 of 35 raters on a 34- item inventory with ratings ranging from 1 to 5. The data were modeled four different ways. Model T (theory) had separate parameters for each student's ability and for each rater's stringency. Model A (ability) assumed all raters had equal stringency, and rating differences were based only on student ability. Model S (stringency) assumed students were of equal ability, and rater 37 stringency parameters were allowed to vary. Model 0 (null hypothesis) assumed no systematic differences in either student ability or rater stringency, but that all rating differences were due to chance variation. Over all three years of data, Model T fit the data significantly better than any of the other three models. Variation in rater stringency explained about 35 percent of the variance in ratings, and variation in student ability explained an additional 40 percent of the variance in ratings. In a later study (Cason & Cason, 1989), the one—parameter RRT model (stringency only) was compared to the methods of no equating and mean equating. These models differ in how they partition rating variance. No equating assumes no differences in rater stringency, and all scoring variance is assumed due to either ability differences or error. Mean equating assumes all between—rater variance is due to rater stringency differences and none is due to ability differences. These assumptions are unreasonable if the number of subjects per rater is small because of sampling error. Because the RRT model simultaneously estimates stringency and ability parameters, Cason and Cason argued that it provides a more accurate partitioning of variance, especially for small data sets. In this follow—up study, two data sets were used from the medical school clerkship. One data set consisted of 42 raters rating 24 students; there were approximately 3 ratings per rater and 5 ratings per student. The second data set had 93 raters rating 163 students, with approximately 8 ratings per rater and 5 ratings per student. In the smaller data set, no equating accounted for 24 percent of score 38 variance, mean equating accounted for 66 percent, and RRT accounted for 72 percent. The estimated reliabilities based on the mean over 5 raters was .63 for no equating, .73 for mean equating, and .84 for RRT. In the larger data set, the difference between mean equating and RRT was not as great. No equating accounted for 43 percent of score variance, mean equating accounted for 64 percent, and RRT accounted for 65 percent. Reliability estimates over 5 raters after adjustments were .77 for no equating, .80 for mean equating, and .83 for RRT. Cason and Cason (1989) recommended the RRT model, especially for small data sets. m In studies with the Educational Testing Service, Braun (1986, 1988) investigated the increases in reliability obtained by calibrating ratings across four facets of the rating process. Braun (1988) separately considered (a) stringency of the raters, (b) the rating team the assigned raters were a part of, (c) which of four days the paper was rated, and (d) the time of day the paper was rated, for three separate questions from an Advanced Placement exam in English Literature and Composition. The raters were calibrated according to an additive model, using a partially balanced incomplete block design. This design allowed estimation of the effects of each of the four facets of the ratings, while greatly reducing the number of readings required compared to a complete factorial design. The study involved 12 raters in 2 groups of six each reading 32 essays over 4 days, half in the morning and half in the afternoon, with each rater scoring 8 papers per day and each paper scored by 3 raters per day. 39 Rater calibration estimates were determined in an experimental context and then applied in an operational setting. Braun (1988) found that the estimated variance component associated with the raters was about 15 to 20 percent of the estimated error variance component. By adjusting for rater effects, single—reading reliability estimates based on variance components increased for each of the three essay questions graded. For the three—question total, the reliability estimate was .68 before adjusting scores, and .74 after the adjustment. A cross- validation resulted in some shrinkage, back to .72 after adjustment. In contrast, going from a single reading to a double reading would increase the reliability to .81.2 Braun (1988) recommended rater calibration (with only some papers being read twice) as a cost—effective alternative to a full double reading of all papers. In this study, calibration increased the scoring load by 5 to 7 percent, while a full double reading clearly increases the amount of scoring by 100 percent over a single reading. The increase in reliability by using calibration was 31 percent of the increase produced by using two scorers. In contexts where single— reading reliabilities are lower, the relative benefits of calibration are even higher. 2. This estimate was derived empirically in the study and it followed directly from the Spearman—Brown formula which describes how reliability changes as a function of the number of observations. When a test with reliability rxx is doubled in length, the new reliability Rxx is given by the formula RXX = 2' rxx/(l + rxx). USing tWice as many raters doubles the number of observations. 40 Wilson Wilson (1988) compared ordinary least squares (OLS) with generalized least squares (GLS). Both methods assume the additive model, that rater bias can be expressed as a single additive constant. Whereas OLS assumes that all raters have equal error variance, GLS estimates an error variance for each rater and weights the scores of accurate raters more than those of less accurate raters in estimating true scores. In a simulated data set where two of eight raters gave considerably more inaccurate ratings than the other six, GLS better reproduced the true scores. The GLS estimates of the true scores had a mean squared error less than half that of the OLS estimates. Houston, Raymond, Svec, and Webb In a series of studies conducted at the American College Testing Program (Raymond & Houston, 1990; Houston, Raymond, & Svec, 1990; Webb, Raymond, & Houston, 1990) the researchers compared several models of adjusting for rater effects. Raymond and Houston compared (a) simple averaging of raw scores (NO ADJUSTMENT), (b) OLS, (c) WLS (identical to the GLS method of Wilson, 1988), (d) the Rasch rating scale model (Rasch) and (e) imputation of scores for missing paper/rater combinations (IMPUTE) on simulated data. Houston, Raymond, and Svec compared NO ADJUSTMENT, OLS, WLS, and IMPUTE manipulating (a) the number of raters per paper, (b) the level of rater bias, and (c) the number of examinees in simulated data sets. Webb, Raymond, and Houston applied OLS, the Rasch extension described earlier, and WLS to a set of certification examination data, focusing on how rater adjustments affected the pass/fail decisions. 41 Raymond and Houston (1990) simulated data for 25 individuals rated by 6 raters on a 1 to 5 scale. In the simulation, the raters varied in their degree of bias and reliability and were generated from a multivariate normal distribution. The true score for any paper was its mean rating over all six raters, but only two ratings for each paper were used in estimating the adjustment parameters. Of the five methods they compared, the four correction methods, with an average error of .40 SDs, were all better than uncorrected data, which had an average error of .56 SDs. The four correction methods differed little from each other, with mean errors ranging from .39 SDs for Rasch to .46 SDs for WLS. The correlations between adjusted scores and true scores ranged from .86 for WLS to .88 for OLS. Houston, Raymond, and Svec (1990) simulated data based on a general linear model. Scores ranged from 1 to 7, scored by 8 raters who varied in their level of bias and in their scoring reliability. In all, 120 data sets were generated with 30 replications of a 2 x 2 design: level of rater bias (high or low) and number of examinees (50 or 100). Each data set was analyzed by four methods (OLS, WLS, IMPUTE, and NO ADJUSTMENT), with either 50 percent or 25 percent of the raters scoring each paper. The methods were compared based on how close the adjusted scores were to the true scores, using correlations and root mean squared errors (RMSE) to measure extent of agreement. By both measures of agreement, the three adjustment methods were all better than NO ADJUSTMENT. All three methods adjusted well for both high and low levels of bias, but IMPUTE had lower RMSEs than OLS and WLS, especially in the cases with fewer raters per paper. However, 42 correlations were slightly higher for OLS and WLS than for IMPUTE. The researchers suggested that because IMPUTE assumes normally distributed data, and because the data were generated from a normal distribution, that the method may not do as well if scores have some other distribution. Webb, Raymond, and Houston (1990) compared adjustment methods on actual data from a health profession oral certification exam. For each of three years, approximately 120 candidates were examined by 4 of 40 raters and were assigned scores with a possible range of 3 to 36. Scores were adjusted, and the three adjustment methods (OLS, Rasch extension, and WLS) differed little; subsequent analyses used OLS because of its relative simplicity. Pass/fail decisions using unadjusted data were compared to the decisions using OLS-adjusted data. Of 129 decisions, 122 (95%) were the same whether scores were adjusted or not. Lunz, Linacre, and Wright In his dissertation, Linacre (1987a) developed a generalization of the Rasch model which applies in situations where several facets contribute to scoring. His model is especially well suited to contexts where raters score several items with varying difficulty. Linacre (1987b) applied the model to the essay test data used by Braun (1988). The model confirmed many of Braun's findings, such as rater stringency differences, but also identified particular papers and raters that fit the model poorly. Linacre recommended that these papers be regraded, and that the raters receive further training. 43 Lunz, Linacre, and Wright (1988) applied the model to practical examinations from an American Society of Clinical Pathologists test administration. A team of 12 judges scored 15 items (microscopic slides) submitted by 226 examinees. Each slide was graded on three tasks (microtomy, quality, and processing) on a either a 0—1 scale (microtomy and processing) or a 0-3 scale (quality). The model in this situation was In ( anijk / anijk—l) = Bn _ Am - Di - C:j — ka ' where P ,_ nmijk is the probability of person n being scored a k by judge j on task m of item i, Bn is the ability of person n, Am is the difficulty of task m, Di is the difficulty of item i, C3. is the severity of judge j, and FER is the height of grading step k on task m. The PCM described earlier is a special case of this model, except in this model the difficulty of moving from a k-l to a k was assumed to be a property of the item, and not a property of the judge. Lunz, Linacre, and Wright (1988) found that despite common training, judges differed in their overall level of stringency. The model obtained separate estimates of the difficulty of each item, each task and the associated steps of difficulty on the scale, each ability, and each rater's stringency. The researchers cited examples where raw score differences were misleading, and recommended making decisions based on the ability estimates because rater effects would then be eliminated. In practice, Rasch ability estimates are more difficult to interpret and explain to users than scores presented in rating scale units. 44 m In an earlier paper (Denny, 1989) this author described a pilot study investigating the feasibility of applying the PCM to a rater calibration study. One simulated data set with 10 raters and 800 papers generated from the PCM, and a real—life writing assessment data set with 11 raters and 391 papers were compared using four adjustment methods: (a) no equating, (b) mean equating, (c) linear equating, and (d) PCM. In the simulated data set, differences among raters were negligible, so adjustments made little difference, though PCM was marginally better than the other methods. In the real data set, where the true scores were unknown, the mean and linear equating methods both yielded adjusted scores closer to the PCM adjusted scores than to the raw scores . Summary Several general conclusions can be drawn from the studies discussed above: 1. Rater calibration, or adjustment for rater effects, can be applied in many contexts ranging from English Literature and Composition essays (Braun, 1986) to Clinical Pathology practical examinations (Lunz, Linacre, & Wright, 1988). 2. In every study involving some type of true scores, adjusted raw scores were closer to the true scores than were unadjusted scores. In studies examining reliability estimates, adjusted scores had higher internal consistency reliability estimates than did unadjusted scores. In general, more sophisticated models (those involving the estimation 45 of more parameters) did better than simpler models (those with fewer estimated parameters). But these models require more computer resources and need more data to produce stable parameter estimates. 3. PCM and equipercentile equating, the two methods that adjust for rater differences at each level of the rating scale, have not been adequately studied. Equipercentile equating and equating with item response theory models such as PCM have been studied extensively in the context of equating parallel forms of objective tests, but not in the context of equating raters. 4. All of these studies represent attempts to make more precise measurements, and thus to enable better decisions. But unlike the methods of retraining raters and using more raters per paper, statistical adjustment entails minimal additional cost. This study examines a wide range of methods that adjust scores for rater stringency. It is the first study to use equipercentile equating and the PCM to equate raters. This study compares rater calibration methods while varying several facets of the rating situation including the number of papers scored, the number of points on the rating scale, the shape of the paper quality distribution, and the number of raters per paper. The results of this study should help test administrators to decide what method of rater adjustment, if any, would be most appropriate in a particular performance rating context. CHAPTER THREE METHOD This study compared adjustment methods by applying them to both real and simulated data with a variety of scoring conditions and over several different rater types. In the simulated data sets, raters were modeled to vary in their level of stringency and spread. The rating task also varied in the number of raters per paper and in the number of scale points. The simulated sets varied in size (the number of papers) and in paper quality distribution. Adjusted scores were compared based on (a) the overall accuracy of scores as measured by the root mean squared error (RMSE), (b) the comparative rank order of scores as measured by the Pearson product-moment correlation, and (c) the worst case as measured by the most discrepant adjusted score for each method. Each of these aspects of the comparison are discussed in greater detail below. Methods Compared Seven adjustment methods were compared in this study. Three of the methods (mean equating, linear equating, and OLS) were compared both in their standard formulation and using truncation to keep adjusted scores within the range of the rating scale. The methods are listed below in order of complexity, with more detail than the general descriptions provided in Chapter One. 46 47 1. No equating. The adjusted score for any rater was the score the rater actually gave. The adjusted score for any paper was the mean score given by all raters who scored the paper. 2. Mean equating. The scores assigned by each rater were adjusted by a fixed amount so that each rater had the same mean score after adjustment. For example, if a rater consistently assigned scores that were .5 points higher than the mean across all raters, each of that rater's scores were lowered by .5 points to get the adjusted score. If xij was the raw score assigned to paper i by rater j, Hg was the mean score given by rater j, and m was the mean score over all raters, then the adjusted score ytjfor paper i from rater j was given by yij:=}%j + (m - mj). For truncated mean equating, any scores that exceeded the maximum score of the scale were reduced to the maximum score and any scores that were adjusted below zero were assigned a zero. The adjusted score for a paper was the mean of the adjusted scores of the raters who scored the paper. After any necessary truncation, scores from the raters who scored the paper were averaged to get the adjusted score for the paper. 3. Linear equating. This method considers both the mean and standard deviation of each rater's scores. The scores were adjusted linearly so that all raters had a mean score equal to the overall mean and a standard deviation equal to the overall standard deviation. If the scores rater j assigned had a mean of mj and a standard deviation of sj, and over all papers and raters the mean was m and the standard deviation was s, then a score xij adjusted to yij where yij = [(xij -nH)/sj]°s + m. The adjusted score for a paper was the 48 mean of the adjusted scores of the raters who scored the paper. Truncated linear equating is linear equating with adjusted scores being truncated to lie within the scale and then averaged across raters to get the adjusted score for the paper. 4. Equipercentile equating. This method considers the frequency of scores assigned at each level of the rating scale. For each rater, and for all raters combined, a cumulative frequency distribution was determined for each point on the rating scale. Each of a rater's scores was transformed to the point on the combined frequency distribution with the same percentile rank, using linear interpolation as needed. If pjk is the proportion of scores rater j assigned that are k or less, and if Pt and Pt+ are the proportion of scores overall 1 that are t or less and t+1 or less, and if Pt 5 pjk 5 Pen then a k from rater j is adjusted to the value t + (pjk — Pt)/(P — Pt). The ti» 1 example illustrated graphically in Figure 2 can also be formulated algebraically: k=3, t=3, t+1=4, Ek3='23’ P3=.19, and P4=.27, so a 3 from rater A adjusts to 3 + (.23—.19)/(.27—.19) = 3 + (.04/.O8) = 3.5. Two special cases do not fit the formula. First, if pjk < PO then a k from rater j is adjusted to a O. For example, if a rater assigned Os or ls only 3 percent of the time while overall raters assigned Os 7 percent of the time, then ls from this rater are adjusted to 0. Second, if Pt = Pt+1 (making the denominator of the fraction zero, and corresponding to a case where no scores of t+1 were assigned by any rater) and if pjk = Pt then a k from rater j is adjusted to t, not to t+1. For example, if no rater assigned any paper a 9, and if 49 rater j assigned no 85 or 95, then a 7 from rater j (the rater's highest score) adjusts to an 8 (the highest score given) and not to a 9 (the highest score possible). There is more than one way to define the percentile of a score. A simple definition is the percentage of scores at or below the given score. This definition assumes that if a paper was assigned a 3, for example, that it is of higher quality than all other papers that were assigned a 3. A better definition is the percentage of scores below the score plus half of the percentage of scores at that score level. This definition assumes that if a paper was assigned a 3 then it is only better than half of the papers that were assigned 33. This study used the simpler definition, both in determining percentiles for each rater's scores and for the overall scores. Using the other definition likely would have given slightly more accurate adjusted scores. In either case, though, the method accounts for rater differences in scoring patterns at each point on the rating scale. 5. QEE' This method assumes the additive model, where the score )qj given to a paper with true score ai by rater j with bias 6j and random error ex“ is given by xij = aii+ 6j-+E%J. In contrast to WLS, the error variances for all raters are assumed equal, and linear regression is used to estimate the terms ai and éj. The matrix equations used by Wilson (1988) and by other researchers (Raymond & Houston, 1990; Houston, Raymond, & Svec, 1990; Webb, Raymond, & Houston, 1990) are not well—suited for large—scale testing programs. If 500 papers were each scored by 2 of 10 raters, the recommended matrix equations would contain a matrix with 1000 rows and 509 columns, 50 or 509,000 elements. Performing algebraic operations on a matrix of that size requires much computer time and capacity. The additive model as presented by de Gruijter (1984) reduces the size of the matrices considerably. Instead of modeling raw scores directly, he modeled the average difference between ratings of pairs of raters to obtain estimates of the rater effects. If djk is the average difference between the ratings of rater j with bias 6j and rater k with bias 6k on the papers they both graded, then djk = 6j — 6k + tjk' where tjk is a residual error term. To get a unique solution, the sum of the bias terms (relative stringency or leniency) is assumed to be 0. The last rater effect can then be expressed in terms of the other rater effects: 6“ = —26i , where i = 1, 2, ..., n—l. In matrix terms, the equation becomes g = A' g + E , where g is the observed vector of average rater differences, A is a design matrix which designates which pair of raters is involved, g is a vector of rater effects, and E is the residual error to be minimized in estimating rater effects. With 500 papers each scored by 2 of 10 raters, assuming each pair of raters rate at least 1 paper in common so every combination is represented, the design matrix has 45 columns and 9 rows, or only 405 elements. The OLS estimate of rater effects is given by the matrix equation 6 = (ATNA)flA?Ng , where N is a diagonal matrix containing the number of papers graded by each pair of raters. Unlike non—linear models which require iterative solutions, this solution is straightforward and can be performed on any statistical package capable of multiple regression. 51 A special program was written (see Appendix A) to do the matrix computations on the data used in the study. Once the rater effects were estimated, these estimates were used to adjust raw scores. As with mean and linear equating, scores were adjusted both with and without truncating. 6. Rasch extension. The Rasch extension method is similar to OLS, except instead of assuming a linear relationship between paper qualities and ratings, the model assumes a curvilinear relationship. As stated earlier, the expected score Rij of a paper with quality level 1% when rated by a judge with stringency parameter 6j on a scale ranging from 0 to M is given by Rij = M exp(l3i — aj)/(1 + exp(Bi — 63)). Choppin (1982) derived a formula for an estimate of the difference between two rater effects dfi(= éj — 6k: djk = ln{[2xk(M—xj)]/[Exj(M—xk)]}, where the summations are over all observed score pairs xij and xik that raters j and k have in common. Notice that this transformation eliminates the quality parameter Bf Now the OLS method of de Gruijter (1984) described in the previous section can be applied to the transformed djk values to get estimates of the rater effects 6j. These parameter estimates 6j were used in the following formula which transforms observed scores xij into adjusted scores xi with the rater effects removed: xi = exp(6j)° xU/[l—xij(l—exp(6j))/M] The adjusted scores for each rater were then averaged to get the overall adjusted score for each paper. 52 7. 29M. To adjust scores using the PCM, the stringency parameters were estimated by considering the pattern of scores over pairs of raters. Next, those values were used to estimate the quality parameters. Finally, the estimated parameters were substituted back into the model to get expected scores for each rater and these expected scores were averaged to get the adjusted score for the paper. This method estimated true scores not only for the raters who scored the paper, but also for raters who did not score the paper, by using their stringency parameter estimates. The PAIR algorithm for estimating the stringency parameters 6ij in the PCM is detailed by Wright and Masters (1982, pp. 82-85). PAIR is the estimation procedure they recommend when a data set has many missing cases. In this context, a case is considered missing any time a rater does not score a paper. A BASIC program written to perform the iterative estimations is included in Appendix A. The iterative procedure terminated either when the maximum parameter shift was less than .02 or after 50 iterations. Data Sets Two types of data were used to compare the methods: simulated and real. Simulated data have the advantage that true scores are known, and that various aspects of the scoring procedure can be manipulated. Real data have the advantage of not being based on the assumptions of any particular model, but true scores are unknown. Real Data The real data were from a district—wide writing assessment conducted in a suburban school district in Michigan. All students in 53 grades 5, 8, and 11 wrote essays arguing for a change they would like to see in their school. The instructions given to the students are in Appendix B. The essays (124 for grade 5, 141 for grade 8, and 122 for grade 11) were scored in grade level order by a team of 10 raters on a 5-point scale. Training for the raters consisted of a brief presentation of the rating criteria listed in Appendix C, followed by scoring of sample papers. These samples were of each of score levels 1 through 5 as determined by the rating supervisors, who were experienced holistic scorers. The raters scored each of the five papers and then as a group discussed why they assigned the scores they did, with particular focus on raters who gave lower or higher scores than the rest of the rating team. This training procedure (scoring and discussing sample papers) preceded the scoring of actual papers at each of the three grade levels. Even though the scoring criteria in Appendix C do not refer to grade level, the standards for each score level were higher at the higher grade levels. Thus the scoring was criterion—referenced, but across grade levels the criteria for each score shifted. After the training, papers were shuffled and two raters read each paper and assigned scores independently. Instead of writing numerical scores on the papers, the raters wrote a letter code corresponding to their score so other raters would not be influenced by the previous scores. If the two ratings on a paper differed by more than a point, a third rater scored the paper. The most discrepant rating was omitted, or if the third rating was midway between the other two then the lower rating was omitted. In the fifth grade data set 8.2 percent of the 54 papers were rescored, 8.1 percent of the eighth grade papers required rescoring, and only 3.1 percent of the grade eleven papers were discrepant and needed scoring by a third rater. Table 2 lists the distribution of scores assigned by each rater after elimination of discrepant scores. Although a score of 0 was possible, in practice no Os were assigned and only a few ls were assigned. Overall, the mean scores assigned were almost identical at each grade level (3.48 for grade 5, 3.45 for grade 8, and 3.48 for grade 11). The standard deviation of scores decreased at higher grade levels (.93 for grade 5, .86 for grade 8, and .78 for grade 11). One explanation of this trend toward less score variance is that paper qualities are more homogeneous in higher grade levels than in lower. Alternatively, because papers were scored in grade level order, it could be that over time raters used extreme scores less often. Constable and Andrich (1984) detailed a study which suggested that when raters were encouraged to agree on scores, over the course of a grading session raters tended to give more moderate scores. Raters differed across grade levels in how they assigned scores. For example, rater 10 was the most lenient rater on grade 5 and grade 8 essays, but was at the overall mean on grade 11 essays. Rater 6 assigned the lowest mean scores of any rater on the grade 5 and grade 8 sets, but on grade 11 papers this rater was the second most lenient. On the grade 5 papers, mean ratings by rater ranged from a low of 3.13 to a high of 3.89, and the standard deviations ranged from .68 to 1.10. On grade 8 papers, the most stringent rater assigned a mean rating of 55 Table 2 Frequencies of Scores Assigned by Raters in the Real Data Set GRADE FIVE Frequency of Each Score Summary RATER 1 2 3 g E N Mean SE 1 0 2 6 6 2 16 3.50 0.87 3 0 2 10 8 4 24 3.58 0.86 4 0 7 11 6 4 28 3.25 0.99 5 l 4 7 9 6 27 3.56 1.10 6 0 5 4 7 0 16 3.13 0.86 7 0 1 10 6 1 18 3.39 0.68 8 1 2 16 3 4 26 3.27 0.94 9 0 3 6 12 5 26 3.73 0.90 10 0 1 13 12 11 37 3.89 0.86 11 0 6 13 10 1 30 3.20 0.79 Total 2 33 96 79 38 248 3.48 0.93 GRADE EIGHT Frequency of Each Score Summary RATER 1 2 3 4 5 N Mean SD 1 1 5 8 9 2 25 3.24 0.99 3 0 2 16 10 2 30 3.40 0.71 4 0 3 16 13 l 33 3.36 0.69 5 0 3 8 13 5 29 3.69 0.88 6 1 4 11 5 0 21 2.95 0.79 7 0 2 8 8 2 20 3.50 0.81 8 0 1 8 12 3 24 3.71 0.73 9 0 7 16 10 4 37 3.30 0.90 10 0 0 14 9 9 32 3.84 0.83 11 0 5 12 10 4 31 3.42 0.91 Total 2 32 117 99 32 282 3.45 0.86 g 56 Table 2 (continued) GRADE 11 Frequency of Each Score Summary RATER 1 2 2 4 E N Mean SD 1 0 8 8 4 0 20 2.80 0.75 3 0 0 14 13 1 28 3.54 0.57 4 0 3 10 13 2 28 3.50 0.78 5 0 1 7 11 8 27 3.96 0.84 6 0 0 9 8 3 20 3.70 0.71 7 0 1 9 4 0 14 3.21 0.56 8 0 2 6 8 3 19 3.63 0.87 9 0 3 18 8 2 31 3.29 0.73 10 0 1 15 10 2 28 3.46 0.68 11 0 1 15 11 2 29 3.48 0.68 Total 0 20 111 90 23 244 3.48 0.78 2.95 while the most lenient rater assigned a mean rating of 3.84; standard deviations ranged from .69 to .99. Grade 11 papers showed the greatest range of means, from 2.80 for the most stringent rater to 3.96 for the most lenient. Standard deviations of raters' scores for the grade 11 papers ranged from .56 to .87. Because raters varied in their stringency depending on the grade level, three separate analyses were performed, one at each grade level. Thus, a rater's parameter estimates for grade 5 papers are independent of the estimates for grades 8 or 11. Because so few Os and ls were assigned, before the analysis the scores were transformed from a 0—5 scale to a 0-3 scale by subtracting 2 from each score. The 1 ratings (of which there were only four) were reduced to O. The computer programs (Appendix A) were written in terms of a scale beginning with 0, and it was easier to transform the data than to rewrite the programs . 57 Simulated Data The simulated data were generated from the PCM. The PCM has separate parameters for each step of the rating scale for each rater, so is able to model rater differences throughout the scale. Five facets of the rating situation were varied across the simulated data sets: 1. Scoring ranges from 0 to 5 or from 0 to 9. 2. Data sets of 100 or 500 papers. 3. Papers scored by 1, 2, or 3 raters, or by 2 raters with rescoring by a third rater in case of score discrepancy (as in the real data set). Note that when only one rater scores each paper, some adjustment methods are not appropriate. 4. Scoring by a team of 9 raters. In terms of stringency level, raters 1, 2, and 3 are lenient; 4, 5, and 6 are average; and 7, 8, and 9 are stringent. Raters 1, 4, and 7 assign widely spread scores; 2, 5, and 8 assign scores of average spread; and 3, 6, and 9 assign scores with a narrow spread. Table 3 presents the stringency step parameters for each rater, for both a 5— and 9-point scale. These parameters were based on the findings of the earlier pilot study (Denny, 1989) and were selected to be similar to the real data in means, standard deviations and in level of rater agreement. Note that stringency step parameters have values opposite to the scores they produce—~high values of step parameters lead to low scores, and step parameters with low variance result in scoring with high variance. Raters are nested within the simulated data sets, so each set is scored by the same nine raters. HIIIIII Table 3 58 Stringency Parameters for Each Rater in the Simulated Data Sets 5-Point Scale Rater Step -3.33 -5.33 -9.33 -2.00 —4.00 -8.00 - .67 -2.67 -6.67 9—Point Scale Rater l 1 -3.33 -5.33 -9.33 -2.00 -4.00 -8.00 -2.67 -6.67 1 —2.83 -4.33 -7.33 -1.50 -3.00 -6.00 -l.67 -4.67 Step 2 -2.33 -3.33 -5.33 -1.00 ~2.00 -4.00 .33 -2.67 -2.33 -3.33 -5.33 -1.00 -2.00 —4.00 .33 —2.67 Scale Step 4 -1.83 —2.33 -3.33 -1.00 —2.00 .83 .33 Step 3 -l.33 -l.33 -1.33 5 -l.33 -1.33 -1.33 Step 4 .67 .33 .67 2.00 .83 .33 .33 .33 .67 .67 .00 .00 .00 .33 .33 .33 Step 5 .67 59 5. Quality parameters fiisimulated from a distribution that is either normally distributed, positively skewed, or negatively skewed. To get random number p from a normal distribution with mean 0 and SD 1, substitute random numbers r and s from a uniform distribution on the interval [0,1] into the formula p = J_:2733R;5—° sin (2n' 5) . For the normal distribution, the quality parameters were generated by 3' p + 2, resulting in a normal distribution with mean = 2 and SD = 3. For a positively skewed distribution, the quality parameters were generated by 5' pl — 3, a distribution with mean = 1 and SD = 3 and a minimum value of —3. In a negatively skewed distribution, the quality parameters are generated by —5' ‘p' + 7, a distribution with mean = 3 and SD = 3, with a maximum value of 7. To reduce the number of separate analyses, the data sets with 100 papers were generated with each of the three distribution shapes, but the sets with 500 papers only used normally distributed quality parameters. In all, there were 24 data sets with 100 papers from a 2 x 4 x 3 design: scale points (5 or 9) by number of raters (1, 2, 3, or 2 with rescoring) by quality distribution (normal, positively, or negatively skewed). There were 8 data sets with 500 papers from a 2 x 4 design: scale points by number of raters. Criteria for Comparing the Methods Three criteria were used to compare the methods: 1. The Pearson product—moment correlation coefficient measured the degree of linear relationship between two variables. Computationally, the formula is 2 (xi — x.)(yi — y.)/(sx'sy'n), where 60 x. is the mean of the xis, y. is the mean of the yis, sx and sy are the SDs of the xis and the yis, n is the number of papers, and the summation is from i = 1 to i = n. Because rank orders of scores change little by using the adjustment methods, all correlations were high. To average or compare correlations, a Fisher's Z transformation (Glass & Hopkins, 1984; p. 305) was used to reduce the ceiling effect on high correlations. 2. For the real data the root mean squared difference (RMSD) measured the differences in adjusted scores for pairs of methods. RMSD is given by the formula JE_7§:_:_§:TZGT. This statistic is in raw score units so a RMSD of .5 indicates that on average, two sets of adjusted scores are .5 units apart. 3. The maximum score discrepancy was the greatest score difference between the two sets of scores for any paper. This represented the worst case for a method. Particularly when adjusted scores are used to make decisions about individuals, it is important to know how much individual cases can be affected. For the real data set, true scores were unknown. Thus, there was no absolute criterion for comparison. The adjustment methods were compared with each other, to determine how much they differed. Pass rates using a cut—score of 3.5 were compared with the different adjustment methods, to determine the proportion of decisions that would be affected by using rater adjustment. For the simulated data, true scores are defined by the average expected score over all raters based on the parameters of the model. Adjusted scores were compared with true scores for each adjustment 61 method. Correlations, root mean squared errors (RMSEs) and maximum differences were all based on comparisons with true scores. Better methods have high correlations, low RMSEs, and low maximum score discrepancies. In addition, each of the three comparison statistics were averaged across data sets to determine which methods did better than others overall, and under what scoring conditions particular methods adjusted scores more accurately. CHAPTER FOUR RESULTS This chapter consists of two sections. First, the adjustment methods are compared for the writing assessment data. Second, the adjustment methods are compared for the simulated data, focusing on how the methods interact with each facet of the simulated data. In both data sets the amount of data collected was overwhelming, so of necessity data have been summarized and combined for ease of analysis and reporting. Writing Assessment Data The ten adjustment methods were applied to the writing assessment data for each of grades 5, 8, and 11. The RMSD, maximum difference, and correlation between each pair of methods are reported for each grade level in Table 4. Examining the data revealed these facts: 1. The truncated methods differed little from their non-truncated versions for mean equating, linear equating, and OLS. 2. RMSDs and correlations were inversely related—-higher RMSDs were associated with lower correlations. A secondary analysis showed that RMSDs and correlations after a Fisher Z—transformation had a correlation of —.92 across the three grades. Thus, the differences between adjustment methods based on RMSDs have nearly the same rank order as the differences between methods based on correlations. 62 GRADE 5 N0 N0 .0000 MN .1566 TMN .1581 LI .1684 TLI .1644 EQP .2482 OLS .1788 TLS .1736 RAS .2141 PCM .1814 NO NO .0000 MN .3275 TMN .3275 LI .3605 TLI .3605 EQP .5393 OLS .3031 TLS .3031 RAS .4189 PCM .4465 N0 N0 1.000 MN .9827 TMN .9830 LI .9807 TLI .9809 EQP .9688 OLS .9803 TLS .9803 RAS .9696 PCM .9769 Method MN .1566 .0000 .0419 .0696 .0647 .1864 .2064 .1968 .2863 .1539 MN .3275 .0000 .2040 .3159 .1937 .4752 .4052 .4052 .6864 .5142 MN .9827 1.000 .9992 .9974 .9969 .9877 .9729 .9733 .9417 .9832 63 Table 4A Comparisons for the Writing Assessment Data Root Mean Squared Difference TMN .1581 .0419 .0000 .0888 .0549 .1880 .2027 .1889 .2856 .1464 TMN .3275 .2040 .0000 .3633 .1407 .4752 .4052 .4052 .6420 .4261 TMN .9830 .9992 1.000 .9969 .9982 .9898 .9747 .9753 .9403 .9855 LI .1684 .0696 .0888 .0000 .0621 .1781 .2182 .2158 .2995 .1475 TLI .1644 .0647 .0549 .0621 .0000 .1756 .2098 .1996 .2952 .1276 EQP .2482 .1864 .1880 .1781 .1756 .0000 .2903 .2840 .3540 .2227 OLS .1788 .2064 .2027 .2182 .2098 .2903 .0000 .0482 .3762 .2475 Maximum Difference LI TLI EQP .3605 .3605 .5393 .3159 .1937 .4752 .3633 .1407 .4752 .0000 .3296 .4217 .3296 .0000 .4217 .4217 .4217 .0000 .4410 .4410 .6696 .4410 .4410 .6696 .7640 .6569 .9189 .5349 .3992 .6043 Correlation LI TLI EQP .9807 .9809 .9688 .9974 .9969 .9877 .9969 .9982 .9898 1.000 .9982 .9880 .9982 1.000 .9907 .9880 .9907 1.000 .9701 .9718 .9645 .9697 .9724 .9652 .9398 .9380 .9200 .9853 .9885 .9780 OLS .3031 .4052 .4052 .4410 .4410 .6696 .0000 .2526 .6914 .6373 OLS .9803 .9729 .9747 .9701 .9718 .9645 1.000 .9990 .9092 .9594 TLS .1736 .1968 .1889 .2158 .1996 .2840 .0482 .0000 .3709 .2410 TLS .3031 .4052 .4052 .4410 .4410 .6696 .2526 .0000 .6914 .6373 TLS .9803 .9733 .9753 .9697 .9724 .9652 .9990 1.000 .9063 .9596 RAS .2141 .2863 .2856 .2995 .2952 .3540 .3762 .3709 .0000 .2829 RAS .4189 .6864 .6420 .7640 .6569 .9189 .6914 .6914 .0000 .7027 RAS .9696 .9417 .9403 .9398 .9380 .9200 .9092 .9063 1.000 .9453 PCM .1814 .1539 .1464 .1475 .1276 .2227 .2475 .2410 .2829 .0000 .4465 .5142 .4261 .5349 .3992 .6043 .6373 .6373 .7027 .0000 .9769 .9832 .9855 .9853 .9885 .9780 .9594 .9596 .9453 1.000 GRADE 8 NO NO .0000 MN .1629 TMN .1604 LI .1700 TLI .1635 EQP .2151 OLS .1206 TLS .1224 RAS .1211 PCM .1438 NO NO .0000 MN .3264 TMN .3264 LI .4446 TLI .3768 EQP .5184 OLS .2403 TLS .2403 RAS .2800 PCM .3945 NO NO 1.000 MN .9772 TMN .9789 LI .9748 TLI .9770 EQP .9686 OLS .9874 TLS .9873 RAS .9877 PCM .9823 64 Table 48 Method Comparisons for the Writing Assessment Data Root Mean Squared Difference MN TMN .1629 .1604 .0000 .0297 .0297 .0000 .0590 .0718 .0555 .0503 .1555 .1558 .1689 .1663 .1681 .1612 .2217 .2209 .1734 .1729 MN TMN .3264 .3264 .0000 .1264 .1264 .0000 .1621 .2532 .1621 .1621 .4321 .4321 .3909 .3909 .3909 .3909 .4729 .4729 '.5011 .5011 MN TMN .9772 .9789 1.000 .9994 .9994 1.000 .9975 .9970 .9970 .9978 .9891 .9908 .9752 .9768 .9733 .9754 .9585 .9596 .9745 .9756 LI TLI EQP OLS .1700 .1635 .2151 .1206 .0590 .0555 .1555 .1689 .0718 .0503 .1558 .1663 .0000 .0483 .1418 .1746 .0483 .0000 .1341 .1704 .1418 .1341 .0000 .2237 .1746 .1704 .2237 .0000 .1779 .1664 .2269 .0432 .2282 .2229 .2638 .2351 .1779 .1704 .2268 .1812 Maximum Difference LI TLI EQP OLS .4446 .3768 .5184 .2403 .1621 .1621 .4321 .3909 .2532 .1621 .4321 .3909 .0000 .2486 .3855 .3997 .2486 .0000 .3855 .3997 .3855 .3855 .0000 .5702 .3997 .3997 .5702 .0000 .3997 .3997 .5702 .1636 .5651 .4508 .5982 .5179 .5347 .5347 .6763 .4555 Correlation LI TLI EQP OLS .9748 .9770 .9686 .9874 .9975 .9970 .9891 .9752 .9970 .9978 .9908 .9768 1.000 .9985 .9911 .9733 .9985 1.000 .9941 .9747 .9911 .9941 1.000 .9656 .9733 .9747 .9656 1.000 .9714 .9739 .9648 .9990 .9556 .9579 .9491 .9531 .9727 .9754 .9663 .9718 TLS .1224 .1681 .1612 .1779 .1664 .2269 .0432 .0000 .2377 .1803 TLS .2403 .3909 .3909 .3997 .3997 .5702 .1636 .0000 .5179 .4963 TLS .9873 .9733 .9754 .9714 .9739 .9648 .9990 1.000 .9517 .9719 RAS .1211 .2217 .2209 .2282 .2229 .2638 .2351 .2377 .0000 .1858 .2800 .4729 .4729 .5651 .4508 .5982 .5179 .5179 .0000 .5150 .9877 .9585 .9596 .9556 .9579 .9491 .9531 .9517 1.000 .9711 PCM .1438 .1734 .1729 .1779 .1704 .2268 .1812 .1803 .1858 .0000 PCM .3945 .5011 .5011 .5347 .5347 .6763 .4555 .4963 .5150 .0000 PCM .9823 .9745 .9756 .9727 .9754 .9663 .9718 .9719 .9711 1.000 65 Table 4C Method Comparisons for the Writing Assessment Data GRADE 11 Root Mean Squared Difference NO MN TMN LI TLI EQP OLS TLS RAS PCM NO .0000 .2159 .2142 .2222 .2192 .2917 .1542 .1388 .1353 .1510 MN .2159 .0000 .0272 .0790 .0703 .2136 .3044 .2689 .2200 .1823 TMN .2142 .0272 .0000 .0846 .0671 .2112 .3034 .2673 .2182 .1825 LI .2222 .0790 .0846 .0000 .0452 .1918 .3070 .2757 .2303 .1866 TLI .2192 .0703 .0671 .0452 .0000 .1898 .3029 .2701 .2283 .1827 EQP .2917 .2136 .2112 .1918 .1898 .0000 .3681 .3375 .3013 .2909 OLS .1542 .3044 .3034 .3070 .3029 .3681 .0000 .0853 .2738 .2399 TLS .1388 .2689 .2673 .2757 .2701 .3375 .0853 .0000 .2689 .2157 RAS .1353 .2200 .2182 .2303 .2283 .3013 .2738 .2689 .0000 .1837 PCM .1510 .1823 .1825 .1866 .1827 .2909 .2399 .2157 .1837 .0000 Maximum Difference NO MN TMN LI TLI EQP OLS TLS RAS PCM NO .0000 .4683 .4683 .4729 .4577 .8106 .3882 .3339 .3969 .3858 MN .4683 .0000 .2438 .2729 .2396 .7101 .7254 .7254 .7097 .4839 TMN .4683 .2438 .0000 .2729 .2396 .7101 .7254 .7254 .7097 .4839 LI .4729 .2729 .2729 .0000 .2630 .5369 .8068 .8068 .6531 .4424 TLI .4577 .2396 .2396 .2630 .0000 .5369 .7529 .7529 .6531 .4424 EQP .8106 .7101 .7101 .5369 .5369 .0000 .9550 .9550 1.094 .7851 OLS .3882 .7254 .7254 .8068 .7529 .9550 .0000 .3201 .7308 .7189 TLS .3339 .7254 .7254 .8068 .7529 .9550 .3201 .0000 .7308 .7189 RAS .3969 .7097 .7097 .6531 .6531 1.094 .7308 .7308 .0000 .5506 PCM .3858 .4839 .4839 .4424 .4424 .7851 .7189 .7189 .5506 .0000 Correlation NO MN TMN LI TLI EQP OLS TLS RAS PCM NO 1.000 .9506 .9516 .9495 .9494 .9391 .9775 .9799 .9820 .9778 MN .9506 1.000 .9992 .9957 .9951 .9816 .9069 .9192 .9521 .9665 TMN .9516 .9992 1.000 .9958 .9962 .9838 .9078 .9197 .9536 .9669 LI .9495 .9957 .9958 1.000 .9985 .9850 .9078 .9209 .9473 .9658 TLI .9494 .9951 .9962 .9985 1.000 .9890 .9083 .9207 .9476 .9659 EQP .9391 .9816 .9838 .9850 .9890 1.000 .8993 .9132 .9331 .9510 OLS .9775 .9069 .9078 .9078 .9083 .8993 1.000 .9947 .9279 .9433 TLS .9799 .9192 .9197 .9209 .9207 .9132 .9947 1.000 .9268 .9516 RAS .9820 .9521 .9536 .9473 .9476 .9331 .9279 .9268 1.000 .9681 PCM .9778 .9665 .9669 .9658 .9659 .9510 .9433 .9516 .9681 1.000 66 3. The maximum distance measure was erratic, reflecting properties of individual papers and not general properties of the adjustment methods. 4. Looking across the three grade levels, the relative proximity of adjusted scores by the various methods were consistent. For example, compared to NO, EQP had the highest RMSD at all three grade levels. At each grade level, the closest method to TLS (other than OLS) was NO. Based on these observations, steps were taken to reduce the volume of data. First, only the truncated versions of the methods were used. Second, correlations were omitted as being redundant with RMSDs. Third, maximum difference as a measure was omitted as unreliable. Fourth, RMSDs for the three grades were examined separately and then averaged, with the three grades treated as three replications of the study. RMSDs Compared The resulting differences between the methods based on the RMSD measure are graphed in Figures 7, 8, 9, and 10. The graphs are two— dimensional representations of relative distances (RMSDs) which are multi-dimensional, so the distances in the graph are a distortion of the actual RMSDs. The graphs were produced by the SPSS-X procedure ALSCAL (Young, Takane, & Lewyckyj, 1988). In those cases where a method does not appear in the graph, the method is coincident with the truncated version of the method. In Figure 7, for example, OLS (which would have been graphed as "7”) is coincident with TLS ("8"). The relative positions of the methods were consistent across the three grade levels and overall. Three methods EQP (depicted as "6"), TLS ("8"), and RAS (”9") are graphed as the vertices of a triangle 67 2.1 —- . 6 . 1.0 —-‘ O 4. 0 5 . 32. 0.0 —_‘O O 0 O O O O O O O O O O O 0 O 0 O -l.0 —~ 8 . 1 9 .2.1 __.l -2.5 -1.5 -0.5 0 5 1.5 2.5 Key 1 No Equating (NO) 2 Mean Equating (MN) 3 Truncated Mean Equating (TMN) 4 Linear Equating (LI) 5 Truncated Linear Equating (TLI) 6 Equipercentile Equating (EQP) 7 Ordinary Least Squares (OLS) 8 Truncated Least Squares (TLS) 9 Rasch Extension (RAS) 0 Partial Credit Model (PCM) Figure 7: Graph of Average RMSDs for the Grade 5 Data 68 201 _ o . 9 1.0 —— . 6 . . 0 45 . 0.0 ——. . . . . . . 2. . . . . . . . . . . . . . . . . 3 . l —1.0 -— . . 7 . 8 —2.1 -— . —2.5 -1.5 —0.5 0 5 1 5 2.5 Key 1 No Equating (NO) 2 Mean Equating (MN) 3 Truncated Mean Equating (TMN) 4 Linear Equating (LI) 5 Truncated Linear Equating (TLI) 6 Equipercentile Equating (EQP) 7 Ordinary Least Squares (OLS) 8 Truncated Least Squares (TLS) 9 Rasch Extension (RAS) 0 Partial Credit Model (PCM) Figure 8: Graph of Average RMSDs for the Grade 8 Data 69 2.1 —- . . 9 1.0 -1 . 0 3 . 1 0.0 —a. . . . . . . . . . . . . . . . . . . . . . . . 5 . . 8 7 -1.0 —~ . 6 . —2.1 —‘ . —2.5 -1.5 -0.5 0.5 1.5 2.5 Key 1 No Equating (NO) 2 Mean Equating (MN) 3 Truncated Mean Equating (TMN) 4 Linear Equating (LI) 5 Truncated Linear Equating (TLI) 6 Equipercentile Equating (EQP) 7 Ordinary Least Squares (OLS) 8 Truncated Least Squares (TLS) 9 Rasch Extension (RAS) 0 Partial Credit Model (PCM) Figure 9: Graph of Average RMSDs for the Grade 11 Data 70 2.1 — I 9 1.0 —- . 0 . 1 0 0 —- . . . . . . . . . . . . . . . . . . . . . . . 453 . 6 —1 0 —~ . 8 7 -2.1 _. . -2 5 -l.5 —0.5 0 5 1.5 Key 1 No Equating (NO) 2 Mean Equating (MN) 3 Truncated Mean Equating (TMN) 4 Linear Equating (LI) 5 Truncated Linear Equating (TLI) 6 Equipercentile Equating (EQP) 7 Ordinary Least Squares (OLS) 8 Truncated Least Squares (TLS) 9 Rasch Extension (RAS) 0 Partial Credit Model (PCM) Figure 10: Graph of Average RMSDs for the Three Grades Combined 71 which includes the other methods. EQP, TLS, and RAS consistently had large RMSDs among themselves. In each graph NO ("1") was midway between TLS (”8") and RAS ("9"). The simple linear methods ("2" through "5") were about two-thirds of the way from TLS ("8") to EQP (”6"). PCM ("O") was generally within the triangle, closest to RAS ("9"). These graphs demonstrate that the adjustment methods differed in how they adjust scores, but in a systematic way. The relative differences among the methods followed a consistent pattern across the three grade levels of the real—life data set. Unfortunately, knowing how the adjustment methods differ gives no additional information about the papers' true scores, nor does it provide a basis for choosing one method over another. Passing Rates Although the actual writing assessment did not involve pass/fail decisions, a similar competency exam in the district requires an average rating of 3.5 for a passing score. The data were analyzed to see what difference the various adjustment methods would have on the number of students who pass or fail using a 3.5 standard. Because of rounding, any student with an average adjusted score of 3.25 or greater is considered to have passed the exam. Table 5 lists the number of papers passing and failing at each grade level using the 3.25 criterion. The table also indicates the number of papers which were "helped" or "hurt” by score adjustment compared to the pass/fail decision with unadjusted scores. A paper is "helped" if it passes when scores are adjusted, but fails when scores 72 Table 5 Pass/Fail Decisions for Adjusted Scores Relative to Unadjusted Scores for the Writing Data Grade 5 Passing Failing Helped Hurt NO 72 52 0 0 MN 71 53 3 4 TMN 71 53 3 4 LI 71 53 3 4 TLI 71 53 3 4 EQP 76 48 8 4 OLS 67 57 0 5 TLS 67 57 0 5 RAS 84 40 13 1 PCM 67 57 0 5 Grade 8 Passing Failing Helped Hurt NO 85 56 0 0 MN 83 58 5 7 TMN 83 58 5 7 LI 82 59 3 6 TLI 82 59 3 6 EQP 87 54 5 3 OLS 85 56 0 0 TLS 85 56 0 0 RAS 81 60 0 4 PCM 85 56 0 0 Grade 11 Passing Failing Helped Hurt NO 71 51 0 0 MN 66 56 5 10 TMN 66 56 5 10 LI 68 54 5 8 TLI 68 54 5 8 EQP 70 52 6 7 OLS 70 52 1 2 TLS 70 52 1 2 RAS 72 50 3 2 PCM 71 51 0 0 73 are unadjusted. A paper is "hurt" if it fails when scores are adjusted, but passes when scores are unadjusted. Mean equating changed 34 decisions, with more papers "hurt" (21) than "helped" (13). Equipercentile equating "helped" 19 papers and "hurt" 14. Linear equating "helped" 11 papers and "hurt" 18 across the three grades. The Rasch extension method "helped" 16 papers and "hurt" 7 overall, but adjustment "helped" more papers in fifth grade (13-1), and "hurt" more in eighth grade (0—4). The least-squares methods "helped" only 1 paper and "hurt" 7. PCM adjustment changed only 5 pass/fail decisions, all of which "hurt" fifth grade papers. Not surprisingly, whether truncation was used made no difference on any decision, because truncation only affects extreme scores and not those in the middle of the rating scale near the cutoff point. This analysis demonstrated that the choice of adjustment methods affects the dichotomous pass-fail decisions made from ratings. However, the analysis does not provide a basis for preferring one method over another. Preferring one method over another requires a measure of the correctness of the decisions made from the adjusted scores. (The terms "helped" and "hurt" referred only to the direction a decision changed, and not to the correctness of the decision.) Determining the correctness of any decision requires an external criterion of success. If the school district had another measure of whether students were "competent" writers, then the classifications based on essay scores could be compared to those of the other measure. Without such an external criterion, only relative comparisons of the adjustment methods are possible. 74 Simulated Data The simulated data sets allowed a comparison of adjustment methods in a context where true scores were known. In all, 32 data sets were simulated and analyzed. As an example, Table 6 lists the scores that raters assigned for one of the 32 sets. The patterns of rater scoring were as expected from the parameters in the model. Raters 1, 2, and 3 were lenient; raters 4, 5, and 6 were of average stringency; raters 7, 8, and 9 assigned scores averaging less than the overall mean. Raters l, 4, and 7 assigned scores with a large variance; raters 2, 5, and 8 gave scores with a variance similar to the overall score variance; raters 3, 6, and 9 assigned scores mostly to the middle of the scale and had a low score variance. Root Mean Squared Errors (RMSEs) For each simulated data set, the accuracy of the adjustment methods was measured with the RMSE by comparing adjusted scores with the true scores expected from the parameters of the model. Table 7 lists the overall RMSEs for all data sets and for each of the ten adjustment methods. Because only six of the methods were used for the data sets with only one rater per paper, and because those data sets had higher RMSEs than the other data sets, they were analyzed separately. The RMSEs for the remaining 24 data sets were averaged, both overall and by each of the facets which varied across the simulated data sets. The results of this analysis are reported in Table 8. 75 Table 6 Scoring Frequencies for One Simulated Data Set (2+551)* 0 1 2 3 4 5 N Mean SD Overall 48 72 171 252 218 239 1000 3.2370 1.4173 Rater 1 8 4 10 9 14 63 108 3.9074 1.5959 2 2 10 11 20 26 36 105 3.5810 1.3924 3 0 4 15 49 45 13 126 3.3810 0.9331 4 10 7 19 17 15 48 116 3.4138 1.6716 5 5 11 19 32 33 28 128 3.2578 1.3821 6 l 6 18 50 30 2 107 3.0093 0.9120 7 12 8 17 11 10 26 84 2.9167 1.7942 8 10 14 20 27 20 21 112 2.8571 1.5461 9 0 8 42 37 25 2 114 2.7456 0.9348 * This set had 500 papers, a 5~point scale, normally distributed paper quality parameters, and 2 raters per paper with rescoring by a third rater if scores were discrepant. Overall RMSEs for the 76 Table 7 Simulated Data Data Set* NO MN TMN LI TLI EQP OLS TLS RAS PCM 1151 0.97 0.83 0.77 0.89 0.79 0.87 1152 0.94 0.86 0.83 0.93 0.85 1.08 1153 0.85 0.85 0.82 0.94 0.90 0.91 1191 1.74 1.51 1.40 1.63 1.51 1.39 1192 1.69 1.57 1.50 1.61 1.49 1.51 1193 1.62 1.42 1.39 1.61 1.57 1.37 1551 0.91 0.80 0.76 0.82 0.77 0.85 1591 1.85 1.66 1.61 1.60 1.53 1.53 2151 0.62 0.58 0.55 0.55 0.51 0.63 0.64 0.61 0.69 0.56 2152 0.74 0.69 0.66 0.73 0.67 0.73 0.87 0.78 0.76 0.69 2153 0.62 0.58 0.57 0.66 0.62 0.64 0.65 0.61 0.66 0.64 2191 1.40 1.40 1.34 1.39 1.33 1.22 1.53 1.50 1.47 1.34 2192 1.36 1.34 1.29 1.35 1.27 1.18 1.53 1.46 1.43 1.38 2193 1.10 1.04 1.03 1.04 0.95 1.00 1.06 1.04 1.19 1.21 2551 0.72 0.65 0.63 0.65 0.61 0.64 0.77 0.74 0.71 0.61 2591 1.28 1.20 1.16 1.17 1.12 1.04 1.30 1.28 1.28 1.11 2+151 0.60 0.55 0.54 0.54 0.53 0.62 0.60 0.59 0.60 0.58 2+152 0.82 0.67 0.64 0.66 0.64 0.72 0.87 0.84 0.78 0.81 2+153 0.63 0.54 0.52 0.53 0.50 0.57 0.67 0.64 0.62 0.62 2+191 1.37 1.27 1.24 1.22 1.18 1.10 1.41 1.39 1.34 1.28 2+192 1.51 1.17 1.14 1.23 1.15 1.05 1.53 1.51 1.52 1.39 2+193 1.46 1.25 1.22 1.21 1.18 1.21 1.51 1.48 1.43 1.37 2+551 0.74 0.64 0.62 0.64 0.61 0.65 0.74 0.73 0.73 0.67 2+591 1.40 1.23 1.21 1.21 1.17 1.13 1.44 1.42 1.36 1.26 3151 0.51 0.42 0.40 0.49 0.43 0.55 0.57 0.51 0.52 0.41 3152 0.55 0.49 0.49 0.54 0.52 0.54 0.58 0.57 0.54 0.50 3153 0.50 0.43 0.42 0.46 0.42 0.48 0.53 0.49 0.52 0.43 3191 1.01 0.95 0.93 0.97 0.93 0.84 1.03 1.03 1.05 0.97 3192 1.22 1.14 1.10 1.17 1.13 0.94 1.22 1.21 1.20 1.11 3193 1.13 1.04 1.02 1.08 1.02 0.89 1.10 1.09 1.14 0.90 3551 0.53 0.48 0.46 0.49 0.46 0.50 0.56 0.54 0.53 0.44 3591 1.13 1.04 1.02 1.06 1.02 0.93 1.16 1.15 1.09 0.94 * Key to Digits in Data Set Number First Digit: Number of raters per paper; 2+ means two raters with rescoring if scores were discrepant. Second Digit: Third Digit: Fourth Digit: Number of hundred papers scored Skew of paper quality distribution; 1 is normal, positively skewed, and 3 is negatively skewed. (100 or 500). Number of points in the rating scale (0-5 or 0-9). 2 is 77 Table 8 Overall RMSEs Averaged by Facet (l-Rater Data Sets Omitted) Total NO MN 0.96 0.87 TMN ALL 0.84 Number of Raters Per Paper NO MN TMN 2 0.98 0.93 0.90 2+ 1.07 0.91 0.89 3 0.82 0.75 0.73 Number of Papers NO MN TMN 100 0.95 0.86 0.84 500 0.97 0.87 0.85 Number of RatingiScale Points NO MN TMN 0-5 0.63 0.56 0.54 0—9 1.28 1.17 1.14 Paper Quality Distribution NO MN TMN Normal 0.94 0.87 0.84 Skew + 1.03 0.92 0.89 Skew - 0.91 0.82 0.80 LI TLI EQP OLS .88 0.83 0.82 0.99 LI TLI EQP OLS 0.94 0.89 0.88 1.04 0.91 0.87 0.88 1.10 0.78 0.74 0.71 0.84 LI TLI EQP OLS 0.88 0.83 0.83 0.99 0.87 0.83 0.81 0.99 LI TLI EQP OLS .58 0.54 0.61 0.67 .17 1.12 1.04 1.32 LI TLI EQP OLS 0.86 0.82 0.82 0.98 0.95 0.90 0.86 1.10 0.83 0.78 0.80 0.92 TLS 0.97 TLS 1.00 1.07 0.82 TLS 0.96 0.98 TLS 0.64 1.29 TLS 0.96 1.06 0.89 0.96 1.02 1.05 0.82 0.97 0.95 0.64 1.29 0.95 1.04 0.93 PCM 0.88 PCM 0.94 1.00 0.71 PCM 0.90 0.84 PCM 0.58 1.19 PCM 0.85 0.98 0.86 78 Overall EQP was the method with the lowest overall average RMSE (.82). TLI and TMN had only slightly higher average RMSEs (TLI-—.83; TMN--.84). MN (.87), LI (.88), and PCM (.88) still represented improvements over NO (.96). The three matrix methods (RAS--.96, TLS~-.97, and OLS——.99) did not do as well as no equating (NO). Overall averages are somewhat misleading because different methods worked better under different scoring conditions. Each facet of the simulated data was considered separately. Number of Raters Per Paper Not surprisingly, scoring with three raters per paper resulted in lower RMSEs than scoring with two raters per paper. More surprising was that scoring with two raters without rescoring discrepant cases had lower RMSEs than when discrepant scores were rescored by a third rater. This result contradicts the traditional view that rescoring in discrepant cases produces more accurate scores. With three raters per paper, PCM and EQP had the lowest RMSEs (both .71) while with two raters per paper, EQP and TLI had the lowest RMSEs (both roughly .88). PCM did notably worse with only two raters per paper than with three. OLS, TLS, and RAS continued to perform worse than NO. MN, TMN, and LI had only slightly higher RMSEs than TLI. Number of Papers Scored Most methods had the same average RMSE whether averaging over 100- paper sets or SOC-paper sets. The one exception was PCM, which had an average RMSE of .90 on loo-paper sets and .84 on 500-paper sets, indicating that PCM was more accurate with large data sets than with “iwv,w1_t. 79 smaller ones. All other methods performed about as well on either size of data set as they did overall. Number of Rating Scale Points RMSEs were averaged over both 0—5 rating scales and 0—9 rating scales. As would be expected, the RMSEs for the 0-9 scales were almost twice as great as those for 0-5 rating scales. EQP performed much better on the data sets with more scale points. On 0-9 scales, EQP had the lowest RMSE by far (1.04, compared to TLI's 1.12 and NO's 1.28) while on 0—5 scales EQP did only slightly better than NO (.61 vs. .63), and TLI did the best (.54). Thus the overall advantage of EQP is due primarily to its advantage on 0-9 data sets. Note also that because the 0-9 data sets had greater variance in RMSE, they were weighted more heavily when RMSEs were averaged across sets. A supplemental analysis indicated that if the RMSEs were converted to T—scores before averaging (so all data sets were weighted equally), then TLI would have the lowest average RMSE overall. Paper Quality Distribution Simulated data sets varied in their distribution of paper quality parameters. Positively skewed data sets were generated to have a greater number of low quality papers; negatively skewed data sets were generated to have a greater number of high quality papers. The relative effectiveness of the ten methods with each shape of distribution was generally the same as their relative effectiveness overall. The negatively skewed data sets had the lowest RMSEs and the positively skewed data sets had the highest RMSEs across the methods. 80 This trend favoring negatively skewed sets is likely due to ceiling effects. Papers with extreme (either high or low) true scores are more accurately rated because scoring errors can only oCcur in one direction. The data sets were simulated with more high scores than low scores (like the real data sets) so the extreme scores occur mostly at the high end. In this simulation the negatively skewed data sets had relatively more extreme scores than did the positively skewed data sets and were scored more accurately. Rater Type Besides the overall RMSE computed for each data set and method with average adjusted scores, RMSEs were also computed for each of the nine raters for the 32 data sets and 10 methods. Rather than consider each rater separately, raters are grouped into sets of three and rater types are analyzed for differences in RMSE. One analysis is based on the level of the ratings (lenient, average, or stringent) and the other is based on the spread of the ratings assigned (wide, medium, or narrow). Table 9 lists the RMSEs for lenient, average, and stringent raters, both overall and for each of the scoring facets--number of raters (NR), number of papers (NP), number of scale points (NS) and skewness of score distribution. Overall, five methods (NO, OLS, TLS, RAS, and PCM) had greater RMSEs with the scores of stringent raters than with those of lenient raters. The other five methods (MN, TMN, LI, TLI, and EQP) did roughly the same or slightly better with stringent raters' scores than with lenient raters' scores. In particular, EQP had the least RMSE of the ten methods with stringent RMSEs by Rater Type by Facet 81 Table 9 Raters 1,2,3 —— Lenient NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 1.10 1.12 1.08 1.21 1.12 1.22 1.25 1.15 1.13 0.95 NR=2 1.10 1.18 1.14 1.31 1.19 1.28 1.32 1.16 1.16 0.94 NR=2+ 1.03 1.05 1.01 1.07 1.03 1.12 1.08 1.04 1.04 0.97 NR=3 1.17 1.13 1.08 1.24 1.13 1.26 1.34 1.24 1.19 0.93 NP=100 1.10 1.14 1.09 1.23 1.13 1.23 1.26 1.15 1.14 0.95 NP=500 1.10 1.07 1.03 1.15 1.08 1.19 1.20 1.13 1.10 0.93 NS=5 0.86 0.76 0.73 0.85 0.77 0.92 1.00 0.90 0.85 0.74 NS=9 1.34 1.48 1.43 1.57 1.46 1.52 1.50 1.39 1.40 1.15 Normal 1.09 1.11 1.07 1.16 1.09 1.23 1.26 1.16 1.12 0.94 Skew + 1.20 1.25 1.19 1.35 1.28 1.36 1.40 1.28 1.23 1.03 Skew - 1.01 1.01 0.97 1.16 1.00 1.07 1.07 0.99 1.04 0.88 Raters 4,5,6 -- Average NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 1.17 1.16 1.13 1.16 1.11 1.13 1.25 1.23 1.17 1.02 NR=2 1.22 1.18 1.14 1.19 1.13 1.16 1.38 1.35 1.23 1.08 NR=2+ 1.11 1.10 1.07 1.13 1.08 1.12 1.12 1.11 1.11 1.06 NR=3 1.18 1.19 1.17 1.16 1.11 1.12 1.26 1.24 1.16 0.92 NP=100 1.17 .15 1.12 1.16 1.11 1.13 1.27 1.24 1.18 1.03 NP=500 1.16 1.17 1.15 1.17 1.12 1.14 1.22 1.21 1.12 1.00 NS=5 0.80 0.79 0.77 0.83 0.80 0.84 0.84 0.82 0.80 0.66 NS=9 1.54 1.52 1.48 1.49 1.42 1.42 1.67 1.65 1.54 1.37 Normal 1.14 1.14 1.12 1.15 1.10 1.12 1.23 1.21 1.13 0.99 Skew + 1.26 1.24 1.21 1.19 1.15 1.13 1.40 1.37 1.31 1.11 Skew — 1.13 1.10 1.07 1.15 1.09 1.15 1.16 1.14 1.09 0.99 Raters 7,8,9 -— Stringent NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 1.54 1.19 1.12 1.14 1.07 1.07 1.55 1.52 1.54 1.43 NR=2 1.58 1.27 1.20 1.22 1.15 1.16 1.61 1.58 1.65 1.45 NR=2+ 1.34 1.01 0.95 0.98 0.90 0.98 1.38 1.36 1.30 1.33 NR=3 1.68 1.27 1.20 1.23 1.15 1.08 1.65 1.63 1.67 1.52 NP=100 1.53 1.17 1.11 1.13 1.06 1.08 1.54 1.52 1.55 1.44 NP=500 1.55 1.22 1.15 1.17 1.10 1.06 1.55 1.54 1.53 1.42 NS=5 0.99 0.81 0.76 0.77 0.72 0.76 0.99 0.97 1.03 0.90 NS=9 2.08 1.56 1.48 1.51 1.41 1.38 2.10 2.07 2.05 1.97 Normal 1.53 1.20 1.13 1.16 1.09 1.07 1.54 1.52 1.52 1.41 Skew + 1.59 1.12 1.05 1.16 1.06 1.10 1.66 1.63 1.51 1.53 Skew — 1.49 1.22 1.15 1.09 1.04 1.05 1.44 1.40 1.61 1.38 82 raters but with lenient raters ranked ninth. In contrast, PCM was the best method with lenient raters and average raters but ranked only sixth with stringent raters. The number of raters who scored each paper made little difference in the RMSEs of individual raters, except that the raters' RMSEs were slightly less in those data sets where discrepant raters were removed. The individual raters involved in the sets with rescoring were more accurate than in sets without rescoring. But recall that the average adjusted scores on papers was slightly more accurate without rescoring by a third rater. In general, none of the facets of scoring appears to interact with the relative leniency of the raters. One minor exception is that stringent raters' scores after adjustments MN, TMN, LI, TLI, or EQP have slightly lower RMSEs for positively skewed data sets than for negatively skewed data sets. In all other cases, the negatively skewed RMSE was less because of ceiling effects but here the situation is reversed due to floor effects when stringent raters score low quality papers. Table 10 analyzes RMSEs by rater score variance. Generally the RMSEs are greater for raters with high score variance than for raters with low score variance. Three methods——LI, TLI, and EQP——have much lower RMSEs for high variance raters than the other methods and slightly higher RMSEs for low variance raters than the other seven methods. PCM has the least RMSE of the ten methods for low variance raters and medium variance raters but ranks sixth for high variance raters. 83 Table 10 RMSEs by Rater Spread by Facet Raters 1,4,7 -- High Score Variance NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 1.56 1.40 1.30 1.14 1.09 1.17 1.66 1.55 1.58 1.41 NR=2 1.63 1.49 1.38 1.22 1.16 1.22 1.77 1.63 1.66 1.48 NR=2+ 1.35 1.20 1.11 0.97 0.93 1.05 1.45 1.38 1.32 1.32 NR=3 1.70 1.51 1.40 1.24 1.17 1.24 1.75 1.65 1.77 1.44 NP=100 1.54 1.38 1.26 1.13 1.07 1.16 1.66 1.54 1.56 1.41 NP=500 1.61 1.47 1.39 1.16 1.15 1.20 1.65 1.59 1.66 1.42 NS=5 1.07 0.96 0.88 0.78 0.75 0.87 1.19 1.09 1.08 0.94 NS=9 2.05 1.84 1.71 1.50 1.42 1.47 2.12 2.02 2.08 1.88 Normal 1.56 1.41 1.31 1.12 1.09 1.17 1.63 1.54 1.60 1.39 Skew + 1.80 1.49 1.35 1.22 1.15 1.22 1.98 1.83 1.81 1.65 Skew - 1.31 1.30 1.21 1.12 1.01 1.13 1.40 1.30 1.33 1.22 Raters 2,5,8 -- Medium Score Variance NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 1.20 1.08 1.05 1.14 1.09 1.13 1.28 1.24 1.22 1.02 NR=2 1.21 1.14 1.11 1.25 1.19 1.20 1.36 1.29 1.31 1.03 NR=2+ 1.13 1.00 0.97 1.04 1.00 1.06 1.14 1.12 1.14 1.06 NR=3 1.25 1.09 1.05 1.14 1.09 1.11 1.34 1.30 1.23 0.99 NP=100 1.21 1.09 1.06 1.16 1.11 1.14 1.30 1.25 1.26 1.04 NP=500 1.15 1.04 1.00 1.10 1.05 1.09 1.23 1.20 1.10 0.98 NS=5 0.83 0.73 0.70 0.77 0.72 0.77 0.86 0.83 0.83 0.70 NS=9 1.57 1.43 1.39 1.52 1.46 1.48 1.70 1.64 1.61 1.35 Normal 1.18 1.08 1.05 1.13 1.08 1.12 1.31 1.25 1.17 1.02 Skew + 1.24 1.07 1.04 1.17 1.14 1.16 1.38 1.35 1.21 1.09 Skew — 1.19 1.09 1.05 1.14 1.08 1.11 1.13 1.10 1.33 0.97 Raters 3,6,9 -- Low Score Variance NO MN TMN L I TLI EQP OLS TLS RAS PCM ALL 1.05 0.98 0.98 1.22 1.11 1.13 1.11 1.11 1.03 0.96 NR=2 1.06 0.99 0.99 1.27 1.13 1.17 1.17 1.17 1.08 0.96 NR=2+ 1.01 0.96 0.96 1.17 1.08 1.11 1.00 1.00 0.99 0.98 NR=3 1.07 0.99 0.99 1.24 1.13 1.10 1.16 1.16 1.02 0.94 NP=100 1.05 0.99 0.99 1.23 1.11 1.14 1.12 1.12 1.05 0.97 NP=500 1.04 0.95 0.95 1.22 1.11 1.09 1.09 1.09 0.99 0.94 NS=5 0.75 0.68 0.68 0.90 0.82 0.87 0.77 0.77 0.77 0.66 NS=9 1.34 1.29 1.29 1.55 1.41 1.38 1.45 1.44 1.30 1.26 Normal 1.03 0.96 0.96 1.22 1.11 1.13 1.10 1.10 1.01 0.94 Skew + 1.01 1.06 1.06 1.31 1.20 1.22 1.11 1.11 1.03 0.92 Skew - 1.12 0.94 0.94 1.14 1.03 1.02 1.13 1.13 1.08 1.05 84 Comparing the facets of scoring, the number of raters per paper did not interact with rater score variance. The number of papers scored made a slight difference, with high variance raters more accurate with loo-paper sets than with BOO-paper sets and low variance raters more accurate with larger data sets. This result is likely artifactual, because the BOO-paper sets were all normally distributed while the loo-paper sets were either normally distributed, positively skewed, or negatively skewed. The number of rating scale points did not interact with rater score variance. The skewness of distributions made some difference, with high score variance raters doing much worse on positively skewed data sets than on negatively skewed data sets, and with low score variance raters doing the same on all distribution shapes. Correlations Another measure of the relative effectiveness of the ten adjustment methods was the Pearson product-moment correlation. As with RMSE, the data sets with only one rater per paper are analyzed separately, because four of the methods were not used on one—rater data sets. The correlation between average adjusted scores and true scores was computed for 24 data sets and for each of the ten adjustment methods. Before averaging across data sets, the correlations were converted to Z—scores using the Fisher's Z transformation: Z = .5 ' ln [(1+r)/(1-r)]. Because correlations cannot exceed 1.0, this transformation makes the distribution of correlations more nearly normal and reduces the ceiling effect. The Z—values are averaged 85 Table 11 Correlations for the Simulated Data Sets ALL =2 NR=2+ =3 NP=l NP=5 NS=5 NS=9 SK=1 0.89 0.91 0.91 SK: SK= ALL NR=2 NR=2+ NR=3 NP=1 NP=5 NS=5 NS=9 SK=1 SK=2 SK=3 0.87 0.90 0. 0. 89 91 0.90 0.91 LI 0.92 0.90 0.90 0.94 0.92 0.90 0.92 TLI 0.92 0.91 0.91 0.94 0.92 0.91 0.92 OLS 0.88 0.87 0.85 0.91 0.89 TLS 0.88 0.88 0.86 0.91 0.89 0.86 0.89 0.92 0.85 0.89 0.90 0.87 0.89 0.89 0.91 Z-Transformed Correlations for the Simulated Data Sets 1.39 1.30 1.56 1.44 1.33 1.45 1.52 1.42 1.52 TMN 1.51 1.43 1.42 1.69 1.54 1.45 1.53 LI 1.57 TLI 1.59 1.50 1.51 1.75 EQP 1.52 1.57 1.42 1.52 OLS 1.38 1.34 1.27 1.52 1.40 1.27 1.43 TLS 1.39 1.36 1.28 1.54 1.42 1.29 1.45 RAS 1.40 1.34 1.32 1.53 1.42 1.33 1.42 PCM 1.53 1.45 1.39 1.75 1.57 1.43 1.54 86 overall and by each facet, then converted back to correlations. Table 11 reports these average correlations and their associated Z-scores. Of all the methods, TLI had the highest correlation with true scores, r = .92 (Z = 1.59). In rank order, LI (.92), PCM (.91), EQP (.91), TMN (.91), and MN (.90) all had higher correlations than NO (.89). RAS, TLS, and OLS all had correlations of .88. This order agrees closely with the overall order of the RMSE measure. Recall that EQP had a lower RMSE than TLI, but largely because of its superior performance on 9-point scales which weighted more heavily than 5-point scales. With correlations all data sets are weighted equally, and TLI outperformed EQP with equal weighting. In general though, correlation and RMSE agreed closely, so an analysis of all the facets of the scoring situation with correlations revealed nothing additional to the RMSE analysis. Consequently, correlations for individual raters were not analyzed. Maximum Difference Another way to compare the relative effectiveness of the ten adjustment methods is with the maximum difference measure, which is the greatest difference between average adjusted score and true score of all the papers in a data set and measures the magnitude of error of the worst case in any data set. Table 12 lists the maximum differences separately for all of the 32 data sets and for each of the 10 methods. Table 13 reports averages of these maximum errors across data sets, both overall and for each of the scoring facets. Table 12 87 Maximum Score Difference for All Simulated Data Sets and Each Method Observed - True Data Set* NO MN TMN LI TLI EQP OLS TLS RAS PCM 1151 2.39 2.22 2.00 2.28 2.28 2.99 1152 2.59 2.26 2.26 2.93 2.93 3.50 1153 2.06 2.32 2.32 2.80 2.80 2.77 1191 5.02 4.29 4.29 5.52 4.98 3.83 1192 4.19 4.23 3.86 4.99 4.10 4.10 1193 6.18 4.09 4.09 5.82 5.75 3.96 1551 3.41 2.93 2.89 3.13 2.97 3.20 1591 7.74 6.94 6.94 6.60 6.60 6.88 2151 2.13 1.65 1.65 1.34 1.34 1.71 2.10 2.00 2.13 1.99 2152 2.18 2.15 1.83 2.25 1.67 2.04 2.55 2.39 1.85 1.99 2153 2.00 2.25 2.25 2.20 2.20 2.37 2.15 2.15 2.19 1.69 2191 3.80 3.51 3.51 3.62 3.62 3.91 4.11 4.11 3.96 3.59 2192 4.89 6.31 5.40 6.55 5.89 5.85 4.29 4.07 5.36 5.25 2193 3.78 3.46 3.46 4.00 3.04 2.64 4.10 4.10 3.78 4.20 2551 3.65 3.43 3.40 2.90 2.90 2.56 3.80 3.65 3.65 3.46 2591 5.04 4.49 4.49 3.89 3.89 3.70 5.19 5.01 5.13 3.97 2+151 1.50 1.28 1.28 1.19 1.19 1.50 1.53 1.50 1.50 1.41 2+152 1.86 1.95 1.57 1.83 1.71 1.86 1.96 1.91 1.88 1.91 2+153 2.09 2.38 2.38 2.41 2.41 2.24 2.09 1.95 2.22 2.43 2+191 3.57 3.41 3.41 3.05 3.05 2.97 3.66 3.66 3.44 3.52 2+192 3.90 2.98 2.98 4.20 3.08 3.03 4.06 3.89 3.89 3.59 2+193 4.25 3.97 3.97 4.34 4.34 4.69 4.76 4.76 3.96 5.31 2+551 2.45 2.04 2.04 2.14 2.14 2.09 2.43 2.43 2.44 2.23 2+591 6.10 5.80 5.80 4.69 4.69 4.07 6.32 6.16 6.10 5.41 3151 1.73 1.26 1.17 1.52 1.33 1.89 1.91 1.71 1.74 1.24 3152 2.01 1.51 1.51 1.47 1.35 1.51 2.01 2.01 1.97 1.40 3153 1.41 1.26 1.26 1.55 1.55 1.49 1.51 1.51 1.62 1.36 3191 2.86 2.70 2.70 2.66 2.66 2.37 3.04 3.04 3.39 2.57 3192 3.20 2.48 2.48 2.56 2.56 2.57 3.41 3.41 2.85 2.67 3193 3.15 3.10 3.08 3.45 3.19 3.10 3.20 3.19 3.12 3.42 3551 2.70 2.31 2.31 1.98 1.98 1.99 2.70 2.70 2.70 1.92 3591 4.09 4.03 3.66 3.73 3.53 3.35 3.98 3.98 4.14 3.14 * Key to Digits in Data Set Number First Digit: Second Digit: Third Digit: Fourth Digit: Skew of paper quality distribution; positively skewed, and 3 is negatively skewed. Number of hundred papers scored (100 or 500). Number of points in the rating scale (0-5 or 0-9). 1 is normal, Number of raters per paper; 2+ means two raters with a rescoring if scores were discrepant. is 88 Table 13 Maximum Differences Averaged Across Methods (1-Rater Data Sets Omitted) NO MN TMN LI TLI EQP OLS TLS RAS PCM ALL 3.10 2.90 2.82 2.90 2.72 2.73 3.20 3.14 3.13 2.90 NR=2 3.43 3.41 3.25 3.34 3.07 3.10 3.53 3.43 3.51 3.27 NR=2+ 3.21 2.98 2.93 2.98 2.83 2.81 3.35 3.28 3.18 3.23 NR=3 2.64 2.33 2.27 2.37 2.27 2.28 2.72 2.69 2.69 2.21 NP=1 2.79 2.64 2.55 2.79 2.57 2.65 2.91 2.85 2.82 2.75 NP=5 4.00 3.68 3.62 3.22 3.19 2.96 4.07 3.99 4.03 3.35 NS=5 2.14 1.96 1.89 1.90 1.81 1.94 2.23 2.16 2.16 1.92 NS=9 4.05 3.85 3.74 3.90 3.63 3.52 4.18 4.12 4.09 3.89 SK=1 3.30 2.99 2.95 2.73 2.69 2.68 3.40 3.33 3.36 2.87 SK=2 3.01 2.90 2.63 3.14 2.71 2.81 3.05 2.95 2.97 2.80 SK=3 2.78 2.74 2.73 2.99 2.79 2.75 2.97 2.94 2.81 3.07 Overall, TLI had the least average maximum difference (2.72) followed closely by EQP (2.73). Next were TMN (2.82) and PCM, MN, and LI (all at 2.90). The matrix methods-~RAS (3.13), TLS (3.14), and OLS (3.20)—-continued to do slightly worse than no equating (3.10). Note that the relative order of the methods based on the maximum difference measure is almost identical with the order from the other two measures. As with the RMSE measure, EQP's slight edge with 9—point data sets was weighted too heavily in the overall average. Comparing the number of raters, Table 13 indicates that two raters with rescoring was more effective with worst cases than two raters without rescoring. Of course, the average score of three raters tends to be closer to the true score than the average score of two raters, and the average maximum difference with three raters was less than with two for each of 89 the ten methods. With two raters, TLI had the least average maximum difference of any method (3.07). With two raters and rescoring, EQP had the least average maximum difference (2.81). With three raters PCM did the best, with an average maximum difference of 2.21. The maximum error was greater with SOC—paper data sets than with loo-paper sets. The worst case of a SOC—paper set tends to be worse than the worst case of a 100—paper set, even though the average case is expected to be identical. Based on the maximum difference measure, TMN (2.55) and TLI (2.57) performed the best of any method on the lOO-paper sets, and EQP (2.96) did best on the BOO—paper sets. The magnitude of these differences suggests that even with adjustment of scores, an occasional paper can receive an average observed score that is up to 3 points different from its true score on a 9—point scale. The maximum error was greater with 9—point scales than with 5— point scales, clearly because of the greater possible range of scores. TLI had the least average maximum difference of any method for 5—point scales (1.81) and EQP was the best method with 9-point scales (3.52). In terms of skewness of the paper distribution, no clear trend emerged with regard to the maximum difference measure. NO, MN, OLS, TLS, and RAS had their lowest maximum differences with negatively skewed distributions, TMN and PCM did best with positively skewed distributions, and LI, TLI, and EQP did best with normal distributions. Comparing the methods, TMN had the least average maximum difference for both positively and negatively skewed data sets (2.63 and 2.73), and EQP did the best of any method with normally distributed data sets (2.68). 90 One—Rater Data Sets The one-rater data sets were only analyzed with six of the methods (NO, MN, TMN, LI, TLI, and EQP) and measures were generally worse than with two or three raters. The results from those sets were analyzed separately. RMSEs from one-rater sets are contained in Table 7, and averaged by facet in Table 14. Table 14 Average RMSEs for the One-Rater Simulated Data Sets Total NO MN TMN LI TLI EQP All 1.320 1.187 1.132 1.251 1.175 1.189 Number of Papers NO MN TMN LI TLI EQP 100 1.301 1.174 1.116 1.264 1.185 1.189 500 1.377 1.227 1.181 1.212 1.146 1.188 Number of Rating Scale Points NO MN TMN LI TLI EQP 0-5 0.916 0.836 0.794 0.888 0.826 0.928 0-9 1.724 1.539 1.471 1.613 1.524 1.449 Paper Quality Distribution Normal 1.364 1.198 1.133 1.230 1.149 1.159 Skew + 1.313 1.216 1.162 1.269 1.170 1.295 Skew - 1.239 1.136 1.102 1.274 1.233 1.142 For all one—rater sets, TMN had the least RMSE of the six adjustment methods (1.13). Next were TLI (1.18), MN (1.19), and EQP (1.19). LI (1.25) did only slightly better than NO (1.32). 91 On the loo—paper sets with one rater, TMN performed best (1.12), while TLI did best on sets with SOC—papers (1.15). EQP continued its dominance on 9-point rating scales (1.45), while TMN did best on 5- point scales (.79). The skewness of the paper quality distribution made little difference, as TMN outperformed the other methods on all three distribution shapes. Correlations and maximum differences were not analyzed for the one-rater data sets, because the earlier analyses suggested that those measures provided little new information compared to RMSEs. CHAPTER FIVE DISCUSSION Comparing the adjustment methods on the writing assessment data was inconclusive, because there was no external criterion for deciding which method was superior. Comparing the adjustment methods on the simulated data was also inconclusive, because there is no certainty that the PCM used in simulating the data adequately models what happens in real-life rating situations, and it may have affected the comparative results. Despite these two limitations, this study answered some questions and raised others. The following sections summarize the results by method, then recommendations for future study and current practice are given. Summary by Method No method worked best under all scoring conditions. This section describes the situations under which each of the methods was relatively more or less effective than other methods. No equating (NO) In the simulated data sets, using the scores that raters actually assigned without adjustments did not reproduce true scores as well as most of the other equating methods. For the simulated data sets, the average difference in RMSE between the best method (TLI) and no equating was only .09 points for S—point data sets. For the real data, 92 93 each of the other equating methods adjusted scores by at least .14 points compared to no equating. It could be argued that the increment in accuracy by using adjustment methods is not great enough to warrant their use. But when that small average difference is multiplied many times as it is applied to the hundreds of decisions made about individuals based on ratings, the case for some type of equating seems more persuasive. Although no equating is the most common adjustment method used in education and it is fairly accurate, it may not be accurate enough for the high-stakes individual decisions that are often made in large—scale assessment programs. While the present study focused only on writing assessment, the results are equally applicable to other forms of large scale performance assessment being considered by some states. Mean Equating (MN) and Truncated Mean Equating (TMN) Simply adjusting scores up or down to compensate for the stringency of the particular raters who score a paper makes sense and is relatively easy to do. Truncating scores is also desirable to maintain the outer limits of a rating scale. TMN always matched true scores better than MN. TMN often did better than any other equating method, especially on one—rater data sets. (Scoring with only one rater is not a recommended practice, and the RMSEs were much larger with one—rater sets than with two-rater sets even after adjustment.) While TMN does not take full advantage of all available information in estimating true scores, even with large data sets it does well. However, with small data sets, or if papers are not randomly assigned, 94 rater stringency can be confounded with real differences in paper quality. The TMN method has the advantages of easy computation, political defensibility, and a fairly high level of accuracy. Linear Equating (LI) and Truncated Linear Equating (TLI) As with mean equating, truncation always represented an improvement over no truncation. In the simulated data, TLI was likely the most accurate method overall. By compensating for both the level and spread of rater scoring, TLI does a better job in theory than TMN. On small data sets, where only a few scores are available for some raters, the estimates of rater scoring variance are unstable and linear equating may not work as well as mean equating. In data sets where each rater scored at least 20 papers, linear equating did well. A disadvantage of TLI and TMN is the assumption of equally spaced intervals on the rating scale. Methods such as EQP and PCM, which assume only an ordinal scale rather than an interval scale, should work better when raters do not assign scores as if the units were equally spaced. In general though, TLI is a good method because it is easy to compute, it is not too hard to explain, and it proved accurate on the simulated data in this study. Equipercentile equating (EQP) In the simulated data sets, the advantage of EQP was in those sets with 0—9 rating scales. The difference in accuracy of EQP on 5—point scales compared to 9—point scales suggests that the adjustments EQP gave with 0—3 scales in the real data sets were inaccurate. It may not be possible to make nine distinct categories for evaluating writing, but if raters were encouraged to use half—points in those cases when 95 they have trouble deciding between two scores, a 0—5 scale would become a 0—9 scale and EQP would work better. Alternatively, smoothing techniques (connecting the points with curves rather than with line segments) could make EQP a more accurate method for scales with fewer points. Unlike most methods, EQP has the theoretical advantage of assuming only that scale points are ordered categories and not making the stronger assumption of equally spaced units. OLS, TLS, and Rasch Extension These three methods, the matrix methods, did not do well in the simulated data sets. It was disturbing that these methods did not appear as effective in reproducing true scores as no equating, under any of the scoring conditions. A series of diagnostic steps were performed to better understand why these methods did not perform well in this study. First, the computer program for these methods was checked for "bugs", but the algorithms for matrix multiplication and inversion worked on sample data, and the estimation procedure for estimating rater effects consisted only of a series of those matrix operations. There were no programming errors that would account for the poor performance of those methods. Second, the de Gruijter (1984) formulas were checked for errors. There was a minor error in one formula in his paper which was easily corrected, but his general method of estimating rater differences from assigned score differences seemed sound. It was a different estimation method from that used in other studies that showed matrix estimation methods to be effective (e.g. Cason & Cason, 1984; Raymond & Houston, 96 1990). As was mentioned earlier though, those methods worked well with small data sets, but would have been unwieldy with the large data sets in this study. Third, it was possible that the PCM model used in the simulation resulted in data that did not fit the linear model assumed by OLS. To test this, an additional data set was simulated based on the linear model. The linear model regards observed scores as true scores plus rater effects plus a random error term: Xij = Ti-+l% + en. Supplemental Analysis In this data set, paper true scores were normally distributed with mean 3.5 and variance 1. The raters were from a 3 x 3 design where one dimension was leniency-—three lenient raters added .5 points per paper, three raters were average with rater effect 0 points, and three raters gave scores .5 points lower than true scores. Each observed score also included an additional error term. Three raters (one of each leniency level) had random error distributed normally with mean 0 and SD 1.0, three raters added error terms with mean 0 and SD 0.7 and three raters had error terms with mean 0 and SD 0.4. The data set consisted of 100 papers scored by two raters each on a 0—5 scale. Table 15 lists the frequencies of scores assigned by each rater and overall, as well as the RMSE, correlation, and maximum difference comparisons with true scores for the data set for each of the ten methods. In this data set TMN did the best, followed by MN, EQP, and TLI. RAS had the next lowest RMSE, then LI. All of those methods were more accurate than NO, but PCM, TLS, and OLS did worse than no equating Supplemental Data Set Simulated From a Linear Model 97 Table 15 Score Frequencies RATER 0 1 2 3 4 5 N MN SD TOTAL 7 14 38 45 57 39 200 3.240 1.338 1 0 2 7 6 4 3 22 2.954 1.186 2 2 2 5 6 8 0 23 2.695 1.266 3 3 5 9 4 3 l 25 2.080 1.293 4 0 l 3 4 5 4 17 3.470 1.193 5 1 0 4 4 6 2 17 3.176 1.247 6 0 2 5 7 8 2 24 3.125 1.092 7 1 0 1 2 6 11 21 4.142 1.245 8 0 2 2 6 7 11 28 3.821 1.226 9 0 0 2 6 10 5 23 3.782 .883 Comparisons of Adjusted Scores with True Scores RMSE CORR MAXDIF NO .580 .879 1.365 MN .532 .882 1.391 TMN .519 .886 1.391 LI .569 .886 1.399 TLI .542 .889 1.399 EQP .539 .895 1.515 OLS .647 .863 1.378 TLS .634 .861 1.378 RAS .545 .886 1.476 PCM .598 .867 1.553 98 at all. Of greatest interest was the improved performance of RAS, though OLS and TLS continued to perform unacceptably. PCM did not do as well in this data set as in the simulated data sets generated from the PCM. After this analysis, it was hypothesized that the matrix methods were not effective because of the nature of the rating scale itself. The linear model assumes continuous linear scales, without rounding or ceiling effects. The observed data, though, must be integers from 0 to 5. The rounding and ceiling effects inherent in the categorical scale are assumed to be part of the error term. The simple linear methods (MN, TMN, LI, and TLI) make the same assumptions of a linear and continuous scale, but apparently are more robust and not as easily affected by the violations of model assumptions. RAS did better than OLS and TLS on the supplemental data set. RAS was more accurate because it controls for non—linearity at the extremes of the rating scale before estimating rater effects, whereas TLS controls for ceiling effects after doing the estimation, and OLS does not control for these effects at all. Nothing in this study recommended the matrix methods over the simpler mean or linear equating methods for score adjustment. The matrix methods may be more accurate in small data sets where differences in rater means are due to sampling error or real differences in paper quality. The least-squares methods may also reproduce true scores better when scoring is more continuous and less affected by floor or ceiling effects. 99 Partial Credit Model Despite being the model used to generate the simulated data, PCM rarely performed better than the simpler models. The PCM had more parameters to be estimated than any other method, and by far took the most computer time to make score adjustments. While other methods took at most three or four minutes, PCM took four hours or more to iteratively estimate parameters and adjust scores. To accurately estimate parameters, much data is needed. PCM did the best in situations with many papers (the SOC—paper data sets), more raters (3 per paper) and with few scale points (the 0-5 scales). The scoring situations under which PCM did the best are not realistic. In the 3—rater SOD—paper data sets, each simulated rater scored over 150 papers. Rarely in real-life settings do raters score that many papers. Also, real-life data do not fit the PCM as well as did the data simulated from the PCM. In the one set generated from a different model, PCM reproduced true scores worse than no equating. Although PCM proved to be a flexible model for generating the data, as an equating method it required more scores from a rater than are typically available. Recommendations This study would have been more conclusive if it could have determined which adjustment methods best reproduced true scores for real-life data sets. This could be done in one of two ways. First, if it could be shown that the model used in the simulation adequately represents what happens when real people score real papers then the 100 results of the simulated data study would apply to real settings. Second, if a good estimate of true scores for the real—life data were available then the adjustment methods could be compared in an absolute sense and not just relative to one another. These two improvements suggest directions for future research. Better Models of Rater Scoring The simulation in this study was not adequately linked to real- life scoring. If the assumptions of the model used in the simulated data sets are consistent with real scoring, then the method comparisons for the simulated data are valid. But the PCM is only one model for categorical scoring, and may not be the best model for the context of essay scoring. More study is needed to determine which mathematical model best represents how real—life raters score real-life papers. The PCM assumes that raters assign an initial score of 0 for a paper and then increment its score until they decide (based on their stringency and the paper's quality) that the score should not be any higher. The linear model assumes that the true score of a paper is adjusted by both the overall rater effect and by an error term peculiar to that particular rater—paper combination. It does not model the categorical nature of the rating scale. One direction of study for developing a better model would be to attempt to understand the cognitive processes which raters actually go through when they assign scores. A way to do this is to have raters "think out loud" as they read papers and assign scores. By analyzing rater thinking, one could better understand how decisions are made as 101 to which of the score categories a rater assigns a paper. Such a study might also clarify how raters differ in the extent to which they value various aspects of paper quality. Holistic scoring provides a one- dimensional measure of paper quality, which is clearly a multi- dimensional phenomenon. An adequate model might entail more than one parameter for paper quality. Better True Score Estimates for Real Data Another direction for future research is to get better measures of real-life true scores. In this study, adjustment methods for the writing assessment data only produced data on how the adjustment methods compared to each other, but could not compare adjusted scores to true scores. One way to get true score estimates would be to have a set of papers scored by a team of expert raters. Their average score would be an estimate of a paper's true score. Then the papers would be rescored by two or three raters from a different team, and the adjustment methods could be compared based on how closely they adjusted the second team's scores to the average score of the expert team. One surprising finding in the study was that two raters with rescoring by a third rater when scores were discrepant proved less accurate than scoring with two raters without resolving discrepant scores. If better real-life true score estimates were available, that finding could be investigated further. It may be that the average of two discrepant scores is a better measure of a paper's quality than the average of two scores, only one of which was discrepant. But it may also be that rescoring eliminates an occasional paper with a large error in scoring, as the standard rationale for rescoring would 102 suggest. By examining the maximum error statistics, research could be focused on discrepant cases to better understand why unusual scores occur. Implications for Practice Other refinements of the adjustment methods used in this study could be developed and investigated. When raters' scores on a paper are combined, instead of using the simple mean as in this study, any adjustment method could use a precision—weighted mean so that accurate raters' scores are weighted more heavily than those of less accurate raters. Equipercentile equating worked well with 9-point scales, but with smoothing techniques it ought to give better results with fewer scale points. Equipercentile equating needs further refinement to be recommended over the simpler linear methods in general, but it has the theoretical advantage of assuming scales are ordinal rather than interval measures. More study is needed on the matrix methods to determine why they did not work well in this study but did in other studies. PCM was effective in the larger simulated data sets, but the evidence did not suggest that the method would work well with data based on other scoring models. PCM is also not recommended as an adjustment method because of the excessive computing time it requires and the amount of data necessary to get stable parameter estimates. Based on this study, score adjustment methods are recommended in those high—stakes contexts where decisions are based on the scores assigned by only some of a team of raters. The sophistication of the 103 adjustment method used should depend on the amount of data available. With fewer than 8 papers per rater or with non-random assignment of papers to raters, no equating should be used because estimates of raters' mean scores are unstable. With 8 to 20 papers per rater and random assignment of papers to raters, truncated mean equating is recommended because of the instability of the rater variance estimates. With more than 20 papers per rater, truncated linear equating should result in more accurate score estimates than the other methods. By using statistical methods to equate ratings, researchers and decision- makers can get better measures of traits which are hard to measure. EPILOGUE Although this study did not provide definitive answers, it does provide some insights into Ivan's situation as described in the introduction. If papers were randomly assigned to raters and raters' mean scores differed significantly, some statistical adjustment of scores would be advisable. The simulated data sets in this study indicated that the simpler linear methods did a good job of reproducing true scores. There was little evidence that the more sophisticated matrix methods or item response theory methods resulted in improved adjusted scores. Using truncated linear equating compensates for rater differences in both the level and spread of scores they assign, and keeps scores within the limits of the rating scale, so it is recommended over the other linear methods. In addition to a post hoc procedure such as truncated linear equating, the district could try to improve the training of raters. 104 Training should make raters as homogeneous in their scoring as possible, and as raters agree more, statistical adjustment will affect scores less. In practice, raters vary in how they assign scores, even when they are trained extensively. If individual raters are not consistent in how they assign scores, then no amount of statistical adjustment can make invalid scores valid. But if raters assign scores consistently, rater effects can be controlled by statistical adjustment. LI ST OF REFERENCES 105 References Andrich, D. (1978). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psychological Measurement, 2, 581-594. Angoff, W. H. (1971). Norms, scales, and equating. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600). Washington, DC: American Council on Education. Braun, H. I. (1986). Calibration of essay readers (Final Report) (Program Statistics Research Tech. Rep. No. 86-68). Princeton, NJ: Educational Testing Service. (ERIC Document Reproduction Service No. ED 274 673) Braun, H. I. (1988). Understanding scoring reliability: Experiments in calibrating essay readers. Journal of Educational Statistics, 22, 1-18. Breland, H. M. (1983). The direct assessment of writing skill: A measurement review. (College Board Report 83.6). New York: College, Entrance Examination Board. (ERIC Document Reproduction Service No. 242 756) Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: American College Testing Program. Cason, G. J., & Cason, C. L. (1984). A deterministic theory of clinical performance rating: Promising early results. Evaluation and the Health Professions, 1, 221—247. Cason, G. J., & Cason, C. L. (1989, April). Rater stringency error in performance rating: A contrast of three models. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. (ERIC Document Reproduction Service No. 306 254) Choppin, B. H. (1982). The use of latent trait models in the measurement of cognitive abilities and skills. In D. Spearrit (Ed.), The improvement of measurement in education and psychology. Melbourne: Australian Council for Educational Research. Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 271—302). Washington, DC: American Council on Education. 106 Constable, E., & Andrich, D. (1984, April). Inter-judge reliability} is complete agreement among judges the ideal? Paper presented at the annual meeting of the National Council on Measurement in Education. New Orleans, LA. (ERIC Document Reproduction Service No. ED 243 962) Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: John Wiley. De Gruijter, D. N. M. (1984). Two simple models for rater effects. Applied Psychological Measurement, 2, 213-218. Denny, G. S. (1989). Calibrating for rater stringency: A comparison of three methods. Unpublished manuscript, Michigan State University. Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 12' 407-424. Glass, G. V., & Hopkins, K. D. (1984). Statistical methods in education and psychology (2nd ed.). Englewood Cliffs, NJ: Prentice— Hall. Godshalk, F. I., Swineford, F., & Coffman, W. E. (1966). The measurement of writing ability. New York: College Examination Board. Guilford, J. P. (1954). Psychometric methods. New York: McGraw—Hill. Houston, W., Raymond, M., & Svec, J. (1990, April). Adjustments for rater effects in performance assessment: An empirical investigation. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education, Boston. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 21, 72-107. Landy, F. J., & Farr, J. L. (1983). The measurement of work performance: Methods, theory, and applications. New York: Academic Press. Linacre, J. M. (1987a). An extension of the Rasch model to multi- faceted situations. Chicago: University of Chicago. Linacre, J. M. (1987b, December). A multi—faceted Rasch measurement model. Paper presented at the Midwest Objective Measurement Seminar, Chicago. 107 Lunz, M. E., Linacre, J. M., & Wright, B. D. (1988, April). The impact of judge severity on examination scores. Paper presented at the annual convention of the American Educational Research Association, New Orleans. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 21, 149-174. Paul, S. R. (1976). Models and estimation procedures for the calibration of examiners. PhD Thesis, The University College of Wales, Aberystwyth. Paul, S. R. (1979). Models and estimation procedures for the calibration of examiners. British Journal of Mathematical and Statistical Psychology, 82, 242—251. Paul, S. R. (1981). Bayesian methods for calibration of examiners. British Journal of Mathematical and Statistical Psychology, fig, 213— 223. Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York: Macmillan. Raymond, M. R., & Houston, W. M. (1990, April). Detecting and correcting for rater effects in performance assessment. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education, Boston. Stanley, J. C. (1961). Analysis of unreplicated three—way classifications, with applications to rater bias and trait independence. Psychometrika, 26, 205-219. Weare, J., Moore, J., & Woodall, F. (1987). Interrater reliability: A selected and annotated bibliography of articles concerning interrater reliability. (ERIC Document Reproduction Service No. 280 898) Webb, L. C., Raymond, M. R., & Houston, W. M. (1990, April). Rater stringency and consistency in performance assessment. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education, Boston. Wherry, R. J. (1950). The control of bias in rating: Survey of the literature (Personnel Research Board Report 898). Washington, DC: Department of the Army, Personnel Research Section. Wherry, R. J. (1952). The control of bias in rating: A theory of rating (Personnel Research Board Report 922). Washington, DC: Department of the Army, Personnel Research Section. 108 Wilson, H. G. (1988). Parameter estimation for peer grading under incomplete design. Educational and Psychological Measurement, 38, 69-81. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press. Young, F. W., Takane, Y., & Lewyckyj, R. J. (1988). ALSCAL [Computing procedure]. In SPSS-X User's Guide (3rd. ed.) [Computer program manual]. Chicago: SPSS Inc. APPENDICES 109 APPENDIX A PROGRAMS FOR GENERATING AND ANALYZING SIMULATED DATA SETS All programs are written using QuickBasic, version 4.5. ' RATER.BAS ' This is the first of a series of programs to compare ' rater calibration methods. ' This program asks for the values of these parameters: ' NUMBER OF RATERS PER PAPER (1, 2, 3, or 2*) ' NUMBER OF PAPERS (100 or 500) ' NUMBER OF SCALE POINTS (5 or 9) ' PAPER QUALITY DISTRIBUTION (Normal, Positive, or Negative Skew) ' It then chains to the second program PARGEN.BAS, ' carrying over the values of the four parameters in F$ COMMON F$ CLS PRINT PRINT "How many raters per paper"; : INPUT MJ F$ = CHR$(48 + MJ) PRINT PRINT "How many HUNDRED papers rated"; : INPUT NP F$ = F$ + CHR$(48 + NP) PRINT PRINT ”How many scale points"; : INPUT MP F$ = F$ + CHR$(48 + MP) PRINT PRINT "Which type of skew (1: normal, 2: positive, 3: negative)"; : INPUT SKEW F$ = F$ + CHR$(48 + SKEW) CHAIN "pargen.bas" END 110 ’ PARGEN.BAS ' PARGEN is the second program in the series. ' PARGEN follows RATER.BAS and precedes TRUEGEN.BAS. ' PARGEN generates paper parameters and stores them in ' C:\THESIS\SIMDATA\F$\F$.PAP ' where F$=wxyz ' w=MJ (Number of raters per paper, O=two + one) ' x=NP / 100 (Number of hundred papers rated) ' y=MP (Number of scale points) ' z=SKEWNESS (1: Normal 2: Positive 3: Negative) COMMON F$ mj = VAL(MID$(F$, 1, 1)) hp = VAL(MID$(F$, 2, 1)) * 100 mp = VAL(MID$(F$, 3, 1)) skew = VAL(MID$(F$, 4, 1)) ' Initialization q = mj * np * mp * skew RANDOMIZE q P12 = ATN(1) * 8 IF F$ = "" THEN F$ = ”NULL" PAPER$ = "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ”.PAP" OPEN PAPER$ FOR OUTPUT AS #1 CLS ' Parameter Generation PRINT "Paper quality parameters (out of"; np; ") :"; SELECT CASE skew CASE 1 FOR I = 1 TO np LOCATE l, 45: PRINT I PRINT #1, SQR(-2 * LOG(RND)) * SIN(PI2 * RND) * 3 + 2 NEXT I CASE 2 FOR I = 1 TO np LOCATE 1, 45: PRINT I PRINT #1, ABS(SQR(—2 * LOG(RND)) * SIN(PI2 * RND)) * 5 - 3 NEXT I CASE 3 FOR I = 1 TO np LOCATE l, 45: PRINT I PRINT #1, -5 * ABS(SQR(—2 * LOG(RND)) * SIN(PIZ * RND)) + 7 NEXT I END SELECT PRINT 111 PRINT "Paper quality parameters loaded into PRINT : PRINT CLOSE #1 CHAIN "TRUEGEN.BAS" END ' TRUEGEN.BAS file ", PAPER$; ' TRUEGEN is the third program in the series. ' TRUEGEN follows PARGEN.BAS and precedes OBSGEN.BAS ' TRUEGEN generates rater parameters, and combines them with ' the paper parameters to get true scores for each paper, which are ' stored in ' where F$ is the same as in PARGEN.BAS COMMON F$, J() m3 = VAL(MID$(F$, 1, 1)) np = VAL(MID$(F$, 2, 1)) * 100 mp = VAL(MID$(F$, 3, 1)) skew = VAL(MID$(F$, 4, 1)) P$ = "C:\THESIS\SIMDATA\” + F$ + "\" + F$ + T$ = "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + OPEN T$ FOR OUTPUT AS #2 OPEN P$ FOR INPUT AS #3 ' Generate Rater Parameters J(j,i) nj = 9 DIM J(nj, mp) FOR I = —1 TO 1: FOR J = -1 TO 1 "' 3x3 FOR k = 1 TO mp: L = 3 * I + J + 5 J(L, k) = ((—6 + (k - 1) * 12 / (mp - 1)) * NEXT R, J, I ' Compute true scores DIM e(mp), s(nj) BIG = EXP(20) CLS PRINT "True scores generated (out of"; FOR k 1 TO np LOCATE l, 42: INPUT #3, ab ave 0 FOR I = e(O) - T 0 FOR J 1 TO mp 'Compute values in numerator e(J) e(J - 1) * EXP(ab - J(I, J)) “P; PRINT k; 1 TO nj 1 C:\THESIS\SIMDATA\(F$)\(F$).TRU ".PAP" ".TRU" design, NJ=9 2 J + 2 * I) / 1.5 I!) 'If numbers start getting large, divide all by a large constant 112 IF e(J) > BIG THEN FOR L = 0 TO J: e(L) = e(L) / BIG: NEXT L NEXT J 'Get total for denominator FOR L = 0 TO mp: T = T + e(L): NEXT L 'Get the expected score for rater I s(I) = 0 FOR L = 1 TO mp: s(I) = 8(1) + e(L) / T * L: NEXT L 'Get the average score across all nine raters ave = ave + s(I) NEXT I PRINT #2, ave / nj NEXT k PRINT : PRINT CLOSE CHAIN "OBSGEN.BAS" END ' OBSGEN.BAS OBSGEN is the fourth program in the series. ' OBSGEN follows TRUEGEN.BAS and precedes EZEQ.BAS ' OBSGEN selects raters randomly, and probabilistically generates observed scores from each rater and stores them in ' c:\THESIS\SIMDATA\(F$)\(F$).OBs ' in the format rater#;rating;...;rater#;rating COMMON f$, J() MJ = VAL(MID$(f$, l, 1)) np = VAL(MID$(f$, 2, 1)) * 100 mp = VAL(MID$(f$, 3, 1)) skew = VAL(MID$(f$, 4, 1)) nj = 9 OPEN "C:\THESIS\SIMDATA\" + f$ + "\" + f$ + ".085" FOR OUTPUT AS #2 OPEN "C:\THESIS\SIMDATA\” + f$ + "\" + f$ + ".PAP" FOR INPUT AS #1 q = MJ * up + mp * skew RANDOMIZE q DEF fnr (x) = INT(RND * x) + 1 DIM r(3), s(3) CLS PRINT "Number of observed scores generated (out of"; np; ") :"; ' Randomly select MJ raters SELECT CASE MJ CASE 0 raters = 2 FOR I = 1 TO np CASE 1 CASE 2 113 INPUT #1, ab r(l) = fnr(nj): r(2) = r(l) DO UNTIL r(2) <> r(l) r(2) = fnr(nj) LOOP J = 1: GOSUB SCOREGEN J = 2: GOSUB SCOREGEN IF ABS(s(l) - s(2)) > 1 THEN r(3) = r(l) DO UNTIL (r(3) <> r(2) AND r(3) <> r(1)) r(3) = fnr(nj) LOOP J = 3: GOSUB SCOREGEN d1 = ABS(s(3) - s(I)): d2 = ABS(s(3) — s(2)) IF d1 < d2 OR (d1 = d2 AND s(l) > s(2)) THEN PRINT #2, r(l); s(l); r(3); s(3) ELSE PRINT #2, r(2); s(2); r(3); s(3) END IF ELSE PRINT #2, r(l); s(l); r(2); s(2) END IF LOCATE 1, 55: PRINT I NEXT I raters = 1 FOR I = 1 TO np INPUT #1, ab r(l) = fnr(nj) J = l: GOSUB SCOREGEN PRINT #2, r(l), s(l) LOCATE 1, 55: PRINT I NEXT I raters = 2 FOR I = 1 TO np INPUT #1, ab r(l) = fnr(nj): r(2) = r(l) DO UNTIL r(2) <> r(l) r(2) = fnr(nj) LOOP J = 1: GOSUB SCOREGEN J = 2: GOSUB SCOREGEN PRINT #2, r(l); s(l); r(2); s(2) LOCATE l, 55: PRINT I 114 NEXT I CASE 3 END SELECT CLOSE raters = 3 FOR I = 1 TO np INPUT #1, ab r(l) = fnr(nj): r(2) = r(l) DO UNTIL r(2) <> r(l) r(2) = fnr(nj) LOOP r(3) = r(l) DO UNTIL (r(3) <> r(2) AND r(3) <> r(1)) r(3) = fnr(nj) LOOP J = l: GOSUB SCOREGEN J = 2: GOSUB SCOREGEN J = 3: GOSUB SCOREGEN PRINT #2, r(l); s(l); r(2); s(2); r(3); s(3) LOCATE l, 55: PRINT I NEXT I CHAIN "EZEQ.BAS" SCOREGEN: RETURN END ' Subroutine that generates an observed score s(j), given ' rater R(j) and ability AB so = O: r = r(J) DO UNTIL RND > 1 / (1 + EXP(J(r, so + 1) — ab)) so = so + 1: IF sc = mp THEN EXIT DO LOOP s(J) = sc 115 ' EZEQ.BAS ' EZEQ is the fifth program in the series. ' EZEQ follows OBSGEN.BAS and precedes EQPEQ.BAS ' EZEQ does no equating —- into ' mean equating —— into ' truncated mean equating —- into ' linear equating —- into ' truncated linear equating —— into ' Files consist of NP lines where each C:\THESIS\SIMDATA\(F$)\(F$).NO C:\THESIS\SIMDATA\(F$)\(F$).MN C:\THESIS\SIMDATA\(F$)\(F$).TMN C:\THESIS\SIMDATA\(F$)\(F$).LI C:\THESIS\SIMDATA\(F$)\(F$).TLI line is of the form ' rater#;rater adj score;...;rater#;rater adj score;overall adj score COMMON F$, j(): X%(), r%() mj = VAL(MID$(F$, 1. 1)) IF mj = 0 THEN mj = 2 np = VAL(MID$(F$, 2, 1)) * 100 mp = VAL(MID$(F$, 3, 1)) nj = 9 DIM X%(npr mj)! r%(npl mj)! n(nj)r s(nj), ss(nj)l mn(nj), Sd(nj) ' x%(i,j) is the score the jth rater gave the ith paper ' r%(i,j) is the number of the jth rater on the ith paper ' n(j) is the number of ratings given by rater j ' s(j) is the sum Of the ratings given by rater j ' ss(j) is the sum of squares of the ratings given by rater j ' mn(j) is the mean rating given by rater j ' sd(j) is the standard deviation of the ratings ' where j=0 represents totals over all raters given by rater j OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT AS #1 CLS PRINT "Loading Observed data (out of"; np; ") :" FOR i = 1 TO np FOR j = 1 TO mj INPUT #1, r, sc r%(i, j) = r: x%(i, j) = sc n(O) = n(O) + 1: 5(0) = 5(0) + sc: ss(O) = ss(O) + sc * sc n(r) = n(r) + l: s(r) = s(r) + 80: ss(r) = ss(r) + sc * sc NEXT j LOCATE 1, 40: PRINT i NEXT i CLOSE #1 PRINT PRINT "Computing means and SDs" FOR j = 0 TO nj mn(j) = S(j) / n(j) sd(j) = SQR((SS(j) - 5(j) 2 / n(3')) / r1(3')) NEXT j 116 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".NO" FOR OUTPUT AS #2 PRINT : PRINT "NO equating —- ("; np; "cases) :" FOR i = 1 TO np adj = 0 FOR j = 1 TO mj PRINT #2, r%(i, j); x%(i, j); adj = adj + x%(i, j) NEXT j PRINT #2, adj / mj LOCATE 5, 40: PRINT i NEXT 1 CLOSE #2 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".MN" FOR OUTPUT AS #2 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TMN" FOR OUTPUT As #3 PRINT : PRINT "Mean equating -- ("; np; "cases) :" FOR i = 1 TO np adj = O tadj = 0 FOR j = 1 TO mj sc = x%(i, j) — mn(r%(i, j)) + mn(O) tsc = so IF tsc > mp THEN tsc = mp ELSE IF tsc < 0 THEN tsc = O PRINT #2, r%(i, j); sc; PRINT #3, r%(i, j); tsc; adj = adj + sc tadj = tadj + tsc NEXT j PRINT #2, adj / mj PRINT #3, tadj / mj LOCATE 7, 40: PRINT 1 NEXT 1 CLOSE #2 CLOSE #3 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".LI" FOR OUTPUT As #2 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TLI" FOR OUTPUT AS #3 PRINT : PRINT "Linear equating —— ("; np; "cases) :" FOR 1 = 1 TO np adj = O tadj = 0 FOR j = 1 TO mj aj = r%(i, 1) SC (x%(i, j) - mn(aj)) / sd(aj) * sd(O) + mn(O) tsc = sc IF tsc > mp THEN tsc = mp ELSE IF tsc < 0 THEN tsc = O PRINT #2, aj; sc; PRINT #3, aj; tsc; adj = adj + sc tadj = tadj + tsc NEXT j PRINT #2, adj / mj 117 PRINT #3, tadj / mj LOCATE 9, 40: PRINT i NEXT i CLOSE #2 CLOSE #3 CHAIN "EQPEQ.BAS" ' EQPEQ.BAS EQPEQ is the sixth program in the series. ' EQPEQ follows EZEQ.BAS and precedes OLS.BAS ' EQPEQ does equipercentile equating, storing results in ' C:\THESIS\SIMDATA\(F$)\(F$).EQP in the form rater#;rater adj score;...;rater#;rater adj score;overall adj score COMMON f$, j(): X%(): r%() mj = VAL(MID$(f$. 1. 1)) IF mj = 0 THEN mj = 2 mp = VAL(MID$(f$, 2, 1)) * 100 mp = VAL(MID$(f$, 3, 1)) nj = 9 OPEN "C:\THESIS\SIMDATA\" + f$ + "\" + f$ + ”.EQP" FOR OUTPUT AS #2 DIM w%(nj, mp) CLS PRINT "Equipercentile equating loading -— (out of"; np; ") :" FOR i = 1 TO np FOR L = 1 TO mj aj = r%(i, L) FOR K = 0 TO mp IF x%(i, L) <= K THEN w%(aj, K) = w%(aj, K) + 1: w%(O,K) = w%(O,K)+1 NEXT K, L LOCATE l, 55: PRINT 1 NEXT 1 bign = w%(0, mp) PRINT : PRINT " equating --" FOR i = 1 TO np sc = 0 FOR j = 1 TO mj aj = 13%(1, j) X = X%(ir j) s w%(aj, x) / w%(aj, mp) t w%(0, 0) / bign IF S <= t THEN adj = O ELSE a = -l WHILE t < s r = t a = a + l 118 t = w%(0, a) / bign WEND adj = a + (s - r) / (t — r) - 1 END IF sc = sc + adj PRINT #2, aj; adj; NEXT j PRINT #2, so / mj LOCATE 3, 55: PRINT i NEXT 1 CLOSE #2 CHAIN "OLS.BAS" ' OLS.BAS OLS is the seventh program in the series. ' OLS follows EQPEQ.BAS and precedes RASCHEXT.BAS OLS does ordinary least squares equating, storing adjusted scores in ' C:\THESIS\SIMDATA\(F$)\(F$).OLS in the usual format ' rater; rater's adjusted score; ...; overall adjusted score ' The program follows the algorithm in de Gruijter (1984). DECLARE SUB inverse (x!(), N!, b!()) DECLARE SUB matmult (a!(), mi, N!, b!(), 0!, pl, c1()) COMMON F$ mj = VAL(MID$(F$, 1, 1)) IF mj = 0 THEN mj = 2 IF mj = 1 THEN CHAIN "MORE.BAS" np VAL(MID$(F$, 2, 1)) * 100 mp = VAL(MID$(F$, 3, 1)) nj 9: njl = nj — 1 OPEN ”C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".038" FOR INPUT As #1 MaxT = nj * (njl) / 2 DIM d(MaxT, 1), t1(MaxT), t2(MaxT), r%(mj), s%(mj), N(MaxT, MaxT) teams = O CLS LOCATE 24, 1: PRINT F$, "OLS" LOCATE 1, 1: PRINT "Loading observed data (out of"; np; ") :"; FOR i = 1 TO np FOR j = 1 TO mj INPUT #1, r%(j), s%(j) NEXT j 'Sort by rater number FOR jl = 1 TO mj - 1 FOR j2 = jl + 1 TO mj IF r%(j1) > r%(j2) THEN SWAP r%(j1), r%(j2): SWAP s%(j1), s%(j2) NEXT j2, jl 119 FOR j1 = 1 TO mj — 1 FOR j2 = jl + 1 TO mj flag = 0 FOR T = 1 TO teams 'see if this is an existing team IF t1(T) = r%(jl) AND t2(T) = r%(j2) THEN d(T, 1) = d(T, 1) + s%(j1) — s%(j2) N(T, T) = N(T, T) + 1 flag = 1 EXIT FOR END IF NEXT T IF flag = 0 THEN 'create a new team teams = teams + 1 T = teams d(T, l) = d(T, 1) + s%(jl) - s%(j2) N(T, T) = N(T, T) + 1 t1(T) = r%(jl) t2(T) = r%(j2) END IF NEXT j2, jl LOCATE 1, 55: PRINT i NEXT 1 CLOSE 1 K = teams DIM a(K, njl), Atrans(njl, K), Theta(nj, l), NA(K, njl) DIM AtNA(njl, njl), AtNAi(nj1, njl), AtND(njl, 1), Nd(K, K) FOR i = 1 TO K IF t1(i) < nj THEN a(i, t1(i)) = a(i, t1(i)) + 1 ELSE FOR j = 1 TO njl a(i, j) = a(ir j) ‘ 1 NEXT j END IF IF t2(i) < nj THEN a(i, t2(i)) = a(i, t2(i)) + l ELSE FOR j = 1 TO njl a(il j) = a(ir j) _ 1 NEXT j END IF d(i, 1) = d(i, 1) / N(i, i) FOR j 1 TO njl Atrans(j, i) = a(ir 1) NEXT j NEXT 1 PRINT : PRINT "Computing Nd"; 120 matmult N(), K, K, d(), K, 1, Nd() PRINT : PRINT "Computing A'Nd"; matmult Atrans(), njl, K, Nd(), K, 1, AtND() PRINT : PRINT "Computing NA"; matmult N(), K, K, a(), K, njl, NA() PRINT : PRINT "Computing A'NA"; matmult Atrans(), njl, K, NA(), K, njl, AtNA() PRINT : PRINT "Computing (A'NA) inverse"; inverse AtNA(), njl, AtNAi() PRINT : PRINT "Computing Theta=(A'NA)inverse(A'Nd)"; matmult AtNAi(), njl, njl, AtND(), njl, 1, Theta() FOR j = 1 TO njl Theta(nj, 1) = Theta(nj, 1) - Theta(j, 1) NEXT j OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".038" FOR INPUT AS #1 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".OLs" FOR OUTPUT AS #2 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TLs" FOR OUTPUT AS #3 PRINT : PRINT "Creating output files (out of"; np; ") :"; FOR i = 1 TO np scl = O: sc2 = 0 FOR j = 1 TO mj INPUT #1, r, 5 adj = s + Theta(r, 1) PRINT #2, r; adj; scl = scl + adj IF adj > mp THEN adj = mp ELSE IF adj < 0 THEN adj = O PRINT #3, r; adj; sc2 = s02 + adj NEXT j PRINT #2, scl / mj PRINT #3, sc2 / mj LOCATE 15, 55: PRINT 1 NEXT 1 CLOSE CHAIN "RASCHEXT.BAS" SUB inverse (x(), N, b()) lin = CSRLIN LOCATE lin, 40: PRINT ”(out Of"; N; ") :”; ' Create b(), identity matrix DIM a(N, N) FOR r = 1 TO N: FOR c IF r = c THEN b(r, c) NEXT 0, r 1 TO N: a(r, c) = x(r, c) 1 ELSE b(r, c) = 0 FOR row = 1 TO N 'Make diagonal element 1 121 d row) IF d 0 THEN 'Switch with another row flag 0: row2 row WHILE flag = O row2 row2 + 1 IF row2 > N THEN PRINT "no inverse exists": IF a(row2, row) <> 0 THEN FOR col 1 TO N SWAP a(row2, col), SWAP b(row2, col), NEXT col flag 1 d a(row, a(row, col) col) a(row, b(row, row) END IF WEND END IF FOR col 1 TO N a(row, col) = a(row, col) / d b(row, col) - b(row, col) / d NEXT col 'Subtract multiples of ROW to get zeros in other positions FOR 1 1 TO N IF i <> row THEN m = a(i, row) FOR j = 1 TO N a(i, j) = a(i, j) - m * a(row, j) b(il j) = b(j—r j) — m * b(row, j) NEXT j END IF NEXT i LOCATE lin, 55: PRINT row NEXT row END SUB SUB matmult (a(), m, N, b(), O, p, c()) BEEP: STOP IF N <> 0 THEN PRINT "Matrices not compatible——multiplication fails": STOP lin CSRLIN LOCATE lin, 40: PRINT "(out of"; m; FOR 1 1 TO m FOR j 1 TO p C(i, j) 0 FOR K = 1 TO N C(ir j) C(i, NEXT K, j LOCATE lin, NEXT i H) = j) + a(i, K) * b(K, j) 55: PRINT i 122 END SUB ' RASCHEXT.BAS ' RASCHEXT is the eighth program in the series. ' RASCHEXT follows OLS.BAS and precedes PCM.BAS ' RASCHEXT does Rasch extension equating, storing adjusted scores in ' C:\THESIS\SIMDATA\(F$)\(F$).RAS in the usual format ' rater; rater's adjusted score; ...; overall adjusted score ' The program follows the algorithm in de Gruijter (1984). DECLARE SUB inverse (x!(), N1, b!()) DECLARE SUB matmult (a!(), mi, N1, b!(), 0!, pl, c!()) COMMON F$ mj = VAL(MID$(F$, 1, 1)) IF mj = 0 THEN mj = 2 ; IF mj = 1 THEN PRINT "Program terminates—-" PRINT "other methods require at least 2 raters": END END IF np = VAL(MID$(F$, 2, 1)) * 100 mp VAL(MID$(F$, 3, 1)) nj = 9: njl = nj — 1 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT AS #1 MaxT = nj * (njl) / 2 DIM d(MaxT, 1), t1(MaxT), t2(MaxT), r%(mj), 5%(mj), N(MaxT, MaxT), den(MaxT), num(MaxT) teams = O CLS : LOCATE 24, l: PRINT F$, "RASCH" LOCATE 1, 1: PRINT "Loading observed data (out of"; np; ") :”; FOR 1 = 1 TO np FOR j = 1 TO mj INPUT #1, r%(j), s%(j) NEXT j 'Sort by rater number FOR j1 = 1 TO mj — 1 FOR j2 = j1 + 1 TO mj IF r%(jl) > r%(j2) THEN SWAP r%(j1), r%(j2): SWAP s%(jl), s%(j2) NEXT j2, j1 FOR j1 = 1 TO mj — 1 FOR j2 = jl + 1 TO mj flag = 0 FOR T = 1 TO teams 'see if this is an existing team IF t1(T) = r%(jl) AND t2(T) = r%(j2) THEN den(T) = den(T) + s%(j1) * (mp - s%(j2)) num(T) = num(T) + s%(j2) * (mp — s%(j1)) N(T, T) = N(T, T) + 1 flag = l 123 EXIT FOR END IF NEXT T IF flag = 0 THEN 'create a new team teams = teams + l T = teams den(T) den(T) + s%(j1) * (mp — s%(j2)) num(T) num(T) + s%(j2) * (mp — s%(j1)) N(T, T) = N(T, T) + 1 t1(T) = r%(j1) t2(T) = r%(j2) END IF NEXT j2, j1 LOCATE 1, 55: PRINT i NEXT 1 CLOSE l K = teams DIM a(K, njl), Atrans(njl, K), BHat(nj, 1), NA(K, njl) DIM AtNA(njl, njl), AtNAi(njl, njl), AtND(njl, 1), Nd(K, 1) FOR i = 1 TO K IF num(i) = 0 OR den(i) = 0 THEN d(i, 1) = o ELSE d(i, 1) = LOG(num(i) / den(i)) END IF IF t1(i) < nj THEN a(i, t1(i)) = a(i, t1(i)) + 1 ELSE FOR j = 1 TO njl a(i, j) = a(i, j) - 1 NEXT j END IF IF t2(i) < nj THEN a(i, t2(i)) = a(i, t2(i)) + 1 ELSE FOR j = 1 TO njl a(i, j) = a(i, j) — 1 NEXT j END IF FOR j = 1 TO njl Atrans(j, i) = a(i, j) NEXT j NEXT 1 PRINT : PRINT "Computing Nd"; matmult N(), K, K, d(), K, 1: Nd() PRINT : PRINT "Computing A'Nd"; matmult Atrans(), njl, K, Nd(), K, 1, AtND() PRINT : PRINT "Computing NA"; 124 matmult N(), K, K, a(), K, njl, NA() PRINT : PRINT "Computing A'NA"; matmult Atrans(), njl, K, NA(), K, njl, AtNA() PRINT : PRINT "Computing (A'NA) inverse"; inverse AtNA(), njl, AtNAi() PRINT : PRINT "Computing BHat=(A'NA)inverse(A'd)"; matmult AtNAi(), njl, njl, AtND(), njl, l, BHat() FOR j = 1 TO njl BHat(nj, 1) = BHat(nj, 1) - BHat(j, 1) NEXT j OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT As #1 OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".RAs" FOR OUTPUT AS #2 PRINT : PRINT "Creating output files (out of"; np; ") :”; FOR i = 1 TO np sc = 0 FOR j = 1 TO mj INPUT #1, r, 5 adj = EXP(BHat(r, 1)) * s / (1 — s / mp * (1 — EXP(BHat(r, 1)))) PRINT #2, r; adj; sc = sc + adj NEXT j PRINT #2, so / mj LOCATE 15, 55: PRINT 1 NEXT 1 CLOSE CHAIN "again.BAS" SUB inverse (x(), N, b()) lin = CSRLIN LOCATE lin, 40: PRINT "(out of"; N; ") :"; ' Create b(), identity matrix DIM a(N, N) FOR r = 1 TO N: FOR C IF r = c THEN b(r, c) NEXT C, r 1 TO N: a(r, c) = x(r, c) 1 ELSE b(r, c) = 0 ll FOR row = 1 TO N 'Make diagonal element 1 d = a(row, row) IF d = 0 THEN 'Switch with another row flag = O: row2 = row WHILE flag = o row2 = row2 + 1 IF row2 > N THEN PRINT "no inverse exists": BEEP: STOP IF a(row2, row) <> 0 THEN FOR col = 1 TO N SWAP a(row2, col), a(row, col) 125 SWAP b(row2, col), b(row, col) NEXT col flag = 1 d = a(row, row) END IF WEND END IF FOR col = 1 TO N a(row, col) = a(row, col) / d b(row, col) = b(row, col) / d NEXT col 'Subtract multiples of ROW to get zeros in other positions FOR 1 = 1 TO N IF 1 <> row THEN m = a(i, row) FOR j = 1 TO N a(ir j) = a(ir j) _ m * a(rowr j) b(ir j) = b(ir j) - m * b(rowr j) NEXT j END IF NEXT 1 LOCATE lin, 55: PRINT row NEXT row END SUB SUB matmult (a(), m, N, b(), o, p, C()) IF N <> 0 THEN PRINT "Matrices not compatible——multiplication fails": STOP lin = CSRLIN LOCATE lin, 40: PRINT "(out of"; m; ") :" FOR 1 = 1 TO m FOR j = 1 TO p C(i, j) = 0 FOR K = 1 TO N C(ir j) = C(i, j) + a(i, K) * b(K, j) NEXT K, j LOCATE lin, 55: PRINT i NEXT i END SUB ' PCM.BAS PCM is the ninth program in the series. ’ PCM follows RASCHEXT.BAS and precedes MORE.BAS. ' PCM does partial credit model adjustment, storing adjusted scores in 126 ' C:\THESIS\SIMDATA\(F$)\(F$).PCM in the usual format ' rater;rater's adjusted score; ... ; overall adjusted score ' The program follows the PAIR unconditional maximum likelihood ' estimation algorithm described in Wright and Masters (1982). COMMON F$ eps = .002: MaxIt = 50: big = EXP(20) mj = VAL(MID$(F$, 1. 1)) IF mj = 0 THEN mj = 2 IF mj = 1 THEN PRINT "Program terminates-—" PRINT "other methods require at least 2 raters": END END IF np = VAL(MID$(F$, 2, 1)) * 100 mp VAL(MID$(F$. 3, 1)) nj = 9: njl = nj — 1 OPEN "C:\THESIS\SIMDATA\" + F$ + "\” + F$ + ".088" FOR INPUT AS #1 DIM nk%(njl mp! njl mp), r%(npr mj)! 8%(DP, mj)! d(njl mp)! dt(njr mp) CLS : PRINT "PARTIAL CREDIT MODEL" LOCATE 3, 1: PRINT "Loading Observed data (out of"; np; ") :"; FOR k = 1 TO np FOR j = 1 TO mj INPUT #1, r%(k, j), 5%(k, j) NEXT j FOR jl = 1 TO mj: a FOR j2 = 1 TO mj: nk%(a, b, c, d) NEXT j2, jl LOCATE 3, 55: PRINT NEXT k CLOSE 1 LOCATE 11, 1: PRINT "Iterating rater parameters (out of"; nj; ") :" DO FOR i = 1 TO nj FOR x 1 TO mp: w = x - 1: F = 0: fl = .01 FOR j = 1 TO nj FOR 2 = 1 TO mp: y = z - 1 r%(k, jl): b = s%(k, jl) = r%(k, j2): d = s%(k, j2) nk%(a, b, c, d) + 1 HO xi n1 = nk%(i, w, j, z): n2 = nk%(i, x, j, y): bign = n1 + n2 pi = 1 / (1 + EXP(d(i, x) — d(j, z))): F = F + bign * pi — n2 f1 = fl + bign * pi * (1 — pi) NEXT z, j a = d(i, x) + F / f1 IF a < -15 THEN dt(i, x) = -15 ELSEIF a > 15 THEN dt(i, x) = 15 ELSE dt(i, x) = a END IF NEXT X LOCATE ll, 40: PRINT 1 dif NEXT i LOCATE 24, l: PRINT F$; max = O: ave = 0 FOR 1 = 1 TO nj FOR x — 1 TO mp dif = ABS(dt(i, x) — d(i, x)): IF dif > max THEN max ave = ave + dif: d(i, x) = dt(i, x) NEXT X, i LOCATE 7, 20: PRINT "Maximum parameter shift was"; max LOCATE 9, 20: PRINT "Average parameter shift was"; ave / nj / mp v = V + l LOCATE 5, 20: PRINT "Iteration : "; v LOOP UNTIL max < eps OR v MaxIt OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".PCM” FOR OUTPUT AS #2 LOCATE FOR xi CEN DO LOOP 13, l: PRINT "Loading adjusted scores (out of"; hp; ") :" = 1 TO np 0: del = 5: a = 0 FOR i = 1 TO 3 u(i) = CEN + (i — 2) * del p(i) = 1 FOR m = 1 To mj e(O) = l: T = 0 FOR j = 1 TO mp e(j) = e(j — l) * EXP(U(i) - d(r%(xi, m): j)) IF e(j) > big THEN FOR L = 0 TO j: e(L) = e(L) / big: NEXT L NEXT j: FOR L = 0 TO mp: T = T + e(L): NEXT L P(i) = P(i) * e(8%(Xi, m)) / T NEXT m, i IF ABS(CEN) > 14 THEN EXIT DO IF p(l) < p(2) AND p(2) < p(3) THEN CEN = u(3): IF a = 0 OR a = 3 THEN a = 3 ELSEIF p(l) > p(2) AND p(2) > p(3) THEN CEN = u(1): IF a = 0 OR a = 1 THEN a = 1 ELSE a = 2: del = del / 2 END IF UNTIL del < eps / 2 ‘ At this point, CEN is the paper quality estimate for the XIth paper. ' Use CEN to compute the expected score over all raters as well as ' the expected score from the actual raters based on the parameters. ave = 0 FOR 1 = 1 TO nj e(O) = 1: T = 0 FOR j = 1 TO mp e(j) = e(j — l) * EXP(CEN — d(i, 1)) IF e(j) > big THEN FOR L = 0 TO j: e(L) NEXT j FOR L = 0 TO mp: T = T + e(L): NEXT L e(L) / big: NEXT L 128 sc(i) = 0: FOR L = 1 TO mp: sc(i) = sc(i) + e(L) / T * L: NEXT L ave = ave + sc(i) NEXT i FOR j = 1 TO mj: r = r%(xi, j) PRINT #2, r; sc(r); NEXT j PRINT #2, ave / nj LOCATE 13, 55: PRINT xi NEXT Xi CLOSE CHAIN "MORE.BAS" ' MORE.BAS is a program that automates the process Of moving from one ' simulated data set to the next. It bypasses RATER.BAS COMMON F$ IF F$ = "1152" THEN F$ = "1153": CHAIN "PARGEN.BAS" IF F$ = "1153" THEN F$ = "1191": CHAIN "PARGEN.BAS" IF F$ = "1191" THEN F$ = "1192": CHAIN "PARGEN.BAS" IF F$ = "1192" THEN F$ = "1193": CHAIN "PARGEN.BAS" IF F$ = "1193" THEN F$ = "1551": CHAIN "PARGEN.BAS" IF F$ = "1551" THEN F$ = "1591": CHAIN "PARGEN.BAS" IF F$ = "1591" THEN F$ = "2191": CHAIN "PARGEN.BAS" IF F$ = "2191" THEN F$ = "2192": CHAIN "PARGEN.BAS" IF F$ = "2192" THEN F$ = "2193": CHAIN "PARGEN.BAS" IF F$ = "2193" THEN F$ = "2551": CHAIN "PARGEN.BAS" IF F$ = "2551" THEN F$ = "2591": CHAIN "PARGEN.BAS" IF F$ = "2591" THEN F$ = "3191": CHAIN "PARGEN.BAS" IF F$ = "3191" THEN F$ = "3192": CHAIN "PARGEN.BAS" IF F$ = "3192" THEN F$ = "3193": CHAIN "PARGEN.BAS" APPENDIX B DISTRICT WRITING ASSESSMENT DATA WRITING ASSIGNMENT Implementation All students were given the same instructions and essentially the same rules applied to students at all grade levels. The instructions provided were as follows: Write a paper to your principal stating one rule you'd like to have changed, modified or strengthened in your building. state your idea, reasons for the change and how the school would be improved by this change. The purpose of this writing sample is to assess your writing skills. All students in your grade will be writing on this or a similar topic. It is important that you do your best. General Rules 1. Place your name on the upper right hand corner of your paper. Do not write on this sheet. 2. Dictionaries are available for your use. 3. Remember to include an introduction, body and conclusion which would convince your principal to make a change. 4. Do not write this as a letter. 130 APPENDIX C DISTRICT WRITING ASSESSMENT DATA MODIFIED HOLISTIC SCORING CRITERIA 5 point = Very Good The paper is creative, well organized with a good command of vocab- ulary. The paper clearly develops a topic from beginning, to mid— dle, to end. Ideas are supported with details and flow smoothly. Errors in sentence structure and mechanics may be present but they do not detract from the overall impression of the paper. 4 point = Good The paper shows some creativity, organization is evident, with fairly good command of vocabulary. The paper develops a topic from beginning, to middle, to end. Ideas are generally supported with details for a minimum interruption of flow. Errors in sentence structure and mechanics may be present but they do not substantially detract from the overall impression of the paper. 3 point = Adequate The paper has deficiencies but demonstrates enough overall strengths in sufficient degree to be judged competent. A deficien— cy(s) may be found in: —— organization -- command of vocabulary —- supportive details —- smooth flow from one idea to another —— development of topic —- sentence structure —— mechanics 2 point = Inadequate This paper is disorganized and has limited vocabulary. The topic is addressed but is poorly developed and may lack a beginning, middle, or end. Ideas are poorly supported with few details. The paper does not flow smoothly. Errors in sentence structure and mechanics are frequent enough to detract from the overall impression of the composition. 1 point = Poor This paper has an absence of organization and poor vocabulary. The topic is addressed but not developed. Ideas are not supported with detail. The paper does not flow. The errors in sentence structure and mechanics are frequent and serious enough to detract substantially from the overall impression of the composition. 0 point = Not Acceptable The paper is illegible or totally unrelated to the topic. ’——_‘ LIBRARIES NICHIGQN STQTE UNIV \w ‘12,: . :9... us . 1. . .. ...