‘9‘. 1.1.1)
Iiivyn.

 

Kl:
Li V9.31. ,
181

(I... £1.51 .
.. I‘VE-255.5.)

.. .21:
.1.zx21t1

a:

 

5.156 .. en. a} Ltv )9»:
.lxlt; =10|.t«

 

 

 

-, 1

  

llllilllllllllllllllllllllllilllllllllilll

3 1293 00910 8626

; As... ‘ .4'
4' “ﬁg/r"-
29‘105v , ,1

 

This is to certify that the

dissertation entitled

A Comparison of Rater Calibration Methods

presented by

George Stephen Denny

has been accepted towards fulﬁllment
of the requirements for

Ph.D. Measurement,

degree in

 

 

Evaluation G Research Design

   

Major professor

Date ///XI/?ﬂ

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

LIBRARY
Michigan State
University

 

 

PLACE IN RETURN BOX to remove this checkout from your record.

TO AVOID FINES return on or before date due.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU Is An Affirmative Action/Equal Opportunity Institution
c:\circ\datedm.pm3—p. 1

A COMPARISON OF RATER CALIBRATION METHODS

BY

George Stephen Denny

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counseling,

Educational Psychology, and Special Education

1990

é¢5~ (26706)

ABSTRACT
A COMPARISON OF RATER CALIBRATION METHODS
BY

George Stephen Denny

When the quality of any performance is measured with human judgment,
the score assigned depends both on the quality of the product and on how
raters use the rating scale. As much as possible, idiosyncratic rater
differences should be minimized. This study investigates a variety of
methods of statistically adjusting raters' scores based on how their
scoring compares to the scoring patterns of others. The ten methods
compared were no equating (NO), mean equating (MN), truncated mean
equating (TMN), linear equating (LI), truncated linear equating (TLI),
equipercentile equating (EQP), ordinary least squares (OLS), truncated
least squares (TLS), Rasch extension (RAS), and partial credit model
(PCM).

Data were from a suburban school district's writing assessment and
from a simulation based on the PCM. Simulated data varied in the number
of raters per paper, the number of rating scale points, the total number
of papers scored, and the distribution of paper quality. Simulated
raters varied in the stringency and spread of their scores. With
the real data, methods were compared based on the relative proximity of
their adjusted scores and with respect to their effects on passing rates
relative to a given cut-score. With the simulated data, adjusted scores
were compared to the true scores expected from the model given the
generating parameters. Differences among methods were measured by root

mean squared error, correlation, and maximum score difference statistics.

In the real data sets, the methods produced adjusted scores that
differed substantially from one another and from the raw scores.
However, no judgment of which method worked best was possible because
true scores for the papers were unknown. In the simulated data sets, the
simpler methods (TMN and TLI) reproduced true scores well under all
scoring conditions. EQP did well in data sets with more rating scale
points. PCM did well in data sets with many papers and raters and few
scale points. OLS, TLS, and RAS generally did worse than no equating.
In assessment settings where raters differ in stringency and papers are
randomly assigned, TMN and TLI are recommended for statistical adjustment

of scores.

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . .
LIST OF FIGURES . . . . . . . . . . . . . . .
LIST OF ABBREVIATIONS . . . . . . . . . . . .
INTRODUCTION . . . . . . . . . . . . . . . .
CHAPTER ONE . . . . . . . . . . . . . . . . .

Controlling for Rater Differences . . .
Training . . . . . . . . . . . . .
Multiple Raters . . . . . . . . . .
Statistical Adjustment . . . . . .

When Adjustment is Not Recommended

Methods of Score Adjustment . . . . . .
No Equating (NO) . . . . . . . . .
Mean Equating (MN) . . . . . . . .
Linear Equating (LI) . . . . . . .

Equipercentile Equating (EQP)

Ordinary Least Squares (OLS) . . .
Rasch Extension (RAS) . . . . . . .
Partial Credit Model (PCM) . . . .
Other Adjustment Methods . . . . .
Truncated mean equating (TMN)

Truncated linear equating (TLI)

Equipercentile equating with smoothing

iv

0

viii

 

The Rasch Rating Scale Model

Rater Response Theo

Summary of Adjustment Methods

rY

c

CHAPTER TWO: A REVIEW OF PREVIOUS STUDIES

Contexts Requiring Performance Assessment

Job Performance . . . .
Large—scale Testing . .

The Problem of Unreliability

Research Studies on Rater Effects

Early Studies . . . . .

Paul . . . . . . . . .

de Gruijter . . . . . .
Cason and Cason . . . .
Braun . . . . . . . . .
Wilson . . . . . . . .
Houston, Raymond, Svec,
Lunz, Linacre, and Wrigh
Denny . . . . . . . . .
Summary . . . . . . . . . .
CHAPTER THREE: METHOD . . . . .
Methods Compared . . . . . .
No equating . . . . . .

Mean equating . . . . .
Linear equating . . . .
Equipercentile equating

OLS . . . . . . . . . .

Rasch extension . . . .

o o

and

t .

c o o

o . o

o

22

23

23

26

26

27

28

28

30

31

32

33

34

38

40

4O

42

44

44

46

46

47

47

47

48

49

51

PCM . . . . . . . . . . . . . . . . . . . . . . . .

Data Sets . . . . . . . . . . . . . . . . . . . . . . .
Real Data . . . . . . . . . . . . . . . . . . . . .
Simulated Data . . . . . . . . . . . . . . . . . .
Criteria for Comparing the Methods . . . . . . . . . . .
CHAPTER FOUR: RESULTS . . . . . . . . . . . . . . . . . . .
Writing Assessment Data . . . . . . . . . . . . . . . .
RMSDs Compared . . . . . . . . . . . . . . . . . .
Passing Rates . . . . . . . . . . . . . . . . . . .
Simulated Data . . . . . . . . . . . . . . . . . . . . .
Root Mean Square Errors (RMSEs) . . . . . . . . . .
Overall . . . . . . . . . . . . . . . . . . .

Number of Raters Per Paper . . . . . . . . . .

Number of Papers Scored . . . . . . . . . . .

Number of Rating Scale Points . . . . . . . .

Paper Quality Distribution . . . . . . . . . .

Rater Type . . . . . . . . . . . . . . . . . .
Correlations . . . . . . . . . . . . . . . . . . .
Maximum Difference . . . . . . . . . . . . . . . .
One-Rater Data Sets . . . . . . . . . . . . . . . .
CHAPTER FIVE: DISCUSSION . . . . . . . . . . . . . . . . . .
Summary by Method . . . . . . . . . . . . . . . . . . .
No equating (NO) . . . . . . . . . . . . . . . . .

Mean Equating (MN) and Truncated Mean Equating (TMN)

52

52

52

57

59

62

62

66

71

74

74

78

78

78

79

79

80

84

86

9O

92

92

92

93

Linear Equating (LI) and Truncated Linear Equating (TLI)94

Equipercentile equating (EQP) . . . . . . . . . . .
OLS, TLS, and Rasch Extension . . . . . . . . . . .

vi

 

Supplemental Analysis . . . . . . . . . . . . . . . . . 96
Partial Credit Model . . . . . . . . . . . . . . . . . 99
Recommendations . . . . . . . . . . . . . . . . . . . . . . 99
Better Models of Rater Scoring . . . . . . . . . . . 100
Better True Score Estimates for Real Data . . . . . . 101
Implications for Practice . . . . . . . . . . . . . . 102

EPILOGUE . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

References . . . . . . . . . . . . . . . . . . . . . . . . . . 105

APPENDIX A:

PROGRAMS FOR GENERATING AND ANALYZING SIMULATED DATA SETS . . . 109
RATER.BAS . . . . . . . . . . . . . . . . . . . . . . . . 109
PARGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . . 110
TRUEGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . 111
OBSGEN.BAS . . . . . . . . . . . . . . . . . . . . . . . . 112
EZEQ.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 115
EQPEQ.BAS . . . . . . . . . . . . . . . . . . . . . . . . 117
OLS.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 118
RASCHEXT.BAS . . . . . . . . . . . . . . . . . . . . . . . 122
PCM.BAS . . . . . . . . . . . . . . . . . . . . . . . . . 125

APPENDIX B: DISTRICT WRITING ASSESSMENT DATA
WRITING ASSIGNMENT . . . . . . . . . . . . . . . . . . . . 129

APPENDIX C: DISTRICT WRITING ASSESSMENT DATA

MODIFIED HOLISTIC SCORING CRITERIA . . . . . . . . . . . . 130

vii

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table 10:

Table

Table 12:

4A:

4B:

4C:

9:

11:

LIST OF TABLES

Comparison of Adjustment Methods . . . . . . . . . . .

Frequencies of Scores Assigned by

Raters in the Real Data Set . . . . . . . . . . . . 55—
Stringency Parameters for Each Rater

in the Simulated Data Sets . . . . . . . . . . . . . .
Method Comparisons for the Writing Assessment Data
GRADE 5 . . . . . . . . . . . . . . . . . . . . . . . .
Method Comparisons for the Writing Assessment Data
GRADE 8 . . . . . . . . . . . . . . . . . . . . . . . .
Method Comparisons for the Writing Assessment Data
GRADE 11 . . . . . . . . . . . . . . . . . . . . . . .
Pass/Fail Decisions for Adjusted Scores Relative

to Unadjusted Scores for the Writing Data . . . . . . .

Scoring Frequencies for One Simulated Data Set (2+551)

Overall RMSEs for the Simulated Data . . . . . . . . .
Overall RMSEs Averaged by Facet

(l—Rater Data Sets Omitted) . . . . . . . . . . . . . .
RMSEs by Rater Type by Facet . . . . . . . . . . . . .

RMSEs by Rater Spread by Facet . . . . . . . . . . . .
Correlations for the Simulated Data Sets . . . . . . .
Maximum Score Difference Observed - True
for All Simulated Data Sets and Each Method . . . . .

viii

56

58

63

64

65

72

75

76

77

81

83

85

87

Table 13:

Table 14:

Table 15:

Maximum Differences Averaged Across Methods
(l—Rater Data Sets Omitted) . . . . . . . . . . .
Average RMSEs for the One—Rater Simulated Data Sets

Supplemental Data Set Simulated From a Linear Model

ix

. 90

. 97

 

Figure 1:
Figure 2:
Figure 3:
Figure 4:
Figure 5:

Figure 6:

Figure 7:
Figure 8:
Figure 9:

Figure 10:

LIST OF FIGURES

A Graphical Demonstration of Mean Equating . . . . .
A Graphical Demonstration of Equipercentile Equating
Expected Score Functions with the Rasch Extension .
Probability Curves for the Partial Credit Model . .
Expected Score Function for the Partial Credit Model
Graphical Representation of a

Rater Characteristic Curve . . . . . . . . . . . . .
Graph of Average RMSDs for the Grade 5 Data . . . .
Graph of Average RMSDs for the Grade 8 Data . . . .
Graph of Average RMSDs for the Grade 11 Data . . . .

Graph of Average RMSDs for the Three Grades Combined

10

13

15

19

2O

35

67

68

69

70

 

IMPUTE

LI

NO

NP

NR

NS

RAS

RCC

RMSD

TMN

WLS

LIST OF ABBREVIATIONS
Equipercentile equating
Exponential function, base 9 = 2.718...
Generalized least squares
Imputation of scores for missing rater/paper combinations
Linear equating
Natural logarithm function, base e = 2.718...
Mean equating
No equating
Number of papers
Number of raters
Number of scale points
Ordinary least squares
Partial Credit Model
Rasch extension
Rater characteristic curve
Root mean squared difference
Root mean squared error
Rasch rating scale
Rater response theory
Standard deviation
Truncated linear equating
Truncated mean equating

Weighted least squares

xi

 

INTRODUCTION

May 9

Ivan and Anita are both seniors in high school. Each has
completed all graduation requirements except one--passing the writing
component of the competency test. Today is their final opportunity to
pass the test in time to graduate with their class. To pass the test,
Ivan and Anita must write essays that receive a total of 7 points from
two raters scoring on a 5-point scale. Both are somewhat nervous and
have never done especially well on writing assignments. Both write
essays of borderline quality.

May 11

A team of English teachers trained in holistic scoring rate the
essays. Mrs. Redpen and Mr. Markov read Ivan's paper and each teacher
rates it a 3. Ivan's total is 6; he fails the competency requirement
and does not receive a high school diploma. Miss Dove and Mr. Laxer
score Anita's essay, and each teacher rates it a 4. Anita's total is
8; she passes the writing requirement and graduates with her
classmates.

July 3

A subsequent analysis of the essay scoring reveals that raters

differed significantly in the level of scores they assigned. In fact,

the two most stringent raters, Redpen and Markov, assigned average

 

2
scores one point lower than the average score over the entire team of
raters. The two most lenient raters, Dove and Laxer, assigned scores
that averaged one point higher than the overall average.
August 12

Ivan's parents become aware of the results of this study and file
suit to force the district to award Ivan a diploma. Their attorney
makes the case that Ivan was failed only because of the luck of the
draw in who rated his paper. The attorney argues that "if Ivan's paper
had been scored by lenient or even average difficulty raters, he would
not have been denied a diploma. The other graduation tests in the
district are constructed so that various forms of the test are of equal
difficulty. In the same way, scoring on the writing test should be
made as equitable as possible."

The school district superintendent claims the scoring system was
satisfactory, because using statistics to adjust essay test scores is
not standard educational practice. Besides, she argues, "if statistics
are used to adjust some scores higher, then they should also be used to
make other scores lower. Another student whose paper was scored by
lenient raters would not have been allowed to graduate if scores were
adjusted. Therefore, the current system is fair."

If you were the judge hearing the case, how would you rule?

 

 

CHAPTER ONE

Many forms of assessment used in education require human judgment.
Of the three R‘s, reading and arithmetic achievement are measured
extensively with multiple-choice tests while writing achievement is
more often assessed by ratings from trained judges. In other contexts,
such as artistic or musical performance, evaluations are totally
dependent on human judgment. When an instructional objective requires
that a student produce original work, the quality of that work is
measured by the ratings of evaluators. A problem with human judgment
is lack of consistency--the same performance can receive different
scores depending on who assigned the ratings. For scores to be
meaningful and fair, differences among raters should be minimized.

The focus throughout this study is on writing assessment and the
terminology reflects that focus. Performances are referred to as
"papers" which vary in their degree of "quality". The principle of
minimizing individual differences among raters in how they use a rating

scale can be generalized to many other contexts.

Controlling for Rater Differences
Coffman (1971) listed three ways in which raters differ: (a) in
the level of scores they assign: some raters are more lenient or more
stringent than others, (b) in the spread of scores they assign: some

raters use the extreme values on the scale more than do others, and

4

(c) in idiosyncratic ways: raters differ in how they weight various
aspects of a piece of writing as they assign an overall score to it.
Training

Training represents one attempt to minimize rater differences. A
typical training session consists of two parts. First, all raters are
presented with clear descriptions of what papers are like at each point
on the rating scale. Second, raters practice scoring papers of known
quality as determined by experienced raters (anchor papers), and
discuss as a group why they assigned particular ratings to specific
papers. A rating supervisor can follow up this training with
activities designed to monitor or maintain rating practice, such as
periodically assigning anchor papers for scoring, or watching for
raters who consistently assign scores that are lower or higher than
other raters, and then giving additional training as needed.

Multiple Raters

 

Another method that lessens rater differences involves using
multiple raters. When more than one rater scores a paper and then the
ratings are averaged, the effect of any one unusual rating on the total
score is diminished. A common practice is to randomly assign two
raters to a paper. If their scores differ by more than one scale unit,
a third rater also scores the paper and the most discrepant score of
the three is omitted. Increasing the number of raters will increase
the reliability of the score assigned to any paper. Ideally, every
rater would read every paper, but the excessive cost and the law of
diminishing returns make such scoring prohibitively inefficient. Using

two raters for all papers, and a third only if scores are discrepant,

 

5
is generally viewed as a cost-effective compromise for increasing
reliability. Doubling the number of raters will double the cost of
scoring, but will not double the reliability of scores.

The effect on reliability of doubling the number of raters can be
estimated directly from the Spearman-Brown Prophecy formula. If a
measure with reliability rm(is replaced with an equivalent measure
having K times as many observations, the new reliability Rm(is given
by Rxx = K' rxx / (l + (K - 1) ° rxx). For example, if with two raters
per paper scores had reliability .70, then with four raters (K = 2)
reliability would be Rm(= 2(.70) / (1 + (2 — l)(.70)) = .82. Going
from two raters per paper to four increased reliability only .12
points. Going from four raters per paper to eight would increase

reliability even less.

Statistical Adjustment

 

Even when raters are trained and monitored, and even when
discrepant scores are omitted, raters still differ in how they assign
scores to papers. Another method to reduce rater differences is to
statistically adjust each rater's scores to compensate for these
idiosyncratic differences. This type of statistical adjustment of
scores parallels the theory of test equating across multiple forms of
the same test. Tests are equivalent if an examinee is expected to
receive the same score regardless of which test form is used. To make
tests equivalent, scores on each form of a test are adjusted so that
scores on any one form have the same meaning as scores on other forms

of the test.

6

There are various designs for equating test scores across
different test forms. One equating design is based on equivalent
groups. In this type of design, a large group of examinees is randomly
divided into groups and each group is given a different form of the
test. With random assignment into large groups, the average ability
levels of the groups are nearly equal. Any group differences in test
performance are assumed to be due to differences in the difficulty of
the test forms. To compensate for these differences, scores on the
more difficult forms are raised and scores on easier forms are lowered
to make the scores across different forms equivalent.

Scoring with rating scales presents a parallel situation. When
adjusting for different raters, the raters are viewed as different
forms of a test. If a large number of papers are randomly assigned to
raters for scoring, then the scoring pattern for each rater should be
about the same. When evidence suggests that raters vary in how they
assign scores, some type of statistical adjustment may be appropriate.
A given paper should have the same expected score irrespective of which
raters assign the scores.

When Adjustment is Not Recommended

Before discussing various types of statistical adjustment, it is
important to note the following situations in which it would not be
appropriate or necessary to adjust raters' scores:

1. If the differences between raters' scores are not
statistically significant, no adjustment is necessary. To test for
mean score differences, use an F—test in an analysis of variance. A

homogeneity of variance test (available as an option to the analysis of

variance procedure in some statistical software) tests whether the
raters differ significantly in how spread out their scores are. A chi—
square test can determine differences in rater scoring patterns at any
point on the rating scale. A Kolmogorov-Smirnov test does the same but
with slightly more power because it takes advantage of the ordinal
nature of the rating scale. When raters read only a few papers,
sampling error can be confounded with rater stringency. For example,
if someone only rated 5 papers and 2 of those were exceptionally poor,
the rater could appear to be unusually stringent because of the small
sample size. By only adjusting scores when differences among raters
achieve statistical significance, spurious rater differences resulting
from small sample sizes can be avoided. In large—scale assessments
employing multiple raters, there is typically no problem with small
sample sizes.

2. If papers are not randomly assigned to raters, one would not
expect that raters would assign scores the same way. For example, if
rater A graded a set of papers from an honors class and rater B graded
a set from a regular class, rater A should not be considered lenient
despite assigning higher scores than rater B, because the honors class
papers are likely better than those of the regular class. The simpler
equating methods ignore group differences, whether they are due to
sampling error or to non—random assignment. The more complex methods
require two or more raters per paper and take into account not only the
scores a rater assigned to a set of papers, but also the scores that
other raters assigned to those papers. These more sophisticated

methods control for sampling error and non—random assignment.

 

8

3. If all raters score all papers, no adjustment is necessary.
Any rater effects should affect all scores equally. When only two
raters read each paper, those papers that happen to be scored by the
two most stringent raters will be at a disadvantage compared to those
scored by the two most lenient raters. Although scores in general will
be fairly accurate with two raters, some individuals will receive
scores that fail to reflect the true quality of their work. Rater
adjustment would alleviate this problem.

4. If the uses made of the scores are of little importance, then
rater calibration is unnecessary. For example, if a district wanted to
know whether writing was improving in the district and only aggregate
scores were used, then scores at the individual level need not be
adjusted for rater effects. On the other hand, if a minimal writing
score has to be obtained before a high school diploma is granted, then
scores at the individual level must be as fair and equitable as
possible, and some type of adjustment is appropriate.

5. In some political climates, adjustment of scores may not be
acceptable. However, most people understand that some raters are
tougher than others and would accept the need for some type of
adjustment. Of course, individuals who have their own scores lowered

might be less accepting of score adjustment.

Methods of Score Adjustment
Many methods, varying in their complexity and accuracy, have been
suggested for adjusting rater scores. Some of these methods are

discussed briefly here, then are described more fully in Chapter 3.

No Equating (NO)

This is the simplest method, where all scores are accepted at face
value. When multiple raters score a paper, their scores are averaged
or totalled to get the paper's score. This method is widely used,
typically because one or more of the conditions listed above (e.g.
random assignment, importance) fail to hold. When raters differ in how
they assign scores, all rater scoring differences affect the scores
papers actually receive.

Mean Equating (MN)

In this method, each rater's scores across all papers are
averaged. Any rater who assigns a mean score lower than the overall
mean score is considered stringent, and any rater who assigns a mean
score higher than the overall mean score is considered lenient. A
stringent rater's scores are all shifted upward and a lenient rater's
scores are all shifted downward, so that all raters have the same mean
after adjustment. This method compensates for rater differences in
score level, but not for differences in score spread or score
distribution shape. It also assumes that all between—rater differences
are due to rater stringency differences and not to mean differences in
paper quality across raters.

Figure 1 illustrates mean equating on a simple data set consisting
of only two raters. The first rater is stringent, assigning a mean
score of 4; the second rater is lenient, assigning a mean score of 6 on
a 9—point scale. Overall, the average score assigned is a 5, so the
stringent rater's scores are increased by 1 and the lenient rater's

scores are decreased by l. The distributions of adjusted scores have

10

Scores Assigned——Lenient Rater

Scores Assigned——Stringent Rater

8

7 7 7 7

 

 

 

 

= 6.03

Mean

= 4.04

Mean

Scores Assigned-—Both Raters

5

 

 

= 5.01

Mean

Adjusted Scores——Lenient Rater

Adjusted Scores——Stringent Rater

66666

 

 

6666

44444

10

 

 

= 5.01

Mean

= 5.01

Mean

A Graphical Demonstration of Mean Equating

Figure 1:

11
the same shape as the distributions of assigned scores, but have the
overall mean.

Linear Eqpating (LI)

 

This method considers both the mean and standard deviation (SD) of
each rater's scores. Scores are adjusted linearly so that all raters
have the overall mean and SD after adjustment. If the scores rater i
assigns have a mean of mi and a SD of si, and over all papers and
raters the mean is m and the SD is s, then a score xi is adjusted to yi
where yi = [(xi — mi)/si]°s + m. Linear equating compensates for
rater differences both in score level and score spread, but it also
assumes that there are no differences in the distribution of paper
qualities across raters.

Each of these three methods assumes that a one—unit difference in
score has the same meaning throughout the rating scale. In reality,
raters may have different standards for the level of performance
required to achieve a particular score. For example, a rater might
give many 45 but few 35 or Ss, compared to other raters. This scoring
pattern suggests that for this rater, the quality of paper necessary to
get a 4 is considerably less than the quality of paper required to
merit a 5. This type of scoring pattern is local in scope, and only
indirectly affects means and SDs, which are global parameters.

Equipercentile Equating (EQP)

 

This method takes into account differences in rater scoring at
each level of the rating scale, and thus is slightly more complex than
the methods already described. Overall, ratings have a cumulative

frequency distribution. For example, on a 9-point scale, 2 percent of

12
the scores may be 0, 7 percent are 0 or 1, 13 percent are 0 through 2,
and so on, until 100 percent of the scores are in the range 0 through
9. Graphically, these points can be connected with line segments to
describe an increasing function from the point (0,2) to the point
(9,100). One such cumulative frequency distribution is illustrated by
the lower line in Figure 2.

In equipercentile equating, a rater's scores are adjusted to the
overall score with the same percentile rank, using linear interpolation
as necessary. For example, if 23 percent of the scores rater A
assigned are 3 or less, while overall only 19 percent are 3 or less and
27 percent are 4 or less, then a 3 from rater A would be adjusted to a
3.5, which represents the 23rd percentile over all scores. In terms of
the graph in Figure 2, the upper line represents the cumulative
frequency distribution for rater A. To transform a score from rater A,
locate the score on the horizontal axis, move vertically to the line of
rater A, then horizontally to the overall line, then back down to the
horizontal axis to the point that represents the adjusted score.

Ordinary Least Sanres (OLS)

 

Linear models, such as ordinary least squares (OLS) or weighted
least squares (Wilson, 1988; Houston, Raymond, & Svec, 1990; de
Gruijter, 1984), model rater effects by additive constants. The usual
additive model is

=ot+6 +e
Y" i j i

u j

where yij is the score given to paper i by rater j, ai is the true
score of paper i, éj is the scoring bias for rater j, and eij is random

error.

Cumulative
Percentage

100

80

60

40

20

Figure

13

 

H//////
A//////O
%°//

0 l 2

 

2: A Graphical Demon

Rater A

A Overall

 

3 4 5 6 7 8 f
Rating Scale Points

stration of Equipercentile Equating

 

14

To estimate the parameters of this model, each paper must be
scored by at least two raters. By estimating the rater effects
parameters simultaneously, the assumption that all raters are scoring
equivalent distributions of paper quality is unnecessary.
Computationally, the parameter estimates are not iterative as are the
non-linear Rasch models, so they require less computing time. But the
computations involve matrix algebra and so they must be performed with
a computer rather than a calculator. The use of a linear model results
in distortions at the high and low ends of the rating scale, so for
example, a paper which receives a perfect score from stringent raters
is adjusted even higher than a perfect score for lenient raters.
Weighted least squares differs from OLS by weighting the scores of
consistent raters more than the scores of inconsistent raters in
estimating parameters, whereas OLS weights all raters' scores the same.
Rasch Extension (RAS)

This method, described by de Gruijter (1984), models a curvilinear
relationship between a paper's underlying quality and the paper's
expected score from a rater. Originally proposed by Choppin (1982),
the model states that the expected score Rij of a paper with quality
levellﬁ when rated by a judge with stringency parameter 6j on a scale
ranging from O to M is given by
Rij = M exp(l$i — (Sp/(1 + exp(r5i — 53.)).

If M=l the formula looks like the Rasch 1—parameter item response
model, but that model gives probabilities of correct answers whereas
the Rasch extension yields expected scores. The function is graphed in

Figure 3 for raters with stringency parameters 61 = —l.0, 62 = 0.0, and

 

15

Expected Score
5

 

 

 

 

 

-3 —2 -1 0 l 2 3
Paper Quality

Figure 3: Expected Score Functions with the Rasch Extension

16
63 = 1.0, scoring on a 5—point scale. A paper of quality level 1.0 has
an expected score of 4.4 from rater 1 (lenient), 3.7 from rater 2
(average), and 2.5 from rater 3 (stringent).

To estimate the parameters of the model, the curve is transformed
into a linear model, and the matrix solution of the OLS method is
applied. Then the rater effect parameters are transformed back into
the non—linear form of the model to get the adjusted scores for each
paper. The model assumes scoring is continuous, so with discrete

scoring categories the Rasch extension may not adequately fit the data.

Partial Credit Model (PCM)

 

The PCM (Masters, 1982; Wright & Masters, 1982) is another model
for rater response based on item response theory. This model assumes
separate stringency parameters at each level of scoring for all raters.
Originally, the model was intended for test items that contained a
finite number of discrete steps, and a partial credit score represented
the number of steps (points) that an examinee got correct (received).
In particular, a response must earn a k before it can be considered for
a k+1. For any item, the steps involved in a solution can vary in
difficulty.

For example, an item which asks examinees to simplify
the expression (4 + 5)% — 8 requires three steps to get to a final
answer. The first step, adding 4 and 5, is fairly easy so the step
from a O to a 1 on the item has a low difficulty parameter. This step
must be done correctly to continue scoring the item. The second step,
which requires knowledge of fractional exponents, is more difficult and

so the difficulty parameter for that step should be higher than those

 

17

of the other steps. The third step, subtracting 8 from 3, is more
difficult than the first but less difficult than the second and so its
difficulty parameter should numerically lie between the other two. But
to reach the third step, the examinee must do the second step
correctly. Thus, relatively few examinees get a partial credit score
of 2, because most of those who can get from a l to a 2 (by evaluating
a fractional exponent) can also get from a 2 to a 3 (by subtracting
integers).

A similar phenomenon can occur in a rater's scoring pattern. A

popular rating format is

1: demonstrates incompetence
2: suggests incompetence

3: suggests competence

4: demonstrates competence.

A particular rater might have stringent standards to move from a 2 to a
3, because of stringent standards as to what constitutes "competence",
but at the same time have lenient standards as to the difference
between "suggests” and "demonstrates". This rater would likely give
relatively few 33 and relatively more 25 and 43. Another rater with
different standards and understanding of the terms used in the scale
might assign many 3s and relatively fewer 28 and 45. These differences
in stringency are not global, but apply at specific points on the
scale. The PCM accounts for such rater differences at each level of

the rating scale.

18
The PCM treats each scoring step as a dichotomous item. The
probability émx that the nth paper is judged a k rather than a k—l

by rater i is given by

exp(l3>n — 61k)

 

nik " 1 + exp(l3n — am)

where éﬂcis the difficulty parameter for rater i to give the kth level
score and Hg is the quality parameter of the nth paper. Because
getting credit for any step is contingent on having already gotten
credit for all previous steps, the steps can be combined into an
overall probabilistic model. The probability that the nth paper is

judged an x by rater i is given by the formula

X
exp 2 (ﬁg — éij)
j=o

 

nix

where x ranges from 0 to M and where the quantity in the numerator is 1
when x=0.

Figure 4 illustrates probability curves for a 5—point rating scale
item. The vertical axis represents probability, and the horizontal
represents underlying paper quality. For each level of quality, there
are five probabilities corresponding to the five score levels possible
from a particular rater. In the graph, these probabilities are
represented by the numerals 1 through 5 above each quality level. The
probability curves are for a rater with step parameters —4, -1.5, 1.5,
and 4. These parameters are the quality levels at which the rater is
equally likely to assign a score or the next higher score. As the

underlying paper quality moves from very low to very high, the most

19

Probability

1.0

 

.90 11

.80 11 55

.70 1 333 5

.60 l 22 222 3 3 444 44 5

.20 22 13 22 44 35 44

.10 22 33 ll 44 22 55 33
3333 111.444 222.555 3333

 

 

 

.00

Paper Quality

Figure 4: Probability Curves for the Partial Credit Model

2O
probable score moves from 1 to 5. Where two curves intersect, the
paper is of a quality such that the two scores are equally likely to be
assigned. This situation models the real-life indecision of a rater
scoring a "borderline" paper.

Figure 5 combines the probabilities with the score values to give
the expected score of a paper as a function of its quality. The
horizontal axis represents paper quality, and the vertical axis
represents the expected score for the paper from a rater with step
parameters -4, —1.5, 1.5, 4 as in Figure 4. This curve is similar to

the Rasch extension curves in Figure 3, except the shape of the PCM

Expected Score

 

5.0

 

 

 

-6 —5 -4 —3 —2 -l 0 1 2 3 4 5 6
Paper Quality

Figure 5: Expected Score Function for the Partial Credit Model

21
expected score curve varies depending on the values of the difficulty
parameters at each step of the rating scale.

To use the PCM for rater score adjustment, first estimate the
difficulty parameters for each judge and quality parameters for each
paper that best fit the data. Then substitute the parameters back into
the equations of the model to get the expected score of each paper for
each rater, including the ones who did not actually rate the paper.

The average over all raters is the adjusted score of the paper. The
PCM offers a great deal of flexibility, by modeling differences in
rater stringency at each point of the rating scale and by estimating
scores even for those raters who did not rate the paper. One problem
with the model is the large number of parameters which must be
estimated, often with little data. Parameter estimates may be unstable
or inaccurate, particularly for small data sets.

Other Adjustment Methods

 

A variety of other adjustment methods also address the problem of
inconsistency across raters. Most are variants of the methods
described above.

1. Truncated mean equating (TMN) is mean equating, but with a

 

ceiling and floor imposed, so that no score is adjusted above the
highest score on the rating scale and no score is adjusted below the
lowest score on the scale.

2. Truncated linear equating (TLI) is linear equating, with

 

scores truncated to the range of scores possible on the original scale.

3. Equipercentile equating with smoothing replaces the segmented

 

cumulative frequency curves described earlier with smooth curves in an

22
attempt to reduce the effect of having discrete score categories.
Instead of abruptly changing the slope of the curve at the category
endpoints and using linear interpolation, the slope of the curve
changes more gradually and some type of curvilinear interpolation is
used.

4. The Rasch Rating Scale Model (Andrich, 1978; Wright &

 

Masters, 1982) is another in a family of polychotomous response models.
These models apply in situations where a response is scored with more
than two categories of quality, such as essay scoring and rating scales
of all kinds. In complexity, the rating scale model is intermediate to
the Rasch extension and the PCM. The parameters in this model
represent rater differences in overall stringency, and different sizes
of steps between the various score categories, but it assumes that the
different step sizes are the same for all raters. For example, the
rating scale model assumes that the difference between a 2 and a 3 for
any particular rater equals the difference between a 2 and a 3 for all
other raters. This simplifying assumption reduces the number of
distinct parameters which are estimated compared to the PCM, yet is
still a more flexible model than the Rasch extension. The equations
for this model are identical to the PCM, except the parameter 6ik which
represents the difficulty parameter for rater i's kth step is replaced
by 51 + tk, where 61 is the overall stringency parameter for rater i

and tk is the difficulty for the kth step across all raters. The
number of rater parameters to be estimated is thus reduced from the

product i° k in the PCM to the sum i+k in the Rating Scale Model.

23

5. Rater Response Theory, developed by Cason and Cason (1984),

 

is similar to the Rasch extension described above, but instead of using
the logistic function, the model is based on the normal ogive. The two
mathematical functions are virtually identical, graphed as S-shaped
curves with upper and lower asymptotes to the right and left,
respectively. Because the logistic function is easier to work with
computationally than the normal ogive, it has become the more widely

used of the two functions in item response theory.

Summary of Adjustment Methods

A variety of calibration methods statistically adjust for
differences among raters in how they assign scores. Rater scoring
patterns differ in their overall level of scores, in their overall
spread of scores, and in their proportion of scores at each score
level. All adjustment methods account for differences in rater means,
but only some of the methods account for the other types of differences
in scoring patterns. The simpler methods ignore sampling error and
assume equivalent quality distributions of the papers each rater
scores. The methods vary considerably in computational complexity,
ranging from mean equating which can be done with a hand-held
calculator or spreadsheet program, to the PCM which requires hours of
computer time to perform the computations necessary to estimate
parameters. The characteristics of each method are summarized in
Table l.

The focus of this study is on the accuracy of several of these

adjustment methods. The study attempts to determine the extent to

24
which each method improves the quality of scoring when only a subset of
raters scores each paper. The methods are compared over several data
sets which vary in the number of papers scored, the number of rating
scale points, and the way paper qualities are distributed. within a
data set, raters vary in their level and spread of scores. By
understanding how effective these adjustment methods are in a
controlled study, better decisions can be made about which adjustment

method to apply in real scoring situations.

Scale

25

Table 1

Comparison of Adjustment Methods

adjusts

for differences

in overall
level? N Y

in overall
spread? N N

at each point
of the scale? N N

Recognizes
sampling error? Y N

Possible with
only one rater
per paper? Y Y

Computational
complexity?

(O=Lo,

5=Hi) o 1

Can adjust scores
off the scale? N Y

Number of rater
parameters
estimated? 0 R

KEY

R:
P:

N0 =
MN =
LI
EQP =
OLS =

Number
Number

of raters

No equating

Mean equating

Linear equating
Equipercentile equating
Ordinary least squares

of points on rating

9; EQP OLS WLS RRT RAS RRS
Y Y Y Y Y Y Y
1
Y Y N Y Y N N
N Y N N N N N1
N N Y Y Y Y Y
Y Y N N N N N
1 2 3 3 4 4 5
Y N Y Y N N N
2R PR R 2R 4R R P+R
scale
WLS = Weighted least squares
RRT = Rater Response Theory
RAS = Rasch Extension
RRS = Rasch Rating Scale
PCM = Partial Credit Model

 

1

PCM

PR

. The Rasch Rating Scale model allows for varying difficulties
at each step of the rating scale, but assumes the same step size for
each rater.

 

CHAPTER TWO

A REVIEW OF PREVIOUS STUDIES

Research in rater calibration methods is fairly recent. This fact
is somewhat surprising because extensive research has been done on
equating across forms of objective tests (Angoff, 1971; Petersen,
Kolen, & Hoover, 1989), yet subjective measures such as essay tests
have a much longer history in education. Objective tests cost less to
score and their scores are more reliable than are scores from essay
tests; thus they continue to dominate large-scale testing. In recent
years, however, performance assessment has played an increasing role in
educational testing. Essays are used for measuring general writing
ability as well as for evaluating student achievement in content areas.
The emphasis in performance assessment is on what students can do
rather than on what they know; on active production rather than on
passive response; on recall rather than on recognition. But rating a
performance is more subjective than scoring an objective test because
rater biases can be confounded with the quality of the performance.

The increased interest in performance assessment has led to research
that addresses how best to separate the quality of a performance from
the effects of the particular raters scoring that performance.
Contexts Requiring Performance Assessment
Some abilities can only be adequately measured with human

judgment. Organizing thoughts and expressing them in writing are

26

27
important in nearly every subject area, and assessing the quality of
this organization and expression requires human judgment. In problem
solving, the processes used to arrive at an answer can be important and
assignment of partial credit for correct steps leading to a final
answer is typically based on subjective ratings. Ratings are also
necessary to measure the quality of any creative product, such as a
painting, a term paper, or a science fair project.

A growing trend in educational measurement is toward performance
assessment. By simulating real—life situations an examiner can
evaluate a wide range of behaviors in realistic contexts and thus
obtain more valid measures of an examinee's ability to perform certain
tasks. These measures can be either objectively or subjectively
scored, but generally involve a qualitative assessment of the quality
of an educational product, as determined by rater judgment.

Job Performance

 

Performance assessment has a rich literature in industrial
psychology, where rating scales are the primary means of measuring job
performance. Wherry (1950) and Landy and Farr (1980) provided
extensive reviews of performance rating research. Wherry's theory of
rating (Wherry, 1952; Landy & Farr, 1983), which involved partitioning
variance, foreshadowed later developments in generalizability theory.
In 1980, Landy and Farr reviewed research studies in performance rating
and suggested that further attempts to improve rating quality by
adjusting the formats of rating scales were likely to prove unfruitful.
Instead, they recommended more research into statistical control of

common rating errors. They optimistically cautioned that "although

 

28

this is a mechanical solution that implies no increase in the
understanding of the rating process, it offers the possibility of
simultaneously providing the practitioner with better numbers and the
researcher with hypotheses" (p. 101).
Large—scale Testing

Large—scale testing programs have increasingly begun to use
subjective measures of achievement and ability. Godshalk, Swineford,
and Coffman (1967) found that the predictive validity of a test of
writing skills composed of multiple-choice items significantly
increased when an item requiring a writing sample was added to the
test. Performance measures requiring human judgment can test a broad
range of skills in a variety of stimulus conditions, but when they are
subjectively scored the score depends both on the person producing the
performance and on the person rating the performance.

The Problem of Unreliability

The major problem with performance measures is lack of reliability
in scoring. As Breland (1983) notes, "reliability has always been the
Achilles heel of essay assessment" (p. 23). Clearly, the score on a
single essay is an imperfect and unreliable indicator of a larger
construct such as writing ability. To get a reliable measure of an
individual's performance on essay questions, one would need several
essays on different topics, written on different days and read by
different raters.

Generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam,
1972; Brennan, 1983) considers the question of reliability broadly.

Writing tasks consist of several distinct facets, any of which can

29

systematically introduce variance into scoring. A person's writing
ability varies depending on the type of writing, such as narrative,
persuasive, or expository. The specific prompt (writing task) assigned
makes a difference in the quality of essays. Score variation can also
result from the assignment of raters to papers. Even when all these
sources of variation are accounted for, error variance remains. Error
variance measures differences in essay ratings that are not explained
by any of the previously enumerated factors.

Considering these sources of variance helps to answer the question
of how well an essay score measures an examinee's ability. A narrower
question is how well assigned ratings reflect the quality of any

particular paper. The term inter—rater reliability refers to this

 

specific type of reliability; it measures the level of agreement among
raters. (For a more complete discussion of inter—rater reliability,
see Weare, Moore, and Woodall, 1987.)

Breland (1983) approximated reliabilities of writing ability
assessments under various scoring conditions. Typically, a single
essay with one rater has reliability of only .42. Adding a second
rater increases reliability to .53. The use of two topics scored by
different raters increases the reliability to about .57. Three essays
with three different modes of discourse and three different raters
result in estimates of writing ability with reliability .79. To get an
ability estimate having reliability of .85 with only one reading per
essay would require nine essays, and three different modes. With two
readings per essay, six essays and two different modes would result in

a reliability of .86.

 

30

As an example of how low the agreement among raters can be,
Breland (1983) recounted a study by the Educational Testing Services
where 300 essays written by college freshmen were rated on a 9-point
scale by 53 raters from several different fields. The dispersion of
ratings was large for each paper rated. None of the essays received
fewer than five different ratings out of the nine possible. In fact,
23 percent of the essays received seven different ratings, 37 percent
received eight different ratings, and 34 percent of the essays received
all nine possible ratings!

Despite the recognition that rater effects were problematic,
little research was done on controlling for these effects, probably for
two reasons:

1. Large scale assessment involving rater scoring was not
prevalent. With small data sets, typically all raters score all papers
and rater effects cancel out. When a subset of raters score each
paper, the problem of sampling error exists. But separating rater
leniency from paper quality is more difficult with small data sets.

2. More sophisticated scoring models that allow for separation
of rater stringency and paper quality had not been developed, and the
computing resources necessary to estimate parameters were not readily
available. Consequently, most research into statistical control of

rater effects occurred after 1980.

Research Studies on Rater Effects
Most research has investigated the viability of using statistical

techniques to control for differences in overall rater stringency.

 

31
Most studies consist of models for rater effects being applied to data,
either real or simulated, to determine how well the models reduce rater
effects. All studies suggest that some form of rater calibration is
desirable.
Early Studies

Ebel (1951) was one of the first to consider the problem of
estimating reliability in a context where only a subset of raters rates
each of a group of students. Ebel based his reliability estimates on
an analysis of variance, applying the intraclass correlation formula to
rater judgments. The between-raters variance component could either be
included or excluded from the formula, depending on whether rater
effects were retained or removed from the final scores.

Guilford (1954) recommended an analysis of variance approach to
control for rater differences, largely addressing the situation where
all raters score all subjects on several traits. With reference to an
incomplete matrix of ratings, Guilford recognized the potential for
unfairness to subjects because of the particular raters scoring the
paper. Guilford observed:

"There is no simple, generally applicable solution to this

problem. To the extent that any two or more raters have ratings

in common sufficient to make the kind of study of ratings that was
described above [ANOVA], something can be done to make
adjustments. Linear transformations taking care of differences in
means as well as differences in standard deviations would become
important in this kind of situation. If one is willing to make
assumptions concerning comparability of subgroups of ratees, one
extends the possibility of making inferences about the amounts of

errors of different kinds" (p. 289).

Stanley (1961) addressed rater bias in the context of a three-way

analysis of variance: ratees by raters by traits. He developed

computational formulas, and recommended controlling not only for rater

32

main effects, but also for rater—ratee interactions (e.g. halo——the
tendency for raters to rate an individual highly on all traits) and
rater-trait interactions (e.g. the tendency for raters to rate one
trait more stringently than they rate other traits.) Stanley did not
address the common rating situation where only a subset of the raters
rate each performance. Consequently, he points out that "the adjusted
trait sums (over raters) and adjusted total scores (over both raters
and traits) cannot be better for any purpose——predictive or otherwise--
than the unadjusted ratings are" (p. 214). But by removing rater main
effect and interaction terms, internal consistency estimates of
reliability are higher. In the data set Stanley studied, with 3 raters
rating each of 7 individuals on 5 traits, the coefficient of
equivalence increased from .84 for unadjusted ratings to .89 for
adjusted ratings. While this increase is small, a common standard is
that measures used to make decisions about individuals should have a
reliability of at least .85; the adjusted ratings met this standard
while the unadjusted ratings did not.
Pa_ul

Paul (1976, 1979, 1981) compared an additive model, which
considered only differences in raters' mean stringency level, to a
linear relationship model which modeled differences among raters both
in their level of scores and in their spread of scores. Paul used both
real data, where 85 raters scored each of 10 papers, and simulated
data, again with 85 raters but scoring 20 papers each. Paul found
little difference between the additive model and the linear

relationship model, and recommended using the simpler additive model.

 

33

In one of his studies, Paul (1981) used Bayesian methods to
estimate the models' parameters. In contrast to the estimates from
simple mean equating and simple linear equating described earlier, Paul
claims the Bayesian estimates are generally less susceptible to
sampling fluctuations and should yield estimates closer to the true
values. For the data he studied, the Bayesian estimates were indeed
more accurate. But Bayesian estimates include additional subjective
judgments and are therefore open to allegations of bias in
predetermining results.

de Gruijter

 

De Gruijter (1984) outlined two models for rater effects: the
additive model and the Rasch extension. In the additive model, raters
are only assumed to differ in their mean level of stringency. De
Gruijter used the method of ordinary least squares to estimate the
parameters of the model. This method is equivalent to using simple
linear regression to estimate the paper qualities and rater
stringencies which best predict the data. De Gruijter claimed the
additive model is ”computationally simple and straightforward, but
unfortunately overly simplistic" (p. 215). He mentioned the
possibility of using a more general linear model which allows for
differences in error and true score variance across raters, but
concluded that because of problems at the lower and upper bounds of the
rating scale, only a nonlinear model is satisfactory.

The Rasch extension, described earlier, is nearly linear for
values near the center of the rating scale, but is curvilinear near the

extremes of the scale (Figure 3). This models the reality that after a

34
certain level of quality, increases in quality improve the expected
score of the paper only marginally. Similarly, if a poor paper is
already likely to receive a zero, a much poorer paper is only slightly
more likely to receive a zero.

De Gruijter (1984) applied both the additive model and the Rasch
extension to a data set consisting of 949 essays, each scored on a ten—
point rating scale by two out of eight raters. The two models agreed
very closely for values near the mean rating. De Gruijter argued that
for more extreme values the results for the two models must diverge.
De Gruijter suggested that the model may be too simple to adequately
represent the effects of all raters.

Cason and Cason

 

In a series of studies (e.g. Cason & Cason, 1984; Cason & Cason,
1989), Cason and Cason present a non—linear model similar to the Rasch
extension. As noted earlier, their model (Rater Response Theory, or
RRT) was based on the normal ogive rather than the logistic function
used in the Rasch extension. The RRT model consists of rater
characteristic curves (RCCs). Figure 6 is a graph of one RCC.

These RCCs are characterized by four parameters:

1. Resolving power refers to the extent to which ratings change
as the quality of performance changes. Graphically, resolving power
corresponds to the slope of the RCC, and is analogous to item

discrimination in item response theory. Differences in resolving power

 

35

Expected Score
7

 

Effective Rating Ceiling

 

 

Effective Rating Floor .

 

 

 

Figure 6:

l 2 3
Paper Quality

Graphical Representation of a Rater Characteristic Curve

 

36
affect the spread of scores across raters. A closely related term is
rater sensitivity, which is the maximum value of rater resolving power,
or the maximum slope of the curve (e.g., at point M in Figure 6).

2. Rater stringency is the tendency to require higher or lower
paper quality to assign any given rating. Graphically, it corresponds
to the horizontal location of the RCC. In Figure 6, the value of point
L is the stringency parameter for the rater. As with the Rasch
extension, one parameter is used for stringency over the entire rating
scale.

3. The effective rating ceiling represents the highest level of
scores which a rater actually gives. This may be less than the highest
value on the rating scale.

4. The effective rating floor represents the lowest level of
scores which a rater assigns. Again, this may be higher than the
lowest value on the rating scale. In their 1984 study, Cason and Cason
imposed the simplifying assumptions that all raters have equal
sensitivity and that the effective rating floor and ceiling for each
rater were the lowest and highest values on the rating scale. Data
were collected over a 3-year period from a medical school clerkship.

In any year, each of 30 students were rated by 5 of 35 raters on a 34-
item inventory with ratings ranging from 1 to 5.

The data were modeled four different ways. Model T (theory) had
separate parameters for each student's ability and for each rater's
stringency. Model A (ability) assumed all raters had equal stringency,
and rating differences were based only on student ability. Model S

(stringency) assumed students were of equal ability, and rater

37
stringency parameters were allowed to vary. Model 0 (null hypothesis)
assumed no systematic differences in either student ability or rater
stringency, but that all rating differences were due to chance
variation. Over all three years of data, Model T fit the data
significantly better than any of the other three models. Variation in
rater stringency explained about 35 percent of the variance in ratings,
and variation in student ability explained an additional 40 percent of
the variance in ratings.

In a later study (Cason & Cason, 1989), the one—parameter RRT
model (stringency only) was compared to the methods of no equating and
mean equating. These models differ in how they partition rating
variance. No equating assumes no differences in rater stringency, and
all scoring variance is assumed due to either ability differences or
error. Mean equating assumes all between—rater variance is due to
rater stringency differences and none is due to ability differences.
These assumptions are unreasonable if the number of subjects per rater
is small because of sampling error. Because the RRT model
simultaneously estimates stringency and ability parameters, Cason and
Cason argued that it provides a more accurate partitioning of variance,
especially for small data sets.

In this follow—up study, two data sets were used from the medical
school clerkship. One data set consisted of 42 raters rating 24
students; there were approximately 3 ratings per rater and 5 ratings
per student. The second data set had 93 raters rating 163 students,
with approximately 8 ratings per rater and 5 ratings per student. In

the smaller data set, no equating accounted for 24 percent of score

38

variance, mean equating accounted for 66 percent, and RRT accounted for
72 percent. The estimated reliabilities based on the mean over 5
raters was .63 for no equating, .73 for mean equating, and .84 for RRT.
In the larger data set, the difference between mean equating and RRT
was not as great. No equating accounted for 43 percent of score
variance, mean equating accounted for 64 percent, and RRT accounted for
65 percent. Reliability estimates over 5 raters after adjustments were
.77 for no equating, .80 for mean equating, and .83 for RRT. Cason and
Cason (1989) recommended the RRT model, especially for small data sets.
m

In studies with the Educational Testing Service, Braun (1986,
1988) investigated the increases in reliability obtained by calibrating
ratings across four facets of the rating process. Braun (1988)
separately considered (a) stringency of the raters, (b) the rating team
the assigned raters were a part of, (c) which of four days the paper
was rated, and (d) the time of day the paper was rated, for three
separate questions from an Advanced Placement exam in English
Literature and Composition. The raters were calibrated according to an
additive model, using a partially balanced incomplete block design.
This design allowed estimation of the effects of each of the four
facets of the ratings, while greatly reducing the number of readings
required compared to a complete factorial design. The study involved
12 raters in 2 groups of six each reading 32 essays over 4 days, half
in the morning and half in the afternoon, with each rater scoring 8

papers per day and each paper scored by 3 raters per day.

39

Rater calibration estimates were determined in an experimental
context and then applied in an operational setting. Braun (1988) found
that the estimated variance component associated with the raters was
about 15 to 20 percent of the estimated error variance component. By
adjusting for rater effects, single—reading reliability estimates based
on variance components increased for each of the three essay questions
graded. For the three—question total, the reliability estimate was .68
before adjusting scores, and .74 after the adjustment. A cross-
validation resulted in some shrinkage, back to .72 after adjustment.
In contrast, going from a single reading to a double reading would
increase the reliability to .81.2

Braun (1988) recommended rater calibration (with only some papers
being read twice) as a cost—effective alternative to a full double
reading of all papers. In this study, calibration increased the
scoring load by 5 to 7 percent, while a full double reading clearly
increases the amount of scoring by 100 percent over a single reading.
The increase in reliability by using calibration was 31 percent of the
increase produced by using two scorers. In contexts where single—
reading reliabilities are lower, the relative benefits of calibration

are even higher.

 

2. This estimate was derived empirically in the study and it

followed directly from the Spearman—Brown formula which describes how
reliability changes as a function of the number of observations. When
a test with reliability rxx is doubled in length, the new reliability

Rxx is given by the formula RXX = 2' rxx/(l + rxx). USing tWice as many

raters doubles the number of observations.

 

40
Wilson
Wilson (1988) compared ordinary least squares (OLS) with
generalized least squares (GLS). Both methods assume the additive

model, that rater bias can be expressed as a single additive constant.

 

Whereas OLS assumes that all raters have equal error variance, GLS
estimates an error variance for each rater and weights the scores of
accurate raters more than those of less accurate raters in estimating
true scores. In a simulated data set where two of eight raters gave
considerably more inaccurate ratings than the other six, GLS better
reproduced the true scores. The GLS estimates of the true scores had a
mean squared error less than half that of the OLS estimates.

Houston, Raymond, Svec, and Webb

 

In a series of studies conducted at the American College Testing
Program (Raymond & Houston, 1990; Houston, Raymond, & Svec, 1990; Webb,
Raymond, & Houston, 1990) the researchers compared several models of
adjusting for rater effects. Raymond and Houston compared (a) simple
averaging of raw scores (NO ADJUSTMENT), (b) OLS, (c) WLS (identical to
the GLS method of Wilson, 1988), (d) the Rasch rating scale model
(Rasch) and (e) imputation of scores for missing paper/rater
combinations (IMPUTE) on simulated data. Houston, Raymond, and Svec
compared NO ADJUSTMENT, OLS, WLS, and IMPUTE manipulating (a) the
number of raters per paper, (b) the level of rater bias, and (c) the
number of examinees in simulated data sets. Webb, Raymond, and Houston
applied OLS, the Rasch extension described earlier, and WLS to a set of
certification examination data, focusing on how rater adjustments

affected the pass/fail decisions.

 

41
Raymond and Houston (1990) simulated data for 25 individuals rated
by 6 raters on a 1 to 5 scale. In the simulation, the raters varied in
their degree of bias and reliability and were generated from a
multivariate normal distribution. The true score for any paper was its

mean rating over all six raters, but only two ratings for each paper

 

were used in estimating the adjustment parameters. Of the five methods
they compared, the four correction methods, with an average error of
.40 SDs, were all better than uncorrected data, which had an average
error of .56 SDs. The four correction methods differed little from
each other, with mean errors ranging from .39 SDs for Rasch to .46 SDs
for WLS. The correlations between adjusted scores and true scores
ranged from .86 for WLS to .88 for OLS.

Houston, Raymond, and Svec (1990) simulated data based on a
general linear model. Scores ranged from 1 to 7, scored by 8 raters
who varied in their level of bias and in their scoring reliability. In
all, 120 data sets were generated with 30 replications of a 2 x 2
design: level of rater bias (high or low) and number of examinees (50
or 100). Each data set was analyzed by four methods (OLS, WLS, IMPUTE,
and NO ADJUSTMENT), with either 50 percent or 25 percent of the raters
scoring each paper. The methods were compared based on how close the
adjusted scores were to the true scores, using correlations and root
mean squared errors (RMSE) to measure extent of agreement. By both
measures of agreement, the three adjustment methods were all better
than NO ADJUSTMENT. All three methods adjusted well for both high and
low levels of bias, but IMPUTE had lower RMSEs than OLS and WLS,

especially in the cases with fewer raters per paper. However,

42
correlations were slightly higher for OLS and WLS than for IMPUTE. The
researchers suggested that because IMPUTE assumes normally distributed
data, and because the data were generated from a normal distribution,
that the method may not do as well if scores have some other
distribution.

Webb, Raymond, and Houston (1990) compared adjustment methods on
actual data from a health profession oral certification exam. For each
of three years, approximately 120 candidates were examined by 4 of 40
raters and were assigned scores with a possible range of 3 to 36.
Scores were adjusted, and the three adjustment methods (OLS, Rasch
extension, and WLS) differed little; subsequent analyses used OLS
because of its relative simplicity. Pass/fail decisions using
unadjusted data were compared to the decisions using OLS-adjusted data.
Of 129 decisions, 122 (95%) were the same whether scores were adjusted
or not.

Lunz, Linacre, and Wright

 

In his dissertation, Linacre (1987a) developed a generalization of
the Rasch model which applies in situations where several facets
contribute to scoring. His model is especially well suited to contexts
where raters score several items with varying difficulty. Linacre
(1987b) applied the model to the essay test data used by Braun (1988).
The model confirmed many of Braun's findings, such as rater stringency
differences, but also identified particular papers and raters that fit
the model poorly. Linacre recommended that these papers be regraded,

and that the raters receive further training.

43

Lunz, Linacre, and Wright (1988) applied the model to practical
examinations from an American Society of Clinical Pathologists test
administration. A team of 12 judges scored 15 items (microscopic
slides) submitted by 226 examinees. Each slide was graded on three
tasks (microtomy, quality, and processing) on a either a 0—1 scale
(microtomy and processing) or a 0-3 scale (quality). The model in this
situation was

In ( anijk / anijk—l) = Bn _ Am - Di - C:j — ka '

where P ,_
nmijk

is the probability of person n being scored a k by judge j
on task m of item i, Bn is the ability of person n, Am is the
difficulty of task m, Di is the difficulty of item i, C3. is the
severity of judge j, and FER is the height of grading step k on task m.
The PCM described earlier is a special case of this model, except in
this model the difficulty of moving from a k-l to a k was assumed to be
a property of the item, and not a property of the judge.

Lunz, Linacre, and Wright (1988) found that despite common
training, judges differed in their overall level of stringency. The
model obtained separate estimates of the difficulty of each item, each
task and the associated steps of difficulty on the scale, each ability,
and each rater's stringency. The researchers cited examples where raw
score differences were misleading, and recommended making decisions
based on the ability estimates because rater effects would then be
eliminated. In practice, Rasch ability estimates are more difficult to

interpret and explain to users than scores presented in rating scale

units.

44

m

In an earlier paper (Denny, 1989) this author described a pilot
study investigating the feasibility of applying the PCM to a rater
calibration study. One simulated data set with 10 raters and 800
papers generated from the PCM, and a real—life writing assessment data
set with 11 raters and 391 papers were compared using four adjustment
methods: (a) no equating, (b) mean equating, (c) linear equating, and
(d) PCM. In the simulated data set, differences among raters were
negligible, so adjustments made little difference, though PCM was
marginally better than the other methods. In the real data set, where
the true scores were unknown, the mean and linear equating methods both
yielded adjusted scores closer to the PCM adjusted scores than to the

raw scores .

Summary

Several general conclusions can be drawn from the studies
discussed above:

1. Rater calibration, or adjustment for rater effects, can be
applied in many contexts ranging from English Literature and
Composition essays (Braun, 1986) to Clinical Pathology practical
examinations (Lunz, Linacre, & Wright, 1988).

2. In every study involving some type of true scores, adjusted
raw scores were closer to the true scores than were unadjusted scores.
In studies examining reliability estimates, adjusted scores had higher
internal consistency reliability estimates than did unadjusted scores.

In general, more sophisticated models (those involving the estimation

45
of more parameters) did better than simpler models (those with fewer
estimated parameters). But these models require more computer
resources and need more data to produce stable parameter estimates.

3. PCM and equipercentile equating, the two methods that adjust
for rater differences at each level of the rating scale, have not been
adequately studied. Equipercentile equating and equating with item
response theory models such as PCM have been studied extensively in the
context of equating parallel forms of objective tests, but not in the
context of equating raters.

4. All of these studies represent attempts to make more precise
measurements, and thus to enable better decisions. But unlike the
methods of retraining raters and using more raters per paper,
statistical adjustment entails minimal additional cost.

This study examines a wide range of methods that adjust scores for
rater stringency. It is the first study to use equipercentile equating
and the PCM to equate raters. This study compares rater calibration
methods while varying several facets of the rating situation including
the number of papers scored, the number of points on the rating scale,
the shape of the paper quality distribution, and the number of raters
per paper. The results of this study should help test administrators
to decide what method of rater adjustment, if any, would be most

appropriate in a particular performance rating context.

CHAPTER THREE

METHOD

This study compared adjustment methods by applying them to both
real and simulated data with a variety of scoring conditions and over
several different rater types. In the simulated data sets, raters were
modeled to vary in their level of stringency and spread. The rating
task also varied in the number of raters per paper and in the number of
scale points. The simulated sets varied in size (the number of papers)
and in paper quality distribution. Adjusted scores were compared based
on (a) the overall accuracy of scores as measured by the root mean
squared error (RMSE), (b) the comparative rank order of scores as
measured by the Pearson product-moment correlation, and (c) the worst
case as measured by the most discrepant adjusted score for each method.
Each of these aspects of the comparison are discussed in greater detail

below.

Methods Compared
Seven adjustment methods were compared in this study. Three of
the methods (mean equating, linear equating, and OLS) were compared
both in their standard formulation and using truncation to keep
adjusted scores within the range of the rating scale. The methods are
listed below in order of complexity, with more detail than the general

descriptions provided in Chapter One.

46

47

1. No equating. The adjusted score for any rater was the score

 

the rater actually gave. The adjusted score for any paper was the mean
score given by all raters who scored the paper.

2. Mean equating. The scores assigned by each rater were

 

adjusted by a fixed amount so that each rater had the same mean score
after adjustment. For example, if a rater consistently assigned scores
that were .5 points higher than the mean across all raters, each of
that rater's scores were lowered by .5 points to get the adjusted
score. If xij was the raw score assigned to paper i by rater j, Hg was
the mean score given by rater j, and m was the mean score over all
raters, then the adjusted score ytjfor paper i from rater j was given
by yij:=}%j + (m - mj). For truncated mean equating, any scores that
exceeded the maximum score of the scale were reduced to the maximum
score and any scores that were adjusted below zero were assigned a
zero. The adjusted score for a paper was the mean of the adjusted
scores of the raters who scored the paper. After any necessary
truncation, scores from the raters who scored the paper were averaged

to get the adjusted score for the paper.

3. Linear equating. This method considers both the mean and

 

standard deviation of each rater's scores. The scores were adjusted
linearly so that all raters had a mean score equal to the overall mean
and a standard deviation equal to the overall standard deviation. If
the scores rater j assigned had a mean of mj and a standard deviation
of sj, and over all papers and raters the mean was m and the standard
deviation was s, then a score xij adjusted to yij where

yij = [(xij -nH)/sj]°s + m. The adjusted score for a paper was the

 

48

mean of the adjusted scores of the raters who scored the paper.
Truncated linear equating is linear equating with adjusted scores being
truncated to lie within the scale and then averaged across raters to
get the adjusted score for the paper.

4. Equipercentile equating. This method considers the frequency

 

of scores assigned at each level of the rating scale. For each rater,
and for all raters combined, a cumulative frequency distribution was
determined for each point on the rating scale. Each of a rater's
scores was transformed to the point on the combined frequency
distribution with the same percentile rank, using linear interpolation
as needed. If pjk is the proportion of scores rater j assigned that

are k or less, and if Pt and Pt+ are the proportion of scores overall

1

that are t or less and t+1 or less, and if Pt 5 pjk 5 Pen then a k from

rater j is adjusted to the value t + (pjk — Pt)/(P — Pt). The

ti» 1
example illustrated graphically in Figure 2 can also be formulated
algebraically: k=3, t=3, t+1=4, Ek3='23’ P3=.19, and P4=.27, so a 3
from rater A adjusts to
3 + (.23—.19)/(.27—.19) = 3 + (.04/.O8) = 3.5.

Two special cases do not fit the formula. First, if pjk < PO
then a k from rater j is adjusted to a O. For example, if a rater
assigned Os or ls only 3 percent of the time while overall raters
assigned Os 7 percent of the time, then ls from this rater are adjusted
to 0. Second, if Pt = Pt+1 (making the denominator of the fraction
zero, and corresponding to a case where no scores of t+1 were assigned
by any rater) and if pjk = Pt then a k from rater j is adjusted to t,

not to t+1. For example, if no rater assigned any paper a 9, and if

 

49
rater j assigned no 85 or 95, then a 7 from rater j (the rater's
highest score) adjusts to an 8 (the highest score given) and not to a 9
(the highest score possible).

There is more than one way to define the percentile of a score. A
simple definition is the percentage of scores at or below the given
score. This definition assumes that if a paper was assigned a 3, for
example, that it is of higher quality than all other papers that were
assigned a 3. A better definition is the percentage of scores below
the score plus half of the percentage of scores at that score level.
This definition assumes that if a paper was assigned a 3 then it is
only better than half of the papers that were assigned 33. This study
used the simpler definition, both in determining percentiles for each
rater's scores and for the overall scores. Using the other definition
likely would have given slightly more accurate adjusted scores. In
either case, though, the method accounts for rater differences in
scoring patterns at each point on the rating scale.

5. QEE' This method assumes the additive model, where the score
)qj given to a paper with true score ai by rater j with bias 6j and
random error ex“ is given by xij = aii+ 6j-+E%J. In contrast to WLS,
the error variances for all raters are assumed equal, and linear
regression is used to estimate the terms ai and éj. The matrix
equations used by Wilson (1988) and by other researchers (Raymond &
Houston, 1990; Houston, Raymond, & Svec, 1990; Webb, Raymond, &
Houston, 1990) are not well—suited for large—scale testing programs.

If 500 papers were each scored by 2 of 10 raters, the recommended

matrix equations would contain a matrix with 1000 rows and 509 columns,

50
or 509,000 elements. Performing algebraic operations on a matrix of
that size requires much computer time and capacity.

The additive model as presented by de Gruijter (1984) reduces the
size of the matrices considerably. Instead of modeling raw scores
directly, he modeled the average difference between ratings of pairs of
raters to obtain estimates of the rater effects. If djk is the average
difference between the ratings of rater j with bias 6j and rater k with
bias 6k on the papers they both graded, then djk = 6j — 6k + tjk' where
tjk is a residual error term. To get a unique solution, the sum of the
bias terms (relative stringency or leniency) is assumed to be 0. The
last rater effect can then be expressed in terms of the other rater
effects: 6“ = —26i , where i = 1, 2, ..., n—l.

In matrix terms, the equation becomes g = A' g + E , where g is
the observed vector of average rater differences, A is a design matrix
which designates which pair of raters is involved, g is a vector of
rater effects, and E is the residual error to be minimized in
estimating rater effects. With 500 papers each scored by 2 of 10
raters, assuming each pair of raters rate at least 1 paper in common so
every combination is represented, the design matrix has 45 columns and
9 rows, or only 405 elements.

The OLS estimate of rater effects is given by the matrix
equation 6 = (ATNA)ﬂA?Ng , where N is a diagonal matrix containing
the number of papers graded by each pair of raters. Unlike non—linear
models which require iterative solutions, this solution is
straightforward and can be performed on any statistical package capable

of multiple regression.

 

51

A special program was written (see Appendix A) to do the matrix
computations on the data used in the study. Once the rater effects
were estimated, these estimates were used to adjust raw scores. As
with mean and linear equating, scores were adjusted both with and
without truncating.

6. Rasch extension. The Rasch extension method is similar to
OLS, except instead of assuming a linear relationship between paper
qualities and ratings, the model assumes a curvilinear relationship.
As stated earlier, the expected score Rij of a paper with quality level
1% when rated by a judge with stringency parameter 6j on a scale
ranging from 0 to M is given by
Rij = M exp(l3i — aj)/(1 + exp(Bi — 63)).

Choppin (1982) derived a formula for an estimate of the difference
between two rater effects dﬁ(= éj — 6k:

djk = ln{[2xk(M—xj)]/[Exj(M—xk)]},
where the summations are over all observed score pairs xij and xik that
raters j and k have in common. Notice that this transformation
eliminates the quality parameter Bf Now the OLS method of de Gruijter
(1984) described in the previous section can be applied to the
transformed djk values to get estimates of the rater effects 6j.

These parameter estimates 6j were used in the following formula
which transforms observed scores xij into adjusted scores xi with the
rater effects removed:

xi = exp(6j)° xU/[l—xij(l—exp(6j))/M]

The adjusted scores for each rater were then averaged to get the

overall adjusted score for each paper.

 

52

7. 29M. To adjust scores using the PCM, the stringency
parameters were estimated by considering the pattern of scores over
pairs of raters. Next, those values were used to estimate the quality
parameters. Finally, the estimated parameters were substituted back
into the model to get expected scores for each rater and these expected
scores were averaged to get the adjusted score for the paper. This
method estimated true scores not only for the raters who scored the
paper, but also for raters who did not score the paper, by using their
stringency parameter estimates.

The PAIR algorithm for estimating the stringency parameters 6ij in
the PCM is detailed by Wright and Masters (1982, pp. 82-85). PAIR is
the estimation procedure they recommend when a data set has many
missing cases. In this context, a case is considered missing any time
a rater does not score a paper. A BASIC program written to perform the
iterative estimations is included in Appendix A. The iterative
procedure terminated either when the maximum parameter shift was less
than .02 or after 50 iterations.

Data Sets

Two types of data were used to compare the methods: simulated and
real. Simulated data have the advantage that true scores are known,
and that various aspects of the scoring procedure can be manipulated.
Real data have the advantage of not being based on the assumptions of
any particular model, but true scores are unknown.

Real Data
The real data were from a district—wide writing assessment

conducted in a suburban school district in Michigan. All students in

 

53
grades 5, 8, and 11 wrote essays arguing for a change they would like
to see in their school. The instructions given to the students are in
Appendix B. The essays (124 for grade 5, 141 for grade 8, and 122 for
grade 11) were scored in grade level order by a team of 10 raters on a
5-point scale.

Training for the raters consisted of a brief presentation of the
rating criteria listed in Appendix C, followed by scoring of sample
papers. These samples were of each of score levels 1 through 5 as
determined by the rating supervisors, who were experienced holistic
scorers. The raters scored each of the five papers and then as a group
discussed why they assigned the scores they did, with particular focus
on raters who gave lower or higher scores than the rest of the rating
team. This training procedure (scoring and discussing sample papers)
preceded the scoring of actual papers at each of the three grade
levels. Even though the scoring criteria in Appendix C do not refer to
grade level, the standards for each score level were higher at the
higher grade levels. Thus the scoring was criterion—referenced, but
across grade levels the criteria for each score shifted.

After the training, papers were shuffled and two raters read each
paper and assigned scores independently. Instead of writing numerical
scores on the papers, the raters wrote a letter code corresponding to
their score so other raters would not be influenced by the previous
scores. If the two ratings on a paper differed by more than a point, a
third rater scored the paper. The most discrepant rating was omitted,
or if the third rating was midway between the other two then the lower

rating was omitted. In the fifth grade data set 8.2 percent of the

54
papers were rescored, 8.1 percent of the eighth grade papers required
rescoring, and only 3.1 percent of the grade eleven papers were
discrepant and needed scoring by a third rater.

Table 2 lists the distribution of scores assigned by each rater
after elimination of discrepant scores. Although a score of 0 was
possible, in practice no Os were assigned and only a few ls were
assigned. Overall, the mean scores assigned were almost identical at
each grade level (3.48 for grade 5, 3.45 for grade 8, and 3.48 for
grade 11). The standard deviation of scores decreased at higher grade
levels (.93 for grade 5, .86 for grade 8, and .78 for grade 11). One
explanation of this trend toward less score variance is that paper
qualities are more homogeneous in higher grade levels than in lower.
Alternatively, because papers were scored in grade level order, it
could be that over time raters used extreme scores less often.
Constable and Andrich (1984) detailed a study which suggested that when
raters were encouraged to agree on scores, over the course of a grading
session raters tended to give more moderate scores.

Raters differed across grade levels in how they assigned scores.
For example, rater 10 was the most lenient rater on grade 5 and grade 8
essays, but was at the overall mean on grade 11 essays. Rater 6
assigned the lowest mean scores of any rater on the grade 5 and grade 8
sets, but on grade 11 papers this rater was the second most lenient.

On the grade 5 papers, mean ratings by rater ranged from a low of 3.13
to a high of 3.89, and the standard deviations ranged from .68 to 1.10.

On grade 8 papers, the most stringent rater assigned a mean rating of

 

 

 

55
Table 2

Frequencies of Scores Assigned by Raters in the Real Data Set

 

 

 

GRADE FIVE
Frequency of Each Score Summary

RATER 1 2 3 g E N Mean SE
1 0 2 6 6 2 16 3.50 0.87

3 0 2 10 8 4 24 3.58 0.86

4 0 7 11 6 4 28 3.25 0.99

5 l 4 7 9 6 27 3.56 1.10

6 0 5 4 7 0 16 3.13 0.86

7 0 1 10 6 1 18 3.39 0.68

8 1 2 16 3 4 26 3.27 0.94

9 0 3 6 12 5 26 3.73 0.90
10 0 1 13 12 11 37 3.89 0.86
11 0 6 13 10 1 30 3.20 0.79
Total 2 33 96 79 38 248 3.48 0.93

GRADE EIGHT
Frequency of Each Score Summary

RATER 1 2 3 4 5 N Mean SD
1 1 5 8 9 2 25 3.24 0.99

3 0 2 16 10 2 30 3.40 0.71

4 0 3 16 13 l 33 3.36 0.69

5 0 3 8 13 5 29 3.69 0.88

6 1 4 11 5 0 21 2.95 0.79

7 0 2 8 8 2 20 3.50 0.81

8 0 1 8 12 3 24 3.71 0.73

9 0 7 16 10 4 37 3.30 0.90
10 0 0 14 9 9 32 3.84 0.83
11 0 5 12 10 4 31 3.42 0.91
Total 2 32 117 99 32 282 3.45 0.86

 

g

56

Table 2 (continued)

GRADE 11
Frequency of Each Score Summary
RATER 1 2 2 4 E N Mean SD
1 0 8 8 4 0 20 2.80 0.75
3 0 0 14 13 1 28 3.54 0.57
4 0 3 10 13 2 28 3.50 0.78
5 0 1 7 11 8 27 3.96 0.84
6 0 0 9 8 3 20 3.70 0.71
7 0 1 9 4 0 14 3.21 0.56
8 0 2 6 8 3 19 3.63 0.87
9 0 3 18 8 2 31 3.29 0.73
10 0 1 15 10 2 28 3.46 0.68
11 0 1 15 11 2 29 3.48 0.68
Total 0 20 111 90 23 244 3.48 0.78

 

2.95 while the most lenient rater assigned a mean rating of 3.84;
standard deviations ranged from .69 to .99. Grade 11 papers showed the
greatest range of means, from 2.80 for the most stringent rater to 3.96
for the most lenient. Standard deviations of raters' scores for the
grade 11 papers ranged from .56 to .87.

Because raters varied in their stringency depending on the grade
level, three separate analyses were performed, one at each grade level.
Thus, a rater's parameter estimates for grade 5 papers are independent
of the estimates for grades 8 or 11. Because so few Os and ls were
assigned, before the analysis the scores were transformed from a 0—5
scale to a 0-3 scale by subtracting 2 from each score. The 1 ratings
(of which there were only four) were reduced to O. The computer
programs (Appendix A) were written in terms of a scale beginning with
0, and it was easier to transform the data than to rewrite the

programs .

57

Simulated Data

 

The simulated data were generated from the PCM. The PCM has
separate parameters for each step of the rating scale for each rater,
so is able to model rater differences throughout the scale. Five

facets of the rating situation were varied across the simulated data

sets:
1. Scoring ranges from 0 to 5 or from 0 to 9.
2. Data sets of 100 or 500 papers.
3. Papers scored by 1, 2, or 3 raters, or by 2 raters with

rescoring by a third rater in case of score discrepancy (as in the real
data set). Note that when only one rater scores each paper, some
adjustment methods are not appropriate.

4. Scoring by a team of 9 raters. In terms of stringency level,
raters 1, 2, and 3 are lenient; 4, 5, and 6 are average; and 7, 8, and
9 are stringent. Raters 1, 4, and 7 assign widely spread scores; 2, 5,
and 8 assign scores of average spread; and 3, 6, and 9 assign scores
with a narrow spread. Table 3 presents the stringency step parameters
for each rater, for both a 5— and 9-point scale. These parameters were
based on the findings of the earlier pilot study (Denny, 1989) and were
selected to be similar to the real data in means, standard deviations
and in level of rater agreement. Note that stringency step parameters
have values opposite to the scores they produce—~high values of step
parameters lead to low scores, and step parameters with low variance
result in scoring with high variance. Raters are nested within the

simulated data sets, so each set is scored by the same nine raters.

HIIIIII

Table 3

58

Stringency Parameters for Each Rater in the Simulated Data Sets

5-Point Scale

 

Rater

Step

-3.33

-5.33

-9.33

-2.00

—4.00

-8.00

- .67

-2.67

-6.67

9—Point Scale

 

Rater

l

1

-3.33

-5.33

-9.33

-2.00

-4.00

-8.00

-2.67

-6.67

1

—2.83

-4.33

-7.33

-1.50

-3.00

-6.00

-l.67

-4.67

Step 2

-2.33

-3.33

-5.33

-1.00

~2.00

-4.00

.33

-2.67

-2.33

-3.33

-5.33

-1.00

-2.00

—4.00

.33

—2.67

Scale Step

4

-1.83

—2.33

-3.33

-1.00

—2.00

.83

.33

Step 3

-l.33

-l.33

-1.33

5

-l.33

-1.33

-1.33

Step 4

.67

.33

.67

2.00

.83

.33

.33

.33

.67

.67

.00

.00

.00

.33

.33

.33

Step 5

.67

59

5. Quality parameters ﬁisimulated from a distribution that is
either normally distributed, positively skewed, or negatively skewed.
To get random number p from a normal distribution with mean 0 and SD 1,
substitute random numbers r and s from a uniform distribution on the
interval [0,1] into the formula p = J_:2733R;5—° sin (2n' 5) . For
the normal distribution, the quality parameters were generated by
3' p + 2, resulting in a normal distribution with mean = 2 and SD = 3.
For a positively skewed distribution, the quality parameters were

generated by 5'

 

pl — 3, a distribution with mean = 1 and SD = 3 and a
minimum value of —3. In a negatively skewed distribution, the quality
parameters are generated by —5' ‘p' + 7, a distribution with mean = 3

and SD = 3, with a maximum value of 7.

To reduce the number of separate analyses, the data sets with 100
papers were generated with each of the three distribution shapes, but
the sets with 500 papers only used normally distributed quality
parameters. In all, there were 24 data sets with 100 papers from a
2 x 4 x 3 design: scale points (5 or 9) by number of raters (1, 2, 3,
or 2 with rescoring) by quality distribution (normal, positively, or
negatively skewed). There were 8 data sets with 500 papers from a

2 x 4 design: scale points by number of raters.

Criteria for Comparing the Methods
Three criteria were used to compare the methods:
1. The Pearson product—moment correlation coefficient measured
the degree of linear relationship between two variables.

Computationally, the formula is 2 (xi — x.)(yi — y.)/(sx'sy'n), where

60
x. is the mean of the xis, y. is the mean of the yis, sx and sy are the
SDs of the xis and the yis, n is the number of papers, and the
summation is from i = 1 to i = n. Because rank orders of scores change
little by using the adjustment methods, all correlations were high. To
average or compare correlations, a Fisher's Z transformation (Glass &
Hopkins, 1984; p. 305) was used to reduce the ceiling effect on high
correlations.

2. For the real data the root mean squared difference (RMSD)
measured the differences in adjusted scores for pairs of methods. RMSD
is given by the formula JE_7§:_:_§:TZGT. This statistic is in raw
score units so a RMSD of .5 indicates that on average, two sets of
adjusted scores are .5 units apart.

3. The maximum score discrepancy was the greatest score
difference between the two sets of scores for any paper. This
represented the worst case for a method. Particularly when adjusted
scores are used to make decisions about individuals, it is important to
know how much individual cases can be affected.

For the real data set, true scores were unknown. Thus, there was
no absolute criterion for comparison. The adjustment methods were
compared with each other, to determine how much they differed. Pass
rates using a cut—score of 3.5 were compared with the different
adjustment methods, to determine the proportion of decisions that would
be affected by using rater adjustment.

For the simulated data, true scores are defined by the average
expected score over all raters based on the parameters of the model.

Adjusted scores were compared with true scores for each adjustment

61
method. Correlations, root mean squared errors (RMSEs) and maximum
differences were all based on comparisons with true scores. Better
methods have high correlations, low RMSEs, and low maximum score
discrepancies. In addition, each of the three comparison statistics
were averaged across data sets to determine which methods did better
than others overall, and under what scoring conditions particular

methods adjusted scores more accurately.

CHAPTER FOUR

RESULTS

This chapter consists of two sections. First, the adjustment
methods are compared for the writing assessment data. Second, the
adjustment methods are compared for the simulated data, focusing on how
the methods interact with each facet of the simulated data. In both
data sets the amount of data collected was overwhelming, so of
necessity data have been summarized and combined for ease of analysis

and reporting.

Writing Assessment Data

The ten adjustment methods were applied to the writing assessment
data for each of grades 5, 8, and 11. The RMSD, maximum difference,
and correlation between each pair of methods are reported for each
grade level in Table 4. Examining the data revealed these facts:
1. The truncated methods differed little from their non-truncated
versions for mean equating, linear equating, and OLS.
2. RMSDs and correlations were inversely related—-higher RMSDs were
associated with lower correlations. A secondary analysis showed that
RMSDs and correlations after a Fisher Z—transformation had a
correlation of —.92 across the three grades. Thus, the differences
between adjustment methods based on RMSDs have nearly the same rank

order as the differences between methods based on correlations.

62

 

 

GRADE 5

N0
N0 .0000
MN .1566
TMN .1581
LI .1684
TLI .1644
EQP .2482
OLS .1788
TLS .1736
RAS .2141
PCM .1814

NO
NO .0000
MN .3275
TMN .3275
LI .3605
TLI .3605
EQP .5393
OLS .3031
TLS .3031
RAS .4189
PCM .4465

N0
N0 1.000
MN .9827
TMN .9830
LI .9807
TLI .9809
EQP .9688
OLS .9803
TLS .9803
RAS .9696

PCM .9769

Method

MN
.1566
.0000
.0419
.0696
.0647
.1864
.2064
.1968
.2863
.1539

MN
.3275
.0000
.2040
.3159
.1937
.4752
.4052
.4052
.6864
.5142

MN
.9827
1.000
.9992
.9974
.9969
.9877
.9729
.9733
.9417
.9832

63

Table 4A

Comparisons for the Writing Assessment Data

Root Mean Squared Difference

TMN
.1581
.0419
.0000
.0888
.0549
.1880
.2027
.1889
.2856
.1464

TMN
.3275
.2040
.0000
.3633
.1407
.4752
.4052
.4052
.6420
.4261

TMN
.9830
.9992
1.000
.9969
.9982
.9898
.9747
.9753
.9403
.9855

LI
.1684
.0696
.0888
.0000
.0621
.1781
.2182
.2158
.2995
.1475

TLI
.1644
.0647
.0549
.0621
.0000
.1756
.2098
.1996
.2952
.1276

EQP
.2482
.1864
.1880
.1781
.1756
.0000
.2903
.2840
.3540
.2227

OLS
.1788
.2064
.2027
.2182
.2098
.2903
.0000
.0482
.3762
.2475

Maximum Difference

LI TLI EQP
.3605 .3605 .5393
.3159 .1937 .4752
.3633 .1407 .4752
.0000 .3296 .4217
.3296 .0000 .4217
.4217 .4217 .0000
.4410 .4410 .6696
.4410 .4410 .6696
.7640 .6569 .9189
.5349 .3992 .6043

Correlation

LI TLI EQP
.9807 .9809 .9688
.9974 .9969 .9877
.9969 .9982 .9898
1.000 .9982 .9880
.9982 1.000 .9907
.9880 .9907 1.000
.9701 .9718 .9645
.9697 .9724 .9652
.9398 .9380 .9200
.9853 .9885 .9780

OLS
.3031
.4052
.4052
.4410
.4410
.6696
.0000
.2526
.6914
.6373

OLS
.9803
.9729
.9747
.9701
.9718
.9645
1.000
.9990
.9092
.9594

TLS
.1736
.1968
.1889
.2158
.1996
.2840
.0482
.0000
.3709
.2410

TLS
.3031
.4052
.4052
.4410
.4410
.6696
.2526
.0000
.6914
.6373

TLS
.9803
.9733
.9753
.9697
.9724
.9652
.9990
1.000
.9063
.9596

RAS
.2141
.2863
.2856
.2995
.2952
.3540
.3762
.3709
.0000
.2829

RAS
.4189
.6864
.6420
.7640
.6569
.9189
.6914
.6914
.0000
.7027

RAS
.9696
.9417
.9403
.9398
.9380
.9200
.9092
.9063
1.000
.9453

PCM
.1814
.1539
.1464
.1475
.1276
.2227
.2475
.2410
.2829
.0000

.4465
.5142
.4261
.5349
.3992
.6043
.6373
.6373
.7027
.0000

.9769
.9832
.9855
.9853
.9885
.9780
.9594
.9596
.9453
1.000

GRADE 8

NO
NO .0000
MN .1629
TMN .1604
LI .1700
TLI .1635
EQP .2151
OLS .1206
TLS .1224
RAS .1211
PCM .1438

NO

NO .0000
MN .3264
TMN .3264
LI .4446
TLI .3768
EQP .5184
OLS .2403
TLS .2403
RAS .2800
PCM .3945

NO
NO 1.000
MN .9772
TMN .9789
LI .9748
TLI .9770
EQP .9686
OLS .9874
TLS .9873
RAS .9877

PCM .9823

64

Table 48

Method Comparisons for the Writing Assessment Data

Root Mean Squared Difference

MN TMN
.1629 .1604
.0000 .0297
.0297 .0000
.0590 .0718
.0555 .0503
.1555 .1558
.1689 .1663
.1681 .1612
.2217 .2209
.1734 .1729
MN TMN
.3264 .3264
.0000 .1264
.1264 .0000
.1621 .2532
.1621 .1621
.4321 .4321
.3909 .3909
.3909 .3909
.4729 .4729
'.5011 .5011

MN TMN
.9772 .9789
1.000 .9994
.9994 1.000
.9975 .9970
.9970 .9978
.9891 .9908
.9752 .9768
.9733 .9754
.9585 .9596
.9745 .9756

LI TLI EQP OLS
.1700 .1635 .2151 .1206
.0590 .0555 .1555 .1689
.0718 .0503 .1558 .1663
.0000 .0483 .1418 .1746
.0483 .0000 .1341 .1704
.1418 .1341 .0000 .2237
.1746 .1704 .2237 .0000
.1779 .1664 .2269 .0432
.2282 .2229 .2638 .2351
.1779 .1704 .2268 .1812

Maximum Difference

LI TLI EQP OLS
.4446 .3768 .5184 .2403
.1621 .1621 .4321 .3909
.2532 .1621 .4321 .3909
.0000 .2486 .3855 .3997
.2486 .0000 .3855 .3997
.3855 .3855 .0000 .5702
.3997 .3997 .5702 .0000
.3997 .3997 .5702 .1636
.5651 .4508 .5982 .5179
.5347 .5347 .6763 .4555

Correlation

LI TLI EQP OLS
.9748 .9770 .9686 .9874
.9975 .9970 .9891 .9752
.9970 .9978 .9908 .9768
1.000 .9985 .9911 .9733
.9985 1.000 .9941 .9747
.9911 .9941 1.000 .9656
.9733 .9747 .9656 1.000
.9714 .9739 .9648 .9990
.9556 .9579 .9491 .9531
.9727 .9754 .9663 .9718

TLS
.1224
.1681
.1612
.1779
.1664
.2269
.0432
.0000
.2377
.1803

TLS
.2403
.3909
.3909
.3997
.3997
.5702
.1636
.0000
.5179
.4963

TLS
.9873
.9733
.9754
.9714
.9739
.9648
.9990
1.000
.9517
.9719

RAS
.1211
.2217
.2209
.2282
.2229
.2638
.2351
.2377
.0000
.1858

.2800
.4729
.4729
.5651
.4508
.5982
.5179
.5179
.0000
.5150

.9877
.9585
.9596
.9556
.9579
.9491
.9531
.9517
1.000
.9711

PCM
.1438
.1734
.1729
.1779
.1704
.2268
.1812
.1803
.1858
.0000

PCM
.3945
.5011
.5011
.5347
.5347
.6763
.4555
.4963
.5150
.0000

PCM
.9823
.9745
.9756
.9727
.9754
.9663
.9718
.9719
.9711
1.000

65
Table 4C

Method Comparisons for the Writing Assessment Data

GRADE 11 Root Mean Squared Difference

NO MN TMN LI TLI EQP OLS TLS RAS PCM
NO .0000 .2159 .2142 .2222 .2192 .2917 .1542 .1388 .1353 .1510
MN .2159 .0000 .0272 .0790 .0703 .2136 .3044 .2689 .2200 .1823
TMN .2142 .0272 .0000 .0846 .0671 .2112 .3034 .2673 .2182 .1825
LI .2222 .0790 .0846 .0000 .0452 .1918 .3070 .2757 .2303 .1866
TLI .2192 .0703 .0671 .0452 .0000 .1898 .3029 .2701 .2283 .1827
EQP .2917 .2136 .2112 .1918 .1898 .0000 .3681 .3375 .3013 .2909
OLS .1542 .3044 .3034 .3070 .3029 .3681 .0000 .0853 .2738 .2399
TLS .1388 .2689 .2673 .2757 .2701 .3375 .0853 .0000 .2689 .2157
RAS .1353 .2200 .2182 .2303 .2283 .3013 .2738 .2689 .0000 .1837
PCM .1510 .1823 .1825 .1866 .1827 .2909 .2399 .2157 .1837 .0000

Maximum Difference

NO MN TMN LI TLI EQP OLS TLS RAS PCM
NO .0000 .4683 .4683 .4729 .4577 .8106 .3882 .3339 .3969 .3858
MN .4683 .0000 .2438 .2729 .2396 .7101 .7254 .7254 .7097 .4839
TMN .4683 .2438 .0000 .2729 .2396 .7101 .7254 .7254 .7097 .4839
LI .4729 .2729 .2729 .0000 .2630 .5369 .8068 .8068 .6531 .4424
TLI .4577 .2396 .2396 .2630 .0000 .5369 .7529 .7529 .6531 .4424
EQP .8106 .7101 .7101 .5369 .5369 .0000 .9550 .9550 1.094 .7851
OLS .3882 .7254 .7254 .8068 .7529 .9550 .0000 .3201 .7308 .7189
TLS .3339 .7254 .7254 .8068 .7529 .9550 .3201 .0000 .7308 .7189
RAS .3969 .7097 .7097 .6531 .6531 1.094 .7308 .7308 .0000 .5506
PCM .3858 .4839 .4839 .4424 .4424 .7851 .7189 .7189 .5506 .0000

Correlation

NO MN TMN LI TLI EQP OLS TLS RAS PCM
NO 1.000 .9506 .9516 .9495 .9494 .9391 .9775 .9799 .9820 .9778
MN .9506 1.000 .9992 .9957 .9951 .9816 .9069 .9192 .9521 .9665
TMN .9516 .9992 1.000 .9958 .9962 .9838 .9078 .9197 .9536 .9669
LI .9495 .9957 .9958 1.000 .9985 .9850 .9078 .9209 .9473 .9658
TLI .9494 .9951 .9962 .9985 1.000 .9890 .9083 .9207 .9476 .9659
EQP .9391 .9816 .9838 .9850 .9890 1.000 .8993 .9132 .9331 .9510
OLS .9775 .9069 .9078 .9078 .9083 .8993 1.000 .9947 .9279 .9433
TLS .9799 .9192 .9197 .9209 .9207 .9132 .9947 1.000 .9268 .9516
RAS .9820 .9521 .9536 .9473 .9476 .9331 .9279 .9268 1.000 .9681
PCM .9778 .9665 .9669 .9658 .9659 .9510 .9433 .9516 .9681 1.000

66

3. The maximum distance measure was erratic, reflecting properties of
individual papers and not general properties of the adjustment methods.
4. Looking across the three grade levels, the relative proximity of
adjusted scores by the various methods were consistent. For example,
compared to NO, EQP had the highest RMSD at all three grade levels. At
each grade level, the closest method to TLS (other than OLS) was NO.

Based on these observations, steps were taken to reduce the volume
of data. First, only the truncated versions of the methods were used.
Second, correlations were omitted as being redundant with RMSDs.
Third, maximum difference as a measure was omitted as unreliable.
Fourth, RMSDs for the three grades were examined separately and then
averaged, with the three grades treated as three replications of the
study.

RMSDs Compared

 

The resulting differences between the methods based on the RMSD
measure are graphed in Figures 7, 8, 9, and 10. The graphs are two—
dimensional representations of relative distances (RMSDs) which are
multi-dimensional, so the distances in the graph are a distortion of
the actual RMSDs. The graphs were produced by the SPSS-X procedure
ALSCAL (Young, Takane, & Lewyckyj, 1988). In those cases where a
method does not appear in the graph, the method is coincident with the
truncated version of the method. In Figure 7, for example, OLS (which
would have been graphed as "7”) is coincident with TLS ("8").

The relative positions of the methods were consistent across the
three grade levels and overall. Three methods EQP (depicted as "6"),

TLS ("8"), and RAS (”9") are graphed as the vertices of a triangle

67

 

 

 

 

2.1 —- .
6 .
1.0 —-‘ O
4. 0
5 .
32.
0.0 —_‘O O 0 O O O O O O O O O O O 0 O 0 O
-l.0 —~ 8 . 1 9
.2.1 __.l
-2.5 -1.5 -0.5 0 5 1.5 2.5
Key
1 No Equating (NO)
2 Mean Equating (MN)
3 Truncated Mean Equating (TMN)
4 Linear Equating (LI)
5 Truncated Linear Equating (TLI)
6 Equipercentile Equating (EQP)
7 Ordinary Least Squares (OLS)
8 Truncated Least Squares (TLS)
9 Rasch Extension (RAS)
0 Partial Credit Model (PCM)

 

Figure 7: Graph of Average RMSDs for the Grade 5 Data

68

 

 

 

 

 

201 _ o
. 9
1.0 —— .
6 .
. 0
45 .
0.0 ——. . . . . . . 2. . . . . . . . . . . . . . . . .
3 . l
—1.0 -— .
. 7
. 8
—2.1 -— .
—2.5 -1.5 —0.5 0 5 1 5 2.5
Key
1 No Equating (NO)
2 Mean Equating (MN)
3 Truncated Mean Equating (TMN)
4 Linear Equating (LI)
5 Truncated Linear Equating (TLI)
6 Equipercentile Equating (EQP)
7 Ordinary Least Squares (OLS)
8 Truncated Least Squares (TLS)
9 Rasch Extension (RAS)
0 Partial Credit Model (PCM)

 

Figure 8: Graph of Average RMSDs for the Grade 8 Data

69

 

 

 

 

 

 

2.1 —- .
. 9
1.0 -1 . 0
3 . 1
0.0 —a. . . . . . . . . . . . . . . . . . . . . . . .
5 .
. 8 7
-1.0 —~ .
6 .
—2.1 —‘ .
—2.5 -1.5 -0.5 0.5 1.5 2.5
Key
1 No Equating (NO)
2 Mean Equating (MN)
3 Truncated Mean Equating (TMN)
4 Linear Equating (LI)
5 Truncated Linear Equating (TLI)
6 Equipercentile Equating (EQP)
7 Ordinary Least Squares (OLS)
8 Truncated Least Squares (TLS)
9 Rasch Extension (RAS)
0 Partial Credit Model (PCM)

 

Figure 9: Graph of Average RMSDs for the Grade 11 Data

70

 

 

 

 

2.1 — I 9
1.0 —- . 0
. 1
0 0 —- . . . . . . . . . . . . . . . . . . . . . . .
453 .
6
—1 0 —~ . 8 7
-2.1 _. .
-2 5 -l.5 —0.5 0 5 1.5

Key
1 No Equating (NO)
2 Mean Equating (MN)
3 Truncated Mean Equating (TMN)
4 Linear Equating (LI)
5 Truncated Linear Equating (TLI)
6 Equipercentile Equating (EQP)
7 Ordinary Least Squares (OLS)
8 Truncated Least Squares (TLS)
9 Rasch Extension (RAS)
0 Partial Credit Model (PCM)

 

Figure 10: Graph of Average RMSDs for the Three Grades Combined

71
which includes the other methods. EQP, TLS, and RAS consistently had
large RMSDs among themselves. In each graph NO ("1") was midway
between TLS (”8") and RAS ("9"). The simple linear methods ("2"
through "5") were about two-thirds of the way from TLS ("8") to EQP
(”6"). PCM ("O") was generally within the triangle, closest to RAS
("9").

These graphs demonstrate that the adjustment methods differed in
how they adjust scores, but in a systematic way. The relative
differences among the methods followed a consistent pattern across the
three grade levels of the real—life data set. Unfortunately, knowing
how the adjustment methods differ gives no additional information about
the papers' true scores, nor does it provide a basis for choosing one
method over another.

Passing Rates

Although the actual writing assessment did not involve pass/fail
decisions, a similar competency exam in the district requires an
average rating of 3.5 for a passing score. The data were analyzed to
see what difference the various adjustment methods would have on the
number of students who pass or fail using a 3.5 standard. Because of
rounding, any student with an average adjusted score of 3.25 or greater
is considered to have passed the exam.

Table 5 lists the number of papers passing and failing at each
grade level using the 3.25 criterion. The table also indicates the
number of papers which were "helped" or "hurt” by score adjustment
compared to the pass/fail decision with unadjusted scores. A paper is

"helped" if it passes when scores are adjusted, but fails when scores

72
Table 5

Pass/Fail Decisions for Adjusted Scores
Relative to Unadjusted Scores for the Writing Data

Grade 5

Passing Failing Helped Hurt
NO 72 52 0 0
MN 71 53 3 4
TMN 71 53 3 4
LI 71 53 3 4
TLI 71 53 3 4
EQP 76 48 8 4
OLS 67 57 0 5
TLS 67 57 0 5
RAS 84 40 13 1
PCM 67 57 0 5
Grade 8

Passing Failing Helped Hurt
NO 85 56 0 0
MN 83 58 5 7
TMN 83 58 5 7
LI 82 59 3 6
TLI 82 59 3 6
EQP 87 54 5 3
OLS 85 56 0 0
TLS 85 56 0 0
RAS 81 60 0 4
PCM 85 56 0 0
Grade 11

Passing Failing Helped Hurt
NO 71 51 0 0
MN 66 56 5 10
TMN 66 56 5 10
LI 68 54 5 8
TLI 68 54 5 8
EQP 70 52 6 7
OLS 70 52 1 2
TLS 70 52 1 2
RAS 72 50 3 2
PCM 71 51 0 0

73
are unadjusted. A paper is "hurt" if it fails when scores are
adjusted, but passes when scores are unadjusted.

Mean equating changed 34 decisions, with more papers "hurt" (21)
than "helped" (13). Equipercentile equating "helped" 19 papers and
"hurt" 14. Linear equating "helped" 11 papers and "hurt" 18 across the
three grades. The Rasch extension method "helped" 16 papers and "hurt"
7 overall, but adjustment "helped" more papers in fifth grade (13-1),
and "hurt" more in eighth grade (0—4). The least-squares methods
"helped" only 1 paper and "hurt" 7. PCM adjustment changed only 5
pass/fail decisions, all of which "hurt" fifth grade papers. Not
surprisingly, whether truncation was used made no difference on any
decision, because truncation only affects extreme scores and not those
in the middle of the rating scale near the cutoff point.

This analysis demonstrated that the choice of adjustment methods
affects the dichotomous pass-fail decisions made from ratings.
However, the analysis does not provide a basis for preferring one
method over another. Preferring one method over another requires a
measure of the correctness of the decisions made from the adjusted
scores. (The terms "helped" and "hurt" referred only to the direction
a decision changed, and not to the correctness of the decision.)
Determining the correctness of any decision requires an external
criterion of success. If the school district had another measure of
whether students were "competent" writers, then the classifications
based on essay scores could be compared to those of the other measure.
Without such an external criterion, only relative comparisons of the

adjustment methods are possible.

74
Simulated Data

The simulated data sets allowed a comparison of adjustment
methods in a context where true scores were known. In all, 32 data
sets were simulated and analyzed. As an example, Table 6 lists the
scores that raters assigned for one of the 32 sets. The patterns of
rater scoring were as expected from the parameters in the model.
Raters 1, 2, and 3 were lenient; raters 4, 5, and 6 were of average
stringency; raters 7, 8, and 9 assigned scores averaging less than the
overall mean. Raters l, 4, and 7 assigned scores with a large
variance; raters 2, 5, and 8 gave scores with a variance similar to the
overall score variance; raters 3, 6, and 9 assigned scores mostly to

the middle of the scale and had a low score variance.

Root Mean Squared Errors (RMSEs)

For each simulated data set, the accuracy of the adjustment
methods was measured with the RMSE by comparing adjusted scores with
the true scores expected from the parameters of the model. Table 7
lists the overall RMSEs for all data sets and for each of the ten
adjustment methods. Because only six of the methods were used for the
data sets with only one rater per paper, and because those data sets
had higher RMSEs than the other data sets, they were analyzed
separately. The RMSEs for the remaining 24 data sets were averaged,
both overall and by each of the facets which varied across the
simulated data sets. The results of this analysis are reported in

Table 8.

75

Table 6

Scoring Frequencies for One Simulated Data Set (2+551)*

 

0 1 2 3 4 5 N Mean SD
Overall 48 72 171 252 218 239 1000 3.2370 1.4173
Rater
1 8 4 10 9 14 63 108 3.9074 1.5959
2 2 10 11 20 26 36 105 3.5810 1.3924
3 0 4 15 49 45 13 126 3.3810 0.9331
4 10 7 19 17 15 48 116 3.4138 1.6716
5 5 11 19 32 33 28 128 3.2578 1.3821
6 l 6 18 50 30 2 107 3.0093 0.9120
7 12 8 17 11 10 26 84 2.9167 1.7942
8 10 14 20 27 20 21 112 2.8571 1.5461
9 0 8 42 37 25 2 114 2.7456 0.9348

 

* This set had 500 papers, a 5~point scale, normally distributed paper
quality parameters, and 2 raters per paper with rescoring by a third
rater if scores were discrepant.

Overall RMSEs for the

76

Table 7

Simulated Data

 

 

 

 

 

 

 

Data

Set* NO MN TMN LI TLI EQP OLS TLS RAS PCM
1151 0.97 0.83 0.77 0.89 0.79 0.87

1152 0.94 0.86 0.83 0.93 0.85 1.08

1153 0.85 0.85 0.82 0.94 0.90 0.91

1191 1.74 1.51 1.40 1.63 1.51 1.39

1192 1.69 1.57 1.50 1.61 1.49 1.51

1193 1.62 1.42 1.39 1.61 1.57 1.37

1551 0.91 0.80 0.76 0.82 0.77 0.85

1591 1.85 1.66 1.61 1.60 1.53 1.53

2151 0.62 0.58 0.55 0.55 0.51 0.63 0.64 0.61 0.69 0.56
2152 0.74 0.69 0.66 0.73 0.67 0.73 0.87 0.78 0.76 0.69
2153 0.62 0.58 0.57 0.66 0.62 0.64 0.65 0.61 0.66 0.64
2191 1.40 1.40 1.34 1.39 1.33 1.22 1.53 1.50 1.47 1.34
2192 1.36 1.34 1.29 1.35 1.27 1.18 1.53 1.46 1.43 1.38
2193 1.10 1.04 1.03 1.04 0.95 1.00 1.06 1.04 1.19 1.21
2551 0.72 0.65 0.63 0.65 0.61 0.64 0.77 0.74 0.71 0.61
2591 1.28 1.20 1.16 1.17 1.12 1.04 1.30 1.28 1.28 1.11
2+151 0.60 0.55 0.54 0.54 0.53 0.62 0.60 0.59 0.60 0.58
2+152 0.82 0.67 0.64 0.66 0.64 0.72 0.87 0.84 0.78 0.81
2+153 0.63 0.54 0.52 0.53 0.50 0.57 0.67 0.64 0.62 0.62
2+191 1.37 1.27 1.24 1.22 1.18 1.10 1.41 1.39 1.34 1.28
2+192 1.51 1.17 1.14 1.23 1.15 1.05 1.53 1.51 1.52 1.39
2+193 1.46 1.25 1.22 1.21 1.18 1.21 1.51 1.48 1.43 1.37
2+551 0.74 0.64 0.62 0.64 0.61 0.65 0.74 0.73 0.73 0.67
2+591 1.40 1.23 1.21 1.21 1.17 1.13 1.44 1.42 1.36 1.26
3151 0.51 0.42 0.40 0.49 0.43 0.55 0.57 0.51 0.52 0.41
3152 0.55 0.49 0.49 0.54 0.52 0.54 0.58 0.57 0.54 0.50
3153 0.50 0.43 0.42 0.46 0.42 0.48 0.53 0.49 0.52 0.43
3191 1.01 0.95 0.93 0.97 0.93 0.84 1.03 1.03 1.05 0.97
3192 1.22 1.14 1.10 1.17 1.13 0.94 1.22 1.21 1.20 1.11
3193 1.13 1.04 1.02 1.08 1.02 0.89 1.10 1.09 1.14 0.90
3551 0.53 0.48 0.46 0.49 0.46 0.50 0.56 0.54 0.53 0.44
3591 1.13 1.04 1.02 1.06 1.02 0.93 1.16 1.15 1.09 0.94

* Key to Digits in Data Set Number
First Digit: Number of raters per paper; 2+ means two raters with

rescoring if scores were discrepant.

Second Digit:

Third Digit:

Fourth Digit:

Number of hundred papers scored

Skew of paper quality distribution; 1 is normal,
positively skewed, and 3 is negatively skewed.

(100 or 500).

Number of points in the rating scale (0-5 or 0-9).

2 is

77

Table 8

Overall RMSEs Averaged by Facet

(l-Rater Data Sets Omitted)

Total

NO MN
0.96 0.87

TMN

ALL 0.84

Number of Raters Per Paper

 

 

 

NO MN TMN
2 0.98 0.93 0.90
2+ 1.07 0.91 0.89
3 0.82 0.75 0.73
Number of Papers
NO MN TMN
100 0.95 0.86 0.84
500 0.97 0.87 0.85
Number of RatingiScale Points
NO MN TMN
0-5 0.63 0.56 0.54
0—9 1.28 1.17 1.14

Paper Quality Distribution

 

NO MN TMN
Normal 0.94 0.87 0.84
Skew + 1.03 0.92 0.89
Skew - 0.91 0.82 0.80

LI TLI EQP OLS

.88 0.83 0.82 0.99
LI TLI EQP OLS
0.94 0.89 0.88 1.04
0.91 0.87 0.88 1.10
0.78 0.74 0.71 0.84
LI TLI EQP OLS
0.88 0.83 0.83 0.99
0.87 0.83 0.81 0.99
LI TLI EQP OLS

.58 0.54 0.61 0.67
.17 1.12 1.04 1.32
LI TLI EQP OLS
0.86 0.82 0.82 0.98
0.95 0.90 0.86 1.10
0.83 0.78 0.80 0.92

TLS
0.97

TLS

1.00
1.07
0.82

TLS
0.96
0.98

TLS
0.64
1.29

TLS

0.96
1.06
0.89

0.96

1.02
1.05
0.82

0.97
0.95

0.64
1.29

0.95
1.04
0.93

PCM
0.88

PCM

0.94
1.00
0.71

PCM
0.90
0.84

PCM
0.58
1.19

PCM

0.85
0.98
0.86

78

Overall

EQP was the method with the lowest overall average RMSE (.82).
TLI and TMN had only slightly higher average RMSEs (TLI-—.83;
TMN--.84). MN (.87), LI (.88), and PCM (.88) still represented
improvements over NO (.96). The three matrix methods (RAS--.96,
TLS~-.97, and OLS——.99) did not do as well as no equating (NO).
Overall averages are somewhat misleading because different methods
worked better under different scoring conditions. Each facet of the
simulated data was considered separately.

Number of Raters Per Paper

 

Not surprisingly, scoring with three raters per paper resulted in
lower RMSEs than scoring with two raters per paper. More surprising
was that scoring with two raters without rescoring discrepant cases had
lower RMSEs than when discrepant scores were rescored by a third rater.
This result contradicts the traditional view that rescoring in
discrepant cases produces more accurate scores. With three raters per
paper, PCM and EQP had the lowest RMSEs (both .71) while with two
raters per paper, EQP and TLI had the lowest RMSEs (both roughly .88).
PCM did notably worse with only two raters per paper than with three.
OLS, TLS, and RAS continued to perform worse than NO. MN, TMN, and LI
had only slightly higher RMSEs than TLI.

Number of Papers Scored

 

Most methods had the same average RMSE whether averaging over 100-
paper sets or SOC-paper sets. The one exception was PCM, which had an
average RMSE of .90 on loo-paper sets and .84 on 500-paper sets,

indicating that PCM was more accurate with large data sets than with

“iwv,w1_t.

79
smaller ones. All other methods performed about as well on either size
of data set as they did overall.

Number of Rating Scale Points

 

RMSEs were averaged over both 0—5 rating scales and 0—9 rating
scales. As would be expected, the RMSEs for the 0-9 scales were almost
twice as great as those for 0-5 rating scales. EQP performed much
better on the data sets with more scale points. On 0-9 scales, EQP had
the lowest RMSE by far (1.04, compared to TLI's 1.12 and NO's 1.28)
while on 0—5 scales EQP did only slightly better than NO (.61 vs. .63),
and TLI did the best (.54). Thus the overall advantage of EQP is due
primarily to its advantage on 0-9 data sets. Note also that because
the 0-9 data sets had greater variance in RMSE, they were weighted more
heavily when RMSEs were averaged across sets. A supplemental analysis
indicated that if the RMSEs were converted to T—scores before averaging
(so all data sets were weighted equally), then TLI would have the
lowest average RMSE overall.

Paper Quality Distribution

 

Simulated data sets varied in their distribution of paper quality
parameters. Positively skewed data sets were generated to have a
greater number of low quality papers; negatively skewed data sets were
generated to have a greater number of high quality papers. The
relative effectiveness of the ten methods with each shape of
distribution was generally the same as their relative effectiveness
overall.

The negatively skewed data sets had the lowest RMSEs and the

positively skewed data sets had the highest RMSEs across the methods.

80
This trend favoring negatively skewed sets is likely due to ceiling
effects. Papers with extreme (either high or low) true scores are more
accurately rated because scoring errors can only oCcur in one
direction. The data sets were simulated with more high scores than low
scores (like the real data sets) so the extreme scores occur mostly at
the high end. In this simulation the negatively skewed data sets had
relatively more extreme scores than did the positively skewed data sets
and were scored more accurately.

Rater Type

 

Besides the overall RMSE computed for each data set and method
with average adjusted scores, RMSEs were also computed for each of the
nine raters for the 32 data sets and 10 methods. Rather than consider
each rater separately, raters are grouped into sets of three and rater
types are analyzed for differences in RMSE. One analysis is based on
the level of the ratings (lenient, average, or stringent) and the other
is based on the spread of the ratings assigned (wide, medium, or
narrow).

Table 9 lists the RMSEs for lenient, average, and stringent
raters, both overall and for each of the scoring facets--number of
raters (NR), number of papers (NP), number of scale points (NS) and
skewness of score distribution. Overall, five methods (NO, OLS, TLS,
RAS, and PCM) had greater RMSEs with the scores of stringent raters
than with those of lenient raters. The other five methods (MN, TMN,
LI, TLI, and EQP) did roughly the same or slightly better with
stringent raters' scores than with lenient raters' scores. In

particular, EQP had the least RMSE of the ten methods with stringent

RMSEs by Rater Type by Facet

81

Table 9

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Raters 1,2,3 —— Lenient

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 1.10 1.12 1.08 1.21 1.12 1.22 1.25 1.15 1.13 0.95
NR=2 1.10 1.18 1.14 1.31 1.19 1.28 1.32 1.16 1.16 0.94
NR=2+ 1.03 1.05 1.01 1.07 1.03 1.12 1.08 1.04 1.04 0.97
NR=3 1.17 1.13 1.08 1.24 1.13 1.26 1.34 1.24 1.19 0.93
NP=100 1.10 1.14 1.09 1.23 1.13 1.23 1.26 1.15 1.14 0.95
NP=500 1.10 1.07 1.03 1.15 1.08 1.19 1.20 1.13 1.10 0.93
NS=5 0.86 0.76 0.73 0.85 0.77 0.92 1.00 0.90 0.85 0.74
NS=9 1.34 1.48 1.43 1.57 1.46 1.52 1.50 1.39 1.40 1.15
Normal 1.09 1.11 1.07 1.16 1.09 1.23 1.26 1.16 1.12 0.94
Skew + 1.20 1.25 1.19 1.35 1.28 1.36 1.40 1.28 1.23 1.03
Skew - 1.01 1.01 0.97 1.16 1.00 1.07 1.07 0.99 1.04 0.88
Raters 4,5,6 -- Average

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 1.17 1.16 1.13 1.16 1.11 1.13 1.25 1.23 1.17 1.02
NR=2 1.22 1.18 1.14 1.19 1.13 1.16 1.38 1.35 1.23 1.08
NR=2+ 1.11 1.10 1.07 1.13 1.08 1.12 1.12 1.11 1.11 1.06
NR=3 1.18 1.19 1.17 1.16 1.11 1.12 1.26 1.24 1.16 0.92
NP=100 1.17 .15 1.12 1.16 1.11 1.13 1.27 1.24 1.18 1.03
NP=500 1.16 1.17 1.15 1.17 1.12 1.14 1.22 1.21 1.12 1.00
NS=5 0.80 0.79 0.77 0.83 0.80 0.84 0.84 0.82 0.80 0.66
NS=9 1.54 1.52 1.48 1.49 1.42 1.42 1.67 1.65 1.54 1.37
Normal 1.14 1.14 1.12 1.15 1.10 1.12 1.23 1.21 1.13 0.99
Skew + 1.26 1.24 1.21 1.19 1.15 1.13 1.40 1.37 1.31 1.11
Skew — 1.13 1.10 1.07 1.15 1.09 1.15 1.16 1.14 1.09 0.99
Raters 7,8,9 -— Stringent

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 1.54 1.19 1.12 1.14 1.07 1.07 1.55 1.52 1.54 1.43
NR=2 1.58 1.27 1.20 1.22 1.15 1.16 1.61 1.58 1.65 1.45
NR=2+ 1.34 1.01 0.95 0.98 0.90 0.98 1.38 1.36 1.30 1.33
NR=3 1.68 1.27 1.20 1.23 1.15 1.08 1.65 1.63 1.67 1.52
NP=100 1.53 1.17 1.11 1.13 1.06 1.08 1.54 1.52 1.55 1.44
NP=500 1.55 1.22 1.15 1.17 1.10 1.06 1.55 1.54 1.53 1.42
NS=5 0.99 0.81 0.76 0.77 0.72 0.76 0.99 0.97 1.03 0.90
NS=9 2.08 1.56 1.48 1.51 1.41 1.38 2.10 2.07 2.05 1.97
Normal 1.53 1.20 1.13 1.16 1.09 1.07 1.54 1.52 1.52 1.41
Skew + 1.59 1.12 1.05 1.16 1.06 1.10 1.66 1.63 1.51 1.53
Skew — 1.49 1.22 1.15 1.09 1.04 1.05 1.44 1.40 1.61 1.38

 

82
raters but with lenient raters ranked ninth. In contrast, PCM was the
best method with lenient raters and average raters but ranked only
sixth with stringent raters.

The number of raters who scored each paper made little difference
in the RMSEs of individual raters, except that the raters' RMSEs were
slightly less in those data sets where discrepant raters were removed.
The individual raters involved in the sets with rescoring were more
accurate than in sets without rescoring. But recall that the average
adjusted scores on papers was slightly more accurate without rescoring
by a third rater.

In general, none of the facets of scoring appears to interact with
the relative leniency of the raters. One minor exception is that
stringent raters' scores after adjustments MN, TMN, LI, TLI, or EQP
have slightly lower RMSEs for positively skewed data sets than for
negatively skewed data sets. In all other cases, the negatively skewed
RMSE was less because of ceiling effects but here the situation is
reversed due to floor effects when stringent raters score low quality
papers.

Table 10 analyzes RMSEs by rater score variance. Generally the
RMSEs are greater for raters with high score variance than for raters
with low score variance. Three methods——LI, TLI, and EQP——have much
lower RMSEs for high variance raters than the other methods and
slightly higher RMSEs for low variance raters than the other seven
methods. PCM has the least RMSE of the ten methods for low variance
raters and medium variance raters but ranks sixth for high variance

raters.

 

83

Table 10

RMSEs by Rater Spread by Facet

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Raters 1,4,7 -- High Score Variance

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 1.56 1.40 1.30 1.14 1.09 1.17 1.66 1.55 1.58 1.41
NR=2 1.63 1.49 1.38 1.22 1.16 1.22 1.77 1.63 1.66 1.48
NR=2+ 1.35 1.20 1.11 0.97 0.93 1.05 1.45 1.38 1.32 1.32
NR=3 1.70 1.51 1.40 1.24 1.17 1.24 1.75 1.65 1.77 1.44
NP=100 1.54 1.38 1.26 1.13 1.07 1.16 1.66 1.54 1.56 1.41
NP=500 1.61 1.47 1.39 1.16 1.15 1.20 1.65 1.59 1.66 1.42
NS=5 1.07 0.96 0.88 0.78 0.75 0.87 1.19 1.09 1.08 0.94
NS=9 2.05 1.84 1.71 1.50 1.42 1.47 2.12 2.02 2.08 1.88
Normal 1.56 1.41 1.31 1.12 1.09 1.17 1.63 1.54 1.60 1.39
Skew + 1.80 1.49 1.35 1.22 1.15 1.22 1.98 1.83 1.81 1.65
Skew - 1.31 1.30 1.21 1.12 1.01 1.13 1.40 1.30 1.33 1.22
Raters 2,5,8 -- Medium Score Variance

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 1.20 1.08 1.05 1.14 1.09 1.13 1.28 1.24 1.22 1.02
NR=2 1.21 1.14 1.11 1.25 1.19 1.20 1.36 1.29 1.31 1.03
NR=2+ 1.13 1.00 0.97 1.04 1.00 1.06 1.14 1.12 1.14 1.06
NR=3 1.25 1.09 1.05 1.14 1.09 1.11 1.34 1.30 1.23 0.99
NP=100 1.21 1.09 1.06 1.16 1.11 1.14 1.30 1.25 1.26 1.04
NP=500 1.15 1.04 1.00 1.10 1.05 1.09 1.23 1.20 1.10 0.98
NS=5 0.83 0.73 0.70 0.77 0.72 0.77 0.86 0.83 0.83 0.70
NS=9 1.57 1.43 1.39 1.52 1.46 1.48 1.70 1.64 1.61 1.35
Normal 1.18 1.08 1.05 1.13 1.08 1.12 1.31 1.25 1.17 1.02
Skew + 1.24 1.07 1.04 1.17 1.14 1.16 1.38 1.35 1.21 1.09
Skew — 1.19 1.09 1.05 1.14 1.08 1.11 1.13 1.10 1.33 0.97
Raters 3,6,9 -- Low Score Variance

NO MN TMN L I TLI EQP OLS TLS RAS PCM
ALL 1.05 0.98 0.98 1.22 1.11 1.13 1.11 1.11 1.03 0.96
NR=2 1.06 0.99 0.99 1.27 1.13 1.17 1.17 1.17 1.08 0.96
NR=2+ 1.01 0.96 0.96 1.17 1.08 1.11 1.00 1.00 0.99 0.98
NR=3 1.07 0.99 0.99 1.24 1.13 1.10 1.16 1.16 1.02 0.94
NP=100 1.05 0.99 0.99 1.23 1.11 1.14 1.12 1.12 1.05 0.97
NP=500 1.04 0.95 0.95 1.22 1.11 1.09 1.09 1.09 0.99 0.94
NS=5 0.75 0.68 0.68 0.90 0.82 0.87 0.77 0.77 0.77 0.66
NS=9 1.34 1.29 1.29 1.55 1.41 1.38 1.45 1.44 1.30 1.26
Normal 1.03 0.96 0.96 1.22 1.11 1.13 1.10 1.10 1.01 0.94
Skew + 1.01 1.06 1.06 1.31 1.20 1.22 1.11 1.11 1.03 0.92
Skew - 1.12 0.94 0.94 1.14 1.03 1.02 1.13 1.13 1.08 1.05

 

 

 

84

Comparing the facets of scoring, the number of raters per paper
did not interact with rater score variance. The number of papers
scored made a slight difference, with high variance raters more
accurate with loo-paper sets than with BOO-paper sets and low variance
raters more accurate with larger data sets. This result is likely
artifactual, because the BOO-paper sets were all normally distributed
while the loo-paper sets were either normally distributed, positively
skewed, or negatively skewed. The number of rating scale points did
not interact with rater score variance. The skewness of distributions
made some difference, with high score variance raters doing much worse
on positively skewed data sets than on negatively skewed data sets, and
with low score variance raters doing the same on all distribution

shapes.

Correlations

Another measure of the relative effectiveness of the ten
adjustment methods was the Pearson product-moment correlation. As with
RMSE, the data sets with only one rater per paper are analyzed
separately, because four of the methods were not used on one—rater data
sets. The correlation between average adjusted scores and true scores
was computed for 24 data sets and for each of the ten adjustment
methods. Before averaging across data sets, the correlations were
converted to Z—scores using the Fisher's Z transformation:
Z = .5 ' ln [(1+r)/(1-r)]. Because correlations cannot exceed 1.0,
this transformation makes the distribution of correlations more nearly

normal and reduces the ceiling effect. The Z—values are averaged

85

Table 11

Correlations for the Simulated Data Sets

 

ALL

=2

NR=2+

=3

NP=l
NP=5

NS=5
NS=9

SK=1

0.89

0.91

0.91

SK:
SK=

ALL

NR=2

NR=2+

NR=3

NP=1
NP=5

NS=5
NS=9

SK=1
SK=2
SK=3

0.87
0.90

0.
0.

89
91

0.90
0.91

LI
0.92

0.90
0.90
0.94

0.92
0.90
0.92

TLI
0.92

0.91
0.91
0.94

0.92
0.91
0.92

OLS
0.88

0.87

0.85
0.91

0.89

TLS
0.88

0.88
0.86
0.91

0.89
0.86

0.89

0.92

0.85
0.89

0.90

0.87
0.89

0.89
0.91

Z-Transformed Correlations for the Simulated Data Sets

1.39
1.30
1.56

1.44
1.33
1.45

1.52
1.42
1.52

TMN
1.51

1.43
1.42
1.69

1.54
1.45
1.53

LI
1.57

TLI
1.59

1.50
1.51
1.75

EQP
1.52

1.57
1.42
1.52

OLS
1.38

1.34
1.27
1.52

1.40
1.27
1.43

TLS
1.39

1.36
1.28
1.54

1.42
1.29
1.45

RAS
1.40

1.34
1.32
1.53

1.42
1.33
1.42

PCM
1.53

1.45
1.39
1.75

1.57
1.43
1.54

 

86

overall and by each facet, then converted back to correlations. Table
11 reports these average correlations and their associated Z-scores.

Of all the methods, TLI had the highest correlation with true
scores, r = .92 (Z = 1.59). In rank order, LI (.92), PCM (.91), EQP
(.91), TMN (.91), and MN (.90) all had higher correlations than NO
(.89). RAS, TLS, and OLS all had correlations of .88. This order
agrees closely with the overall order of the RMSE measure. Recall that
EQP had a lower RMSE than TLI, but largely because of its superior
performance on 9-point scales which weighted more heavily than 5-point
scales. With correlations all data sets are weighted equally, and TLI
outperformed EQP with equal weighting. In general though, correlation
and RMSE agreed closely, so an analysis of all the facets of the
scoring situation with correlations revealed nothing additional to the
RMSE analysis. Consequently, correlations for individual raters were

not analyzed.

Maximum Difference

Another way to compare the relative effectiveness of the ten
adjustment methods is with the maximum difference measure, which is the
greatest difference between average adjusted score and true score of
all the papers in a data set and measures the magnitude of error of the
worst case in any data set. Table 12 lists the maximum differences
separately for all of the 32 data sets and for each of the 10 methods.
Table 13 reports averages of these maximum errors across data sets,

both overall and for each of the scoring facets.

 

Table 12

87

Maximum Score Difference
for All Simulated Data Sets and Each Method

Observed - True

 

 

 

 

 

 

 

 

Data

Set* NO MN TMN LI TLI EQP OLS TLS RAS PCM
1151 2.39 2.22 2.00 2.28 2.28 2.99

1152 2.59 2.26 2.26 2.93 2.93 3.50

1153 2.06 2.32 2.32 2.80 2.80 2.77

1191 5.02 4.29 4.29 5.52 4.98 3.83

1192 4.19 4.23 3.86 4.99 4.10 4.10

1193 6.18 4.09 4.09 5.82 5.75 3.96

1551 3.41 2.93 2.89 3.13 2.97 3.20

1591 7.74 6.94 6.94 6.60 6.60 6.88

2151 2.13 1.65 1.65 1.34 1.34 1.71 2.10 2.00 2.13 1.99
2152 2.18 2.15 1.83 2.25 1.67 2.04 2.55 2.39 1.85 1.99
2153 2.00 2.25 2.25 2.20 2.20 2.37 2.15 2.15 2.19 1.69
2191 3.80 3.51 3.51 3.62 3.62 3.91 4.11 4.11 3.96 3.59
2192 4.89 6.31 5.40 6.55 5.89 5.85 4.29 4.07 5.36 5.25
2193 3.78 3.46 3.46 4.00 3.04 2.64 4.10 4.10 3.78 4.20
2551 3.65 3.43 3.40 2.90 2.90 2.56 3.80 3.65 3.65 3.46
2591 5.04 4.49 4.49 3.89 3.89 3.70 5.19 5.01 5.13 3.97
2+151 1.50 1.28 1.28 1.19 1.19 1.50 1.53 1.50 1.50 1.41
2+152 1.86 1.95 1.57 1.83 1.71 1.86 1.96 1.91 1.88 1.91
2+153 2.09 2.38 2.38 2.41 2.41 2.24 2.09 1.95 2.22 2.43
2+191 3.57 3.41 3.41 3.05 3.05 2.97 3.66 3.66 3.44 3.52
2+192 3.90 2.98 2.98 4.20 3.08 3.03 4.06 3.89 3.89 3.59
2+193 4.25 3.97 3.97 4.34 4.34 4.69 4.76 4.76 3.96 5.31
2+551 2.45 2.04 2.04 2.14 2.14 2.09 2.43 2.43 2.44 2.23
2+591 6.10 5.80 5.80 4.69 4.69 4.07 6.32 6.16 6.10 5.41
3151 1.73 1.26 1.17 1.52 1.33 1.89 1.91 1.71 1.74 1.24
3152 2.01 1.51 1.51 1.47 1.35 1.51 2.01 2.01 1.97 1.40
3153 1.41 1.26 1.26 1.55 1.55 1.49 1.51 1.51 1.62 1.36
3191 2.86 2.70 2.70 2.66 2.66 2.37 3.04 3.04 3.39 2.57
3192 3.20 2.48 2.48 2.56 2.56 2.57 3.41 3.41 2.85 2.67
3193 3.15 3.10 3.08 3.45 3.19 3.10 3.20 3.19 3.12 3.42
3551 2.70 2.31 2.31 1.98 1.98 1.99 2.70 2.70 2.70 1.92
3591 4.09 4.03 3.66 3.73 3.53 3.35 3.98 3.98 4.14 3.14

* Key to Digits in Data Set Number

First Digit:

Second Digit:

Third Digit:

Fourth Digit:

Skew of paper quality distribution;
positively skewed,

and 3 is negatively skewed.

Number of hundred papers scored (100 or 500).
Number of points in the rating scale (0-5 or 0-9).

1 is normal,

Number of raters per paper; 2+ means two raters with a
rescoring if scores were discrepant.

is

88
Table 13
Maximum Differences Averaged Across Methods

(1-Rater Data Sets Omitted)

 

NO MN TMN LI TLI EQP OLS TLS RAS PCM
ALL 3.10 2.90 2.82 2.90 2.72 2.73 3.20 3.14 3.13 2.90

 

 

 

 

 

 

 

 

NR=2 3.43 3.41 3.25 3.34 3.07 3.10 3.53 3.43 3.51 3.27
NR=2+ 3.21 2.98 2.93 2.98 2.83 2.81 3.35 3.28 3.18 3.23
NR=3 2.64 2.33 2.27 2.37 2.27 2.28 2.72 2.69 2.69 2.21

NP=1 2.79 2.64 2.55 2.79 2.57 2.65 2.91 2.85 2.82 2.75
NP=5 4.00 3.68 3.62 3.22 3.19 2.96 4.07 3.99 4.03 3.35

NS=5 2.14 1.96 1.89 1.90 1.81 1.94 2.23 2.16 2.16 1.92
NS=9 4.05 3.85 3.74 3.90 3.63 3.52 4.18 4.12 4.09 3.89

SK=1 3.30 2.99 2.95 2.73 2.69 2.68 3.40 3.33 3.36 2.87
SK=2 3.01 2.90 2.63 3.14 2.71 2.81 3.05 2.95 2.97 2.80
SK=3 2.78 2.74 2.73 2.99 2.79 2.75 2.97 2.94 2.81 3.07

 

Overall, TLI had the least average maximum difference (2.72)
followed closely by EQP (2.73). Next were TMN (2.82) and PCM, MN, and
LI (all at 2.90). The matrix methods-~RAS (3.13), TLS (3.14), and OLS
(3.20)—-continued to do slightly worse than no equating (3.10). Note
that the relative order of the methods based on the maximum difference
measure is almost identical with the order from the other two measures.
As with the RMSE measure, EQP's slight edge with 9—point data sets was
weighted too heavily in the overall average. Comparing the number of
raters, Table 13 indicates that two raters with rescoring was more
effective with worst cases than two raters without rescoring. Of
course, the average score of three raters tends to be closer to the
true score than the average score of two raters, and the average

maximum difference with three raters was less than with two for each of

 

 

89
the ten methods. With two raters, TLI had the least average maximum
difference of any method (3.07). With two raters and rescoring, EQP
had the least average maximum difference (2.81). With three raters PCM
did the best, with an average maximum difference of 2.21.

The maximum error was greater with SOC—paper data sets than with
loo-paper sets. The worst case of a SOC—paper set tends to be worse
than the worst case of a 100—paper set, even though the average case is
expected to be identical. Based on the maximum difference measure, TMN
(2.55) and TLI (2.57) performed the best of any method on the lOO-paper
sets, and EQP (2.96) did best on the BOO—paper sets. The magnitude of
these differences suggests that even with adjustment of scores, an
occasional paper can receive an average observed score that is up to 3
points different from its true score on a 9—point scale.

The maximum error was greater with 9—point scales than with 5—
point scales, clearly because of the greater possible range of scores.
TLI had the least average maximum difference of any method for 5—point
scales (1.81) and EQP was the best method with 9-point scales (3.52).

In terms of skewness of the paper distribution, no clear trend
emerged with regard to the maximum difference measure. NO, MN, OLS,
TLS, and RAS had their lowest maximum differences with negatively
skewed distributions, TMN and PCM did best with positively skewed
distributions, and LI, TLI, and EQP did best with normal distributions.
Comparing the methods, TMN had the least average maximum difference for
both positively and negatively skewed data sets (2.63 and 2.73), and
EQP did the best of any method with normally distributed data sets

(2.68).

90
One—Rater Data Sets
The one-rater data sets were only analyzed with six of the methods
(NO, MN, TMN, LI, TLI, and EQP) and measures were generally worse than
with two or three raters. The results from those sets were analyzed
separately. RMSEs from one-rater sets are contained in Table 7, and

averaged by facet in Table 14.

Table 14

Average RMSEs for the One-Rater Simulated Data Sets

 

 

Total
NO MN TMN LI TLI EQP
All 1.320 1.187 1.132 1.251 1.175 1.189
Number of Papers
NO MN TMN LI TLI EQP
100 1.301 1.174 1.116 1.264 1.185 1.189
500 1.377 1.227 1.181 1.212 1.146 1.188

Number of Rating Scale Points

 

NO MN TMN LI TLI EQP
0-5 0.916 0.836 0.794 0.888 0.826 0.928
0-9 1.724 1.539 1.471 1.613 1.524 1.449

Paper Quality Distribution

 

Normal 1.364 1.198 1.133 1.230 1.149 1.159
Skew + 1.313 1.216 1.162 1.269 1.170 1.295
Skew - 1.239 1.136 1.102 1.274 1.233 1.142

 

For all one—rater sets, TMN had the least RMSE of the six
adjustment methods (1.13). Next were TLI (1.18), MN (1.19), and EQP

(1.19). LI (1.25) did only slightly better than NO (1.32).

91

On the loo—paper sets with one rater, TMN performed best (1.12),
while TLI did best on sets with SOC—papers (1.15). EQP continued its
dominance on 9-point rating scales (1.45), while TMN did best on 5-
point scales (.79). The skewness of the paper quality distribution
made little difference, as TMN outperformed the other methods on all
three distribution shapes. Correlations and maximum differences were
not analyzed for the one-rater data sets, because the earlier analyses
suggested that those measures provided little new information compared

to RMSEs.

CHAPTER FIVE

DISCUSSION

Comparing the adjustment methods on the writing assessment data
was inconclusive, because there was no external criterion for deciding
which method was superior. Comparing the adjustment methods on the
simulated data was also inconclusive, because there is no certainty
that the PCM used in simulating the data adequately models what happens
in real-life rating situations, and it may have affected the
comparative results. Despite these two limitations, this study
answered some questions and raised others. The following sections
summarize the results by method, then recommendations for future study

and current practice are given.

Summary by Method
No method worked best under all scoring conditions. This section
describes the situations under which each of the methods was relatively
more or less effective than other methods.

No equating (NO)

 

In the simulated data sets, using the scores that raters actually
assigned without adjustments did not reproduce true scores as well as
most of the other equating methods. For the simulated data sets, the
average difference in RMSE between the best method (TLI) and no

equating was only .09 points for S—point data sets. For the real data,

92

93
each of the other equating methods adjusted scores by at least .14
points compared to no equating. It could be argued that the increment
in accuracy by using adjustment methods is not great enough to warrant
their use. But when that small average difference is multiplied many
times as it is applied to the hundreds of decisions made about
individuals based on ratings, the case for some type of equating seems
more persuasive. Although no equating is the most common adjustment
method used in education and it is fairly accurate, it may not be
accurate enough for the high-stakes individual decisions that are often
made in large—scale assessment programs. While the present study
focused only on writing assessment, the results are equally applicable
to other forms of large scale performance assessment being considered
by some states.

Mean Equating (MN) and Truncated Mean Equating (TMN)

 

Simply adjusting scores up or down to compensate for the
stringency of the particular raters who score a paper makes sense and
is relatively easy to do. Truncating scores is also desirable to
maintain the outer limits of a rating scale. TMN always matched true
scores better than MN. TMN often did better than any other equating
method, especially on one—rater data sets. (Scoring with only one
rater is not a recommended practice, and the RMSEs were much larger
with one—rater sets than with two-rater sets even after adjustment.)
While TMN does not take full advantage of all available information in
estimating true scores, even with large data sets it does well.

However, with small data sets, or if papers are not randomly assigned,

94
rater stringency can be confounded with real differences in paper
quality. The TMN method has the advantages of easy computation,
political defensibility, and a fairly high level of accuracy.

Linear Equating (LI) and Truncated Linear Equating (TLI)

 

As with mean equating, truncation always represented an
improvement over no truncation. In the simulated data, TLI was likely
the most accurate method overall. By compensating for both the level
and spread of rater scoring, TLI does a better job in theory than TMN.
On small data sets, where only a few scores are available for some
raters, the estimates of rater scoring variance are unstable and linear
equating may not work as well as mean equating. In data sets where
each rater scored at least 20 papers, linear equating did well. A
disadvantage of TLI and TMN is the assumption of equally spaced
intervals on the rating scale. Methods such as EQP and PCM, which
assume only an ordinal scale rather than an interval scale, should work
better when raters do not assign scores as if the units were equally
spaced. In general though, TLI is a good method because it is easy to
compute, it is not too hard to explain, and it proved accurate on the
simulated data in this study.

Equipercentile equating (EQP)

 

In the simulated data sets, the advantage of EQP was in those sets
with 0—9 rating scales. The difference in accuracy of EQP on 5—point
scales compared to 9—point scales suggests that the adjustments EQP
gave with 0—3 scales in the real data sets were inaccurate. It may not
be possible to make nine distinct categories for evaluating writing,

but if raters were encouraged to use half—points in those cases when

95
they have trouble deciding between two scores, a 0—5 scale would become
a 0—9 scale and EQP would work better. Alternatively, smoothing
techniques (connecting the points with curves rather than with line
segments) could make EQP a more accurate method for scales with fewer
points. Unlike most methods, EQP has the theoretical advantage of
assuming only that scale points are ordered categories and not making
the stronger assumption of equally spaced units.

OLS, TLS, and Rasch Extension

 

These three methods, the matrix methods, did not do well in the
simulated data sets. It was disturbing that these methods did not
appear as effective in reproducing true scores as no equating, under
any of the scoring conditions. A series of diagnostic steps were
performed to better understand why these methods did not perform well
in this study.

First, the computer program for these methods was checked for
"bugs", but the algorithms for matrix multiplication and inversion
worked on sample data, and the estimation procedure for estimating
rater effects consisted only of a series of those matrix operations.
There were no programming errors that would account for the poor
performance of those methods.

Second, the de Gruijter (1984) formulas were checked for errors.
There was a minor error in one formula in his paper which was easily
corrected, but his general method of estimating rater differences from
assigned score differences seemed sound. It was a different estimation
method from that used in other studies that showed matrix estimation

methods to be effective (e.g. Cason & Cason, 1984; Raymond & Houston,

96
1990). As was mentioned earlier though, those methods worked well with
small data sets, but would have been unwieldy with the large data sets
in this study.

Third, it was possible that the PCM model used in the simulation
resulted in data that did not fit the linear model assumed by OLS. To
test this, an additional data set was simulated based on the linear
model. The linear model regards observed scores as true scores plus
rater effects plus a random error term: Xij = Ti-+l% + en.

Supplemental Analysis

 

In this data set, paper true scores were normally distributed with
mean 3.5 and variance 1. The raters were from a 3 x 3 design where one
dimension was leniency-—three lenient raters added .5 points per paper,
three raters were average with rater effect 0 points, and three raters
gave scores .5 points lower than true scores. Each observed score also
included an additional error term. Three raters (one of each leniency
level) had random error distributed normally with mean 0 and SD 1.0,
three raters added error terms with mean 0 and SD 0.7 and three raters
had error terms with mean 0 and SD 0.4. The data set consisted of 100
papers scored by two raters each on a 0—5 scale.

Table 15 lists the frequencies of scores assigned by each rater
and overall, as well as the RMSE, correlation, and maximum difference
comparisons with true scores for the data set for each of the ten
methods. In this data set TMN did the best, followed by MN, EQP, and
TLI. RAS had the next lowest RMSE, then LI. All of those methods were

more accurate than NO, but PCM, TLS, and OLS did worse than no equating

Supplemental Data Set Simulated From a Linear Model

97

Table 15

 

Score Frequencies

 

 

 

RATER 0 1 2 3 4 5 N MN SD
TOTAL 7 14 38 45 57 39 200 3.240 1.338
1 0 2 7 6 4 3 22 2.954 1.186
2 2 2 5 6 8 0 23 2.695 1.266
3 3 5 9 4 3 l 25 2.080 1.293
4 0 l 3 4 5 4 17 3.470 1.193
5 1 0 4 4 6 2 17 3.176 1.247
6 0 2 5 7 8 2 24 3.125 1.092
7 1 0 1 2 6 11 21 4.142 1.245
8 0 2 2 6 7 11 28 3.821 1.226
9 0 0 2 6 10 5 23 3.782 .883
Comparisons of Adjusted Scores with True Scores
RMSE CORR MAXDIF

NO .580 .879 1.365

MN .532 .882 1.391

TMN .519 .886 1.391

LI .569 .886 1.399

TLI .542 .889 1.399

EQP .539 .895 1.515

OLS .647 .863 1.378

TLS .634 .861 1.378

RAS .545 .886 1.476

PCM .598 .867 1.553

 

 

 

 

 

98
at all. Of greatest interest was the improved performance of RAS,
though OLS and TLS continued to perform unacceptably. PCM did not do
as well in this data set as in the simulated data sets generated from
the PCM.

After this analysis, it was hypothesized that the matrix methods
were not effective because of the nature of the rating scale itself.
The linear model assumes continuous linear scales, without rounding or
ceiling effects. The observed data, though, must be integers from
0 to 5. The rounding and ceiling effects inherent in the categorical
scale are assumed to be part of the error term. The simple linear
methods (MN, TMN, LI, and TLI) make the same assumptions of a linear
and continuous scale, but apparently are more robust and not as easily
affected by the violations of model assumptions. RAS did better than
OLS and TLS on the supplemental data set. RAS was more accurate
because it controls for non—linearity at the extremes of the rating
scale before estimating rater effects, whereas TLS controls for ceiling
effects after doing the estimation, and OLS does not control for these
effects at all.

Nothing in this study recommended the matrix methods over the
simpler mean or linear equating methods for score adjustment. The
matrix methods may be more accurate in small data sets where
differences in rater means are due to sampling error or real
differences in paper quality. The least-squares methods may also
reproduce true scores better when scoring is more continuous and less

affected by floor or ceiling effects.

99

Partial Credit Model

 

Despite being the model used to generate the simulated data, PCM
rarely performed better than the simpler models. The PCM had more
parameters to be estimated than any other method, and by far took the
most computer time to make score adjustments. While other methods took
at most three or four minutes, PCM took four hours or more to
iteratively estimate parameters and adjust scores. To accurately
estimate parameters, much data is needed. PCM did the best in
situations with many papers (the SOC—paper data sets), more raters (3
per paper) and with few scale points (the 0-5 scales).

The scoring situations under which PCM did the best are not
realistic. In the 3—rater SOD—paper data sets, each simulated rater
scored over 150 papers. Rarely in real-life settings do raters score
that many papers. Also, real-life data do not fit the PCM as well as
did the data simulated from the PCM. In the one set generated from a
different model, PCM reproduced true scores worse than no equating.
Although PCM proved to be a flexible model for generating the data, as
an equating method it required more scores from a rater than are

typically available.

Recommendations
This study would have been more conclusive if it could have
determined which adjustment methods best reproduced true scores for
real-life data sets. This could be done in one of two ways. First, if
it could be shown that the model used in the simulation adequately

represents what happens when real people score real papers then the

100
results of the simulated data study would apply to real settings.
Second, if a good estimate of true scores for the real—life data were
available then the adjustment methods could be compared in an absolute
sense and not just relative to one another. These two improvements
suggest directions for future research.

Better Models of Rater Scoring

 

The simulation in this study was not adequately linked to real-
life scoring. If the assumptions of the model used in the simulated
data sets are consistent with real scoring, then the method comparisons
for the simulated data are valid. But the PCM is only one model for
categorical scoring, and may not be the best model for the context of
essay scoring.

More study is needed to determine which mathematical model best
represents how real—life raters score real-life papers. The PCM
assumes that raters assign an initial score of 0 for a paper and then
increment its score until they decide (based on their stringency and
the paper's quality) that the score should not be any higher. The
linear model assumes that the true score of a paper is adjusted by both
the overall rater effect and by an error term peculiar to that
particular rater—paper combination. It does not model the categorical
nature of the rating scale.

One direction of study for developing a better model would be to
attempt to understand the cognitive processes which raters actually go
through when they assign scores. A way to do this is to have raters
"think out loud" as they read papers and assign scores. By analyzing

rater thinking, one could better understand how decisions are made as

101
to which of the score categories a rater assigns a paper. Such a study
might also clarify how raters differ in the extent to which they value
various aspects of paper quality. Holistic scoring provides a one-
dimensional measure of paper quality, which is clearly a multi-
dimensional phenomenon. An adequate model might entail more than one
parameter for paper quality.

Better True Score Estimates for Real Data

 

Another direction for future research is to get better measures of
real-life true scores. In this study, adjustment methods for the
writing assessment data only produced data on how the adjustment
methods compared to each other, but could not compare adjusted scores
to true scores. One way to get true score estimates would be to have a
set of papers scored by a team of expert raters. Their average score
would be an estimate of a paper's true score. Then the papers would be
rescored by two or three raters from a different team, and the
adjustment methods could be compared based on how closely they adjusted
the second team's scores to the average score of the expert team.

One surprising finding in the study was that two raters with
rescoring by a third rater when scores were discrepant proved less
accurate than scoring with two raters without resolving discrepant
scores. If better real-life true score estimates were available, that
finding could be investigated further. It may be that the average of
two discrepant scores is a better measure of a paper's quality than the
average of two scores, only one of which was discrepant. But it may
also be that rescoring eliminates an occasional paper with a large

error in scoring, as the standard rationale for rescoring would

102
suggest. By examining the maximum error statistics, research could be
focused on discrepant cases to better understand why unusual scores
occur.

Implications for Practice

 

Other refinements of the adjustment methods used in this study
could be developed and investigated. When raters' scores on a paper
are combined, instead of using the simple mean as in this study, any
adjustment method could use a precision—weighted mean so that accurate
raters' scores are weighted more heavily than those of less accurate
raters.

Equipercentile equating worked well with 9-point scales, but with
smoothing techniques it ought to give better results with fewer scale
points. Equipercentile equating needs further refinement to be
recommended over the simpler linear methods in general, but it has the
theoretical advantage of assuming scales are ordinal rather than
interval measures.

More study is needed on the matrix methods to determine why they
did not work well in this study but did in other studies. PCM was
effective in the larger simulated data sets, but the evidence did not
suggest that the method would work well with data based on other
scoring models. PCM is also not recommended as an adjustment method
because of the excessive computing time it requires and the amount of
data necessary to get stable parameter estimates.

Based on this study, score adjustment methods are recommended in
those high—stakes contexts where decisions are based on the scores

assigned by only some of a team of raters. The sophistication of the

103
adjustment method used should depend on the amount of data available.
With fewer than 8 papers per rater or with non-random assignment of
papers to raters, no equating should be used because estimates of
raters' mean scores are unstable. With 8 to 20 papers per rater and
random assignment of papers to raters, truncated mean equating is
recommended because of the instability of the rater variance estimates.
With more than 20 papers per rater, truncated linear equating should
result in more accurate score estimates than the other methods. By
using statistical methods to equate ratings, researchers and decision-

makers can get better measures of traits which are hard to measure.

EPILOGUE

Although this study did not provide definitive answers, it does
provide some insights into Ivan's situation as described in the
introduction. If papers were randomly assigned to raters and raters'
mean scores differed significantly, some statistical adjustment of
scores would be advisable. The simulated data sets in this study
indicated that the simpler linear methods did a good job of reproducing
true scores. There was little evidence that the more sophisticated
matrix methods or item response theory methods resulted in improved
adjusted scores. Using truncated linear equating compensates for rater
differences in both the level and spread of scores they assign, and
keeps scores within the limits of the rating scale, so it is
recommended over the other linear methods.

In addition to a post hoc procedure such as truncated linear

equating, the district could try to improve the training of raters.

104
Training should make raters as homogeneous in their scoring as
possible, and as raters agree more, statistical adjustment will affect
scores less. In practice, raters vary in how they assign scores, even
when they are trained extensively. If individual raters are not
consistent in how they assign scores, then no amount of statistical
adjustment can make invalid scores valid. But if raters assign scores
consistently, rater effects can be controlled by statistical

adjustment.

LI ST OF REFERENCES

105
References
Andrich, D. (1978). Application of a psychometric rating model to

ordered categories which are scored with successive integers.
Applied Psychological Measurement, 2, 581-594.

 

Angoff, W. H. (1971). Norms, scales, and equating. In R. L.
Thorndike (Ed.), Educational measurement (2nd ed., pp. 508-600).
Washington, DC: American Council on Education.

 

Braun, H. I. (1986). Calibration of essay readers (Final Report)
(Program Statistics Research Tech. Rep. No. 86-68). Princeton, NJ:
Educational Testing Service. (ERIC Document Reproduction Service
No. ED 274 673)

 

Braun, H. I. (1988). Understanding scoring reliability: Experiments in
calibrating essay readers. Journal of Educational Statistics, 22,
1-18.

 

Breland, H. M. (1983). The direct assessment of writing skill: A
measurement review. (College Board Report 83.6). New York: College,

Entrance Examination Board. (ERIC Document Reproduction Service No.
242 756)
Brennan, R. L. (1983). Elements of generalizability theory. Iowa

 

City, IA: American College Testing Program.

Cason, G. J., & Cason, C. L. (1984). A deterministic theory of
clinical performance rating: Promising early results. Evaluation
and the Health Professions, 1, 221—247.

 

 

Cason, G. J., & Cason, C. L. (1989, April). Rater stringency error in
performance rating: A contrast of three models. Paper presented at
the annual meeting of the American Educational Research Association,
San Francisco. (ERIC Document Reproduction Service No. 306 254)

 

 

Choppin, B. H. (1982). The use of latent trait models in the
measurement of cognitive abilities and skills. In D. Spearrit
(Ed.), The improvement of measurement in education and psychology.
Melbourne: Australian Council for Educational Research.

 

Coffman, W. E. (1971). Essay examinations. In R. L. Thorndike (Ed.),
Educational measurement (2nd ed., pp. 271—302). Washington, DC:
American Council on Education.

 

106

Constable, E., & Andrich, D. (1984, April). Inter-judge reliability}
is complete agreement among judges the ideal? Paper presented at
the annual meeting of the National Council on Measurement in
Education. New Orleans, LA. (ERIC Document Reproduction Service
No. ED 243 962)

 

 

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972).
The dependability of behavioral measurements: Theory of
generalizability for scores and profiles. New York: John Wiley.

 

 

De Gruijter, D. N. M. (1984). Two simple models for rater effects.
Applied Psychological Measurement, 2, 213-218.

 

Denny, G. S. (1989). Calibrating for rater stringency: A comparison of
three methods. Unpublished manuscript, Michigan State University.

 

 

Ebel, R. L. (1951). Estimation of the reliability of ratings.
Psychometrika, 12' 407-424.

 

 

 

Glass, G. V., & Hopkins, K. D. (1984). Statistical methods in
education and psychology (2nd ed.). Englewood Cliffs, NJ: Prentice—
Hall.

Godshalk, F. I., Swineford, F., & Coffman, W. E. (1966). The
measurement of writing ability. New York: College Examination
Board.

 

Guilford, J. P. (1954). Psychometric methods. New York: McGraw—Hill.

 

Houston, W., Raymond, M., & Svec, J. (1990, April). Adjustments for
rater effects in performance assessment: An empirical investigation.
Paper presented at the annual meeting of the American Educational
Research Association and the National Council on Measurement in
Education, Boston.

 

 

Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological
Bulletin, 21, 72-107.

 

 

 

Landy, F. J., & Farr, J. L. (1983). The measurement of work
performance: Methods, theory, and applications. New York: Academic
Press.

Linacre, J. M. (1987a). An extension of the Rasch model to multi-

 

faceted situations. Chicago: University of Chicago.

 

Linacre, J. M. (1987b, December). A multi—faceted Rasch measurement
model. Paper presented at the Midwest Objective Measurement
Seminar, Chicago.

 

107

Lunz, M. E., Linacre, J. M., & Wright, B. D. (1988, April). The impact
of judge severity on examination scores. Paper presented at the
annual convention of the American Educational Research Association,
New Orleans.

 

Masters, G. N. (1982). A Rasch model for partial credit scoring.
Psychometrika, 21, 149-174.

 

Paul, S. R. (1976). Models and estimation procedures for the
calibration of examiners. PhD Thesis, The University College of
Wales, Aberystwyth.

 

Paul, S. R. (1979). Models and estimation procedures for the
calibration of examiners. British Journal of Mathematical and
Statistical Psychology, 82, 242—251.

 

Paul, S. R. (1981). Bayesian methods for calibration of examiners.
British Journal of Mathematical and Statistical Psychology, ﬁg, 213—
223.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling,
norming, and equating. In R. L. Linn (Ed.), Educational measurement
(3rd ed., pp. 221-262). New York: Macmillan.

Raymond, M. R., & Houston, W. M. (1990, April). Detecting and
correcting for rater effects in performance assessment. Paper
presented at the annual meeting of the American Educational Research
Association and the National Council on Measurement in Education,
Boston.

 

 

Stanley, J. C. (1961). Analysis of unreplicated three—way
classifications, with applications to rater bias and trait
independence. Psychometrika, 26, 205-219.

 

Weare, J., Moore, J., & Woodall, F. (1987). Interrater reliability: A
selected and annotated bibliography of articles concerning
interrater reliability. (ERIC Document Reproduction Service No.
280 898)

Webb, L. C., Raymond, M. R., & Houston, W. M. (1990, April). Rater
stringency and consistency in performance assessment. Paper
presented at the annual meeting of the American Educational Research
Association and the National Council on Measurement in Education,
Boston.

 

Wherry, R. J. (1950). The control of bias in rating: Survey of the
literature (Personnel Research Board Report 898). Washington, DC:
Department of the Army, Personnel Research Section.

 

Wherry, R. J. (1952). The control of bias in rating: A theory of
rating (Personnel Research Board Report 922). Washington, DC:
Department of the Army, Personnel Research Section.

108

Wilson, H. G. (1988). Parameter estimation for peer grading under
incomplete design. Educational and Psychological Measurement, 38,
69-81.

 

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis.
Chicago: MESA Press.

 

Young, F. W., Takane, Y., & Lewyckyj, R. J. (1988). ALSCAL [Computing
procedure]. In SPSS-X User's Guide (3rd. ed.) [Computer program
manual]. Chicago: SPSS Inc.

APPENDICES

109

APPENDIX A
PROGRAMS FOR GENERATING AND ANALYZING SIMULATED DATA SETS

All programs are written using QuickBasic, version 4.5.

' RATER.BAS

' This is the first of a series of programs to compare
' rater calibration methods.

' This program asks for the values of these parameters:

' NUMBER OF RATERS PER PAPER (1, 2, 3, or 2*)

' NUMBER OF PAPERS (100 or 500)

' NUMBER OF SCALE POINTS (5 or 9)

' PAPER QUALITY DISTRIBUTION (Normal, Positive, or Negative Skew)

' It then chains to the second program PARGEN.BAS,
' carrying over the values of the four parameters in F$
COMMON F$

CLS

PRINT

PRINT "How many raters per paper"; : INPUT MJ
F$ = CHR$(48 + MJ)

PRINT
PRINT "How many HUNDRED papers rated"; : INPUT NP
F$ = F$ + CHR$(48 + NP)

PRINT
PRINT ”How many scale points"; : INPUT MP
F$ = F$ + CHR$(48 + MP)

PRINT

PRINT "Which type of skew (1: normal, 2: positive, 3: negative)"; :
INPUT SKEW

F$ = F$ + CHR$(48 + SKEW)

CHAIN "pargen.bas"
END

110

’ PARGEN.BAS
' PARGEN is the second program in the series.
' PARGEN follows RATER.BAS and precedes TRUEGEN.BAS.

' PARGEN generates paper parameters and stores them in
' C:\THESIS\SIMDATA\F$\F$.PAP

' where F$=wxyz

' w=MJ (Number of raters per paper, O=two + one)
' x=NP / 100 (Number of hundred papers rated)
' y=MP (Number of scale points)
' z=SKEWNESS (1: Normal 2: Positive 3: Negative)
COMMON F$

mj = VAL(MID$(F$, 1, 1))

hp = VAL(MID$(F$, 2, 1)) * 100

mp = VAL(MID$(F$, 3, 1))

skew = VAL(MID$(F$, 4, 1))

' Initialization
q = mj * np * mp * skew
RANDOMIZE q

P12 = ATN(1) * 8

IF F$ = "" THEN F$ = ”NULL"

PAPER$ = "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ”.PAP"
OPEN PAPER$ FOR OUTPUT AS #1

CLS

' Parameter Generation

PRINT "Paper quality parameters (out of"; np; ") :";
SELECT CASE skew
CASE 1

FOR I = 1 TO np
LOCATE l, 45: PRINT I
PRINT #1, SQR(-2 * LOG(RND)) * SIN(PI2 * RND) * 3 + 2
NEXT I
CASE 2
FOR I = 1 TO np
LOCATE 1, 45: PRINT I
PRINT #1, ABS(SQR(—2 * LOG(RND)) * SIN(PI2 * RND)) * 5 - 3
NEXT I
CASE 3
FOR I = 1 TO np
LOCATE l, 45: PRINT I
PRINT #1, -5 * ABS(SQR(—2 * LOG(RND)) * SIN(PIZ * RND)) + 7
NEXT I
END SELECT
PRINT

111

PRINT "Paper quality parameters loaded into
PRINT : PRINT

CLOSE #1

CHAIN "TRUEGEN.BAS"

END

' TRUEGEN.BAS

file ", PAPER$;

' TRUEGEN is the third program in the series.

' TRUEGEN follows PARGEN.BAS and precedes OBSGEN.BAS

' TRUEGEN generates rater parameters,

and combines them with

' the paper parameters to get true scores for each paper, which are

' stored in
' where F$ is the same as in PARGEN.BAS

COMMON F$, J()

m3 = VAL(MID$(F$, 1, 1))

np = VAL(MID$(F$, 2, 1)) * 100
mp = VAL(MID$(F$, 3, 1))

skew = VAL(MID$(F$, 4, 1))

P$ = "C:\THESIS\SIMDATA\” + F$ + "\" + F$ +
T$ = "C:\THESIS\SIMDATA\" + F$ + "\" + F$ +
OPEN T$ FOR OUTPUT AS #2
OPEN P$ FOR INPUT AS #3

' Generate Rater Parameters J(j,i)

nj = 9

DIM J(nj, mp)

FOR I = —1 TO 1: FOR J = -1 TO 1 "' 3x3
FOR k = 1 TO mp: L = 3 * I + J + 5

J(L, k) = ((—6 + (k - 1) * 12 / (mp - 1)) *
NEXT R, J, I

' Compute true scores

DIM e(mp), s(nj)
BIG = EXP(20)
CLS

PRINT "True scores generated (out of";
FOR k 1 TO np
LOCATE l, 42:
INPUT #3, ab
ave 0
FOR I =
e(O) -
T 0
FOR J 1 TO mp
'Compute values in numerator
e(J) e(J - 1) * EXP(ab - J(I, J))

“P;

PRINT k;

1 TO nj
1

C:\THESIS\SIMDATA\(F$)\(F$).TRU

".PAP"
".TRU"

design, NJ=9

2 J + 2 * I) / 1.5

I!)

'If numbers start getting large, divide all by a large constant

112

IF e(J) > BIG THEN FOR L = 0 TO J: e(L) = e(L) / BIG: NEXT L
NEXT J
'Get total for denominator
FOR L = 0 TO mp: T = T + e(L): NEXT L
'Get the expected score for rater I
s(I) = 0
FOR L = 1 TO mp: s(I) = 8(1) + e(L) / T * L: NEXT L
'Get the average score across all nine raters
ave = ave + s(I)
NEXT I
PRINT #2, ave / nj
NEXT k
PRINT : PRINT
CLOSE
CHAIN "OBSGEN.BAS"
END

' OBSGEN.BAS
OBSGEN is the fourth program in the series.

' OBSGEN follows TRUEGEN.BAS and precedes EZEQ.BAS
' OBSGEN selects raters randomly, and probabilistically generates
observed scores from each rater and stores them in

' c:\THESIS\SIMDATA\(F$)\(F$).OBs
' in the format rater#;rating;...;rater#;rating

COMMON f$, J()

MJ = VAL(MID$(f$, l, 1))
np = VAL(MID$(f$, 2, 1)) * 100
mp = VAL(MID$(f$, 3, 1))

skew = VAL(MID$(f$, 4, 1))
nj = 9

OPEN "C:\THESIS\SIMDATA\" + f$ + "\" + f$ + ".085" FOR OUTPUT AS #2
OPEN "C:\THESIS\SIMDATA\” + f$ + "\" + f$ + ".PAP" FOR INPUT AS #1
q = MJ * up + mp * skew

RANDOMIZE q

DEF fnr (x) = INT(RND * x) + 1

DIM r(3), s(3)

CLS
PRINT "Number of observed scores generated (out of"; np; ") :";

' Randomly select MJ raters
SELECT CASE MJ
CASE 0
raters = 2

FOR I = 1 TO np

CASE 1

CASE 2

113

INPUT #1, ab
r(l) = fnr(nj): r(2) = r(l)
DO UNTIL r(2) <> r(l)
r(2) = fnr(nj)
LOOP
J = 1: GOSUB SCOREGEN
J = 2: GOSUB SCOREGEN
IF ABS(s(l) - s(2)) > 1 THEN
r(3) = r(l)
DO UNTIL (r(3) <> r(2) AND r(3) <> r(1))
r(3) = fnr(nj)
LOOP
J = 3: GOSUB SCOREGEN
d1 = ABS(s(3) - s(I)): d2 = ABS(s(3) — s(2))
IF d1 < d2 OR (d1 = d2 AND s(l) > s(2)) THEN
PRINT #2, r(l); s(l); r(3); s(3)

ELSE
PRINT #2, r(2); s(2); r(3); s(3)
END IF
ELSE PRINT #2, r(l); s(l); r(2); s(2)
END IF
LOCATE 1, 55: PRINT I
NEXT I
raters = 1

FOR I = 1 TO np

INPUT #1, ab

r(l) = fnr(nj)

J = l: GOSUB SCOREGEN
PRINT #2, r(l), s(l)

LOCATE 1, 55: PRINT I
NEXT I

raters = 2
FOR I = 1 TO np

INPUT #1, ab

r(l) = fnr(nj): r(2) = r(l)

DO UNTIL r(2) <> r(l)

r(2) = fnr(nj)

LOOP

J = 1: GOSUB SCOREGEN

J = 2: GOSUB SCOREGEN

PRINT #2, r(l); s(l); r(2); s(2)

LOCATE l, 55: PRINT I

 

 

114

NEXT I

CASE 3

END SELECT

CLOSE

raters = 3
FOR I = 1 TO np

INPUT #1, ab

r(l) = fnr(nj): r(2) = r(l)
DO UNTIL r(2) <> r(l)

r(2) = fnr(nj)

LOOP

r(3) = r(l)

DO UNTIL (r(3) <> r(2) AND r(3) <> r(1))
r(3) = fnr(nj)

LOOP

J = l: GOSUB SCOREGEN
J = 2: GOSUB SCOREGEN
J = 3: GOSUB SCOREGEN
PRINT #2, r(l); s(l); r(2); s(2); r(3); s(3)

LOCATE l, 55: PRINT I
NEXT I

CHAIN "EZEQ.BAS"

SCOREGEN:

RETURN

END

' Subroutine that generates an observed score s(j), given
' rater R(j) and ability AB
so = O: r = r(J)
DO UNTIL RND > 1 / (1 + EXP(J(r, so + 1) — ab))
so = so + 1: IF sc = mp THEN EXIT DO
LOOP
s(J) = sc

 

115

' EZEQ.BAS
' EZEQ is the fifth program in the series.

' EZEQ follows OBSGEN.BAS and precedes EQPEQ.BAS

' EZEQ does no equating —- into
' mean equating —— into
' truncated mean equating —- into
' linear equating —- into
' truncated linear equating —— into

' Files consist of NP

lines where

each

C:\THESIS\SIMDATA\(F$)\(F$).NO
C:\THESIS\SIMDATA\(F$)\(F$).MN
C:\THESIS\SIMDATA\(F$)\(F$).TMN
C:\THESIS\SIMDATA\(F$)\(F$).LI
C:\THESIS\SIMDATA\(F$)\(F$).TLI

line is of the form

' rater#;rater adj score;...;rater#;rater adj score;overall adj score

COMMON F$, j(): X%(), r%()
mj = VAL(MID$(F$, 1. 1))
IF mj = 0 THEN mj = 2
np = VAL(MID$(F$, 2, 1)) * 100
mp = VAL(MID$(F$, 3, 1))
nj = 9
DIM X%(npr mj)! r%(npl mj)! n(nj)r s(nj), ss(nj)l mn(nj), Sd(nj)
' x%(i,j) is the score the jth rater gave the ith paper
' r%(i,j) is the number of the jth rater on the ith paper
' n(j) is the number of ratings given by rater j
' s(j) is the sum Of the ratings given by rater j

' ss(j) is the sum of squares of the ratings given by rater j
' mn(j) is the mean rating given by rater j

' sd(j) is the standard deviation of the ratings
' where j=0 represents totals over all raters

given by rater j

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT AS #1
CLS
PRINT "Loading Observed data (out of"; np; ") :"
FOR i = 1 TO np
FOR j = 1 TO mj
INPUT #1, r, sc
r%(i, j) = r: x%(i, j) = sc
n(O) = n(O) + 1: 5(0) = 5(0) + sc: ss(O) = ss(O) + sc * sc
n(r) = n(r) + l: s(r) = s(r) + 80: ss(r) = ss(r) + sc * sc
NEXT j
LOCATE 1, 40: PRINT i
NEXT i
CLOSE #1
PRINT PRINT "Computing means and SDs"
FOR j = 0 TO nj
mn(j) = S(j) / n(j)
sd(j) = SQR((SS(j) - 5(j) 2 / n(3')) / r1(3'))

NEXT j

116

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".NO" FOR OUTPUT AS #2

PRINT : PRINT "NO equating —- ("; np; "cases) :"
FOR i = 1 TO np
adj = 0
FOR j = 1 TO mj
PRINT #2, r%(i, j); x%(i, j);
adj = adj + x%(i, j)
NEXT j
PRINT #2, adj / mj
LOCATE 5, 40: PRINT i
NEXT 1
CLOSE #2

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".MN" FOR OUTPUT AS #2
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TMN" FOR OUTPUT As #3

 

PRINT : PRINT "Mean equating -- ("; np; "cases) :"
FOR i = 1 TO np
adj = O
tadj = 0
FOR j = 1 TO mj
sc = x%(i, j) — mn(r%(i, j)) + mn(O)
tsc = so
IF tsc > mp THEN tsc = mp ELSE IF tsc < 0 THEN tsc = O
PRINT #2, r%(i, j); sc;
PRINT #3, r%(i, j); tsc;
adj = adj + sc
tadj = tadj + tsc
NEXT j
PRINT #2, adj / mj
PRINT #3, tadj / mj
LOCATE 7, 40: PRINT 1
NEXT 1
CLOSE #2
CLOSE #3

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".LI" FOR OUTPUT As #2
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TLI" FOR OUTPUT AS #3

PRINT : PRINT "Linear equating —— ("; np; "cases) :"
FOR 1 = 1 TO np

adj = O

tadj = 0

FOR j = 1 TO mj
aj = r%(i, 1)
SC (x%(i, j) - mn(aj)) / sd(aj) * sd(O) + mn(O)
tsc = sc
IF tsc > mp THEN tsc = mp ELSE IF tsc < 0 THEN tsc = O
PRINT #2, aj; sc;
PRINT #3, aj; tsc;
adj = adj + sc
tadj = tadj + tsc

NEXT j
PRINT #2, adj / mj

117

PRINT #3, tadj / mj
LOCATE 9, 40: PRINT i
NEXT i
CLOSE #2
CLOSE #3

CHAIN "EQPEQ.BAS"
' EQPEQ.BAS

EQPEQ is the sixth program in the series.

' EQPEQ follows EZEQ.BAS and precedes OLS.BAS

' EQPEQ does equipercentile equating, storing results in
' C:\THESIS\SIMDATA\(F$)\(F$).EQP in the form

rater#;rater adj score;...;rater#;rater adj score;overall adj score

COMMON f$, j(): X%(): r%()

mj = VAL(MID$(f$. 1. 1))

IF mj = 0 THEN mj = 2

mp = VAL(MID$(f$, 2, 1)) * 100
mp = VAL(MID$(f$, 3, 1))

nj = 9

OPEN "C:\THESIS\SIMDATA\" + f$ + "\" + f$ + ”.EQP" FOR OUTPUT AS #2
DIM w%(nj, mp)
CLS
PRINT "Equipercentile equating loading -— (out of"; np; ") :"
FOR i = 1 TO np
FOR L = 1 TO mj
aj = r%(i, L)
FOR K = 0 TO mp
IF x%(i, L) <= K THEN w%(aj, K) = w%(aj, K) + 1: w%(O,K) = w%(O,K)+1
NEXT K, L
LOCATE l, 55: PRINT 1
NEXT 1
bign = w%(0, mp)

PRINT : PRINT " equating --"
FOR i = 1 TO np
sc = 0

FOR j = 1 TO mj
aj = 13%(1, j)
X = X%(ir j)
s w%(aj, x) / w%(aj, mp)
t w%(0, 0) / bign
IF S <= t THEN

adj = O
ELSE
a = -l
WHILE t < s
r = t

a = a + l

 

 

118

t = w%(0, a) / bign
WEND
adj = a + (s - r) / (t — r) - 1
END IF
sc = sc + adj
PRINT #2, aj; adj;
NEXT j
PRINT #2, so / mj
LOCATE 3, 55: PRINT i
NEXT 1

CLOSE #2
CHAIN "OLS.BAS"

' OLS.BAS
OLS is the seventh program in the series.

' OLS follows EQPEQ.BAS and precedes RASCHEXT.BAS

 

OLS does ordinary least squares equating, storing adjusted scores in
' C:\THESIS\SIMDATA\(F$)\(F$).OLS in the usual format

' rater; rater's adjusted score; ...; overall adjusted score

' The program follows the algorithm in de Gruijter (1984).

DECLARE SUB inverse (x!(), N!, b!())
DECLARE SUB matmult (a!(), mi, N!, b!(), 0!, pl, c1())
COMMON F$

mj = VAL(MID$(F$, 1, 1))

IF mj = 0 THEN mj = 2

IF mj = 1 THEN CHAIN "MORE.BAS"

np VAL(MID$(F$, 2, 1)) * 100

mp = VAL(MID$(F$, 3, 1))

nj 9: njl = nj — 1

OPEN ”C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".038" FOR INPUT As #1

MaxT = nj * (njl) / 2
DIM d(MaxT, 1), t1(MaxT), t2(MaxT), r%(mj), s%(mj), N(MaxT, MaxT)
teams = O
CLS
LOCATE 24, 1: PRINT F$, "OLS"
LOCATE 1, 1: PRINT "Loading observed data (out of"; np; ") :";
FOR i = 1 TO np
FOR j = 1 TO mj
INPUT #1, r%(j), s%(j)
NEXT j
'Sort by rater number
FOR jl = 1 TO mj - 1
FOR j2 = jl + 1 TO mj
IF r%(j1) > r%(j2) THEN SWAP r%(j1), r%(j2): SWAP s%(j1), s%(j2)
NEXT j2, jl

119

FOR j1 = 1 TO mj — 1
FOR j2 = jl + 1 TO mj
flag = 0
FOR T = 1 TO teams 'see if this is an existing team
IF t1(T) = r%(jl) AND t2(T) = r%(j2) THEN
d(T, 1) = d(T, 1) + s%(j1) — s%(j2)
N(T, T) = N(T, T) + 1
flag = 1
EXIT FOR
END IF
NEXT T
IF flag = 0 THEN 'create a new team
teams = teams + 1
T = teams
d(T, l) = d(T, 1) + s%(jl) - s%(j2)
N(T, T) = N(T, T) + 1
t1(T) = r%(jl)
t2(T) = r%(j2)
END IF
NEXT j2, jl
LOCATE 1, 55: PRINT i
NEXT 1
CLOSE 1

K = teams
DIM a(K, njl), Atrans(njl, K), Theta(nj, l), NA(K, njl)
DIM AtNA(njl, njl), AtNAi(nj1, njl), AtND(njl, 1), Nd(K, K)

FOR i = 1 TO K
IF t1(i) < nj THEN
a(i, t1(i)) = a(i, t1(i)) + 1

ELSE
FOR j = 1 TO njl
a(i, j) = a(ir j) ‘ 1
NEXT j
END IF
IF t2(i) < nj THEN
a(i, t2(i)) = a(i, t2(i)) + l
ELSE
FOR j = 1 TO njl
a(il j) = a(ir j) _ 1
NEXT j
END IF

d(i, 1) = d(i, 1) / N(i, i)

FOR j 1 TO njl
Atrans(j, i) = a(ir 1)
NEXT j

NEXT 1

PRINT : PRINT "Computing Nd";

 

120

matmult N(), K, K, d(), K, 1, Nd()

PRINT : PRINT "Computing A'Nd";

matmult Atrans(), njl, K, Nd(), K, 1, AtND()

PRINT : PRINT "Computing NA";

matmult N(), K, K, a(), K, njl, NA()

PRINT : PRINT "Computing A'NA";

matmult Atrans(), njl, K, NA(), K, njl, AtNA()
PRINT : PRINT "Computing (A'NA) inverse";

inverse AtNA(), njl, AtNAi()

PRINT : PRINT "Computing Theta=(A'NA)inverse(A'Nd)";
matmult AtNAi(), njl, njl, AtND(), njl, 1, Theta()

FOR j = 1 TO njl

Theta(nj, 1) = Theta(nj, 1) - Theta(j, 1)
NEXT j
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".038" FOR INPUT AS #1
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".OLs" FOR OUTPUT AS #2
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".TLs" FOR OUTPUT AS #3

PRINT : PRINT "Creating output files (out of"; np; ") :";
FOR i = 1 TO np
scl = O: sc2 = 0
FOR j = 1 TO mj
INPUT #1, r, 5
adj = s + Theta(r, 1)
PRINT #2, r; adj;
scl = scl + adj
IF adj > mp THEN adj = mp ELSE IF adj < 0 THEN adj = O
PRINT #3, r; adj;
sc2 = s02 + adj
NEXT j
PRINT #2, scl / mj
PRINT #3, sc2 / mj
LOCATE 15, 55: PRINT 1
NEXT 1
CLOSE

CHAIN "RASCHEXT.BAS"
SUB inverse (x(), N, b())

lin = CSRLIN
LOCATE lin, 40: PRINT ”(out Of"; N; ") :”;

' Create b(), identity matrix
DIM a(N, N)

FOR r = 1 TO N: FOR c
IF r = c THEN b(r, c)
NEXT 0, r

1 TO N: a(r, c) = x(r, c)
1 ELSE b(r, c) = 0

FOR row = 1 TO N
'Make diagonal element 1

121

d row)
IF d 0 THEN
'Switch with another row
flag 0: row2 row
WHILE flag = O
row2 row2 + 1
IF row2 > N THEN PRINT "no inverse exists":
IF a(row2, row) <> 0 THEN
FOR col 1 TO N
SWAP a(row2, col),
SWAP b(row2, col),
NEXT col
flag 1
d a(row,

a(row,

col)
col)

a(row,
b(row,

row)
END IF
WEND

END IF

FOR col 1 TO N
a(row, col) = a(row, col) / d
b(row, col) - b(row, col) / d

NEXT col

'Subtract multiples of ROW to get zeros in other positions
FOR 1 1 TO N
IF i <> row THEN

m = a(i, row)
FOR j = 1 TO N
a(i, j) = a(i, j) - m * a(row, j)
b(il j) = b(j—r j) — m * b(row, j)
NEXT j
END IF
NEXT i
LOCATE lin, 55: PRINT row
NEXT row
END SUB
SUB matmult (a(), m, N, b(), O, p, c())

BEEP: STOP

 

IF N <> 0 THEN PRINT "Matrices not compatible——multiplication fails":

STOP

lin CSRLIN
LOCATE lin, 40: PRINT "(out of"; m;
FOR 1 1 TO m
FOR j 1 TO p
C(i, j) 0
FOR K = 1 TO N
C(ir j) C(i,
NEXT K, j
LOCATE lin,
NEXT i

H)

= j) + a(i, K) * b(K, j)

55: PRINT i

122

END SUB

' RASCHEXT.BAS

' RASCHEXT is the eighth program in the series.
' RASCHEXT follows OLS.BAS and precedes PCM.BAS

' RASCHEXT does Rasch extension equating, storing adjusted scores in
' C:\THESIS\SIMDATA\(F$)\(F$).RAS in the usual format

' rater; rater's adjusted score; ...; overall adjusted score

' The program follows the algorithm in de Gruijter (1984).

DECLARE SUB inverse (x!(), N1, b!())
DECLARE SUB matmult (a!(), mi, N1, b!(), 0!, pl, c!())
COMMON F$

mj = VAL(MID$(F$, 1, 1))
IF mj = 0 THEN mj = 2 ;
IF mj = 1 THEN

PRINT "Program terminates—-"

PRINT "other methods require at least 2 raters": END
END IF
np = VAL(MID$(F$, 2, 1)) * 100
mp VAL(MID$(F$, 3, 1))
nj = 9: njl = nj — 1
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT AS #1
MaxT = nj * (njl) / 2
DIM d(MaxT, 1), t1(MaxT), t2(MaxT), r%(mj), 5%(mj), N(MaxT, MaxT),
den(MaxT), num(MaxT)
teams = O
CLS : LOCATE 24, l: PRINT F$, "RASCH"
LOCATE 1, 1: PRINT "Loading observed data (out of"; np; ") :”;
FOR 1 = 1 TO np

FOR j = 1 TO mj

INPUT #1, r%(j), s%(j)
NEXT j
'Sort by rater number
FOR j1 = 1 TO mj — 1
FOR j2 = j1 + 1 TO mj
IF r%(jl) > r%(j2) THEN SWAP r%(j1), r%(j2): SWAP s%(jl), s%(j2)
NEXT j2, j1

FOR j1 = 1 TO mj — 1

FOR j2 = jl + 1 TO mj
flag = 0
FOR T = 1 TO teams 'see if this is an existing team

IF t1(T) = r%(jl) AND t2(T) = r%(j2) THEN
den(T) = den(T) + s%(j1) * (mp - s%(j2))
num(T) = num(T) + s%(j2) * (mp — s%(j1))
N(T, T) = N(T, T) + 1
flag = l

123

EXIT FOR
END IF
NEXT T
IF flag = 0 THEN 'create a new team
teams = teams + l
T = teams
den(T) den(T) + s%(j1) * (mp — s%(j2))
num(T) num(T) + s%(j2) * (mp — s%(j1))
N(T, T) = N(T, T) + 1
t1(T) = r%(j1)
t2(T) = r%(j2)
END IF
NEXT j2, j1
LOCATE 1, 55: PRINT i

NEXT 1
CLOSE l

K = teams
DIM a(K, njl), Atrans(njl, K), BHat(nj, 1), NA(K, njl)
DIM AtNA(njl, njl), AtNAi(njl, njl), AtND(njl, 1), Nd(K, 1)

FOR i = 1 TO K
IF num(i) = 0 OR den(i) = 0 THEN
d(i, 1) = o
ELSE d(i, 1) = LOG(num(i) / den(i))

END IF
IF t1(i) < nj THEN
a(i, t1(i)) = a(i, t1(i)) + 1
ELSE
FOR j = 1 TO njl
a(i, j) = a(i, j) - 1
NEXT j
END IF

IF t2(i) < nj THEN
a(i, t2(i)) = a(i, t2(i)) + 1
ELSE
FOR j = 1 TO njl
a(i, j) = a(i, j) — 1
NEXT j
END IF

FOR j = 1 TO njl
Atrans(j, i) = a(i, j)
NEXT j
NEXT 1

PRINT : PRINT "Computing Nd";

matmult N(), K, K, d(), K, 1: Nd()

PRINT : PRINT "Computing A'Nd";

matmult Atrans(), njl, K, Nd(), K, 1, AtND()
PRINT : PRINT "Computing NA";

124

matmult N(), K, K, a(), K, njl, NA()

PRINT : PRINT "Computing A'NA";

matmult Atrans(), njl, K, NA(), K, njl, AtNA()
PRINT : PRINT "Computing (A'NA) inverse";

inverse AtNA(), njl, AtNAi()

PRINT : PRINT "Computing BHat=(A'NA)inverse(A'd)";
matmult AtNAi(), njl, njl, AtND(), njl, l, BHat()

FOR j = 1 TO njl
BHat(nj, 1) = BHat(nj, 1) - BHat(j, 1)
NEXT j

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".088" FOR INPUT As #1
OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".RAs" FOR OUTPUT AS #2

PRINT : PRINT "Creating output files (out of"; np; ") :”;
FOR i = 1 TO np
sc = 0

FOR j = 1 TO mj
INPUT #1, r, 5
adj = EXP(BHat(r, 1)) * s / (1 — s / mp * (1 — EXP(BHat(r, 1))))
PRINT #2, r; adj;
sc = sc + adj
NEXT j
PRINT #2, so / mj
LOCATE 15, 55: PRINT 1
NEXT 1
CLOSE
CHAIN "again.BAS"

SUB inverse (x(), N, b())

lin = CSRLIN
LOCATE lin, 40: PRINT "(out of"; N; ") :";

' Create b(), identity matrix
DIM a(N, N)

FOR r = 1 TO N: FOR C
IF r = c THEN b(r, c)
NEXT C, r

1 TO N: a(r, c) = x(r, c)
1 ELSE b(r, c) = 0

ll

FOR row = 1 TO N
'Make diagonal element 1
d = a(row, row)
IF d = 0 THEN
'Switch with another row
flag = O: row2 = row
WHILE flag = o
row2 = row2 + 1
IF row2 > N THEN PRINT "no inverse exists": BEEP: STOP
IF a(row2, row) <> 0 THEN
FOR col = 1 TO N
SWAP a(row2, col), a(row, col)

125

SWAP b(row2, col), b(row, col)
NEXT col
flag = 1
d = a(row, row)
END IF
WEND

END IF

FOR col = 1 TO N

a(row, col) = a(row, col) / d
b(row, col) = b(row, col) / d
NEXT col

'Subtract multiples of ROW to get zeros in other positions
FOR 1 = 1 TO N
IF 1 <> row THEN

m = a(i, row)

FOR j = 1 TO N

a(ir j) = a(ir j) _ m * a(rowr j)
b(ir j) = b(ir j) - m * b(rowr j)
NEXT j
END IF
NEXT 1
LOCATE lin, 55: PRINT row
NEXT row

END SUB

SUB matmult (a(), m, N, b(), o, p, C())

IF N <> 0 THEN PRINT "Matrices not compatible——multiplication fails":
STOP

lin = CSRLIN
LOCATE lin, 40: PRINT "(out of"; m; ") :"
FOR 1 = 1 TO m
FOR j = 1 TO p
C(i, j) = 0
FOR K = 1 TO N
C(ir j) = C(i, j) + a(i, K) * b(K, j)

NEXT K, j

LOCATE lin, 55: PRINT i
NEXT i
END SUB

' PCM.BAS

PCM is the ninth program in the series.

’ PCM follows RASCHEXT.BAS and precedes MORE.BAS.

' PCM does partial credit model adjustment, storing adjusted scores in

 

126

' C:\THESIS\SIMDATA\(F$)\(F$).PCM in the usual format

' rater;rater's adjusted score; ... ; overall adjusted score
' The program follows the PAIR unconditional maximum likelihood
' estimation algorithm described in Wright and Masters (1982).

COMMON F$

eps = .002: MaxIt = 50: big = EXP(20)
mj = VAL(MID$(F$, 1. 1))

IF mj = 0 THEN mj = 2
IF mj = 1 THEN
PRINT "Program terminates-—"
PRINT "other methods require at least 2 raters": END
END IF
np = VAL(MID$(F$, 2, 1)) * 100

mp VAL(MID$(F$. 3, 1))
nj = 9: njl = nj — 1
OPEN "C:\THESIS\SIMDATA\" + F$ + "\” + F$ + ".088" FOR INPUT AS #1
DIM nk%(njl mp! njl mp), r%(npr mj)! 8%(DP, mj)! d(njl mp)! dt(njr mp)
CLS : PRINT "PARTIAL CREDIT MODEL"
LOCATE 3, 1: PRINT "Loading Observed data (out of"; np; ") :";
FOR k = 1 TO np
FOR j = 1 TO mj
INPUT #1, r%(k, j), 5%(k, j)
NEXT j
FOR jl = 1 TO mj: a
FOR j2 = 1 TO mj:
nk%(a, b, c, d)
NEXT j2, jl
LOCATE 3, 55: PRINT
NEXT k
CLOSE 1
LOCATE 11, 1: PRINT "Iterating rater parameters (out of"; nj; ") :"
DO
FOR i = 1 TO nj
FOR x 1 TO mp: w = x - 1: F = 0: fl = .01
FOR j = 1 TO nj
FOR 2 = 1 TO mp: y = z - 1

r%(k, jl): b = s%(k, jl)
= r%(k, j2): d = s%(k, j2)
nk%(a, b, c, d) + 1

HO

xi

n1 = nk%(i, w, j, z): n2 = nk%(i, x, j, y): bign = n1 + n2
pi = 1 / (1 + EXP(d(i, x) — d(j, z))): F = F + bign * pi — n2
f1 = fl + bign * pi * (1 — pi)

NEXT z, j

a = d(i, x) + F / f1
IF a < -15 THEN
dt(i, x) = -15
ELSEIF a > 15 THEN
dt(i, x) = 15
ELSE
dt(i, x) = a
END IF
NEXT X
LOCATE ll, 40: PRINT 1

 

dif

NEXT i

LOCATE 24, l: PRINT F$;

max = O: ave = 0

FOR 1 = 1 TO nj

FOR x — 1 TO mp

dif = ABS(dt(i, x) — d(i, x)): IF dif > max THEN max
ave = ave + dif: d(i, x) = dt(i, x)

NEXT X, i

LOCATE 7, 20: PRINT "Maximum parameter shift was"; max

LOCATE 9, 20: PRINT "Average parameter shift was"; ave / nj / mp

v = V + l

LOCATE 5, 20: PRINT "Iteration : "; v

LOOP UNTIL max < eps OR v

MaxIt

OPEN "C:\THESIS\SIMDATA\" + F$ + "\" + F$ + ".PCM” FOR OUTPUT AS #2

LOCATE
FOR xi
CEN
DO

LOOP

13, l: PRINT "Loading adjusted scores (out of"; hp; ") :"
= 1 TO np
0: del = 5: a = 0
FOR i = 1 TO 3
u(i) = CEN + (i — 2) * del
p(i) = 1
FOR m = 1 To mj
e(O) = l: T = 0
FOR j = 1 TO mp
e(j) = e(j — l) * EXP(U(i) - d(r%(xi, m): j))
IF e(j) > big THEN FOR L = 0 TO j: e(L) = e(L) / big: NEXT L
NEXT j: FOR L = 0 TO mp: T = T + e(L): NEXT L
P(i) = P(i) * e(8%(Xi, m)) / T
NEXT m, i

IF ABS(CEN) > 14 THEN EXIT DO

IF p(l) < p(2) AND p(2) < p(3) THEN

CEN = u(3): IF a = 0 OR a = 3 THEN a = 3
ELSEIF p(l) > p(2) AND p(2) > p(3) THEN

CEN = u(1): IF a = 0 OR a = 1 THEN a = 1
ELSE a = 2: del = del / 2
END IF

UNTIL del < eps / 2

‘ At this point, CEN is the paper quality estimate for the XIth paper.
' Use CEN to compute the expected score over all raters as well as
' the expected score from the actual raters based on the parameters.

ave = 0
FOR 1 = 1 TO nj
e(O) = 1: T = 0
FOR j = 1 TO mp
e(j) = e(j — l) * EXP(CEN — d(i, 1))
IF e(j) > big THEN FOR L = 0 TO j: e(L)
NEXT j
FOR L = 0 TO mp: T = T + e(L): NEXT L

e(L) / big: NEXT L

128

sc(i) = 0: FOR L = 1 TO mp: sc(i) = sc(i) + e(L) / T * L: NEXT L
ave = ave + sc(i)
NEXT i
FOR j = 1 TO mj: r = r%(xi, j)
PRINT #2, r; sc(r);
NEXT j
PRINT #2, ave / nj
LOCATE 13, 55: PRINT xi
NEXT Xi
CLOSE
CHAIN "MORE.BAS"

' MORE.BAS is a program that automates the process Of moving from one
' simulated data set to the next. It bypasses RATER.BAS

COMMON F$

IF F$ = "1152" THEN F$ = "1153": CHAIN "PARGEN.BAS"
IF F$ = "1153" THEN F$ = "1191": CHAIN "PARGEN.BAS"
IF F$ = "1191" THEN F$ = "1192": CHAIN "PARGEN.BAS"
IF F$ = "1192" THEN F$ = "1193": CHAIN "PARGEN.BAS"
IF F$ = "1193" THEN F$ = "1551": CHAIN "PARGEN.BAS"
IF F$ = "1551" THEN F$ = "1591": CHAIN "PARGEN.BAS"
IF F$ = "1591" THEN F$ = "2191": CHAIN "PARGEN.BAS"
IF F$ = "2191" THEN F$ = "2192": CHAIN "PARGEN.BAS"
IF F$ = "2192" THEN F$ = "2193": CHAIN "PARGEN.BAS"
IF F$ = "2193" THEN F$ = "2551": CHAIN "PARGEN.BAS"
IF F$ = "2551" THEN F$ = "2591": CHAIN "PARGEN.BAS"
IF F$ = "2591" THEN F$ = "3191": CHAIN "PARGEN.BAS"
IF F$ = "3191" THEN F$ = "3192": CHAIN "PARGEN.BAS"
IF F$ = "3192" THEN F$ = "3193": CHAIN "PARGEN.BAS"

APPENDIX B

DISTRICT WRITING ASSESSMENT DATA
WRITING ASSIGNMENT

Implementation

All students were given the same instructions and essentially the same
rules applied to students at all grade levels.

The instructions provided were as follows:

Write a paper to your principal stating one rule you'd like to have
changed, modified or strengthened in your building. state your idea,
reasons for the change and how the school would be improved by this
change.

The purpose of this writing sample is to assess your writing skills.

All students in your grade will be writing on this or a similar topic.
It is important that you do your best.

General Rules

1. Place your name on the upper right hand corner of your paper.
Do not write on this sheet.

2. Dictionaries are available for your use.

3. Remember to include an introduction, body and conclusion which would
convince your principal to make a change.

4. Do not write this as a letter.

130

APPENDIX C

DISTRICT WRITING ASSESSMENT DATA
MODIFIED HOLISTIC SCORING CRITERIA

5 point = Very Good

The paper is creative, well organized with a good command of vocab-
ulary. The paper clearly develops a topic from beginning, to mid—
dle, to end. Ideas are supported with details and flow smoothly.

Errors in sentence structure and mechanics may be present but they
do not detract from the overall impression of the paper.

4 point = Good

The paper shows some creativity, organization is evident, with
fairly good command of vocabulary. The paper develops a topic from
beginning, to middle, to end. Ideas are generally supported with
details for a minimum interruption of flow.

Errors in sentence structure and mechanics may be present but they
do not substantially detract from the overall impression of the
paper.

3 point = Adequate

The paper has deficiencies but demonstrates enough overall
strengths in sufficient degree to be judged competent. A deficien—
cy(s) may be found in:

—— organization

-- command of vocabulary

—- supportive details

—- smooth flow from one idea to another
—— development of topic

—- sentence structure

—— mechanics

2 point = Inadequate

This paper is disorganized and has limited vocabulary. The topic
is addressed but is poorly developed and may lack a beginning,
middle, or end. Ideas are poorly supported with few details. The
paper does not flow smoothly.

Errors in sentence structure and mechanics are frequent enough to
detract from the overall impression of the composition.

1 point = Poor
This paper has an absence of organization and poor vocabulary. The
topic is addressed but not developed. Ideas are not supported with
detail. The paper does not flow.
The errors in sentence structure and mechanics are frequent and

serious enough to detract substantially from the overall impression
of the composition.

0 point = Not Acceptable

The paper is illegible or totally unrelated to the topic.

’——_‘

 

 

 

 

LIBRARIES

NICHIGQN STQTE UNIV

 

\w ‘12,: .

 

 

:9... us . 1.
. .. ...