I D" “'2'." " "t. (2". U8,“

T3'M "'W i u 333,1133 33 .

12,33.“ 3,,9 331,33 3'§I~3'33P33333""""33" “3,3,, 3|,“ ,, {‘3 “1533,? Eff,” :3 .133 EDW'! $11,,

:Zi‘m 3:33:33 33,.“ 3.: "Z 333‘: "3 ""' ‘
, 11'3": 'l,,,"3133; 1‘:3"'.'E?""' 3,3'1'3'3'3333 1,. :l l , 5mg], ,1,“

(3:23“, I :Fﬂ‘i‘ 1:? $3,, ",3 3:3 " 3. ' I, . Lki'j' lg“! !‘I,, i.‘ y, “:5, ,EW

1
'A'HK' ,v 'h' ,,,§ l,, '3‘] ﬂ'liJ‘f

.3».
. ”LE:-
...
fq‘iz."
-

t

‘ '3',“ ,‘Wx ,_ to .1. 5. g'
uﬁﬁg§vs3%Z3érﬁnﬁpﬁ3"33‘w,3,

 
  
  
 
 
     

  
 

   
      
   
  
  
  
       

   

      
 

op
35::
3‘45 :-
’3":
A :‘l
. qr

    
   
  
  
 

i.
‘3.

' 3' ; 3,1333% , 3.
‘Z'3 532‘ l '3'3‘ '13 I??? .-
3?" §!":' “Fiﬂs'gn ":pb’ Egigo .z;i.:s ...,
. . 'zz‘m'f HEW", 3,3,, I. :3! “9‘“ 13,1333}: 3333'?k‘I'E""-3:zﬁl~3,

     
 
  

  

 

     
 
   
     
  
 
   

   

   
    
      
 
  
     
          
 
    

  
 

. -.J
7

33, .

. i
3 <‘ '3 ‘3 33~’33332:43: aq . 3,: :b"33333 ;;3.33 333”"
3 1.33 333131 .333 "3.33.3333 3333333. 3:..33’13323‘233' 233333.3’3‘33333
3 3; 3-7. 3:. .. 3-3323 Z3 3313.3 323 3 x3333, .333 3:333:32.
1333 3., 3'33 '3'"“333' '39'b3hwt m33333'~'33ZRW333
-‘; ,xL=,313.‘,, 33'5“”: 3‘5135,"'3:;'; 3333932333,:gt 333533 Hg, £133:
Z 3‘ .., '331'35'33. r 3 333 f';"'?“33313‘ "3 333' ' .
5 3-3 33333333233 3333;333:3333: 4.3" 33,33 3331333233 ,..
“A ',.x~ 4:,“ “'58, "":gi,"" c,, :. "',",,, p, ,1; .d.’ 2...“;
f 1' ~ ,

...-.. ... ‘

..-
:7." -'?-‘-‘-v- .

r
-.mu-

3
i, 3

w
”...-
9
"Y‘

*‘d’t‘l
...-
....-
oo -
. —~
-o«
.a
i“—
-— .
J o
.-
< .
'7.- -:d‘ "I.
.. ”k“:
.v-- ....
av
row-or
-
”:1...
2...:
"THE.
-.
w"!
1
.nm-
..h.
o
...-
.... 3'”.
dc
w' .

_. _ . 4.3.2. . .an ',,, ,
3.33335'333333333

,v- v.
. .01...
a I.
v

' “4.77
J.."
.
$1.":
,.._
....
j'r.‘
..."
2....
a.

..-

  

r

...—3v 49'..-
«:3
M. .I“

”r.-.- J
I

a "OWIIO
.3 n
~

«00--
..

 

3‘ .z' -. 111$”

——.—
...-3:)"
our

. . xv.»

-..-emu '~

., ....

‘ . “it 4:. n.
,5 0., '3: i, g' "‘2" -' ib't'd

H I “3:6"
,:,P 'I} 'l‘, I “3' "3:33. ‘ L'l' 'i” :2“ ii“ 3.,“ '33:“
31333333335;.33333'3‘33'33‘3313333 '3" 33,3531“

' 'E'M ["3 ﬁg,"

"—

2' " I , -
A ,_ ..., ,
- g... ...':H.MI
.
’3’...-—.~

a."
”O
“‘0’
o .—
F-
n
...
, n .
goo-w
r -.-a
a

or. >
...-4

arm—«v u
n— n

u

.3 <uv
~.:

..-.
$72“... .33: u

n

max;
‘34-.“
-“-,.
.“ _«,_w
.

ﬂ
.—
‘3‘,
‘ . ».
mo
.Od'lLIWt
P . '
'5 .
... ...-r.
'0 ,'.
.

 
  

            
 
         
  

    

   
    

  
  
 

 
 

 
   
 

 

 

 

   
  
 
     
   
    
    

 

   

 

 

    
 
    
    
       

 

 

 

  

 

 

  
   
 

 

3 . . z
'3 ‘ . ' I t’
:. 3: 3- ,3, 3 :3 I, an 3 3ft 33W
,3 "u 3, ”3‘3 6'! n 31 333"? 3 '31:: 3333' ,3» 11333M'3‘ ,3. MN" '3'3' '33 I'i'" . .
'3 33'2" :33.- I"""""3'31 . ”3,33%“ 3:,"“3""'33:"S331333“ 333'3'3'333‘3333'" ”"3" 3:"
" I ha 3: ""93" 3'3 '33:}.- "ii" 93'“ "3 ' LC: ,§( ',"'..'!';.3 h" "'i' " “33!; 3"" ,f': ; "' H '1 _ 3.313 '33:',": 3.1:!
3r " W M 3‘33) W, '- 333.333 333.3311"- #15,, $352., , ,, '3' ;,§§i[lt‘ , ,333 |,~ 3-,; ,‘,,‘ J 3i“ ' "
2‘: i',:':':: "t! 1': "ﬁ!§i 2‘0“}. 3'"; :'I":E 3'36"- W333 4"? 52'3”. 3
.' b .,,c.,'-:,.{l313 0,3 h, i,‘ ! i, ""3 I ,., H121, .
,1 3’31333'3313'33i3'm :31“ 3‘" 5:“, 7 - .3} 3:3,? 50,, 35%;" ,3,“ 3:333 :I, v,, .
. ("1‘ . '3'31 3' 32h; 1‘ '33" 3 $13 '
; 183' Z‘3333333 1.3.333‘ W33‘: '
i353" ‘1333'31 e'i‘
. I. 33:." 331,11: .'
- '- 3. 1'33 1 "'
3;: .Z'ii' ;,353"§',9§2I3 .
,{13 In ‘,':,,’,;,,113 I, ,. 3.3.9,, 33,2, .
".7. t “33533 ' , quf‘. "
3,32,! '3 43,: '31.;33 I
:B-\3 _-l.,._m..",, i'u If"
3‘ 1' H ”"'I:E"'-!. ‘4, vi '3 3
,2, .:I, 33"1' I, 3""
'2 Z, , 1‘ 2,3, .3313 3 ,1
I3, 31],, ‘ _,,
34“, vU‘q‘” “Ht, ' r, 'l I,
711::"",‘":"";"""" ,3|"'~':',"3' , '3: '3: " '{f-j' .'I-. .333: x 2', ~ _' .. . . , , ...
I 33, .333333l”'h:;umr'$3§“$Zﬁ3ﬁmhﬂ}«NIJ””PW"“"3“ 3”'--3 "a " ‘mﬁﬁw
"37ﬁh33 3? .33.,.3Z{é3?ua3333:3Z3353Z3333~r33-33333 393' 333333
.3. ,:.h! 3‘M3,mam3ﬂw¢m%mwnvdu3 ‘ f-
3'~HW““”””ﬂW'Hnnm3 ‘ 3' '3G“Zi
":‘2 "3 «I; ’3,“ ”“3' .33"j3 "t? “Rik”. 3“ '; 'iy'y :‘ié' 5" z" 'I'il'pJ';
"5" W3 3 h n I ‘ I "'-"3""’ ~' ,' ’ 3"33
3?;Z.mﬂﬂmmnMwhnih
:5 nn.n3a"=q3&rvzﬁhh. I “,1,-
' l’, I , .3 . 'r 4' .. d 2 . I. ' I" .j
:3. 3 «3333 33333.33 33: -. - ..~. ..., .. :Z ..... . , 3 3., ..‘1'3
‘J n'331‘1ﬁ‘1333 "'1’3‘": - - 3: " ’lsiw" ' ‘11.:3 3.. 3,
4 | 331332“! {,3 "31'4"" ' ,3 "‘ ,
” 'MmmﬁWMKHMRW. 3,:’r ﬁrmngg,
,3, 3 ,"3’ .3 l'$,3t “.23“ ,. ." u ...“ . . ,,;,.f’!,g,,,
1.33%Jﬁvuwnui .fw rqn:p3
,., , I,', :..‘!5 3““: t» “t." v I‘
.12 }~ ‘3',,.,1,,'1':n 3 3‘7 '3" 3.53.3“? I' 333333)
'33.." "'3 , 3333‘313' f 1,3333”: , 1,3,“ ,‘ ,- 33
EA; MN, ,i',|','3 ,.1.23E",';.,'f ' ""5"“ ,' ,‘z,, {"33"
733 :3333333333333 .3333333 '34335
3:3 -33333433Hﬁar33 ;r3.uu ,Iﬁpnﬂgl
. 3"hwluﬁJz , -33m~ “3&3“
4- ,gﬁhngrh3ﬁgq3ﬁ ,3. 3333,:33 ,kgkgﬁc
"- "NR3 3 3:' :-.'.'
Ir WHQﬁﬁﬁhﬁuuhdh; 3‘.. W,,, :3. :ﬁﬁg, ' 333é¢3 3
3.33 0“3ﬁﬁti%fﬂ3. 3' -y“qu 3 “3&3
“ 33“ 333333333333 3., 3.; 133333'3'3’333233-333‘
3*HKMHiMananu ﬂnnmﬁ' Emuii :mﬁmw

* 15.818

NIVERSITY LIBRARIES

1111111111111111111111111 11111 1111

3 1293 00910

 

 

11

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is to certify that the

dissertation entitled
A STUDY OF THE NATURE AND EXTENT
OF THE DISCREPANCIES AMONG THREE METHODS
FOR SETTING PASSING SCORES
presented by

William L. Brown

has been accepted towards fulﬁllment
of the requirements for

Ph. D. degree in Education

Ma 1; professor

5/1325

 

MS U L: an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

 

LIBRARY
Mlchlgan State
g Untvcrslty

 

 

PLACE II RETURN BOX to remove thto checkout from your record.
TO AVOID FINES return on or baton duo duo.

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU to An Aﬂimdtvo ActIoNEquol Oppomdty Inctltuion
WM.-

 

 

A STUDY OF THE NATURE AND EXTENT
OF THE DISCREPANCIES AMONG THREE METHODS

FOR SETTING PASSING SCORES

BY

William L. Brown

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Educational Psychology
and Special Education

1993

ABSTRACT
A STUDY OF THE NATURE AND EXTENT

OF THE DISCREPANCIES AMONG THREE METHODS
FOR SETTING PASSING SCORES

BY

William L. Brown

Tests are often used to identify individuals who have
achieved an acceptable level of mastery of a certain skill.
It is important, therefore, that passing scores for such tests
be established as systematically and consistently as possible.
The purpose of this study was to analyze the characteristics
and discrepancies in results obtained using three standard-
setting methods. ‘These methods were used to establish passing
scores for mathematics tests used in the statewide Michigan
Educational Assessment Program (MEAP). A modified Angoff
approach, contrasting groups, and borderline group methods
were included.

The analyses presented in this study show that, when
judges establish a passing score based upon the expected
performance of "marginally capable" candidates, the
classification of examinees as "masters" or "nonmasters" by
the test is not in agreement with the judges' own
classifications of the mastery states of those examinees.
. Based upon results obtained independently from three different
tests, data are presented which indicate that one possible
explanation is that judges tend to idealize student

performance on tests, i.e., by assuming that something which

has been taught has been learned and will, therefore, lead
students to give successful responses to related items on a
test.

Using comparisons of results obtained through judges’
predictions of item difficulty apart from test data, along
with judges’ estimations of mastery states of real examinees,
this study demonstrates that classification errors can be
minimized by (a) establishing a passing score through a
modified Angoff approach and (b) adjusting the score downward
by one or more standard errors of measurement. The amount of
adjustment to be made could be based on the level of stakes of
the test, with the greater adjustment recommended for tests
used when property rights (such as graduation requirements,

certification, licensure, etc.) are involved.

Copyright by
WILLIAM L . BROWN
1993

ACKNOWLEDGEMENT

The completion of this dissertation represents the
partial fulfillment of a lifelong dream: to be able to use my
love of measurement for improving education. Reaching this
milestone is due primarily to the perseverance and.patience»of
my mentor and advisor, Dr. Herbert Rudman. Despite his entry
into active retirement prior to the completion of my studies,
he encouraged me to take all the time I needed to complete the
job. 'The acumen and wisdom that he has acquired through years
of experience in teaching, educational administration and
psychometrics have been invaluable to me.

I am also indebted to Dr. William A. Mehrens for sharing
his measurement expertise, to Dr. John E. Hunter for incisive
criticism, and to Dr. Ralph Putnam for his knowledge of
cognitive processes and their impact on standard setting.

Thanks are also due my employers, Dr. Ed Roeber who was
my supervisor at the Michigan Educational Assessment Program
and Dr. Jan Witthuhn, Associate Superintendent for Research
and Development at the Minneapolis Public Schools, for
providing support and encouragement along the way.

Finally, I am deeply indebted to my wife and companion of
thirty-one years, Suzanne, who shouldered many of my
responsibilities for six years so that I would have the time
and energy to devote to completing my doctoral studies. Her
love, encouragement, and. editing skills, along’ with. the

support of our three children, made this goal attainable.

TABLE OF CONTENTS

Chapter

List of Tables . . . . . .
List of Figures . . . . . .
1. Introduction . . . . . .

Procedures . . . . . .

Research Questions .
Questions related
Questions related
Questions related
fatigue . .

2. Review of the Literature

Using Tests to Make Decisions About Individuals
Problems in Standard- -Setting . . . . . . . . . .
Taxonomy of Standard- -Setting Methods . . . . . .
Current Standard- Setting Approaches . . . . .

to item difficulties .
to instruction . . . . . .
to rater drift and examinee

Approaches involving judgments about test

items . . .

Approaches involving judgments about

examinees .

Compromises between empirical and theoretical

techniques .

Major Factors Affecting Standard-Setting . . . . .

Characteristics
Characteristics

Method . .
Characteristics

3. Method . . . . . . . . .
The Tests . . . . . .
The Judges . . . . . .
The Standard-Setting
The Students . . . . .
Research Questions . .

Process . . . . . . .

Of the TeSt O O O O O O O I
of the Standard-Setting

of the Judges . . . . . . .

ii

Page

vii

O‘O‘ubNI-I’

O1

14
18
21

22

23

24
26
26

27
29

33
34
35
36
37
39

Chapter

Stati

Detai

stical Analysis . . . . . . . . . . .
Phase I: Rasch Calibration and Pit of Data to
Rasch Model . . . . . . . . . . . . . .
Phase II: Inter-rater and Intra-rater
Consistency Analysis . . . . . . . . . .
Phase III: Analysis of Cognitive Structure of
the Test . . . . . . . . . . . . . . .
Phase IV: Predictive Validity of Ratings . .
Phase V: Predictive Validity of the Passing
Score . . . . . . . . . . . . . . . . .
led Overview . . . . . . . . . . . . . . . .

External Consistency of Judges’ Ratings . . . . .
Internal Consistency of Judges’ Ratings . . . . .

Correlations of judge ratings with actual
test results . . . . . . . . . . . . . .

Consistency analysis using Item Response
Theory . . . . . . . . . . . . . . . . .

Measures of Association: The Chi-Square Test . . .

4. Results
Findi
Resul

Resul

Resul

The Tetrachoric Correlation . . . . . . . . .
error Rate 0 O O O O O O O O O O O O O I O 0
Multiple regression . . . . . . . . . . . . .

ngs O O o O O O O O O O O O O O O O O O O O 0

ts for Grade Four Test . . . . . . . . . . .
Internal Consistency of Judges’ Ratings . . .
Modified Angoff Results . . . . . . . . . . .
Borderline Group and Contrasting Groups
Results . . . . . . . . . . . . . . .
Instructional Alignment Effects . . . . . . .
Rater Drift . . . . . . . . . . . . . . . . .
Latent Trait Analysis . . . . . . . . . . . .
ts for Grade Seven Test . . . . . . . . . . .
Consistency of Judges’ Ratings . . . . . . .
Modified Angoff Results . . . . . . . . . . .
Borderline Group and Contrasting Groups
Results . . . . . . . . . . . . . . . .
Instructional Alignment Effects . . . . . . .
Rater Drift . . . . . . . . . . . . . . . . .
Latent Trait Analysis . . . . . . . . . . . .
ts for Grade Ten Test . . . . . . . . . . . .
Consistency of Judges’ Ratings . . . . . . .
Modified Angoff Results . . . . . . . .
Borderline Group and Contrasting Groups
Methods . . . . . . . . . . . . . . . .
Instructional Alignment Effects . .
Rater Drift . . . . . . . . . . . .
Latent Trait Analysis . . . . . . .

iii

Page
42
42
42

44
46

47
49
54
58

58

58
63
67
67
68

69
69
69
69
71

76
80
82
86
89
89
89

92
96
98
101
104
104
104

107
111
112
116

Chapter

5. Conclus
Concl

Gener

ions and Discussion . . . . . .
usions . . . . . . . . . . . .
Consistency of Judges’ Ratings

Classification Consistency . .
Instructional Alignment Effects
Drift . . . . . . . . . . . .

Underlying Developmenta Scale
alizability of the Study . . .

Importance of the Study . . . . . .
Recommendations for Further Study .

References
Appendix A

Appendix B

: Cut Score Committee Meeting .

: Validation Project Documents

iv

Page

118
118
119
124
130
132
132
133
135
136

138

149

152

LIST OF TABLES

Table

4.21
4.22
4.23

Table

4.24

4.25

van der Linden Consistency Index Results: Grade 4
Test . . . . . . . . . . . . . . . . . . . . . . .
Correlations of Judge Ratings with MCC Group &
Consensus Ratings: Grade 4 . . . . . . . . . . . .
Correlation Matrix for Judge Ratings (p-values):
Grade 4 Test . . . . . . . . . . . . . . . . . . .
Reliabilities of Judge Ratings: Grade 4 Test . . .
Summary of Cut Scores by Method . . . . . . . . .
Correlation of Strand Scores with Curriculum
Alignment: Grade 4 Test . . . . . . . . . . . . .
Correlation of p-values with Item Sequence: Grade
4 Test . . . . . . . . . . . . . . . . . . . . . .
Regression of Judge Ratings on MCC p-values & Item
Sequence: Grade 4 Test . . . . . . . . . . . . . .
van der Linden Consistency Index Results: Grade 7
Test . . . . . . . . . . . . . . . . . . . . . . .
Correlation of Judge Ratings with MCC Group &
Consensus Ratings: Grade 7 Test . . . . . . . . .
Correlation Matrix for Judge Ratings (p-values):
Grade 7 . . . . . . . . . . . . . . . . . . . . .
Reliabilities of Judge Ratings: Grade 7 . . . . .
Summary of Cut Scores by Method . . . . . . . . .
Correlation of Strand Scores with Curriculum
Alignment: Grade 7 . . . . . . . . . . . . . . . .
Correlation of p-values with Item Sequence: Grade
7 Test . . . . . . . . . . . . . . . . . . . . . .
Regression of Judge Ratings on MCC p-values & Item
Sequence: Grade 7 Test Panel . . . . . . . . . . .
Latent Trait Analysis: Grade 7 . . . . . . . . .
van der Linden Consistency Index Results: Grade 10
Test . . . . . . . . . . . . . . . . . . . .
Correlation of Judge Ratings with MCC Group &
Consensus Ratings: Grade 10 Test . . . . . . . . .
Correlation Matrix for Judge Ratings (p-values):
Grade 10 . . . . . . . . . . . . . . . . . . . . .
Reliabilities for Judge Ratings: Grade 10 Test . .
Summary of Cut Scores by Method . . . . . . . . .
Correlation of Strand Scores with Curriculum
Alignment: Grade 10 . . . . . . . . . . . . . . .

Correlation of p-values with Item Sequence for

Grade 10 Test . . . . . . . . . . . . . . . . . .
Regression of Judge Ratings on MCC p-values & Item
Sequence: Grade 10 Test . . . . . . . . . . . . .

V

Page

71
72
74
75
80
81
83
85
89
90
91
92
96
97
99

100
101

104
105
106
107
110
112

Page

113

115

4.26 Latent Trait Analysis: Grade 10 Test . . . . . . . 116

1 van der Linden Consistency Indices . . . . . . . . 120
.2 Correlations of Judge Ratings with MCC Group p-

values . . . . . . . . . . . . . . . . . . . . . . 121
3 Correlations Between Judge Ratings . . . . . . . . 122
.4 Residuals Between Judge Ratings and Latent Trait
Analysis . . . . . . . . . . . . . . . . . . . . . 123
5 5 High and Low Standards Based 0 Phi Coefficient . 125
5 6 Standards Based on Borderline Groups Method . . . 126
5.7 Standards Based on Contrasting Groups Method . . . 127
5 8 Cut Score Adjustments Needed to Reduce the
Classification Error Rate . . . . . . . . . . . . 129
5.9 Correlation of Strand Scores with Curriculum

Alignment . . . . . . . . . . . . . . . . . . . . 131

vi

chub
UN

huh-b uh
\IO‘UI b

hub-b
H1000

LIST OF FIGURES

Figure

Framework for Grade 10 MEAP Mathematics Test . .

Scatterplot of Judge #3, Grade 4 Test Panel
Ratings vs. MCC p-values . . . . . . . . . .
Contrasting Groups and Borderline Group Method .
Chi— —square, phi (x100) and Error Rate (%) for
Grade 4 Test . . . . . . . . . . . . . . . . . .
Dendrogram using average linkage between groups:

Grade 4 Test 0 O O O O O O O o O O O O O O O O O

Contrasting Groups and Borderline Group Method .
Chi-square, phi (x100) and Error Rate (%) . . .
Dendrogram using average linkage between groups:
Grade 7 Test . . . . . . . . . . . . . . . . . .
Contrasting Groups and Borderline Group Method .
Chi-square, phi (x100) and Error Rate(%) . . .

Dendrogram using average linkage between groups:

Grade 10 Test . . . . . . . . . . . . . . . . .

vii

Page

73
77

79
88
93
95
103
108
109

117

Chapter 1. Introduction

When tests are used to identify individuals who have
achieved an acceptable level of mastery'of a certain skill, it
is important that the passing scores be established as
systematically and consistently as possible. The purpose of
this study was to identify the characteristics of three
standard-setting methods and to analyze discrepancies in
results obtained using each of them. The three methods
examined were (1) a modified Angoff approach}, (2) ‘the
Contrasting Groups method”, and (3) the Borderline Group
method.’ 'These methods were used to establish passing scores
for mathematics tests used in the statewide Michigan
Educational Assessment Program (MEAP).

Because test performance exists along a continuous scale,

any attempt to identify a boundary line between those who are

 

1The Angoff method is designed for use with a variety of
item formats, and is not restricted to multiple choice items.
The Angoff method requires judges to examine each test item
and establish the probability that a minimally competent
candidate would answer the item correctly. These estimates
are then added to determine the passing score.

2The Contrasting Groups method requires judges to
identify examinees who, in their judgment, are clearly masters
and clearly nonmasters. Students in-between (i.e., neither
clear masters nor clear nonmasters) are not rated. Frequency
plots are made for the test scores received by each group, and
the passing score is set somewhere between the distributions
of the two groups.

’The Borderline Group method requires judges to identify
examinees who, in their judgment, are neither masters nor
nonmasters (i.e., examinees who they believe to be marginally
competent candidates). The test score frequency is plotted
and the passing score is set near the middle of the
distribution.

2

competent and those who are not will result in an artificial
dichotomy. No matter how carefully the passing score is
selected, there will always be a degree of misclassification
of those whose performance is near the passing score because
of measurement error. The process of setting standards is
necessarily arbitrary, and has often resulted in considerable
controversy (Haney, Madaus & Kreitzer, 1987: Jaeger, 1982a:
Madaus, 1986: Mehrens, 1986).

Through this study, three methods for setting passing
scores were examined.in depth to explore steps to minimize the
effects of arbitrariness and of measurement error on the
standard setting process. The results suggest ways to
identify and control for some of the factors which contribute
to these effects.

EIQEEQHIEE

This research was a causal comparative analysis of the
relationships among a number of variables. Data collected
from three standard-setting procedures were used to set
passing scores for the three MEAP mathematics tests. These
tests were designed to be used in the beginning of the school
year at Grades 4, 7 and 10 to measure achievement in content
which had been taught in Grades K-3, 4-6, and 7-9. For
simplicity, the tests are referred to as the "Grade 4 Test,"
the "Grade 7 Test," and the "Grade 10 Test" in this study.
They were first used for statewide assessment in 1991, and

were comprised of 92, 115 and 120 items, respectively.

3

Each test assessed performance in 8 content strands and
5 process strands, where each item was designed to measure
performance in one content strand and one process strand. 'The
content strands were whole numbers and numeration: fractions,
decimals, ratio and percent; geometry: measurement; statistics
and probability: algebraic ideas: problem solving and logical
reasoning; and calculator literacy and use. The process
strands were conceptualization: mental arithmetic: estimation:
computation: and applications and problem solving. The
framework shown in Figure 1.1 illustrates how the test items
wereidistributed among the strands on the>grade 10 test, where
each row represents a content strand, each column a process
strand, and the number in each cell represents the number of

items testing each pair of content/process skills.

 

 

 

 

 

 

 

 

 

 

 

 

 

Concept- Rental Bsti- Coupu- Appli-
ualiza- Arith- nation tation cations
tion letic
whole Nunbers 2 1 1 0 0
Fractions 8 2 7 2 12
Geo-etry 9 o 2 o I
Measurement 6 0 2 0 6
Statistics 5 1 I 1 3
Algebraic Ideas I 0 3 5 7
Problel Solving 18
Calculators 6

 

 

 

 

 

Figure 1.1 Framework for Grade 10 MEAP Mathematics Test

Most items fall within both content and process strands. Some

items, however, were not separated into content and process

4
(namely, those items measuring either Problem Solving or
Calculators), and are classified only as content.

In May, 1991, a group of mathematics teachers and teacher
educators met with members of the MEAP staff to develop a
common understanding of a ”marginally capable candidate'"
(MCC). These educators made predictions about how this
hypothetical group of students would perform on each
individual test item (cf. Appendix A for specific details
about the modified Angoff standard-setting activity).

Prior to receiving the test results, these same
educators, together with other teachers trained by them, made
judgments about the performance of their students who had just
taken the tests. The analyses described in this study
compared actual student performance with mastery-state
judgments made by teachers. This was viewed from two
different perspectives. First, the test performance of a
large sample of students whose scores were near the "passing"
score (and therefore represented marginal performance) was
compared with the predicted performance of marginal candidates
as hypothesized by the content area specialists. Then, the
teachers’ estimates of how each student from their own
classroom(s) would perform was compared with the students’

individual test results.

 

‘A Marginally Capable Candidate is defined as a just-
barely-qualified examinee who demonstrates the minimum
acceptable competence in the domain being tested.

W

Research by Jones (1987) and Saliba (1990) suggests that
judges modify their predictions after being shown actual item
difficulty data. Their predictions of student performance on
test items are generally modified to conform more closely to
that represented by the empirical data presented. As a
result, the between-judge variance of the ratings is reduced
after reconsideration. Busch and Jaeger (1990) have shown
that there is no consistent pattern of increasing or
decreasing the panel-established cut scores after empirical
data are presented.

There are other factors which create variability in the
standard setting process beside inconsistencies between the
predictions of judges and the actual performancerof examinees.
One such factor is disagreement among judges (inter-judge
variability) and inconsistencies within the predictions made
by each judge (intra—judge variability). The work of Hunter
and Schmidt (1990) suggests that one source of discrepancies
between ratings of student performance and ratings of item
difficulty is the tendency of judges to assess student
performance based upon characteristics not pertinent to the
rated skill, such as dress and behavior.

In addition to the above factors, drift affects judges
and fatigue affects examinees on long tests. Also, when a
test covers a diverse set of skills such as those represented
by the eight content strands of the MEAP mathematics tests, an

additional consideration is that judges’ ratings of item

6

difficulties may be affected by the type of halo effect which
involves misconceptions among the judges about the relative
difficulties of the various skills.

With these concerns and considerations in mind, the
following questions were posited.

WWW: How internally
consistent are the predictions of individual judges? How
consistent were the judges’ ratings with the actual
performance of "marginally capable" students (i.e., students
whose scores were within a standard error of measurement of
the panel-established cut score)? How do judges’ predictions
of item performance compare with those made by other judges?
How do judges’ predictions of the test performance of their
own students compare with the students’ actual performance?
How do judges’ predictions about the relative difficulties of
the test strands compare‘with.actual test strand difficulties?

WWW: Does student
performance on the various test strands differ among groups
which have received different instruction on the strands?
Do students show fatigue as indicated by decreased performance
on the last half of the test items? Similarly, do judges
demonstrate drift by over-or under-predicting examinee success
on items which appear later in the test compared with those
appearing earlier?

By considering these questions, patterns have been

observed which should help to explain the mechanisms which are

7
in effect during the standard-setting process. By isolating
and describing these mechanisms, some recommendations were
identified for consideration when judges are faced with the

task of setting a passing score on a test.

Chapter 2. Review of the Literature
!!. 1 1 . . El ! I 1' '3 1

Two methods are commonly used to interpret scores from
educational tests: norm-referenced and criterion-referenced5
(Jacobs & Chase, 1992, pp. 10f; Mehrens & Lehman, 1987, p.
14). Norm-referenced interpretations are based upon "adding
meaning to a score by comparing it to the scores of people in
a reference (or norm) group" (Mehrens & Lehman, 1987, p. 15).
Criterion-referenced interpretations, on the other hand, are
based upon comparisons of individual performance against an
absolute standard which has been established for "some
specified behavioral domain or criterion of proficiency" (p.
15).

The process of making norm-referenced interpretations is
relatively straightforward: the test developer selects an
appropriate norm group, administers the tests, and then uses
the results to compare examinees with the norm group. There
seems to be a broad consensus on how to interpret norm-
referenced test scores, particularly since there are standard
methods for' computing such. derived scores as percentile
rankings and grade equivalents, which have become widely
accepted among test users with notable exceptions, such as

Cannell (1987).

 

’Millman and Greene (1989) prefer the term domain-
referenced, since interpretations will "represent information
within a specific domain about the absolute level and
particular strengths and weaknesses of examinees’ performance"
(p. 341). One form of a domain-referenced test which is not
addressed in this study is objective-referenced.

8

9

Criterion-referenced interpretations, on the other hand,
are complicated by the fact that usually only two results are
of interest: Did the examinee demonstrate mastery of the
domain, or not?‘ The process of establishing this "absolute
standard" causes much concern among those who use the test
results for making decisions about examinee performance.

Much of the dissension among psychometricians and test
users results from differing views and confusion related to
the purpose(s) of testing. For example, is the function of
the test to document growth, or to make mastery decisions? If
it is the former, then there is no need to establish arbitrary
passing scores, since the scores themselves provide a greater
amount of information. If it is the latter, then how does one
differentiate between a master and a nonmaster on the basis of
a single test score? Rowley (1982) observed that, "If 300
whiskers makes a beard, then do 299 not make one?" It would
certainly be impossible to differentiate between the two
cases, since they are part of a continuum which precludes
clear separation into "bearded" and "non-bearded." The same
can be said about performance on a test.

There are several critical steps to be considered when
developing tests which will be used for making norm-referenced
or criterion-referenced interpretations about examinee

performance. The first is to establish the objectives to be

 

‘On some occasions there has been interest in multiple
cutoffs, as has been the case with the National Assessment of
Educational Progress (Forsyth, 1991), but in most cases the
primary interest has been dichotomous.

10

measured by the test. Next, a table of specifications is
developed to be followed by item developers in constructing
the test. The third step is to try out the items with a
sample of pupils for whom the test was developed and verify
the psychometric properties of the items to ensure that they
validly and reliably measure what they purport to measure.
The procedures for accomplishing these steps are well
established and generally agreed upon by the measurement
community (Crocker & Algina, 1986, Chapter 4; Ebel, 1972,
Chapter 5: Gronlund & Linn, 1990: Mehrens & Lehmann, 1987,
pages 226f: Thorndike, 1982, Chapters 2-4).

An additional critical step is to establish an acceptable
passing score in those situations where a test is to be used
for making decisions about mastery. This is perhaps the most
tenuous part of the entire test development process, since
there is no single correct way to do it, and its outcome
determines how' many’ examinees will achieve "satisfactory
performance." The situation is further complicated by a lack
of consensus of what is meant by a "valid passing score"
(Mehrens & Lehmann, 1987, pp. 126f).7

Beyond the purely technical issues surrounding the
establishment of performance standards, there remains the

question of whether a single score on any test instrument can

 

’Nitko is critical of the idea of setting a passing score
on a criterion-referenced test. He states that "One confusion
about criterion-referencing is the misconception that the term
means using a cut off score or a passing score" (Nitko, 1984,
p. 21: emphasis in original).

11
be relied upon for making important decisions. Shepard (1980)
observed,
With a good test, valid distinctions can be
made between those who are well above or well below

the standard, but, pass-fail decisions near’ the

cutoff will have poor validity because a continuum

of performance has been "arbitrarily" dichotomized

(p. 448).

The fundamental problem is highlighted by Johnson and Zieky
(1988), who noted,
A score ... [represents] a probable range or

band of scores, rarely provided to the test taker.

We know that the size of the band is a function of

the reliability of the test, which we can and

should estimate. But there is still the question

of validity --- is the score an appropriate,

meaningful measure of the construct for the person

tested and the purpose of the testing (p. 3).

Clearly, it is not necessary to select a "passing score"
to dichotomize the results of a test when the test is only
being used to obtain diagnostic information about an
individual’s knowledge, skills and abilities, or when the
results are not intended to be used for making mastery
decisions about individual examinees. These activities are
part of formative evaluation, where the decisions to be made
relate more to progress toward a goal rather than final
judgments of mastery; The issues become more important if the
stakes are higher: for example, after many years of training,
a candidate is being tested for a license to practice the
profession for which the training was taken (Shimberg, 1981).

The expanding use of state-administered tests for making high

school graduation decisions has attracted attention to the

12
debate over the importance and relevance of testing (Rudman,
1985, p. 28).

Because of legal, technical and political issues, many
critical factors are involved in standard-setting under
various conditions. Complex checklists have been developed to
be used as aids in the standard-setting process (Arrasmith &
Hambleton, 1987: Hambleton & Powell, 1983).

Even those who concede that tests are generally valid
predictors of job performance have criticized the sole use of
test results for making selection decisions, since members of
dominant social groups tend to perform better on typical
employment examinations than members of minority groups
(Hartigan & Wigdor, 1989). This is particularly acute when a
cutoff score is used. lAs a result, tests have been criticized
as being biased and/or unfair to minorities. This is
primarily a political issue for which there is no simple
technical resolution. Many approaches for settling the
controversy have been proposed (Cleary, 1968; Cole, 1973;
Thorndike, 1971), but all have been criticized.on practical as
well as theoretical grounds (Novick & Ellis, 1977).

Issues related to standard-setting have gained additional
interest because of the current movement toward using
"performance assessment" as a substitute for paper-and-pencil
tests (Mueller, 1991). Compared with traditional objectively
scored tests, performance assessment is especially sensitive
to problems of between-rater and within-rater variability.

The validity and generalizability of the results have also

13
been called into question (Fremer, 1991: Linn, Baker and
Dunbar, 1991: Mehrens, 1992), and their use on state-mandated
tests has been challenged on numerous grounds (Beck, 1991:
Phillips, 1992).

Establishing standards for performance assessments
involves several complicating factors in addition to the usual
psychometric elements entering into an objectively scored
paper-and-pencil test. One of these factors is task
complexity, where the response to a prompt can take on many
different forms which express multiple levels of information
processing and use. Although this is considered to be one of
the advantages of performance assessment, it does contribute
to the difficulties in developing reliable and valid scoring.
Another is cognitive demand, where multiple facets of a
problem must be held simultaneously in the mind of the
examinee during the performance. Observer bias, leniency, and
rater drift also play a part in performance assessment.

Performance assessment is uniquely subject to the "halo
effect,” in which the rater tends to appraise diverse
characteristics as if they were common attributes. Supervisor
ratings of employee productivity may, for instance, be
affected by irrelevant characteristics of the examinee such as
dress and manner; De Meuse (1987) stated that "the effects of
three classes of non-verbal variables (demographic cues,
physical appearance, non-verbal behaviours) on performance
appraisal ... are significant and varied" (p. 207). Hunter &

Schmidt (1990) noted that the "idiosyncrasy of that

l4

supervisor’s perceptions is a part of the error of measurement
in the observed ratings. Extraneous factors that may
influence human judgment include friendship, physical
appearance, moral and/or life-style conventionality, and more"
(p. 65). Finally, Trevisan (1991) observed that performance
assessment is affected by differing standards which can vary
over time.

Even if rigorous scoring guides are developed, there is
no assurance that they will be carried out reliably by
different raters. Developers of performance assessment have
attempted to minimize extraneous variance, and thereby
increase reliability, by producing detailed scoring protocols.
However, the more the scoring criteria are tightened, the more
the assessment becomes restricted and subject to the same
criticisms as its multiple-choice counterpart. O’Leary &
Hansen (1983) report that experience in the field of employee
performance assessment, which has been standard practice for
decades, does not provide much hope for objectivity in the
judgment of mastery.

E l] . S! 3 3'5 !!°

As the use of test scores has become increasingly common
in screening and selection, the acceptability of using test
scores as indicators of success or failure has come under much
debate (McAllister, 1991). Even when highly trained

professionals are involved in establishing the passing scores

15
for these tests, it is difficult to obtain agreement.as to how
this should be done.“

Although standard-setting methods have received
considerable attention, as yet no consensus has emerged as to
which method or approach is most appropriate. In his paper on
standards, Glass (1978) criticized the "common notion ... that
a minimal acceptable level of performance on a task can be
specified" (page 237) . He supported this criticism by stating
that ”the language of performance standards is a pseudo-
quantification, a meaningless application of numbers to a
question not prepared for quantitative methods” (page 238),
and added further that the result is "to ask for greater
precision than the circumstances permit" (page 258). Since
scores and performance exist on a continuum, there can be no
uniquely defined "scientific" cut point above which everyone
is a master and below which everyone is a nonmaster. If a
dichotomy is demanded, it must be established arbitrarily.

In the introduction to his report on standard-setting

procedures, Cramer (1990) observed that "In spite of much

 

'The debate over defining unique criterion-referenced
proficiency levels in the National Assessment of Educational
Progress (NAEP) is a case in point, where experts and the
public experienced great difficulty in understanding or
agreeing upon what was meant by terms such as "basic,"
"proficient,” and "advanced" (FairTest, 1991). Forsyth (1991)
is especially critical of the NAEP science scale, which he
describes as having "purported" criterion-referenced
characteristics since the test "mixes dimensions" in "an ill-
defined domain" which "can lead reporters, legislators and
even professional educators to draw very questionable
conclusions from the NAEP results" (pages 5f). In other
words, if a test domain is not well defined, it is not
appropriate to use the test to assess proficiency.

16
research in the area, we are still far from agreement on the
’right’ way to perform this increasingly important task" (p.
l). Halpin & Halpin (1987) summed up the situation well:
Given that different standards result when
different methods are used, standard setters are

left with a fundamental, unsolved problem: They

must decide how best to set standards.

Unfortunately, at present, there are little or no

scientific grounds for choosing among different

procedures. Needed is research indicating how well

the different standard-setting methods serve their

purpose which, ideally, is to separate the masters

from the nonmasters, to pass those who are

qualified and fail those who are not . . .. (p.

977)

Plake and Melican (1989) isolated the essence of why
there is such divergence in establishing standards for tests.
"Judgmental standard setting methods are, by definition,
subjective evaluations by .... specialists about the test (or
item) performance of minimally competent candidates (MCCs)"
(p. 45). When the test developer has been asked to "draw a
line," its placement will always be open to criticism because
extremely small differences in scores around the cut-off will
cause one examinee to pass while the other will fail.

A logical question might be, if a student.who answers 70%
of the items correctly on a test is considered to be a
"master" of the subject, is it reasonable to say that one who
receives a score of 69% correct is a "nonmaster?" (Rowley,
1982). Shepard (1980) suggested establishing three zones of

mastery: those who are clearly masters, those who are clearly

not masters, and those whose mastery state cannot be

17

established by the test score. Shepard’s suggestion was

explored in this research.

Glass (1978) found flaws in six common methods used for
determining the criterion in a criterion-referenced test.
Logical perfection aside, passing scores are still demanded by
many educators and policy makers who wish to make decisions
about individual or group performance in a variety of
settings. In such situations, there are many issues which
enter into establishing acceptable passing scores. Some of
these include:

(1) What percentage of examinees "should" be expected to pass
the test?

(2) What are the relative costs of failing someone who
"should” pass, as opposed to passing an examinee who
"should" fail?

(3) How important is it to establish a high (or low)
standard?

Rudman (1985) criticized the tendency of standard setters to
identify minimums which eventually become maximums as
educators orient their teaching toward the narrow content of
the tests, and argued for setting multiple standards which
would recognize excellence as well as the achievement of
minimal competencies.

When faced with a situation in which standards are
demanded, there remain numerous dilemmas which frame the
standard-setting situation. If a high percentage of examinees

achieve the standard, it might be argued that the test is too

18

easy. If a large proportion of examinees fail the test, some
will charge that the test. is unfair' or' arbitrary. 13:
determining the relative costs of "false positives” or “false
negatives," how does one determine the utility of someone
being denied entry into a profession because of a spuriously
low test score, or the cost of having someone admitted into
college who does not possess the skills necessary for success
in the program of study?

Numerous studies have demonstrated that judges can set
reasonably consistent passing scores using one of several
commonly used methods, when provided with consistent
instructions. Across methods, however, there is typically a
wide range of results (as will be explored later).
WWW

Although it is difficult to conceive of a measurement of
an important competency which does not result in a range of
scores rather than a simple dichotomy, standard-setting
methods have been classified as "state" or "continuum," .State
models view'competence as all or nothing: ‘the examinee either
has the skill or does not have it, as in the determination of
whether or not.a normal child can take a step, speak a word or
ride a bicycle. Macready and Dayton (1980, pp. 494f)
discussed the relative merits and limitations of state models,
noting that their application has been made difficult because
of errors attributed to guessing, forgetting, cheating, and

differential cognitive processing brought about by variations

19
in learning, such as rote memorization and nonhomogeneous
domain specifications.

Continuum models are based on the observation that most
skills vary over a wide range. This makes the determination
of a "passing score" logically impossible since there is
virtually no difference between an examinee at any given score
and an individual who correctly answers one item more (or
less). To be perfectly defensible, the establishment of a
standard (n1.a continuum-type competency requires that the
items of the test can be ranked, in the same order’ of
difficulty for every individual tested, as in a Guttman Scale
(Thurstone, 1928: Gardner, 1962), so that "mastery" can be
defined as the score which demonstrates that the skill has
been attained. Although the continuum model leads logically
to the conclusion that "mastery" cannot be uniquely
determined, there are social and political reasons for
pursuing the task of setting standards for acceptable
performance on tests, and this has led psychometricians to
attempt to find reasonable and practical methods for
establishing such standards.

Berk (1986) identified at least 38 different approaches
which have been used to set passing scores, which he
categorized into three levels according to:

(l) The acquisition of the underlying trait or ability (state
or continuum),

(2) The methods (judgmental, empirical, or mixed), and

20
(3) The major aim (e.g., to set a new standard or to adjust
an existing one).
He found that 30 of the 38 approaches were of the continuum
type: 17 of the continuum models were variants of the
approaches developed by Angoff (1984) and Ebel (1972): and 10
of those relied strictly on expert judgment.

Cross, Impara, Frary and Jaeger (1984) observed that "all
methods that have been proposed for setting standards can be
classified . . . into two groups, those that are based upon
judgments of test content, and those that are based upon the
test performance of known groups of examinees" (p. 114).
Hofstee (1983) argued that ultimately' all standards are
relative and are based on a real or fictitious norm group.
One could set an absolute standard? which could be so high
that none would succeed or so low that all would pass.
Conversely, a relative standard10 could be established based
on the percentage of passes desirable.

Methods which use a combination of these approaches have
been developed by various psychometricians (Bank, 1984: De
Gruijter, 1980, 1985; Hofstee, 1983). Hofstee’s method used
minimum and maximum acceptable cutoff scores and failure rates

to determine a line which (ideally) intersects the curve

 

’An absolute standard is one which is established based
upon judgments of test content, independent of considerations
of actual examinee performance.

mA relative standard is one which is established based
upon test performance of known groups of examinees,
independent of considerations of test content.

21

established by the measured cumulative score frequency of the
examinee group. Beuk’s method introduced the variability of
acceptable cutoff scores and failure rates as a means for
determining the degree of judges’ preference for absolute
ratings. De Gruijter introduced Bayesian statistics by having
each judge establish estimates of the uncertainty of the
cutoff scores and failure rates proposed by that judge. All
three of these methods represent compromises between a
relative standard based upon empirical data and an absolute
standard based upon judgments about test content.
WWW

Various methods have been developed for determining
appropriate cutoff scores for criterion-referenced tests
(Angoff, 1971: Berk, 1986; Beuk, 1984; Hofstee, 1983; Jaeger,
1982b; Nedelsky, 1954; Livingston & Zeiky, 1982). In a review
of standard-setting methods, Hambleton, Powell, and Eignor
(1979) identified, approximately' 30 different. methods for
setting out scores. All of these approaches require experts
to make judgments about (1) an absolute standard based on the
expected performance of a hypothetical group of examinees on
certain test items, or (2) a relative standard based on
declaring as "masters" the highest "n" percent of the
examinees (where "n” may typically be between 50 and 90
percent).

Judgmental standard-setting methods require the
deliberation of experts who are knowledgeable about both

content and examinee behavior, and who have a clear

22
understanding about the competency expected for the
circumstances. When considering various judgmental standard-
setting methods, it is necessary first to understand how
various approaches work. Three major classifications will be
described and discussed here.

I l . J . . 3 ! l ! l ! °l .
Although a more complex method was developed by Ebelu (1972)
and used for several years, the Nedelsky12 (1954) and Angoff
(1971) methods are currently the most commonly used for
setting standards based upon hypothesized performance of
"competent" examinees on test items (Jaeger, 1989). Between
these two methods, Jaeger (1990) reported that "limited
empirical evidence ... suggests that Angoff’s method, more

often than not, yields standards that are more reliable than

 

11In the Ebel method, judges are asked to specify two
pieces of information about each item: its perceived
difficulty (easy, medium and hard) and its relevance
(essential, important, acceptable and questionable). The
items are sorted ”into cells using a two-way classification
grid where the relevance and difficulty of the item are the
two dimensions considered. Once all test items have been
sorted, the items in each cell are considered by the
individual judge or group of judges, and the proportion of
items within each cell that should be answered correctly by an
examinee who has achieved a minimum acceptable level of
proficiency is specified. The product of this proportion and
the number of items in each cell is calculated. The
examination standard or passing score is then derived by
summing these products across cells" (Andrew & Hecht, 1976,
pages 46f).

uThe Nedelsky method is limited to multiple choice items,
and requires subject matter experts to identify the
distractors that a minimally competent candidate would be
expected to eliminate. The reciprocal of the number of
remaining options is the minimum passing level of the item.
These estimates are then added to determine the passing score.

23
those produced by Nedelsky’s method" (p. 17). In their purest
forms, the.Angoff and Nedelsky methods require judgments about
predicted examinee performance on test items and therefore
lack.a link between actual examinee performance and the levels
of competence anticipated by the judges.

E l . J . . I ! l ! . .
Livingston and Zeiky’s (1982) Contrasting Groups method and
Borderline Group method are frequently used for setting
standards based on the performance of groups of specific
individuals (Arrasmith, 1986). Because these methods are
predicated upon judges’ perceptions of the competence of
actual examinees, they possess a kind of "face validity" and
are, therefore, intuitively appealing. "People in our society
are accustomed to judging other people’s skills as adequatetor
inadequate for some purpose ... . Therefore, making this type
of judgment is likely to be a familiar and meaningful task"
(Livingston & Zeiky, 1982, p. 31).

Although there may be broad agreement that this task is
familiar, there are many who would argue that it is not
meaningful. These methods are no less judgmental than, e.g.,
the.Angoff or Nedelsky methods since they require the judgment
of examinee competence by teachers or other expert observers.
The experience of many years of developing performance
appraisals for business and industry would indicate that such
judgments are usually unreliable and easily confounded by

characteristics of the judge and the examinee which are not

24

germane to the subject of the rating (Hunter & Schmidt, 1990:
O’Leary 8 Hansen, 1983).

one on - 9- (“g ‘11... ._ 0.”. 1‘0 ‘ ' ;. ‘ ”-1.: .
In an effort to bridge the gap between judgments about test
items and judgments about examinees, Hofstee’3 (1983), Beuk“
(1984), and De Gruijter“ (1985) developed newer and slightly
more complex methods for setting passing scores. These are

”compromise” approaches, since judges must decide on

 

”In the Hofstee method, judges are required to specify
the minimum and maximum percentages of failing examinees (fun
and f..., respectively) along with the minimum and maximum
acceptable jpercentages of items that. minimally' competent
candidates should be able to answer correctly (km, and k_,,
respectively). Upon a graph which shows the percentage of
candidates who would fail at each given score, f(k), is
superimposed a line which connects (k.,,,, f...) and (kw, fun).
The intersection of this line and the graph of f(k) is the
passing score.

“In the Beuk method, judges are required to specify the
knowledge level that a "minimally competent” candidate should
possess, expressed as a minimum percentage of items answered
correctly on the test, and the expected pass rate for that
score, expressed as a percentage of the examinees passing.
The means and standard deviations for both ratings are
computed across all judges. These are denoted as km, and SR
for the passing score and v“, and s, for the passing rate.
Upon a graph which shows the percentage of candidates who
would pass at each given score, v(k), is superimposed a line
which passes through the point (vw, k...) with a slope sv/sk.
The intersection of this line and the graph of v(k) is the
passing score.

”In the De Gruijter method, judges are required to
specify an ”ideal” passing score k. and a corresponding
failure rate f0, and to estimate the uncertainty of each (uk
and u,, respectively). Using the ratio r = u,/u,,, all
combinations of k and f on the ellipse r’(k - k0)2 + (f - f,,)2
= d2 are considered equally plausible. All that is necessary
is to determine which combination (f, k) on the empirical
curve f(k) produces the smallest value of d, thereby
ascertaining the value of k which is the "best" compromise
passing score.

25
acceptable ranges of cutoff scores and then use actual test
performance data to arrive at a "best” compromise.

As a group, judges tend to over-estimate examinee
performance in the absence of data from actual test
administrations. As a consequence, when theoretical judgments
about item content are combined with empirical results from
real examinees, passing scores generally result which are
lower than would have been obtained from purely judgmental
methods, therefore producing slightly higher passing rates.

For a variety of reasons, none of these compromise
methods has achieved.widespread use. This present study is an
attempt.to combine the strengths of both judgmental approaches
(i.e., judgments about items and examinees), but to do so in
a manner which is not as complicated as the compromise methods
just described.

The aforementioned approaches may be viewed, accordingly,
as empirical (or examinee-oriented), theoretical (or test-
oriented), and policy-based (or compromise) approaches to the
standard-setting process. Since there is very little
agreement among practitioners as to which approach is most
acceptable, the research undertaken in this study weighed the
strengths of each approach and assessed how they could be
combined to optimize the setting of passing scores in the
final outcome.

Although there are many methods for determining passing
scores on tests, no single "best" method has emerged. The

subjective nature of standard-setting procedures, along with

26
the possibility that multiple standard-setting methods invoke
different cognitive processes, can lead to unacceptable
variations in outcomes.
Comparative studies have shown that the failure rates

resulting from cutoff scores derived using different methods

can vary greatly. For example, on the.Louisianatsrade_2_8a§io
W, the failure percentages ranged from

0% to 29.75% using the results fromtdifferent standard-setting
procedures (Mills, 1983). Because of such variability, Jaeger
(1989, p. 500) recommends that standard-setters use different
methods, and then set a standard which is a compromise among
them.
H . E ! EE 1' ! 1 1‘5 !!'

ghazaggggistig§_gf_tng;1e§t. Any well-developed test is
generally optimized for use with students who possess a
relatively broad range of abilities. This ensures that the
test will provide a maximum amount of information about the
population for which it was designed. The distribution of
test scores, relative to the region of maximum information for
the test, can have an impact on test results. A test for
which most of the items are too easy (or too difficult) for
the examinee group will exhibit greater error of measurement
than one which is optimized for the ability of the group.

In a study of the effect of item difficulty distribution
on the precision of measurement near the cutoff score, Julian
(1985) found that the number of false passes and false

failures were related to whether the test’s area of maximum

27

precision was above or below the cutoff score. "The easier
test passed fewer persons who should have failed and the more
difficult test failed fewer students who should have passed
even when the tests had similar total error rates" (page 108).
Depending upon the relative cost of false positives and false
negatives, then, the test developer could reduce the more
critical type of error by designing an easier’ or’ more
difficult test.

Another test characteristic which affects standard-
setting is its reliability. Since reliability sets an upper
limit on the validity of decisions made using test scores,
this is particularly important. Tucker (1946) showed that the
maximum validity for a 100-item test occurs when the point
biserial correlation between items is less than 0.3 (p. 11).
He demonstrated similar results for tests of various lengths.
Nunnally (1970) argued that "when items have low correlations
with one another and each correlates positively with the
criterion, each item adds information to that provided by the
other items, and when scores are summed over items, a
relatively high correlation with the criterion will be found"
(p. 204).

WWW Brennan
and Lockwood (1979) stated that the Angoff method established
a less variable standard than the Nedelsky method, indicating
that judges using the .Angoff procedure were in. greater
agreement than judges employing the Nedelsky procedure.

Reaching a similar conclusion, Behuniak, Archambault and Gable

28
(1982) reported that "judges using the Angoff procedure were
in greater' agreement than judges employing the iNedelsky
procedure" (p. 254). They found this to be true for tests of
both reading and mathematics.

As noted by Harasym (1981), one of the weaknesses of the
Nedelsky method is that, depending upon the number of options
used, the p-values are limited: for example, on a multiple-
choice test item with four alternatives the p-values are 1.00,
.50, .33, .25 and 0. The Angoff method, by contrast, yields
a continuum of p-values ranging from 1.00 to 0. This means
that, in the Nedelsky procedure, all very easy items will be
assigned a p-value of 1.00, and all other items will have p-
values in a restricted range represented by numbers which are
only'moderately different from<each other (e.g., .50, .33, and
.25 for a four-option item). No items will have predicted p-
values between 0.5 and 1.0: as a result, cut scores
established using the Nedelsky method tend to be unreliable
and lower than those set by the Angoff method (Andrew & Hecht,
1976; Behuniak et al., 1982: Brennan & Lockwood, 1979; Cross
et al., 1984: Halpin et al., 1983; Harasym, 1981). Note also
that the Nedelsky method is limited to multiple-choice items,
whereas the Angoff method is applicable to any format.

Gross (1985) suggested a way to improve the Nedelsky
method using post hoc evaluation. His research identified
items with 0, 1, 2, 3, and 4 ”sophisticated" distractors (for
S-alternative multiple choice items), using judges’ ratings to

ascertain the groups to which the items belonged (i.e., items

29

with no sophisticated distractors have estimated p-values of
1.0: those with 1 have p = 0.5, etc.). He then determined the
empirical mean p-values (averaged over an entire group of
examinees) for each of these groups of items. He found a
significant relationship between the p-values and the number
of ”significant distractors.” In his research, Gross
implemented. a modification. of the Nedelsky procedure .by
establishing a three-point weighting scheme to differentiate
among' correct answers (weighted 2), wrong“ but plausible
distractors (weighted 1), and wrong but implausible
distractors (weighted 0), yielding a stronger relationship
between p-values and judge-determined MPLs. He suggested that
future research, following this approach, could use p-values
for the target group (i.e., MCCs) rather than the entire group
of examinees.

Since the Angoff method has been found to function much
more consistently than the Nedelsky method, it has become the
"method of choice" among standard-setters (Jaeger, 1989). For
this reason, it was the primary method selected for setting
the MEAP passing scores.

Will-15392.5: In addition to the effects
of test design parameters and standard-setting method
characteristics on the measurement properties of criterion-
referenced tests, characteristics of the judges have also been
presumed to have a strong effect on how students are
classified by the passing score which was set for a given

test.

30

In a study which compared the Angoff and. Nedelsky
standard-setting methods, Behuniak (1980) found that "The use
of different methods of setting standards is likely to result
in different cut scores," even when the same group of judges
was involved (p. 120). On the other hand, ”Judges’
demographic characteristics, including years of experience,
position within the school system and level are unrelated to
the cut scores generated using the Angoff and. Nedelsky
methods” (p. 120). Norcini, Shea and Kanya (1988) found that
"specialization within broad content areas does not affect an
expert’s estimates of the performance of the borderline group"
(p. 63).

In an investigation of the characteristics of standard-
setting judges, Jones (1987) investigated the effects of job
status of the judges setting a standard for the certification
examination for the American Nursing Association. "Job
status" included such variables as the number of years
experience since receipt of a license as a nurse, the number
of hours worked weekly, the number of years in a particular
job category, the highest academic degree attained, the size
of the facility in which the judge was employed, and the
judge’s achievement in content areas assessed by the
examination. Jones found that the average cutoff scores, set
by judges from three different classification groups, were not
significantly different.

In a review of standard-setting on the National Teacher

Examination, Roth (1987) reported that "the difference between

31

the cut-score for the teacher educators and the combined group
[of teachers and teacher educators] is quite small” (p. 6).
Busch and Jaeger (1986) studied the effects of judges’
background, attitudes, and information on standard-setting.
In general, they found that "relationships between most
characteristics of judges and their recommended test standards
are modest" (p. 24). They found, however, that the attitudes
of the judges toward screening and standard setting had a
significant influence on outcomes, reporting that "We observed
higher recommended standards for those judges who agreed with
the need to set standards" (p. 23). It is also apparent that
those who oppose the setting of standards can be expected to
recommend a relatively low passing score (p. 24).

Summary: Of all the methods currently used for setting
passing scores on tests, the Angoff method (and several
variations of it) has become the most widely accepted. When
the Angoff method is used, the modifications which are made
usually consist of the introduction.of empirical data into the
purely judgmental aspects of the process, such.as sharing item
ratings among judges and/or giving judges p-values from field
tests involving a sample of examinees from the population for
which the test was designed.

An appropriate standard-setting panel consists of judges
who are knowledgeable about the test content and the
characteristics of the examinees who will be taking the test.
A.wide range of research has shown that judge characteristics

(such as training, experience and job setting) do not have a

32
significant impact upon the standards which are set (Norcini
et al, 1988), although judges who oppose setting standards
will generally set lower standards than those who favor them
(Busch 8 Jaeger, 1986: Busch, 1990). The essential aim of
this research is to answer the question, "Do judges have the
ability to establish a clear conception of the marginally
capable candidate and to estimate accurately how well such a

candidate will perform on the test?"

Chapter 3. Method

Over the past several years, a number of researchers have
studied individual methods for setting passing scores. A few
studies compared various judgmental methods, such as Ebel and
Nedelsky (Andrew 8 Hecht, 1976), Angoff and Nedelsky (Smith 8
Smith, 1988: Harasym, 1981; Behuniak et al, 1982; Cross et al,
1984), Angoff, Ebel and Nedelsky (Halpin et a1, 1983), and
Angoff and Contrasting Groups (Mills, 1983: Arrasmith, 1986).
None of them analyzed the cognitive structure of the test or
differences in instruction among the students taking the test,
although Meskauskas (1986) acknowledged that these factors may
be important in understanding why different methods lead to
different results.

Until this study, there have been no other reported
studies where the identical set of people have served as test
developers, standard-setters, and teachers of students who
have taken the exam. Since there are so many variables which
cannot be measured or controlled in standard-setting studies,
the value of having a common set of individuals involved at
all stages is that there should be a common base of
understanding and assumptions about the cognitive processes of
the participants, beginning with item development, continuing
through the standard-setting process, and into interpretation
of the test results. If differences persist across standard-
setting methods, there'is a strong chance that the reasons for
the differences can be attributed to the standard-setting

methods themselves.

33

34

In this study, a reasonably common set of professionals
was involved with each of the three tests throughout the
process, from conceptualization of the test frameworks through
the final testing of students, although minor variations in
committee membership occurred. In some respects, this
reflects on a larger scale what happens in classroom
assessment, where a teacher conceptualizes the test, writes
it, and "validates" it with classes of students. The
standard-setting approaches which were used in this study
included the modified Angoff Method, the Borderline Group
Method, and the Contrasting Groups Method. There were four
facets of the study: the test, the judges, the standard-
setting process, and the students. Each will be described in
the following pages.
W

This study analyzed three methods used for setting
standards on three mathematics tests used in the Michigan
Educational Assessment Program (MEAP). These tests were
designed to be used in the fall to measure achievement at the
beginning of grades four, seven and tens The tests were first
used for statewide assessment in 1991.

Each test assessed performance in 8 content strands and
5 process strands, where each item was designed to measure
performance in one content strand and one process strand (cf.
Figure 1.1, Test Framework, p. 3). Some items, however, could

not.be separated intoicontent and.process (namely, those items

35
measuring either problem solving and logical reasoning, or
calculator literacy and use).

The tests were designed to measure achievement in content
which had been taught in grades K-3, 4-6, and 7-9 (and tested
at the beginning of grades 4, 7 and 10) and were comprised of
92, 115 and 120 items, respectively.

W

The standard-setting panels for this study consisted
primarily of members of the Michigan Mathematics Test
Development Coordinating Committee. Members were employees of
public and pmivate schools, intermediate school districts,
colleges and universities. Each had extensive knowledge of
instructional trends in mathematics. Participation on the
committee was a long-term, voluntary commitment; members were
not paid for their work. Participants were selected by the
Michigan Department of Education staff to represent Michigan
content area specialists and classroom teachers, so there was
no randomness in the selection of this committee.

The standard-setting panels consisted of six members for
the Grade 4 Test, five for the Grade 7 Test, and four for the
Grade 10 Test. During a single modified Angoff standard-
setting meeting held in May 1991, item difficulty ratings by
individual judges, along with consensus values set by the
grade-level panels, were obtained.

Judges who were also classroom teachers, and other
teachers who were trained by the judges, made predictions of

individual student performance, prior to the receipt of test

36

results. Comparison.of these predictions*with.the actual test
scores obtained by the students was used to carry out the
Borderline Group and Contrasting Groups standard-setting
activities.
MW

All panel members participated in the modified Angoff
standard-setting activity held in May, 1991 (of. Appendix A).
Six months later, these members of the committees were asked
to participate in a validation project as part of the
dissertation research described here (of. Appendix 8). Most
members agreed to cooperate, and to assist in recruiting
additional teachers to participate with whom they had worked
in disseminating the conceptual model on which the tests were
based. The members of the panels deemed this training to be
sufficient to prepare the teachers to make informed
predictions about how individual students in their classrooms
would perform on the tests, as part of the Contrasting Groups
and Borderline Group standard-setting activities.

Participants were told.that every effort.would.be:made to
ensure confidentiality of the information; there would be no
way to trace the data back to individual judges, to any of the
teachers recruited by the judges, or to any students whose
test results would be shared for use in this study. This
confidentiality limited some aspects.of the research (e.g., it
was not possible to link any particular set of students to any
particular judge so that student performance on the tests

could be compared with the p-values predicted by the judge),

37
but was deemed necessary in order to secure full cooperation
from the teachers who were asked to rate their students as
"masters", "borderline" or "nonmasters."

Prior to receiving any test results, each participating
teacher was provided with forms on which they were instructed
to list the names of students regarded as clear masters, clear
nonmasters, and marginal students in their classroom(s) where
the MEAP tests had been administered (cf. Appendix 8). They
were instructed to identify no more than seven students from
any given classroom for each of these three categories. Since
a typical classroom consists of approximately 20 to 35
students, this meant that some students would not be
classified as belonging to any of the three categories. This
was considered to be essential to the proper application of
the Borderline Group and Contrasting Groups methods, since
only' clear' "masters", "borderline" and "nonmasters" were
sought.“

e ud

The students who were involved in this study included
three groups, identified as the "Marginally Capable Candidate
(MCC)" group, "Statewide Sample" group, and "Judge’s

Classrooms" group.

 

“Some indication of mastery for every student may have
been useful: see, for instance, Livingston, 1982, p. 229 for
a variant of the Borderline Group and Contrasting Group
methods in which all students are classified to determine the
conditional probability of mastery/nonmastery for each score.

38

The MCC group was selected from a 5% systematic sample
with random start, drawn from the total population of
approximately 100,000 students who had been tested at each of
Grades 4, 7 and 10. This 5% sample was chosen by selecting
randomly a number between one and twenty and, beginning with
the student in the selected position in the MEAP database,
choosing every twentieth student. From this base sample, only
those students whose total test scores were within. one
standard error of measurement of the cut score established by
the modified Angoff approach were included in the MCC group
(approximately 1,000 to 2,000 students per grade level). The
test performance of this group was used as a comparison for
analyzing the results from the modified Angoff standard-
setting activity, where the judges attempted to predict how
Marginally Capable Candidates would perform on the items
comprising these tests.

The "Statewide Sample" of students was a 2% random sample
of all students in Grades 4, 7 and 10 who took the MEAP
mathematics tests (approximately 2,000 students per grade
level). The Statewide Sample group was used to determine if
certain test items functioned differentially for the general
population than they did for the MCC group.

The "Judges’ Classrooms" group consisted of students from
classrooms taught either by judges or by teachers who had.been
trained by judges. These students constituted the sample for
the Borderline Group and Contrasting Groups standard-setting

activities. Eight teachers rated 325 students from Grade 4,

39
six teachers rated 411 students from Grade 7, and nine
teachers rated 251 students from Grade 10.
W

The first questions addressed in the research were
concerned with how well the judges performed in the modified
Angoff standard-setting activity, where they predicted the
item difficulties for a hypothetical group of marginal
examinees.

(1) How well did the item difficulty ratings of

the expert judges, a§_a_grgup, compare with p-
values obtained by students whose performance
placed them at or near the cutoff score?

(2) How well did the item difficulty ratings of
the expert judges, W, compare
with p-values obtained by students whose
performance placed them at or near the cutoff
score?

(3) On what content and/or process strands did the
judges overpredict or under—predict actual
examinee performance?

The next questions were concerned with how well the
judges performed in the Borderline Group and Contrasting
Groups standard-setting activities, where they predicted how
each of their own students would perform on the tests,
including possible instructional effects.

(4) How well did the judges’ predictions of which

of their own students are "masters",

40
"borderline", and "nonmasters" compare with
the students’ actual performance relative to
the cut score set by the same judges?

(5) Did performance on the various strands differ
among students from classrooms in which
teachers reported emphasis on different
instructional objectives?

The next questions related to the potential influence of

test fatigue on the results.

(6) Did the judges demonstrate drift by having
different predictive success on items toward
the end of the test compared with the
beginning?

(7) Did students show fatigue, as indicated by
decreased performance on the last part of the
test as compared with the first part?

The last question considered related to the underlying

developmental scale" postulated for the tested objectives.

(8) How did the underlying developmental scale of
mathematics (location of objectives along the
latent trait"), as indicated by examinee

performance, compare with that "predicted" by

 

"A developmental scale is an ordering of specific skills,
items, or tests which represent the manner in.which an ability
is acquired.

“A latent trait is an ability or characteristic which is
postulated to exist but cannot be observed or measured
directly. An example is cognitive ability, represented by
Spearman’s ”g," which can only be measured indirectly.

the

41

judges? This question addressed the

following basic issues:

(a)

(b)

(C)

Are the tests based upon a single
dimension of mathematical ability (i.e.,
is there a unidimensional latent trait
measured by the test)?

Did instruction, by the emphasis placed
on various aspects of mathematics, modify
the structure of the pupils’ cognitive
development and thereby bring into
question the existence of a unique
developmental scale?

If some students have been taught certain
concepts covered by the tests, while
others have not, there will be a tendency
for the instructed students to find items
relating to these concepts to be easy
while the uninstructed will find them
difficult, thus causing the responses to
not "fit" the measurement model. Could
the degree to which student responses to
items fail to fit the measurement model
be used to detect instructional effects
on the development of the latent trait in

the examinees?

42
5! !° !' 1 1 .

There were five phases to the analysis of the data, each
of which was designed to provide data which could be used to
address the aforementioned specific research questions.

1 _ l 1.] 1' l E'! E | ! !
Model. Rasch difficulty" estimates and classical p-values2°
were computed from the MCC group data file using an
unconditional maximum likelihood method (Wright 8 Stone,
1979). The use of p-values from a "marginally capable" group
is in contrast with the usual practice of using p-values from
an entire group of examinees. This approach was preferred
since it was consistent with the instructions which had been
given to the judges at the cut score meeting (i.e., the judges
were instructed to estimate the percentage of marginal
students who should.be able to answer each item.correctly: of.
Appendix A). The primary purpose of the Rasch analysis was to
assess the fit of the items to the Rasch model (cf. Research
Question 8c).
Analysis. This portion of the data analysis involved:
(a) computing the Pearson product-moment correlation of

each judge’s ratings of item difficulties with

 

"The Rasch difficulty is the point along the ability
scale where the probability of a correct response is equal to
0.5. The ability scale is transformed so that its mean is 0
and its standard deviation is 1.

20The classical p-value is the proportion of examinees who
respond correctly to a test item.

43

those of the other judges: although this ignores

differences in the relative level of scores set by

the individual judges, it does indicate "the extent

to which item judgments co-vary systematically

across items among the judges, without regard to

differences in overall performance expectations
among the judges" (Cross, et al, 1984, p. 123) and

is therefore useful in this analysis:

(b) computing the Pearson product-moment correlation of

each judge’s ratings with the p-values obtained by

the MCC group: then
(c) on a strand-by—strand basis, using ANOVA to analyze

the variance across strands to ascertain those in

which the judges’ ratings differed significantly

from each other and/or from the actual item

performance of the marginally capable candidate

group.

For this study, the "marginally capable candidate" was
operationally defined as a student whose test score was within
one standard error‘ of measurement of the passing score
determined by summing the p-values predicted by the judges.
As a result, the sum of the deviations of the p-values
obtained by the marginal students from the item ratings set by
q the judges’ will be zero. Unless there is a complete strand-
by-strand match between judges’ ratings and the actual p-

values, however, there should be strands.for'which the p-value

44
exceeded the judges’ expectations, and others for which it
fell short of these expectations.

When the MCC group earns strand scores which are less
than that predicted by the judges, the strand performance is
said.to be "below'expectations." If there are several strands
in which the MCC group falls below expectations, then (since
the total test scores of this group are in the vicinity of the
cut score) the performance of MCC group on other strands must
have exceeded expectations. One assumption of the test
development committee was that the tests would not necessarily
represent.what is currently being taught, but rather that they
would represent "good instruction" (Rigney, 1992). As a
result, one might expect to find that performance on strands
which represent actual current instruction will be higher than
expected, while performance on the strands which do not
currently receive emphasis will be lower than expected. The
judges’ predictions were based.on the notion that the students
had received instruction consistent with the objectives being
tested. In this case, no specific observations could be made
to determine if this assumption was in accordance ‘with
reality.

“to. ° ' at. .. o ... ° - a. .- 0 9‘ ‘._ .
This part of the analysis examined the cognitive structure
underlying the MEAP mathematics tests. The MEAP tests can be
described as consisting of eight subtests, each designed to
measure pupil performance on a given strand. The relative

difficulty of each subtest might be fixed in relation to

45

development of the other strands if there is truly an overall
"mathematics trait." (For a concise explanation of the theory
underlying this concept, see Chapter 2 in Resnick 8 Ford,
1981, pp. 38-66). For instance, if geometry represents a
higher level of mathematical learning than algebra, and if
algebra in turn represents a higher level of learning than
numeration, then achievement in these strands will always tend
to develop in this order (i.e., numeration -> algebra ->
geometry). This could be verified by comparing the
developmental scales across the»grades tested (i.e., Grades 4,
7 and 10).

A second consideration was how the performance predicted
by the judges on the strands compared with the actual student
achievement measured by the tests. If these differ, then the
judges’ ratings are at odds with the unidimensional
assumption. One explanation for such an occurrence might be
that an instructional effect has altered the presumed
unidimensional nature of student development.

If, on the other hand, mathematical learning proceeds
independently along many strands simultaneously, it would be
necessary to examine development along each strand
independently. In order to explore the existence of as many
as.8 unique strands of mathematical learning, Phase III of the
analysis included a factor analysis of the student test
results. In the earlier pilot testing of the MEAP
instruments, factor analysis determined that there was only

one major strand of mathematical development (Rigney, 1990).

46
Since the data used for this study came from a larger, more
diverse population than*was included.in.the pilot testing, and
since "weak" or ambiguously worded items from the pilot were
omitted from the final forms of the tests, it seemed possible
that a more complex factor structure might emerge than had
been apparent during pilot testing.

WWW Pearson
product-moment correlations between judges’ individual
predictions of item difficulties and the consensus predictions
provides an assessment of the reliability of the standard-
setting process. Also, if the ratings established by one
judge have a much higher correlation with the consensus
ratings than those of the other judges, it might indicate that
one judge may have wielded a higher level of influence in the
consensus process than other judges. Correlations between the
consensus ratings and actual p-values obtained by marginally
capable candidates, along with correlations among individual
judges’ ratings, represent a measure of the validity of the
ratings.

Angoff (1991) noted that, in the absence of a clear
criterion, differences between standards set by judges and
performance by actual "marginally capable candidates" do not
necessarily indicate invalidity. The judges had been
instructed to rate the items based on the assumption that the
students had received several years of instruction consistent
with the objectives which underlie the tests. If instruction

has any value in promoting learning, then it must be assumed

47

that instructional differences can affect the order in which
cognitive skills develop in students. The extent to which
judges use their own students as models for the "marginal'I
examinee is not known, but it would be reasonable to assume
that experience with ”real” students would have a strong
impact on the judges’ ratings. When the predictions and
actual performance do not agree, then, the differences could
be a result of erroneous judgements about (a) how hypothetical
”marginally capable" individuals will perform on the items,
(b) how individual "real" students will perform on the test,
or both. In the absence of a clear criterion, it is difficult
t0>determine if the source of the invalidity is in the judges’
perceptions of the students, or the students’ perceptions of
the items.

The judges’ predictions of the performance of students in
their mathematics classes was the criterion used in this
study, because these predictions were based upon the actual
performance of real students, while the predicted p-values
were based on hypothetical marginally capable students. This
decision strongly affects the conclusions reached in this
study since, having established a clear criterion, it is now
possible to examine possible sources of invalidity.
Comparisons were made using data collected from judges. In
cases where a judge was not teaching a grade in which the
tests had.been administered, the judge‘was asked to select for

inclusion in the study one or more teachers (of Grade 4, 7 or

48

10) who had been trained to understand and use the

instructional model upon which the tests were based. The

teachers made mastery/nonmastery classifications of their
students prior to receiving test results“ The tests also made
mastery/nonmastery classifications of the same students.

These sets of results were compared using the Chi-square test.

These same data enabled a comparison between the modified

Angoff method employed in the MEAP standard-setting process

(of. Appendix A for a description of the approach followed)

and Livingston and Zeiky’s (1982) Borderline Group and

Contrasting Groups methods.

For this analysis, modifications of the customary
Borderline Group and Contrasting Groups methods were used.
Instead of Borderline Group and Contrasting Group frequency
plots, cumulative frequency plots were used, as follows:

(A) It is common to select the mode of the Borderline
Group when this method is used to establish a cut
score. The mode is difficult to justify if the
distribution of the scores is not smooth (Mills,
1983). This can be especially problematical if the
data are multi-modal. For the modification of the
Borderline Group method, the median was selected as
the cut score.

(8) The intersection of the mastery and non-mastery
frequency plots for the "master" and "nonmaster"
groups is selected as the cut score in the

traditional Contrasting Group method. This

49
equalizes the false positive and false negative
errors at the cut point only. Selecting the
appropriate intersection is complicated when
multiple intersections of the frequency plots occur
(Mills, 1983). Livingston and Zeiky (1982)
suggested calculating conditional probabilities of
mastery for each raw score and smoothing the
resultant probability function by hand. The cut
score is then taken at the point where the
probability of mastery is 50 percent, thereby
equalizing the false positive and false negative
errors. It seemed. desirable to eliminate the
subjectivity of this smoothing process. Therefore,
for the modification of the Contrasting Groups
method, the cumulative frequency of the non-mastery
group was plotted beginning at the highest score
and proceeding down to the lowest, whereas the
cumulative frequency for the mastery group was
plotted beginning at the lowest score and
proceeding up to the highest. The point at which
these curves intersected was selected as the cut
score, thereby equalizing the false positive and
false negative errors.
E ! .1 l : .
In a classroom testing situation, the teacher should not
be interested merely in finding out how well the student

performed on the particular test which was used. The more

50
interesting (and important) question is, "How well can the
results of this narrow test be generalized to the broader
domain of interest?" For instance, if a student.can1correctly
answer 10 out of 10 items on a computational test, does this
mean that the student has mastered 100% of the skill which is
called "math computations?"

When the Michigan Educational Assessment Program
developed the items for the mathematics tests which.were used
as the basis for this study, a sincere effort was made to
represent a wide spectrum of mathematics content and ability.
The item-writers were not merely interested in how well the
students could perform on a particular sample of items, but
rather in how well the students could apply their knowledge
about mathematics to a wide range of situations.

A well developed test can be used to make inferences to
the broader domain of interest. Such a test must contain
items which are psychometrically sound. For example, when the
scored responses (1 = "correct” or'O = "incorrect") to an item
correlate positively and significantly with the total test
score (i.e., when the point-biserial correlations are
high”), the item is apparently measuring the same trait as
the rest of the test. Items with very low (or negative)

point-biserial correlations are often viewed as problematical

 

“Nunnally (1970) and Tucker (1946) both alluded to the
notion that the optimal point-biserial should be approximately
0.3; higher values indicate that the item does not contribute
anything new to the rest of the test, whereas lower values
indicate that the item is measuring a different trait from the
rest of the test.

51
since their primary effect is to contribute "noise"
(measurement error) to the test results.

In the analysis of the items used in this study,
statistics characteristic of both classical and Item Response
Theory were used. From the perspective of classical test
theory, it seems reasonable to ask whether judges should be
expected to make valid predictions about examinee performance
on items which have extremely low empirical p-values and/or
very low (or negative) point-biserial correlations. If an IRT
model is being used in the analysis, extremely poor fit
statistics may indicate that an item is not functioning
properly and therefore should be excluded from the test,
exempted from the standard-setting process, or weighted
low”. If these are the best items available to represent
important content in the test, it may be necessary to keep
them in the item pool. In the standard-setting process,
however, when the statistics for an item are poor the
following procedure is suggested (Mehrens, 1993):

(1) Make a rank-order list of the items, based upon their
empirical p-values:
(2) Note the items which, on the list, are just above and

below the item with poor fit statistics; and

 

”Poor fit may also be an indication that the assumption
of unidimensionality'is being violated.by the item: this could
mean that instruction has altered the trait acquisition, or
that the item itself is not part of the trait represented by
the rest of the test.

52
(3) Compute the average predicted p-value for these two
adjacent items: this average should be used instead of
the judges’ predicted item difficulty in determining the
passing score using the modified Angoff method.

In his paper on the internal consistency of judge
performance, van der Linden (1982) noted that, aside from
latent trait theory, there is no method for measuring the
consistency of a judge’s prediction of marginally capable
candidate performance on test items, because item difficulty
is a nonlinear function of examinee ability. Examinee
responses to a Guttmann scale, for example, are extremely
nonlinear when plotted against examinee ability. Even within
the sphere of latent trait analysis, if responses to an item
do not fit the measurement model, there is no way to specify
the relationship between the probability of a correct response
and the ability of the examinee.

In the tests used in this study, several of the items had
IRT fit statistics which were extremely poor, indicating that
the responses to the items were inconsistent with what would
have been expected given the overall performance by examinees
on the testi There were items, for instance, where the judges
predicted that 60% of the examinees should get the correct
answer, whereas only 10% actually did so. Since 10% is
considerably below'the chance level (25% for an itemmwith four
response choices), it is clear that the examinees viewed the
items quite differently from.what the judges had anticipated.

On the other hand, there were items which the judges thought

53
would be missed by 25% of the examinees, but which.were missed
by only 5% in the actual test administration. In many cases,
very poor fit statistics indicated that the student responses
were essentially unpredictable.

There are those who reason that "tests should lead
curriculum and instruction," and therefore that "standards
cannot be based on current performance" (Rigney, 1992). On
the other hand, standards which are too remote from current
performance may be subject to the criticism that they are
excessively arbitrary and judgmental. In a discussion of the
development of scales in psychology, Thurstone (1928) observed
that the concept of measurement requires the existence of an
appropriate scale of measurement, where the elements of the
scale form an ordinal set of attributes to be assessed. He
went on to note,

If the scale is to be regarded as valid, the scale

values of the statements should not be affected by

the opinions of the people who help to construct

it. This may turn out to be a severe test in

practice, but the scaling method must stand such a

test before it can be accepted as being more than a

description of the people who construct the scale.

At any rate, to the extent that the present method

of scale construction is affected by the opinions

of the readers who help sort out the original

statements into a scale, to that extent the

validity or universality of the scale may be
challenged (pp. 547f).
Since the purpose of a test is to place the examinees on a
scale, the words "items" and "judges" can be substituted into
the above quotation in place of "statements" and "readers."

In practice, the judges’ ratings of the perceived

difficulties of the items could be used to sort the items into

54
an order of ascending (or descending) difficulty. These
orderings could be compared both within and across judges to
determine if the items appear to be rated consistently.
Residuals between estimated.p-values which.exceed, say; one:or
two standard errors of measurement could be used to detect
”misfitting" items, since their difficulties are affected too
much by individual judge’s opinions. Eliminating (or
substituting’ empirical data for) judge’s predictions for
severely misfitting items could increase the probability that
judges can set more consistent (and, therefore, potentially
more valid) standards for tests.

1 1 . ! E I l , E I'

There are two major purposes for correlating the ratings
established by a panel of expert judges. One is to
investigate possible halo effects, i.e., to determine if the
judges were influenced by factors other than the perceived
difficulties of the items being rated. Halo is observed in
"what seems to be inflated correlations between dimensions of
ratings ..., i.e., correlations higher than warranted by
actual ratee behavior" (Borman, 1983, p. 128).

Another reason for correlating ratings is to assess the
impact of systematic distortions in the judges’ ratings,
caused by "the absence of relevant cues for a rater" which
leads to "nonrandom distortion in the direction of semantic,
’what goes with what’ relationships between dimensions"
(Borman, 1983, pp. 133f). For instance, judges may assume

that students who do well on algebra-related items will also

55

do well on those related to geometry, since students tend to
take these courses as a sequence. Although the skills
required to perform well in algebra and geometry may prove to
be highly correlated, these subjects are not necessarily part
of the same unidimensional trait, since one involves
relationships among numerical quantities and the other
involves relationships among spatial figures. Depending upon
the extent to which a judge’s views of reality are
systematically distorted, there may be situations in which the
responses of examinees to the test items will not be in
agreement with expert predictions.

Inter-rater reliability is considered to be essential in
assessing rater performance. Brennan and Lockwood (1980), in
discussing the relative merits of the Nedelsky and Angoff
methods, observed that ”the validity and practical utility of
these approaches, and similar approaches, for practical
decision making may rest heavily upon the extent to which
raters agree in their judgments" (p. 220). Several methods
for comparing ratings between and among judges have been
described in the literature. The most common of these
involves computing the Pearson product-moment correlation
(Ebel, 1972, p. 411) and/or rank-order correlation of the
ratings (Guilford, 1951, p. 395).

High correlations among judge ratings could indicate
either that the judges agreed.substantially about the relative
difficulties of the items, or that their ratings were affected

by some factor(s) other than perceived item difficulty.

56

Judges who have access to actual test results, for example,
may be influenced by the empirical data, thereby resulting in
higher correlations between judges (Saliba, 1990, pp. 70f:
Jones, 1987, pp. 51-55). "Judges presented with item
difficulty indices moved their item reevaluations in the
direction of the p-values. This trend was uniformly
observed across groups of judges and across the majority of
the test items" (Saliba, 1990, p. 97). Similarly, judges who
collaborate in rating the item difficulties prior to
participating in the consensus-rating exercise may influence
each others’ judgments, thereby producing spuriously high
inter-rater correlations.

The standard-setting process used in this study was
structured in a manner which prevented judges from working
together until the final consensus session, and during the
standard-setting meeting the judges were not provided with p-
values from previous administrations of the tests. Some
experts in the standard-setting field have argued, however,
that judges should be provided with the "prior knowledge" of
real data, and should.be encouraged to work together to reduce
idiosyncratic differences. For instance, in a discussion of
research involving the Ebel method, investigators concluded
that large error variances obtained for the Ebel standards
”were due to disagreements among judges regarding the expected
success probabilities assigned to groups of items." Their
solution to this problem was to provide "reasonable

probabilities” for the raters (Jones, 1987, p. 10). 'There are

57
studies which demonstrate that, given empirical item
statistics, judges will generally "move their' item
reevaluations in the direction of the normative feedback"
(Saliba, 1990, p. iv). This "team" effect is worthy of
further study.

The correlation between a predictor and its associated
criterion is often used as a validity coefficient, while
correlations between (among) predictors may be used as
reliability coefficients (Pedhazur 8 Schmelkin, 1991, p. 37).
When the relationship between two sets of ratings is linear,
the product-moment correlation is used (Ebel, 1972, p. 411).
When the relationship is non—linear, other methods (such as
curvilinear regression) may In: indicated (Pedhazur 8
Schmelkin, 1991, p. 37).

The reliability for the judgements of a panel of judges
can be computed using the average correlation through the
Spearman-Brown prophecy formula

Inf—'5‘??-
(k-1)I+1
where ‘1: is the average of the off-axis elements of the
correlation matrix for all k judges.
Ebel (1951, pp. 411f) computed the reliability of the
average ratings of k judges using the formula r” = 1 - vdhu,
where v. and v, are the error and item variances, respectively.

Guilford (1954, pp. 396f) provided formulas to use for

partitioning the variance into the components due to items,

58
raters and error. All of these measures of reliability were
used in this study.
I ! 1 . 1 E I I , l 1'

Three different analytical approaches were used in this
study to measure the internal consistency of the judges’
ratings. These are described below.

0 ‘ ._ 0| 0 -0." o. to ' t . -2 ‘_ :
Each judge’s p-values were correlated with those obtained by
the Marginally Capable Candidates group and also with the
consensus ratings established by the panel of judges. Since
the probability of a correct response (p-value) is not a
linear function of ability (the relationship between p-value
and ability is assumed to approximate an ogive), use of linear
correlation may yield inaccurate results.

WWW: An index
of consistency is needed to provide a nondimensional measure
of how well each judge predicted actual test performance of
the hypothetical marginally capable candidate. Latent trait
analysis, also known as Item Response Theory (IRT), provides
a functional relationship between empirically-derived
psychometric characteristics of each test item and the ability
of the examinee. It is possible, therefore, to calculate an
expected probability that a person with a given ability will
succeed on a given test item. By performing these
calculations for each of the items comprising an examination,
and summing the probabilities, one can determine the most

likely raw score which would be obtained by this hypothetical

59

person. IRT was used to estimate how closely each judge’s p-
values compared with a postulated latent trait using an
analysis of the residuals, i.e., by computing the absolute
differences between each judge’s ratings and the IRT model.

In a standard-setting situation, each judge envisions the
hypothetical ability of a: typical marginally capable
candidate. Since the abilities envisioned by different judges
are rarely identical, different raw scores will result when
the predicted probabilities of success for the items are
summed, These differences represent the in;ez;juggg

inconsistency in the setting of the standard, analogous to the

W in ANOVL

For the within;judgg_yariance, van der Linden postulates
that (for items which fit the latent trait model) the intra;
We: can be computed from the residuals
between the latent trait model and the judge’s predictions.
Once a test.has been administered, there are computer programs
(such as BICAL, LOGIST, or BIGSTEPS) which can be used to
compute a ”raw score to ability" conversion table for each
test. The item calibrations (difficulty ratings), obtained
from actual test administration(s) , can be used to compute the
modelled probabilities that the marginal candidate (envisioned
by any given judge) will select the correct answer for'a given
item. These are then compared with the predictions
established by the judge. The steps in this process are:
1. The judge estimates, for each test item which has

been found to fit the IRT model reasonably well,

index,
dividing the sum by the sum of the maximum possible residuals.
The maximum possible residuals are those which would be
obtained if the judge predicted that the examinees would all
get a difficult item correct, or that all examinees would get

an easy item wrong.

60
the probability that a marginally capable student
will obtain the correct answer.
These values are then summed to produce the
expected raw score ( ’?; ) for' the marginal
examinee.
The score thus determined is looked up in the "raw
score to ability” conversion table to determine the
6 value for the marginal candidate.
This value, 6c , is substituted into the latent
trait model for each test item to determine the
probability that the person with the raw score
would obtain the correct answer to the item.
These probabilities are then compared with the
judge’s predictions to see how closely the latent
trait. model was approximated, and. the :residual
between the latent-trait modelled p-value and the

judge’s prediction is computed.

In an effort to create a nondimensional consistency

are as follow:

1.

Compute the absolute difference between the judge’s

predicted p-value and the IRT-modelled p-value.

van der Linden normalized the 'mean residuals by

The steps for computing these residuals

 

 

CC

la

 

61

2. Find the average of the differences obtained in
Step 1; this is called E, (the mean error for judge
j)-

3. Compute the maximum value of the difference between
the IRT-modelled p-value and the value of 1 or 0.

4. Find the average of the differences obtained in
Step 3: this is called M, (the mean maximum possible
error for judge j).

5. Compute the difference between E, and M, and divide
by M,: this is the consistency index C,,.23

CH can also be written as

There is some bias in this method of determining
consistency, however. Because the logistic curve is non-
linear, small differences in ability can lead to relatively
large variations in probability of success in the region.where
the Item Characteristic Curve (ICC)“ has the greatest slope,
i.e., where the ability of the examinee and the difficulty of
the item are matched (p = 0.5). This also happens to be the
region of greatest reliability (and maximum information

content) of the item. In«cases where the items are either too

 

2"I‘he subscript 1 indicates that the Angoff method is
being used.

2‘An Item (or Test) Characteristic Curve is a plot which
represents the probability of success, as a function of
ability, on an item (or test).

62
easy or too difficult, however, the slope of the ICC is small
so the reverse situation applies (large differences in ability
lead to small variations in the probability of success).

In the standard-setting process, the difficulty of the
test is fixed. The judges are not modifying the items, but
are merely trying to predict how many items will be correctly
answered by a marginally capable candidate. If Judge A
believes the test to be very easy (therefore envisioning the
marginal examinee to be very high-scoring) , then the magnitude
of the residuals between predicted and actual performance are
likely to be small (because of the low slope of the logistic
curve at high values of ability relative to the item
difficulties). A similar result would be obtained by Judge 8
who believes that the test is very difficult (leading to a
low-scoring marginal candidate). If, however, Judge C
believes the test is matched to the ability of the marginal
candidate (who will thereby obtain a score near 50 percent),
the residuals will tend to be larger (because of the high
slope of the logistic curve near p = 0.50). This does not
necessarily mean that Judge C is not as consistent as Judges
A and B: the difference could merely be an artifact of the
shape of the logistic curve and the difficulty of the test
relative to the abilities of the hypothetical marginal
students as envisioned by the three judges.

For this study, the consistency index was used with the
caveat that the results may be tdased against a judge who

places the passing score near 50 percent. Note further that

 

na

22

Q’U

na

SE

rn

63

many facets of the content of the MEAP tests are, for many
teachers, relatively new and therefore perceived as ”hard."
As a result, many judges may have been tempted to assign p-
values close to 0.5 for a large number of the items. This
would‘tend,totdepress their consistency indices. .Despite this
possibility, it will be shown that most of the judges
displayed excellent internal consistency in their ratings.

H E l . 1° . I] il'-S I !

The primary purpose of this study was to determine the
nature and extent of the discrepancies among multiple standard
setting methods. Statisticians have developed various
measures of association to measure the relationship between
variables. The Pearson chi-square test is one of the most
widely used measures of association, and is used primarily to
test the hypothesis that two dichotomous variables are
independent (Norusis, 1990).

Once several students, for whom mastery/nonmastery states
had been established, had taken the tests and received their
scores, the effect (on the classification of students by the
test) of setting hypothetical values of the cut score over a
wide range of raw scores was explored. It was only necessary
to vary the cut score and observe the effect it had on the
value of Chi-square (or the Phi coefficient) and error rate
obtained for each hypothetical value of the cut score. The
"optimal" cut score, relative to the judge’s ratings, was
defined as the value of the cut score which maximized Chi-

square (or the Phi coefficient), or which minimized the error

ra

th

sq
be

 

wo
ma
mi
Cl.

tl’

64

rate. For a given test, it seems reasonable to assume that
these will occur at the same cut score, since the maximum.Chi-
square (or Phi coefficient) implies the best relationship
between judge ratings and test scores. Logically, then, this
would seem to imply that the cut score which provides the best
match.between judge ratings and.test results*would.produce the
minimum number of classification errors as well. By plotting
Chi-square, the Phi coefficient, and the error rate against
the values of the cut score, this hypothetical relationship
was explored.

Even when the "Optimal" cut score (or, possibly, the
optimal range of cut scores) has been established, the task of
standard-setting' is still incomplete» Experts are ‘very
reluctant to rely upon a standard which has been determined
"blindly” through statistics. This is why psychometricians
have been unable to arrive at a single "best" method for
establishing cut scores. Consider, for example, the case
where the judges’ mastery ratings of examinees are taken.to be
relatively accurate representations of the "truth," as would
be the case with the Contrasting Groups or Borderline Group
Methods. In these cases, statistics based on these "known"
mastery states are used. Despite this, most standard-setting
is done using a variant of the Angoff method where data from
examinees plays at most a minor role (Jaeger, 1989).

If the judges merely wish to use the test to separate
probable masters from probable nonmasters, they may be willing

to settle for a low cut score which simultaneously provides a

re

re
"51
so
si
ne
by
va

GI

m

In

65

reasonably significant Chi-square or Phi <coefficient. and
relatively few errors. On the other hand, judges who wish to
"set a high standard" may prefer instead to select a high cut
score which provides a moderate level of statistical
significance” while keeping the false positive and false
negative error rates well below fifty percent. In any case,
by providing the kind of analysis where the cut score is
varied and its effect is measured in terms of significance and
error rate, judges would have additional data upon which to
base their decision. An extensive literature search failed to
bring up any instances where this type of analysis had been
attempted.

Despite the prevalence of chi-square as a test for
independence of two variables, it does have a weakness.
"Large chi-square values can arise in applications where
residuals are small relative to expected frequencies but where
the sample size is large" (Norusis, 1990). The sample sizes
in this study were very large. 'The Phi coefficient normalizes
chi-square by dividing it by the sample size and taking the
square root of the result. As a result, Phi is a correlation
coefficient and is independent of sample size. Chi-square is
most appropriate when the sample size is constant; when sample
sizes are very different, the Phi coefficient is the more

appropriate statistic.

 

25e.g., p < 0.01 which, for one degree of freedom, is
obtained with a chi-square of approximately 6.6.

 

 

 

 

tw

8X

st

 

thl

chl

C0

V8

"N

66

Even in cases where there is no relationship between the
two variables, random variations occur which, particularly in
a small sample, could appear to indicate that a relationship
exists between the two variables. For this reason,
statistical analyses are used to determine the probability
that the apparent relationship could have occurred because of
chance alone.

The procedure for computing chi-square is to set up a
contingency table in which the pairings of the values of the
variables are grouped. Judge ratings of students ("Master"/
"Nonmaster"), and classifications of students by a test
("Pass"/”Fail"), are both artificial dichotomies. True
dichotomies include male-female, married/single, and alive-
dead. Artificial dichotomies include master-nonmaster (the
predictor in this study) and pass-fail (the criterion in this
study). Although mastery exists along' a continuum, in
practice it is viewed and used as a dichotomy. The
distinction is unimportant for determining independence using
chi-square, but becomes important when determining which
statistic should be used for computing validity. Since each
of the variables has two possible values, there are four
unique pairings altogether: i.e., subjects can be classified
as "Nonmaster"/"Pass, " "Nonmaster"/"Fail, " "Master"/"Pass" and
"Master"/"Fail." Chi-square is a measure of the amount by
which the observed frequencies of the variable pairings depart
from what would have been expected had they occurred by

chance.

67

WW: When the goal is to measure
predictive validity, Thorndike (1982) states that "the key
consideration is whether, in the real-life prediction
situation, that predictor will continue to appear and have to
be used as a dichotomy. If so, it must be treated as a true
dichotomy" (page 205). The judges’ ratings used in this study
fall into this category: although "mastery" exists along a
continuum, the judges viewed and used.the concept of "mastery/
nonmastery" as a dichotomy. The tetrachoric correlation is
the appropriate measure of the amount_gf_assgciatign to use
when one of the two variables is a true dichotomy and the
other is an artificial dichotomy (i.e., a continuous variable
which has been divided arbitrarily into two classes).

For this study, the predictive validity will be measured
by the tetrachoric correlation between the judge’s individual
predictions of the mastery states of their students (a true
dichotomy) and their classification by the MEAP mathematics
tests (an artificial dichotomy).

£zrg;_8ate: From a contingency table, it is relatively
simple to compute the error rate for the judges’ predictions
of student test performance. When the test declares a
"master" to be a ”nonmaster," the error is a false negative.
When the test declares a "nonmaster" to be a "master," the
error is a false positive. The error rate is the sum of the
frequencies of these inconsistent results, divided by the
total frequency (i.e., the total number of subjects in the

sample). It is irrelevant whether an error in the pairing

68

occurred because of the effects of judge error on the ratings,
or of measurement error on the test scores. It is only
important that the two classifications do not agree.

331W: To determine if rater drift or
examinee test fatigue affected the ratings, multiple
regression was used to separate out the effects of item
difficulty from of item sequence. Fatigue and/or drift can
affect the perceived fairness of the standard-setting process
in two ways: (a) the judges may not take into account the
fatigue experienced by examinees while taking the test, and/or
(b) the judges will become more lenient (or more strict) in

their ratings as they progress through the test.

CO

de

ra

CC

et

ar

Chapter 4. Results
E° 1'

The findings of this study are presented. in three
sections. The first section presents the results for the
Grade 4 Test (which covered content from Grades 1-3), the
second section presents the results for the Grade 7 Test
(which covered content from Grades 4-6) , and the third section
presents the results for the Grade 10 Test (which covered
content from Grades 7-9). Within each section, there are
descriptions of the internal consistency' of the judges’
ratings, the modified Angoff results, the Borderline Group and
Contrasting Groups results, the instructional alignment
effects, the rater drift analysis, and the latent trait

analysis.
W
W: The van der

Linden (1982) Consistency Index compares each judge’s ratings
with an empirically derived latent trait model. The
individual judge’s p-values are first summed to produce a raw
score (’f;) which corresponds to the judge’s conception of a
marginally capable examinee. Using a table which converts raw
scores to ability (calculated from calibration of field trial
results for the test instrument), this raw score is converted
into an ability measure ( 62). Next, using the latent trait
IRT parameters, p-values are computed which correspond to
those that would be obtained by examinees whose scores are

equal to ?’ . Two statistics are then computed for each item:
4

69

tl’

ar

6C

 

it

CC

70
the absolute value of the residual between the judge’s rating
and the IRT ability-estimated p-values (Ip1)'-Ehn1). and the
maximum residual which would have been obtained if Eh were
equal to 0 or 1 (i.e., the maximum of pm. and 1-p,.,) for that
item. These two sets of residuals are then summed, and a

consistency index C, is computed using C,== (M - E)/M, where

: %I$max (print 1 'pmr)

and

1
= ﬁglpi 'pIR'ri

Note that the subscript 1 is used to indicate that the Angoff
method was employed.“

The values for C1 and E are presented in Table 4.1 below.
The range of C1 is 0.80 to 0.88, with mean residuals ranging
from 0.09 to 0.15. Since the standard deviation (S.D.) of the
item difficulties for the Grade 4 Test was 0.22, the mean
residuals varied from 0.42 S.D. to 0.69 S.D., which indicates

a high degree of consistency in the judges’ ratings.

 

"There is also a consistency index C, which is used with
the Nedelsky method.

71

 

 

Table 4.1.
-I 0‘ 99‘! on ‘0 00‘ i: ' ..- -
Judge Consistency Mean
Index Residual

Judge 1 0.83 0.13
Judge 2 0.87 0.09
Judge 3 0.86 0.10
Judge 4 0.80 0.15
Judge 5 0.88 0.09
Judge 6 0.83 0.12
Consensus 0.87 0.10

 

Modified_angoff_8esult§: Table 4.2 shows the

correlations between the individual judges’ ratings, the
empirical p-values obtained by the MCC group, and the
consensus p-value ratings determined by the cut score
committees.27 All of the correlations are significant (p <
0.001), indicating that the individual judge ratings
correlated significantly with both the marginal examinee

performance and the consensus ratings.

 

”The cut score committee was comprised of Judges 1
through 6.

72

 

 

Table 4.2

o . o. o oo- ;. go w' o M G o o o -.
W

MCC Group Consensus

Judge 1 0.46** 0.52**
Judge 2 0.38** 0.48**
Judge 3 0.53** 0.91**
Judge 4 0.39** 0.39**
Judge 5 0.40** 0.63**
Judge 6 0.46** 0.67**
MCC Group 1.00 0.62**
Consensus 0.62** 1.00

 

NOTE: ** Significant at p < 0.001

The consensus ratings were arrived at by the panel of
judges who had previously established individual lists of
predicted p-values. As a result, it would be expected that
the agreement would be stronger when individual judge ratings
are compared with the consensus ratings than when they are
compared.with the MCC’Group resultsc The consensus ratings of
the grade 4 panel correlated moderately well with the
individual judges’ ratings, with an average correlation of

0.62.

73
The scatterplot shown in Figure 4.1 illustrates the
predicted p-values from Judge #3 on the Grade 4 Test panel
plotted against the Marginally Capable Candidates group p-
values. This particular judge had an R? of 0.306, or 30.6
percent of the variance accounted for, which was the highest

percentage for all of the fifteen judges in this study.

 

 

 

 

 

1* .
J . ::: R
u . : :..:..++.
d . . : .: :.:.=
g 751 .. : . .. ~
e . . . .0: ..... .
3 . O. .. .
P .
.5« . : . 1
R
251 ~
f I I I I I I I l I
.125 .375 .625 .875 1.125
0 .25 .5 .75 l

MCC p-value

Figure 4.1 Scatterplot of Judge #3, Grade 4 Test Panel
Ratings vs. MCC p-values

With the exception of five items for which Judge 3 rated the
p-values high while the students performed poorly (upper left
corner of Figure 4.1), the ratings of item difficulties by
Judge 3 corresponded very closely with student test

performance.

74
Table 4.3 shows the correlations of the individual (pre-
consensus) ratings made by each judge with those of the other

judges and with the consensus ratings.

Table 4.3

 

Judge 2 Judge 3 Judge 4 Judge 5 Judge 6 Consensus

 

Judge 1 0.17 0.53** 0.28* 0.31* 0.40** 0.52**
Judge 2 0.39** 0.29* 0.34** 0.27* 0.48**
Judge 3 0.34** 0.60** 0.56** 0.91**
Judge 4 0.16 0.32** 0.39**
Judge 5 0.34** 0.63**
Judge 6 0.67**

Average inter-judge correlation = 0.35**

 

** p < 0.001: * p < 0-01

With two exceptions, all of the interrater correlations
were significant at the p < 0.01 level. The average
correlation was 0.35. Based upon the Spearman-Brown prophecy,
the reliability of the judgements obtained by the grade 4
panel (k a 6 and E =0.35) is 0.77.

The ANOVA calculations for the reliabilities of the

judgements of the panel and of the individual judges are

75
displayed in Table 4.4 for the Grade 4 Test, using Ebel’s and

Guilford’s equations.

Table 4.4

E 1' l']'!° E I I E !° . E I l I !

 

 

Sum of d.f. Mean F

Squares Square
Item 4.240 91 0.047 3.92 (p < 0.01)
Judge 5.758 5 1.152 96.0 (p < 0.001)
Error 5.393 455 0.012
Total 15.391 552

Panel Reliability (ru) 0.75

Judge Reliability (rn) 0.33

 

These results are extremely close>to those obtained using
the average correlations with the Spearman-Brown prophecy
formula, both for the individual rater judgements and the
panel judgements. This demonstrates that simple correlations
can be used to identify those judges who contribute most
positively to the standard set by the panel, and to determine
how much confidence can be placed in the final outcome of the
standard-setting process. For instance, in high-stakes

situations it may be desirable to use panels of experts whose

76
judgements yield reliabilities higher than the values of 0.75
to 0.77 obtained in this situation.

WWW: For the
Grade 4 students, mastery/nonmastery decision data were
obtained from members of the standard setting panels whose
students had taken the MEAP mathematics tests for which
standards were being established. In cases where a judge was
not teaching a grade in which the tests had been administered,
the judge was asked to select one or more teachers who had
been trained to understand and use the instructional model
upon which the tests were based. These teachers, in turn,
rated their own students as "masters," "nonmasters" or
”marginal.” All of the students were rated prior to the
receipt of test results by their teachers.

The cumulative frequency distributions of scores obtained
by the three rated groups (master, nonmaster and marginal) are
displayed in Figure 4.2 below. For the Contrasting Groups
method, the cumulative frequency of the non-mastery group was
plotted beginning at the highest score and proceeding down to
the lowest, whereas the cumulative frequency for the mastery
group was plotted beginning at the lowest score and proceeding
up to the highest. The point at which these curves
intersected was selected as the Contrasting Groups cut score,
thereby' equalizing' the overall false positive and false
negative errors“ For the Borderlinet Group 'method, the
cumulative frequency for the marginal group was plotted

beginning at the lowest score and proceeding up to the

77
highesti The point.at.which this curve reached.50 percent was
selected as the Borderline Group cut score, thereby splitting

this group in half.

Fourth Grade

 

Frequency by Mastery
100 r 7. : I I : I
_a E i E 3 3 E E E 2 “*'
901° -" """""" . '''''''' """"""" ‘ Masters
30. .......... RR“... . a
701‘atm\- ....... .......... g ...... ; .......... ; .......... Nonmastors
60 1. ......... E.....-....E. ......... ........ ......... inn-uni..." 1......n-Eu-o-n-u: ........
“ 1 Nunmnms

 

 

 

CumuIativc Frequency

 

 

30354045505'56b6'5707530
Score

 

 

Figure 4.2 Contrasting Groups and Borderline Group Method

A "false positive" error is defined as an event in which
an examinee who had been rated as a "clear nonmaster" by the
teacher/judge is classified by the test as a "master." A
"false negative" error occurs when an examinee who has been
rated as a "clear master" is classified by the test as a
"nonmaster.” Most of the error at the panel-established cut
score of 69 is of the "false negative" type, where students
who had been rated as "clear masters" by their teachers were
classified by the test as "nonmasters." Had the cut score

been set below the lowest score obtained by a "clear master"

78

(a score of 43 on this test), there would be no false
negatives, but the number of false positives would be large
(69%). Had the cut score been set above the highest score
obtained by a "clear nonmaster" (a score of 83 in this case),
the false positives would be eliminated, but the false
negatives would be large (over 70%). The false positive and
false negative error rates are equal at a score of 63, as
shown by the intersection of the frequency plots at the score
of 63 in Figure 4.2. By definition, then, the Contrasting
Groups method places the cut score at 63, the point where the
frequency curves intersect. The Borderline Group Method sets
the cut score at the median of the marginal group (or a score
of 61). The tetrachoric correlation for this data set is 0.92
(Davidoff 8 Goheen, 1957). This is considerably larger than
the maximum phi-coefficient (0.80).

Figure 4.3 shows the results of varying the cut score
from 30 to 80 and computing Chi-square, the Phi coefficient,
and the error rate for the comparison between test-derived
mastery states and teacher-assigned mastery states for the
Grade 4 students. These results indicate a very strong
relationship between mastery ratings and test results over a
wide range of cut scores (p < 0.01 from a score of 34 through
80). The optimal cut score, based upon.highest chi-square and
lowest error rate, is 60 (out of 92). At this score, the
error rate is less than 10%, the chi-square is 88 (p <
0.0000), and the phi-coefficient is 0.80. The actual cut

score set by the panel of judges was 69, which produces an

79
error rate twice as large as that produced at the optimal

score of 60.
Fourth Grade

Contrasting Groups
100 I g g T r 7 r r

 

 

+

Chngurc

—a—
Phi

+

Error Rate

 

 

 

 

 

 

Statistic

 

 

 

 

 

 

 

 

 

 

c. 4 3e
30.35404550556065707580
ChtScorc
Figure 4.3 Chi-square, phi (x100) and Error Rate (%) for

Grade 4 Test

Figure 4.2 also shows that the panel-established cut
score of 69 misclassifies 36% of the "clear masters" and 6% of
the ”clear nonmasters." The standard error of measurement
(SEM) on this test is 7 raw score points. By reducing the
panel’s cut score 0.9 SEM, the false positive and false
negative errors are equalized: reducing it by 1.3 SEM
minimizes the error rate and maximizes Chi-square. These
results for all three standard-setting methods are summarized

in Table 4.5 below.

80’
Table 4.5

W

 

 

Method Cut False False
Score Positives Negatives
Angoff (consensus) 69 6% 30%
Contrasting Groups 63 12% 12%
Borderline Group 61 16% 8%

 

WW: Was there any evidence

of differences in test performance based upon the emphasis
placed on various strands in the mathematics instrUction
reported by teachers? A correlation of the student
performance (strand scores) with the instructional alignment
was performed. Strands which were reported as having the most
emphasis were coded with "1.“ Those with the least emphasis
were coded with "-1." Others were coded as "0." If student
test performance is improved because of the reported
instructional alignment, these correlations should be
positive.

The results for the Grade 4 Test are shown in Table 4.6

below.

81

 

 

 

 

Table 4.6

o - . 'o. o ..q o - ' c - . . -u e 'Oou‘! ° 12‘
ﬁ_I§§§
Content Strand Correlation
Whole Numbers & Numeration 0.04
Fractions 0.10
Measurement -0.23**
Geometry 0.14*
Statistics & Probability -0.14*
Algebra -0.11
Problem Solving 0.32**
Calculators -0.13
Process Strand Correlation
Conceptualization 0.23**
Mental Arithmetic -0.06
Estimation -0.32**
Computations -0.12
Applications 0.09

 

NOTE: ** Significant at p < 0.001: * Significant at p < 0.01

82

There are six significant correlations: for Measurement,
Geometry, Statistics, Problem Solving, Conceptualization and
Estimation. Half of the significant correlations (Geometry,
Problem Solving, and Conceptualization) are positive and half
(Measurement, Statistics, and Estimation) are negative. The
mean correlation is -0.01. As a whole, then, it can be
concluded that instructional emphasis, as reported by the
Grade 4 teachers, had no consistent effect upon student test
performance.

There are several ways such a result could be
interpreted. Assuming that the teacher self-reported
instructional alignment was accurate, one could surmise that
instruction had no effect on test results. Another
interpretation is that student learning was more closely
related to curriculum than to instruction, and that a better
indicator would be curriculum (or textbook) alignment rather
than instructional alignment. A third logical explanation is
that the teacher self-report.was not.an.accurate reflection of
what was actually taught in the classroom. Independent of the
choice of an explanation, it can be concluded that the data
collected showed no consistent relationship between reported
instructional alignment and student test performance.

Ratgr_pziﬁt: Was there any evidence of drift among the
grade 4 judges? It is conventional in test development to
arrange the items from the easiest at the beginning of the
test to the most difficult at the end. The items on the MEAP

mathematics tests were grouped around themes with little or no

83
regard for item<difficultyu As a result, it was expected that
item difficulty would be statistically independent of position
in the test (i.e., the correlation of p-value with item
sequence would be approximately zero). This is confirmed by
the small positive correlation (r = 0.06) of the actual p-
values obtained by the MCC’group (PMCC) with item sequence, as
shown in Table'4.7u This table also presents the correlations
of the judges' predicted p-values and the consensus p-values

with the sequence of items on the test.

 

 

Table 4.7
o . o, o o- . - w’ . -u -o -. -° .o- -

Judge 1 -0.18
Judge 2 0.28*
Judge 3 -0.12
Judge 4 0.21
Judge 5 -0.15
Judge 6 -0.09
MCC Group (PMCC) 0.06
Consensus -0.03
* p < 0.01

The results in Table 4.7 show very little evidence for

drift among the Grade 4 Test panel judges. Five correlations

84
were negative and three were positive. Only Judge 2 showed a
correlation which was significant but low (r = 0.28).

A second approach used to investigate evidence of drift
was to perform a multiple regression of judge-predicted p-
values onto the MCC Group p-values and item sequence. This
multivariate technique was used to separate the effects of
possible examinee test fatigue (as represented by changes in
PMCC) from rater drift. The results are presented in Table
4.8 below.

As with the correlation method, the regression analysis
shows that only Judge 2 had a significant sequence effect (p
< 0.01): over the 92 items in the test, this judge's ratings
"drifted" upward about 8% (i.e., Beta weight for Sequence *

Number Of Items = 0.0009*92 = 0.08).

85

 

 

 

 

Table 4.8
;-. - o. . ..- -. .. .. . .- . . ‘u -.L-,
Grade_1_1e§t
Part I: Regression Results for Sequence
B SE B T Attained
Significance
Judge 1 -0.0012 0.0006 -2.260 0.0264
Judge 2 0.0009* 0.0004 2.728 0.0077
Judge 3 -0.0007 0.0004 -l.710 0.0906
Judge 4 0.0006 0.0003 1.905 0.0601
Judge 5 -0.0006 0.0003 -1.820 0.0722
Judge 6 -0.0005 0.0004 -1.232 0.2213
Part II: Regression Results for MCC p-values
B SE B T Attained
Significance
Judge 1 0.4287** 0.0844 5.077 0.0000
Judge 2 0.1973** 0.0520 3.792 0.0003
Judge 3 0.3789** 0.0640 5.923 0.0000
Judge 4 0.2021** 0.0518 3.905 0.0002
Judge 5 0.2098** 0.0491 4.270 0.0001
Judge 6 0.3159** 0.0639 4.941 0.0000

 

**p

< 0.001; * p < 0.01

86

Laggn§__11§it__gngly§i§: The latent trait analysis
facilitates an investigation of one aspect of "halo" in the
standard-setting process. The judges make assumptions about
the latent trait, which may be contrary to the manner in*which
students actually develop mathematical skills. As a result,
the judges may consider certain strands to be more difficult
than they in fact are and other strands to be easier than they
are.

The latent trait Rasch model was used to investigate this
halo factor. From the Rasch analysis of the Grade 4 Test
data, the empirical strand.difficulties are indicated in'Table
4.9 below. The greatest discrepancies between MCC performance
and judge ratings are on the Measurement and Problem Solving
strands. The judges thought that the students would perform
better than they did on Measurement, and worse than they did
on Problem Solving. The Curriculum Alignment analysis
demonstrated that the most significant correlations between
student performance and reported curriculum alignment were
obtained from the Measurement and the Problem Solving strands,
where Measurement correlated -0.23 and Problem Solving
correlated +0.32. The results of the Latent Trait and

Curriculum Alignment Analyses are, therefore, consistent.

87

Table 4.9

I ! ! I 'l J . . E 1 I I !

 

 

Strand Computed Rasch Judge
Difficulty Rating Residual

Statistics 8 Probability 0.87 0.81 0.06
Algebra 0.81 0.78 0.03
Calculators 0.81 0.75 0.06
Whole Numbers & Numeration 0.79 0.76 0.03
Measurement 0.73 0.81 -0.08
Problem Solving 0.73 0.62 0.11
Geometry 0.68 0.64 0.04
Fractions 0.62 0.72 -0.10

 

One of the assumptions of latent trait theory is that the
trait being measured is unidimensional. This was investigated
by using principal components analysis as initial estimates
with orthogonal rotation. The resultant factor analysis of
the students’ scores (percent correct) on the eight content
strands produced one factor with an Eigenvalue (4.40) greater
than 1. 'This would indicate that the test measured one trait,
and is essentially unidimensional.

In order to determine possible interrelatedness of the
content strands, a squared Euclidian distances cluster
analysis was computed from the correlation matrix of the
student scores (percent correct) on ther8 strands. Figure 4.4

shows the results of a two-to four-level cluster analysis for

the grade 4 data.
Whole Numbers form

cluster at all.

Strand Sequence

88

Calculator Literacy, Problem Solving and

one clusteru The other five strands.do not

 

Whole Is
ProbSolv

L_J

 

 

 

 

Cachit

 

 

 

Algebra

 

 

StatProb

 

 

 

Measure

 

 

Geometry

Npummmqw

 

 

Fraction

«q.—

l

L l l
T I T I

5 10 15 20
Rescaled Distance Cluster Combine

J
I

25

Figure 4.4 Dendrogram using average linkage between groups:

Grade 4 Test

89
e v 5

W: The values for
the van der Linden Consistency Index (C1) and the mean
residuals (E) for each of the grade 7 judges are presented in
Table 4.10 below. For the Grade 7 Test panel judges, the
range of C1 was 0.83 to 0.80, with mean residuals varying from
0.15 to 0.20. Since the standard deviation (S.D.) of the item
difficulties for the Grade 7'Test.was 0.27, the mean residuals
varied from 0.56 S.D. to 0.75 S.D., which indicates a moderate

degree of consistency in the judges' ratings.

 

 

Table 4.10

Von o co 9 0| ‘I 10‘ 1‘ ' oo ‘
Judge Consistency Mean

Index Residual

Judge 1 0.77 0.16

Judge 2 0.80 0.15

Judge 3 0.79 0.16

Judge 4 0.73 0.20

Judge 5 0.76 0.18

Consensus 0.76 0.16

 

W: Table 4.11 shows the

correlations jbetween 'the individual judges' ratings, the

empirical p-values obtained by the MCC group, and the

90
consensus p-value ratings determined by the cut score
committees. All of the correlations are significant (p <
0.01), indicating'that.the individual judgezratings‘correlated
significantly with both the marginal examinee performance and

the consensus ratings agreed upon at the cut-score committee

 

 

meeting.
Table 4.11

o ' . o. o 00‘ {. 0!, W 0 v 0 0 0' ‘I
W

MCC Group Consensus

Judge 1 .49** .63**
Judge 2 .32** .62**
Judge 3 .42** .65**
Judge 4 .29** .62**
Judge 5 .25* .53**
MCC Group 1.00 .41**
Consensus .41** 1.00

 

** p < 0.001; t p < 0.01

The consensus ratings were arrived at by the panel of judges
who had previously established individual lists of predicted
p-values. As a result, it would be expected that the

agreement would be stronger when individual judge ratings are

91
compared with the consensus ratings than. when they are
compared with the MCC’Group results. The consensus ratings of
the grade 7 panel correlated least.with.the empirical p-values
(0.41) of all the grades tested.
Table 4.12 shows the correlation matrix calculated from

the ratings for the five judges on the Grade 7 Test panel.

Table 4.12

 

Judge 2 Judge 3 Judge 4 Judge 5 Consensus

 

Judge 1 0.37** 0.50** 0.48** 0.37** 0.63**
Judge 2 0.51** 0.40** 0.35** 0.62**
Judge 3 O.48** 0.64** 0.65**
Judge 4 o.31** 0.62**
Judge 5 0.53**

Average inter-judge correlation = 0.45**

 

4* p < 0,001; * p < 0.01

All of the interrater correlations in Table 4.12 are
significant at the p < 0.001 level. Based on the Spearman-
Brown prophecy, the reliability of the judgements obtained by
the Grade 7 Test panel (k = 5 and E = 0.45) is 0.80.

The ANOVA calculations for the reliabilities of the

judgements of the panel and of the individual judges are

92
displayed in Table 4.13 for the Grade 7 Test, using Ebel's and

Guilford's equations.

Table 4.13

E 1i liJ'li E I l E !' _ E i Z I !

 

 

Sum of d.f. Mean F

Squares Square
Item 8.046 114 0.071 4.73 (p < 0.001)
Judge 13.055 4 3.264 217 (p < 0.001)
Error 7.064 456 0.015
Total 28.164 574

Panel Reliability (ru) 0.78

Judge Reliability (ru) 0.42

 

These results are extremely close»to thoseeobtained using
the average correlations with the Spearman-Brown prophecy
formula, both for the individual rater judgements and the
panel judgements.

WW: For the
Grade 7 students, mastery/nonmastery decision data were
obtained from members of the standard setting panels whose
students had taken the MEAP mathematics tests for which
standards were being established. In cases where a judge was
not teaching a grade in which the tests had been administered,

the judge was asked to select one or more teachers who had

93
been trained to understand and use the instructional model
upon which the tests were based. These teachers rated their
own students as "masters," "nonmasters" or "marginal” prior to
receiving the test results.
The cumulative frequency distributions of scores obtained
by the three rated.groups (master, nonmaster and marginal) are

displayed in Figure 4.5 below.

Seventh Grade
Frequency by Mastery

 

 

 

Mater:

Nonnuter:

Marginal;

 

 

 

(imitative Frequency

 

 

 

 

 

Figure 4.5 Contrasting Groups and Borderline Group Method
Most of the error at the panel-established cut score of

69 is of the "false negative” type, where students who were
rated as "clear masters" by their teachers are classified by
the test as "nonmasters." Had the cut score been set below
the lowest score obtained by a "clear master" (a score of 39
on this test), there would be no false negatives, but the

false positive error would be large (65%). Had the cut score

94

been set above the highest score obtained by a "clear
nonmaster" (a score of 78 in this case), the false positives
would be eliminated but the false negative error would be
large (61%). The false positive and false negative errors are
equal at a score of 59, as shown by the intersection of the
frequency plots at the score of 59 in Figure 4.6. By
definition, then, the Contrasting Groups method places the cut
score at 59, the point where the frequency curves intersect.
The Borderline Group Method sets the cut score at the median
of the marginal group (or a score of 61). The tetrachoric
correlation for this data set is 0.82.

The graph in Figure 4.6 shows the results of the analysis
of this data at Grade'7. These results indicate a very strong
relationship between mastery ratings and test results over a
wide range of cut scores (p < 0.01 from a score of 38 through
94). The optimal cut score, based upon highest chi-square and
lowest error rate, is 52 (out of 115). At this score, the
error rate is less than 18%, the chi-square is 62 (p <
0.0000), and the phi-coefficient is 0.63. The actual cut
score set by the panel of judges was 69, which increases the
error rate by about 40% compared with the error rate at the

optimal score of 52.

95

Seventh Grade

Contrasting Groups

1 f ' U T
I I O I 0

 

 

 

 

Error Rate

 

Statistic

 

 

 

 

30 40 5'0 60 7'0 80 90 100
Out Score
Figure 4.6 Chi—square, phi (x100) and Error Rate (%) for

Grade 7 Test

Figure 4.5 also shows that the panel-established cut
score of 69 misclassifies 43% of the "clear masters" and 7% of
the "clear nonmasters." The standard error of measurement
(SEM) on this test is 9 raw score points. By reducing the
panel's cut score 0.9 SEM, the false positive and false
negative errors are equalized: reducing it by 1.2 SEM's
minimizes the overall error rate and maximizes the chi-square.
The results for all three standard-setting methods are

summarized in Table 4.14 below.

96

Table 4.14

Wind

 

 

Method Cut False False
Score Positives Negatives
Angoff 69 7% 43%
Borderline Group 61 17% 28%
Contrasting Groups 59 22% 22%

 

WW: Was there any evidence

of differences in test performance based upon the emphasis
placed on various strands in the mathematics instruction
reported by teachers? A correlation of the student
performance (strand scores) with the instructional alignment
was performedt Strands which were reported as having the most
emphasis were coded with "1.” Those with the least emphasis
were coded with "-1." Others were coded as "0." If student
test performance is improved because of the reported
instructional alignment, these correlations should be
positive.

The results for the Grade 7 Test are shown in Table 4.15
below. There are nine significant correlations: for Whole
Numbers, Fractions, Measurement, Geometry, Algebra, Problem

Solving, Estimation, Computations, and Applications.

Table 4.15
o - {pogo qI'. ‘ ’
Z_T_est

97

o - . ,u a Otu‘t ‘ «q:

 

 

 

 

Content Strand Correlation
Whole Numbers & Numeration O.25**
Fractions 0.22**
Measurement 0.21**
Geometry 0.13*
Statistics & Probability -0.06
Algebra -0.12*
Problem Solving O.32**
Calculators 0.08
Process Strand Correlation
Conceptualization 0.08
Mental Arithmetic -0.01
Estimation 0.39**
Computations 0.31**
Applications 0.25**

 

NOTE: ** Significant at p < 0.001: * Significant at p < 0.01

98

Eight of the significant correlations (Whole Numbers,
Fractions, Measurement, Geometry, Problem Solving, Estimation,
Computations and Applications) are positive and one (Algebraic
Ideas) is negative. The mean correlation is 0.16. As a
whole, then, it can be concluded that instructional emphasis,
as reported by the Grade 7 teachers, had a small positive
effect upon student test performance.

Rntgz_pziﬁt: Was there any evidence of drift among the
Grade 7 Test panel judges? It is conventional in test
development to arrange the items from the easiest at the
beginning of the test to the most difficult at the end. The
items on the MEAP mathematics tests were grouped around themes
with little or no regard for item difficulty; As a result, it
was expected that item difficulty would be statistically
independent of position in the test (i.e., the correlation of
p-value with item sequence would be approximately zero). For
the grade 7 test, this was confirmed because of a small but
statistically insignificant negative correlation (r = -0.11)
of the actual p-values obtained by the MCC group (PMCC) with
item sequence, as shown in Table 4.16. This table also
presents the correlations of the judges’ predicted p-values
and the consensus p-values with the sequence of items on the

test.

99

 

Table 4.16
. - . .. . .- . - . -u -. -, - ..- -
Judge 1 -0.29**
Judge 2 -0.17
Judge 3 -0.11
Judge 4 -0.23*
Judge 5 0.30**
MCC Group (PMCC) -0.11
Consensus -0.31**

 

** p < 0.001: * p < 0.01

The results in Table 4.16 give some evidence for rater
drift among the judges on the Grade 7 Test panel. All but one
of the correlations were negative, three were significant at
p < 0.001, and a fourth was significant at p < 0.01.

A second approach to investigate evidence for drift was
to perform used a multiple regression of judge-predicted p-
values onto Marginally Capable Candidates group p-values and

item sequence. The results are presented in Table 4.17 below.

100

 

 

 

 

 

Table 4.17

t o - o. o ..- g. .o o. , o- . - -u -,| -. -.

W

Part I: Regression Results for Sequence

B SE B T Attained

Shmukamce

Judge 1 -0.0012* 0.0004 -2.975 0.0036

Judge 2 -0.0003 0.0002 -1.485 0.1404

Judge 3 -0.0003 0.0004 -O.721 0.4721

Judge 4 -0.0013* 0.0005 -2.301 0.0232

Judge 5 0.0017** 0.0004 3.863 0.0002

Part II: Regression Results for MCC p-values

B SE B T Attained

Shyufmxuce

Judge 1 0.3383* 0.0577 -2.975 0.0036

Judge 2 0.1221** 0.0351 3.477 0.0007

Judge 3 0.2827** 0.0590 4.788 0.0000

Judge 4 0.2437* 0.0793 3.073 0.0027

Judge 5 0.2064* 0.0633 3.263 0.0015

it p < 0,001; * p < 0.01

had possible drift.

showing that the correlation of MCC p-values with sequence is

The correlation method indicated that Judges 1, 4 and 5

The multiple regression confirms this,

101
separable from the "drift" effect for Judges 1, 4 and 5. The
"drift" for Judge 5 was +20% (i.e., Beta weight for Sequence
* Number of Items = 0.0017*115 = 0.20), and for Judges 1 and
4 were -14% and -15% (-0.0012*115 and -0.0013*115,
respectively).
I . J .

From the Rasch analysis of the Grade 7 Test data, the

empirical strand difficulties are indicated in Table 4.18

below.

Table 4.18

I ! ! n 'l E 1 . . 3 i Z I !

 

 

Strand Computed Rasch Judge
Difficulty Rating Residual

Calculators 0.83 0.81 0.02
Whole Numbers & Numeration 0.66 0.73 -0.07
Measurement 0.61 0.70 -0.09
Statistics & Probability 0.59 0.62 -0.03
Geometry 0.53 0.52 0.01
Algebra 0.53 0.52 0.01
Fractions 0.63 0.67 -0.04
Problem Solving 0.47 0.44 0.03

 

The greatest discrepancy between the MCC performance and judge
ratings is on the Measurement strand. The judges thought the

students would do better than they did. The curriculum

102
alignment analysis does not reflect this result, indicating an
inconsistency between the Latent Trait and Curriculum
Alignment Analyses for the Grade 7 Test.

The unidimensionality of the latent trait. was
investigated using principal components analysis as initial
estimates with orthogonal rotation. The resultant factor
analysis of the students' scores (percent correct) on the
eight content strands produced one factor with an Eigenvalue
(5.10) greater than 1” This would indicate that the test
measured one trait, and is essentially unidimensional.

In order to determine possible interrelatedness of the
content strands, a squared Euclidian distances cluster
analysis was computed from the correlation matrix of the
student scores (percent correct) on ther8 strands. Figure 4.7
shows the results of a two-to four-level cluster analysis for
the Grade 7 Test data. Whole Numbers, Fractions,
Measurement, Geometry, and Problem Solving form a cluster. The
Statistics and Probability, Calculator Literacy, and Algebra

strands each form separate clusters.

103

Strand Sequence

Whole ls
Fraction
Measure
Geometry
ProbSolv
Algebra
Cachit
StatProb

Figure 4.

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2 J

3

4

7 I

6

8

5
1. 1 .1 1 1 1
T 1 I 1 I I
0 5 10 15 20 25

Rescaled Distance Cluster Combine
7 Dendrogram using average linkage between groups:

Grade 7 Test

104
Results_f9r_§rade_Ten_Test
Internal_CQnsisten2¥_ef_2udgesi_8atings; The values for

the Van der Linden Consistency Index (C1) and the mean
residuals (E) for each of the Grade 10 Test panel judges are
presented in Table 4.19 below. The range of C1 for the Grade
10 Test panel judges was 0.76.to 0.82, with mean residuals
varying from 0.13 to 0.17. Since the standard deviation
(S.D.) of the item difficulties for the Grade 10 Test was
0.24, the mean residuals varied from 0.54 S.D. to 0.71 S.D.,

indicating a moderate degree of consistency in the judges'

 

 

 

ratings.
Table 4.19
a! 0.: 99‘. 00.. 'o M‘. i‘ 1 _° 20.: 0 ‘.
Rater Consistency Mean
Index Residual

Judge 1 0.78 0.15

Judge 2 0.77 0.15

Judge 3 0.76 0.17

Judge 4 0.77 0.17

Consensus 0.82 0.13

Medified__Anngf__Eesults: Table 4.20 shows the

correlations between the judges' ratings and the empirical p-

values obtained by the MCC group and the consensus p-values.

105

 

 

Table 4.20

o - . o. o 00- :. .0 w 0 .1 f.
Eating§i_§rade_10_1est

MCC Group Consensus

Judge 1 0.48** 0.72**
Judge 2 0.35** 0.74**
Judge 3 0.46** 0.68**
Judge 4 0.44** 0.60**
MCC Group 1.00 0.61**
Consensus 0.61** 1.00

 

** p < 0.001: * p < 0-01

All but one of the correlations reached significance at
p < 0.001, indicating that the individual judge ratings
correlated significantly with both the marginal examinee
performance and the consensus ratings agreed upon at the cut-
score committee meeting. As might be expected, the agreement

is stronger when individual judge ratings are compared with

the consensus ratings than with the MCC Group results.

Table 4.21 shows the correlation matrix for the Grade 10

Test panel.

106

Table 4.21

 

Judge 2 Judge 3 Judge 4 Consensus

 

Judge 1 0.44** 0.56** 0.58** 0.72**
Judge 2 0.47** 0.23* 0.74**
Judge 3 0.29** 0.68**
Judge 4 0.60**

Average inter-judge correlation = 0.43**

 

it p < 0.001: * p < 0-01

All of the correlations in Table 4.21 are significant at
the p < 0.01 level. Based on the Spearman-Brown prophecy, the
reliability of the Grade 10 Test panel (k = 4 and E = 0.43) is
0.75.

The calculations for the reliabilities of the panels and
individual judges are displayed in Table 4.22 for the Grade 10

Test, using Ebel's equations for interrater reliability.

107

 

 

Table 4.22
E 1. l']°!' E 1 3 !° . E 1 1: 5e !

Sum of d.f. Mean F

Squares Square
Item 6.763 119 0.057 4.07 (p < 0.001)
Judge 2.111 3 0.704 50.3 (p < 0.001)
Error 5.108 357 0.014
Total 13.982 479

Panel Reliability (ru) 0.75

Judge Reliability (ru) 0.43

 

These results are extremely close to those obtained using
the average correlations with the Spearman-Brown prophecy
formula, both for the individual rater judgements and the
panel judgements.

WWW: For the
Grade 10 students, mastery/nonmastery decision data were
obtained from members of the standard setting panels whose
students had taken the MEAP mathematics tests for which
standards were being established. In cases where a judge was
not teaching a grade in which the tests had been administered,
the judge was asked to select one or more teachers who had
been trained to understand and use the instructional model
upon which the tests were based. These teachers rated their

own students as "masters," "marginal," or "nonmasters."

108
The cumulative frequency distributions of scores obtained
by the three rated groups (master, nonmaster and marginal) are

displayed in Figure 4.8 below.

Tenth Grade
Frequency by Mastery

""

 

 

 

Masters

 

 

Nonmasters

 

Marginal:

 

 

 

 

 

(hmnlative Frequency

 

 

 

 

 

 

 

 

Figure 4.8 Contrasting Groups and Borderline Group Method

Most of the error at the panel-established cut score of
78 is of the "false negative" type, where students rated as
"clear masters" by their teachers are classified by the test
as "nonmasters." Had the cut score been set below the lowest
score obtained by a "clear master" (a score of 42 on this
test), there would be no false negatives, but the false
positives would be quite large (78%). Had the cut score been
set above the highest.score obtained by a "clear nonmaster" (a
score of 83 in this case), the false positive error would be

eliminated, but the false negative error would be large (45%) .

109
The Contrasting Groups method places the cut score at 67, the
point where the frequency curves intersect. The Borderline
Group Method sets the cut score at the median of the marginal
group (or a score of 64). The tetrachoric correlation for

this data set is 0.85.

The graph in Figure 4.9 shows the analysis of the Grade

10 Test data.

Tenth Grade

Contrasting Groups

 

 

+

tmmmmfmwmmémmwmjwmmummmmm"?mwmmj .............. (nus

t... ......... i ............. a ............. .. ...; ............. 1 ............. a
' ' ' ' mu Coefficient

+—
Error Rate

 

 

 

 

 

 

 

 

 

Statistic

 

 

 

 

 

 

 

 

Figure 4.9 Chi-square, phi (x100) and Error Rate(%) for

Grade 10

These results indicate a very strong relationship between
mastery ratings and test results over a wide range of cut
iscores (p < 0.01 from a score of 42 through 96). The optimal
cut score, based upon highest chi-square and lowest error

laate, is 70 (out of 120). At this score, the error rate is

110

less than 18%, the chi-square is 41 (p < 0.0000), and the phi-
coefficient is 0.63. The actual cut score set by the panel of
judges was 78, which increases the error rate by about 33%
compared with that obtained for the optimal score of 70.

Figure 4.8 also shows that the panel-established cut
score of 78 misclassifies 36% of the "clear masters" and 6% of
the "clear nonmasters.” The standard error of measurement
(SEM) on this test is 9 raw score points. Reducing the
panel’s cut score by 1.2 SEM’s equalizes the false positives
and false negatives: reducing it by 1.6 SEM's minimizes the
overall error rate and maximizes the Chi-square. These

results are summarized in Table 4.23 below.

 

 

Table 4.23

WM

Method Cut False False
Score Positives Negatives

Angoff 78 6% 36%

Contrasting Groups 67 19% 19%

Borderline Group 65 23% 18%

 

111

1n5gIggtigng1_Alignm§n;_£ﬁﬁgQ;§: Was there any evidence
of differences in test performance based upon the emphasis
placed on various strands in the mathematics instruction
reported by teachers? A correlation of the student
performance (strand scores) with the instructional alignment
was performed. Strands which were reported as having the most
emphasis were coded with "1." Those with the least emphasis
were coded with "-1.” Others were coded as "0." If student
test performance is improved because of the reported
instructional alignment, these correlations should be
positive.

The results for the Grade 10 Test are shown in Table 4.24
below. There are three significant correlations: for Whole
Numbers, Mental Arithmetic, and Applications. One of the
significant correlations (Applications) is positive and two
(Whole Numbers and Mental Arithmetic) are negative. The mean
correlation is 0.00. As a whole, then, it can be concluded
that instructional emphasis, as reported by the Grade 10

teachers, had no effect upon student test performance.

Table 4.24
o -. 0.0 .00 .-
lQ_Ie§t

112

o - - -u 4 elu‘r ' 12‘

 

 

 

 

Content Strand Correlation
Whole Numbers & Numeration -0.18*
Fractions 0.14
Measurement -0.03
Geometry 0.04
Statistics & Probability 0.09
Algebra 0.05
Problem Solving 0.05
Calculators -0.05
Process Strand Correlation
Conceptualization 0.00
Mental Arithmetic -0.18*
Estimation -0.04
Computations -0.07
Applications 0.21**

 

NOTE: ** Significant at p < 0.001: * Significant at p < 0.01

Ratgr_pzift: Was there any evidence of drift among the

Grade 10 Test panel judges? It is conventional in test

113
development to arrange the items from the easiest at the
beginning of the test to the most difficult at the end. The
items on the MEAP mathematics tests were grouped around themes
with little or no regard for item difficulty. As a result, it
was expected that item difficulty would be statistically
independent of position in the test (i.e., the correlation of
p-value with item sequence would be approximately zero). For
the grade 10 test, this was confirmed by’ a small but
statistically insignificant negative correlation (r = -0.14)
of the actual p-values obtained by the MCC group (PMCC) with
item sequence, as shown in Table 4.25. This table also
presents the correlations of the judges' predicted p-values

and the consensus p-values with the sequence of items on the

 

test.
Table 4.25
o . o. o o-. ' Q ‘n ‘0 ‘0 o .0 0 ‘
Judge 1 0.04
Judge 2 0.20
Judge 3 0.23*
Judge 4 -0.28*
MCC Group -0.14
Consensus 0.01

 

** p < 0.001: * p < 0.01

114

The results in Table 4.25 show some evidence of drift
among the judges on the Grade 10 Test panel, where two of the
six correlations were negative and four were positive. Still,
only two were significant (at p < 0.01).

A second approach to finding evidence of drift was to
perform a multiple regression of judge-predicted p-values onto
the MCC group p-values and item sequence. This multivariate
technique was used to separate the effects of possible
examinee test fatigue (as represented by changes in PMCC) from

rater drift. The results are presented in Table 4.26 below.

115

 

 

 

 

 

Table 4.26
{‘0 - o. o .o- g. .0 o. g o- . - ‘u -,L-. -.
Grade_1Q_Test
Part I: Regression Results for Sequence
B SE B T Attained
Significance
Judge 1 0.0005 0.0004 1.291 0.1992
Judge 2 0.0010* 0.0003 3.021 0.0031
Judge 3 0.0016** 0.0004 3.856 0.0000
Judge 4 -0.0010* 0.0004 -2.720 0.0075
Part II: Regression Results for MCC p-values
B SE B T Attained
Significance
Judge 1 0.3467** 0.0577 6.008 0.0000
Judge 2 0.2281** 0.0496 4.598 0.0000
Judge 3 0.4028** 0.0632 6.378 0.0000
Judge 4 0.4315** 0.0587 5.046 0.0000
** p < 0.001: * p < 0.01

Judge 3 showed "drift" of +19% on the test (i.e., Beta

weight for Sequence * Number of Items = 0.0016*120 = 0.19),
whereas Judges 2 and.41drifted +12% and -12% (0.0010*120 and -

0.0010*120, respectively).

116
! . 1 .

The Rasch analysis of the Grade 10 Test data generated
the strand difficulties shown in Table 4.27 below. The
Geometry, Fractions and Whole Numbers strands show the most
difference, with the students performing higher than expected
in Fractions and Whole Numbers, and lower than expected in
Geometry; Since the teachers reported a low emphasis on Whole
Numbers in the curriculum alignment study, high performance by
students on that strand should produce a negative correlation

(as shown in Table 4.24), and it did.

Table 4.27

 

 

Strand Computed Rasch Judge
Difficulty Rating Residual

Calculators 0.76 0.75 0.01
Fractions 0.73 0.67 0.06
Whole Numbers & Numeration 0.72 0.64 0.08
Statistics & Probability 0.71 0.71 0.00
Measurement 0.64 0.66 -0.02
Algebra 0.62 0.64 -0.02
Problem Solving 0.54 0.57 -0.03

Geometry 0.50 0.63 -0.13

 

117

The unidimensionality of the latent. trait. ‘was
investigated using principal components analysis as initial
estimates with orthogonal rotation. The factor analysis of
the students’ scores (percent correct) on the eight content
strands produced one factor with an Eigenvalue (4.75) greater
than 1. This indicated that the test measured one trait, and
that it is essentially unidimensional.

In order to determine if the content strands ‘were
interrelated, a squared Euclidian distances cluster analysis
was computed from the correlation matrix of the student scores
(percent correct) on each of the 8 strands. Figure 4.10 shows
the results of a two-to four-level cluster analysis for the
Grade 10 Test data. The Fraction, Algebra, Problem Solving,
Statistics & Probability, Measurement and Geometry strands

form a cluster.

Strand Sequence

 

Fraction
Algebra
ProbSolv
StatProb
Measure
Geometry
Cachit
Whole Is

L

 

 

 

 

 

 

 

 

 

 

 

 

 

Hm-bUUIVO‘N

l l l l l l

I l I I 1 “T

0 5 10 15 20 25
Rescaled Distance Cluster Combine

Figure 4.10 Dendrogram using average linkage between

groups: Grade 10 Test

Chapter 5. Conclusions and Discussion

We

The analyses presented in this study have shown that,
when judges establish a passing score based upon the expected
performance of ”marginally capable" candidates, the
classification of examinees as "masters" or "nonmasters" by
the test is not in agreement with their own classifications of
actual examinees. Based upon three independent examples, it
has been demonstrated that one possible explanation is that
judges tend to idealize the performance by students on tests,
i.e., judges may assume that something which they have taught
has been learned by their students and will, therefore, lead
students to give correct responses to related items on a test.

Using comparisons of results obtained through judges’
predictions of item difficulty apart from test data, along
with judges’ estimations of mastery states of real examinees,
it has been shown that false negative and false positive
errors can be minimized by (a) establishing a passing score
using a modified Angoff approach and (b) adjusting the score
downward by one or more standard errors of measurement. The
amount of adjustment to be used could be based on the level of
stakes of the test, with the greater adjustment recommended
for tests to be used in determining property rights (such as
graduation requirements, certification, licensure, etc.).

These conclusions are based upon several considerations,

which are described below.

118

119

WWW: How well did the ite-
difficulty ratings of the expert judges, as_§_gzgup, compare
with p-values obtained by students whose performance placed
them at or near the cutoff score? The performance of each of
the three standard-setting panels was investigated using
ANOVA, which produced reliabilities for the judgements of the
Grades 4, 7 and 10 panels of 0.75, 0.78 and 0.75. Using the
Spearman-Brown prophecy formula, reliabilities were also
computed and were found to be 0.77, 0.80 and 0.75,
respectively. The number of judges on the panels were 6, 5
and 4, respectively. For such relatively small panels, these
reliabilities are reasonably high.

How did the expert judges’ individual ratings compare
with actual examinee performance? The van der Linden
consistency index (CA) was computed for each judge, and for
the consensus ratings established by each of the three panels
of judges. A summary of these results is shown in Table 5.1

below.

120

Table 5.1

3 . 3 . ! I 3'

 

Grade Range of Judge’s Panel’s Consensus Cut score
of Test Consistency Indices Consistency Index (% correct)

 

Grade 4 0.80 - 0.88 0.87 0.75
Grade 7 0.73 - 0.80 0.76 0.60
Grade 10 0.76 - 0.78 0.82 0.65

 

These values show a high degree of consistency within the
ratings of both individual judges and the panels. The highest
degree of consistency was for the Grade 4 Test panel, and the
lowest for the Grade 7 Test panel. During this study, the
observation was made that there is a high correlation between
the consistency index and the amount that the cut score
deviates from 50% of the items on the test. This was true for
both cases found in the literature (van der Linden, 1982:
Friedman & Ho, 1990) as well as for this study. It is
apparent that the discrepancies between judges’ ratings and
actual p-values would be greatest for judges whose ratings
cluster near 50%, and least for ratings which cluster near 0%
or 100% correct, because of the nonlinearity of the item
characteristic curves at very low or very high abilities. It
can be concluded that the results of this study show that the

judge ratings were internally consistent.

121

Additional evidence of the success of the judges in
predicting student performance is given in Table 5.2, which
shows the correlations between the ratings of individual
judges and the MCC group p-values. All of these correlations
are significant, with all but one significant.at.p1< 0.001 and
an average value of 0.41. It can be concluded, therefore,
that the judges’ ratings were externally consistent with the

performance of the MCC group.

 

 

Table 5.2

o ‘ . o. o 00- {. ,0 Q g o o e- . -

Grade 4 Grade 7 Grade 10

Judge 1” 0.46** 0.49** 0.48**
Judge 2 0.38** 0.32** 0.35**
Judge 3 0.53** 0.42** 0.46**
Judge 4 0.39** 0.29** 0.44**
Judge 5 0.40** 0.25*
Judge 6 0.46**

 

** p < 0.001: * P < 0-01

 

‘”Note that there was no overlap between groups of judges
across grade levels. The judges for each of the three grade-
levels of the tests formed a unique and separate group.

122

The correlations between the ratings of the judges were
examined, and the results are summarized in Table 5.3 below.
These values show a moderate degree of consistency between the
ratings of individual judges and a high degree of consistency
between individual judges and the panels» Between judges, the
highest degree of consistency was demonstrated by the Grade 7
Test panel, and the lowest by the Grade 4 Test panel. The
highest degree of consistency between individual judge ratings
and the consensus (panel) ratings, on the other hand, was
obtained by the Grade 10 Test panel: once again, the least
consistency was demonstrated between individual Grade 4 Test
judges and the corresponding consensus (panel) ratings. In
summary, the results of this study show that, with few

exceptions, the judge ratings were externally consistent.

Table 5.3

: 1 l' E ! I 3 E l'

 

 

Judge to Judge Judge to Panel

Range Mean Range Min
Grade 4 Test 0.16 - 0.60 0.35 0.39 - 0.91 (L60
Grade 7 Test 0.31 - 0.64 0.45 0.53 - 0.65 (L61
Grade 10 Test 0.23 - 0.58 0.43 0.60 - 0.74 (L69

 

WM: On what content and/or
process strands did the judges overpredict or under-predict

123
actual examinee performance? The latent trait analysis is

summarized in Table 5.4 below.

Table 5.4

'0
O
o
I

o
I
I

.
O
O

D

f

.0

O
C
e

1.
O
4

I
I.
a
a.

 

Strand Grade 4 Grade 7 Grade 10

 

Statistics &

Probability 0.06 -0.03 0.00
Algebra 0.03 0.01 -0.02
Calculators 0.06 0.02 0.01
Whole Numbers &

Numeration 0.03 -0.07 0.08
Measurement —0.08 -0.09 -0.02
Problem Solving 0.11 0.03 -0.03
Geometry 0.04 0.01 -0.13
Fractions -0.10 -0.04 0.06

 

The residuals for the Measurement strand indicate that the
Measurement items were rated by all three of the standard-
setting panels to be less difficult than what was actually
demonstrated by the MCC group. The residuals also show that
the items in the Fractions strand were rated progressively
more difficult by the judges from grade to grade than was

actually demonstrated by the MCC group. On the other hand,

124

the residuals for the Problem Solving, Geometry, Algebra, and
Calculator strands indicate that these items ‘were :rated
progressively less difficult by the judges from grade to grade
than was actually demonstrated by the MCC group. 'The
residuals for the Whole Numbers & Numeration and Statistics 5
Probability strands showed no consistent patterns across the
grades.

WW: How well did the judge!“
predictions of which of their own students are “masters",
“borderline", and 'nonmasters' compare with the students’
actual performance relative to the cut score set by the same
judges? In this study of the standard-setting process, the
Phi coefficient was used to compare the classification of
students on the "mastery" continuum (1) by the test (using a
wide range of hypothetical cut scores) and.(2) by the teachers
who worked with these students. The ”optimum" cut score was
defined as that score which produced the highest value of Phi.
This, then, established a single "best" cut score.

In order to establish a range of high and low test
standards, Phi was allowed to move below its maximum value by
10%. The results of this activity are summarized in Table 5.5

below.

125

Table 5.5

".1: II: iEli:EE"!

 

 

High Low Assigned
Grade Standard” Standard Cut Score
Grade 4 65 52 69
Grade 7 69 43 69
Grade 10 78 59 78

 

Note that the cut scores established by the panels at
each grade level were very close to the "high standard"
established using the Phi coefficients generated by the
Contrasting Groups method. Assuming that the panel members
wished to establish high standards, the passing scores
established by all three panels were consistent.30

Another measure of the consistency of the standard-
setting can be investigated from the Borderline Groups method,
where the cut score was established by selecting the median of
the frequency plot for the marginal group. .Based on the logic
of the Borderline Groups method, it seems reasonable to assume

that approximately half of the "marginal" students will score

 

"The "high" and "low" standards are those out scores
which produce a Phi coefficient equal to 90% of its maximum
value.

‘”It should be noted that the "high" standard produced a
high rate of errors of classification, whereas the "low"
standard produced the minimum total error rate.

126
as "masters" on the test while the other half will score as
”nonmasters." The actual results for the three tests used in

this study are shown in Table 5.6 below.

Table 5.6

When

 

 

Grade Median Angoff "Marginals" Scoring
of Plot Cut score Below Cut score

Grade 4 61 69 84%

Grade 7 61 69 64%

Grade 10 65 78 93%

 

At Grades 4, 7 and 10, the panel-assigned cut scores
classified 84%, 64% and 93% respectively of the borderline
students as nonmasters. Recall that the judges were asked to
predict performance of students "just on the borderline of
performing' satisfactorily' on 'this test" (cf. Procedures,
Appendix A), and to predict which of their students were
"marginal" (cf. Appendix B). Consistency suggests that the
students who were selected by their teachers as "marginal"
would perform similarly to the "Marginally Capable Candidates"
group envisioned by a standard-setting panel which includes
these same teachers. 'These results seem to indicate that only
the Grade 7 Test panel came up with modified Angoff results

which were consistent with the Borderline Group method (i.e.,

127
by setting the cut score at the median of the "borderline"
group, .approximately' one-half’ of the "marginal" students
should be classified as nonmasters).
For the Contrasting Groups method, the cut score is
established by selecting the intersection of the frequency
plots for the "master” and "nonmaster” groups. These results

are shown in Table 5.7 below.

Table 5.7

When

 

 

Grade Intersection Angoff Masters Scoring
Cut score Below Cut score
Grade 4 63 69 35%
Grade 7 59 69 42%
Grade 10 67 78 36%

 

Note that the Grade 4‘Test modified Angoff cut score (69)
texceeds the Grade 4 Test high standard (65), while the
modified Angoff cut scores and high standards are equal for
both the Grade 7 Test (69) and the Grade 10 Test (78). Using
the modified Angoff cut scores for the Grades 4, 7 and 10
Tests, 35%, 42% and 36% of the masters are classified as
nonmasters. Therefore, the modified Angoff cut scores produce
relatively high rates of false negative error, and thereby

misclassify a large number of students who are considered as

128

as "masters" by their teachers. Had the Contrasting Groups
Method been used (i.e., had the cut scores been set at the
intersections of the frequency plots), the percentage of both
masters and nonmasters misclassified would have ranged from
12% to 20%, which is a reduction in false negative errors by
a factor of 2 or 3.

For each of the three tests used in this study, there was
a fairly consistent relationship between the standards set by
the committees and the independent classification of students
using teacher judgment and the tests as criteria. In all
three cases, the panel-established cut scores produced a high
number of "false negative" errors. By reducing the passing
score approximately one standard error of measurement (SEM),
the false positive and false negative errors could be
equalized. By further reduction of an additional two-fifths
of an SEM, the overall error rate is minimized and the Phi
coefficient is maximized. The exact figures are shown in

Table 5.8 below.

129

 

 

Table 5.8
0 ‘ .4'1 «‘0 \“979 o {‘9 ‘ 9‘ . ,' °. 'o.
EIIQI_B§L§
Grade of Adjustment to Adjustment to
Test Equalize Errors Minimize Errors
Grade 4 0.9 SEM 1.3 SEM’s
Grade 7 0.9 SEM 1.2 SEM’s
Grade 10 1.2 SEM 1.6 SEM’s

 

These results are remarkably consistent across all three
tests, with a mean adjustment of 1.0 SEM to equalize the false
negatives and false positives, and a mean adjustment of 1.4
SEM to minimize the errors and maximize the Phi coefficient.
Overall, the passing scores set by the panels using the
modified Angoff method for the Grades 4 and 7 Tests required
the least reduction, and the passing score for the Grade 10
Test required the greatest reduction. In situations where a
test is to be used for high stakes decisions, the greater

adjustment would be recommended.31

 

”In the second (Fall, 1992) statewide administration of
the MEAP Mathematics Tests, students taking the Grade 10 Test
needed to achieve a score of at least 60 items correct to earn
a state endorsement on their high school diplomas. ‘Under this
higher-stakes condition, the average score increased 6 points.
Similar increases in percentage of items correct were
experienced at grades 4 and 7, however, where there were no
stakes attached. Consequently, these increases could be the
result of equating artifacts and/or adjustments to the test

130

WWW: Did the relative
performance for students on the various strands differ among
groups for which teachers reported different instructional
emphasis? A correlation of the student performance (strand
scores) with the instructional alignment was performed.
Strands which were reported as having the most emphasis were
coded with "1." Those with the least emphasis were coded with
”-1." Others were coded as "0." If student test performance
is improved because of the reported instructional alignment,
these correlations should be positive.

The results for all three tests are shown in Table 5.9
below. Of the significant correlations, 12 are positive and
6 are negative. The average of all of the correlations is
0.05. This indicates that student performance, as evidenced
by scores on the MEAP Mathematics Tests, does not mirror the
stated instructional emphasis of the classes which the
students are taking in school.

One explanation for this result is that the content of
the tests may not reflect the instructional materials used to
teach mathematics, thereby signalling a lack of content
validity of the tests (of. Messick, 1989, pp. 36-42). Another
explanation is that student ability may be more important than
curriculum, since students learn mathematics in a number of
non-school settings. (The same may not be true of subjects

such as History.)

 

rather than an increase in motivation on the part of the
students.

Table 5.9

131

‘Ill‘l

 

 

 

 

Content Strand Grade 4 Grade 7 Grade 10
Whole Numbers

& Numeration 0.04 O.25** -O.18*
Fractions 0.10 0.22** 0.14
Measurement -0.23** O.21** -0.03
Geometry 0.14* 0.13* 0.04
Statistics

5 Probability -0.14* -0.06 0.10
Algebra -0.12 —0.12 0.05
Problem Solving 0.32** 0.32** 0.05
Calculators -0.13 0.08 -0.05
Process Strand Grade 4 Grade 7 Grade 10
Conceptualization 0.23** 0.08 0.00
Mental Arithmetic -0.06 -0.01 -0.18*
Estimation -0.32** 0.39** -0.04
Computations -0.12 0.31** -0.07
Applications 0.09 0.25** 0.21**

 

NOTE: ** Significant at p < 0.001: * Significant at p < 0.01

132

Drift: Did the judges demonstrate drift by having lower
predictive success on the last part of the test.compared*with
the first part? Regression analysis indicated that a small
number of individual judges had statistically significant
sequence effects, but overall the drift did not affect the
results significantly.

Did students show fatigue, as indicated by decreased
performance on the last part of the test as compared.with the
first part? Correlations of item difficulties with their
sequence in the test were all nonsignificant.

Latent.trait analysis did not indicatezany’changes in fit
statistics with item sequence on the test. Since misfit to
the latent trait model is expected in the presence of fatigue,
it can be concluded that theme was no evidence of fatigue
effects for the students.

unggIlxing_pgyglgpmgn§al_§gﬂlﬁ: How did the underlying
developmental scale of mathematics, as indicated by examinee
performance, compare with that "predicted" by the judges?
This issue was addressed using the following questions:

Are the tests based upon a single dimension of
mathematical ability (i.e., is there a unidimensional latent
trait measured by the test)? Factor analysis for each of the
three MEAP mathematics tests indicated that there was only one
factor with an Eigenvalue greater than 1. It is concluded,
therefore, that the tests each measured one basic underlying

trait.

133

Did instruction, by the emphasis placed on various
aspects of mathematics, modify the structure of the pupils’
cognitive development and thereby bring into question the
existence of a unique developmental scale? Except for a small
but statistically significant relationship confirmed for parts
of the Grade 7 Test, there was no evidence that the teacher
self-reported curriculum alignment had any effect on student
performance on the tests.

Could the degree to which student responses to items fail
to fit the measurement model be used to detect instructional
effects on the development of the latent trait in the
examinees? If some students have been taught certain concepts
covered by the tests, while others have not, there will be a
tendency for the instructed students to find items relating to
these concepts to be easy while the uninstructed will find
them difficult, thus causing the responses to not "fit" the
measurement model. Investigation of the relationship between
Rasch fit statistics and teacher self-reported curriculum
alignment for these tests was not possible, however, because
item-level data were not available for the students for whom
the curriculum alignment data were collected. The reason for
this was that anonymity was required, so there was no way to
connect item statistics with any of the students involved.
WW

One of the most notable features of this study is that
the data were collected from a very large sample of

approximately 100,000 examinees per test instrument. Such

134
large scale assessment results are not available to every
practitioner in the field, so the methods described in this
study may not be applicable in all situations (e.g., in a
single classroom).

Since every Michigan 4th, 7th, and 10th, grader is
required by law to take the MEAP tests, it was possible to
sample from the entire public school student population,
therefore facilitating broad generalization of the results.
Also, many aspects of this study were simplified by the easy
access to the group of educators who had. designed the
curriculum upon which the test instruments were based,
established the standards for each instrument, and taught
students who were later assessed using the instruments. These
conditions will not, in general, exist for other test
developers.

This implementation of the modified Angoff method did not
include a review of actual test data by the standard-setting
panels prior to finalizing the passing scores. Research has
shown that judges tend to modify their estimated p-values when
shown actual data. It is possible that the passing scores for
the MEAP Mathematics Tests may have been set lower had this
step been included in the process.

Because of the apparent lack of alignment between these
tests and the mathematics instruction reported by the
educators who were responsible for their development, it was
not possible to determine the extent to which such alignment

might have affected the standard-setting process.

135

The panels which set the standards on the MEAP
Mathematics Tests were primarily skilled practitioners in the
field of mathematics education. They were not scholars who
read and write in the discipline of mathematics. Mad such
scholars been involved in the standard setting activities, the
results may have been quite different.

The extent to which these results can be generalized to
the population of professionals who sit on standard-setting
panels is not known. Since the study involved three panels
which set standards on different tests independently of each
other, however, and since the results were so similar in each
case, this study produced fairly strong evidence that
comparable results could be expected in other situations where
standards are being established for tests.
Impartanee_ef_the_atud¥

Tests are often used to divide groups into those who are
"competent” and those who are not" 'The borderline between the
two groups is generally taken to be a single score which is
defined as "that score which would be attained by an
individual who is marginally capable." Since measurement
error is always present to an unknown extent, truly marginal
examinees would obtain a range of scores, so it is impossible
to establish a single, meaningful cut off score. As a result,
the actual competency of those who score near the cut off is
uncertain.

This study presents a means for accounting for this

uncertainty. Those wishing to establish an appropriate

136

passing score are encouraged to use the most commonly used
method, which is a modified Angoff approach. In this study,
the results from the application of a modified Angoff approach
to three tests, using distinct and separate committees of
judges, demonstrated that classification errors can be reduced
significantly by setting the actual cut score one or more
standard deviations below the modified Angoff passing score.

This is extremely important at a time when tests are
being used with increasing frequency for high stakes
decisions, such as determining who will graduate from high
school and who will be certified or licensed to practice a
profession. Since diplomas, certificates and licenses are
generally viewed by the courts to be "property rights" of the
individual being tested, it is critical that the passing
scores be legally defensible.
W

This study demonstrated a technique for reconciling
discrepancies among a modified Angoff approach and the
Borderline Group and Contrasting Groups methods for setting
passing scores. Although this technique appeared to function
consistently in three relatively independent situations, there
is a need for further research using a variety of types of
judges, covering a variety of subject-matter being tested, and
using diverse populations of students. If the decision is
made to adjust panel-established passing scores to compensate
for errors in judges’ perceptions of student performance, what

steps should be followed in licensure situations where the

137

public safety must be protected when an examinee’s score on a
test does not demonstrate "clear mastery”? In such cases, it
may be desirable to maintain a high standard but allow an
examinee whose score is near the cut to undergo alternative
assessment to clarify the mastery state possessed by that
examinee, similar to Shepard’ 5 recommendation that there be an
intermediate classification for examinees who do not exhibit
clear mastery or nonmastery.

Studies similar to the study reported in this
dissertation should be undertaken in situations where less of
an effort is being made for "tests to drive instruction" and
the content of the tests is more closely aligned with current
instructional materials. It: is anticipated. that. closer
alignment between the tests and instruction would lead to more
consistent standard-setting.

Further study should be undertaken to investigate the
effects of drift on judges and fatigue on examinees. A
purposeful effort to sequence the test items in order of
increasing difficulty should be used. Once this has been
done, what steps should be taken in a situation where drift
and/or fatigue appear to be factors? Should an adjustment be
made to the cut score to account for these factors, also?

The present study dealt only with tests which were to be
administered with no time limitation. Other studies should be
made where passing scores are being established for speeded

tests.

References

Andrew, B.J. & Hecht, J.T. (1976). A preliminary
investigation of two procedures for setting examination
standards. Educational and Psychological Measurement,
36, 45-50.

Angoff, W. H. (December 9, 1991). Personal communication.

Angoff, W. H. (1971). Scales, norms, and equivalent scores.
In R.L. Thorndike (Ed.), Educational Measurement.
Washington, DC: American Council on Education.

Arrasmith, D.G. (1986). Investigation of judges’ errors in
Angoff and Contrasting Groups cut-off score methods.
(Doctoral dissertation, University of Massachusetts).
Dissertation Abstracts International, 47/09,3405.

Arrasmith, D.G., & Hambleton, R.K. (1988). Steps for setting
standards with the Angoff method. Washington, DC: EDRS
(an 299 326).

Beck, M.D. (1991). "Authentic assessment" for large-scale
accountability purposes: Balancing the rhetoric. Paper
presented at the Annual Meeting of the American
Educational Research Association, Chicago.

Behuniak, P. (1980). An investigation of standard setting
methods for objective-referenced proficiency tests.
(Doctoral dissertation, University of Connecticut, 1980) .
Dissertation Abstracts International, 41/09, 3998.

Behuniak, P., Archambault, F. X, & Gable, R. K. (1982).
Angoff and Nedelsky standard setting procedures:

Implications for the validity of proficiency test score

138

 

139
interpretation. Educational and Psychological
Measurement, 42, 247-255.

Berk, R. A. (1986). A.consumer’s guide to setting performance
standards on criterion-referenced tests. Review of
Educational Research, 56, 137-172.

Beuk, C. H. (1984). A method for reaching a compromise
between absolute and relative standards in examinations.
Journal of Educational Measurement, 21, 147-152.

Borman, W.C. (1983). Implications of personality theory and
research for the rating of work performance in
organizations. In F. Landy, 5. Zedeck and J. Cleveland
(Eds.), Performance Measurement and Theory. Hillsdale,
NJ: Lawrence Erlbaum Associates.

Brennan, R.L. & Lockwood, R.E. (1979). A comparison of two
cutting score procedures using generalizability theory.
Paper presented at the Annual Meeting of the National
Council.onhMeasurements in Education, San Francisco, CA.

Busch, J. C. (1990). Application of a mixed consequential
ethical model to a problem regarding test standards.
Paper presented at the Annual Meeting of the American
Educational Research Association (Boston, MA).

Busch, J. C., 8 Jaeger, R. M. (1990). Influence of type of
judge, normative information, and discussion on standards
recommended for the National Teacher Examinations.
Journal of Educational Measurement, 27, 145-163.

Busch, J. C., & Jaeger, R. M. (April, 1986). Judges’

background, attitudes and information as concomitants of

140
recommended NTE standards. Paper presented at the Annual
Meeting of the American Educational Research Association,
San Francisco, CA.

Cannell, J.J. (1987) . Nationally normed elementary
achievement testing in America’s public schools: Now all
50 states are above the national average. Daniels, WV:
Friends for Education.

Cleary, T.A. (1968). Test bias: Prediction of grades for
Negro and white students in integrated colleges. Journal
of Educational Measurement, 5, 115-124.

Cole, N.S. (1973). Bias in selection. .Journal of.Educational
Measurement, 10, 237-255.

Cramer, S.E. (1990). Some practical solutions to standard-
setting problems: The Georgia Teacher Certification Test
experience. Paper presented at the Annual Meeting of the
National Council on Measurement in Education, Boston, MA.

Cross, L. H., Impara, J. C., Frary, R. B., & Jaeger, R. M.
(1984). A comparison of three methods for establishing
minimum standards on the National Teacher Examinations.
Journal of Educational Measurement, 21, 113-129.

Davidoff, M.D. & Goheen, H.W. (1959). A table for the rapid
determination of the tetrachoric coefficient .
Psychometrika, 22, 43-52.

De Gruijter, D. N. M. (1980, June). Accounting for
uncertainty in performance standards. Paper presented at
the International Symposium on Educational Testing,

Antwerp, Belgium. (ED 199 280)

 

141

De Gruijter, D. N. M. (1985). Compromise models for
establishing' examination standards. Journal of
Educational Measurement, 22, 263-269.

De Meuse, K.P. (1987). A review of the effects of non-verbal
cues on the performance appraisal process. Journal of
Occupational Psychology, 60, 207-226.

Ebel, R. L. (1972). Essentials of Educational Measurement.
Englewood Cliffs, NJ: Prentice-Hall.

Ebel, R.L. (1951). Estimation of the reliability of ratings.
Psychometrika, 16, 407-424.

FairTest (1991). Controversy erupts over NAEP achievement
levels. FairTest Examiner 5, 11-12.

Forsyth, R.A. (1991). Do NAEP scales yield criterion-
referenced interpretations? Educational Measurement:
Issues and Practice 10, 3-9, 16.

Fremer, J. (1991). Performance testing and standardized
testing: What are their proper places. Paper presented
at the Annual Meeting of the National Council on
Measurements in Education, Chicago.

Gardner, E.F. (1962). Normative standard scores. In W.A.
Mehrens & R.L. Ebel (Eds. ) , Principles of Educational and
Psychological Measurement: A Book of Selected Readings.
Chicago: Rand McNally & Company (1967, pp. 53-60).

Glass, G.V. (1978). Standards and criteria. Journal of

Educational Measurement, 15, 237-261.

 

142

Gronlund, N.E. & Linn, R.L. (1990). Measurement and
Evaluation in Teaching, 6th Edition. New York: Macmillan
Publishing Company.

Gross, L.J. (1985). Setting cutoff scores on credentialing
examinations: A refinement of the Nedelsky procedure.
Evaluation and the Health Professions, 8, 469-493.

Guilford, J.P. (1951). Psychometric Methods. New York:
McGraw-Hill Book Company.

Halpin, G. & Halpin, G. (1987). An analysis of the
reliability and validity of procedures for setting
minimum competency standards. Educational and
Psychological Measurement, 47, 977-983.

Halpin, G., Sigmon, G., & Halpin, G. (1983). Minimum
competency standards set by three divergent groups of
raters using three judgmental procedures: Implications
for validity. Educational and Psychological Measurement,
43, 185-196.

Hambleton, R.K. & Powell, S. (1983). A framework for viewing
the process of standard setting. Evaluation and the
Health Professions 6, 3-24.

Hambleton, R. K., Powell, 5., & Eignor, E. R. (April, 1979).
Issues and methods for standard setting. Paper presented
to the Annual Meeting' of the National Council for
Measurement in Education, San Francisco.

Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991).
Fundamentals of Item Response Theory. Newbury Park, NJ:

Sage Publications.

 

143

Haney, W., Madaus, G. & Kreitzer, A. (1987). Charms
talismanic: Testing teachers for the improvement of
American education. Review of Research in Education, 14,
169-238.

Harasym, P. H. (1981). A comparison of the Nedelsky and
modified Angoff standard-setting procedure [sic] on
evaluation outcome. Educational and Psychological
Measurement, 41, 725-734.

Hartigan, J.A. & Wigdor, A.I<. Eds. (1989). Fairness in
Employment Testing: Validity Generalization, Minority
Issues, and the General Aptitude Test Battery.
Washington, DC: National Academy Press.

Hofstee, W.K.B. (1983). The case for compromise in
educational selection and grading. In S. B. Anderson &
J. S. Helmick (Eds.), On Educational Testing. San
Francisco: Jossey-Bass.

Hunter, J.E. (March, 1993). Personal communication.

Hunter, J.E. & Schmidt, F.L. (1990). Methods of Meta-
Analysis: Correcting Error and Bias in Research Findings.
Newbury Park, CA: Sage Publications, Inc.

Jacobs, L.C. & Chase, C.I. (1992). Developing and Using Tests
Effectively: A Guide for Faculty. San Francisco: Jossey-
Bass Publishers.

Jaeger, R. M. (1982a, March). High school proficiency
standards and the definition of competence. Paper
presented at the annual meeting of the National Council

on Measurement in Education, New York City.

 

144

Jaeger, R. M. (1982b). An iterative structured judgment
process for establishing standards on competency tests:
Theory and application. Educational Evaluation and
Policy Analysis, 4, 461-476.

Jaeger, R. M. (1989). Certification of student competence.
In R. L. Linn (Ed.), Educational Measurement, 3rd ed.
New York: American Council on Education & Macmillan
Publishing Company.

Jaeger, R. M. (1990). Establishing standards for teacher
certification tests. Educational Measurement: Issues and
Practice, 9, 15-20.

Johnson, S. & Zieky, M. (1988). Increasing public
understanding of testing: An academic - test developer
dialogue. Paper presented at the annual meeting of the
American Educational Research Association, New Orleans.

Jones, J. P. (1987). The effects of the job status of judges
and the presence of item statistics on the passing scores
set by three pooled-judgment methods. (Doctoral
dissertation. Columbia University). Dissertation
Abstracts International, 48/08,2049.

Julian, E.R. (1985). The effect of item difficulty
distribution shape on the precision of measurement at a
passing score. (Doctoral dissertation, Florida State
University). Dissertation Abstracts International,

47/01, 159.

 

145

Linn, R.L., Baker, E.L., & Dunbar, S.B. (1991). Complex,
performance-based assessment: Expectations and validation
criteria. Educational Researcher 20, 15-21.

Livingston, S.A. (1982). Comment on Rowley’s paper,
Historical antecedents of the standard-setting debate: An
inside account of the minimal-beardedness controversy.
Journal of Educational Measurement, 19, 229.

Livingston, S. A., & Zieky, M. J. (1982). Passing scores: A
manual for setting standards of performance on
educational and occupational tests. Princeton, NJ:
Educational Testing Service.

Macready, G.B. & Dayton, C.M. (1980). The nature and use of
state mastery models. Applied Psychological Measurement,
4, 493-516.

Madaus, G. (1986). Measurement specialists: Testing the faith
- A reply to Mehrens. Educational Measurement: Issues
and Practice, 5(4), 11-14.

McAllister, P.H. (1991). Overview of state legislation to
regulate standardized testing. .Educational Measurement:
Issues and Practice 10, 19-22.

Mehrens, W.A. (March, 1993). Private communication.

Mehrens, W.A. (1992). Using’ performance assessment for
accountability purposes. Educational Measurement: Issues
and Practice, 11, 3-9, 20.

Mehrens, W.A. (1986). Measurement specialists: Motive to
achieve or motive to avoid failure? Educational

Measurement: Issues and Practice, 5(4), 5-10.

 

146

Mehrens, W.A. & Lehmann, I.J. (1987). Using Standardized
Tests in Education, Fourth Edition. NYC: Longman.

Meskauskas, J.A. (1986). Setting standards for
credentialling examinations: An update. .Evaluation
and the Health Professions, 9, 187-203.

Messick, S. (1989). Validity. In R. L. Linn (Ed.),
Educational Measurement, 3rd ed. New York: American
Council on Education & Macmillan Publishing Company.

Millman, J. & Greene, J. (1989). The specification and
development of tests of achievement and ability. In R.
L. Linn (Ed.), Educational Measurement, 3rd ed. New
York: American Council on Education & Macmillan
Publishing Company.

Mills, C. N. (1983). A comparison of three methods of
establishing cut-off scores on criterion-referenced
tests. Journal of Educational Measurement, 20, 283-292.

Mueller, M.W. (1991). Introduction. In F.L. Finch (Ed.),
Educational Performance Assessment. Chicago, IL: The
Riverside Publishing Company.

Nedelsky, L. (1954). Absolute grading standards for objective
tests. Educational and Psychological Measurement, 14,
3-19.

Nitko, A.J. (1980). "Criterion-Referencing Schemes." In S.T.
Mayo (Ed.) Interpreting Test Performance (No. 6 in New
Directions for Testing and Measurement). San Francisco:

Jossey-Bass, pp. 35-71.

147

Nitko, A.J. (1984). Defining "criterion-referenced test."
R.A. In Berk (Ed.), A.Guide to Criterion-Referenced’Test
Construction. Baltimore: The Johns Hopkins University
Press.

Norusis, M.J. (1990). SPSS/PC+ Statistics 4.0 for the IBM
PC/XT/AT and PS/2. Chicago, IL: SPSS Inc.

Norcini, J. J., Shea, J. A., & Kanya, D. T. (1980). The
effect of various factors on standard setting. Journal
of Educational Measurement, 25, 57-65.

Novick, M.R. & Ellis, D.D. Jr. (1977). Equal opportunity in
educational and employment selection. American
Psychologist, 32, 306-320.

Nunnally, J.C. (1970). Introduction to Psychological
Measurement. New York: McGraw-Hill.

O’Leary, V.E. & Hansen, R.D. (1983). Performance evaluation:
A social-psychological perspective. Chapter 8 in F.
Landy, 5. Zedeck & J. Cleveland (Eds.), Performance
Measurement and Theory. Hillsdale, NJ: Lawrence Erlbaum,
Publishers.

Pedhazur, E.J; & Schmelkin, L.P. (1991). .Measurement, Design,

and Analysis . Hillsdale , NJ : Lawrence Erlbaum
Associates.
Phillips, S.B. (1992). Legal issues in performance

assessment. Paper presented at the American Educational
Research Association (San Francisco).
Plake, B.S., & Melican, G.J. (1989). Effects of item context

on intrajudge consistencies of expert judgments via the

 

148
Nedelsky standard setting method. .Educational and
Psychological Measurement, 49, 45-58.

Popham, W.J. (1987). Preparing policymakers for standard-
setting on high-stakes tests. Educational Evaluation and
Policy Analysis, 9, 77-82.

Resnick, L.B. & Ford, W.W. (1981). Transfer hierarchies and
the organization of instruction. Chapter 3 in The
Psychology of Mathematics for Instruction. Hillsdale,
NJ: Lawrence Erlbaum Associates.

Rigney, S. (1990). Factor analysis of MEAP test pilot data.
Private communication.

Roth, R. (November, 1987). The differences between teachers
and teacher educators when judging the NTE Professional
Knowledge Test to determine a cut-score. Paper presented
at the Annual Meeting of the Mid-South Educational
Research Association, Atlanta, Georgia.

Rowley, G.L. (1982). Historical antecedents of the standard-
setting debate: An inside account of the Minimal-
Beardedness controversy. Journal of Educational
Measurement 19, 87-95.

Rudman, H.C. (Spring, 1985). Testing beyond minimums. ASAP
Notes, Special Edition, Occasional Paper #5.

Saliba, C.N. (1990). Interjudge variability in the setting of
passing scores on the National Teacher Examination using
the Angoff and modified Angoff methods. (Doctoral
dissertation. State University of New York at Albany).

Dissertation Abstracts International, 51/07,2357.

149

Shepard, L.A. (1980). Standard setting issues and methods.
Applied Psychological Measurement 4, 447-467.

Shepard, L.A. (1984) . Setting performance standards. In R.A.
Berk (Ed.), A Guide to Criterion-Referenced Test
Construction. Baltimore: The Johns Hopkins University
Press.

Shimberg, B. (1981). Testing for licensure1and1certification.
American Psychologist, 36, 1138-1146.

Smith, R.L. & Smith, J.K. (1988). Differential use of item
information by judges using Angoff and Nedelsky
procedures. Journal of Educational Measurement, 25, 259-
274.

Suen, H.K. (1990). Principles of Test Theories. Hillsdale,
NJ: Lawrence Erlbaum Associates.

Thorndike, R.L. (1971). Concepts of culture fairness.
Journal of Educational Measurement, 8, 63-70.

Thorndike, R.L. (1982). Applied Psychometrics. Boston, MA:
Houghton Mifflin Company.

Thurstone, L.L. (1928). The measurement of opinion. Journal
of Abnormal and Social Psychology, 22, 415-430.

Trevisan, M. (1991). Reliability of performance assessments:
Let’s make sure we account for all the errors. Paper
presented at the Annual Meeting of the National Council
on Measurement in Education, Chicago.

Tucker, L.R. (1946). Maximum validity' of a test. with

equivalent items. Psychometrika, 11, 1-13.

150

Van der Linden, W.J. (1982). A latent trait method of
determining intrajudge inconsistency in the Angoff and
Nedelsky techniques of standard-setting. Journal of
Educational Measurement, 4, 295-308.

Wright, B.D. & Stone, M.H. (1979). Best Test Design.
Chicago: MESA Press.

Zahran, A.H. (1981). The impact of multiple-choice item
styles, judge experience and item taxonomy level on
minimum passing standards and interscorer agreement.
(Doctoral dissertation, The Ohio State University).

Dissertation Abstracts International, 42/07, 3121.

 

APPENDICES

 

151

Appendix A: Cut Score Committee Meeting

MEAP MEETING SUMMARY

Prgiecl; MEAP ESSENTIAL SKILLS MATl-lEMATICS TEST
CUT SCORE MEETING

MeeﬂnLQale.‘ May 31.1991
Mandarin;

Grade 4 Terry Coburn, Leola Gagliardi, Linda Kolnowski, Joe
Payne, Wayne Scoll. Gail Zaniea

Grade 7 Sue Forlunalo. Jacqueline Long, Lynda Miller, Jan
Palkowski. Nancy Varner

Grade 10 Chuck Allan, Johanna Brown, Blaine Brummels.
Valerie Mills. Dick Seilz

Ex 0ilici0 Bill Brown. Sharil Shakrani. Sue Rigney

W

Determination ol raw score criterion for lall 1991 administration of
Essential Skills Mathematics Test. (See Procedures attached.)

Mi;

Grade 4 75% or 69 of 92 items correct
ll this standard had been applied to the Pilot A lorm, 36.5%
of the sample would have achieved Salislaclory Performance

Grade 7 60% or 69 ol 115 items correct
ll this standard had been applied to the Pilot A form 27. 7%
ol the sample would have achieved Satisfactory Perlormance

Grade 10 65% or 78 ol 120 items correct
ll this standard had been applied to the Pilot A form, 17.2%
of the sample would have achieved Satisfactory Perlormance

Waited;
Rigney will contact S. Phillips to communicate cui score to her.

Alter meeting. Sharil suggested that we consider using some other scale
or some other out point. e.g., 250 instead of 300. He argued that there is
no reason to be consistent with Reading. This is a simple solution to
what is becoming a very difficult issue. I will discuss it with Susan
Phillips tonight.

Meeting Summary prepared by: Sue Rigney

152

MEAP ESSENTIAL SKILLS MATHEMATICS TEST
CUT SCORE MEETING 5131/91

EBQQEQLLBES

Prior to the meeting. assign all participatnts to a grade level group. Prepare
individually labeled packets containing Rating 1 data sheets. Base Test Forms.
appropriate ancillary materials. a calculator. pencils and scratch paper.

At the start of the meeting describe the purpose of the meeting and provide a
brief overview of the procedures to be followed. Emphasize that participants
must complete all activities.

‘The purpose of this meeting is to determine the 'cut score' for the 1991
form of the Essential Skills Mathematics Test. The cut score is the
number of items that a student must answer correctly in order to achieve
the Satisfactory Performance criterion. In the past. the cut score was
based on the number of objectives successfully attained by answering
two out of three questions related to the objective correctly. For the fall
test. the cut score will be based on the number of questions answered
correctly overall.‘

'The same cut score will be employed for the life of this test. In
subsequent years. different test forms will be equated back to the form
you will see today. That means that the cut score that you recommend
today will serve as a standard for students and teachers throughout the
state for many years and is likely to have a substantial impact on both
instructional practice and attitudes toward mathematics.‘

Bl'l'lll ...” III! I' ll

'Please complete the 'Rating 1' data sheet which is in your packet in the
following manner. Visualize a marginal student. That is. a student who is
just on the borderline of performing satisfactorily on this test. For each
and every item in the test. consider the probability of this marginal
student answering the item correctly. Enter that probability on the rating
sheet opposite the item number. Alternatively. you may prefer to
visualize a group of 100 marginal students and consider the proportion of
that group likely to complete an item correctly. In that case. enter the
proportion answering correctly opposite the item number. Assume that
the students in question have received instruction consistent with the
Michigan Essential Goals and Objectives for Mathematics Education.“

‘Please work independently. When you are finished. you make take a
break but please do not disturb others who are still working.’

r i i' ii

 

153

“Review your item ratings with the others in your gram. identify items on
which there was a significant discrepancy within the group (10 points or
more). Try to identify the reasons for these discrepancies. You may wish
to make notes for reference during the afternoon session. (Paper has
been included in your folders for this purpose.) You will have
approximately 30 minutes for this discussion.’

Wing;

“Working in grade level groups. appoint one member of the group as
recorder. Use the “Rating 1“ data sheets and review the test items one at
a time. recording a consensus rating for each item. if there is near
agreement within the group on a particular item. record the mean rating.
(“Near agreement“ means a difference of 9 points or less within the
group.) if individuals within the group differ by 10 points or more in their
rating of a particular item. group discussion is recommended as the
means of arriving at a consensus rating. The entire group should
contribute to resolution of differences.

When the Consensus Rating Sheet has been completed. the recorder
should calculate the overall mean.“

owner-cox no 01': 9". eonor'“._u

Each group will report their overall mean. The combined groups will
discuss similarities/differences in their results. the potential message
which such similarities/differences may send to the field. and the

acceptability of adopting these overall means as cut score standards.

MEAP staff will report the estimated percent of students to achieve.
Satisfactory Performance for each grade level mean. A brief discussion
of the implications will follow.

EinaLLecemmeodatien;

Working in grade level groups. review the overall mean produced by the
group during the Consensus Rating procedure. discuss the reasons for
retaining or modifying this value as the final cut score recommendation.
seek clarification of any cut score issues that are unclear or concerns that
have not been addressed. The goal of discussion is group consensus on
the final cut score recommendation. Enter the recommended cut score
value on the “Final Recommendation“ form. Have all group members
sign the form and tan it in to MEAP staff.

 

154

Appendix 8: Validation Project Docunents

Michigan Department of Education

Memorandum
To: Michigan Mathematics Test Coordinating Committee
From: Bill Brown -
Subject: Cutscore Validation Project
Date: November 22. 1991

On May 31. 1991. many of you were involved in setting out scores for the new MEAP Essential Skills
Mathematics Tests. The process which was used is known as the “Modified Angoff Method.“ The
standards which were set at that meeting will be in place for all future editions of the test. through the
process of equating.

i am performing a study to validate the standards. and need your help in two areas. First. I you teach (or
recently taught) students who look the test. think of the “marginally capable“ student used in setting the
cut scores. That is. think of a student who. taught using instructionally appropriate material. would just
barely reach the “Satisfactory Performance“ level.

On the attached sheet. please make three lists of student names. as follows:
up to seven students who are clearly nonmasters (-)
up to seven students who are marginal (0)
up to seven students who are clearly masters (+)
Please use a separate sheet for each class. and return them in the enclosed envelope.

These envelopes will be returned to you unopened after you receive your test results in December.
Along with these lists. you will receive a paper “mask“ to cover up the identifiable inlorrnation on your
Classroom Listing Reponts). so you can indicate in a confidential way (by the o. 0 and + designations)
which of the students on the above lists fell into the three categories. Then. copy the masked Classroom
Listing Reports. and send the copies to me.

Next. i would like you to indicate (by a +) those strands for which the state objectives and the mathematics
program by which these students have been taught. are best aligned. and (by a -) those which you feel
are least wel aligned. Please use the attached form for this purpose.

You can be assured that strict confidentiality will be maintained throughout this process. Neither students
or teachers will be identified. directly or indirectly. in this study. The sole purpose is to determine the
validity of the standard-setting process. You will receive a copy of the results.

Thanks for your willingness to assist. Your cooperation is essential to the successful completion of this

study.

Clearly Masters:

155

Mastery Level Lists

 

 

 

 

 

 

 

 

 

 

 

 

 

1. Teacher:
2. School:
3. District:
4.

5.

6.

7.

Marginal:

1.

2.

3.

4.

5.

6.

7.

 

Clearly Non-masters:

1.

 

 

 

 

 

 

NP’SﬁPPN

 

 

 

 

 

156

Curriculum Alignment Lists

 

INSTRUCTIONS:

Please indicate by a plus (+). on the left of the strand names. up to 5 strands for which
the state objectives are best aligned with the curriculum with which the students have
been taught; and by a minus (~). on the left of the strand names. up to 5 which are least
aﬁgned.

 

CONTENT STRANDS

Whole Numbers. Numeration
Fractions. Decimals. Ratio & Percent
Measurement

_ Geometry

Statistics 8: Probability

Algebraic ideas

Problem Solving 8. Logical Reasoning
_ Calculator Literacy & Use
PROCESS STRANDS
Conceptualization
Mental Arithmetic
Estimation

__ Computation

Applications & Problem Solving

 

 

 

157

SarltpleClassroornListirlgRepon(Masked&Marked)

. . l.rll.....t. ..i... .5 3a .3 that." ufﬂoo ﬂu MP”. is: I! g:
2 .20.... a...» ... 23.52 .3 2:5: 23» ..r .«r. s u c ...-983 ...... 53.2.... a.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

......va m c 25.2.
3 o: 2 .2 2 2 2 8 2 3 2 2 2 a 2 L.
2 3 2 2 2 2 2 2 2 2 3 2 8. o. 2
2 3 S 2 2 2 3 2 3 2 3 2 3 a 2 . +
2 2: 3 2 2 2 2 3 2 3 2 2 2 a 2 — ...
im. 2 3 2 3 S 2 2 l2 3 3 3 2 o. 2 u
2 3 3 l 2 3 2 3 3 2 2 a 3 2 o. 2 .
2 2 3 2 2 3. 2 3 2 ... a 3 2 o. 3 . I
l? 2 a 3 3 3 2 2 2 2 3 l 2 2 l o. 2 t 0
3 3 2 2 2 2 2 2 2 2 2 2 8 e. 2 .
2 3 8 2 2 3 2 .3 3 3 3 8 8A a 2 ..r
--m. 2 2 2 2 3 3 "a 2m 3 -2 am 2 o. 2 .O
2 2 2 8A 2 2 2 2 2 . 2 2 2 e. L.
2 2 2 2 2 8. 3 3 2 _ 2 2 S e. i
:2 2 3 2 ma em. 2 2 2 2 2 2 o. I
S 2 2 2 2 2 3 2 3 2 .3 2 o. M
a a r. .... a a m m c. a .u a. ... ....
giant . c. a do: . : gidngiﬂﬂdwrﬁ .
u m m u m w m m 1 u w u m n u u n u
N N M u ... 0 W N N m M m We
..."... . m w m m m ... .. ... s .. m .
.m ... m m m .m . ... m . m . m... m
u m m .. u am a a . ... ..t. a . . .
m m H . V . w V o r.... I .. .. ...
' ‘ . . #‘r.. l
, N . t
. «800 c 2‘ Co

 

 

 

I
on un<xo moah<zmzh<z Qr
...modmm 025m: zoommm<gu

Nam 7.33 8202.. 2.08203. .3300.

 

158

MICHIGAN DEPARTMENT OF EDUCATION

M E M 0 RA N D U M
To: Educators Assisting with MEAP Cutscore Validation
From: William L. Brown
Subject: Instructions for Using Enclosed Mask
Date: December 12. 1991

If your MEAP results have arrived. you should be ready to complete your
contribution to the mathematics test cutscore validation process. Please follow
these steps:

(1)

(2)
(3)

(4)

(5)

(6)

(7)

Lay the enclosed cutout mask over your Classroom Listing Report (like the
one in Attachment A) to cover up the names of the students. district. school
and teacher. With the mask in place. the Classroom Listing Report will
resemble Attachment 8. WW.

Photocopy the masked Classroom Listing Report.

Since your photocopied reports will no longer have names on them. you will
need to use your original unmasked Classroom Listing Report as a guide
for this step. Mark the photocopy you just made with +. 0. or - next to the
scores for students which you indicated on your Mastery Level Lists (see
Attachment C) to be clearly masters (+ ). marginal (0). or clearly non-
masters (-).

Do this for each Classroom Listing Report for your grade(s) 4. 7. and/or
10 math classes.

Staple together the completed Curriculum Alignment Lists (Attachment
D). which you made earlier. and the marked. masked copies of the
Classroom Listing Reports. (An extra copy of the Curriculum Alignment
List is on the back side of this memorandum.)

If you were a member of the Cut Score Committee. please indicate this by
Writing the words “Cut Score Committee“ on the first sheet of your stapled
packet.

Return the stapled packet to me in the enclosed envelope.

Every effort will be made to ensure confidentiality. If you wish to receive a copy
of my report of the findings. please return the enclosed postcard with the address

to which you wish the report sent. WW;

please send it separately to maintain anonymity of your information.

Thank you for your willingness. during this busy season. to be part of this
important study.

 

 

 

 

159

SampleClassroomListingReport

‘

 

 

 

 

 

0......—-_-.

 

 

 

 

 

 

 

   

 

    

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. . . i 32.. r ...
2 .223 a...» ... 2333 3 523.. 32. x0392: new: .5 8. 55:3... 238 an u§s958 2g .8. 8:... .§
. . . p
o 2“...—
3 3A 2 3n 3 no no 3 3 no 3 2 n 3
no no nn up on no on 2. M“ nu so 2. 3n n: as to.“
2 3 3 2 2 3 3 3 3 2 3 2 3 n 3 a a
2 a: 3 2 3 no 3 3 2 3 3 no 2 n no t 25
3 no 3 2 3 3 3 3 2 3 3 3 2 n: 3 0 Sn
2 3 3 2 3 3 3 3 3 2 3 3 2 A: 3 x 5
3 3 3 2 3 93 3 3 2 2 2 3 2 o. 3 c I:
3 2 3 3 3 3 3 3 no 3 3 3 2 A: 3 .. 2
no .- nn nn 3. no on nn on on nu. no on n: no 3
2 3 3 2. 2 3 3 3 3 3 3 3 3n n 3 3:
no on on nu on on co 3 an 3 nn no on n! nn I92
3 3 2 3A 3 3 3 3 2 no 2 2 8n n. 2 2»
nn on 3 on on 3 co 3 on no nn 3 nn no. An a!»
on no on nu n can on nu an on nu nn nu n: no .38
on on 3 ns no n. «o an to an no no non n! no us.
on 3 3 2 3 3 n 3 “n «no A“ n» 2 n: 2 3..
. w . ...u .9... n u .. 3.... ....» ....u m as
in ...om.... .um...i.nm.m... up w...
. 0 .ns; .H (Ink... cum I.
....m .. m... ﬁm. . ..u 2." ......“ e M .
....e .. .. n 24h?“ M «W .W. 11. . I. .
00... m r..- s Us " n. 9“le . I... u m u ..
..M. 0. Just... a... M, . 9 ﬂ. . .
.._.. 3 Mn. 1.. . . ..r: ..s... a. 9 7a. A“. ......s. Q n .
Pzgxwu \d. .3 a, $55.13. 0 . P‘OQO'OA-o
.. . cg .. .. chug .380 '
...: .8... r- .39 2.9mm 0222 ._ 209.250
38.8 2323 . .8232 N00 aluam H 8363‘ #353926 60300.

 

4 ucoscueuu<

"2.0.2 3:. ... 2283.3 3 3:2. 422. «.885! $9 2. 8. gal 93....

SE! Em. Eu

‘7 M

160
SampIeCIassroom Listing Report (Masked)

2n,
.. 32¢
.52..

“”1
"313°“

 

.IIA‘IOI "31.0.4
U‘.‘ .- .
--".‘ an",

m IIOMVJM

'3: 'un I mum

-.»1

‘li: g ' uouvunn

a":
. .

 

3 256 825.22: QC
:89. 9:5: zoommmﬁo
n 2.2.5:: 22-22 a... 58:52:. 3:82

161

Mastery Level Llsts
Cleany Masters:

 

 

 

 

 

 

 

 

 

 

1. 144/torn “a; . Teacher._
2 z ’5 4 5 School:
3 ”76/0/6305 1° District:
4 ’72: Y4 T-
5. __
6
7
Marginal:

1. Mam» W'
,ngm 5-

 

 

 

 

 

NPW§PN

 

Clearly Non-masters:

1. Mf/p/ g

jgl-Iu/I #-
!(4/11 ”.

 

 

 

 

 

 

N99???)

 

 

162

Curriculum Alignment Lists

 

INSTRUCTIONS:
Please indicate by a plus (+). on the left of the strand names. up to 5 strands tor which
the state objectives are best aligned with the curriculum with which the students have
been taught; and by a minus H. on the lett ot the strand names. up to 5 which are least

aligned.

 

CONTENT

STRANDS

Whole Numbers, Numeration

Fractions. Decimals. Ratio 8. Percent
Measurement

Geometry

Statistics & Probability

Algebraic Ideas

Problem Solving 5 Logical Reasoning
Calculator Literacy 8. Use
STRANDS

Conceptualization

Mental Arithmetic

Estimation

Computation

Applications & Problem Solving

Teacher:

 

District: