1‘ m-p:.‘,}’§7""””“"""

g.ﬁ
£3.59 .‘~
i ‘7

‘ I

f. ”if
333%”;
.mrig‘r ‘

. ..
91:”.1:,m
'sz‘ 'r‘

1.3%

‘1 I

«Hark $191?

WM “c

Hays!“ , :
, .

1vW-.

. > "5* .7 {.1 a $575.”. ﬁr!
> , ‘ 5“ ‘ tn F»
. I: "91' ' x h‘ ‘3:
wk gt“: ~42” “

'31.}: fat
3 - a.» .n
t“,

1

» ' 19
,J v, . 4.3; ' ' ' 1337'
r, , , “

“N.
. 1153'
€an p?§%‘ ‘1',
‘ ‘:“:ng ? JJH’u,AﬁW
.. .5

~ I". . ‘ -
ing’ISJ-zgsm gu-
.. v

I-(«§~ ‘1') J.
" 'S‘EWL.
*‘TA'TJZL" u
.. W mw ‘
.In_ ‘I
5!

.mmmilr
no” ‘ . #1:”).11;
V E. ‘ . > h. n/ ~.'
'_ . _ _‘ x 7 , r .. d ‘3:
‘Lubslirsi’téi'h" “ ' "i
a?» . ““49-
3 " ' '

V\
A“::‘{
”3‘3““

q'ubﬁ:
In

My. 2*
r

’3‘... a

421m! 322%?“
‘ «5’15:

9‘
u

’4.

a

an *
Q“;

,_I >. .7
, _"‘)§5Y"ri

. A l
,
:1; 34'
..

lumen gr"
La;- 2*

2’38..-

.4.T‘J2L1' '
.

s: r
-m 2:-
new:

.~
..
v

N O- F'V‘ '” —.
:aﬁiﬁ

s
.0“..-
..

' fir ». h
.1 .%?W(
* ﬁggh
, , z“-

"a: ..';: ~
: ”15.3.53“: "

75.. .
«‘23. ‘5'

~

' K.

k

at
~|l‘n
1-";
sz‘zﬁx “k,” a- ,
—~ . ~: "EFH' ‘-
a

ac? £351;
var.“

.. u

m'l.‘

‘. “4 .
MM'E'M

ill; .3
M “2
7.29“". . g I“! ‘KY

3" 77.?

,LWns

\ mm , 7
‘r’

Vsp .
:‘ekv‘
1"».-

:z‘m“

53’"?

' 72%“) n
P’

:3
34:11:33.“.
.92" xw‘HM"
m

-,.,..
-5.

“
kn‘a‘: 2
. .

vl-n‘".
u.~ :-
up

 

 
    

llllllllllllllln

 

LIBRARY
Michigan State
g University

 

 

ﬁmbmmMUmamem
dissertation entitled
An Investigation of One Alternative to the Group-process
Format for Setting Performance Standards on a Medical

Specialty Examination
presented by

Gregory J. Cizek
has been accepted towards fulﬁllment

of the requirements for

Ph.D. (kgmehi Measurement,

Evaluation, and Research Design

g5‘§as::§%3;;2§:é£24zr25&444«/
jor professor

0-12771

 

”511;“... Afr .‘ - ' '- 'OP: '. 1 .-..-..

 

 

PLACE iN RETURN BOX to remove this checkout from your record.
TO AVOID FINES return on or before date due.

 

DATE‘DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

, ‘ A??? 0 5 2007
1 M 9 07
ﬁHN—G—B—ZBCB
710 1 '11 n
‘ 1r; 1 r. U ‘
l .
L3: 3 any-g;

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

MSU is An Affirmative Action/Equal Opportunity Institution
snowman-pd

 

wm~n.__

 

 

 

AN INVESTIGATION INTO ONE ALTERNATIVE TO THE GROUP-PROCESS
PROCEDURE FOR SETTING PERFORMANCE STANDARDS ON A
MEDICAL SPECIALTY EXAMINATION

BY

Gregory J. Cizek

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY
Department of Counseling, Educational Psychology,
and Special Education

1991

 

0c
\
’7

\s

‘1) f) -

 

AETRACI‘

ANINVESPIGATIOIINIOQIEAUIERNATIVETO'BIEWMCESS

Gregory J. Cizek

missuidyexaminedonevariationofthetraditioralgmip-process
procedure for establishing passing standards on a medical specialty
examination using the Angoff methodology. 'Ihe variation consisted of
requiring subject-matter experts to provide Angoff ratings
indepeniently, without group interaction or other sources of
information. 'Ihe study also sought to isolate the effect of group
interaction and information—sharing thrulgh carparison to a group-
process condition and a condition in Whidl independent item reviewers
were provided with distrihrtions of other the independent reviewers'
ratings.

'Iherewereseveralmajor findinginthestudy. Itwasobserved
that the independent procedure produced a nonsignificrantly higher
passingstandardthanthegmip-pmcessproceduredid. 'Iheabsenceof
statistical significance, however, did not exclude large practical
cortseqtiernesfortheinterestedgroips, suchastheexamjneesandthe
standard setting board. These practical consequences are described and
discussed. Also, it was observed that individual item reviewers'
ratings were more variable in the independent condition carpaned to the

gram-process procedure. 'Ihe independent condition was also 1&5

 

 

costly to inplanerrt. Item reviewers in both conditions produoed

ratings that exhibited less than desirable accuracy in terns of
estimating the performance of the hypothetical minimlly-caxpetent
91'0“!)-

'Ihe provision of additional information to the independa'xt group in
theformofdistrihxtionsoftheirowninitial itemratingsresultedin
subsequent ratings that were significantly higher and less variable,
hitdidmtresultinmrepreciseestinatesofperfomarneforthe
minimlly cmpetent group. However, independent raters amarerrtly
utilized the additional information provided as distributions of
ratings. It was found that krmledge of a reviewer's initial rating
andthegroup'sinitialmeanitemratingwasamderatelygood
predictor of a reviewer's subsequent ratings.

Inplicatiors for future design of standard setting procedures and
policy catsiderations are discussed.

 

. on,
. aimiﬁ
. F i

i... .

 

 

I appreciate the patience and encouraganerrt of those who have

helped me with this project: Stephen Raudenbush, Irvin lamann, Diana
Pullin, William We, David labaree, and Stephen Yelon of Michigan
State University.

I am most grateful to my wife, Rita, and our children, Caroline,
David, and Stephen for their unfailing love and support, and to my
parents for their enduring confidence.

IthankGod forhisblessings, as surely evidermdtomethrough

these people who have given so Mich.

iv

 

MEGFCINTENB

M PACE
cnpteri—pmmen ..... .......... 1
I Introduction ........ . ................. .. ...... 1
Background ......... ....... . .............. 4

Need ................ ..... ................. ......... . 5

Purpose ........... ........ 10
aiapterZ-Review of PreviousResearch . .......... ..... 12
Methodological Developnent .......... . 12
Inter-methodological Research 19
Intra-methodologicalResearch ..... . ..... 21
Chapter3-StmdyDesign ........ 30
Experimentl ......... ..... 30

Empirical Treatments . ....... ....... 31
ControlGroup ......... 32
TreatmentGroup ........ .. ....... ........ 36

Subjects .......... ......... 37

Consent ......... ..... ... ..... ............ ..... ... 38
ValidityOoncerns ...... 38
Imtnmientation ..... . .............. .. ...... .. 40

Statistical Analyses ........... 42

Ebrperimen‘tZ .................... 52

Empirical Treatment ...... ....... 53
ValidityConoerns ....................... 55

Iretrumentation .................... ....... 56

Table of Contents (corrt'd)

Statistical Analyses ...................... 56
Chapter 4 - Results ................ . ...... 60
Ecperimerrtl ........... 60
Between-groupMeanDifferences 60
Within-group Differences 67

Relationship betweenGrotmandIndependent
Ratings 69
Relationship toObtained Item Statistics 74

Relationship between E and E' and Reviewer
Generalizability Analyses 78
Betseen-conditionMean Differences 89

Relationship between With-information and
tic-information Ratings .......... 97
Decision Consistency 99

Relationship of Ratings to Obtained Item
Statistics ...... 101
RegressionAnalysce . ..... ...... 104
Carbined Reslllts .......... ..... . ........ .......... 107
ChapterS-Diswssim .......... ..... 111

WWI-WOOOOOOOOOO ..... .0... ..... .0... 111
mmtjmsamvariability OOOOOOOOOOOOOOOOOOO. 111

vi

 

 

Table of Contents (cont'd)

Relationship of Ratings to Obtained Item
StatisticsandReviewer diaracteristics
GeneralizabilityAnalyses . ........
CostAnalysis .....
Relationship of Ratings to Obtained Itan
RegressionAnalysis .....
Discussion of Combined Analysis
Simmary of Findings and Implications
IdmitatioreardSuggestions forFutureResearch
Appendix A - Inter-methodological Carparison of Standard-
settinngoedures InvolvingOneorMore Absolute
Starriard-setting Methodologies
AppendixB-Passing ScoreMeetingInformational Materials
AppendixC-Sanple ItemRating Collection Form ......
Appendix o- Sample Post-meeting passing Score Study
AppendixE-nata Law for Ecperimentl
AppendixF- Sanple Rating Form for Experimentz

m-StOfkfm OIOOOOOOOOOOOOOOOOO0.00.00.00.000.0.0.000...

vii

113

113

115

116

117

117

118

120

121

124

137

142

143

149

150
151
152

153

 

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

Table

1

2

3

5

6

7

9..

10

IISI'CT'IREUB

Description of Practice Items Used in Passing
Score Starinmup Training Session 34
Descriptive Statistics for Independent and Group-

Process Reviewers Across 200 Items 61

- Test for Significant Mean Differences between Indepen-

dentandGroup-process ConditionPassingScores 65
Wized Block ANOVA Results for Independent arri
Group-process Conditions ........ 68
Intercorrelation Matrix of Ratings frm Independent

andGroup—process Corriition Reviewers ....... 70

- Indices of Decision Oorsistency for Independent

andGroup—prooess Conditions ..... .......... 73

- Absolute and Relative Errors of Specification for

Item Reviewers in Independent and Group-process
Conditions ....................... 76
Sunmary of Generalizability (G-study) Results

for IndependentandGroup—proosss Conditions 79
Summary of Generalizability Analyses (d—study)
Results .. ...................... . ............ ...... 81

- Carparison of Costs for Conducting a Passing

Score Study under Group-process and Independent
mitiors ... IIIIIIIIIIIIIIIIIIII t. ..... 0.0.0.... 86

11 - Descriptive Statistics for No-information and

With-information Reviewers across 100 Items ...... 91

viii

 

Venice’s} : ': ‘ -.

 

maximize- ..T

    

66
..........._

. _
. ,u.,_ . e ‘
[05 .‘I‘)(...'. .4“. .L or z
3129')?" -

)r. .1 .1n

n v I 6.2.3.1'. 1“}!

 

List of Tables (cont'd)

Table 12 - Test for Significant Mean Difference between
No-infonation and With-infornetim Condition
PassingScores .. 94

Table 13 - Repeated Measures AMNA Results for No-infornation
andWith-informtim Conditions 96

Table 14 - Irrtercorrelation Matrix of Ratings fran
tic-information and With-information audition

Table 15 - Indices of Decision Consistency for No-informatim
andWith-information Conditions 100

Table 16 - Absolute and Relative Errors of Specification for
Itan Reviewers in No—information and With-
information Conditions 103

Table 17 - Regression Analyses for Individual Reviewers in
Erperiment 2 106

'I‘ablelB-anparisonofﬁbcperinentlardbcperinentz
SuggestedPassingStandards ..... 109

 

 

 

LISPWFI“

Figlnel-PlotofIIﬂeperderrtarderp-processmrditim
MM'W 0.00.00.00.00......OOOOOOOOOOOOO 62
Figure 2 - Plot of bio-information and With-informtim

MiM'm 0.0.0.0..........OOOOOOOO... ..... 92

 

 

 

 

I.PK)BLEM

lgtroduction

The licensure and certification processes represent the efforts
of govermental and private entities to ascertain and recognize the
competence of individuals in the practice of a profession or trade.
Licensure, as canmonly understood, is the granting, by a governmental
entity, of the right to legally practice a profession or trade. The
right, or license, is granted pursuant to the individual's
demnstrated acquisition of the knowledge or skills required for _sa_fe
practice. licensure programs are conducted by governmental entities
in their effort—and charge—to protect the public against unsafe
practice. Certification is the process by which non-goverrmiental
entities, commonly professions or associatiors, confer a credential.
'Ihe credential is also usually only conferred upon the individual
after demonstration by the individual that a specified level of
knowledge or skill has been acquired (Shimberg, 1981) .

As reported by Nafziger and Hiscox (1976), over 2000 occupations
employ some type of licensure or certification procedures. That
number is surely increasing, even leading some to label Americans "the
credential society" (Collins, 1979) .

Additionally, many entities which once issued permanent licenses
or certificates have now begun to reassess the concept of lifetime

credential. Instead, time-limited certification or re-credentialing

 

   

f;
.._i

,

 

undue-gr: 5 new“.
Mivtfrri m: nary:

3'0 MEI Lei 312?: V,

4

season .mman

‘

’ijr-Fenis'e'ni :o 33,5:33:

fﬁtaimo-‘xz ‘3')
I. \
0,.
5'!

rL- f'

Iﬂfﬁi'a'T‘... was-:2; 1 ﬂow.-

(‘7‘- ! "

 

AALA

 

 

 

concqits have begun to be seriously entertained and often implemented,

especially in rapidly-among technical fields such as the medical
professiors (American Board of Medical Specialties, 1987) .

The carpetence required of a candidate for licensure or
certification is usually stated in terms of requisite knowledge,
skills, and abilities. Verification that the individual has acquired
the knowledge, skills, and abilities is often linked to one or more of
three canponents: a minimum educational attaimnent, a minimum practice
crexperiencerequirernent, andaminimumlevel ofperformanceonan
cbjectivet$t. 'nteaaxnimtionsusedaspartofthethirdcanponent
are increasingly criterion-referenced ones.1 Hambleton, Swaminathan,
Algina, and Coulson (1978) have defined such tests as ones that are
"used to ascertain an individual's status (referred to as a detain
score) with respect to a well-defined heavier domain" (p. 2). These
tests consist of items that are a "representative set of items from a
clearly-defined domain of behaviors measuring an objective" (p. 3).

'meprcsentresearchfocissesonthelastofthethree
carponents in licensure and certification testing programs—the
criterion-referenced test. Specifically, this research examines one
particular test score of unique interest—that score from which
emanates inferences of mastery or competence—the passing score.

The passing score on an criterion-referenced examination

represents the establishment of a standard of performance judged to

 

1 It is recognized that terns such as "criterion-referenced,"
"domain-referenced," and "norm-referenced" precisely describe test
score interpretations and inferences rather than the instruments
themselves. However, imprecise use of these terms in referring to
instruments is ubiquitous—even among measurement specialists
(Cronbach, 1989) . 'Ihis relaxed usage, though imprecise, is followed
throughout this manuscript for purposes of ease and clarity.

 

3

beacceptable. Itisthelowestscorethatpermitstheexamineeto
receiveﬂielioettseorcrederitial. Smetimes, thoughlessandless
so, thepassingscoreissetinamfmmamier. 'nntis,
thepassingscoreisfbtedrelativeto,ordeperdentupon, the
performance of sane group. For exanple, a rpm-referenced or
"relative" approach to standard-setting might result in requiring
aamineestoscoreatorabovethe85thpercentile, oratorabove
sanemmberofstardarddeviatimsawayfrunaverageperformamem
the examination.

thiever, because the forms of licensure and certification
prograns has increasingly becane that of assessing examinees'
cmpeternewithrespecttoapre-judgedstandardofperformance,
norm-referenced standard-setting procedures have been called into
question in terns of their propriety for the stated purpose. In
their place, "absolute" or criterion-referenced methods of
establishirtgpassirtgstaiﬂardsitavebecanemreccmnon. 'Iheabsolute
methodologies, while boasting of greater intuitive and political
appeal, still face diallenges with respect to the validity of
inferencesthataremadeasaresultoftheirresultingstandards
(Jaeger, 1979) . Specifically, the possibility of establishing a
standardthat results inthefailure ofatrulyccupetentperson (a
"false negative") or results in the passing of a truly incompetent
person (a "false positive"), is of particular concern.

Criterion-referenced standard-setting methodologies have clearly
not yet accanplished technical perfection: nuch work remains to be
done in this area (Hanbletcn, et al, 1978; Angoff, 1988). The prcsent
mseardiaddressesoneaspectoftheprocessbywhidistandardsare

    
    
     
   
   
   
  
 
   
   
   
   
  

mimic:
to ‘(ﬁir’icv (r; ::
maxim 9-1 ‘_ '

s (ril'dickze‘a: t: . L
5) {0'13} molar-h.

oar-5am it 3:11;: . .~

«which M—‘m £51.31"

‘ d 03 art-s2: 3.2:: v:

#m cm new: .- 7 , .-
9 Formula liaise \‘c 1 1'- . ‘ : ‘ s _ .. -

.‘

.. '. X

 

set on a criterim—referenced certification examination in a medical
specialty.

Lek-ems

Since at least 1954 when Nedelsky sought to derive "absolute
grading standards for objective tests" (Nedelsky, 1954, p. 3), the
probl- of how to establish passing standards on criterion-referenced
educational assessments has persisted. Nedelsky's early work
prompted investigation of alternative standard setting procedures
designed to establish passing standards that differed from the
doninant norm-referenced approaches of the tine. Nedelsky's
objective, and that many of contemporary researchers in the field of
standard setting, was straightforward:

"The passing score [should] be based on the instructor's

judgnent of what constitutes an adequate achievement on the

partofasttxientandnotcmtheperformancebythestudent
‘ relative to his class or to any other particular group of

students" (Nedelsky, 1954, p. 3).

The past three and one-half decades have witnessed the
introduction of many alternative methodologies that have shared the
sane dajective—ncvenent away fran the daninant norm-referenced, or
relative, approaches. Among the proposed "absolute" methods as they
are sometimes called, the most well-io'iown are those proposed by
Nedelsky (1954), Angoff (1971), Reel (1972), and Jaeger (1982).

other nethods have also been introduced that have tried to
achieve a caupmnise between the absolute and relative approaches.
Proposals by Bank (1984), deGruijter (1980), and Hofstee (1983)
represent attempts to synthesize absolute and relative nethods.

Taken together, all of these methods represent efforts to

 

 

 

 

fonalize a set of rules for establishing passing standards in a less

arbitrary, or at least more justifiable, fashion than traditional,
norm-referenced practice has offered. 'Ihe methods rely primarily on
the use of subject matter experts' (hereafter dolled "SMEs")
judgments concerning one or both of two critin elements: a
conceptualization of the "barely-passing," "mininally-ccmpetent," or
"borderline" examinee; and, an expectation regarding the level of
contenthmledgearﬂskillthatsuzhanacamineeshouldpossess
(Livingston & Zieky, 1982).

After initial research efforts to derive absolute and, later,
carpmnisenethodsofestablishingpassingstaniards, asecondstream
ofresearchdeveloped. 'Ihissecondlineofirqﬁryfomssedmainly
on differences pm methodologies (Mills & Melican, 1988).
Investigations carparing two or more methods characterized this
secondphaseofstarxiard-settingixquiry. AppendixAlistssomeof
these ﬁlter-nethodological investigations.

Recently, however, athirdphaseofreseardlneedhasenerged.
Researd'l in this phase is characterized by attempts to identify

sources of variation within standard—setting methods.

ﬁeld
'Iheproposedreseardiiscloselyalignedwiththethirdphaseof
research into standard setting methodologies and focusses on one
method—the Angoff method. The Angoff method and its variations
(sometimes called "Modified Angoff" procedures) are derived from the
work of Angoff (1971) and others. The Angoff methods require SME‘s to

serve as item reviewers and to scrutinize each item in an

 

 

 

6

examination, usually prior to the administratim of the examination.
'meitenreviewersarethenaskedtojudge, foreachitem, the
proportion ofminimallycmpetentexamireeewhowillanswerthe item
correctly. ‘Ihe item reviewers' judgments, in the form of
proportions, are cannonly referred to as "Angoff ratings."

'memw-preferredneansofobtainingtheitemreviewers'
judgments utilizes a W format. In this format, the panel
of SMEs is convened in a single location, provided with training in
the starriard-setting methodology, and directed to provide their
ratings foreachiteminatest. 'megmlp-processformatisoften
preferred because, predictably, item reviewers do not produce
identical ratings and the group-process format provides a means of
resolving the differences in ratings. Most researchers agree that
this reduction of variability is desirable (Jaeger, 1988; Meskauskas,
1986; Snith, Smith, Richards, & Earnhardt, 1989). However, it is
cannon that an extensivepportion of a grwp's meeting time is devoted
to discussions about individual test ite'ns, debate, and, when
applicable, to corsensus-readling regarding the ultimate rating for
each test item.

Several problems arising frcm this format necessitate the
investigation of alternatives to the traditional group—process
format. Norcini, Lipner, Iangdon, & Strecken (1987) summarized two '
of the problems, including: the tediousness of the task of reviewing
individual items and reaching consensus ratings (especially when a
large number of items is involved): and, the expense of empaneling a
sufficiently large group of SMEs in one location for, perhaps,

several days. 'Ihese problems are especially evident in the area of

 

 

 

professional licensure and certification where hundreds of

credentialing prograns employ criterion-referenced standard-setting
methodologies, nest of these relying on subject matter experts
participation in a traditional gram-process format to obtain item
ratings.

Another frequently encan'rtered problem is simply arriving at a
single block of time that is available for each SME on the panel of
itan reviewers. This problem has been characterized by Iockwood,
Halpin, and McLean (1986) as one of the "situational'constraints"
(p. 6) in the standard-setting process. Hambleton (1978, p. 282)
specifically addresses the problem of time resource availability as
one of the four primary considerations in selecting a standard-
setting methodology.

In addition to the need for research to suggest alternatives for
addressingtheproblenscreatedthroxghuseofthegroup-process
fontatinstarxiard-settingstudies, researchisneededtoecaminethe
effect on resultant standards when such alternative strategies are
tried. Many researchers have conducted comparative studies of
standard-setting methodologies which employ a group-process format.
Also, most have offered an opinion concerning the appropriateness of
the group-process technique. For example, Brennan and Iockwood (1980)
opine:

"Scmetimes...it is suggested that a cutting score be

determined by a reconciliation process. For example, after

the five raters in this study completed the Angoff

procedure, theywereinstructed, asagroup, toreconcile

their differences on each item. One typical result of
using a reconciliation process is that certain raters tend

to dominate, or to influence unequally, the reconciled

ratings... There is a certain logic to using a

reconciliation process that appears to be compelling. It
might be argued that the ideal of using either the Nedelsky

 

 

 

 

 

orﬂlek'goffprocedtneisforraterstoagreemevery
itan. 'Iherefore, why not force them to concur? One
arglnnentagainstthislogicisthatforcedconsensusismt
agreement, although forced consensus may effectively hide

. Also, a reconciliation process does not
guaranteethatthesameaxttingscorewillresulteaditine
a study is replicated" (p. 235-236) .

AltlnlghBrennananiIoclmood'sranarksgobeyozﬁtheeffectof
grum—processardexterdintotherealmofrequiringconsensusofthe
expert group, their logic is equally applicable to the traditional
group-process condition. 'Ihat is, after appropriate training of item
reviewers, the condition of group-process may not be mcessary,
desirable, or efficient for use in all standard-setting procedures.

Jaeger (1988) offered his opinion on another aspect of achieving
agreenent among iten reviewers:

"Achieving consensus on an appropriate standard for a test
isanadmirablegoa1(certain1yguaranteedthroughtheuse
ofasingle judge), butit shouldnotbepursued atthe

expense of fairly representing the population of judges
whosereconmendatiorsarepertinenttothetaskof
establishing a workable and equitable test standard"

(p. 29).

Maslow (1983) has remarked that lmowledge about "the optimal
size and structure for the group of judges" is "basic to improving
practice in standard setting (p.104), and that "the research
literature gives only brief and unsteady guidance here" (p. 105) .
While some investigation of the issue of optimal group size has begun
(Smith, Smith, et a1, 1989), the issues surrounding optimal group
structure remain largely unaddressed.

Meskauskus, (1986) has appropriately, and succinctly, noted that
"[there] is a need to explore the determinants of intrajudge and
interjudge variance in de " (p. 200).

Mills and Barr (1983) reported that:

 

 

 

"While general information cmcerm'ng procedures for
implementing the methods and alwlating cut-off scores is
available, specific guidelines are less well established.
Issues of training, groupinteraction, independent ratings
vs. discussion all affect the methods, but little is
available in either discussion or guidelines concerning
these and other implementation issues" (p. 2-3).

In 1984, Fitzpatrick perceived the need for research m
standard—setting procedures in an integrative work applying research
intheareaofsocialpsydlologytotheproblemsofstandardsetting.
The need persists, as Fitzpatrick (1989) notes; specifically there is
aneedbythoseinvolvedinstandard—settingresearchtoixwestigate
the effects of group processes:

"We must ask whether it is desirable that the decisions

that [item reviewers] make be affected by interpersonal

comparisons, by cognitive learning through the exchange of

information, or by both types of processes" (p. 321) .
Powssinginonthesocialaspectstlataffectgroup—basedstandard—
setting methodologies, Fitzpatrick goes on to argue that:

"standard-setting procedures should be designed to both

minimize the effects of social comparison and maximize the

effects of certain informational influences on the

decisions to be made" (p. 322) .

In summary, Fitzpatrick specifically urged that:

"procedures proposed for reducing the impact of undesirable

influences standard-setting context should be

investigated.in Whether or not the suggested procedures will

be effective can only be decided by further research" (p.

325).

Unfortunately, scant attention has been paid to these, and
similar, aspects of intra—methodological variation. Specifically, as
Mills and Barr (1983) and Fitzpatrick (1984) have both remarked,
little evidence has been brought to bear on the effect of the
presence or absence of the group-process condition. Fewer still

appealing alternatives to the group-process format have been

 

 

 

 

10

proposed. alrry (1987) has summarized the existing state of affairs
aptly:
"Mmst all of these authors [on standard setting]
acknmrledge that the expert group process will have

significant impact on the validity of the outcome, few have
examined the dynamics involved" (p. 1).

m

'Ihe present research attempts to identify an efficient variation
of the traditioral group-process method for use with the Angoff
approach to establishing passing standards on a certification
examination.

Using the Angoff (1971) method, the present research campares
twoproceduresforestablishingpassingstaniardsonamedical
specialty certification examination. The two procedures used are:
1) the traditional group-process method; and, 2) an "independent"
condition in which item reviewers provide their item ratings in
isolation (i.e., without the effects of group-process). An attempt
ismadetodeterminewhether, afterbothgroupsofitemreviewersare
provided with initial training in the Angoff method, results obtained
from the group-process condition differ from those obtained in the
isolation condition.

The primary focus of the Angoff standard-setting method is to
identify a passing score for an examination. Accordingly, the
primary focus of this research is to establish whether there is
variation in the passing scores that result from exposure to the two
conditions. It is hypothesized that variation will be observed
betww the two conditions, but that the magnitude of variation will

be small. Additionally, it is hypothesized that the isolation

 

 

 

 

11

condition will provide a suitable, efficient alternative to ﬁle
traditional group-process method of collecting SME's' Angoff ratings
for test itets.

 

 

II. mammal!

The setting of absolute performance standards on criterion-
refereaced educational assessments is a pervasive activity in the
American educatioral system (Hambleton, 1978) and represemts an
ongoirg line of inquiry in the field of educational measurement.
Criterion-referenced starriard-setting methods are currently utilized
by groups responsible for industrial personnel selection, educatioral
and training program evaluation, professional licensure or
certification in medical, allied health, arri msiness fields, arxi
other national, state, and regional credentialing programs
(AERA/APA/NQGE, 1985: Meskauskas, 1986).

AdaptirgtheconceptualviewsggestedbyuillsarriMelican
(1988) , research on criterion-referenced standard-setting can be
viewed as having proceeded in three distinct phases: 1)
Methodological Development; 2) Inter-Metl'lodological Research; and, 3)
Inna-Methodological Research. An overview of these three phases
serves as an organizational framework for reviewing previous research
and is presented in the following pages.

Methodolgical Develgmgt
As one author has noted, mentions of criterion-referenced passing
standards are found in early historical accounts of testing

12

 

 

 

13

situations:

"Averyearlyminimalcerpetencyu-waswhenthecilead
Guards challenged the fugitives from mariam who tried to
crosstheJordanriver. 'Areyouamerberofthetribeof
Ephriam?’ theyasked. Ifthemanreplied thathewasnot,
then they demanded, 'Say Shibboleth.I But if he couldn't
pronolmcethe'sh'andsaidsibbolethinsteadof
Shibboleth he was dragged away and killed. So forty-two
thousard peqale of Eglriam died there at that time"
(Judges 12: 5-6, me Qvgg‘ Bible, quoted in Mehrens,
1981, p.1).

Since that time, so-called "high-stakes" tests, (‘t'hwgh not gm:
high), haveremairedproninertintheassessmentofccnpetence, and
research efforts have been directed at refining the theoretical and
appliedaspectsof settingpassingscoresonslchtests. Inareview
of existing standard-setting methodologies, Berk (1986) reported that
at least 38 methods of establishing or adjusting performance standards
have been proposed. Berk (1980; 1986) and many others (Glass, 1978:
Hambleton & Eignor, 1980; Hambleton, Swamirathan, Algina, & Coulson,
1978; Jaeger, 1989; Livingston & zieky, 1982; Meskauskas, 1976;
Meskauskas & Noroini, 1980: Millman, 1973; Mills and Melican, 1988;
and, Shepard, 1980a) have also developed several similar catalogues
and classification schemes to organize the various methodologies.

Again from a historical perspective, Nedelsky's (1954) work
probably represents one of the first attempts to promote absolute, or
criterion-referenced standards of performance on educational
assessments. As late as the 19705, norm-referenced methodologies
dominated as the preferred standard setting approach. In a 1976
article, Andrew and Hecht reported that:

"At present, the most widely used procedures for selecting
. . .pass-fail levels involves norm-referenced
considerations in which the examination steward is set as

a function of the performance of examinees in relation to
one another" (Andrew & Hecht, 1976, p. 45).

 

 

 

 

14

A noticeable shift began to occur during the 19705 and 19805, when
considerable attention to establishing absolute passing standards
resulted from an increasing popularization of criterion-referenced
testing (Glaser, 1963: Popham & Husek, 1969), or—as sate have termed
it—a shift to a focus on educational "outputs" (Isvin, 1978; Rothman,
1989) . Since that time, many entities responsible for establishing
passing standards have reevaluated their use of norm-referenced
meﬂmodologies and have opted for implementation of absolute or
compromise methods (Hambleton, 1978; Fabrey, 1988; Mills & Barr,
1983). In 1983, Francis and Holmes reported that "the more
traditional norm-referenced approach is being seriously questioned"
(p. 2). Meskauskas (1986) described the evident trend away from norm-
referenoed and toward absolute (or, "content-referenced") standard
setting methodologies in the area of licensure and certification
testing and offered this advice:

"For those credentialing agencies still using normative
Irecemendthatplanstochangeoverto
content—referenced standards be initiated" (p. 198).

Nedelsky's ‘ work in search of an absolute standard-setting
meﬂmdologytlmsrepresentsamarkedmmirgpointinstardard-setting
technology and research. When using the Nedelsky method, subject
matterecpertscarefullyinspectthecontertanditetsinan
examination and judge, for each item in the test, the option or
options that a hypothetical minimally-carpetent examines would rule
outas incorrect. the reciprocal ofthe remainingrmmmberof options
becomes each item's "Nedelsky rating"; the sum of the ratings—or same
adjustment to the sum—is used as a passing score.

Further research and other now—popular methods of establishing

15
absolute passing standards on criterim-refereeed eaminations
followed—thong; not quickly (Scriven, 1978)—after Nedelsky's 1954
piblicatim. Angoff (1971) prcposed a method that, like Nedelsky's,
requiredSMEBtoreviemtestitemsanitoprovideestimatiomoftre

proportimofasubpowlatimofeamixeesmowouldarswertteiteis
correctly:

"Asystematicprocedurefordecidingmtl'eminimmmraw
scoresforpassixgarrihorersmightbedevelopedas
follows: keeping the hypothetical 'minimally acceptable
person' inmini,orecmldgothrulghthetestitemby
itemarddecidewhethersudlapersmcouldanswer
correctly each item tmder consideration. If a score of
aeisgiven foreach itemansweredcorrectlybythe
hypotheticalpersonarriascoreofzeroisgivenforeadl
itemansweredinoorrectlybythatperson,tteemmofthe
itemscoreswillequaltherawscoreearnedbythe
'minimally acceptable person'." (Angoff, 1971, pp. 514-
515)..

In practice, a footnoted variation to the procedure Angoff originally
proposed has dominated applications of the Angoff method:

"A slight variation of this procedure is to ask eadl judge
to state the probability that the 'minimally acceptable
person' would answer each item correctly. In effect,
judges would think of a mmber of minimally acceptable
persons, ireteadofaﬂyeeeldlpersm,andwulld
estimate the proportion of minimally acceptable ' persons
Michelldanswereadlitemcorrectly. 'Ihesumofthese
probabilities would then represent the minimally
acceptable score." (Argoff, 1971, p. 515).

A third absolute method was proposed by Ebel (1972), who also
noted that norm-referenced methods had serious drawbacks:
"The obvious drawback of this approach is that it allows
'of competence of the examinees at a spelelc testing."
(Ebel, 1972, p. 494).
Ebel's methodology also involves the judgments of subject matter
experts. ‘Ihe‘r'belmethodrequiresamstomakedecisionsaboutthe

difficulty of individual test items and about the criticality of test

 

 

l6

cmtent areas.

Odierabsolutemethodologieshavealsobeenprtposed, samequite
recently. One alternative based on rating test specifications was
proposed by Gargelosi (1984). Iockwood, et al (1986) prcposed a
methodof averagilgtheresultsofvarimsstandard-settingapproadles
inordertoata"true" standard, orpreciseestimateofsaneextant
parameter. AmthermethodologyhasbeenproposedbySdlom, Rose!)
and Jones (1988) in response to perceived weakness in the Angoff
approach. Sdloon,Rosen,arxiJonesalsodidsemepreliminary
invetigation into their "Direct Standard Settirg netted" (moms,
Rosen, & Schoon, 1988), but it, like other alternatives to the Angoff,
Ebel, and Nedelsky methodologies, has not received widespread
acceptance or general use.

A second wave of proposed starriard-setting methodologies followed
early attempts at determining absolute passing standards.
Predictably, thesecondwave aspiredtoidentifyamiddleground
through the development of methodologies that would strike a
cmpromise between plrely norm-referenced (relative) approadles and
absolute methods. Illustrative of these carpromise efforts are
methodologies suggested by Beuk (1984), Grosse and Wright (1986),
Hofstee (1983), and deGruijter (1980). Overviews of these
methodologies are provided in deGruijter (1985) and Mills and Melican
(1986).

The cmpronise methodologies have failed to overtake the earlier
absolute proposals, however. currently, in the area of licensure and
certification testing, the Angoff, Ebel, and Nedelsky approaches are

still the most prevalent methodologies for establishing passing

 

   

17
standards, particularly the Angoff and ﬁnal approaches (Hambleton,
1978: Berk, 1986).

Albeit a ubiquitaw task, the establishment of passing starriards
is not necessarily an easy one. Referring specifically to licensure
and certification testlng' programs, the M fg mm erg
W remark mat:

"Defining the level of competence required for liceising
orcertificaticniseeofthe importantand

difficult tasks facing those responsible for slob
programs" (AERA/APA/Nom, 1985, p. 63).

In a discussion of absolute starriard-setting however, it should
alsobemtedtlatcorsiderabledisagreetentedstscoreerningjust
Mammabsolute starriard-settingproceduresare. Glass
(1978) calls decisionmaking within the absolute standard-setting
process "judgmental, capricious, and esseitially unexamined" (p. 253),
andfurthermtestlat"tomykmwledge, everyattempttoderivea
criterion score is either blatantly arbitrary or derives from a set of
arbitrary premises" (p. 258). Similarly, Beuk (1984) has noted that
"setting standards...is ally partly a psychometric problem (p. 147).
Hofstee offers support for the idea that:

"a [stardom-setting] solution satisfactory to all persons
involved does not exist and...the choice between
alternatives is ultimately a political, not a scie’ttific,
matter" (1983, p. 109).
Jaeger claims, flatly:

"All starriard-setting is judgmental. No amount of data
‘ collection, data analysis, and model building can replace
the ultimate judgmental act of deciding which levels of

performance are meritorious or acceptable and which are
uracceptable or inadequate" (1979, p. 48).

Shepard identified the essence of the problem of arbitrariness in the
so-called absolute methods:

 

18

"[N]one of the [starriard-setting] models provides a
scientific meals for discovering the 'true' standard.
'Ihisismtmlyadeficiereyofthemrrentmethodshrt
is a permanent and insolvable problem because the
underlying carpetencies measuredare and
not dichotomous" (1980b, p. 67; cf, Shepard, 1978, p. 62).

Eve'IEoel, whosestandard—settingmethodhasretairedmlar,
resigned himself to the fact that a certain ammt of subjectivity
remains in "absolute" standard-setting methods:

"Asecondpopularbeliefisthatwhenatestisusedto
passorfailsaneone, thedistinctimbetweelﬂleuao
artccmes is clearert and unequivocal. 'Ihis is almost
never true. Determiration of a minimum acceptable
performame always involves same rather arbitrary and not
wholly satisfactory decisions" (Reel, 1972, p. 492).
Hambleton summarized the overwhelming consensus of cpinion:
Watisclearistlatallofﬂiemeuiodsaremm
thispointhasbeenmadeor impliedbyeveryorewhosework
Ilavehadanopporumitytoread. 'mepointisnot
disputed by anyone I am aware of." (1978, p. 281).

However arbitrary and problematic (deGruijter & Hambleton, 1984;
Shepard, 1980b), standards are still essential for making certain
inferences and, accordingly, credentialing decisions. The need for
validstarriard—settingisespeciallyapparentintheareasof
certification and licensure, where ensuring the p.1blic's protection
agaixstunsafepracticeistherealandnecessarydargeofthe
responsible entities (Lerner, 1979; Maslow, 1983; Shepard, 1983). As
Ievin has remarked:

"Unless all forms of certification are eliminated,
however, a stardard is still needed whether the
performaree is sufficient to receive the certification"
(1978, pp. 306-307).

In stmmary, while ambivalence retains over the degree of
arbitrariness inherent in absolute standard-setting methods, their

intuitive appeal, ease of implementation, and perceived advantages in

 

termsofbothpsydmetricpzrpertiesanddefereibilityoverthe
previwslymlarmm—referereedamroadmshavebeendocmentedby

mmerous researdlers (Berk, 1986; Cross, Impara, Frary, & Jaeger,
1984; Klein, 1984; Meskauskas, 1986). 'Ihe use of absolute standard-
settim methods continles to become increasingly widespread. Research
into development of new methodologies, partiwlarly cmprunise
approaches, and alpirically-based methods of adjusting standards
(Hambleton, 1978) and into assessing the validity of the resultant

standards (Jaeger, 1979; Kane, 1985) continues.

MW
Having gained increasing acceptance by the measurement profession
generally, absolute methods of establishing passing standards began to 7
realize widespread use in the determination of cut-off scores on
educational, licensure, and certification tests (Gross, 1985). A
logical second phase of research developed: investigation of the
psychometric properties of the various starriard—setting procedures.
Itﬁssecordphaseofresearohisdlaracterizedlargelybyattemptsto
cmparetwoormorestardard—settirgmethodologiesintermeoftheir
reliability and ability to identify an "acceptable" standard. As late
as 1988, Smith and Smith reported that:
"lmdloftheworkintheareaofstandardsettinghasbeen
concerned with comparisons of different methods for
establishing a criterion." (p. 259).
In testament to the proliferation of inter-methodological
research, Berk (1986) reports that in the five-year period, 1981-1986,
22$hldieswereconductedtocamparestaniardsresultingfrunthe

application of different starﬂard-setting methodologies. Ebctensive

 

20

descriptions of the various inter-methodological cmparison studies
are provided elsewhere (Berk, 1986; Jaeger, 1989). A partial listing
ofinterhmethodologicalisalsoprwidedinthismrkasAmendixA.
(Because the preset research is limited to applications of ore
absolutestardard-settirgapproadl, AppendixAlistsonlythose
studies reporting cmparisons involving one or more absolute standard-
setting methodologies.)
(we result of the wealth of interhmethodological research appears
certain: Different standard-setting methodologies yield different
standards (Arrirew & Hecht, 1976; Brennan & Iodcwood, 1980; Koffler,
1980.- and Skaklm & Kling, 1980). Different methods even produce
differentperformarnestarﬂardswhenappliedtoﬂiesametestsbythe
same group of experts (Mills, 1983; Mills & Barr, 1983). More
tentative and method-specific conclusions apply to studies when
different groups of experts, apply the same methodology to the same
test (Cross, et a1, 1984; Fabrey & Raynorxi, 1987; Jaeger, 1988, 1989;
Rock, Davis & Werts, 1980).
A second result of the inter-methodological research effort is
also compelling: The Angoff approach seems to be the preferred
absolute stardard-setting methodology by several criteria. Mills and
Melican (1988) report that,
"the Angoff method appears to be the most widely used.
The method is not difficult to explain and data collection
and analysis are simpler than for other methods in this
category" (p. 272).

Similarly, Klein (1984) noted that the Angoff method is preferable

"because it can be explained and implemented relatively easily"

(p. 2). Rock, Davis and Werts (1980) concluded that "the Angoff

cutting score seems to be satewhat closer to the 'mark'" (p. 15).

 

21

delta: and Hecht (1981), in their cmparison of the Angoff, ﬁnal, and
Nedelskymethodologies, r'eporttletmkgofftedmiqueardtte
Angoff coteensls techniques are superior to the others" (p. 15).
Cross, et a1 (1984) found that the Angoff method "yielded the most
defersihle starriards" (p. 113). Berk (1986) ooncluied that "the
Angoffmethodappearstooffertlebestbalareebeueentedmical
adequacy and practicability" (p. 147). Meskauskas (1986) states that,
"tlepresentmethodofdloice forstandard-settingistteklgoff
method (p. 199). Finally, in their study ccnparing the Arqoff and
Nedelsky methods, Smith and Smith (1988) report "an urge to say, 'Yes,
the Angoff approach is more valid'" (p. 272).

MW

'nlelileofirrmliryjoiredbythepresentresearchisamly
emerging one (Mills & Melican, 1988) that seeks to identify sources of
variation within and efficient refinements of existing standard-
setting methodologies. Few systamtic researdl efforts have been
directed at this critical facet within the field of startled-setting
research. As Smith and smith reported bluntly, "little work has been
done to explain why differences in standards occur" (1988, p. 259).
Smithandamith (1990)proceededtopursleaeaspectofuxy
differences in standards might occur in an investigation where item
reviewersusingthekrgoffmethodwereaskedtoattendtomly
specified characteristics of reading comprehension items.
Unfortunately, the authors reported somewhat discouraging results, and
asked:

"Where does this leave us? First of all, perplexed, as
usual. Second, reluctant to recommend glvmg judges

 

.U‘w

v—r

"'v"-~m- : -5.

 

22

informatim about what deract'leristics to use or ignore"
(p. 22) .

Historically, the recessity of idertifying sources of intra-
methodologicel variation has never been totally overlooked by research
effortsinthearea. Nedelskyata'icerecognizedthereedtoidentify
andreducesouroesofvariatiminthemethodheprcposed. Ore
hypothesized source of variation—and, possibly invalidity—was the
trainim of item reviewers. Early on in the search for absolute
standards, Nedelsky warned:

"'Ibmekeaproperjuigmentofthiskim, requirestimeand
considerable pedagogical and test-wise sqhisticatiom
withrosponsesmoreheterogetnlsthanintheecamplecited
areiliable judgment may be impossible." (Nedelsky, 1954,

Indeed, the prcper training of qualified item reviewers has been
repeatedly emphasized by those involved in standard setting research
as crucial to the validity of the process (Francis 8 Holmes, 1983;
Jaeger, 1979, 1989; Klein, 1984; Scriven, 1978) . For example, in
their procedural guide to several popular standard-setting
methodologies, Livingston and Zieky (1982) restate the necessity of
reducing variation and invalidity of judgments made by SMFs, devoting
extensive portions oftheirmamaltodescribirgthepropertraining
of judges. Smith, et a1. (1989) state succinctly: "Variability in the
judgmental process reeds to be reduced" (p. 7).

Aside from admonitions concerning the training of item reviewers,
attention to other intra-methodological considerations has been
slight, but growing. A beginning, though sophisticated atht to
identify other sources of intro-methodological variation was put forth

by smith and Smith (1988) who compared sources of information used by

 

23
ArgoffaniNedelskyitemreviewersintheitemratingprocess. Elbe
primarydajectiveoftheirworkwastopinpointdiffereeesbetween
the two methodologies; this, it would still be categorized in the

second-phase of research efforts. I-Iwever, it also clearly repraelts
aeofthefirstatteiptsatideitifyingjgmjmm
variation because of its mique investigation into which item
diaracteristicsaresalienttoitemreviewerswithintheArgoffand
Nedelskymethods.

Salmders, Ryan, and Huynh (1981) also investigated two variatias
of the Nedelsky approach, differing only in the extent to which item
reviewers were permitted to respond "undecided" when considering
whether minimally-campetemt examinees muld rule out an item's option
as incorrect. 'Ihey found that the two conditions "produoe[d]
essentially equivalent results" (p. 209) .

Another investigation into the Nedelsky procedure by Gross (1984)
1edtheauthortosugg$tarefinerentinthetestconstructim
processthatwalldmaximizetheconsistencyoftheNedelsky
methodology.

Flake and Melican (1986) found that, with the Nedelsky method,
item reviewers for a mathematics test made fairly consistent item
ratings, regardless of test length or difficulty. Dillon (1990) found
no strong relationship between the position of an item in an
examination and the Angoff rating reviewers assigred to the item.

Saunders, et al, (1981) and Halpin, Sigmon, and Halpin (1983)
found significant within-method differences in item reviewers' ratings
due to differences in the reviewers' own levels of achievement in the

subject areas, althaxgh Behuniak, Arohambault, and Gable (1982)

 

24
reported firdirg m such differences. Mills and Malian (1990)
reported that little or no differences in passing standards were
observed for randanly equivalent panels of iten reviewers.
Noreini, Shea, and Kanya (1988) reported fairly high consisteicy
inexperts' estimates ofborderlinegruipperfornarcevmenusingﬂie

Angoff method on a medical specialty emiratim. Helicon and Mills
(1987) reported increased p-values and higher intercorrelatiors among
item reviewers' item ratings when reviewers were provided with
knowledge about the other reviewers' ratings.

Garrido and Payne (1987) studied two variations of the Angoff
method under two equities-with and withcxrt iten performance
information provided to the iten reviewers. In this experiment,
mitraireditenreviewerswereaskedtoirdepedeitlyprovideratings
for 20 itens. The provision of iten performance informatim (p-
values) resultedinhigheraveragepassingstardardsandresultedin
reduced interjudge variability. War, the authors note that the
high correlation between "With-Data" judges' ratings and erpirieal p-
values (r =. .98) called into question "the creditability of the judges
in their performance of the judging task" (p. 7). 'lhe author's further
murdered:

"Did the presentation of such intonation influence the
juigestotheecbentthattheydisregardedtheirmm
judgments and relied soley on the item difficulty index in
determining their probabilities?" (p. 8).

(Interesting, Skaknn (1990) also famd that the provision of item
performance data—even purposefully incorrect iten performance data—
has the effect of reducing variability in iten ratings.)

In another rwent study, Friedman and Ho (1990) invatigated the

 

25

relationship between interj'udge variation (cmsensus) and intrajudge
variation (corsistency) am found that procedures aimed at inproving
consensus (such as the provision of iten performance information) "did
not have an adverse affect on intrajudge caisiste'ncy; in
fact...teohniques designed to improve consensus also improved
consistency“ (p. 10). 'meauthorsalsoutilizedsever'alprocedures
designed to evaluate the effect of eliminating judges with poor
interralomsistenyanijuigeswithpooragreenertwiththegruup:
however,noneofthemethodsapprearedtoappreciablyalterthe
overall passing standard.

Nether study of mesa-methodological variation is reported by
Curry(1987) warnestigatedastandard-settiigmfora
certification examination. Using the Nedelsky starxiard-set'ting
method, and a group-process format, Curry found significant variation
in reviewers' ratings of itens resulting in a large percentage of
items requiring extersive group interaction (i.e., iterations of the
ratirgprocess) toadiieveconsensusonitenratings. Whilenotinga
strmg"grwppresstwardsanorm" aniacriticalreedWorehoethe
normative press involved in the use of an expert group" (Garry, 1987,
p. 2), Carry does not provide a strong rationale for why the variation
in ratings mist be reduced, or infornation about what effect, if any,
the initial variation in ratings would have upon a resultant passing
score.

Fitzpatrick (1989) reviewed several steward-setting researdi
efforts talc-hing specifically on the group-process format and
reported:

"Discussion among groupmenbers is thought to elicit
informational influences through the exdiange of argtments

 

26

setting process is an important topic for future research"
(p.322).

Fitzpatrickdidmtreportmresearchtotesttheeffectsofthe
group-process formtinstarriard-setting. However, shedidptroceedto
strongly suggest that "further studies of starﬂard-setting that
involve strucuired discussion or other methods of controlling biased
argtmentatim clearly are warranted" (p. 323).

lastly, a study of intra-methodologicnl variation was conducted by
Norcini, Lipner, langdon, and Strecker (1987). Using the Angoff
method, Norcini, et a1. explicitly tested "the camun notion that a
group setting is most appropriate for inplerentation [of the Angoff
nethodJ." (Norcini, et a1., 1987, p. 56). 'Ihe research of Norcini,
liner, et al. was an attempt "to determine whether more efficient
variations of the process will provide consistent and accurate
results" (p. 56) .

Norcini, et al. found relatively small differences between the
passing scores obtained using three varieties of the Angoff method,
each variation differing only in the extent to which iten reviewers
were exposed to a group—process format. (Each of the modifications of
the Angoff method used by Norcini, et a1. involved the use of
normative feedback with the 'iten reviewers; that is, reviewers were
provided with the correct responses to the itens under review, as well
as with enpirically—obtained difficulty indices (p-values) for each
iten.) Norcini, et a1. argued that tentative support had been
provided for the notion that two of the three variations of the Angoff

method resulted in aweptable passing standards and less interrater

 

27

variation.

'niebasicconclusimofthereseardrbyNorcini, etal. is
stzaightforward:

"In conclusion, this work implies that judgments gathered
after an initial traditional group-prom session can
provideamedianienforsettirgcittingscoresusinga
modified Angoff method and make more efficient use of
meeting time." (Nomini, et al., 1987, p. 63).

Onetraiblingaspectoftherseard'ireportedbyNorcini, etal.
is the failure to control for possible training or "practice" effects
intheitenrevieoers. Intheirstudy, SME‘swereaskedtoreviedtest
itemsineachofthreeconditicns. Inthefirstconditicn, thegroup
reviewedmaterials sentthraxghthemaildescribingtheAngoffmethod
tobeused. Next, thereviewersatterxiedagrcupneetingmierethe
method was ﬂirther described, definitiors of a"minima11y ccmpetent
examinee," etc., were discussed, and ten practice itets were reviewed.
Following this training, the iten reviewers then received a booklet
ccmtaining the actual test items, answer key, normative information
consisting of iten performance statistics, and further review of the
Angoff procedures necessary for completing their iten ratings. 'Ihese
features are characterized by Nomini, et al., as the "Before-
Meeting" condition.

The secorxi cordition (called "hiring-Meeting") was characterized
by the same group of item reviewers participating in another meeting
to review the Angoff procedure and definitions. Following this
review, a traditional group—process Angoff procedure was conducted,
with normative information again provided.

The third, and final, condition (called "After—Meeting") was

conducted approximately one month following the "airing—Meeting"

 

28
condition. In the "After-Meeting" conditim, the sane group of item

reviewerswereegainse'rtapacketofiismictimalnaterials, aset
of iters, answer key, and normative information, and were asked to
provide iten retires.

Norcini, etal., reportthattleresultirepassingscoreecbtaired
in each of the three conditions did vary, though not significantly
[F(2,10) = 2.04, p = .181]. Also reported is an unsurprisire
reduction in the variation of iten ratings frun the "Before-Meeting".
condition to the "After-Meetire" condition. Standard deviations of
the iten reviewers' retires were 5.8, 2.4, and 1.7 for the Before-,
Dlr'ing-, and After-Meeting conditions, respectively.

'mese results might imply, as Norcini, et al., suggest, that
Areoff iten retires collected from iten reviewers performire
indeperdent iten reviews are as reliable as those collected usire a
traditional group—process format. However, a weaker conclusion also
seens tenable: A sirele group of iten reviewers usire the Areoff
methcdtendstobecomelessvariableintheiritenretireswhen
affordedrepeatedeqaosuretothenethodardpermittedgreater
opportunities for practice. Additionally, Ncrcini, et a1. , reported
that, for the retires geerated in the Before-Meetire condition, all
oftheitenreviewers failedtotakeguessireintoaccotmtwhen
providing their retires. 'Ihe reviewers were, however, instructed to
account for examiree guessire for retires they subsequently provided
in the mrire- and After-Meeting conditions (presumably usire p = .20
or p = .25 as the lowest retire possibility). This factor could well
have contributed substantially to the reduction in variation observed

across conditions.

 

29

Insunnary, aretl'ertestoftrepropositiorsprtforthbyNorcini,
eta1.seeiswarrentedarﬂisofferedinttepreeeitsmdy.

 

III.SIUUIH'SIGI

'mepresentreseardrhasbdopirposes: 1) todeterminewhether
iten reviewers, usire the Areoff (1971) method of assigning
probabilities to examination iters, produce different retires as a
resultcfexposuretcatreditiorelgrcup—prccessconditicnandan
isolation condition, and 2) to investigate the effect of knowledge of
other iten reviewers' initial Areoff retires on a subsequent retire of
tiesaneitexs.'lwoexperinentstoaddressthesequesticrsare
presented.

Experimrt l

'Ihe design for the first experiment is are which: 1)rerr1an1y
assigred iten reviewers to each of the two conditions; 2) obtained the
reviewers' retiresmaccmmnsetofiters;and, 3) cmparedthe
resultant ratings.

'Ihe design for Experiment 1 is analogous to the "Pcsttest-Only
Control Group Design" presented by Campbell and Stanley (1963, p. 25).
InthenotetionsuggestedbyCenpoellarriStanley,thistrue
experimental design can be symbolized as follows:

GROUP 1: R 01 [control group - (group-process condition”
GROUP 2: R X 02 [treatment group - (independent condition”,

where:

30

 

31

R indicates randan assignment to a condition,

X iniicetes the administration of a treatmert, and

0 indicates an observation or data collection.

Inthepresentresearch, itenreviewerswererandanlyassigredto
ore of the two conditions—isolation or group-process. The
traditional group-process condition is analogous to a "no treatment"
or control group, and the isolation condition represents a new
treatment. The above design, called "greatly underused in educational
and psychological research" (Campbell and Stanley, 1963, p. 26), has
the advantage over other design choices of offering strore resistance
tofactorsthatwouldweakentheintenialvalidityofthereseard).
'Ihat is, the experimental design—primarily due to the initial randan
assigrnnent to the two conditions—offers a strore potential for
discoveriretruedifferereesbetweenﬂetwogmips' retiresafterthe
treatment has been administered, if such differences exist.

Although, "'knmling for sure' that themgroups were 'equal'"
(Campbell & Stanley, 1963, p. 25) before the experimental treatment is
administered is impossible due to the lack of pre—rendan assignment
comparisons, many of the factors that could weaken the study's
internal validity (particularly, selection) are effectively controlled
for through randanization.

Epirical Imam
aibjectsinthepresentresearchweredividedintotwogroupsand
were exposed to two differing conditions. For pirposee of clarity,
Groipl—thegroupthatwasexposedtothetreditiorelgmip-process
condition—will be referred to as the control group; that is, the

 

 

32

group—process condition can be conceived of as a "no treatment"
caditim.Gru1p2—thegm1pthatwasexposedtotheinieperﬂent
odditim-winherefenedtoasmetreaunmtgmip;theiniepemmt
condition represents the amlication of a new treatment. Precise
descriptions of the characteristics of the control and treatment
graipsareinportantardarepresentedbelow.

m

Fadisubjectintheoontrolgmipwasmailedadaecriptimofthe
Angoff (1971) methodology for establishing passing scores
approximately one nonth prior to a meeting at which the actual item
ratings were collected. A copy of these irstructional materials is
included as Appendix B. Ammdmately two weeks prior to the passing
scoremeeting, eachsubjectinthecontrolgmlpwasteleﬁioredbythe
investigator and questioned concerning hisl tmderstanding of the
mailed materials and feelings of preparedness to undertake application
of the Angoff methodology.

A whole-group meeting, including subjects in both the treatment
andcontrol groups, wasconductedbytheinvestigatoronthedayof
the passing score meeting. At this meeting, the packet of
informational materials which was mailed to subjects prior to the
meeting served as a foundation for review of important concepts and
definitions. Together, both the treatment and control groups then
participated in performing practice ratings for 10 non-operational
tat itezrs. The practice itens were drawn from a recently

 

1All subjects (treatnentardcontrol groups) inthepreeent
weremale.

 

33

administeredtestfomfromthemedicsl specialtyprogrammdersmdy
andwerechosentobereproeentativeof itais famdintheupoaning,
operational test form. Practice items covered a represaitative range
of difficulty, discrimination, and format. Table 1 provides a

description of the 10 practice items.

 

34

Table 1

Description of Practice Iterrs Used in Passing Score Study

Item No. Difficulty Discrimination* Worm *

1

2

10

Group Training Session

.75

.21

.84

.40

.22

.16

.94

.36

.16

.31

.15

.27

.27

.01

.13

.34

.22

’U 'U "U 'U 2 'U *U 'U ’0 “U

Item 1333*“

3" 3’ 3’ 3’ 3’ N N 3* 3’ 3’

Notes: * - Discrimination indices reported are point biserial

** - Key to wording: P = positively worded item; N = negatively

*** - Notation for item types is consistent with those suggested

correlations .

worded .

in Hubbard (1978) .

 

35

Mofthepracticeitenswasaocmpaniedbyitenanalysis
information corresponding to the itan's recmt use. Provision of item
analysis information is consistent with the suggestions of nany
researchers in starﬂard-setting (Berk, 1986; Omeway, 1979; Jaeger,
1982: Iivingstm & Zieky, 1982; and Shepard, 1976; 1979; 1980a: 1980b:
1983) for inclusion of normative information to item reviewers so that
more reasonable item probabilities (ratings) are obtained.

Afterboththetreatnnentarrica'rtrolgmlpscmpletedratingthe
10practice itaie, allnerdoersofbothgroupswerepolledtodetermine
theirper'ceived familiarityandcanfortwithproceedinginthe
application of the Angoff methodology to the operational test form.
Qiestionsandanswersardabriefdiswssimmoderatedbythe
investigator followed.

After questions and clarifications, subjects assigned to the
cartrolgruip(gm1p-processcm1ditim)raminedinthegruipsetting
forthereieimeroftheneetirgtine. Abookletomtainingthe
operatiaialtestitemswasdistrimtedtoeadiwbjectinthecmtrol
group. No additional information except an iniicator of each item's
keywasprovidedtotheoontrolgmip. 'Ihegroupwas, however,
encwragedtoutilizeeadictherandtheirpacketsofmailed
informational mterials on the Angoff method as needed. 'Ihe
investigatorrexainedwiththegroup—processconditimgrwpto
mmitorthedisaissimof itemsinthatgroup, andtoobservethe
frequencyofdiscussion, thecontentofdiscussion, andtheextentto
whididismssimwasdaninatedbyoreormregmipmerbers.

Subjectsinthecontrolgroupwerethenaskedtorecordtheir
ratings foreadztestitenonaratingsheetthatmsprovided.

36
abjectsinﬂwwrtrolgrwppmoeededumghuntestitasasa
group, pausing frequently to discuss difficult item wordings, review
theircmcepuxalizatimofthemhﬁmally-cmpetmtenaminee,andto
ampere item ratings for gustionable item. I-Imever, no forced
Wforitemratingswasrequired,mrwasanyitemrevieder
amnagedtodiangelﬁsitenratim.

In the present research, iten diffiwlty intention (p-values)
wasmtprovidedtosubjectsineitherthetreamentorcmtrol
groups. Althalghsanereseardiershavearguedthatitandiffiwlty
infonnation (pwalues) shculd be provided to item reviaders when
ratingtestitansinordertoincreasethecaisistencyofratirgs
(Cross, Impara, et a1, 1984; Norcini, Shea, & Kanya, 1988: and
albkaviak & mff, 1986), such informatim was not presented to item
reviaoersinthissuxiybecnuseallitetsintheto-be-administered
testformbeingreviewedmrenev (previously untested) items for
mid: performance data were not available.

Ratingsheetsaniallmterialsmrecollectedfraneadrmbject
inthecmrtrolgroupwhenthegrmphadcmpletedtheirratingsfor
eachitem. Finally, abjectsintheoontrolgroiprespadedtoa
brief questionnaire to obtain descriptive information on the subjects
ardiniicatorsoftheirperoeptia‘ecmnemingthepassingscoresuﬂy
methodology.

W

Each subject in the treatment group (isolation condition) has
exposedtcexperiencesidenticaltoﬂwseencamteredbymbjectsin
ﬂwecontrolgruiptmtilthetinethetreaunentwasadministered.

 

37

Specifically, aibjectsinthetreamentgrumreceivedthesamepadoet
of infer-national materials mailed ammcimately one month prior to the
passing score study meeting, received a follow-up telephone call
madmtelywoweekspriortothemeeting,andparticipatedinﬂ1e
whole-group practice session and discussiore m the day of the
nesting.

At the conclusion of the practice session, subjects in the
treatmentgruiphereeadiprovidedwiththesameboonetoftestitas
assubjectsinthecontrolgmlp. Subjectsinthetreamrtgrumwere
asked to use the booklets and previously mailed informational materials
toprovideratirgsmanacccmpanyingratingsheetforeadiitaninthe
testform. Wensubjectsinthetreatnentgruipwereaskedmm
diswsstheirmtingswiﬂiotherueaurentgrummatbers,mabersof
the control group, or other professional colleagues. Rather, sabjects
inthetreatuentgmipmreaskedtocorsiderarxiprwideuieirratirgs
iniependentlyardtoreturntheircatpletedratingfomstothe
investigator. Iﬂcethesubjectsinthecontrol group, subjects inthe
treaunentgraipcmpletedardreumed,alaigwiththeirratirgs,the
post-meeting follow-up questionnaire. All materials were returned by
thetreaulentgralptotheimestigatorwiuiintwodaysofthewnle-
groupmeeting.

meets
Subjects forthepresmtreseardrwerelOmbersoftheWritta-i
Damnation Cannittee of a national medical specialty certificatim
Board. WafﬂieWrittenExaminationOtzmnitteearedzargedwiﬂx

establishing performance standards for the Board's examinatiors.

 

 

38

Subjectsvererecognizedcontentexperisinthemedical specialtyarea
and represented various areas of subspecialty with the profession; each
was also a member of the profession's academy. None of the subjects
possessed expertise in criterion-referenced standard setting
methodologies. Also, each subject indicated that he had not

participated in a previous standard-setting study.

Consent

Each member of the Written Examination Ccmnittee agreed to
participate in the present research. The Board's permission to
coniuctthesuriywasgranteduimighexeartionofaoontractwithﬂie
American College Testing Program, Inc., to perform various assessment
services. The contract specifically covered the conduct of a passing
scorestudyfortheBoard. Pennissiontousedatadrtainedinthe
corductofthepassingscorestudyforreseardipurposeswasobtaired
by the American College Testing Program, Inc., and by the investigator
in correspondence with the Ehrecutive Director of the medical specialty
Board. Also, individual subjects were contacted by mail to request
their participation in the study and each subject provided his

consent.

Validity Concerns
For the medical specialty board under study, length of service on
the Board is long, and changes in conposition of its Written
Examination Carmittee are slight from year-to-year. Also, all numbers
ofthestandardsettingbody(n=10) wereincludedinthes‘tmdy.

'Ihus, external validity within the medical wielg gm is

 

 

39

substantial. Ectemalvalidityisweakermlenviewedmnediml
specialty licensure and certification groups. However, similarities
in the canposition of, experiences, and roles played by other medical
specialtygrmpssggeststhat resultsoftheproposedresearchmaybe
generalizable to other medical specialty groups as well.

A second validity concern also relates to the canposition of the
grape. Inthepresentstudy, all subjectswerenele; thequestionof
wetter fanale iten reviewers would respond differentially to the
treatment (ie, the isolation condition) is not answered by the
proposed research.

Internal validity concerns (previously discussed) are sanewhat
ameliorated due to the randcn assignment of five subjects to each of
thetwoconditions (TreatmentGruip, n=5; ControlGroup, n=5).
'Iwo additional concerns exist, however. First, there is the possible
effect of subjects' knowledge about the plrposes of the study.
Subjectsinboththetreaunentarricontmlgralpsweremadeawareof
mtcorﬂitionﬂieoﬁiergroup'smemberswereexposedto. Itis
likely that the subjects were able, with that hmledge, to surmise
theintentofthestuiy. Itisunknown, however, whateffectsuch
knowledgewill haveontheresults ofthepresent research, thoughno
systanatic bias in either group's item ratings is expected. Second,
differences between treatment and control groups could be magnified
(ordepressed) asaresultofthedaninationofdiswssionbyoneor
moreindividual ratersinthecontrolgroup. Forexample, adaninant
persualityinthecmtrolgmlpcwldinfluenceuieratingsofothers
suchthat ratingsappeartobe less variable thattheyvmldhavebeen
intheabserneofthedaninantpersonality. However, careful

 

4O

attentimwaspaidtothisconcembyuieirwestigator,whomoderated
the group-process condition. Although discussions about the
difficulty of individual items and the concept of the minimally-
cmpetentcnrﬂidatewerecmmlinthecontrolgmlp,thedisaissias
wereparticipatedinbyallgrorpmanbers;mdaninanceofdisaissim
orhegenm1ycverideasexpressedwasdoserved.

Afinalintenialvaliditycmceniedisraisedbytherelatively
small sanple sizes involved in the study (n = 5 per grwp),
Specifically, the power of the present study to detect true
differencesbetweenthegroups, shouldsudldifferences exist, isonly
modest. 'Ihus, if statistically significant differences between the
grozpsaremtobserved,strorgstataientsconcerningthepreserneor
absenceoftruedifferencescamntbenade;thatis,thehypotheses
thattnledifferencesbetweenthegmmsdorntexistardthattnle
differences were simply not detected (a type II error ocwrred) would
remainequallytenable.

Instrumentation

mreeirstnnuei'rtswereusedtorecordobservatiorisinttiepresent
research. First, a rating form to collect iten reviewers' estimates
ofﬂleproportionofminimally-cmpetenteianineesmwwﬂlansmran
item correctly was used. The same item rating collection form was
usedbyboththetreamientarﬂcontrolgrulps. Asanpleitanratin;
collection form is reproduced in Appendix C.

1hesecondirstnmentusedvesaquestiormiredesignedtoelicit
certain information from the item reviewers. Information on the

following variables was desired:

 

41

- length of service on the Written Examination Cmmittee

- type of professional practice setting (e.g., clinic, university,
private practice) .

Additionally, questions using Idkert—type response choices (Likert,
1932) were asked, concerning:
~perceptiorsoftheadequacyofuainirqinthestaniard-settin;
nethcdology;

- perceptions of item reviewers' canprehension of the standard-
setting methodology;

- perceptions of the ease of implementaticm of the standard-
setting methodology; and,

- confidence that application of the standard—setting methodology
would result in acceptable (accurate) separation of minimally-
canpetent/mt minimally-competent examinees.

Information fromthequestiormairewasgatheredinordertoobtain
denographicdlaracteristicsofthecontentexpertpanelandto
identify other variables that might be related to precision and
variability in iten ratings. 'Ihe questionnaire was developed by the
investigator following recommendations set forth in Babbie (1973) and
Schaeffer, Merdemall, and Ctt (1979). 'Ihe questionnaire is
reproducedinAppeniixDaniwasadministeredtoboththetreaunent
and control groups.

'methirdinstnmentusedintheprtsentreseardiwasthenedical
specialty examination itself. The examination is used by the medical
specialty board as a carponent of its certification process. One form
of the examination is administered annually to amroximately 750

residency program graduates. ‘Ihe examination consists of 200

 

42

previaisly untsted multiple-choice qmsticts (types A and K) with
five option choices. 'Ihe examination is developed by the medical
specialty board based on test specifications that include eleven
subtest classifications. Previous analyses of the eleven subtest
areas has revealed high subtest interoorrelatiors (some exceeding 1.00
when corrected for unreliability) suggesting a fairly unidimersicnal
examination (cizek, 1989) . However, on this certification examination,
examinesspassorfailthetestbasedmtheirtotaltestsccrea'ily.

Previous administrations of examination forms have revealed the
test to be quite reliable: KR-ZO indices of internal consistency
(Kuder & Richardson, 1937) for the past eight annual administrations
of the test (1982-1989) have been .92, .93, .92, .92, .92, .92, .92,
.92, respectively.

ﬂtistical m

'lhepurposeofthestatistical analysesalployedinfbcperimaitl
wasto identifyanydifferencesbetweenthetwogroupsthatwulldbe
observableasaresultoftheirexposuretothetwocorditions (group-
process and isolation). Of primary interest is whether the conditiore
result in different passing scores. In each case, an individual iten
reviewer's passing score is defined as the sum of his ratings for each
of the 200 itens. The passing score for each condition is defined as
theaverageofthepassingscoresforeachofthereviewersinthe
condition. 'lhese definitiore can be represented notationally as:

 

 

 

 

 

XI

XI

XI

.jc

43

isthepassirgscoreforareviewerj, inconditionc:
istheratingofitemibyreviewerjincmditimc:
istheindexforitens(i=l...200);

istheirdexforitemreviewers(j=1...5);ani
istheindexforconditims(c=1,2).

j=1 .jc

is the passing score for a condition, and

isdefinedasabove.

'merealsoexistsameanratingforeadiitanwithead'igruxp,

which is obtained by averaging the individual reviewers' ratings for

theitan. 'Ihatisthereexists, foranitem, i,ameanratingacross

reviewers in a condition, represented by § such that:

i.c

 

 

 

44

where: x againrepresentstheratingofitemibyreviewerj
ijc
inconditionc.

Investigation into possible effects on rating nears and variance
producedbyexposmetothetwoconditiorsisofprimaryinportarcein
thepresentresearoh. Recallthatthepassingscoreforeadl
cmditionisthesmnoftheaverageditemratingsforeadlitanfrun
cad) reviewer assigned to that condition. A test for significant

differencebetheenthemconditimmeans,§ and i wasperformed
..1 ..2

usingproceduresforcorriuctingaone—way analysis ofvariance (ANCVA)
asoutlinedinGlassandHopkins (1984). 'Ihetestwasconductedto

.‘ .

determine if the treatment (isolation) condition resulted in a
differentpassingscorethanthatresultirgfrunthegchp-process
condition.

Althcnghtheprimarypractical interestofthereseardlwasin
ascertainingwteﬂlertrerewerebetheeirgrmpmeandifferences,ﬂle
possible existence of within-group mean differences (ie, variation in
passing scores assigned by ixdividual reviewers) was also of interest.
Specifically, do reviewers within a condition vary significantly in
tteiniividualpassingscorestheysuggest? ‘Ihemeanpassingscore

of each reviewer within conditions, that is, the five § and the
.jl

five 35 , were observed for within-corriition mean differences
.j2

using separate randomized block ANCVAs to test for such differences
with reviewers' (p = 5) ratings blocked by itens (n = 200).

'Iheseconiquestionofprimaryinterestwas:0idassigrmenttotre

 

45

two conditions result in differential variability in reviewers' iten
ratings? AreviewongpendixEstnvsthat, acrossjuigeswithin
caditials, variability in item ratings can be observed. 'Ihis
variability of ratings for an item, i, across raters in cmditim 1,
2

can be represented rotationally as $1.1 . Two columre of these item
rating variances (one column for each condition) are shown in Table 1.

'neresultsoftherorandanizedblochmVAs (above)were
cmbined fortl'enextanalysis. AnF-testusingtheratioofthewo
errorvariameewasconductedtothelikelihoodofl'mogereityof
within-condition variances. 'lhetestalsoprovidedameansof
answeringthesecondquestimofprineryinterest: didassignmentto
the two conditions affect the within-block variability of reviavers '
ratings?

No correlation coefficients were also calculated on the

cariiticnmeanitanratings (ie, onthex sand; 8) toanswer
1.1 1.2

thequestion: Dothetwomethodsofratingitems (independentand
group-process) produce similar orderings of item ratings? In this
case , the Pearson product-mt correlation coefficient was
calmlatedtoassesstreextenttowhidlalirearrelatiorehipexisted
between the ratings of reviewers assigned to each ccmdition. Also,
the rank order correlation coefficient was calculated to obtain an
irdicatimoftheextenttowhidithetmcmditioreproducesimilar
rankings of Angoff values.

An intercorrelation matrix of reviewers' ratings based on the 200-
itan set was also calculated. The interoorrelation matrix lends
itself to 1) visual examination of the row entries, and 2) statistical

 

 

46

testing for differences between within-gram mean correlaticns. For
enaane, eadirwcanbevisuallyexamiredtoverifythehypothesis
that a reviaier's ratirgs stmld correlate more highly with other sane-
grwpreviaaers' ratingsthantheystnlldwiththeratingsofrevialers
assigned to the other cordition. More specifically, it was
hypothesized that the mean of the group-process reviewers'
intercorrelatiore slmld exceed that of the independent cariitim,
primrilyduetothesharingof informatimthatoccnirsduringthe
group-processccmdition. Toadiressthishypothesis,atestfor
differences in the nean intra-group correlations was performed.
'Iwonethodswereusedtoevaluatethecaiparabilityoftheeao
conditions using an additional source of data—the expirical item
performance statistics frun administration of the aminatim for
which itemswererated. 'Ihefirstevaluationwasbasedaltheextent
to which the two conditions resulted in dependable classification
(pass/fail) decisions. As the
W (1985) state:
"estimates of the reliability of licensure or

certification decisions should be provided". . . and "the
reliability of the decision of whether or not to certify

is of primary importance" (p. 65) .
Two estimates of decision consistency were utilized, 3° and it.

 

These estimates of decision consistency, using randomly parallel
tests, are elegantly defined by Millman (1979). Millman has
characterized so as "the proportion of individuals classified the same
wayoneach administration [ofatest]" andhedefinesfcasWhe
proportion of the total number of agreements [in classification] above
the dame level of agreement" (p. 86). It is also possible to

conceive of these two indices as an indicator of classification

 

 

 

47

consistency ( £0 ), and an indicator of the relative contribution of
the test to that level of classification consistency ( 3E ).
Procedures are available for obtainirg estimates of decisim
ccnsistencyusingonlyoneformofatest, andtheseprocedureswere
usedinthepresentreseardl. Detailedexplications oftheprocedures
have been provided by l-Iuynh (1976) and Subkoviak (1976; 1984; 1988).
'niesecondevaluationconsistedoftwowaysofexaminingthe
relaticxehip between reviewers' ratings and item statistics obtained
from the actual administration of the examination. For one analysis,
individual iten reviewers' ratings for each item were compared with
alpirically-obtained difficulty indices (pwalues) derived from the
administration of the zoo-item test. Modified p—values (symbolized

p ) were used for this comparison. The modification consisted of
i

calmlatin; the p-values based upon the performance of "minimally-
caupetent" examinees only, rather than on the total group, following
the suggestions of others (see, for example, Kane, 1984; 1986; Ddiauro
& Powers, 1990; Cramer, 1990). For this analysis, minimally—
carpetenteuamireesweredefiredasuiosescoringwiulintmstardard
errorsofneasurementoftheoperationalpassingscoremthe
seminationZ. The analysis consisted of obtaining an indication of
absolute error, or the extent to which reviewers' item ratings

approximated the items' actual performance in the minimally-calpetent

 

2 Forpdvaluestobecalculatedbasedonlyupontherespozsesof
the "minimally-carpetent" group, an external criterion was
needed. That is, theminimally—ccmpetentgrulpcouldmtbe
establishedwith referencetothepassingstandardbasedupon
the Angoff ratings. For the examination under study, the
actual operational passing standard was established using the
Bank (1984) nethodology, thereby avoiding a circular
definition of carpetence.

 

48

gram. Fbllowixgtheomoepmal framarorkalggestedbyothers (vander
Lirrlai, 1982; Subkoviak & Huff, 1986; Friedmn & Ho, 1990) the variable

E was created to reflect error, or misspecificztim of iten performnoe
bythereviewers. 'nms, theabsoluterootmeansquarederror (RISE) of
@ecifioation for a reviewer, j, in coalition c, is represented by:

 

/ 200 2
E = / Z ( X " P ) (n " 1)
.jc / ijc i // i
/ i=1

where x istheratingofitemibyreviewerjinomditimc
ijc

andp isthemodifiedpdvalue for itemi (describedabave).
i

Aseoordanalysiswasomductedtodataihanindicatimof
relative error, or the extent to which reviewers' ratings apprmdmated
group mean iten ratings. Thus, the relative RISE of specification for
areviewerj inconditioncisgivenby:

 

/200 _ 2
E'=/ (x -x)/<n-1)
.jc / Z ijc i.c / i

/ i=1

wl'leretheelanentsaredefinedasabove. Overall,theabsoluteand
relativeerroranalyseswereainedatdeteminingwhetherthetwo
treatments differed in the extent to which they affected reviewers'
approxinaticnsofmeangmlpratingsorapprmdmatimsoftheacmal
performance oftheminimally—cmpetentexamineegmlp.
FurtheranalysesvereooniuctedusingEarriE'asdescribedabove
todetermineiftheaoalracy30fitanreviewers'ratingsms

 

3 'meterm"aoalracy"isusedsanevmatinaoalratelyinthis
context. Strictly speaking, accuracy would apply more
appropriately to a situation in which reviewers ' estimates

 

 

49

significantly related to the variables identified in the Post-Meeting
glestimnaire (i.e., with length of service on the examinatim
cmmittee,practicesetting,andperoeptiorsoftheadequacyof
training, peroeptiors of the adequacy of mterials, and peroeptiors of
the ease of inplanenting the Angoff methodology). For this analysis,
an interoorrelation matrix was produced using each reviewer's responses
to questionnaire itens and average errors of specification (E and E')
asinput. 'Iheplrposeofthisanalysisisstraightforward, asking: Are
any of the questionnaire variables significantly related to reviever
acclracy (as operationalized in E and E') or to each other?
Generalizability analyses were also conducted, following
procedures set forth by Cronbach, Gleser, Nada, and Rajaratran (1972)
and Brennan (1983) . Generalizability analyses were conceptually mite
appropriateforthepresentreseardibecauseofwoprimry
characteristics. First, generalizability theory allows for
differentiation between multiple sources of rating variation (error)
inthetwoproceduressmdied. Further,italsoallowsfor
cmparisons of the relative magnibade of the variances. 'Ihis second
characteristic is especially useful in designing subsequent
measurement procedures of improved dependability urxier different
caibinatiors of items, raters, and procedures. For exanple,
generalizability results might point to changing the number of iten

 

calldbecarparedtoahmnstarﬂardorto"tnrth." However,

becausetheconstructofmininal—ompetenceismtafixed,

knowable parameter, it is not precisely correct to discuss
’ parameter

context refers not to the ability of item reviewers to
approximate truth, but to their ability to approximate an
admittedly weak proxy for it. For the sake of clarity,
however, this relaxed usage of the term "accuracy" will be
followed.

 

50

reviewers using one condition to achieve a level of dependability
associated with the other cordition. In that serse, the
generalizability study (G—study) results using "real" data from the
prtposedreseardloalldomplenerrtthesimlatimstudiesdaleby
anith, et a1 (1989) regarding the optiml mmlber of item reviewers for
inplenenting a passing score methodology.

In the generalizability study (following rotational cmvt-mtions
suggestedbyBrennan, 1983), thefacetstobeexamihedwerethoseof

 

itens and raters. 'Ihe generalizability study (Gr-study) design used is
formlly referred to as a canpletely crossed, randm factor 1 x r
design where:

i isthe indicator fortne item facet (rarrian) and,

r is the indicator for the item rater facet (random).

In the i x r design, items are cmpletely crossed with raters.
'Ihe G—study design employed yields three estimable effects
(variance outpatients):

- the effect associated with items, 6‘ (i) ;

- the effect associated with raters, 6' (r); and,
- the effect associated with an item by rater interaction,
‘ 2
0‘ (ir).
The following generalizability analyses were of primary interest.
First, examination of the relative magnitude of the variance
catponentswasperformedtorevealtheexterrttowhidiitats, item

reviewers, prowdures, and interactions contribute to variability in

 

-—‘_‘_—‘

 

 

51

item ratings. 'lhe analysis of variance omponeuts associated with
thesefaoetsshedsadditionallightmtheprimrqustim:1sthe
alternative methodology (i.e., the indeperdent cmdition) a viable
alternative to the traditional group-process format for gathering item
ratings? For exainple, a omparatively large variance carponent
associated with procedures mild suggest that raters do assign
differentratingstoitetsdependingontheprooedureused
(independent or group-process) .

Secondly, one indioes describing the magnitude of error associated
with the item ratings were calculated. An index of absolute error

2
varianceisgivenby 6'(A) whidlisthesumofallvariance
caiponentsexoeptthat for itens, that is:

2 2 2
a’(A) - d"(r)+ o" (ir).

Aniniexofrelativeerrorvarianceisgivenby 03(J),whiohis
thesmofallinteractimvariarnecmpomntstrntcm'rtainﬂleobject
ofneasuretentirdex—inuﬁscase,theiniexforitems—ardissham
below. (Notethatfortheixrdesign,mlytheirinteraction
varianceccnporentoontainsﬂleobjectofneasurmtirﬂex.)

2 2
fur) = 0"(ir).

2
Ageneralirriexofmeasurementdependabilityisgivenbylip,
where:

2 2 / 2 2
EP =6’(i)/(o’(i) +6'(A) )-
The generalizability analysis (G—sbidy) also provided the

estimates of variance carponen’cs necessary for subsequent decision

 

52

studie (d-smdies). One of the major goals of obtaining estimates of
variarneoanponentsinaG—suldyissothat"theseestimatescanbe
used to design efficient measuranent procedures for queratia'ial use"
(Brerman, 1983, p. 63). ﬁlms, d-studies were designed to observe how
reconfiguration ofthemeasurementprocesscalldreduceerrorvariance
(and, accordingly, increase £19) to sane acceptable level. For
exanple, a relatively large variance component associated with raters
would sunaort the recamexﬂation that additional item reviewers be
utilizedinfuulreusesoftlleiidepmdartcorlditimtocmpexsate for
its omparatively reduced dependability and the number of additional
reviewers necessary could be established. It would also be of
substantial interest to learn, for example, that Um adiitia'ial
reviewers in the independent condition would yield as dependable a
standard setting procedure as in the group-process condition.

Finally, a cost analysis was performed. Because the independent
condition is offered as a cost-effective alternative to the
traditional group-process condition, an investigation of that
hypothesis was warranted and of considerable interest. Specifically,
theoostsintermsoftime, travelexperses, postalfees, andon-site
ecpenses were examined for conducting the standard-setting methodology
under each of the two conditiors.

Emerimart 2
Inthefirstexperiment, each itenreviewerwasexposedtooneof
two cmditiors—indeperxierrt or traditional group-process. In the
irdependent condition, item reviewers were able to draw on only

internal sources of information—their own experiences, expertise, and

 

‘ﬂ 'mm .4-" \-

 

53

maisofaminimllyompetentexaminee. Onﬂleotherhand,
thosereviewersassigredtothegrup—prooessomditimwere
influencedbyboththeirownintemalresalreesaswellasthe

'me information gleaned frun the group interaction is, however, of
two different types (Fitzpatrick, 1984; 1989). First, there are what
can be called "p.1re" informational influences; that is, relevant
intonation about the difficulty of the iten, the appropriateness of
the iten, the conception of the minimlly-ompetent examinee, etc.,
thatiscbtainedfromothers inthegroup.

The second kind of information gained is less pure. Information
ofthesocialcanparisontypeislessappropriatetothestaniard
settingtask. 'Ihiskindof influenoemuleitenratl'ngsoonsistsof
irriividual'sresponsestopressuretoconformtotheopiniorsof
others, for various social reasons. 'lhe second experiment attenpted
toassesstheextenttowhichpurewr, gevant) informational
influences affect the Angoff ratings of item reviewers.

gypirical meant
'Ihedesignforthesecoxriexperinentwasanalogalstothem
Group Pretest-Posttest Design" presented by canrbell and Stanley
(1963, p. 7), and is symbolized as follows:

GRCUPl- o X o [treatientgroupﬁndepexrientcnnditimn
1 2

where:
0 indicates and observation or data collection, and
X indicates administration of a treatment.

Inthesecorristudy,subjectswerethesane(n=5)itemreviewers

 

“participatedintheinieperﬁentconditimfortheratingofitam
in Ebcperimentl. 'mefirstobservatimocxsistedoftheseelbjects'

ratingsmthefirstlooitemfrcmﬁbiperimtl.

'1hetreatmentinﬁbcperinent2wasdefinedasthepresentatim,to
eadiitanreviewer,ofthedistributionofitenratingsfrunthe
initial rating oftheloo item. 'Ihat is, eachreviewerwasptrovided
with all five ratings (one frun each reviewer, including himself) for
eachoftheloo item. Reviewerswerethenaskedtorevievthe
distributicmofratimsforeadiitenanitoprovideaseccniratin;
foreachoftheloo item.

'No precautions were taken. First, in the distributims of
ratings presented reviewers, a reviewer's initial rating was not
identified. Second, motherinformation, suchasthemeanrating for
each item or item difficulty value, was given to the reviewers. 'Jhese
precautions were taken in order to (1) avoid the possibility that
reviewers would autcmatically provide a second rating that was
identical to their initial rating, and (2) to avoid the possibility
thatreviewersvmldsinplyselectthegroupmeanoftheinitial
ratingsfortheseoondrating.

Fortheseoondexperiment,allmaterialsweremailedtotheiten
reviewers. Each subject was given the same informational packet
provided for the initial rating collection, a booklet containing the
testitemtoberated,andaformmwhidltorecordhisratings.
'meboddetcontairedthesamefirstlooitemfruntheinitialrating
task. Subjectswereagain askedtoprovideAngoff ratings forthe
setofitam,arxiwereihstructednottooonsultwiﬂlothergrulp
meibersabelttheratings ortooonsultoﬂlerinformational sources,

 

55

such as textbooks, resident physicians, etc.

1511'! :

'memostprunirentvalidityooncernsassociatedwithomducting
theseconiexperinentoentermaspectsofintenlalvalidity.
AIMSeveraltlueatstovalidinferenoesofcalmeardeffectare
controlled for in the experimental design, one threats to valid
inferences warrant attention.

First, Misaconcern. 'Iheinitial ratingsprovidedbythe
iten reviewers represented the first attempt for each reviewer at
application of the Angoff methodology. 'Ihe second set of ratings
collectedaspartofrbiperinerrtZrepresentthereviewers' second
experience with the methodology. (No reviewer used the methodology
between the data collection for Ecperiment 1 ani the data oollectim
for Experiment 2.) It is possible that experience with the
nethodologyoolldaocamtforsmedlangeinthesecaﬁitenratings.
It is believed, however, that the extensive training arrl practice with
the methodology prior to collection of the initial ratings was
sufficient and, therefore, little subsequent change would take place
simply due to experience with the methodology.

Secondly, as with Experiment 1, statistical conclusion validity
isofsaneconoern. 'lhemodestsanplesizeenployedinthesecond
emerinentneansthatafailuretodetectsignificantdlargesinﬂie
group's item ratings could be attributable either to the possibility
thatmtruednargeoccurredortothepossibilitythatatypen

errorwascrmnitted.

 

 

56
Wm

nesanehsmmsrtatimwasusedinmperimentztocollectthe
securisetofratimfrunitenreviewersaswasrsedtooollecttheir
ratingsinmperimentl. InEbtperinentz,however,therating
oollectimformwereshorteredtoacomdatetheratirrpforloo
itemardtheoriginal fiveratingsforeadlnarginwereprintedm
theformnexttoeadlitennmber.hsanpleratingfonnfor
Wzmhypometicaldistrihxtiaeofitenratingsis
providedinAmendixF.

mm

Statistical analyses ofthedata collectedforﬁhqaerimentZwere
aimed at identifying any differences in reviewers' ratings resulting
fronemoenetothetreatnent. Ofprinaryirrterestwaswxetl'ierthe
provisim of information (the distrihrtims of ratings) to iten
reviewers affected the reviewers' subsecpent ratings.

Analyticmethodsusedmthedataoollected formperinentzhere
similartothoseusetoanalyzethedatafrunmperinentl. Asin
Willem: 1, primaryinterestcenteredmanycbservablediffereioes
in mean ratings and on the possibility of differextial variability in
ratirgsasaresultoftheadditional informationprovidedtoiten
reviewers. An investigation into possible differences in variability
vastn'dertakenusihgatestforequalityofvariamesinpaired
observations as described in Glass and Hopkins (1984, pp. 268-269).
'nleseresultsadiressedtheqmstionofwhethertheprovisimof
additielal information to iten reviewers has a variance reducing
effect, a polarizing (variance increasing) effect, or no effect on

 

 

 

57

variation in the overall item ratings.

Arepeatedmeasuresanalysisofvariancewasalsoperformed (two
observations per subject). line repeated measures desigrn treated item
astheobjectsofmeaelrenent (n=100),withraters (n=5) being
observeddifferent conditiors (n=2). 'nneanalysespresentedinthe
nertdnapterfoansonthemaineffectsofomditimsardraters. A
significant effect of conditions would point to the fairly unambiguous
conclusion that exposure to the two conditions produces different
passing scores. A significant effect of raters (across conditions)
would similarly irriicate that different raters promnce different
passing scores. A rater by condition interaction effect was also
testedandaplotofthefiveraters' meansundereadioftheuwo
conditions was constructed to help interpret the interaction effect.
In Simmry, these analyses should provide information identifyirng
which factors contribute to differences in item ratings under the two
conditions--independent/no-information and independent/with-
infornnetion.

Further investigation into the effects of the treatments on
resulting passing scores parallel the analyses proposed in szeri-
ment 1. For example, the Pearson product-moment correlation and rank
order correlation coefficients were canprted to observe the
relationship between item ratings under the "no-information" and
"with-information" conditiors.

Decision consistency analyses were also performed to help
ascertain whether the provision of additional information actually
results in any improvement in the categorical classifications (ie,

pass/fail decisiors) that are trade on certification arnd licersure

 

 

58

examimticns. Asinthefirstexperiment, thetmprocedureswere
evaluated using 005151sz of the classification decisions as the
criterion. The classification corsistency indices of 3° and I? were
calculated and cmpared for the two conditions.

As in mperinent 1, absolute error of specification (E) was used
as the criterion to assess whether the information or no-infornetion
ccniition resulted in more accurate approodmtions by the reviewers of
the empirical (modified) pwalues. Relative error of specification
(E')wasalsoexamined, ingratingsmdereadxcoruitimtouae
condition mean ratings for each item.

Finally, to discern the extent to which iten reviewers querate
under an implicit "opinion revision" model, a regression analysis was

cmducted. 'Ihe linear model for the regression equation was:

=a+Bx +13 +e
2

y
ij 1 ij

H-XI

where: y istherevised(2nd)ratingforanitemibyareviaaerj
ij
intheindeperderrtgmlp:

x is the original (lst) rating for item i by reviewer j:

§ is the original group mean rating (across reviewers for
item 1:
a is a constant: and,
e is an error term.
The winion revision model postulates that the eventual (with
information) item rating for a reviewer is predicted from lowledge of

the reviewer's original rating ( x ) and the effect of knowledge of
ii

 

59

thegroup'soriginalratings(§ ). Suchamodelisusefulbecause
i

it further illuminates the effect of pure informational influences on
reviewers' subsequent item ratings.

 

IV.RFSUI£I'S

Etperimitl

m Mean Differences

Ofprimaryinterestinﬁbqaerinentlwasvmetherexposmetothetwo
conditions (i.e., the independent rating of items or the use of the
group-process method) resulted in differing overall passing standards.
Table 2 provides descriptive statistics comparing the ratings produced
tmierthetvnconiitimeandFigurelprovidesaplotofthegmuparﬂ
iniependent reviewers' means.

Visual inspection of the individual reviewer means listed in Table
2 suggests some interesting observations. First, each conditions
apparently contains one or more outliers. For example, while the
reviewer means and standard deviations for the independent condition
appear to be fairly similar (High to low range of means equals 11.00)
the variability of Reviewer 5's ratings is quite large canpared to the
rest of the reviewers in the independent condition. Similarly, in the
group-process condition, Reviewer 10 produced an overall mean rating
that was substantially lower that the other group-process condition
reviewers. Of note also is that the variability of Reviewer 6's
ratingsissatewhatgreaterthattheoﬂierreviewersinhisgmlp,
although still not as large as the variability exhibited by Reviewer

5. Interestingly, Reviewer 10, who produced the lowest overall

60

 

 

61

Table 2
Descriptive Statistics for Iniependent and Group—Process Reviewers
Across 200 Itene

 

 

Indepenient Condition Group-Process Condition
Starﬁard Staniard
w. m Mi gtion M Rev; E; 1E!) Mm ° ﬁg
1 60.23 17. 67 -. 255 6 45.57 21.91 .646
2 49.23 17.96 .252 7 60.68 16.56 -.561
3 57.23 16.66 -.201 8 64.13 15.24 -.771
4 51.18 17.26 .019 9 50.55 17.61 .140
5 58.79 26.55 -.217 10 34.13 14.41 1.373

Means 55.33 51.01

 

Ilean Rating

 

62

 

 

 

 

 

Reviewer Number

Figure 1
Plot of Indeperdent and Group—Process Condition Reviewers' Means

 

 

63

rating, also produced sane disproportionately low ratings within his
own distribution of ratings, as evidenced by his relatively large
positive value for skewness (1.373) .

As a first step in exploring for possible differences between the
ewe coalitiors, two variables were created (mDEPRA'IE aal MIPRATE)
torepresenttheoverall ratia; foreadiitenwiuiineadicoalitim.
INDEPRMErepresentstheoverallratingforeadlofthezooitens
providedbyreviewersintheialependentcondition.GRaJPRATE
representstheoverall rating foreachofthezoo itemsprovidedby
reviewers in the group-process condition. In each case, INDEERATE
“Wmobtairedbycelwlatingtheneanratingforeadx
iten across reviewers within the respective coalitions. This
procedure resulted in 200 pairs of ratings (one pair for each iten).
A correlation between the 200 pairs of overall ratings provided ualer
each condition (i.e., between INDEPRATE and MPRATE) was also
calculated and found to be .712, which was significantly different
fran zero at p < .001. albstantively, the magnitude of the
correlation seens to indicate that the reviewers in the two conditions
tendedtoagreeontheoverall ratings foriters.

The overall condition means calculated for the group-process
condition and the independent conditions were 51.01 and 55.33,
respectively. It shouldbemtedthatthesevaluesrepresenta
proposed passing score for each condition, expressed as a percentage.
That is, application of the standard proposed by reviewers in the
group-process coalition would result in a passing percentage of
approximately 51% compared to the approximately 55% correct standard
that would result from application of the standard based on the

 

 

 

inlepealerrtratings. 'Ihisneans,interusofrawscoretmits,that
the indeperleat conditim mean of 55.33% correct would reqaire
examineestorespaidcorrectlytolllitelsinordertopassthe

examination, “measthepassingstandardsuggestedbythegrulp'
process coalition (51.01%) would be only 102 items correct. It is
fairly obvious then, that regardless of whether the difference in the
caidition meats proves to be statistically significant, the nine-point
rawscorediffererneinpassingstaalardselggestedbythe‘uwo
coalitions is clearly of practical significance.

A test for a statistically significant difference between the two
ererallcarlitimnearswasconductedusingaaae-myanalysisof
variance (ANOVA). ‘Ihe reexlts of the significance test are presented
in Table 3. Despite the practical significance of the difference
betweexthetwosuggestedpassimstandardethereeiltsofthe
analysis of variance failed to reveal a statistically significant
difference between the two passing scores. The F test for significant
meandifferencesbetweentheoverall reviewerpassingscoreneanswas
nonsignificant, with F (1,8) = 0.55.

 

 

 

65

Table 3
Test for Significant Mean Differences between Independent
and Grown-Process Condition Passing Scores

 

INIEPRA‘I'E W

Mean 55.33 51.01
Starﬂard Deviatim 12.81 11.78
r INDEPRATE, W = .712 p < .001
Mean Difference ‘ = 4.318
Standard Deviation of 4.85 12.05

Overall Reviewer

Hears
m W e: E
Coalitions 46.70 1 0.554 rs
Error 84.30 8 -

'notal 80.12 9 -

 

 

 

66

Itisinterestingatthispointtomtethatthepresenoeof
profound practical significance in the absence of statistically
significant differences is a sanewhat miccnmm finding. It is
regularly observed that statistically significant findings can be of
little practical inportance (see, for exanple, Glass and Hopkins,
1984, p. 215-216). In the present study, however, although the mean
difference in conditim passing standards was not statistically
significant, a substantial effect on pass/fail classifications could
reentfrunevensmalldifferernesinﬂleeiggestedpassingstardards.

It should be noted that while the nonsignificant F-test reallt does
not indicate that the treahnent was ineffective, neither can the
presence of practical significance be confidently attributed to the
ecperime'ttaltreatment. 'lhedifferencesbetweenthetreatmentaal
cartrolgroupsggdbeduetothetreatmelt, althoughbecauseofthe
small sanple sizesused, thatclaimcamotbesubstantiatedbasedupm
thisexperinent. Inotherwords, thecbserveddifferencescouldbedue
solely to randan error and different groups of iten reviewers could
produce different results.

'nle extent of classification charges that would be seen if the
ialepexdentandgroup—processstandardsmreappliedtotheactual
distributim of scores observed for this examination was explored.
Application of the group-process condition standard (approximately 102
itenscorrect) wouldhaveresultedinapassingrateof93.0%anda
corresporrlingfailurerateof7.0%. 0ntheotherhaal,hadthe
indepeale'tt coalition standard been applied (requiring approximately
111 itens correct), the passing rate would have been 85.8% aal the
failure rate (14.2%) would have nearly doubled carpared to the

 

67
ialeperlent ccnditim failure rate.
Finally, to ascertain whether overall iten rating variances for the

engrolpswerehanogeleas,anF-testusingtheratioofthe
variancesofINDEPRATEardmmATEwasperfomed. Inthiscase,the
ratio of thevariances was 1.18, whidldidmtemoeedthecritical
value of 1.32 for alma= .05 wiﬂudf=199rnmeratoranddeaaninaton
'niisfinliagsuggestsﬂlattheoverallratirgsweremtmorevariable
mdereitherofthetwocmﬂitions.

mm

Separate raalanized block AMVAs [with n = 200 itels (raalan) aal
n=5raters (raalan)] wereperfornedtoleamiftheoverallratirgs
of individual reviewers within a coalition differed significantly fran
ead'lother. Additionally, theresults ofthetwoANOVAswerelsedto
adlressﬂleqiiestimofvmeuieretposuetoeiﬂlerﬂleialepeﬁeitor
group-process condition affects the variability of reviewers' ratings.
Plotsofthegrulpandindependentraters' overallmeansareprovided
inFigurelandANOVAresultsarepresertedin'I‘ableL Inspectionof
Figurelrevealsthat ratersinbothcoalitionswerevariable intheir
ratings. 'lheresults ofthetworaalanizedblockANOVAspresentedin
Table 4 present a similar piculre. 'lhe ANOVAs reveal a significant
effect for raters in both the group-process (F 4,796 = 143.12, p <
.001) and independent (F 4,796 = 17.17, p < .001) coalitions,
indicating that raters within a coalition do produce different ratings
(passing standards). As would be expected, the effect of itens was
also significant in both the group and independent conditiors.

 

Table 4

RaalanizedBlockAMJVAResults forlaleperleltaal

Gram-Process auditions

Ialependent Coalition

 

m mg I

Iteus 820.03 199 2.99*
(810616)

Raters 4701. 19 4 17.17*

Residual 273.79 796

'Ibtal 5795.01 999

*=p<.001

Group-Process Coalition

 

may: ﬁt B

693.64

29016.19
202.74

29912.57

199 3.42*
4 143.12*

796

999

 

 

 

69

Asecadvariabilityissuewasaddressedusingtheresultsfranthe
two randcmized block ANOVAs. Specifically, it was of interest to
learnwietherecpoelretothetwocmditions leadtomorevariable
Wratings. Toassessthiseffect, theratiooftheresidual
variances fran each ANOVA were carpared. Althmgh the residual
variances actually contain two sources of error (interactim effect
plus error), a non-significant finding would lead to the inference
that a within-block effect ms absent.

Because it was hypothesized that the independent coalitim might
lead to greater within-block variability, the variance estimate from
theinleperlertccmditionwasdwsenastheanneratorfortheF—test.
The test revealed a significant F ratio (F 796,796 = 1.35, p < .001),
indicating that the hypothesis of increased within-block variability
for the independent condition renains tenable.

 

As reported earlier in Table 3, a significant Poarson product-
mentcorrelatimbetweengrup-processcaﬁitimadixdeperlent
conditionoverall itemratingswasobserved (rINDEPRA'IE, GUJPRATE=
.712, p < .001). Calculation of the rank order correlation
coefficient yielded similar'results (r W, GUIPRA'IErank =
.702, p < .001). These results indicate that the group-process and
independent conditions provided overall iten ratings of moderately
corresponding linear relationship and rank.

'me interoorrelation matrix of all reviewers' iten ratings was also

producedaalispresentedin'rables. Visualirspectionofthe
correlations did not imaediately leal support to the hypothesis that a

 

 

and Gram-Process audition Reviewers

Intercorrelation Matrix of Ratings fran Independent

 

Group-Process Cbrrlition Reviewers Independent (tradition Reviewers

 

IRl ‘R2 1R3

 

-...J

3

R6 R7 R8 R9 ‘R10

.425

.331

.398

.301

.299

.336

.290

.347

.306

.345

.451
.247
.413
.351

.175

.283

.172

.317

.233

.353

.328
.292
.443
.351

.414

 

888%

.337

.270

.446
.340

.388

.J

 

71

reviewer's ratings would generally correlate more highly with other
within-condition reviewers' ratings than they would with ratings
provided by reviewers assigned to the other coalition. In fact, the
correlations ofhighestaallewestmagnitalewerecbservedm
conditions (highest: r 1,8 = .451; lowest: r 2,9 = .172). However,
all correlations were significantly different fran zero (p < .01) .

To test for significant differences between the average within-
condition correlations, the within-coalition correlatias (enclosed by
dashes in Table 5) were first transformed using Fisher's r to Z
transformation. After transformation, a mean correlation for each
condition was calculated and a test for significant difference between
the mean ialepealent aal group-process correlations was coalucted.
Although the mean correlations differed (mean group-process condition
correlation = .348; mean ialependent condition correlation = .311) the

difference was noisignificant.
Eision Oonsiieggz

'nleextenttowhichexpoelretothegmlp-processcoalitionaal
exposure to the independent coalition results in differing levels of
classification consistency was also examined. Indices of
classification consistency, so aal it , were calculated and are
presentedinTable 6. Asthetable shows, thegroupprocesscondition
edlibited a slightly greater index of overall corsistency than the
independent condition (80 = .958 and .930, respectively). However, it
should be noted that the contribution of the examination itself to

consistency of classification decisions was slightly reduced ualer the

 

 

 

72
A
grum—processcarlitimascmparedtotheindeperleitcarlitim(k=
.647 and .681, resmectively).

 

73

Table 6
Ialices of Decision Consistency for Independent and
Gram-Process Conditions

0 O O A A
easing: W B: Is
Independent 111 .930 .681
snip-process 102 .958 .647

 

 

74

ati

Forreviewers inthegrulp-processaalialeaerﬂertrating
canditiors, overall iten ratings foreachofthezoo itenswere

 

omparedtotheitendiffialltyvaluesdatainedfrmacmal
administration of the etaminatim. For all analyses cmparing the
reviewers' ratings to obtained difficulty indices, however, modified p-
values (MDDP) were used. The modification consisted of calculating the
p-valusbasedmlyoftherespmsesofeuaminees(n=217)whose
totalscorewaswithintwostandardermrsoftheoperationalpassing
score.

First, correlations were calculated between the overall independent
aalgroup-processcoalitimratings (DHJEPRATEaalGRcXJPRATE)aal
mop. Correlations were also mlculated for ialividual iten
reviewer's ratings and 1430?. For both conditiors, indivimlal
reviewer's ratings were found to be only weakly related to the
modified p-values. All individuals' correlations with LDDP were
significantly different from zero at p <.001, but ranged only fran a
low of .306 (for independent condition Reviewer 2) to a high of .423
(for group-process coalition Reviewer 5) . Overall condition iten
ratia; correlatiors with m were only sanewhat greater. The
correlation for the overall group-process coalition ratings and
modified p—values was slightly but not significantly lower than the
correlation between independent condition ratings and modified p-
values (r = .535 and .544, respectively).

lIwoialiceswerealsocreatedtoreflectthedegreeofagreement
between reviewers' ratings and two important criteria. The first
variable, E, was created to reflect the extent of agreement between a

 

 

75

reviewer's ratings and the modified p-values. The variable E can be
conceptualized as an index of absolute error of specification. 'me
secaadvariable, E', reflectsthedegreeofagreenentbetweeia
reviewer's ratirgsandtheneanratingsprovidedbyreviewerswithina
particular coalition. Cmprtatimally, E' for a reviewer j in
ca'ditimcvescalculatedbyaveragingtheratingsofreviewers
except reviewer j within coalition c. This variable, 8', can be
caroeptnalized as an index of relative error of mecification.
Table7presentstheobtainedvaluesofEaalE' forthefive
reviewers in each condition. Examination of the table leads to sane
interesting observations. First, cmpariscm of the values of E aal 13'
across conditions ialicates that, in general, iten reviewers exhibited
disconcertingly large errors of specification, although it is clear
thattheyweremrdibetteratestimatingrmotherrevieaersintheir
curlitimwmldrateitersthantheywereatpredictinghowthe

hypothetical minimally-carpetent group would perform. Secoal, when

 

Table 7

Absolute and Relative Errors of Specification for Iten Reviewers
in Ialependent aal Group-Process Conditious

 

 

Miller E 51 Mr E E:

1 23.46 14.31 6 28.10 15.99

2 26.42 15.58 7 23.49 15.74

3 23.72 13.51 8 24.56 17.75

4 24.53 13.26 9 25.33 13.47

5 27.90 19.59 10 32. 37 19.86
man 25.21 15.25 26.77 16.56
Starﬂard 1.90 2.59 3.57 2.39

Deviation

 

 

 

 

 

e""a‘1\’lat.:i.rgtheoverall performance ofthetwocmditias, itarpears
that. ﬁreialepe'de‘rtcmrlitimresultsinsligntlyinprovedaccuracy

of. Specification in both the absolute and relative sense.

 

Resporsesprovidedbytheloitenreviewersmtl'iepost-Meeting

mestimirewereusedtohelpassesswmeﬂsranybadcgrqmdvariable
ordiaracteristic oftheitenreviewerswasassociatedwithirmeased

precisim in estimation of iten ratings. Correlations were calculated
behweenbackgmnﬂvariables (cbtained frantheQuestiainaire) alrithe
two ialices of error of specification (E aal E').

Forthetotalgroupofreviewersetployedinthesuldy,m
badgrotmd variable or reviewer characteristic appeared to be related
to acoaracy in specification. Correlatims between all variables
including years of service on examination omittee, perceptions of
helpfulness of the informational materials, perceptions of ease of
inplenenting the passing score methodology, etc., and error of
specification variables, E aal 13', were all snall and not
significantly different frcm zero, ranging fron, -.401 to .026 with
most of the correlations near zero.

misresultisatoncediscmragingandmsuprising. Cntheone
hand, theresult ialicatesthatthestudydidmtisolateanyrelevant
reviewer characteristics that might help predict which reviewers would
producethemostaccirate itemratings. Cntheotherhand,this
finlingalsoprovidessaneevidencethatthehanogeieasbadcgmmds
aalperceptions ofthepanelof reviewersutilizedwerenotlikelyto
have contributed to the variability in item ratings.

 

a,”

a, u. 4,; m ml: W651”) 437%; ..

 

\.‘

 

78

“we separate completely crossed (items x raters) generalizability
Marseswereperformed,oneead1forthegmip-processaal
ixﬂqaealentratiagcoalitiors. AsmmaryofﬂsG—studyresultsis
presentedin'l‘ablea. 'Ihetableshowsthatthevariancecomponents
foritensaalthevariamecmponentsfortheitensbyraters
interactimaresomewhatsimilarinmagnimdeacrossconditiasard
similar to each other, ialicating that each of these two factors
contrihrtes roughly equallytothedependability ofthemeasurenent
process (i.e., to the rating of iteus). Ingeneral, standard errors
fortheItersandItersxRaters variancecomporentswererelatively
small, ialicating that they were fairly well estimated and that
subsequentd-stulyanalysesarelikelytobefairlyacclrate. Less
confideseinthepreciseestimationcanbeassignedtothevariance
catponents associated with raters in the presence of the relatively
largestandarderrorsforthesecaiponents.

 

Table 8

3‘1er of Generalizability (G-study) Results for Independent and
Group-Process Conditions

 

 

Variance Standard Variance Staalard
met if Md: m d_f M m
Item 199 109 . 25 16 . 59 199 98 . 18 13 . 99
Raters 4 22 . 14 13 . 57 4 144 . 07 83 . 76
Items x 796 273 . 79 13 . 71 796 202 . 74 10. 15

Raters

 

 

80

of particular practical note, however, is the relatively large
variance cmpment (aal associated large standard error) for raters in
the irriependentandgroup—procossoonditicns. mevarianoecoupment
for ratersintheindepexlentcmditionislargerﬂianismrmally
desirable, relative to the variance competent for itens (about one-
fifth itsmagnitiide). 'nievariancecarponentfor raters inthegroup-
process coalition is extremely large (144.07) relative to all of the
othergrop—processvariamecmpmem's, ecceedingevelthemagniuide
of the variance component for itens (98.18). Taken together, the
negnitaleofthesevariancecmporentselggeststhat,werethe
ialepeidentorgruip-processitenratingprocessestobeutilizedin
futurestudies, betterratertrainingoranincreaseinthemnnberof
raters would be profitable in terns of improved dependability.
Table9presentsaemnaryofd-suxlyresults. Theentriesinthe
table inclule estimated variance outpatients for five to 20 raters for
each ofthetwocoalitiors. Asonewould expect,thevariance
canponents associated with raters aal raters by items interaction
decreaseasthemimberofratersimreases (withthemmberofitens
held constant). Absolute error variance [6"(A )1, relative error
variance [cruf )1, indexof dependability (rpz ), aal signal-to-noise
ratio are also presented in the table. Absolute error variance
represents the variance of the difference between the unit of
neasurenent's observed aal "true" scores or, in this case, the
variance of the difference between the observed aal "true" iten
ratings. Relative error variance is another type of error variance,
whosenegnitrledepeidsofﬂledifferencesbetweenobservedaaltnle

score variances relative to population means for observed and true

81

Table 9

Sunmary of Gameralizability Analyses (d-Sthy) Milt-s

No. of V.C.
rm I_t_ar_s
5 109.25

6 109.25

7 109.25

8 109.25

9 109.25
10 109.25
11 109.25
12 109.25
13 109.25
14 109.25
15 109.25
16 109.25
17 109.25
18 109.25
19 109.25
20 109.25

v. C.

4.43
3.69
3.16
2.77
2.46
2.21
2.01
1.84
1.70
1.58
1.47
1.38
1.30

1.23

 

WM

Absolute Relative Mean
V.C. Error S/N Index of Error S. E.

yarianss _arianss B§_i9 .DEEEBQ; YBIiénQEQHEQB
54.76 59.19 54.76 2.00 .649 5.25 2.29
45.63 49.32 45.63 2.39 .689 4.46 2.11
39.11 42.27 42.28 2.79 .721 3.90 1.98
34.22 36.99 34.22 3.19 .747 3.48 1.87
30.42 32.88 30.42 3.59 .769 3.16 1.78
27.38 29.59 27.38 3.99 .787 2.90 1.70
24.89 26.90 24.89 4.39 .802 2.68 1.64
22.82 24.66 22.82 4.79 .816 2.51 1.58
21.06 22.76 X 21.06 5.19 .828 2.35 1.53
19.567 21.14 19.56 5.59 .838 2.22 1.49
18.25 19.73 18.25 5.99 .847 2.11 1.45
17.11 18.50 17.11 6.38 .855 2.02 1.42
16.11 17.41 16.11 6.78 .863 1.93 1.39
15.21 16.44 15.21 7.18 .869 1.85 1.36
14.41 15.58 14.41 7.58 .875 1.78 1.34
13.69 14.79 13.03 7.98 .881 1.67 1.29

Table 9 (Cont'd.)

i
q

82

Summary of Generalizability Analyses (drstudy) Results

No. of V.C.
Esters Items
5 98.18

6 98.18

7 98.18

8 98.18

9 98.18
10 98.18
11 98.18
12 98.18
13 98.18
14 98.18
15 98.18
16 98.18
17 98.18
18 98.18
19 98.18
20 98.18

 
   
    

Absolute Relative Dkana
V.C. V.C. Error Error S/N Index of Error S.E.
Eaters 1.818. yarianse yarianss Ratio Depend. yarianse Egan
28.81 40.55 69.36 40.55 2.42 .586 29.51 5.43
24.01 33.79 57.80 33.79 2.91 .629 24.67 4.98
_20.58 28.96 49.54 28.96 3.39 .665 21.22 4.61
18.01 25.34 43.35 25.34 3.87 .694 18.63 4.32
16.01 22.53 38.53 22.53 4.36 .718 16.61 4.08
14.41 20.27 34.68 20.27 4.84 .739 15.00 3.87
13.09 18.43 31.53 18.43 5.33 .757 13.68 3.70
12.01 16.90 28.90 16.90 5.81 .773 12.58 3.55
11.08 15.60 26.68 15.60 6.29 .786 11.65 3.41
10.29 14.484 24.77 14.48 6.78 .799 10.85 3.29
9.60 13.52 23.12 13.52 7.26 .809 10.16 3.19
9.00 12.67 21.68 12.67 7.75 .819 9.56 3.09
8.47 11.93 20.40 11.93 8.23 .828 9.03 3.00
8.00 11.26 19.27 11.26 8.72 .836 8.55 2.92
7.58 10.67 18.25 10.67 9.20 .843 8.13 2.85
7.20 10.14 17.34 10.14 9.69 .850 7.74 2.78

 

 

 

 
   
    

i
g

83

Scabres. 'meindexofdepeniability (sanetimesalsoreferredtoasa

Wimbﬂity coefficient) in this case takes into account the

batman-rater variance ooupcnent, since systematic bias in reviewers'
ratings would ccm’ribute error to the estimation of an item's "true"
rating. 'me sigral-to-mise ratio is the ratio of true score variance
to error variance arrl provides amther "reliability-like" irriex. For
exanple, a sigml-to—noise ratio of 2.00 would indicate that the
estimtedtmiversesccrevarianoeistwiceaslargeastheestimated
error variance.

Boﬂuﬂaegroip-processardirdepenientccrditiorsappeartoproduce
moderate signal-to-noise ratios and indioes of dependability with as
few as five raters. Of signifioant interest, halever, are the
relatively large variance carpcrexrts for raters in the group-process
condition and the correspondingly larger absolute error variance.
Interestingly, due to the relatively snaller rater variance cmpcnent
associated with the uriependent condition, higher overall indices of
dependability for the irdependent condition compared to the group-
process condition were observed. A practical application of this
relationship means, for example, that to achieve the same level of
deperriability found with only eleven raters in the independent
condition (approximately .80) would require 14 raters using the group-
process procedure.

Table9also incltﬂestheneanerrorvarianceardthestardard
error of the mean for five to 20 raters in each of the two conditions.
'Ihese values provide information on how well the passing score is
estimated across the samples of itens and raters. 'lhe irriependent

condition rating procedure appears to result in a more precise estimate

 

 

 

 

‘-

 

84
Ot the passing score than the group-process ratings: for a ZOO-item
mSt. ﬁzestandarderrorofthemeaniniicatesﬂiattheconfidane
intenralarundthepassingscorewmldvaryfrmaboutthreetofive
items for the independent oorditim (.0129 x 200 = 2.58; .0229 x 200 =
4.58) but fran six to 11 itans for the gram-process condition (.0278 x
200 = 5.56; .0543 x 200 = 10.86).

Inevaluatiigtheserosults,astamarderroroftheorttirgscore

 

of lasthantwoitatswasusedasamininallyacceptablestandard,
following criteria suggested by Norcini, et a1. (1987) . Applying this
criterion, only the ratings generated using the irdepenient condition
with at least seven raters would approach acceptability for standard
setting procedures (see Table 9). Homever, for both conditions,
signal-to—noise ratios and indices of dependability tended to be
smaller than would be desirable for standard setting.

m Analﬁis

As Nomini, et al. (1987) and Indonod, et al. (1986) have noted,
factors other than psychcrnetric concerns don influence decisions
regarding the conduct of passing score studies. Specifically, the cost
of enpaneling item reviewers (MS) may be prohibitive in many cases.
Because differing costs would likely be associated with implementation
of the group-process or iirlependent conditions, an examination of the
relative costs for each condition was undertaken.

For the following cost analyses, several assumptions were made.
First, itwasassmtedﬁmtagroxmofSME‘s(n=10)weretobe
empaneled to provide ratings for a ZOO-item examination. For analysis

of the group-process condition, it was assumed that nine of the ten

 

\

 

i

85

Mademvmldmmtravel,lodging,arﬂneale2qaersesinorder

tptraveltothepassirgscorestudysiteandparticipateinaniten
review procedure lasting two days.

Two variatiors of the irdependent rating condition were explored in
theanalysis, inadditiontothegroup—processcondition. Forcne
variation (hereafter called the "without-meeting" condition), it was
assumed that the panel of reviewers would be nailed informational
materials explaining the passing score methodology, then the test items
tobereviewedwouldberated irriepenientlyarrirehmiedbymailtoa
central site. The second variation of the independent condition
(hereafter called the "with-meeting" condition), assumes that reviewers
wouldtravel toasingle site foraone—halfdayneetinginorderto
becane familiar with the passing score methodology. Reviewers in this
condition would then receive a booklet of test items to be rated, would
return to their cities of origin, and would return their ratings by
mail.

'I‘ablelOpresentsasmnar-yofcostcatparisons forthegroup—
process condition and the two variations of the indeperrient condition.
CostsestimatedinTable loarebaseduponfigurespublishedinthe
grants Travel Index for 1988, the most recent year for which
complete information was available. (To adjust for inflation, figures
listedintheMwereincreasedbyafactorof 1.1236. The
adjustment factorassumesauniform 6%peryearincreaseincostsdue
to inflation and was applied to all travel expense categories.)

 
     
 

 

86

Table 10
CmparismofCostsforOorductjrgaPassirgScoresuﬂymﬂer

Group-Process and Independent Conditions

Group-Process Coalition

Independent Oonditicms

 

Time 2 days/2 nights
Air Travel $4000.00
Lodging 1309.99
Moals * 1014.84
Transportation ** 200.00
Informational 12.50

Mailirg +
Test Itaxs n/a

Mailing ++

Test Itasteturn n/a
Mailirg++

UIHEIS $6537.33

Notes:

with Meeting

1 day/0 nights 0 days/0 nights

$4000.00
n/amu
507.42
200.00

12.50

85.00

85.00

$4889.92

*=inc1udestaxandgratuity.

**=assnnes$10.00perpersonead1waytoandfrunneeting

site .

*1": = assumes travel to and from meeting site in one day.

+ = first-class postage costs only.

H- = secure-method postage costs only.

Without Moeting

n/a
n/a

85.00

$182.50

 

 

87

The Mprovidescostsassociatedwithtraveleaqaerseoategories
rankedbymajorcity. ‘Ibdeterminecostsforthisshidy, themedian
city listedintheLJQexwasused. Foreuanple, formealcosts,
Harrisburg, Pennsylvania represented the median of 100 U.S. cities for
maloosts. 'Ihus, mealoostswereestimatedusirgtlledatasqplied
for Harristnrg; however, for other travel categories, the median najor
city within the category was used (i.e., not necessarily Harrisburg).

Itshouldfurtherbemtedthatexpensesassociatedwiﬂitravelard
consultation by a psychometrician or testing organization
representative have not been included in the following analyses.
Bemuse licensure and certification boards vary in the extent to which
they utilize in-house psychometric services or contract with external
camsultants, itwasdecidedtoonitumisvariablecostfronead:
ccndition presented. Also excluded because of wide variability are
expersesforconferenoeroanrentalandequiprentrentalforthe
group-process and with-meeting conditions. Like the area of
psychonetric services, organizations vary widely in the extent to
which they utilize "hate office" facilities or conduct meetings off
site. It is recognized that the exclusion of these experses probably
resﬂtsinadcwmardbiasintlmoverallcostestimtesfortlm
group-process and with-meeting conditions.

'No additional assumptions should be noted. First, because air
travel costs are extrately variable, depending on the city of origin,
destination, class of service, and time of week, the costs for air
travel were estimated to be $400.00 per person using figures obtained
frun a national travel service agency for round-trip weekend travel to

and frun "Anywhere, U.S.A." Also, the wit-h-neetirg variation assumes

 

 

88

that reviaaerswwldmtreqmieovemigntlodginginorderto
participate in the meeting. In situations where overnight lodging is
required due to distance travelled, flight connections, etc., lodging
costsmlldbeirulrred. 'Itms, thewith—neetingcostlistedin'l‘able
10 represents a lover bound estimate for that variation.
Emulationof'rableloSlggests, basedupontotalcostsforeadl
of the three conditiors, that the 'Witlmt-meetirq" oordition is by
fartheleastcostlymethodofcorductirgapassingscoresuldy.
Totalestinatedcostsforﬂlethreeconditiousarezerotp-process
condition, $6537.33; With-meeting condition, $4889.92; ard Without-
meeting condition, $182.50.
mile it is true that the without-meeting condition is the least
costlywayofconductingastarﬂardsettirgprocedurewlenonly
mtaryexperdituresareconsidered,therearecertainlyother
factors that should be discussed. For example, earlier in this
sectionitwasobservedthatthegroup—processconiitimsard
independent conditions resulted in statistically and practically
meaningfuldifferencesinpassirgscores. 'Ihisiscertainlynota
factor that should be ignored. Arnther factor beside mtary cost
thatstnuldbeconsideredistheoostinternscftime. Formany
professions, it is quite difficult to identify SME‘s who would be
willing to forego two days of personal time or time away from
professional activities in order to participate in a passing score
study. For this reason, the "Without-meeting" condition, which would
not require set-aside nesting time, could be viewed as the most
econonical. However, it should be noted that no data were collected

asapartofExperinentltoaddressthepsydlmetricpropertiesof

 

 

89

the Without-meeting condition. For that reason, this alternative an
cnly be evaluated in terns of its econcmic feasibility and no
occlusions regarding the accuracy or variability of without-meeting
procedure results can be offered.

Insmmary, itwasobservedthatexpectedsavingsinternsoftime
and financial resources were observed for the "With-nesting" and
"Withort—meeting" variations of the iniependent condition when
carpared to the group-process condition. However, it should be
furthermtedthatanysavingsirnuredmlderanynetlwdwouldrestnt
in trade-offs that should be considered when those responsible for
standard setting actually select a procedure. Also, sane
investigation of actual results from a without-meeting standard
settixgsuﬂyseenswarrantedbeforeanystatementsregardirgits
propriety should be made.

Etperirentz

Winn Man Differences

0fprinaryinterestin£b¢perinent2waswhetherexpcsureto
additional information (i.e., the distribution of the reviewers' om
initial item ratings) would result in differing overall passing
standards. Table 11 provides descriptive statisties comparing the
ratings produced under the two conditions: "no-information" and "with-
information." The no-infcrmation condition is defined as the
iniependent provision by reviewers of Angoff ratings for the 100
items. These ratings were collected as part of Dcperiment l. The
with-information condition is defined as the inieperxient provision of

 

 

90

amusetratirgsforthesanelooitasbythesanerevieders,who
were subsequently provided withthedistrihltion of ratings generated
under the rue-information condition. Figure 2 shows a plot of the
reviewers' mearsacross 100 itals, observedundereadl oftheseuvo

coalitions .

 

 

Table 11

Descriptive Statisties for No-Information and With-Information

Reviewers Across 100 Itans

 

 

First mting Second Rating
(No Information) (With Infornation)
Standard Stardard
Miser ﬁn Miation Lew m Deviation S_kg
1 57.35 17.56 -.141 62.50 18.46 -.382
2 52.95 19.93 .015 62.64 18.45 -.784
3 55.75 17.02 -.053 61.99 21.77 -.288
4 51.80 18.11 -.092 59.75 14.64 -.450

5 56.64 25.70 -.076 53.50 22.59 -.024

 

92

 

 

 

 

“-
”-
2' u-
=
l
C
C
C
a «-
5‘.
521
--I-- Win
a I I I I I

 

Rovlowor Number

Figure 2
Plot of No-Infonnation and With-Information Reviewers' Meats

 

 

 

 

93

As showninbothTablellandFigure 2, overallratingsproduoed
With knowlédge of the other reviewers' ratings are sanewhat greater
ard less variable. Again, as in Experiment 1, one reviewer's pattern
ofratirgsdivergedfrmthetrenisuggestedbytherestofthegmlp.
Domination of Figure 2 reveals that Reviewer 5's overall ratings
decreased under the with-information condition conpared to the other
reviewers whose overall ratings increased. Reviewer 5's overall with-
information rating was also quite different fran the fairly tmiform
overall with-information ratings provided by the other reviewers.

'movariableswerecreated (NOINFOandWI'n-ENFD) toreflecteach
itan's overall rating under the two conditions. NOINFO represents the
initial overall rating provided by the reviewers for each of the 100
items. WI'n-IINFO represents the second (with information) rating
provided by the reviewers for the sane items. In each case, NOMO
arriWI'IHINFOwereobtainedbycalculatirgtheneanratirg foreadl
item across reviewers within the run-information and with-informtim
conditions. This procedure resulted in 100 pairs of ratings (one for
each item) .

The overall means for each condition and other descriptive
statistics are presented in Table 12. The correlation between the
overall ratings provided under each condition is also reported in
Table 12. The magnitude of this correlation indicates fairly strong
intra-reviewer agreement between initial and subsequent item ratings.

 

 

94

Table 12
Descriptive Statistics for No-Informtim

and With-Information Cmdition Passing Scores

NOINFD WI'IHINI‘O
Mean 54.90 60.08
Standard Deviation 13.38 14.31
r NOMO, WI'HIDIPO = .890 p < .001

khan Difference 5. 178

 

 

95

It shouldagainberntedﬂlattheoverall calditionmeansrepresent
theproposedpassingscore, expressedasapercentage, thatwould
result frcm each condition. For exanple, the m—information condition
would result in a passing percentage of approximately 54.9% conpared
tothe60.1%thatvm1dresultifthepassirgstarﬂardwere
established using the with-information ratings. This means that the
tic-information standard of 54.9% would require examinees to respond
correctly to approximately 110 item on a full test of 200 item in
ordertopasstheexamination. 'lhepassingstarxiardelggestedby
reviewers in the with-information condition (60.1%) would trarslate
into a passing score of approximately 120 item correct on a ZOO-item
test. Regardless of the statistical significance of the difference
betweenthetwoconiitionmeans, the 10 rawscoreunitdifferencein
passing scores over a 200-item examination is clearly of practical
significance.

'Iotestwhetherthedifferenceinoverall conditionmeanswas
statistically significant, a repeated measures analysis of variance
(ANOVA) was conducted. The full model specified three factors: Item
(n = 100); Raters (n = 5); and Conditions (replication) (n = 2).
ResultscftherepeatedneasuresANOVAarepresentedinTablen.
Despite the substantial practical significance noted earlier, results
of the repeated measures ANOVA failed to reveal a statistically
significant difference between the two conditions. However,
inspection of Table 13 shows an expected significant effect for item
and a significant effect for raters. Clearly, the results irriicate

that both items and raters affect overall passing scores.

 

96

Table 13
Reeated Measures ANOVA Results for No-Information and

With-Information Cowﬂitionm

marge W W or 1"
W
Raters 9526.25 2381.56 4 11.71*
11'” . Sub. I
Conditions 881.75 881.75 1 0.77 ns
Raters x Conditions 4557.75 1139.44 4
Item 179325.73 1811.37 99 8.91*
Item x Raters 80505.22 203.30 396 .72
Item x conditions 10940.29 110.51 99 .39 no
I x R x c, e 111660.91 281.97 396
m 397398.00 397.80 999

*=p<.001

 

97

A test for homogeneity of variances with paired (dependent)
diservations was also performed. The calculated statistic (t = 1.46)
did not exceed the critical value of 1.98 at alpha = .05 with 98
degreesoffreedan. 'Ihisresultfurthersuggeststhatoverall iten

ratings were not more or less variable mlder either condition.

gationship between With-Information and No-Eormatim him

As shown in Table 12, a significant Pearson product-mm
correlation between tic-information ratings and with-informatim
ratings was observed ( r NOINFO, WI'IHENFO =.890, p < .001).
Calculation of the rank order correlation coefficient yielded similar
results (r NOINFOrank, WI'IHINI‘Orank = .871. p < .001). These results
indicate that the lac-information and with-information conditions
provided iten ratings that were highly similar.

An interccrrelation matrix of item reviewers' first and second
ratingswasalsoproducedardispresentedin'rableu. Visual
inspection of Table 14 reveals that reviewers' first arrl second
ratings (i.c. , under Ila-information and with-informtim conditions)
are generally moderately correlated, ranging fron a high of .759 (for

Reviewer 5) to a low of .485 (for Reviewer 4) with a mean of .673.

 

 

Interoorrelation Matrix of Ratings frun No-Informtion

and With-Information Condition Reviewers

tic-Information Reviewers

(Initial Rating)

Table 14

With-Information Reviewers
(Second Ratlng)

 

R11 R21 R31 R41

 

R11'\—- .389 .367 .249

\

R21 “e:- .305 .259
m1 ‘ ‘ v: .338
R41 \ ‘7‘
R51

R12

R22

R32

R42

-n-————————

.650

.522

.455

.276

.323

R22

.484

.749

.449

.473

.449

R32

.306

.372

.722

.296

.436

R42

.255

.474

.251

.485

.244

R52

.273

.399

.452

.559

.759

 

.433

.483

—
\

.321

.516

.294

.401

.517

.519

.320

\-—-..—--——————

 

99
'IVwogroupscfcorrelatiaisareenclosedbydashedlinesinTable
14. 'meemircledvaluescorrespcrdtothecorrelatiambasedmlyon
Ila-information ratings (upper left) and those based only on with-
infornatim ratings (lower right). After transforming these
correlations using Fisher's r to z tramfornatim, a mean correlation
foreadlccrditimvascorprtedanithetwooverallneanswere
cmpared. As hypothesized, the average with-information correlation
exceeded the average no-infornation correlation (.473 > .337);
however, thedifferernebetweenthemomeancorrelatimswasmt
statistically significant.

12 i . g . !

‘meextenttowhidlecposuretothem-informationardwith-
information conditions resulted in differing levels of classification
consistency was also examined. Indices of classification consistency
p and k were calculated for each condition using the passing scores
suggestedbyeach. 'IheresultsareshowninTable 15. AsTablelS
shows, application of the tic-information passing score would result in
a higher overall index of classification consistency ( so = .934 ),
cmpared to the with-information condition index ( so =- .898 ).
Accordingly, the contribution to classification of the examination
itself to consistency of pass/fail classifications was greater under
the with-information condition ( if: = .706 ) compared to the no-

A
information condition ( k = .678 ).

 

l3:
s 1.:
E
A
- is
at
: 5
j“
.3
-E-I
— t
'.—'_
I:
-51.-
.3.
E,
E
‘2.
3
3:-
h.
at.
I
J.
3‘:
if

 

 

 

100

Table 15
Indices of Decision Consistency for No-Information and

With-Information Coalitions

,. A
m’tion W B; Is
No Information 54.9% (110) .934 .678

With Information 60.1% (120) .898 .706

 

101

ati ' f ' t'

For iten reviewers in the Ira-information (NOINFO) and with-
information (WI'IHJNFD) conditions, overall iten ratings for the 100
item were cotpared to iten difficulty indices resulting from the
actual administration of the examination. As in Experiment 1,
modified walues (mop) were used, obtained by calculating each
iten's difficulty based only on the resporses of examinees whose total
scorewaswithintmstardarderrorsofulepassingscore.

Correlaticlm were calculated between the overall NOINFD and
WITI-UNFO ratings and 100?. Correlations were also calculated between
individual iten reviewers' ratings and mop. For both conditions,
individual reviewers' ratings were found to be moderately related to
FDDP. Interestingly, the lowest correlation with FDDP ( r = .197 )
was observed for a reviewer in the with-information condition, while
thehighestcorrelationwithmDP(r= .505)wasobservedfora
reviewer in the Ila-information condition. Also, surprisingly, the no-
information condition produced a higher (thcugh Iran-significantly)
overall correlation with modified mlues ( r = .590 ) than the with-
information condition ( r = .573 ).

'Ihebwoindicescreatedtoreflectthedegreeofagreenentbeoween
reviewers' ratings and certain criteria (E and E') were also
calculated for each reviewer. Table 16 presents the obtained values
of absolute error of specification (E) and relative error of
specification (E') for the five reviewers under Inc-information and
with-information conditions. Comparison of the values displayed in
Table 16 indicates that, generally, absolute errors of specification

are ally slightly reduced through the provision of additional

 

 

102

information. The mean absolute error Of Specification for the with-
information condition (24.12) was quite close to the mean for the no-

infonnaticn condition (24.93). However, relative errors of

specification were also sightly reduced under the with-information

condition (mean = 13.43) canpared to the rue-information condition

(mean = 14.81).

 

 

 

103

Table 16
Absolute and Relative Errors of Specification for Itan Reviewers
in bio-Information and With-Informtion Conditions

tic-Information Condition With-Information Condition

 

were: E E: E E1

1 23.52 14.09 22.48 12.98

2 26.27 14.76 22.89 10.99

3 23.95 13.24 24.95 14.36

4 25.94 13.77 25.84 12.78

5 24.99 18.17 24.46 16.04

Mean 24.93 14.81 24.12 13.43
Standard 1.20 1.96 1.41 1.89

Deviation

 

 

104

In evaluating the effect of the provision of additional
informtim, it is again observed that individual item reviewers were
mare proficient at estimating the overall group rating for the itens
thantheyvereatpredictinghawuiehypoﬂietioalminimally-cmpetert
amines group would perform.

grass—1mm

In order to further evaluate the effect of providing additional
intonation to item reviewers, five regression analyses were
performed. A regression model was developed which reflects the
hypothsis that an individual reviewer's second (i.e., with-
information) rating can be predicted by kmwledge of his original
(without-information) rating and with hmledge of the group's
original mean rating (with the group mean calculated ewluiing the
iniividualreviewer). 'Ihesetmratingswereusedastheiniepenient
variables in the regression equations with the reviewer's revised
(with-information) rating used as the dependent variable.
'Iheoretically, the model assumes that reviewers' make their juignents
about iten ratings based upon their own procedure-related knowledge;
that is, knowledge regarding the hypothetical minimally-cmpetent
eminee group and the difficulty of the itexrs being rated. And,
reviewers take into account information gleaned from other reviewers:
in this case, fran the distribution of reviewers' initial ratings that
wasprovided fortheiruseinthesecondroundofratings.

To assess the likelihood of such an effect, five regression
analyses were conducted, one for each reviewer according to the
proceduredescribedabove. Resultsoftheanalysesarepresentedin

 

 

105
Raw (non-standardized) multiple regression equations are

Presentedinthetable,alm1gwithu1ecorrelatiorsbetvee1thet\o

Table 17 .

ixdqzeniem: variables, the multiple R, and R squared. In each case,
the correlations between the independent variables are low to
moderate, suggesting that the choice of independent variables does not
pose a threat of nulticollinearity. For each regression performed,
analyses of plots of predicted values against residuals revealed no
disconcerting patterns; plots were broadly scattered and all residuals

hadmeansatornearzero.

 

 

1 y = 8.805 + .535(x1)
2 y = 5.745 + .526(xl)
3 y =-1.o73 + .789(x1)
4 y =29.528 + .290(x1)
5 y =-7.161 + .537(x1)

+

+

+

+

106

Table 17

.424(x2) + e
.524(:a) + e
.349(x2) + e
.273(x2) + e

.555(x2) +'e

.425

.461

.456

.480

.476

Regression Analyses for Individual Reviewers in Experiment 2

.715

.827

.750

.537

.807

Notes: xl=origina1 rating for itemibyreviewer j, and

2
MWK

.511

.683

.563

.288

.652

:a = group's original mean rating for item i carputed with

reviewer j excluded.

 

 

107

'me hypothesized influence of additional informtion appeared to be
Wident in each of the regression analyses. For every reviewer,
Values of b and b were tested for significant difference fran zero;
in all oasis, thezt statistic were significant at p < .01. Further,
the moderately high values of multiple R and (with the exceptim of
Reviewer4)themoderatevaluesofquuaredsuggestthatthe
regression model has accounted for at least half of the variation in

reviewers' ratings.
mined Imults

Selectedresults franEbcperiuentlardEbcperimerrtZwerecmbined
to achieve an overall assessment of the effect of the various stardard
setting procedures. First, the gmlp-process ratings frcm Ecperiment
1werereana1yzedtoobtainthepassingstarriardthatwouldrosult
usingrating forthefirst 100 itets only. 'Ihiswasdonesothat
directcatparisorscouldbemadebetweenthepassixgstarﬂards
suggested by the group-process condition, the iniependent/m—
information condition, and the irriepenient/with—information condition,
anithestandardstobeoouparedwondbebasedupmratingsofuie
same 100 items.

Table 18 presents the results of the combined analysis. Several
striking differences between the three procedures are apparent.
First, the mean item ratings for the three procedures differ
corsiderably, from a low of 48.88% (for the group-process condition)
to a high of 60.08% (for the indeperrient/with-infomation condition).

The dramatic impact that differences of this magniulde would have on

 

 

scum classificatim decisions is also shown in Table 18. For

ample, the lowest passing rate (77.4%) was observed for the
independent/with—informatim condition, while the highest passing rate
(95.0%) was observed for the group—process condition. Accordingly,
failure rates also varied dramatically, from 5.0% for the group-
processconiitimtonearly41/2timesasgreatforthe
indeperrierrt/with-informtion condition (22.6%) .

 

 

 

109

Tab1e18
Outparismofmperimentlandmperinentz
SuggestedPassingStarriards

 

-————oondi tic-s

deependerrt Irriependent
Mean Iten Rating 48.88 54.90 60.08
(across reviezers)
Standard Deviation of 11.60 2.41 3.86
Reviewers' Overall
Ratings
Standard Error 5.19 1.08 1.73
Passing Score 97.76(98) 109.80(110) 120.16(120)
(rounded)* .
95% Confidence Interval 88, 108 108,112 117, 124
for Passing Score
Percent Fossing 95.0(5.0) 86.8(13.2) 77.4(22.6)

(Failing) **

* = adjusted to reflect passing standard for a 200—item test.

** =based onpassing score dotainedusing Beuk (1984) method.

 

 

 

110

'Ihe issue of variability among individual reviewers' overall
ratings (i.e, suggested passing standards) is also highlighted by the
results displayed in Table 18. 'me wide variability across reviewers
inthegrulp-processconditionisecprossstatistieallyinthelarge
standard deviation of group-process reviewers' ratings (11.60). This

fairly large value for the standard deviation of group-process
reviewers' ratirgs is also reflected in a correspariingly large
standard error (5.19) and a very wide confidence interval (88, 108).

On the other hand, both of the indepenient conditions (i.e., the
no-information and with-information conditions) displayed
cmparatively smaller standard deviations for reviewers' overall
ratings and correspondingly smaller standard errors and confidence
intervals. Surprisingly, the smallest standard error (1.08) and
narrowestconfidenceinterval(plusormi1m52rawscorem1its)was
observed for the incleperxient/m—informatim condition.

In sunmary, it should be eurhasized that thae fairly large
differences nay—or may not—be attributable to exposure to the
experinental conditiors. Because of the small panels of item reviewers
utilized, it is possible that the results could be explained by randcm
error. Althargh the social interaction hypothesis would predict the
observed results, the failure to achieve statistical significance for
groupneandifferencesdoeenotmlearttheobservationofthese
results due to diame.

 

 

V. WEN

IIhisstudyconsis‘tedoftwoexperiments. 'Ihepurposeofthefirst
eqaerinentwastoexamineorevariatimofthetraditicnalgrulp-
processprocedureof establishingpassingstandardsusingtheArgoff
standard setting methodology. The variation studied consisted of
having iten reviewers generate their Angoff ratings under an
"irriepenient" coalition in which the usual effects of the gralp-
process procedure (e.g., social cauparison, sharing of information,
etc.) Guild be controlled.

'meplrposeoftheseconiexperinentwastoisolatetheeffectof
one souroe of information that item reviewers use in generating their
ratings—lowledye of the ratings provided by other (peer) reviewers.
'Iheresults of eadleqlerinerrtaresmrarizedbelwarrialistof
major findings and inplicatiors of these results is presented.

mperimrtlamry

Mean Ratm and Variabiligy
Ten item reviewers in Experiment 1 provided Angoff ratings for 200
items on a medical specialty certification examination. Before
providing their ratings, reviewers were given informational materials
and participated in a training session to ensure their familiarity
with the methodology. After this, reviewers were randanly assigned to

111

 

 

112

oneoftwoconditions: anjndependentcmditiminwhidlreviewershad
no interreviewer interactions concerning their ratings, and a group-
process condition in which reviewers freely discussed their ratings
for itets, iten difficulty and relevance, and their cuseptiors of the
hypothetical minimally-canpetent candidate group.

Wiretothetwoconditionsproducedvariedresults. 'Ihe
primarqustionofinterestwaswheﬂlerecposmetotheconiitias
would yield differing passing standards. In Experiment 1, the passing
starﬂardsdatainedshmwedﬂntuleirdeperﬂentomﬁitimrosultedina
starriardﬂlatwasarprmdmatelynirerawscorepointshigherﬂlanthe
group-process condition. However, that difference was not
statistically significant. Although the independent condition
standard was higher, overall group item ratings provided by reviewers
in each condition were nearly equally variable and fairly highly
correlated.

Asecondvariabilityissueaddressedinﬁbtperimentlwaswhether
the two conditions resulted in differential ratings for individual
items. As hypothesized, independent reviewers exhibited, on average,
a slightly wider spread of ratings for individual items than did
reviewers in the group-process condition. This result canplements the
earlierobservationofthehigherstarriardsuggestedbythe independent
grwpinthattheabseroeofreviewerinteractioninuleindeperdent
group may have contributed to this result. Conversely, the variability
of the group-process condition ratings for individual itens may have
been reduced due to the effect of group interaction.

It is critical at this point, however, to highlight the failure

to achieve statistical significance for observed differences between

 

113
neanratixgsforthetwcccnditiorsinmperiuentl. Alttnlghthe
resultswmldelrelyresultinlargepracticnlcasequenceeforthe
examines pqnlation, the profession, and the certifying board,
cmfidentstatanentsregardinguiereprodicahnitycfuaereantcarmt
be made. Specifically, the failure to reject the null hypothesis for
group mean difference means that the results could be explained sinply
with reference to randan error: Different groups of iten reviewers
calldbeelpanelledandarriveatidenticalpassirqscoresorevenat
differ'e'rtpassingscor'esinusqpositedirectimasttnseobservedin
thisstudy.

gision 00mm

Boththe ﬁﬂeperientardgmlp-processconditiasyieldedhigh
indicee of decision consistency, as evidenced by the coefficients a
andlt. However, reitherthefactthatbothirﬂiceswerehighorthe
fact that the group-process condition yielded slightly higher
coefficients is particularly mteworthy: These findings can be
explained by sinply noting that the examination itself was highly
reliableardthatboththeinieperrientardgrmp-pmessspassing
scoresweremtveryclosetotheoverallneanscoremthe
examinatim (with the group-process cordition passing standard located
sligl'rtlyfurtherfrantheoverallneanscorethanthestamd

Wbytheirdepeiﬂentgrulp).

Relationship of Ratggs' to Obtained Item Statistics
ﬁg Reviewer Characteristics

Ratings frm itenreviewersintheirriepexrientandgmlp-process
conditions were ccmpared to p—valuee which were calculated using only

 

 

 

 

114

the respcmes of the hypothetical minimally-ccmpetent candidate group.
Althmgh, for all individual reviewers, correlatious between item
ratings and modified p-values were significantly different frtm zero,
all of the correlatiors were uniformly low. men when outlined to
form group average item ratings, correlations with modified pwalues
were underate at best.

Similarly, the magnitude of the variables E and E'
(cornqluwalized as average errors of specification for iten ratings)
imicated that individual item reviewers, in general, exhibited a
fairly large degree of error when attempting to estimate the
performance of the minimally-carpetent group, as evidenced by the
large values of E. It is of snall coxsolation that reviewers could
more accurately provide estimates of their overall group item ratirgs,
as evidenced by the relatively smaller values of E' .

These findings, taken together, all confirm one canon criticisn
of the Angoff standard setting methodology—that item reviewers often
experience some difficulty in accurately concepulalizirg the
minimally-competent examinee group.

Further, precision in estimation of item ratings does not appear
tobedeperrientuponanyofthereviewercharacteristiesneasuredin
this study. For exanple, one might suspect that the more experience a
reviewer had with producing and reviewing test items would lead to
more accurate specifimtion in item ratings. This result was not
deserved. Likewise, neither was a significant relationship observed
betweentheextenttowhichreviewersreportedtotmderstarrlthe
Angoff nethodology or their confidence in its results am! the

precision of their ratings. These results do not rule out the

 

 

Possibility that other reviewer characteristics do cartrihrte

Slbstantially to accuracy in iten ratings; perhaps other significant
badcgmmdvariablesexistthatweremtmeasnedinthissuxiy. 0n
theotherharri, itisalsosmnewhatencwragingthattheneasned
variablesdomtameartoinfluencerevieweracwracy. Ifstandard
settingbodiescanbelessconcenxedabwtthesevariablesvmen
epaneling reviewers, thepool ofpoterrtial reviewers migi‘rtbe larger,
possibly widening to include participation by able reviewers who may
haveot'herwisebeeneccltxled.

Generalizabilig Analw

Generalizability analysa were conducted to investigate differing
sourceeofvariationinitemratingssothatpotentialfuture
applications of either the iniepemlent or group-process procedures
could be developed to yield increased dependability of measurenent
(i.e., depe'ﬂability of item ratirgs). G-study results indicated that
variance carpcments were fairly well estimated (except for the group-
process conditionraters carponerrt) andwouldbeuseful forsubsequem:
d-stnﬂyanalyses. D-studyresults fortheirxieperde'rtandgruzp-
precessca‘riitionsweredotained,varyirgthemmberofreviewers
whileholdirgthemmberof itatsoonstant. 'Iheresultsstxmedthat
slightly increased neasureezm dependability was achieved using under
theiniependentcmditimascmparedtouaegruxp-pmcesscuditim,
with acceptable results for operational purposes achieved with
apprminately 11t015reviewers. 'Ihis firdingiscontrastedwimthe
suggestions ofsanethatat least sixtoseven reviewersbealpaneled

for passing score decisions, although others (cf., Cross, at al.,

 

 

 

116

1984, p. 116) have also smested that eupaneling 15 or more reviewers
is desirable.

D-studyresults alsosggestedthataddingmoreitanreviewers
(or, possibly, more extensive reviewer training) would likely result in
irmeasedneasnerentdeperﬁabﬂitymﬂereitherﬂueinieperdentor
group-process oonditiors, though more so in the group-process
condition. In practice, of course, inzreasing the nuuber of test item
would also, generally, inprove overall dependability. However, with
testlergth forthetestmﬂersbadyalreadyfairlylong (n=200
itens), more and bettertrained reviewers would likely be a more
practical, less costly, and more efficacious method of addressing the

issue of increasing the accuracy of item ratings.

Cost Analﬁis

Becausetheindependentitenreviewprocedurewasproposedasan
efficient alternative to the gram-process procedure, a cost analysis
wasalso conducted. Asexpected, the financial costsassociatedwith
inpleentation of an iniepenient/with—neeting rating procedure were
lowerthanﬂiecostsassociatedwiﬂzcmﬂuctingthetraditiornlgmxp-
process procedure for a 200—item examination. Substantially lower
costs yet were estimated for an iniependent/wittnrt—neetn'g procedure.
However, itismtedthatsarecontroloverthestaniardsettim
process is surely lost when either ixﬂependent condition is utilized.
One potentially inportant element that is excluded from the
iJ'dependent/witlnlt-neeting condition is the ability of reviewers, as
agroup, toarriveatsateooreensusregardjngmeirconoeptimofme

mininally—cmpetent eaminee group—an important aspect of the Angoff

 

 

 

117

“Sundology. And, itismﬂamnl‘mstmﬁardsestablishedusingan
independent/without-meeting procedure would compare to the
iniqendent/with-meeting or group—process procedures eramined in this
research.

Prunisingresultswereobservedforthea'evariatimofthe
irdependent procedure in which iten reviewers assetble only long
enoightoreceivegrulptraiJﬁm,beconefamiliarwiththe
methodology, and develop cannon referents regarding the minimally-
cmpetent group. This variation was also less expensive that the
traditional group-processmethod, mtwouldrequireagreatertime
ccmnitznentmthepartofpotentialitanrevieers. 'misopticn,
however, should probably be oorsidered by groups contamlating the
need for a standard setting study in light of earlier findings
regarding the inportance of reviewer training.

Wit 2 Sumary
m Ratm‘ and Variabilig

Five iten reviewers—the same reviewers who participated as
irﬂepenient item reviewers in Experiment 1—were each provided with
the five ratings generated for each of the first 100 items form the
ZOO-item examination used in Experiment 1. 'Ihe reviewers were asked
to reread the 100 items, to review the distribution of initial ratings
for eadu item, and to independently provide a second rating for each
item. 'Ihis procedure created two conditions: a "No-Infometion"
condition represented by the initial ratings generated irrieperdently
before any mrnative information was provided, and a "With-

Infonration" condition represented by the subsequent ratings generated

 

 

 

118

With knowledge of the distributions of initial ratings for each item.

Fairly consistently, ratings generated under the with-informtion
cariitimwerehigherthanratingsgeneratedbythesamereviewers
urrier the m—informatim condition. Differences between the condition
means were of statistical and practical significance. However,
overall mean iten ratings across reviewers were roughly equally
variable for the I'D-information and with-information conditions,
although at the individual item level, a slight reduction in
variability for the with-information ratings was cbserved.

These findings generally corplenent the finiings presented for
Dcperiment 1. For ample, the provision of additional informtion—
intheformofthedistributiors of itenratings—mayhavehadthe
effect of calmmicating to reviewers a group "expectation" or
conceptualization regarding minimal carpetence levels which they used
in generating their second set of ratings. Accordingly, reviewers
moseratingsneyhavebeelextreueinitiallymresubtlyinducedto
converge on the standard inplied by the distributions of item ratings,
making their subsequent ratings for individual items sanewhat less
variable. miseffectissimilartowhatsanehavetemedthe
"reality check" aspect of the modified Angoff method in which item
reviewers, after providing an initial set of ratings, are given
empirical iten difficulty levels and asked to generate a second
(revised) set of ratings.

Relationship of Ratm' to Obtained Item Statistics
Ratings frcm iten reviewers in the nae-information and with-

infornation conditions were compared to p-values which were calculated

 

119

again using mlyﬂleresporsesofthehypotheticnl minimlly-oalpeterrt
mndidate group. Although, for all individual reviewers, correlations
between iten ratings and mdified pwalues were significantly
different frcm zero, all of the correlations were of low to moderate
nagni‘ulde. Also, average oorrelatiors between ratings provided under
the two cmﬁitions and the modified p-values did rut differ
significantly. 'Ihese results likely mean that the provisim of
additional information did not influence the reviewers to converge on
thestandardthatvmldbesggestedbytheacmalperformameofﬂle
mininally-ccnpetent group (as operationalized in this study). Rather,
reviewers converged on their own-sanewhat inaccurate—conceptim of
thelevel atwhidianappropriateminimlmstandardstmldbeset. In
fact, reviewers in both the no-information and with-informtim
conditions had similar and fairly large absolute errors of
specification. Mean relative errors of specification for the two
coniitiors were also quite close.
Becausethesamereviewerswhoprovidedratirgs forExperimentl
also provided ratings for the second experiment, the results are
sanewhatdepenient; thelowdegreeofaowracyfotmdinmperimentl
is,toscmeextent,carriedoverto£b<perinent2. 'Iheresults
presented here, however, strongly suggest that providing iten
reviewers with additional information in the form of distributions of
initial ratirgs—alttnlgh this has been promoted by other researchers
intheareaofstandardsettirgasameansofdecreasingthe
variability of ratings—does not contribute substantially, if at all,

to the accuracy of those ratings.

 

 

 

 

mew
Severalregressimanalysosmreperformadtoascertainmlat

impact certain factors had on iniividual reviewer's vim-information
ratings. A model was proposed that suggested individual reviewer's
with-infometion ratings could be predicted with knowledge of the
reviewer's initial ideas about an iten (i.e., the reviewer's initial,
Ila-information rating) and lowledge of the initial group opinion
about the iten (i.e., the initial group iten rating gleaned form the
distribution of initial ratings).

In all five cases, reviewers' subsequent ratings were
substantially and signifimntly affected by their initial ratings and
by the group opinion. In each case, the mltiple regression equatim
for an individual reviewer explained approximately 50% of the
variation in reviewers' subsequent ratings. Although this result is
partially encouraging, it does leave considerable roan for inprcvenent
in predictim power-

(me possible factor that nay have moderated this result is the
time period that passed between initial (i.e., m—infornetim) and
subsequent (with-information) ratings. In this study, approximately
fwrweekspassedbe‘meenthetimeinitial ratingsweregatheredand
the time distributiors of initial ratings were mailed to reviauers.
It is possible that during the interim time period, reviewers lost
saneoftheirfamiliaritywithconceptscentraltothestandard
settingmethodologythatthiscontributedmesmrceoferrortothe
secondsetsofratings. Itmlldbeofinteresttolearnifvarying
the time period between Ila-information and with-infometion ratings is

relatedtothedegreetowhidlthesecondsetofratingsmnbe

 

 

 

 

accurately predicted.

Discussim of Cmbined Analysis

mperinentlcalparedAngoff itenratingsgeneratedbytwogmups
ofcontentexpertitanreviewers: catnip-processrevienerswhoshared
opinions about individual items and infometim concerning the
construct of "minimal cmpetence," and independent reviewers who
provided their iten ratings without such interaction (sharing of
information). A deck on the reasonableness of Experiment 1 results
wasundertakeninﬁhtperinent 2, inmidireviewerswereprthibited
from personal interaction, but were provided with information
concerningothers' ratings of items. Suchadleckwasiniicated
primrily because sample sizes used in Experiment 1 were small and
some verification that the effect of providing information could exert
a predictable effect on item ratings was desired. Also, because the
group-process condition was susceptible to possible over-influence by
one or more reviewers with strongly-held opinions, an attenpt was
made—thmugh Ebtperiment 2—to examine how suggested passing standards
might marge when the social influence of individual reviewers was
controlled.

To accanplish this, the sane independent reviewers fran
quaerinentlreratedasubsetofthesaneitenstheyratedaspartof
Experiment 1. However, for their second sets of ratings, reviewers in
Eminent 2 were provided with relevant information in the form of
distributions of their original ratings generated during Dcperinent 1.

Acanbinedanalysisofthetwoexperinentsacrossthethree

 

 

 

 

 

 

 

 

 

122

differing conditions (i.e., group-process, inieperdent/m—informtim,
and indepenient/wiui-infcmtim) presented sane unexpected results.
First, each of the corditia‘ls resulted in different overall passing
standards. ﬂyover, and possibly due to the relatively snell sanple
sizes, differencesbetveenthesggestedpassingstarxiardsweremt
statistically significant. 'mis observation can, in practice, result
in fairly great practical consequernes, though: Application of the
differing passing standards would yield substantially differing
passing and failing rates and, corsequently, would result in different
certification decisions for fairly large proportions of examinees.

The three conditions also displayed differences with respect to
the variability of individual reviewers' passing standards. And, the
observed differences in variability were not always in the
hypothesized direction. Specifically, although the presence of
information (whether in the form of group-process interaction, or
through the provision of initial item ratings only) tended to rault
in less variability across reviewers for individual item ratings, it
did not have a predictable effect on overall passing standards. For
exanple, reviewers overall ratings (i.e., passing starriards) were
least variable under the iniepenient/m—information condition and most
variable in the group—process format—a finding that would not be
expected if the influence of information gained in the group-process
settingexertedits influenceaswouldbeexpected. Also,
surprisingly, in comparing the irriependent/with—infonnation and
indeperldent/m-information conditions, variability in overall ratings
was reduced—though slightly—in the Ito-information condition.

Finally, it was expected that the iniependent/with—infornetim

 

 

 

 

123

conditimwwld resultinasuggestedpassingstaniardthatwould fall
We between the standards suggested by the W
procedure and the iniepenient/m—infornetim procedure. Or,
culceptually, it was expected that the effect of providing the
distributions of initial iten ratings would be to moderate the social
effects of the group—procas condition, while also tapering the wide
variability anticipated for reviewers who rate items independmtly.
'meseexpectedresultswerenotobserved. Instead,reviewerswho
rated itars in ﬁne independent/with—informatim condition provided the
highest overall starriardofthethreegmlps—astaniardhigherthat
eitherthegrulp-processstarﬂardanihigherthantheircwninitial
(no-information) standard.

Inevaluatingtheseresults itisobservedagainthatcautimis
irriicated because of the size of the sample employed. For exanple,
interpretationofsaneofthesefiiﬂirgscan,tosmeextent,be
explainedwith referencetotheextremeratingsprovidedbyasetof
one or nore reviewers in each of the three rating situatiors.
However, extreme or aberrant ratings are a characteristic of most
standard setting applications. In practice, as nany others have noted
in psydnnetric analyses, variation in individual and overall ratings
will certainly be observed and will contribute to the dependability of
the standard setting process.

Fran another perspective, however, the results presented above
help to make explicit some of the often implicit policy consideratiors
in standard setting. Certainly choice of standard setting methodology
is a policy decision, given what is already known about the likely

effects of methodology on the magnitude of resulting passing

 

 

standards. Also, in this study, within the Angoff methodology, three

variations for implenenting that methodology were examined. A policy

decision to utilize one of these or other Angoff variations will also
surelyhaveaninpactoftheresultingsuggestedpassingstaniard.
Finally, policy decisions might arise if corsideratim is given to the
extent of interreviewer variation that is acceptable, either in tents
of individual iten ratings or for overall passing standards across
reviewers. Resultspresentedinthissuriyhavenedethispolicy
consideration especially salient, with the effects of irriividual
reviewers highlighting the need for further attention to variance-

reducingmeasures.

'Ihis section presents a distillation of the twelve key findings
of the eddy, with inplications for future research and standard
setting practice.

1. Finding: tlhegrap—procsswoceﬁnefor
establishingapassingstaniarddidmtnsultina
pasirgscoreﬂnttassigrﬁficantlydifferatfrm
iniepadentcaditim.

Inplication: Because of the failure to achieve
statistical significance, observed differences

betweentheindeperderrtandgrmip—processprocedures

 

 

 

 

 

125

may be attributable to chance. however, in the
contextofstandardsettirg, evenchancedifferences
canpresentpractical concerns. Forennple, itis
oftenapracticalcoreemforstandardsettirgbodies
that the standard setting methodology utilized might
yield a standard that results in an unacceptably high
failure rate. Use of the independent procedure
studied here might heighten such concerts, although
because of the failure to observe statistically
significant differences, the sane result nay not be
daserved in other applications of the procedure. It
is clearly desirable for further research to be done
comparing the independent and group-process
procedures so that assurance regarding any true
differencesbetweentheprocedurescenbeattaired.
Additionally, the praent research addressed
mlygroup-processardindeperdentvariationeoftte
Angoff nethodology; however, of the absolute starriard
setting nethodologies in prevalent usage (Angoff,
noel, Nedelsky), it is often reported that the Angoff
method yields higher passing scores than the others.
It is possible, therefore, that in some instances the
irdeperdent variation of the Angoff nethod described
inthis studywulldyieldpassingstarxiardsthatare
not politically or practically feasible. Replication
of this research with other, larger samples, and

replication using other methodologies seems

 

 

 

 

 

warranted.

2. Fiming: Iargepractimlcasequaneswere
Siege-steam! the grep-emcee; and irﬂeperdart
procedlres.

Inplicatiors: Despite the fact that nean differences
for the irdeperrient and group-procas conditions were
not statistically significant, the observed
differences inpassirgstandards forthetwogroups
would result in substantial practical consequences.
Forexanple, itwasnotedthatthepercentageof
examinees who would pass under the two standards
varied frcm 86.8% for the independent condition to
95.0% for the group-process condition. Accordingly,
the failure rates more than doubled for the
independent condition (13.2%) conpared to the gron-
process condition (5.0%). Standard errors also
varied substantially for the two conditions (groxp-
process = 5.19; irriependent = 1.08).

It is noteworthy that such large differences can
occur in the absence of statistical significance.
Obviously, this result is related to the small sample
of reviewers employed for the rating of itars.
However, it is also worth nentioning that fairly

small samples are often utilized for standard setting

 

 

.J’ ". -

 

 

 

 

 

purposes. Accordingly, it is suggested that the

uncertainty regarding "true" classifications

associated with passing standards established with
judgmental methodologies should becone more prominent
in standard setting discussions. Specifically, the
confidereethatcanbeplacedintheaconracyofthe
passing stardard (i.e., the size of the standard
errors) directly translates into confidence abont
decisions made for irdividual examinees (e.g.,
certify/donotcertify) ardintosonetimesstrog
inferences abort those examinees (e.g., safe to
practice/unsafe practitioner) .

It is recommended that the uncertainty
associated with passing score estimation assune a
more pruninent place not only in deliberations by
stardardsettingentities,hrtalsoinreportirgon
procedureresultstothoseresponsibleforthe
standards, and possibly in reporting to those
affected by the standards (e.g., professional
groups, individual examinees, the public). It is
interesting to note that "information on how to
interpretthereportedscore, ardanycutscoreused
for classification" is listed as a primary standard
in the m for Educational ard Molgigl
W (AERA/APA/NCME, 1985, p. 53). However,
the notion that "where cut scores are specified for

selection or classification, the standard error of

 

   

 

 

 

measnrenentstnnldbereportedforscorelevelsator

reartheortscore"islistedasasecondarystardard
1731mm (p. 22). Secondarystandardsare
describedasthosethatare"likelybeyodreasonalbe
expectation in many situations" (p.33). No mention
ismadeofreportingtheaconracywithwhidnthe
passingscoreisestimatedinthesectiolofthe
m dealing with professional and occupational
licensure ard certification testing (Section 11) . It
might be appropriate for that section of the
gm to be revised to include reporting, to
those affected by the pass/fail decisions, of the
confidence that can be placed in the accuracy of

passing score estimation.

3. Finding: Interrateragreenant (within itans) was
sminathigherfcrﬂnegrunp—procaspmmirreﬂen
fortheirdqnadentprocedure.

Implication: Because the group-process procedure
tended to produce ratings for irdividual itens that
were sonewhat less variable across reviewers than did
the independent procedure, the gronp—process
procedure may be preferrable to the independent
procedure for its variance reducing effect. This
aggestionwolldbetrueifinterrateragreementon

ratings for individual items remains a goal of

 

 

 

 

 

 

stardard setting methodologies. However, for

standard setting applications in which there is
concemthatthegronp-processproceonremightbe
urdulybiasedbyoreornoredoninantraters,the
independentprocedurenaybepreferable.

4. Finding: variatimbemeanreviewers'nears
across all itens was mghly equal for the
Wardgrop—procasproceones.

Implication: The use of either the independent or
group-process procedure does not seem to corstrict
the spread of item ratings when viewed across itens.
Reviewers usirg either procedure appear to be able to
make fairly consistent discriminatios between

relatively easier ard more difficult itans.

5. Finding: Neither itanreviawers inthe
gronp-process audition exhibited desirable levels of

minimllycanpebantexamineegroup.

Implication: Reviewers in both the independent
and group-process conditions were fairly poor at
predicting the p-values that were actually obtained

fron administration of the examination. 'Ihus, it

 

 

 

 

appears that the effect of group interactim, while

increasing precision sanewhat, has no positive effect
on rechnction of rating bias. It is possible that
efforts to realise variation between reviewers may not
be as profitable as efforts to help reviewers
internalize an accurate conception of minimlly
competent performance. 'Ihe provision of iten p-
values during the rating process has already been
suggested by others as a means of increasing
accuracy; however, theresultsofthisshndysuggest
that modified p-valuos, rather than conventional p-
values, stmldbeusedwhenitisdecidedtoprovide

6. Finding: Melivelylargevariangemnpam
formterswerewservedforbomtheirdqnerhltand

grunp—process caditins.

Implication: large variance components
associated with iten reviewers suggests that large
differences in daserved passing scores could be noted
if different groups of iten reviewers are empaneled.
As mentioned earlier (see Finding 2), the attendant
uncertainty with which the passing score is estimted
should probably become a matter of wider
'adcnowledgenrertt, discussion, and reporting.

At least two methods for reducing the magnitude

 

 

 

 

131

of variance associated with reivierers' item ratings
are Imam: improving reviewer training and
increasing the number of reviewers. ‘Ihe inportance
ofappropriaterevierertraininghasbeenstressedby
several researchers in the area of standard setting.
This study acknowledges the criticality of
appropriate training in the passing score methodology
to be used, prior to data collection. Certainly, it
wasalso observedthat increasingthemmberof iten
reviewers leads to increased measurement
dependability. However, it is noted that, even when
the number of iten reviewers is increased—the
solution suggested most frequently to reduce this
source of variation—only modest gains in overall
measurement dependability were observed. An obvious
inpliention of this finding is that standard setting
bodieswunlddowelltoinncreasethemmberof
reviewersaswellasenqnendthetinenecassaryto
ensure that all iten reviewers clearly grasp the
nednnics of the methodology employed and possess
clear conceptions of elements central to the
nnethodology (such as that of the minimally canpetent
candidate, acceptable performance, etc.) .

7. Finding: Forbountheindqnadentandgrup-
menditins, acceptable dqnedability of
masuranantmsdntainedwithlltolsitan

 

 

 

 

132

reviewers.

Inplication: 'nnis finding is contrasted with the
earlier work of Smith, Smith, et a1. , (1988) that
snggestssixtosevenratersasanacceptablemmber
of reviewers for conducting the passing score
methodology, but agrees wiﬂn the recannendations of
others (cf. Cross, et al., 1984).

Certification and licensure entities considering
implenennting a passing score nethodology should
expectresultstobecomeveryunstableasferer iten
reviewers are utilized. Greater numbers of reviewers
should, of course, be utilized whenever feasible (as
suggestedbyCross, etal, 1984) toimprovethe
dependability of the standard setting procedure. For
exannple, this research found that, using the
independent or group—process procedures, a
dependability index of approximately .70 could be
attained with six to eight iten reviewers,
respectively. A dependability index of approximately
.80 was obtained with 11 to 14 reviewers. Although
themmbersofreviewers inthesecasesaresimilar
tothennmbersofrevienerscamnonlyenployedin
standard setting procedures, the corresponding
indices of dependability seem sanewhat low for
decisionmaking purposes. On the other hand, it was
observed that a dependability index of approximately

 

 

 

 

 

.90couldbeattainedusingatleast200rnbre

reviewers for eadn of the procedures. It is,
however, unncannon for sudn a large group to be
epanneled for standard setting procedures, especially
in the areas of health and basiness professions
credentialing. 'Ihis fact points to the unavoidable
needforstandardsettingentitiestocarefullyweigh
theinportanceofprecisepassingscoreestimtim
with their own practical, finnanncial, and logistical
considerations. In most cesa, an increase in
confidence about the passing score can be "bought"
withbetteritemreviewertraining, anincreased
numberof reviewers, orbothofthese. Inanycase,
it is again noted that, regardless of the
configuration of the procedure in terns of training,
number of reviewers, etc., those responsible for
standard setting would do well to recognize,
evaluate, and to report the trade-offs that played a
partinthedeterminationnofhowﬂnestandardsetting
process was configured (e.g., which methodology was
used, what training was implemented, how many

reviewers were enpaneled, etc.).

8. Finding: 'nneindepe'denntcnditimwasfarla
costlytoinnplanenntthanthegrunp—processcxxditian.

Implication: For standard setting bodies that

 

have severely constrained financial resources, the

independent procedure presents an economically
efficient alternative to a nultiple-day meeting for
purposes of establishing a passing standard.
Additionally, if it is desired to increase the number
of item reviewers, the independent procedure also
presents an econanical way to widen participatian.
However, thenneceesity ofsoundreviewertrainingis
not obviated and should again be emphasized. 'lhe
variation of the independent condition in which
revienersareconwenedonlylongennghtoreceive
methodological training and to arrive at consensus an
key concepunal issues before providing their ratings
in isolationn would seen to be a good alternative if

the full (group-process) procedure is nnot possible.

9. Finding: Providing indepedent caditim
thatwereganerallyhignerandlasvariable.

implication: 'nne provision of additional
information to iten reviewers, in the form of
distributions of their original itenn ratings, tends
to douse reviewers to converge on an innplicit
standard of performance. It is nnot )mom, however,

whatdegreeofconfidencecanbeexpressedthatthe

 

 

 

 

provisionn of this kind of additional information will
alwaysresultinsubsequentratingsthatarehigher
than the originnal ratings. Oenfinnatin of the

directionality of the effect canld be adiressed by
furﬁnerreseardn. Itislikelyﬁnongh,ﬁnatﬁnis
result and the rednction inn intro-iten variability of
ratings are fairly depedable outcanes of providing
the additionnal information.

10. Finding: 'nne m—infornnatim and wiﬁn-
innfcrnatim caditicns produced overall iten ratings
ﬁnat were of rughly eqnal variability.

Inplication: 'Ihe provision of additional
infornation did nnot have the effect of reducing ﬁne
spread of iten ratings when viewed across itens.
'Ihus, itanpearsﬁnatrevieersdonotreduceﬁneir
nnotions of a rating "floor" or "ceiling" due to the
provision of additionnal infornnation and can still
make fairly consistent discriminations between itens
they perceive as easier or more difficult for ﬁne

minimally connpetent examines group.

11. Finding: 'Ihe provisian of adiitiael
innfonnatimtoitanreviewersinﬁnefornnof
distrihntiasoforiginalratingscbsmtanpearto
reenltinmrepreciseestinnatsofﬁneperfamnne

ofﬁnemininallymnpetentexamineegroup.

 

 

Implication: Although ﬁne provision of

additional information had ﬁne effect of enabling
reviewers to better amroxinete ﬁne evenonal group
average rating for an iten, it did nnot appreciably
inncreaseﬁnereviewers' accnracyinestinatingﬁne
performance of ﬁne minimally canpetent examines
group. As wiﬁn ﬁne group-process procedure,
additional information of ﬁne kind provided in this
studydoesnnotappeartohaveﬁnedesiredeffectof
helping reviewers to better estimate the "me"
standard. Perhaps, in additicnn to ﬁne provision of
modified p-values during ﬁne rating process,
' procedures for identifying iten reviewers who are
more familiar with the knowledge, skills, and
abilities of ﬁne minimally canpetent examines group
should be investigated and utilized in future

12. Finding: Rimlecgeofoﬁnerital reviewers'
ratingsisasignifinanntsouroeofinfonnetimﬁnat

ownitenn ratings.

Implication: A reviewer's em initial opinionn
about ﬁne difficulty of an item for ﬁne minimally
canpetent examines group is often revised after

exposure to information provided regarding peer

 

 

 

 

 

reviewers' opinions about the item. Alﬁncugh a

reviewer's final rating can be fairly well predicted
from knnowledge of ﬁne reviewer's initial rating and
knowledge of ﬁne initial group cpinion, oﬁner factors
contributingtoﬁnefinnal ratingcertainnlyexist:
furﬁner research is needed to identify which oﬁner
key variables inclinne reviewers to alter ﬁneir
initialratingswhenaniterativeprocessis
utilized. Because of ﬁne relatively enall
'interrater variationn in overall passing standards
observed in this study using ﬁne independent
procedure, an indepedent, iterative procedure
appears to offer sane pranise. Future applications
of indepedent meﬁnodologies using Delphi techniques
mightservetotakeadvantageofﬁnepositive
characteristics of indepedent rating generation
noted in ﬁnis sundy: provisionn of relevant normative
informtion and restriction of unwanted social
canparison and other sources of irrelevant influeces

on iten ratings.

Idmitatiansandagﬁstia’sfcrnnunreRIseardn

The primary limitation of ﬁnis study was ﬁne relatively snall
sample sizes enployed and ﬁne consequent risk of a failure to be able
to identify true differences between groups when such differeces

existed. Specifically, ﬁnis means ﬁnat sane of ﬁne seemingly large

 

 

 

138

group differences inmnecpecteddirectiesﬁnatwereobservedinthis
study could be attributed to chance.

However, despite ﬁnis limitationn in terns of statistical power,
itwasdaservedthatchancedifferenceswouldhavesubstantial
practicnlconsequecesforﬁnoseaffectedbyﬁnepassingscore
estimation procedure (e.g., enanninnees, ﬁne professim, ﬁne public,
etc.). This siunation has been cause for reexamination of ﬁne
distinction between statistical and practical significance in ﬁne
conntextofstandardsetting. Inthisstudy, evenstatistically
ncnsignificannt findings resulted in often strikingly disparate
practical consequences, such as differeces in pass/fail rates and
individual classification decisions. 'Ihis situation has also been
cause forreexaminatimofﬁnesngyestedstandardsfcrdocnmentingand
reportingﬁneattedantuncertaintyinﬁneresultsofjudgnental
passing score meﬁnodologies.

Annother limitationn of ﬁne study was ﬁnat iten reviewers were all
male. Itisnctknownifﬁnesameresultsnmldbeobtainedfor
female reviewers. Also, it would be advisable to investigate ﬁne
applicability of the results described inthis study with other,
different nnedical specialty certification groups, as well as wiﬁn
oﬁner areas altogeﬁner (e.g., teacher licesure examinations, businness
credentialling programs, industrial selection applications, etc.) .

Similarly, ﬁneresultsdescribedinthisstudyweredntainned
using the Angoff standard setting nneﬁnodology. One might wonder if
ﬁnesaneresultswouldhavebeenobservedifanoﬁnerabsolute
meﬁnodology (e.g., Ebel or Nedelsky) or if a cannon variation of ﬁne

Angoff methodology (e.g., one of the "modified Angoff" approaches) had

 

 

139

been utilized.

As mentioned earlier, sane investigation of ﬁne effect of ﬁne
amount of time ﬁnat passes between initial iten ratings, ﬁne provision
of additional informtion, and ﬁle generation of subsequent ratings
seemswarranted. Itwwldbeofirrteresttolearnifﬁneeffectof
providing ﬁne additional information was stable over time or if ﬁnere
were an optimal time period for providing ﬁne data to item reviewers.

Also, resultsofthissttﬂyhaveservedtonakesalienttheoften
implicit policy considerations inherent in standard setting
applications. Among ﬁnese considerations are ﬁne choice of standard
setting methodology, ﬁne manner in which ﬁne meﬁiodology is actually
inplanented, the kind of training provided, ﬁne nunber of item
reviewers utilized, and ﬁne degree of variability that will be judged
acceptable. '{his study has highlighted these concerns, ard it is
suggested ﬁnat entities resporsible for stardard setting begin ﬁneir
investigation and planning for standard setting procedures
sufficierrtlyinadvanoeofﬁnetimeﬁnatresultsareneededsoﬁnat
policy considerations such as ﬁnose listed above can be debated, made
more explicit, and considered when the operational passing standard is
set.

Finally, a remix-ring recamnexdation in this study has been that
itemreviewersrequirebettertraininginﬁnestardardsetting
methodology in order to accurately predict performance of the minimally
canpetent examinee group. If ﬁne ability for item reviewers to "zero
in" on a standard delimiting some "tnle" line between acceptable ard
unacceptable performance is truly desired—ard not just ﬁne ability of

reviewers to provide similar ratings—ﬁnen serious effort should be

 

 

 

140

Wtowardidentifyingmeﬁnodsofassistingreviewerstoancoeed
at that task.

It was suggested earlier ﬁnat providing reviewers with nnodified
p-Naluesdnringﬁneratingprowssandpdrposefullyselecting
reviewerswhoalreadypossessakeenconceptianofﬁnelmowlecge,
skills, experiences, and abilities of ﬁne minimally canpetent exanninee
group wcnuld be beneficial. PCssibly, ﬁne integration of already
accepted and well-researched practiceguidelines fran oﬁner areas of
education would also help to reduce error in item reviewers'
estimates; it is apparent ﬁnat simply increasing ﬁne number of
reviewers is not enough. For example, applying prinnciples of
instructional designardenlistingﬁneassistanceofenqnertsinﬁne
training field for designning and/or conducting ﬁne initial meﬁnodology
orientation sessions might help address ﬁne issue of rating accuracy.

thdoubtedly, ﬁne necessity for setting fair, defensible, and
accurate standards will renain. As long as rneeds for professional
recognition, certification of canpetence, public protection, and
personnel selection exist, criteria will need to be established ﬁnat
delineate acceptable frcm unacceptable performance. As long as
strategies for setting standards ﬁnat hinge on subjective
conceptualizations of a hypothetical group are employed, variability
of human judgments will coexist. This study reaffirms ﬁne notion ﬁnat
reductioninﬁnevariability ofﬁnosejudgnerrts shouldbeagoalof
standard setting applications: Surely, a standard could not be
strenglyarguedtobeavalidstandardifﬁnemeasurenents
contributing to it (i.e., the reviewers' judgments) were not reliable.

However, this study has also demonstrated ﬁnat individual and

 

 

 

 

141

cMllective reviewer judgments about ﬁne "truth" regarding ﬁnat line

dalimiting minimal canpetence can ﬁnenselves be fairly inaccurate.
Perhaps, a redirection of effort, away from (or in addition to)

atterpts to make more consistent subjective degmennts, and toward

attemtstomakesudnjndgmentslesssubjectivewillprovetobea
rewardingreseardnagerda.

 

 

 

 

 

APPENDHA

 

 

 

 

142

Inter-lbﬁnodological Ounparison Studies of Standard-Setting Procedures

Involving One or More Absolute Standard-Setting Methodologies

W

W

m Normative
v. 'fied f
MW
. Direct
mtg. ﬂ, and Hofﬂ

ff Ebel 'Iest ifica-
Wm

nedelslg v. Contrastm' m

em

Andrei & Kednt, 1976
Schoon, Rosen & Jones, 1979
Schoon, Gullion, & Ferrara, 1979

Bdnuniak, Ardnandnault, & Gable, 1982
Brennan & Lockwood, 1980

Wok, Davis, & Watts, 1980
81111131 & Smith, 1988

Suhkoviak & Huff, 1986

van der Linden, 1982

Colton & Hecht, 1981

Halpin, sigmon, 8 Halpin, 1983
Poggio, Glasnapp, & Bros, 1981

Skakun & Kling, 1980

Garrido & Payne, 1987
meernberg & emtih, 1988
Jones, Rosen, & Schoon, 1988
Bowers & Shindoll, 1989

Mills & Barr, 1983

Koffler, 1980

 

 

 

 

 

 

 

 

 

 

 

APPENDIX B

 

 

 

143

Establishing a Standard of Performance

for the

Informational Materials

 

144

ESTABLISHING A SUGGESTED STANDARD OF PERFORHA¥CE
[

FOR THE EXAMINATION
Overview
The [ ] Examination [ g] has been designed to assess the
[ ] knowledge and skills of Board candidates. The Board has

determined that a criterion reference standard of performance should be
established on the [ ]so that candidate scores can be compared to a level of
mastery that a group of experts in [ 1 has judged to be sufficient
for a given level of specialty practice. Setting a sufficient level of
performance, or a standard of mastery, on the [ ] requires experts in

[ ] to determine what constitutes a mastery level of performance for
effective prhctice at the specified level.

There are a number of procedures available for establishing standards of
performance, each based upon subjective judgments of a grOup of content
experts selected to be representative of important perspectives in the
profession. One of the most popular is the Angoff method. In this method,
each expert examines each question in the test and estimates how many
examinees whose level of knowledge is sufficient and acceptable for entry into
practice will respond correctly. When the estimates for all items are summed
and averaged across all experts, the result is the Suggested standard of
mastery for the test.

Before specific instructions for conducting this method are presented, a
brief explanation of the notion of a "sufficient and acceptable level of

knowledge" might be helpful.

 

 

 

145

Notion of an "Acceptable Level of Knowledge"

The purpose of most examinations is to measure an examinee's level of
knowledge in the content area covered by the examination. The purpose of a
suggested standard of performance for the [ ] would be to differentiate
between those physicians who have a sufficient and acceptable level of

knowledge for the safe practice of [ ], and those who do not. In

setting the standard of performance, it is essential to keep in mind the
concept of the examinee whose knowledge is right at a "sufficient and
acceptable level" for safe practice.

It is important to have a clear conceptualization of this hypothetical

examinee. Suppose a group of physicians who seek to practice [ l is
assembled. These physicians are lined up, from the most knowledgeable

physician to the least knowledgeable physician. The challenge is to start at

the end of the line with the physician who has the greatest knowledge and walk
toward the other end of the line to the physician who has the least

knowledge. At some point it is possible to stop and say, "All the physicians
whom I have walked past have a sufficient and acceptable level of knowledge.
All of the physicians whom I have not walked past do not have a sufficient
level of knowledge."

Now consider two physicians in the line: the last one you walked past
(Physician A) and the next one whom you did not walk past (Physician B).
Physician A will be considered prepared to practice as a [ ] .
This is the physician who has the least knowledge of all those considered to
have a sufficient and acceptable level of knowledge. Physician A knows just

enough to practice safely. Physician B has the greatest knowledge of all

 

 

146

those Who will be judged NOT to have a sufficient and acceptable level of
knowledge. This physician does not know quite enough to practice safely as an
I 1 -

Now think again about Physician A, who knows just enOugh to begin
practicing safely as a [ ] --that is, he has a sufficient and
acceptable level of knowledge for entry into the profession. How would this
physician be described? How much does this physician know? What kinds of
problems should be entrusted to this physician? "hat are this physician's
skills? Thinking about the knowledge and skills a physician must have to
perform effectively is important as you participate in the Angoff standard-
setting method. The following descriptors and questions may help to further
conceptualize the borderline examinee:

An [ ] who has a sufficient and acceptable level of knowledge
for entry into practice will:

1) demonstrate a knowledge base sufficient to diagnose and manage
disease.

2) know the boundaries of the specialty and the profession. What types of
problems should such a physician refer to other professionals?

3) make some errors. What types of errors cannot be tolerated?

4) be aware of the standards, laws, and ethical issues related to specialty
practice. What do these include?

Once the notion of a "sufficient and acceptable level of knowledge"

becomes clearer, the Angoff method can be conducted.

Instructions for Conducting the Angoff Method
Suppose that a hypothetical group of 100 physicians who 5312 a sufficient
and acceptable level of knowledge--physicians just like Physician A--are
gathered in a room. These 100 physicians have been asked to respond to each

question in the [ ].

 

 

147

To conduct the Angoff method, you will need to complete two basic steps

for each item in the WQE:

1) Read the item thoroughly. Think about how frequently the knowledge or
skill tested in the item is used in the practice of . Also
think about how critical that knowledge or skill is to the practice of

]. For example, if a piece of knowledge or a skill is always
critical, there will always be a serious adverse effect (for example,
]) if it is not known or is used incompetently. You might expect

that a large percentage of examinees would respond correctly to critical-
knowledge test items.

2) Next, estimate the percentage of sufficiently prepared examinees--
examinees like Physician A--who will answer the question correctly.
this percentage in the blank labeled Item 1 on your rating sheet.

Remember that some of these examinees will answer correctly by guessing.

Write

Please provide your estimates in multiples of 5. If y0u are not familiar

with the content of a particular item and feel uncomfortable about rating it,

you may leave the item blank. Please try, however, to rate as many items as

you can.

Example:

1. Melanin is synthesized from which of the following amino acids?

A. Lysine

B. Leucine
*C. Tyrosine

D. Histidine

E. Phenylalanine

A patient who has insulin-dependent diabetes experiences early-morning
hyperglycemia that is not preceded by hypoglycemia. The insulin dosage
need not be changed, because the hyperglycemia is due to:

A. insulin resistance.
B. waning of the insulin's effect.

C. excessive levels of glycosylated hemoglobin.
D. adrenocorticoid fluctuation.
*E. a surge of growth hormone.

Suppose a rater reads question 1 and determines that it is testing
relevant knowledge that is critical in certain situations, but only

occasionally needed in the practice of [ ]. The rater estimates

 

148

tlLat 40 out of 100 examinees who are prepared for practice will answer this

tvaestion correctly. The rater then reads question 2 and determines that the
information it is testing is fundamental and nearly all adequately prepared

examinees will answer it correctly. The rater's rating sheet would be

completed as follows:

Estimated
Percentage of
Examinees Who Will

Item No. Answer the Item Correctly

1.

£9
2. 3

Once each rater has completed all the estimates for each item, the
estimates will be averaged across raters and then across items. The result
will be the suggested standard for this particular form of the [ ]. To
illustrate, the following hypothetical example involves five content experts

rating a lO-item test. Each rater's estimates are provided below, by item

 

 

number.
Rater
Item Item
No. l 2 3 4 5 Average
1 7O 80 65 70 50 67
2 70 90 60 60 75 71
3 85 9O 70 70 6O 75
4 8O 70 45 70 40 61
5 75 80 50 50 70 65
6 75 30 40 4O 50 47
7 7O 90 45 60 65 66
8 65 80 60 SO 75 66
9 7O 80 60 60 50 64
10 75 40 8O 60 50 61
Rater Average 73.5 73.0 57.5 59.0 58.5 643/10 =
64.32 or
6.43 items

The standard of mastery for this test wOuld be set at 6 items out of 10.
Examinees answering 6 or more items correctly would meet the performance
standard for the test.

 

 

 

 

APPENDHC

 

 

 

149

Sample Item Rating Collection Form

Directions:
Consider a hypothetical group of 100 physicians who have a sufficient and
acceptable level of knowledge for safe practice of [ ]. What

percentage of these physicians will answer each question correctly? Please
enter your estimates clearly beside each item number. Please keep your
estimates in multiples of 5 (e.g., 45, 60, 65,...).

Estimated Percentage of
Examinees Who
Item Will Answer Item
No. Correctly

l

2

 

 

 

 

 

 

APPENDIX D

 

 

 

150

Sample Post-Meeting Passing Score Study Questionnaire

Directions: For questions 1-3, please write your response in the underlined
space provided.

1) Please indicate the number of years you have served
as a [ ] Board Director. years

2) Please indicate the number of years you have served
on the[ ] Written Examination Committee. years

3) Please indicate your primary practice setting.
(e.g. private practice, teaching hospital. etc.)

For questions 4-7, please circle the number of the response that best
characterizes your level of agreement with the statements below.

4) The informational materials were easy to understand and helpful.

1 2 3 4 S
Strongly Disagree Unsure Agree Strongly
Disagree Agree

5) The item rating practice session was clear and helpful.

1 2 3 4 S
Strongly Disagree Unsure Agree Strongly
Disagree Agree

6) The standard setting method used is easy to implement.

1 2 3 ' 4 5
Strongly Disagree Unsure Agree Strongly
Disagree Agree

7) The standard setting method used will result in a standard that adequately
distinquishes between acceptably and unacceptably prepared examinees.

1 2 3 z.

5
Strongly Disagree Unsure Agree Strongly
Disagree

Agree

Please write any additional comments or suggestions on the lines provided.

 

 

 

 

 

Thank you. Please place this questionnaire in the enclosed envelope and
return it to [ ].

 

 

 

APPENDIX E

In.
N
O
N

N
O
H

K
O‘

x
00

PG
[\

x
\O

X
H
I
(I)
H
O
x
an
N
.3
N
m
K
N
N
H
K

a a s - c . u a a o

151
Data Layout for Experiment 1

 

 

ma N.m m «.m. can. mm. mm. an. em. H.m m z.mx mm. an. mm. an. an.

N n N 1
Na N.NNm N.Nw oﬁmx omx mm: nwx omx H.Nwm H.wm me «Nu mmx NNx me
E . ﬂaw ﬂaw 2: or. my. m. a” agwm 3m 2.. 3.. 2.. ~le :x
3 «gnaw». dads II gang. I: .I. l l..
EmuH EouH emuH on mm mm nm em souH souH mm «x mm mm am
_ IL. lll||._

 

 

manh<m N onHanoo mumh<m a ZOHHHQZOO

 

 

 

 

APPENDHF

 

 

 

ISEILJEza
001.

002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021

022

024

152

Sample Rating Form for Experiment 2

Original Ratings

30 6O 60

60
75
40
65
40
59
70
30
30
75
20
30
50
65
40
80
75
20
55
85
60
35

65

30
40
65
65
20
50
40
25
20
50
30
20
60
50
70
80
40
80
65
80
80
40

60

60
25
55
65
45
45
70
25
60
65
35
40
40
30
45
90
65
55
60
7O
30
45

25

70
40
80
85
6O
30
70
30
30
50
30
60
30
7O
60
70

70

50

60
50
70
40
3O

70

90
90
20
75
80
20

So

25
20
15
40
3O
20
.50
80
70
100
20
75
40
75
15
60

80

E

I'll-Hll-Hlllllllllllllfl

 

 

 

Mexican Board of Medical Specialties (1987) . migration to;
W. EVanston, IL: Author.

American mum Research Association, Mariam Psychological
Association, National (Jamil on Measurement in Education (1985) .
ms for educational and magical m’ . Washington,
Dc: American Psychological Association.

Andres, B. J., & Hectrt, J. T. (1976). A preliminary investigation of

two prooedtm for settirg examination stardards. Wm

Wham, 3_6, 45-50.

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L.
'morniike (£21.), may; measurement (pp. 508-600).
Washingtm, DC: American mil on Education.

Angoff, W. H. (1988). Proposals for theoretical and applied
development in neasuranent. Mlied Measurement in Education,
1(3), 215-222.

Babbie, E. R. (1973). alrvg march methods. Belmrrt, ca:

Wadsworth.

Behmiak, 9., Jr., Archanbault, F. x., & Gable, R. K. (1982). Angoff
and Nedelsky standard setting procedures: Iillplioations for the
validity of proficiency test score interpretation. crati
gm Echolgical Measurenmt, 52, 247-255.

 

 

 

154

Berk. R. A. (1980). A framework for methodological advances in
criterion-referenced testing. Applied Mlgical
M, 5, 563-573.

Berk, R. A. (1986). A oorsumer's guide to setting performance
standards on criterim-refererned tests. Review of Mtional
m. 15, 137-172.

Baik, C. H. (1984). Amethod for reaching a omprmise between
absolute arxirelative standards in mninations.m1_gf
Mm. Q, 147‘152-

Bowers, J. J. & Shirdoll, R. R. (1989, Marth). Eggff. Balk, and
Hofstee: A mison of nultiple methods for settgg' a
ﬁg. score. Paperpresentedattheammalmeetingofthe
National (Jamil for Measurement in Imitation, San Francisco,
C3.

Brennan, R. L. (1983). Elements of generalizabiliy thong. Iowa
City, IA: American College Testing Program.

Brennan, R. L., & Ioclmood, R. E. (1980). A oonparism of the
Nedelsky and Angoff cutting score procedures using
generalizability theory. Applied Eggcholgicnl Measurenent, A,
219-240.

Candaell, D. '1'. & Stanley, J. c. (1963). mimental and mi-
mimental design_s for research. Boston, MA: nghton
Mifflin.

Cangelosi, S. S. (1984). Another answer to the cut-off score
question. Educational masurement: Issues and Practice, 3(4),

23-25 .

 

Cizek, G. J. (1989). r' o 9 ' ' of

W. Ioua City, IA: Mexican College Testing
P109131“-

Cbllins, R. (1979). W. New York: Acadanic
Press.

Oolton, D. A. & Hectrt, J. T. (1981, April). Wm
Synposium presentation at the annual meeting of the National
(Jamil on Measuranent in Education, has Angeles, as.

Caraway, L. E. (1979). Setting standards in carpetency—based
education: Sane current practices and concerns. In M. A. Dnﬂa
S: J. R. Sanders (Eds.), Practices and 21131515 in M-
W (pp. 72-88). Washington, DC: National Oclmcil
for Measurenent in Edmation.

Cramer, s. E. (1990, April). Some practigg semions to mg-

ens: Geo 'a char
W. Paperpresentedattheammalmeetingofthe
National Oomcil on masurement in Education, Boston, MA.

Wham, L. J. (1989). Edtmtional measmmment, third edition
[Review of Mtional measurement, 3rd Ed.]. Educational
mm: Issues and Practice, §(4), 22-25.

cram, L. J., Gleser, G. c., Nanda, 11., & Rajaxatnam, N. (1972).
m awning of behavioral measurement: Theory of
ggjg‘llizabilm for scores and profiles. New York: Wiley.

 

 

Cross, L. x., Inpara, J. c., Frary, R. 13., s. Jaeger, R. M. (1984). A

cmparismofthreemethcdsfarestablishirgminimmstaniardsm
the National Teacher Bamiratia'zs. W
we 2.1, 113-129-

0mm L- (1987. April)- W
m. PaperpresentedattheammalmeetirqofﬂIeAmerican
Educational Research Association, Washirgton, DC.

deGruijter, D. N. M. (1980, June). Win
W- Paper Inserted at the Intemational
Wium on Edtmtimal Testing, Antwerp (ERIC Docment No.
ED 199 280).

deGruijter, D. N. M. & Hanbleton, R. K. (1984). On prom-
encamtered using decision theory to set cutoff scores.
MW. §. 1-8.

Dew—imp, G. E. & Pacers, D. E. (1990, April). Internal mm of
W- Paper Inserted at the
amualmeetixgoftheNational Cunnilmlbaslreuentin
Education, Boston, MA.

Dillm, G. F. (1990, April). m relationship of item aim and
MW. Paper presented at
theammalmeetirgoftheAnericanEducatiomlResearm
Association, Boston, MA.

Fabrey, L. J. (1988, April). E'ustment of Moff mm mints.
PaperpresentedattheannualneetingoftheAmerimn
Dilutional Research Association, New Orleans, IA.

 

 

Pabzey, L. J., & Raymmd, M. R. (1987, April). (Lunar—mg of standard

 

for a ' 'fication examina ' . Paper
presentedattheammalneetimoftheNational Camcilm
mm in Education, Washingtm, DC.

Feldt, L. s. & Brennan, R. L. (1989). Reliability. In R. L. Linn
(EL), Mtimﬂ measuranent (pp. 105-146). New York:
Macmillan.

Fitzpatrick, A- R- (1934: April). WW
ﬂy : Effect of gm interaction on ,u_1di' 'viglgl's
m. Paperpresentedattheanmlalmeetingofthe
American Educational Researdu Association, New Orleans, IA.

Fitzpatrick, A. R. (1989). SOCial influences in starﬂazd-settirg: The
effects of social interaction on group judgment-s. Miew of
MM, 5_, 315-328.

Francis, A. S. & Holmes, 5. E. (1980, August). W
mg m‘ E certification and licensure: Defm‘ ' the
M ' lyMQndidate. Paperpresentedattheamnlal
meeting of the American Psychological Association, Anaheim, CA.

Friednan, c. B., & Ho, K. T. (1990, April). mix—1m oomensus and
jmjm consiw: Is it wible to have both E standard
m PaperpresentedattheammalmeetingoftheNational
Courcil on Measurement in Education, Boston, MA.

Garrido, M. & Payne, D. A. (1987, April). AW
33g effect of jgg' Imcwlggge of item data on two forms of
WW- Paper Preeented at the
annualmeetingoftheNational CamilonMeasurementin
Ekiucation, Washington, III.

 

Glaser, R. (1963). Instructional technology and the measurenmt of
learning wtcanes. m, g, 519-521.

Glass, G. V. (1978). Standards and criteria. W
m, 15, 237-261.

Glass, G- V- & W: K- D. (1984)- My:

W- Enslaved Cliffs. M: Prentice-
Hall.
Greenberg, s., a Smith, I. L. (1988, April). Stra 'es f 1

gig m‘ standards inahighstakespr_ofession. Paperpresented
attheamnlmeetingoftheAmericanEducationalReseaxdl
ASSOCiatim, New Orlears, IA.

Gross, L. J. (1985). Setting cutoff scores on credentialing
examinatiors: A refinement in the Nedelsky procedure.
WWW, ﬁ. 469-493-

Grosse, M. E., & Wright, B. D. (1986). Setting, evaluating, and
mintal'ning certification standards with the Rasch model.
mluation g the Health Professions, 9(3) , 267-285.

Halpin, G., Sigmon, G., & Halpin, G. (1983). Minimum carpetency
stardardssetbyﬂueedivergentgmlpsofratersusirgthree
jiﬁgmental procedures: Inplications for validity. w
W. .4}. 185-196.

Embleton, R. K. (1978). m the use of cut-off scores with criterion-
referenced tests in irstructional settings. Journal of

Wm, .12 277-290-

 

 

159

Ilanbletm, R. K. & Eignor, D. R. (1980).

Carpetencytestdevelopnent,

validation and standard setting. In R. M. Jaeger & C. K. Tittle

(Edso). W;
W (99. 367-396). Berkeley, (m: Mccxtduan.
Hanbleton, R. K., Swaminathan, H., Algina, J., & Cculson, D. B.

(1978) . Criterion-referenced testing and measmremart: A review

of technical issues and develqments. Review of Educational
m, g, 1-47.
Harasyn, p. H. (1980) . A oomparism of the Nedelsky and modified
Angoff starriard-setting procedure on evaluation outcane.
tional 1 ical Measuratent, 5;, 725-734.
Hofstee, W. K. B. (1983). 'Ihe mse for canpranise in educational
selectionandgrading. 1118. B. AndersonS:J. S. Helmick
(Eds.), On educational testg' (pp. 109-127). San Francisco,

(1: Jossey-Bass.

 

mlhbard, J. P. (1978). W‘ medical education: The test and the

mience of the National Board of Medical Ebraminers.
Philadelphia, PA: Lea and Febiger.

Huynh, H, (1976) . On the reliability of decisions in domain-
referenced testing. Journal of Educational Measurement, p,

253-264 .

Jaeger, R. M. (1979). Measurement consequences of selected standard-

setting models. Inn. A. Bunda &J. R. Sanders (Eds.),
mctices and problems in W—based education (pp. 48-
58). Washington, DC: National Council for Measurement in

Education.

 

 

160

Jaeger, R. M. (1982). An iterative structured jtxigment process for
establishing standards on coupetency tests: Theory and
application. Mticnal EValan'on 1g Pollgz' ”is, g, 461-
475.

Jaeger, R. M. (1988) . Use and effect of caution irdices in detecting
aberrant patterns of starﬁard—settirg judgments. ”lied
W m mtim, ;, 17-31.

Jaeger, R. M. (1989). Certification of student carpetence. In R. L.
In'm (Ed.), WM (pp. 485-514). New York:
Macmillan.

Jones, J. P., Rcsen, G. A., & Schoon, c. G. (1988, April). A
Ell-JIM ' wiggtion of the direct m settg'

m. PaperpresentedattheammalmeetingoftheAmerican
Educational Research Association, New Orleans, IA.

Kane, M. T. (1984, April). _Str_‘atgies in validatm’ licersure
W.Paperpresentedattheam1almeetimofﬂle
American Educational Research Association, New Orleans, IA.

Kane, M. T. (1985) . Strategies for validating licensure examinations.
InJ. c. Forume (Ed.), MW
licensure (3;). 45-63). San Francisco: Jossey—Bass.

Kane, M. T. (1986). The interpretability of passing scores (ACI‘
Tedmical Bulletin No. 52). Iowa City, IA: The American
College Testing Program.

Klein, L. W. (1984, April). Practical considerations in the design of
mm settgg' studies in health muons. Paper presented
at the annual meeting of the American Educational Research

Association, New Orleans, IA.

 

 

 

161

Kleinke, D. J. (1980, April). 1 ' the f andNedel
techni to the National Li ' ' tions in

 

gpggtecture. Paperpresentedattheamnialneetingofthe
National Camcil for Measurement in Education, Boston, MA.

Koffler, S. L. (1980) . A cmparison of three approaches for setting
proficiency standards; Jamal of Educational m, 11,
167-178.

lender, G. F. 5. Richardson, M. w. (1937). The theory of the estimation
of test reliability. Matt—ﬁlm, 2, 151-160.

Ierner, B. (1979). Tests and standards today: Attacks,
camterattacks, and responses. InR. T. Ienrm (Ed.), M
gm ions for am and measuranent; Mive m on
m (pp. 15-31). San Francisco, CA: Jossey—Bass.

Levin, H. M. (1978). Educational performance standards: Image or
substal'loe? gournal 9f mtional Measur'atent, 1;, 309-319.
Likert, R. A. (1932). A technique for the measurement of attitudes.

W: 2(140).

Linn, R. L. (1978). Dennis, cautions, and suggestiors for setting
standards. Elm; of Educational lbasuralent, g, 301-308.

Livingston, s. A., & Zieky, M. J. (1982). Passm' scores. Princeton,
NJ: Educational Testing Service.

Iodmood, R. E., Halpin, G., & McLean, J. E. (1986, April).
m-settgg' process. Paper presented at the annual
meeting of the American Educational Research Association, San

Francisco, CA.

 

162

Maslow, A. P. (1983). Standards in occupational settings. In S. B.
Arnierson & J. s. Helmick (Eds.), W (pp.
91-108). San Francisco, (A: Jossey-Bass.

Mehrens, W- A- (1981: Febnnry). W
W. Presentation to the Midiigan School Testing
Conference, Ann Arbor, MI.

Melican, G. J. & Mills, C. N. (1987, April). m effect of Mlge
of other jpges ratm‘ of item Q'fficulgz in an i‘ggtive
MUSE ﬂieNedelﬂardAmpffmethods. Paperpresented
attheanmalmeetingoftheAmericaanucationalReseardi
Association, Washingtm, DC.

Meskauskas, J. A. (1976) . Evaluation models for criterion-referenced
testing: Views regarding mastery and standard-setting. 3e_vie-w of
W. _4_§r 133-158-

Meskauskas, J. A. (1986). Setting standards for credentialing
examinatiors: An update. Evaluation and the Health
Professions, 3, 187-203.

mskauskas, J. A. & Norcini, J. J. (1980). Standard-setting in
written and .interactive (oral) specialty certification
examinations: Issues, models, methods, challenges. Evaluation
and the Health Professions, 3, 321—360.

Millnan, J. (1973). Passing scores and test lengths for danain-
referenced measures. Wm. Q. 205-

216.

 

163

Millman, J. (1976) . Reliability and validity of criterion-referenced

test scores. InR. Traub (Ed), Mdirections for m’ and
measurement: Methodolgical develmpg (pp. 75-92), San

Francisco, CA: Jossey-Bass.

Millman, J. (1989) . If at first you don't succeed: Setting passing
scoreswhenmorethanoneatterptispermitted. Educatiﬂl
M, E, 5'9-

Mills, C. N. (1983) . A comparison of three methods of establishing
cut-off scores on criterion-referenced tests. Journal of
mtional Measurement, _2_0, 283-292.

Mills, c. N. & Barr, J. E. (1983, April). A mison of standard
settm' methods: Do the same j'pges establish the same
standards with different methods? Paper presented at the
anrmalmeetingoftheAmericanEducationalReseardl
Association, Montreal, Canada.

Mills, c. N. & Melican, G. J. (1986, March). AMY
investigtion of three mmise methods for establm ' alt-
pffscores. Paperpresentedattheannualmeetl'ngofthe
National Ccmcil for Measurement in Education, San Francisco,
CA.

Mills, C. N. & Melican, G. J. (1987, April). The effect of lmow1e_dge
of other jges' ratmg' of iten difficul’py in an iterative
m uslm' the ﬂyoff and Nedelsm procedures. Paper
presented at the annual meeting of the American Educational

Research Association, Washington, DC.

 

164

Mills, C. N. & Melican, G. J. (1988). E‘stinating and adjusting cutoff
scores: Features of selected methods. W
m: l: 261.275-

Mills, C. N., & Phliczan, G. J. (1990, April). m‘valence of cut-off

ares derived frcm randomly e_o:__nivalent ms. Paper'
presertedattheamlualmeetil’ngoftheNatia'lal Councilor:
lbasurenent in Edumtion, Boston, MA.

Nafziger, D. G. & Hiscox, N. D. (1976, April). Am
mtional licensure and certification w. Paper
presented at the annual meting of the National Council for
Measm‘enent in Edumtion, San Francisco, CA.

Nedelsky, L. (1954). Absolute grading standards for objective tests.
mtiorial and Mlgical Measurement, 13, 3-19.

Nomini, J. J., Lipner, R. s., Iangdon, L. o., & Strecker, c. A.
(1987) . A canparison of three variations on a standard-
setting method. Journal of Educational Measurement, 2, 56-

64.

Norcini, J. J., Shea, J. A., & Kanya, D. '1'. (1988). The effect of
various factors on standard setting. Journal of Educational
Measurement, _2_5, 57—65.

Plake, B. s., s. Melican, G. J. (1986, April). Effects of item context
on intrajpge consisﬂy of e_xpe_rp jm via the Nedelglg
standard settm method. Paper presented at the annual meeting
of the National Council on Measurement in Education, San

Francisco, CA.

 

165

POggio, J. P., Glasnapp, D. R., S: Eros, D. S. (1981, April). by

itionothe ff andNedel

 

mm- Paper presented at the annual
meeting of the American Educational Roseardl Association, los
Angeles, (3..

Popham, W. J. & Husek, T. R. (1969). Implications of criterion-
referenced measurenent. Journal of Educational Measurement, g,
1-9.

Rock, D. A., Davis, E. L., & Warts, c. (1980, June). An ggir'cal
mison of jm amroaches to mm‘
3% (E18 Research Report). Princeton, NJ: Edlmtional
meeting Service.

Rotlnnan, R. (1989, Novalber 8). States turn to student performme as
new measure of school quality. mum Week, pp. 1, 12-13.

Saunders, J. c., Ryan, J. P., & Huynh, H. (1981). A comparison of two
waysofsettingpassingscoresbasedontheNedelskyprocedure.

lied 01 ical Measurement, 5, 209-217.

Sdleaffer, R. L., Mendenhall, w., and ott, L. (1979). Elm
W- Boston. MA: Daxhlry-

Schoon c. G., Guillion, c. x., & Ferrara, p. (1979). Bayesian
statistics, credentialing examinations, and the determination
of passing points. Evaluation in the Health Professions, 2,,
181-201.

 

 

166

Sdm'l, c. G., Rosen, G., & Jones, J. p. (1988, April). mm
methodo ies in ' of cut
and a disoassion of an alternative meﬂlodolm: The direct
W. Paper presented at the annual meeting of
the American Educational Research Association, New Orleans, 1A.

Scriven, M. (1978). How to anchor standards. W
m, 1.5, 273-275.

Shepard, L. (1979). Setting standards. Inn. A. mnda & J. R.
Sarders (£235.), motioes and Mans in M—based
wig; (pp. 59-71). Washington, DC: National (Jamil for

Shepard, L. (1980a). Standard setting issues andmethods. Applied
Mlgical Measuranent, _4_, 447-467.

Shepard, L. (1980b). 'Dechnical issues in minimum caupetency testing.
In D. C. Berliner (Ed.), Review of researdl in education, (pp.
30—84). Washington, DC: American Educational Research
Association.

Shepard, L. (1983) . Standards for placement and certification. In S.
B. Anderson and J. s. Helmick (Eds.), On educational testy
(pp.61—90). San namisco, as: Jossey-Bass.

Shepard, L. (1984). Setting performance standards. In R. A. Berk
(Ed.) , A ﬂ'de to criterion-referenced test construction (pp.
169-198). Baltimore, MD: Johns Hopkins University Press.

Shinberg, B. (1981) . Testing for licensure and certification.
MW. 16(10): 1138-1146-

 

 

 

167

 

Skakun, E. N. (1990, April). effect of mis' orma 'on ' '
m’ ions m ﬁlm standards. mperpresel'ltedattheamnlal

meetingoftheNatimal Oamcil onMeasurement inEducatia'l,
Boston, MA.
Skaklm, E. N. s. Kling, s. (1980). Comparability of methods for
‘ setting standards. Journal of Educational Measurement, g,
229-235.

anith, R. L. & Smith, J. K. (1988). Differential use of item
information by judges using the Angoff and Nedelsky procedures.
Journal of Educational Measm'emerrt, 2_5, 259-274.

smith, J. K., Smith, R. L., Richards, c., s. Earnhardt, s. (1988, u.‘
March). n_1e 91m numberof ng touse in settgg' Egg
m. PaperpresentedattheamlualmeetingoftheAmerimn
Educational Reseanrh Association, San Francisco, (28.

mm, M. J. (1976) . Estimating reliability frun a single
ministration of a mastery test. Jourrnl of Educational
Measurement, Q, 265-276.

Subkoviak, M. J. (1934) . Estimating the reliability of mastery-
nonmastery classification. In R. A. Berk (Ed.), A guide to
giterion—referenced test construction (pp. 267-291) .
Baltimore, MD: Johns Hopkins University Press.

Subkoviak, M. J. (1988) . A practitioner's guide to cmprtation and
interpretation of reliability indices for mastery tests. Journal

of Educational Measurement, 25, 47-55.

 

 

168

Subkoviak, M. J. & Huff, K. J. (1986, April). Intrajgge
the ff Nedel methods 0
gem. PaperpresentedattheammalmeetingoftheNational
Council for Measuranent in Education, San Francisco, (A.

Van der Linden, w. J. (1982) . A latent trait method for determining
theintrajuiye inconsistencyintheAngoffaniNedelslq
techniques of setting standards. Journal of Educational
m, 19, 295-308.