1‘ m-p:.‘,}’§7""””“""" g.fi £3.59 .‘~ i ‘7 ‘ I f. ”if 333%”; .mrig‘r ‘ . .. 91:”.1:,m 'sz‘ 'r‘ 1.3% ‘1 I «Hark $191? WM “c Hays!“ , : , . 1vW-. . > "5* .7 {.1 a $575.”. fir! > , ‘ 5“ ‘ tn F» . I: "91' ' x h‘ ‘3: wk gt“: ~42” “ '31.}: fat 3 - a.» .n t“, 1 » ' 19 ,J v, . 4.3; ' ' ' 1337' r, , , “ “N. . 1153' €an p?§%‘ ‘1', ‘ ‘:“:ng ? JJH’u,AfiW .. .5 ~ I". . ‘ - ing’ISJ-zgsm gu- .. v I-(«§~ ‘1') J. " 'S‘EWL. *‘TA'TJZL" u .. W mw ‘ .In_ ‘I 5! .mmmilr no” ‘ . #1:”).11; V E. ‘ . > h. n/ ~.' '_ . _ _‘ x 7 , r .. d ‘3: ‘Lubslirsi’téi'h" “ ' "i a?» . ““49- 3 " ' ' V\ A“::‘{ ”3‘3““ q'ubfi: In My. 2* r ’3‘... a 421m! 322%?“ ‘ «5’15: 9‘ u ’4. a an * Q“; ,_I >. .7 , _"‘)§5Y"ri . A l , :1; 34' .. lumen gr" La;- 2* 2’38..- .4.T‘J2L1' ' . s: r -m 2:- new: .~ .. v N O- F'V‘ '” —. :afiifi s .0“..- .. ' fir ». h .1 .%?W( * figgh , , z“- "a: ..';: ~ : ”15.3.53“: " 75.. . «‘23. ‘5' ~ ' K. k at ~|l‘n 1-"; sz‘zfix “k,” a- , —~ . ~: "EFH' ‘- a ac? £351; var.“ .. u m'l.‘ ‘. “4 . MM'E'M ill; .3 M “2 7.29“". . g I“! ‘KY 3" 77.? ,LWns \ mm , 7 ‘r’ Vsp . :‘ekv‘ 1"».- :z‘m“ 53’"? ' 72%“) n P’ :3 34:11:33.“. .92" xw‘HM" m -,.,.. -5. “ kn‘a‘: 2 . . vl-n‘". u.~ :- up llllllllllllllln LIBRARY Michigan State g University fimbmmMUmamem dissertation entitled An Investigation of One Alternative to the Group-process Format for Setting Performance Standards on a Medical Specialty Examination presented by Gregory J. Cizek has been accepted towards fulfillment of the requirements for Ph.D. (kgmehi Measurement, Evaluation, and Research Design g5‘§as::§%3;;2§:é£24zr25&444«/ jor professor 0-12771 ”511;“... Afr .‘ - ' '- 'OP: '. 1 .-..-.. PLACE iN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. DATE‘DUE DATE DUE DATE DUE , ‘ A??? 0 5 2007 1 M 9 07 fiHN—G—B—ZBCB 710 1 '11 n ‘ 1r; 1 r. U ‘ l . L3: 3 any-g; MSU is An Affirmative Action/Equal Opportunity Institution snowman-pd wm~n.__ AN INVESTIGATION INTO ONE ALTERNATIVE TO THE GROUP-PROCESS PROCEDURE FOR SETTING PERFORMANCE STANDARDS ON A MEDICAL SPECIALTY EXAMINATION BY Gregory J. Cizek A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology, and Special Education 1991 0c \ ’7 \s ‘1) f) - AETRACI‘ ANINVESPIGATIOIINIOQIEAUIERNATIVETO'BIEWMCESS Gregory J. Cizek missuidyexaminedonevariationofthetraditioralgmip-process procedure for establishing passing standards on a medical specialty examination using the Angoff methodology. 'Ihe variation consisted of requiring subject-matter experts to provide Angoff ratings indepeniently, without group interaction or other sources of information. 'Ihe study also sought to isolate the effect of group interaction and information—sharing thrulgh carparison to a group- process condition and a condition in Whidl independent item reviewers were provided with distrihrtions of other the independent reviewers' ratings. 'Iherewereseveralmajor findinginthestudy. Itwasobserved that the independent procedure produced a nonsignificrantly higher passingstandardthanthegmip-pmcessproceduredid. 'Iheabsenceof statistical significance, however, did not exclude large practical cortseqtiernesfortheinterestedgroips, suchastheexamjneesandthe standard setting board. These practical consequences are described and discussed. Also, it was observed that individual item reviewers' ratings were more variable in the independent condition carpaned to the gram-process procedure. 'Ihe independent condition was also 1&5 costly to inplanerrt. Item reviewers in both conditions produoed ratings that exhibited less than desirable accuracy in terns of estimating the performance of the hypothetical minimlly-caxpetent 91'0“!)- 'Ihe provision of additional information to the independa'xt group in theformofdistrihxtionsoftheirowninitial itemratingsresultedin subsequent ratings that were significantly higher and less variable, hitdidmtresultinmrepreciseestinatesofperfomarneforthe minimlly cmpetent group. However, independent raters amarerrtly utilized the additional information provided as distributions of ratings. It was found that krmledge of a reviewer's initial rating andthegroup'sinitialmeanitemratingwasamderatelygood predictor of a reviewer's subsequent ratings. Inplicatiors for future design of standard setting procedures and policy catsiderations are discussed. . on, . aimifi . F i i... . I appreciate the patience and encouraganerrt of those who have helped me with this project: Stephen Raudenbush, Irvin lamann, Diana Pullin, William We, David labaree, and Stephen Yelon of Michigan State University. I am most grateful to my wife, Rita, and our children, Caroline, David, and Stephen for their unfailing love and support, and to my parents for their enduring confidence. IthankGod forhisblessings, as surely evidermdtomethrough these people who have given so Mich. iv MEGFCINTENB M PACE cnpteri—pmmen ..... .......... 1 I Introduction ........ . ................. .. ...... 1 Background ......... ....... . .............. 4 Need ................ ..... ................. ......... . 5 Purpose ........... ........ 10 aiapterZ-Review of PreviousResearch . .......... ..... 12 Methodological Developnent .......... . 12 Inter-methodological Research 19 Intra-methodologicalResearch ..... . ..... 21 Chapter3-StmdyDesign ........ 30 Experimentl ......... ..... 30 Empirical Treatments . ....... ....... 31 ControlGroup ......... 32 TreatmentGroup ........ .. ....... ........ 36 Subjects .......... ......... 37 Consent ......... ..... ... ..... ............ ..... ... 38 ValidityOoncerns ...... 38 Imtnmientation ..... . .............. .. ...... .. 40 Statistical Analyses ........... 42 Ebrperimen‘tZ .................... 52 Empirical Treatment ...... ....... 53 ValidityConoerns ....................... 55 Iretrumentation .................... ....... 56 Table of Contents (corrt'd) Statistical Analyses ...................... 56 Chapter 4 - Results ................ . ...... 60 Ecperimerrtl ........... 60 Between-groupMeanDifferences 60 Within-group Differences 67 Relationship betweenGrotmandIndependent Ratings 69 Relationship toObtained Item Statistics 74 Relationship between E and E' and Reviewer Generalizability Analyses 78 Betseen-conditionMean Differences 89 Relationship between With-information and tic-information Ratings .......... 97 Decision Consistency 99 Relationship of Ratings to Obtained Item Statistics ...... 101 RegressionAnalysce . ..... ...... 104 Carbined Reslllts .......... ..... . ........ .......... 107 ChapterS-Diswssim .......... ..... 111 WWI-WOOOOOOOOOO ..... .0... ..... .0... 111 mmtjmsamvariability OOOOOOOOOOOOOOOOOOO. 111 vi Table of Contents (cont'd) Relationship of Ratings to Obtained Item StatisticsandReviewer diaracteristics GeneralizabilityAnalyses . ........ CostAnalysis ..... Relationship of Ratings to Obtained Itan RegressionAnalysis ..... Discussion of Combined Analysis Simmary of Findings and Implications IdmitatioreardSuggestions forFutureResearch Appendix A - Inter-methodological Carparison of Standard- settinngoedures InvolvingOneorMore Absolute Starriard-setting Methodologies AppendixB-Passing ScoreMeetingInformational Materials AppendixC-Sanple ItemRating Collection Form ...... Appendix o- Sample Post-meeting passing Score Study AppendixE-nata Law for Ecperimentl AppendixF- Sanple Rating Form for Experimentz m-StOfkfm OIOOOOOOOOOOOOOOOOO0.00.00.00.000.0.0.000... vii 113 113 115 116 117 117 118 120 121 124 137 142 143 149 150 151 152 153 Table Table Table Table Table Table Table Table Table Table Table 1 2 3 5 6 7 9.. 10 IISI'CT'IREUB Description of Practice Items Used in Passing Score Starinmup Training Session 34 Descriptive Statistics for Independent and Group- Process Reviewers Across 200 Items 61 - Test for Significant Mean Differences between Indepen- dentandGroup-process ConditionPassingScores 65 Wized Block ANOVA Results for Independent arri Group-process Conditions ........ 68 Intercorrelation Matrix of Ratings frm Independent andGroup—process Corriition Reviewers ....... 70 - Indices of Decision Oorsistency for Independent andGroup—prooess Conditions ..... .......... 73 - Absolute and Relative Errors of Specification for Item Reviewers in Independent and Group-process Conditions ....................... 76 Sunmary of Generalizability (G-study) Results for IndependentandGroup—proosss Conditions 79 Summary of Generalizability Analyses (d—study) Results .. ...................... . ............ ...... 81 - Carparison of Costs for Conducting a Passing Score Study under Group-process and Independent mitiors ... IIIIIIIIIIIIIIIIIIII t. ..... 0.0.0.... 86 11 - Descriptive Statistics for No-information and With-information Reviewers across 100 Items ...... 91 viii Venice’s} : ': ‘ -. maximize- ..T 66 ..........._ . _ . ,u.,_ . e ‘ [05 .‘I‘)(...'. .4“. .L or z 3129')?" - )r. .1 .1n n v I 6.2.3.1'. 1“}! List of Tables (cont'd) Table 12 - Test for Significant Mean Difference between No-infonation and With-infornetim Condition PassingScores .. 94 Table 13 - Repeated Measures AMNA Results for No-infornation andWith-informtim Conditions 96 Table 14 - Irrtercorrelation Matrix of Ratings fran tic-information and With-information audition Table 15 - Indices of Decision Consistency for No-informatim andWith-information Conditions 100 Table 16 - Absolute and Relative Errors of Specification for Itan Reviewers in No—information and With- information Conditions 103 Table 17 - Regression Analyses for Individual Reviewers in Erperiment 2 106 'I‘ablelB-anparisonoffibcperinentlardbcperinentz SuggestedPassingStandards ..... 109 LISPWFI“ Figlnel-PlotofIIfleperderrtarderp-processmrditim MM'W 0.00.00.00.00......OOOOOOOOOOOOO 62 Figure 2 - Plot of bio-information and With-informtim MiM'm 0.0.0.0..........OOOOOOOO... ..... 92 I.PK)BLEM lgtroduction The licensure and certification processes represent the efforts of govermental and private entities to ascertain and recognize the competence of individuals in the practice of a profession or trade. Licensure, as canmonly understood, is the granting, by a governmental entity, of the right to legally practice a profession or trade. The right, or license, is granted pursuant to the individual's demnstrated acquisition of the knowledge or skills required for _sa_fe practice. licensure programs are conducted by governmental entities in their effort—and charge—to protect the public against unsafe practice. Certification is the process by which non-goverrmiental entities, commonly professions or associatiors, confer a credential. 'Ihe credential is also usually only conferred upon the individual after demonstration by the individual that a specified level of knowledge or skill has been acquired (Shimberg, 1981) . As reported by Nafziger and Hiscox (1976), over 2000 occupations employ some type of licensure or certification procedures. That number is surely increasing, even leading some to label Americans "the credential society" (Collins, 1979) . Additionally, many entities which once issued permanent licenses or certificates have now begun to reassess the concept of lifetime credential. Instead, time-limited certification or re-credentialing f; .._i , undue-gr: 5 new“. Mivtfrri m: nary: 3'0 MEI Lei 312?: V, 4 season .mman ‘ ’ijr-Fenis'e'ni :o 33,5:33: ffitaimo-‘xz ‘3') I. \ 0,. 5'! rL- f' Iflffii'a'T‘... was-:2; 1 flow.- (‘7‘- ! " AALA concqits have begun to be seriously entertained and often implemented, especially in rapidly-among technical fields such as the medical professiors (American Board of Medical Specialties, 1987) . The carpetence required of a candidate for licensure or certification is usually stated in terms of requisite knowledge, skills, and abilities. Verification that the individual has acquired the knowledge, skills, and abilities is often linked to one or more of three canponents: a minimum educational attaimnent, a minimum practice crexperiencerequirernent, andaminimumlevel ofperformanceonan cbjectivet$t. 'nteaaxnimtionsusedaspartofthethirdcanponent are increasingly criterion-referenced ones.1 Hambleton, Swaminathan, Algina, and Coulson (1978) have defined such tests as ones that are "used to ascertain an individual's status (referred to as a detain score) with respect to a well-defined heavier domain" (p. 2). These tests consist of items that are a "representative set of items from a clearly-defined domain of behaviors measuring an objective" (p. 3). 'meprcsentresearchfocissesonthelastofthethree carponents in licensure and certification testing programs—the criterion-referenced test. Specifically, this research examines one particular test score of unique interest—that score from which emanates inferences of mastery or competence—the passing score. The passing score on an criterion-referenced examination represents the establishment of a standard of performance judged to 1 It is recognized that terns such as "criterion-referenced," "domain-referenced," and "norm-referenced" precisely describe test score interpretations and inferences rather than the instruments themselves. However, imprecise use of these terms in referring to instruments is ubiquitous—even among measurement specialists (Cronbach, 1989) . 'Ihis relaxed usage, though imprecise, is followed throughout this manuscript for purposes of ease and clarity. 3 beacceptable. Itisthelowestscorethatpermitstheexamineeto receiveflielioettseorcrederitial. Smetimes, thoughlessandless so, thepassingscoreissetinamfmmamier. 'nntis, thepassingscoreisfbtedrelativeto,ordeperdentupon, the performance of sane group. For exanple, a rpm-referenced or "relative" approach to standard-setting might result in requiring aamineestoscoreatorabovethe85thpercentile, oratorabove sanemmberofstardarddeviatimsawayfrunaverageperformamem the examination. thiever, because the forms of licensure and certification prograns has increasingly becane that of assessing examinees' cmpeternewithrespecttoapre-judgedstandardofperformance, norm-referenced standard-setting procedures have been called into question in terns of their propriety for the stated purpose. In their place, "absolute" or criterion-referenced methods of establishirtgpassirtgstaiflardsitavebecanemreccmnon. 'Iheabsolute methodologies, while boasting of greater intuitive and political appeal, still face diallenges with respect to the validity of inferencesthataremadeasaresultoftheirresultingstandards (Jaeger, 1979) . Specifically, the possibility of establishing a standardthat results inthefailure ofatrulyccupetentperson (a "false negative") or results in the passing of a truly incompetent person (a "false positive"), is of particular concern. Criterion-referenced standard-setting methodologies have clearly not yet accanplished technical perfection: nuch work remains to be done in this area (Hanbletcn, et al, 1978; Angoff, 1988). The prcsent mseardiaddressesoneaspectoftheprocessbywhidistandardsare mimic: to ‘(fiir’icv (r; :: maxim 9-1 ‘_ ' s (ril'dickze‘a: t: . L 5) {0'13} molar-h. oar-5am it 3:11;: . .~ «which M—‘m £51.31" ‘ d 03 art-s2: 3.2:: v: #m cm new: .- 7 , .- 9 Formula liaise \‘c 1 1'- . ‘ : ‘ s _ .. - .‘ .. '. X set on a criterim—referenced certification examination in a medical specialty. Lek-ems Since at least 1954 when Nedelsky sought to derive "absolute grading standards for objective tests" (Nedelsky, 1954, p. 3), the probl- of how to establish passing standards on criterion-referenced educational assessments has persisted. Nedelsky's early work prompted investigation of alternative standard setting procedures designed to establish passing standards that differed from the doninant norm-referenced approaches of the tine. Nedelsky's objective, and that many of contemporary researchers in the field of standard setting, was straightforward: "The passing score [should] be based on the instructor's judgnent of what constitutes an adequate achievement on the partofasttxientandnotcmtheperformancebythestudent ‘ relative to his class or to any other particular group of students" (Nedelsky, 1954, p. 3). The past three and one-half decades have witnessed the introduction of many alternative methodologies that have shared the sane dajective—ncvenent away fran the daninant norm-referenced, or relative, approaches. Among the proposed "absolute" methods as they are sometimes called, the most well-io'iown are those proposed by Nedelsky (1954), Angoff (1971), Reel (1972), and Jaeger (1982). other nethods have also been introduced that have tried to achieve a caupmnise between the absolute and relative approaches. Proposals by Bank (1984), deGruijter (1980), and Hofstee (1983) represent attempts to synthesize absolute and relative nethods. Taken together, all of these methods represent efforts to fonalize a set of rules for establishing passing standards in a less arbitrary, or at least more justifiable, fashion than traditional, norm-referenced practice has offered. 'Ihe methods rely primarily on the use of subject matter experts' (hereafter dolled "SMEs") judgments concerning one or both of two critin elements: a conceptualization of the "barely-passing," "mininally-ccmpetent," or "borderline" examinee; and, an expectation regarding the level of contenthmledgearflskillthatsuzhanacamineeshouldpossess (Livingston & Zieky, 1982). After initial research efforts to derive absolute and, later, carpmnisenethodsofestablishingpassingstaniards, asecondstream ofresearchdeveloped. 'Ihissecondlineofirqfiryfomssedmainly on differences pm methodologies (Mills & Melican, 1988). Investigations carparing two or more methods characterized this secondphaseofstarxiard-settingixquiry. AppendixAlistssomeof these filter-nethodological investigations. Recently, however, athirdphaseofreseardlneedhasenerged. Researd'l in this phase is characterized by attempts to identify sources of variation within standard—setting methods. field 'Iheproposedreseardiiscloselyalignedwiththethirdphaseof research into standard setting methodologies and focusses on one method—the Angoff method. The Angoff method and its variations (sometimes called "Modified Angoff" procedures) are derived from the work of Angoff (1971) and others. The Angoff methods require SME‘s to serve as item reviewers and to scrutinize each item in an 6 examination, usually prior to the administratim of the examination. 'meitenreviewersarethenaskedtojudge, foreachitem, the proportion ofminimallycmpetentexamireeewhowillanswerthe item correctly. ‘Ihe item reviewers' judgments, in the form of proportions, are cannonly referred to as "Angoff ratings." 'memw-preferredneansofobtainingtheitemreviewers' judgments utilizes a W format. In this format, the panel of SMEs is convened in a single location, provided with training in the starriard-setting methodology, and directed to provide their ratings foreachiteminatest. 'megmlp-processformatisoften preferred because, predictably, item reviewers do not produce identical ratings and the group-process format provides a means of resolving the differences in ratings. Most researchers agree that this reduction of variability is desirable (Jaeger, 1988; Meskauskas, 1986; Snith, Smith, Richards, & Earnhardt, 1989). However, it is cannon that an extensivepportion of a grwp's meeting time is devoted to discussions about individual test ite'ns, debate, and, when applicable, to corsensus-readling regarding the ultimate rating for each test item. Several problems arising frcm this format necessitate the investigation of alternatives to the traditional group—process format. Norcini, Lipner, Iangdon, & Strecken (1987) summarized two ' of the problems, including: the tediousness of the task of reviewing individual items and reaching consensus ratings (especially when a large number of items is involved): and, the expense of empaneling a sufficiently large group of SMEs in one location for, perhaps, several days. 'Ihese problems are especially evident in the area of professional licensure and certification where hundreds of credentialing prograns employ criterion-referenced standard-setting methodologies, nest of these relying on subject matter experts participation in a traditional gram-process format to obtain item ratings. Another frequently encan'rtered problem is simply arriving at a single block of time that is available for each SME on the panel of itan reviewers. This problem has been characterized by Iockwood, Halpin, and McLean (1986) as one of the "situational'constraints" (p. 6) in the standard-setting process. Hambleton (1978, p. 282) specifically addresses the problem of time resource availability as one of the four primary considerations in selecting a standard- setting methodology. In addition to the need for research to suggest alternatives for addressingtheproblenscreatedthroxghuseofthegroup-process fontatinstarxiard-settingstudies, researchisneededtoecaminethe effect on resultant standards when such alternative strategies are tried. Many researchers have conducted comparative studies of standard-setting methodologies which employ a group-process format. Also, most have offered an opinion concerning the appropriateness of the group-process technique. For example, Brennan and Iockwood (1980) opine: "Scmetimes...it is suggested that a cutting score be determined by a reconciliation process. For example, after the five raters in this study completed the Angoff procedure, theywereinstructed, asagroup, toreconcile their differences on each item. One typical result of using a reconciliation process is that certain raters tend to dominate, or to influence unequally, the reconciled ratings... There is a certain logic to using a reconciliation process that appears to be compelling. It might be argued that the ideal of using either the Nedelsky orfllek'goffprocedtneisforraterstoagreemevery itan. 'Iherefore, why not force them to concur? One arglnnentagainstthislogicisthatforcedconsensusismt agreement, although forced consensus may effectively hide . Also, a reconciliation process does not guaranteethatthesameaxttingscorewillresulteaditine a study is replicated" (p. 235-236) . AltlnlghBrennananiIoclmood'sranarksgobeyozfitheeffectof grum—processardexterdintotherealmofrequiringconsensusofthe expert group, their logic is equally applicable to the traditional group-process condition. 'Ihat is, after appropriate training of item reviewers, the condition of group-process may not be mcessary, desirable, or efficient for use in all standard-setting procedures. Jaeger (1988) offered his opinion on another aspect of achieving agreenent among iten reviewers: "Achieving consensus on an appropriate standard for a test isanadmirablegoa1(certain1yguaranteedthroughtheuse ofasingle judge), butit shouldnotbepursued atthe expense of fairly representing the population of judges whosereconmendatiorsarepertinenttothetaskof establishing a workable and equitable test standard" (p. 29). Maslow (1983) has remarked that lmowledge about "the optimal size and structure for the group of judges" is "basic to improving practice in standard setting (p.104), and that "the research literature gives only brief and unsteady guidance here" (p. 105) . While some investigation of the issue of optimal group size has begun (Smith, Smith, et a1, 1989), the issues surrounding optimal group structure remain largely unaddressed. Meskauskus, (1986) has appropriately, and succinctly, noted that "[there] is a need to explore the determinants of intrajudge and interjudge variance in de " (p. 200). Mills and Barr (1983) reported that: "While general information cmcerm'ng procedures for implementing the methods and alwlating cut-off scores is available, specific guidelines are less well established. Issues of training, groupinteraction, independent ratings vs. discussion all affect the methods, but little is available in either discussion or guidelines concerning these and other implementation issues" (p. 2-3). In 1984, Fitzpatrick perceived the need for research m standard—setting procedures in an integrative work applying research intheareaofsocialpsydlologytotheproblemsofstandardsetting. The need persists, as Fitzpatrick (1989) notes; specifically there is aneedbythoseinvolvedinstandard—settingresearchtoixwestigate the effects of group processes: "We must ask whether it is desirable that the decisions that [item reviewers] make be affected by interpersonal comparisons, by cognitive learning through the exchange of information, or by both types of processes" (p. 321) . Powssinginonthesocialaspectstlataffectgroup—basedstandard— setting methodologies, Fitzpatrick goes on to argue that: "standard-setting procedures should be designed to both minimize the effects of social comparison and maximize the effects of certain informational influences on the decisions to be made" (p. 322) . In summary, Fitzpatrick specifically urged that: "procedures proposed for reducing the impact of undesirable influences standard-setting context should be investigated.in Whether or not the suggested procedures will be effective can only be decided by further research" (p. 325). Unfortunately, scant attention has been paid to these, and similar, aspects of intra—methodological variation. Specifically, as Mills and Barr (1983) and Fitzpatrick (1984) have both remarked, little evidence has been brought to bear on the effect of the presence or absence of the group-process condition. Fewer still appealing alternatives to the group-process format have been 10 proposed. alrry (1987) has summarized the existing state of affairs aptly: "Mmst all of these authors [on standard setting] acknmrledge that the expert group process will have significant impact on the validity of the outcome, few have examined the dynamics involved" (p. 1). m 'Ihe present research attempts to identify an efficient variation of the traditioral group-process method for use with the Angoff approach to establishing passing standards on a certification examination. Using the Angoff (1971) method, the present research campares twoproceduresforestablishingpassingstaniardsonamedical specialty certification examination. The two procedures used are: 1) the traditional group-process method; and, 2) an "independent" condition in which item reviewers provide their item ratings in isolation (i.e., without the effects of group-process). An attempt ismadetodeterminewhether, afterbothgroupsofitemreviewersare provided with initial training in the Angoff method, results obtained from the group-process condition differ from those obtained in the isolation condition. The primary focus of the Angoff standard-setting method is to identify a passing score for an examination. Accordingly, the primary focus of this research is to establish whether there is variation in the passing scores that result from exposure to the two conditions. It is hypothesized that variation will be observed betww the two conditions, but that the magnitude of variation will be small. Additionally, it is hypothesized that the isolation 11 condition will provide a suitable, efficient alternative to file traditional group-process method of collecting SME's' Angoff ratings for test itets. II. mammal! The setting of absolute performance standards on criterion- refereaced educational assessments is a pervasive activity in the American educatioral system (Hambleton, 1978) and represemts an ongoirg line of inquiry in the field of educational measurement. Criterion-referenced starriard-setting methods are currently utilized by groups responsible for industrial personnel selection, educatioral and training program evaluation, professional licensure or certification in medical, allied health, arri msiness fields, arxi other national, state, and regional credentialing programs (AERA/APA/NQGE, 1985: Meskauskas, 1986). AdaptirgtheconceptualviewsggestedbyuillsarriMelican (1988) , research on criterion-referenced standard-setting can be viewed as having proceeded in three distinct phases: 1) Methodological Development; 2) Inter-Metl'lodological Research; and, 3) Inna-Methodological Research. An overview of these three phases serves as an organizational framework for reviewing previous research and is presented in the following pages. Methodolgical Develgmgt As one author has noted, mentions of criterion-referenced passing standards are found in early historical accounts of testing 12 13 situations: "Averyearlyminimalcerpetencyu-waswhenthecilead Guards challenged the fugitives from mariam who tried to crosstheJordanriver. 'Areyouamerberofthetribeof Ephriam?’ theyasked. Ifthemanreplied thathewasnot, then they demanded, 'Say Shibboleth.I But if he couldn't pronolmcethe'sh'andsaidsibbolethinsteadof Shibboleth he was dragged away and killed. So forty-two thousard peqale of Eglriam died there at that time" (Judges 12: 5-6, me Qvgg‘ Bible, quoted in Mehrens, 1981, p.1). Since that time, so-called "high-stakes" tests, (‘t'hwgh not gm: high), haveremairedproninertintheassessmentofccnpetence, and research efforts have been directed at refining the theoretical and appliedaspectsof settingpassingscoresonslchtests. Inareview of existing standard-setting methodologies, Berk (1986) reported that at least 38 methods of establishing or adjusting performance standards have been proposed. Berk (1980; 1986) and many others (Glass, 1978: Hambleton & Eignor, 1980; Hambleton, Swamirathan, Algina, & Coulson, 1978; Jaeger, 1989; Livingston & zieky, 1982; Meskauskas, 1976; Meskauskas & Noroini, 1980: Millman, 1973; Mills and Melican, 1988; and, Shepard, 1980a) have also developed several similar catalogues and classification schemes to organize the various methodologies. Again from a historical perspective, Nedelsky's (1954) work probably represents one of the first attempts to promote absolute, or criterion-referenced standards of performance on educational assessments. As late as the 19705, norm-referenced methodologies dominated as the preferred standard setting approach. In a 1976 article, Andrew and Hecht reported that: "At present, the most widely used procedures for selecting . . .pass-fail levels involves norm-referenced considerations in which the examination steward is set as a function of the performance of examinees in relation to one another" (Andrew & Hecht, 1976, p. 45). 14 A noticeable shift began to occur during the 19705 and 19805, when considerable attention to establishing absolute passing standards resulted from an increasing popularization of criterion-referenced testing (Glaser, 1963: Popham & Husek, 1969), or—as sate have termed it—a shift to a focus on educational "outputs" (Isvin, 1978; Rothman, 1989) . Since that time, many entities responsible for establishing passing standards have reevaluated their use of norm-referenced meflmodologies and have opted for implementation of absolute or compromise methods (Hambleton, 1978; Fabrey, 1988; Mills & Barr, 1983). In 1983, Francis and Holmes reported that "the more traditional norm-referenced approach is being seriously questioned" (p. 2). Meskauskas (1986) described the evident trend away from norm- referenoed and toward absolute (or, "content-referenced") standard setting methodologies in the area of licensure and certification testing and offered this advice: "For those credentialing agencies still using normative Irecemendthatplanstochangeoverto content—referenced standards be initiated" (p. 198). Nedelsky's ‘ work in search of an absolute standard-setting meflmdologytlmsrepresentsamarkedmmirgpointinstardard-setting technology and research. When using the Nedelsky method, subject matterecpertscarefullyinspectthecontertanditetsinan examination and judge, for each item in the test, the option or options that a hypothetical minimally-carpetent examines would rule outas incorrect. the reciprocal ofthe remainingrmmmberof options becomes each item's "Nedelsky rating"; the sum of the ratings—or same adjustment to the sum—is used as a passing score. Further research and other now—popular methods of establishing 15 absolute passing standards on criterim-refereeed eaminations followed—thong; not quickly (Scriven, 1978)—after Nedelsky's 1954 piblicatim. Angoff (1971) prcposed a method that, like Nedelsky's, requiredSMEBtoreviemtestitemsanitoprovideestimatiomoftre proportimofasubpowlatimofeamixeesmowouldarswertteiteis correctly: "Asystematicprocedurefordecidingmtl'eminimmmraw scoresforpassixgarrihorersmightbedevelopedas follows: keeping the hypothetical 'minimally acceptable person' inmini,orecmldgothrulghthetestitemby itemarddecidewhethersudlapersmcouldanswer correctly each item tmder consideration. If a score of aeisgiven foreach itemansweredcorrectlybythe hypotheticalpersonarriascoreofzeroisgivenforeadl itemansweredinoorrectlybythatperson,tteemmofthe itemscoreswillequaltherawscoreearnedbythe 'minimally acceptable person'." (Angoff, 1971, pp. 514- 515).. In practice, a footnoted variation to the procedure Angoff originally proposed has dominated applications of the Angoff method: "A slight variation of this procedure is to ask eadl judge to state the probability that the 'minimally acceptable person' would answer each item correctly. In effect, judges would think of a mmber of minimally acceptable persons, ireteadofaflyeeeldlpersm,andwulld estimate the proportion of minimally acceptable ' persons Michelldanswereadlitemcorrectly. 'Ihesumofthese probabilities would then represent the minimally acceptable score." (Argoff, 1971, p. 515). A third absolute method was proposed by Ebel (1972), who also noted that norm-referenced methods had serious drawbacks: "The obvious drawback of this approach is that it allows 'of competence of the examinees at a spelelc testing." (Ebel, 1972, p. 494). Ebel's methodology also involves the judgments of subject matter experts. ‘Ihe‘r'belmethodrequiresamstomakedecisionsaboutthe difficulty of individual test items and about the criticality of test l6 cmtent areas. Odierabsolutemethodologieshavealsobeenprtposed, samequite recently. One alternative based on rating test specifications was proposed by Gargelosi (1984). Iockwood, et al (1986) prcposed a methodof averagilgtheresultsofvarimsstandard-settingapproadles inordertoata"true" standard, orpreciseestimateofsaneextant parameter. AmthermethodologyhasbeenproposedbySdlom, Rose!) and Jones (1988) in response to perceived weakness in the Angoff approach. Sdloon,Rosen,arxiJonesalsodidsemepreliminary invetigation into their "Direct Standard Settirg netted" (moms, Rosen, & Schoon, 1988), but it, like other alternatives to the Angoff, Ebel, and Nedelsky methodologies, has not received widespread acceptance or general use. A second wave of proposed starriard-setting methodologies followed early attempts at determining absolute passing standards. Predictably, thesecondwave aspiredtoidentifyamiddleground through the development of methodologies that would strike a cmpromise between plrely norm-referenced (relative) approadles and absolute methods. Illustrative of these carpromise efforts are methodologies suggested by Beuk (1984), Grosse and Wright (1986), Hofstee (1983), and deGruijter (1980). Overviews of these methodologies are provided in deGruijter (1985) and Mills and Melican (1986). The cmpronise methodologies have failed to overtake the earlier absolute proposals, however. currently, in the area of licensure and certification testing, the Angoff, Ebel, and Nedelsky approaches are still the most prevalent methodologies for establishing passing 17 standards, particularly the Angoff and final approaches (Hambleton, 1978: Berk, 1986). Albeit a ubiquitaw task, the establishment of passing starriards is not necessarily an easy one. Referring specifically to licensure and certification testlng' programs, the M fg mm erg W remark mat: "Defining the level of competence required for liceising orcertificaticniseeofthe importantand difficult tasks facing those responsible for slob programs" (AERA/APA/Nom, 1985, p. 63). In a discussion of absolute starriard-setting however, it should alsobemtedtlatcorsiderabledisagreetentedstscoreerningjust Mammabsolute starriard-settingproceduresare. Glass (1978) calls decisionmaking within the absolute standard-setting process "judgmental, capricious, and esseitially unexamined" (p. 253), andfurthermtestlat"tomykmwledge, everyattempttoderivea criterion score is either blatantly arbitrary or derives from a set of arbitrary premises" (p. 258). Similarly, Beuk (1984) has noted that "setting standards...is ally partly a psychometric problem (p. 147). Hofstee offers support for the idea that: "a [stardom-setting] solution satisfactory to all persons involved does not exist and...the choice between alternatives is ultimately a political, not a scie’ttific, matter" (1983, p. 109). Jaeger claims, flatly: "All starriard-setting is judgmental. No amount of data ‘ collection, data analysis, and model building can replace the ultimate judgmental act of deciding which levels of performance are meritorious or acceptable and which are uracceptable or inadequate" (1979, p. 48). Shepard identified the essence of the problem of arbitrariness in the so-called absolute methods: 18 "[N]one of the [starriard-setting] models provides a scientific meals for discovering the 'true' standard. 'Ihisismtmlyadeficiereyofthemrrentmethodshrt is a permanent and insolvable problem because the underlying carpetencies measuredare and not dichotomous" (1980b, p. 67; cf, Shepard, 1978, p. 62). Eve'IEoel, whosestandard—settingmethodhasretairedmlar, resigned himself to the fact that a certain ammt of subjectivity remains in "absolute" standard-setting methods: "Asecondpopularbeliefisthatwhenatestisusedto passorfailsaneone, thedistinctimbetweelflleuao artccmes is clearert and unequivocal. 'Ihis is almost never true. Determiration of a minimum acceptable performame always involves same rather arbitrary and not wholly satisfactory decisions" (Reel, 1972, p. 492). Hambleton summarized the overwhelming consensus of cpinion: Watisclearistlatalloffliemeuiodsaremm thispointhasbeenmadeor impliedbyeveryorewhosework Ilavehadanopporumitytoread. 'mepointisnot disputed by anyone I am aware of." (1978, p. 281). However arbitrary and problematic (deGruijter & Hambleton, 1984; Shepard, 1980b), standards are still essential for making certain inferences and, accordingly, credentialing decisions. The need for validstarriard—settingisespeciallyapparentintheareasof certification and licensure, where ensuring the p.1blic's protection agaixstunsafepracticeistherealandnecessarydargeofthe responsible entities (Lerner, 1979; Maslow, 1983; Shepard, 1983). As Ievin has remarked: "Unless all forms of certification are eliminated, however, a stardard is still needed whether the performaree is sufficient to receive the certification" (1978, pp. 306-307). In stmmary, while ambivalence retains over the degree of arbitrariness inherent in absolute standard-setting methods, their intuitive appeal, ease of implementation, and perceived advantages in termsofbothpsydmetricpzrpertiesanddefereibilityoverthe previwslymlarmm—referereedamroadmshavebeendocmentedby mmerous researdlers (Berk, 1986; Cross, Impara, Frary, & Jaeger, 1984; Klein, 1984; Meskauskas, 1986). 'Ihe use of absolute standard- settim methods continles to become increasingly widespread. Research into development of new methodologies, partiwlarly cmprunise approaches, and alpirically-based methods of adjusting standards (Hambleton, 1978) and into assessing the validity of the resultant standards (Jaeger, 1979; Kane, 1985) continues. MW Having gained increasing acceptance by the measurement profession generally, absolute methods of establishing passing standards began to 7 realize widespread use in the determination of cut-off scores on educational, licensure, and certification tests (Gross, 1985). A logical second phase of research developed: investigation of the psychometric properties of the various starriard—setting procedures. Itfissecordphaseofresearohisdlaracterizedlargelybyattemptsto cmparetwoormorestardard—settirgmethodologiesintermeoftheir reliability and ability to identify an "acceptable" standard. As late as 1988, Smith and Smith reported that: "lmdloftheworkintheareaofstandardsettinghasbeen concerned with comparisons of different methods for establishing a criterion." (p. 259). In testament to the proliferation of inter-methodological research, Berk (1986) reports that in the five-year period, 1981-1986, 22$hldieswereconductedtocamparestaniardsresultingfrunthe application of different starflard-setting methodologies. Ebctensive 20 descriptions of the various inter-methodological cmparison studies are provided elsewhere (Berk, 1986; Jaeger, 1989). A partial listing ofinterhmethodologicalisalsoprwidedinthismrkasAmendixA. (Because the preset research is limited to applications of ore absolutestardard-settirgapproadl, AppendixAlistsonlythose studies reporting cmparisons involving one or more absolute standard- setting methodologies.) (we result of the wealth of interhmethodological research appears certain: Different standard-setting methodologies yield different standards (Arrirew & Hecht, 1976; Brennan & Iodcwood, 1980; Koffler, 1980.- and Skaklm & Kling, 1980). Different methods even produce differentperformarnestarflardswhenappliedtofliesametestsbythe same group of experts (Mills, 1983; Mills & Barr, 1983). More tentative and method-specific conclusions apply to studies when different groups of experts, apply the same methodology to the same test (Cross, et a1, 1984; Fabrey & Raynorxi, 1987; Jaeger, 1988, 1989; Rock, Davis & Werts, 1980). A second result of the inter-methodological research effort is also compelling: The Angoff approach seems to be the preferred absolute stardard-setting methodology by several criteria. Mills and Melican (1988) report that, "the Angoff method appears to be the most widely used. The method is not difficult to explain and data collection and analysis are simpler than for other methods in this category" (p. 272). Similarly, Klein (1984) noted that the Angoff method is preferable "because it can be explained and implemented relatively easily" (p. 2). Rock, Davis and Werts (1980) concluded that "the Angoff cutting score seems to be satewhat closer to the 'mark'" (p. 15). 21 delta: and Hecht (1981), in their cmparison of the Angoff, final, and Nedelskymethodologies, r'eporttletmkgofftedmiqueardtte Angoff coteensls techniques are superior to the others" (p. 15). Cross, et a1 (1984) found that the Angoff method "yielded the most defersihle starriards" (p. 113). Berk (1986) ooncluied that "the Angoffmethodappearstooffertlebestbalareebeueentedmical adequacy and practicability" (p. 147). Meskauskas (1986) states that, "tlepresentmethodofdloice forstandard-settingistteklgoff method (p. 199). Finally, in their study ccnparing the Arqoff and Nedelsky methods, Smith and Smith (1988) report "an urge to say, 'Yes, the Angoff approach is more valid'" (p. 272). MW 'nlelileofirrmliryjoiredbythepresentresearchisamly emerging one (Mills & Melican, 1988) that seeks to identify sources of variation within and efficient refinements of existing standard- setting methodologies. Few systamtic researdl efforts have been directed at this critical facet within the field of startled-setting research. As Smith and smith reported bluntly, "little work has been done to explain why differences in standards occur" (1988, p. 259). Smithandamith (1990)proceededtopursleaeaspectofuxy differences in standards might occur in an investigation where item reviewersusingthekrgoffmethodwereaskedtoattendtomly specified characteristics of reading comprehension items. Unfortunately, the authors reported somewhat discouraging results, and asked: "Where does this leave us? First of all, perplexed, as usual. Second, reluctant to recommend glvmg judges .U‘w v—r "'v"-~m- : -5. 22 informatim about what deract'leristics to use or ignore" (p. 22) . Historically, the recessity of idertifying sources of intra- methodologicel variation has never been totally overlooked by research effortsinthearea. Nedelskyata'icerecognizedthereedtoidentify andreducesouroesofvariatiminthemethodheprcposed. Ore hypothesized source of variation—and, possibly invalidity—was the trainim of item reviewers. Early on in the search for absolute standards, Nedelsky warned: "'Ibmekeaproperjuigmentofthiskim, requirestimeand considerable pedagogical and test-wise sqhisticatiom withrosponsesmoreheterogetnlsthanintheecamplecited areiliable judgment may be impossible." (Nedelsky, 1954, Indeed, the prcper training of qualified item reviewers has been repeatedly emphasized by those involved in standard setting research as crucial to the validity of the process (Francis 8 Holmes, 1983; Jaeger, 1979, 1989; Klein, 1984; Scriven, 1978) . For example, in their procedural guide to several popular standard-setting methodologies, Livingston and Zieky (1982) restate the necessity of reducing variation and invalidity of judgments made by SMFs, devoting extensive portions oftheirmamaltodescribirgthepropertraining of judges. Smith, et a1. (1989) state succinctly: "Variability in the judgmental process reeds to be reduced" (p. 7). Aside from admonitions concerning the training of item reviewers, attention to other intra-methodological considerations has been slight, but growing. A beginning, though sophisticated atht to identify other sources of intro-methodological variation was put forth by smith and Smith (1988) who compared sources of information used by 23 ArgoffaniNedelskyitemreviewersintheitemratingprocess. Elbe primarydajectiveoftheirworkwastopinpointdiffereeesbetween the two methodologies; this, it would still be categorized in the second-phase of research efforts. I-Iwever, it also clearly repraelts aeofthefirstatteiptsatideitifyingjgmjmm variation because of its mique investigation into which item diaracteristicsaresalienttoitemreviewerswithintheArgoffand Nedelskymethods. Salmders, Ryan, and Huynh (1981) also investigated two variatias of the Nedelsky approach, differing only in the extent to which item reviewers were permitted to respond "undecided" when considering whether minimally-campetemt examinees muld rule out an item's option as incorrect. 'Ihey found that the two conditions "produoe[d] essentially equivalent results" (p. 209) . Another investigation into the Nedelsky procedure by Gross (1984) 1edtheauthortosugg$tarefinerentinthetestconstructim processthatwalldmaximizetheconsistencyoftheNedelsky methodology. Flake and Melican (1986) found that, with the Nedelsky method, item reviewers for a mathematics test made fairly consistent item ratings, regardless of test length or difficulty. Dillon (1990) found no strong relationship between the position of an item in an examination and the Angoff rating reviewers assigred to the item. Saunders, et al, (1981) and Halpin, Sigmon, and Halpin (1983) found significant within-method differences in item reviewers' ratings due to differences in the reviewers' own levels of achievement in the subject areas, althaxgh Behuniak, Arohambault, and Gable (1982) 24 reported firdirg m such differences. Mills and Malian (1990) reported that little or no differences in passing standards were observed for randanly equivalent panels of iten reviewers. Noreini, Shea, and Kanya (1988) reported fairly high consisteicy inexperts' estimates ofborderlinegruipperfornarcevmenusingflie Angoff method on a medical specialty emiratim. Helicon and Mills (1987) reported increased p-values and higher intercorrelatiors among item reviewers' item ratings when reviewers were provided with knowledge about the other reviewers' ratings. Garrido and Payne (1987) studied two variations of the Angoff method under two equities-with and withcxrt iten performance information provided to the iten reviewers. In this experiment, mitraireditenreviewerswereaskedtoirdepedeitlyprovideratings for 20 itens. The provision of iten performance informatim (p- values) resultedinhigheraveragepassingstardardsandresultedin reduced interjudge variability. War, the authors note that the high correlation between "With-Data" judges' ratings and erpirieal p- values (r =. .98) called into question "the creditability of the judges in their performance of the judging task" (p. 7). 'lhe author's further murdered: "Did the presentation of such intonation influence the juigestotheecbentthattheydisregardedtheirmm judgments and relied soley on the item difficulty index in determining their probabilities?" (p. 8). (Interesting, Skaknn (1990) also famd that the provision of item performance data—even purposefully incorrect iten performance data— has the effect of reducing variability in iten ratings.) In another rwent study, Friedman and Ho (1990) invatigated the 25 relationship between interj'udge variation (cmsensus) and intrajudge variation (corsistency) am found that procedures aimed at inproving consensus (such as the provision of iten performance information) "did not have an adverse affect on intrajudge caisiste'ncy; in fact...teohniques designed to improve consensus also improved consistency“ (p. 10). 'meauthorsalsoutilizedsever'alprocedures designed to evaluate the effect of eliminating judges with poor interralomsistenyanijuigeswithpooragreenertwiththegruup: however,noneofthemethodsapprearedtoappreciablyalterthe overall passing standard. Nether study of mesa-methodological variation is reported by Curry(1987) warnestigatedastandard-settiigmfora certification examination. Using the Nedelsky starxiard-set'ting method, and a group-process format, Curry found significant variation in reviewers' ratings of itens resulting in a large percentage of items requiring extersive group interaction (i.e., iterations of the ratirgprocess) toadiieveconsensusonitenratings. Whilenotinga strmg"grwppresstwardsanorm" aniacriticalreedWorehoethe normative press involved in the use of an expert group" (Garry, 1987, p. 2), Carry does not provide a strong rationale for why the variation in ratings mist be reduced, or infornation about what effect, if any, the initial variation in ratings would have upon a resultant passing score. Fitzpatrick (1989) reviewed several steward-setting researdi efforts talc-hing specifically on the group-process format and reported: "Discussion among groupmenbers is thought to elicit informational influences through the exdiange of argtments 26 setting process is an important topic for future research" (p.322). Fitzpatrickdidmtreportmresearchtotesttheeffectsofthe group-process formtinstarriard-setting. However, shedidptroceedto strongly suggest that "further studies of starflard-setting that involve strucuired discussion or other methods of controlling biased argtmentatim clearly are warranted" (p. 323). lastly, a study of intra-methodologicnl variation was conducted by Norcini, Lipner, langdon, and Strecker (1987). Using the Angoff method, Norcini, et a1. explicitly tested "the camun notion that a group setting is most appropriate for inplerentation [of the Angoff nethodJ." (Norcini, et a1., 1987, p. 56). 'Ihe research of Norcini, liner, et al. was an attempt "to determine whether more efficient variations of the process will provide consistent and accurate results" (p. 56) . Norcini, et al. found relatively small differences between the passing scores obtained using three varieties of the Angoff method, each variation differing only in the extent to which iten reviewers were exposed to a group—process format. (Each of the modifications of the Angoff method used by Norcini, et a1. involved the use of normative feedback with the 'iten reviewers; that is, reviewers were provided with the correct responses to the itens under review, as well as with enpirically—obtained difficulty indices (p-values) for each iten.) Norcini, et a1. argued that tentative support had been provided for the notion that two of the three variations of the Angoff method resulted in aweptable passing standards and less interrater 27 variation. 'niebasicconclusimofthereseardrbyNorcini, etal. is stzaightforward: "In conclusion, this work implies that judgments gathered after an initial traditional group-prom session can provideamedianienforsettirgcittingscoresusinga modified Angoff method and make more efficient use of meeting time." (Nomini, et al., 1987, p. 63). Onetraiblingaspectoftherseard'ireportedbyNorcini, etal. is the failure to control for possible training or "practice" effects intheitenrevieoers. Intheirstudy, SME‘swereaskedtoreviedtest itemsineachofthreeconditicns. Inthefirstconditicn, thegroup reviewedmaterials sentthraxghthemaildescribingtheAngoffmethod tobeused. Next, thereviewersatterxiedagrcupneetingmierethe method was flirther described, definitiors of a"minima11y ccmpetent examinee," etc., were discussed, and ten practice itets were reviewed. Following this training, the iten reviewers then received a booklet ccmtaining the actual test items, answer key, normative information consisting of iten performance statistics, and further review of the Angoff procedures necessary for completing their iten ratings. 'Ihese features are characterized by Nomini, et al., as the "Before- Meeting" condition. The secorxi cordition (called "hiring-Meeting") was characterized by the same group of item reviewers participating in another meeting to review the Angoff procedure and definitions. Following this review, a traditional group—process Angoff procedure was conducted, with normative information again provided. The third, and final, condition (called "After—Meeting") was conducted approximately one month following the "airing—Meeting" 28 condition. In the "After-Meeting" conditim, the sane group of item reviewerswereegainse'rtapacketofiismictimalnaterials, aset of iters, answer key, and normative information, and were asked to provide iten retires. Norcini, etal., reportthattleresultirepassingscoreecbtaired in each of the three conditions did vary, though not significantly [F(2,10) = 2.04, p = .181]. Also reported is an unsurprisire reduction in the variation of iten ratings frun the "Before-Meeting". condition to the "After-Meetire" condition. Standard deviations of the iten reviewers' retires were 5.8, 2.4, and 1.7 for the Before-, Dlr'ing-, and After-Meeting conditions, respectively. 'mese results might imply, as Norcini, et al., suggest, that Areoff iten retires collected from iten reviewers performire indeperdent iten reviews are as reliable as those collected usire a traditional group—process format. However, a weaker conclusion also seens tenable: A sirele group of iten reviewers usire the Areoff methcdtendstobecomelessvariableintheiritenretireswhen affordedrepeatedeqaosuretothenethodardpermittedgreater opportunities for practice. Additionally, Ncrcini, et a1. , reported that, for the retires geerated in the Before-Meetire condition, all oftheitenreviewers failedtotakeguessireintoaccotmtwhen providing their retires. 'Ihe reviewers were, however, instructed to account for examiree guessire for retires they subsequently provided in the mrire- and After-Meeting conditions (presumably usire p = .20 or p = .25 as the lowest retire possibility). This factor could well have contributed substantially to the reduction in variation observed across conditions. 29 Insunnary, aretl'ertestoftrepropositiorsprtforthbyNorcini, eta1.seeiswarrentedarflisofferedinttepreeeitsmdy. III.SIUUIH'SIGI 'mepresentreseardrhasbdopirposes: 1) todeterminewhether iten reviewers, usire the Areoff (1971) method of assigning probabilities to examination iters, produce different retires as a resultcfexposuretcatreditiorelgrcup—prccessconditicnandan isolation condition, and 2) to investigate the effect of knowledge of other iten reviewers' initial Areoff retires on a subsequent retire of tiesaneitexs.'lwoexperinentstoaddressthesequesticrsare presented. Experimrt l 'Ihe design for the first experiment is are which: 1)rerr1an1y assigred iten reviewers to each of the two conditions; 2) obtained the reviewers' retiresmaccmmnsetofiters;and, 3) cmparedthe resultant ratings. 'Ihe design for Experiment 1 is analogous to the "Pcsttest-Only Control Group Design" presented by Campbell and Stanley (1963, p. 25). InthenotetionsuggestedbyCenpoellarriStanley,thistrue experimental design can be symbolized as follows: GROUP 1: R 01 [control group - (group-process condition” GROUP 2: R X 02 [treatment group - (independent condition”, where: 30 31 R indicates randan assignment to a condition, X iniicetes the administration of a treatmert, and 0 indicates an observation or data collection. Inthepresentresearch, itenreviewerswererandanlyassigredto ore of the two conditions—isolation or group-process. The traditional group-process condition is analogous to a "no treatment" or control group, and the isolation condition represents a new treatment. The above design, called "greatly underused in educational and psychological research" (Campbell and Stanley, 1963, p. 26), has the advantage over other design choices of offering strore resistance tofactorsthatwouldweakentheintenialvalidityofthereseard). 'Ihat is, the experimental design—primarily due to the initial randan assigrnnent to the two conditions—offers a strore potential for discoveriretruedifferereesbetweenfletwogmips' retiresafterthe treatment has been administered, if such differences exist. Although, "'knmling for sure' that themgroups were 'equal'" (Campbell & Stanley, 1963, p. 25) before the experimental treatment is administered is impossible due to the lack of pre—rendan assignment comparisons, many of the factors that could weaken the study's internal validity (particularly, selection) are effectively controlled for through randanization. Epirical Imam aibjectsinthepresentresearchweredividedintotwogroupsand were exposed to two differing conditions. For pirposee of clarity, Groipl—thegroupthatwasexposedtothetreditiorelgmip-process condition—will be referred to as the control group; that is, the 32 group—process condition can be conceived of as a "no treatment" caditim.Gru1p2—thegm1pthatwasexposedtotheinieperflent odditim-winherefenedtoasmetreaunmtgmip;theiniepemmt condition represents the amlication of a new treatment. Precise descriptions of the characteristics of the control and treatment graipsareinportantardarepresentedbelow. m Fadisubjectintheoontrolgmipwasmailedadaecriptimofthe Angoff (1971) methodology for establishing passing scores approximately one nonth prior to a meeting at which the actual item ratings were collected. A copy of these irstructional materials is included as Appendix B. Ammdmately two weeks prior to the passing scoremeeting, eachsubjectinthecontrolgmlpwastelefiioredbythe investigator and questioned concerning hisl tmderstanding of the mailed materials and feelings of preparedness to undertake application of the Angoff methodology. A whole-group meeting, including subjects in both the treatment andcontrol groups, wasconductedbytheinvestigatoronthedayof the passing score meeting. At this meeting, the packet of informational materials which was mailed to subjects prior to the meeting served as a foundation for review of important concepts and definitions. Together, both the treatment and control groups then participated in performing practice ratings for 10 non-operational tat itezrs. The practice itens were drawn from a recently 1All subjects (treatnentardcontrol groups) inthepreeent weremale. 33 administeredtestfomfromthemedicsl specialtyprogrammdersmdy andwerechosentobereproeentativeof itais famdintheupoaning, operational test form. Practice items covered a represaitative range of difficulty, discrimination, and format. Table 1 provides a description of the 10 practice items. 34 Table 1 Description of Practice Iterrs Used in Passing Score Study Item No. Difficulty Discrimination* Worm * 1 2 10 Group Training Session .75 .21 .84 .40 .22 .16 .94 .36 .16 .31 .15 .27 .27 .01 .13 .34 .22 ’U 'U "U 'U 2 'U *U 'U ’0 “U Item 1333*“ 3" 3’ 3’ 3’ 3’ N N 3* 3’ 3’ Notes: * - Discrimination indices reported are point biserial ** - Key to wording: P = positively worded item; N = negatively *** - Notation for item types is consistent with those suggested correlations . worded . in Hubbard (1978) . 35 Mofthepracticeitenswasaocmpaniedbyitenanalysis information corresponding to the itan's recmt use. Provision of item analysis information is consistent with the suggestions of nany researchers in starflard-setting (Berk, 1986; Omeway, 1979; Jaeger, 1982: Iivingstm & Zieky, 1982; and Shepard, 1976; 1979; 1980a: 1980b: 1983) for inclusion of normative information to item reviewers so that more reasonable item probabilities (ratings) are obtained. Afterboththetreatnnentarrica'rtrolgmlpscmpletedratingthe 10practice itaie, allnerdoersofbothgroupswerepolledtodetermine theirper'ceived familiarityandcanfortwithproceedinginthe application of the Angoff methodology to the operational test form. Qiestionsandanswersardabriefdiswssimmoderatedbythe investigator followed. After questions and clarifications, subjects assigned to the cartrolgruip(gm1p-processcm1ditim)raminedinthegruipsetting forthereieimeroftheneetirgtine. Abookletomtainingthe operatiaialtestitemswasdistrimtedtoeadiwbjectinthecmtrol group. No additional information except an iniicator of each item's keywasprovidedtotheoontrolgmip. 'Ihegroupwas, however, encwragedtoutilizeeadictherandtheirpacketsofmailed informational mterials on the Angoff method as needed. 'Ihe investigatorrexainedwiththegroup—processconditimgrwpto mmitorthedisaissimof itemsinthatgroup, andtoobservethe frequencyofdiscussion, thecontentofdiscussion, andtheextentto whididismssimwasdaninatedbyoreormregmipmerbers. Subjectsinthecontrolgroupwerethenaskedtorecordtheir ratings foreadztestitenonaratingsheetthatmsprovided. 36 abjectsinflwwrtrolgrwppmoeededumghuntestitasasa group, pausing frequently to discuss difficult item wordings, review theircmcepuxalizatimofthemhfimally-cmpetmtenaminee,andto ampere item ratings for gustionable item. I-Imever, no forced Wforitemratingswasrequired,mrwasanyitemrevieder amnagedtodiangelfisitenratim. In the present research, iten diffiwlty intention (p-values) wasmtprovidedtosubjectsineitherthetreamentorcmtrol groups. Althalghsanereseardiershavearguedthatitandiffiwlty infonnation (pwalues) shculd be provided to item reviaders when ratingtestitansinordertoincreasethecaisistencyofratirgs (Cross, Impara, et a1, 1984; Norcini, Shea, & Kanya, 1988: and albkaviak & mff, 1986), such informatim was not presented to item reviaoersinthissuxiybecnuseallitetsintheto-be-administered testformbeingreviewedmrenev (previously untested) items for mid: performance data were not available. Ratingsheetsaniallmterialsmrecollectedfraneadrmbject inthecmrtrolgroupwhenthegrmphadcmpletedtheirratingsfor eachitem. Finally, abjectsintheoontrolgroiprespadedtoa brief questionnaire to obtain descriptive information on the subjects ardiniicatorsoftheirperoeptia‘ecmnemingthepassingscoresufly methodology. W Each subject in the treatment group (isolation condition) has exposedtcexperiencesidenticaltoflwseencamteredbymbjectsin flwecontrolgruiptmtilthetinethetreaunentwasadministered. 37 Specifically, aibjectsinthetreamentgrumreceivedthesamepadoet of infer-national materials mailed ammcimately one month prior to the passing score study meeting, received a follow-up telephone call madmtelywoweekspriortothemeeting,andparticipatedinfl1e whole-group practice session and discussiore m the day of the nesting. At the conclusion of the practice session, subjects in the treatmentgruiphereeadiprovidedwiththesameboonetoftestitas assubjectsinthecontrolgmlp. Subjectsinthetreamrtgrumwere asked to use the booklets and previously mailed informational materials toprovideratirgsmanacccmpanyingratingsheetforeadiitaninthe testform. Wensubjectsinthetreatnentgruipwereaskedmm diswsstheirmtingswifliotherueaurentgrummatbers,mabersof the control group, or other professional colleagues. Rather, sabjects inthetreatuentgmipmreaskedtocorsiderarxiprwideuieirratirgs iniependentlyardtoreturntheircatpletedratingfomstothe investigator. Iflcethesubjectsinthecontrol group, subjects inthe treaunentgraipcmpletedardreumed,alaigwiththeirratirgs,the post-meeting follow-up questionnaire. All materials were returned by thetreaulentgralptotheimestigatorwiuiintwodaysofthewnle- groupmeeting. meets Subjects forthepresmtreseardrwerelOmbersoftheWritta-i Damnation Cannittee of a national medical specialty certificatim Board. WafflieWrittenExaminationOtzmnitteearedzargedwiflx establishing performance standards for the Board's examinatiors. 38 Subjectsvererecognizedcontentexperisinthemedical specialtyarea and represented various areas of subspecialty with the profession; each was also a member of the profession's academy. None of the subjects possessed expertise in criterion-referenced standard setting methodologies. Also, each subject indicated that he had not participated in a previous standard-setting study. Consent Each member of the Written Examination Ccmnittee agreed to participate in the present research. The Board's permission to coniuctthesuriywasgranteduimighexeartionofaoontractwithflie American College Testing Program, Inc., to perform various assessment services. The contract specifically covered the conduct of a passing scorestudyfortheBoard. Pennissiontousedatadrtainedinthe corductofthepassingscorestudyforreseardipurposeswasobtaired by the American College Testing Program, Inc., and by the investigator in correspondence with the Ehrecutive Director of the medical specialty Board. Also, individual subjects were contacted by mail to request their participation in the study and each subject provided his consent. Validity Concerns For the medical specialty board under study, length of service on the Board is long, and changes in conposition of its Written Examination Carmittee are slight from year-to-year. Also, all numbers ofthestandardsettingbody(n=10) wereincludedinthes‘tmdy. 'Ihus, external validity within the medical wielg gm is 39 substantial. Ectemalvalidityisweakermlenviewedmnediml specialty licensure and certification groups. However, similarities in the canposition of, experiences, and roles played by other medical specialtygrmpssggeststhat resultsoftheproposedresearchmaybe generalizable to other medical specialty groups as well. A second validity concern also relates to the canposition of the grape. Inthepresentstudy, all subjectswerenele; thequestionof wetter fanale iten reviewers would respond differentially to the treatment (ie, the isolation condition) is not answered by the proposed research. Internal validity concerns (previously discussed) are sanewhat ameliorated due to the randcn assignment of five subjects to each of thetwoconditions (TreatmentGruip, n=5; ControlGroup, n=5). 'Iwo additional concerns exist, however. First, there is the possible effect of subjects' knowledge about the plrposes of the study. Subjectsinboththetreaunentarricontmlgralpsweremadeawareof mtcorflitionflieofiiergroup'smemberswereexposedto. Itis likely that the subjects were able, with that hmledge, to surmise theintentofthestuiy. Itisunknown, however, whateffectsuch knowledgewill haveontheresults ofthepresent research, thoughno systanatic bias in either group's item ratings is expected. Second, differences between treatment and control groups could be magnified (ordepressed) asaresultofthedaninationofdiswssionbyoneor moreindividual ratersinthecontrolgroup. Forexample, adaninant persualityinthecmtrolgmlpcwldinfluenceuieratingsofothers suchthat ratingsappeartobe less variable thattheyvmldhavebeen intheabserneofthedaninantpersonality. However, careful 4O attentimwaspaidtothisconcembyuieirwestigator,whomoderated the group-process condition. Although discussions about the difficulty of individual items and the concept of the minimally- cmpetentcnrflidatewerecmmlinthecontrolgmlp,thedisaissias wereparticipatedinbyallgrorpmanbers;mdaninanceofdisaissim orhegenm1ycverideasexpressedwasdoserved. Afinalintenialvaliditycmceniedisraisedbytherelatively small sanple sizes involved in the study (n = 5 per grwp), Specifically, the power of the present study to detect true differencesbetweenthegroups, shouldsudldifferences exist, isonly modest. 'Ihus, if statistically significant differences between the grozpsaremtobserved,strorgstataientsconcerningthepreserneor absenceoftruedifferencescamntbenade;thatis,thehypotheses thattnledifferencesbetweenthegmmsdorntexistardthattnle differences were simply not detected (a type II error ocwrred) would remainequallytenable. Instrumentation mreeirstnnuei'rtswereusedtorecordobservatiorisinttiepresent research. First, a rating form to collect iten reviewers' estimates offlleproportionofminimally-cmpetenteianineesmwwfllansmran item correctly was used. The same item rating collection form was usedbyboththetreamientarflcontrolgrulps. Asanpleitanratin; collection form is reproduced in Appendix C. 1hesecondirstnmentusedvesaquestiormiredesignedtoelicit certain information from the item reviewers. Information on the following variables was desired: 41 - length of service on the Written Examination Cmmittee - type of professional practice setting (e.g., clinic, university, private practice) . Additionally, questions using Idkert—type response choices (Likert, 1932) were asked, concerning: ~perceptiorsoftheadequacyofuainirqinthestaniard-settin; nethcdology; - perceptions of item reviewers' canprehension of the standard- setting methodology; - perceptions of the ease of implementaticm of the standard- setting methodology; and, - confidence that application of the standard—setting methodology would result in acceptable (accurate) separation of minimally- canpetent/mt minimally-competent examinees. Information fromthequestiormairewasgatheredinordertoobtain denographicdlaracteristicsofthecontentexpertpanelandto identify other variables that might be related to precision and variability in iten ratings. 'Ihe questionnaire was developed by the investigator following recommendations set forth in Babbie (1973) and Schaeffer, Merdemall, and Ctt (1979). 'Ihe questionnaire is reproducedinAppeniixDaniwasadministeredtoboththetreaunent and control groups. 'methirdinstnmentusedintheprtsentreseardiwasthenedical specialty examination itself. The examination is used by the medical specialty board as a carponent of its certification process. One form of the examination is administered annually to amroximately 750 residency program graduates. ‘Ihe examination consists of 200 42 previaisly untsted multiple-choice qmsticts (types A and K) with five option choices. 'Ihe examination is developed by the medical specialty board based on test specifications that include eleven subtest classifications. Previous analyses of the eleven subtest areas has revealed high subtest interoorrelatiors (some exceeding 1.00 when corrected for unreliability) suggesting a fairly unidimersicnal examination (cizek, 1989) . However, on this certification examination, examinesspassorfailthetestbasedmtheirtotaltestsccrea'ily. Previous administrations of examination forms have revealed the test to be quite reliable: KR-ZO indices of internal consistency (Kuder & Richardson, 1937) for the past eight annual administrations of the test (1982-1989) have been .92, .93, .92, .92, .92, .92, .92, .92, respectively. fltistical m 'lhepurposeofthestatistical analysesalployedinfbcperimaitl wasto identifyanydifferencesbetweenthetwogroupsthatwulldbe observableasaresultoftheirexposuretothetwocorditions (group- process and isolation). Of primary interest is whether the conditiore result in different passing scores. In each case, an individual iten reviewer's passing score is defined as the sum of his ratings for each of the 200 itens. The passing score for each condition is defined as theaverageofthepassingscoresforeachofthereviewersinthe condition. 'lhese definitiore can be represented notationally as: XI XI XI .jc 43 isthepassirgscoreforareviewerj, inconditionc: istheratingofitemibyreviewerjincmditimc: istheindexforitens(i=l...200); istheirdexforitemreviewers(j=1...5);ani istheindexforconditims(c=1,2). j=1 .jc is the passing score for a condition, and isdefinedasabove. 'merealsoexistsameanratingforeadiitanwithead'igruxp, which is obtained by averaging the individual reviewers' ratings for theitan. 'Ihatisthereexists, foranitem, i,ameanratingacross reviewers in a condition, represented by § such that: i.c 44 where: x againrepresentstheratingofitemibyreviewerj ijc inconditionc. Investigation into possible effects on rating nears and variance producedbyexposmetothetwoconditiorsisofprimaryinportarcein thepresentresearoh. Recallthatthepassingscoreforeadl cmditionisthesmnoftheaverageditemratingsforeadlitanfrun cad) reviewer assigned to that condition. A test for significant differencebetheenthemconditimmeans,§ and i wasperformed ..1 ..2 usingproceduresforcorriuctingaone—way analysis ofvariance (ANCVA) asoutlinedinGlassandHopkins (1984). 'Ihetestwasconductedto .‘ . determine if the treatment (isolation) condition resulted in a differentpassingscorethanthatresultirgfrunthegchp-process condition. Althcnghtheprimarypractical interestofthereseardlwasin ascertainingwtefllertrerewerebetheeirgrmpmeandifferences,flle possible existence of within-group mean differences (ie, variation in passing scores assigned by ixdividual reviewers) was also of interest. Specifically, do reviewers within a condition vary significantly in tteiniividualpassingscorestheysuggest? ‘Ihemeanpassingscore of each reviewer within conditions, that is, the five § and the .jl five 35 , were observed for within-corriition mean differences .j2 using separate randomized block ANCVAs to test for such differences with reviewers' (p = 5) ratings blocked by itens (n = 200). 'Iheseconiquestionofprimaryinterestwas:0idassigrmenttotre 45 two conditions result in differential variability in reviewers' iten ratings? AreviewongpendixEstnvsthat, acrossjuigeswithin caditials, variability in item ratings can be observed. 'Ihis variability of ratings for an item, i, across raters in cmditim 1, 2 can be represented rotationally as $1.1 . Two columre of these item rating variances (one column for each condition) are shown in Table 1. 'neresultsoftherorandanizedblochmVAs (above)were cmbined fortl'enextanalysis. AnF-testusingtheratioofthewo errorvariameewasconductedtothelikelihoodofl'mogereityof within-condition variances. 'lhetestalsoprovidedameansof answeringthesecondquestimofprineryinterest: didassignmentto the two conditions affect the within-block variability of reviavers ' ratings? No correlation coefficients were also calculated on the cariiticnmeanitanratings (ie, onthex sand; 8) toanswer 1.1 1.2 thequestion: Dothetwomethodsofratingitems (independentand group-process) produce similar orderings of item ratings? In this case , the Pearson product-mt correlation coefficient was calmlatedtoassesstreextenttowhidlalirearrelatiorehipexisted between the ratings of reviewers assigned to each ccmdition. Also, the rank order correlation coefficient was calculated to obtain an irdicatimoftheextenttowhidithetmcmditioreproducesimilar rankings of Angoff values. An intercorrelation matrix of reviewers' ratings based on the 200- itan set was also calculated. The interoorrelation matrix lends itself to 1) visual examination of the row entries, and 2) statistical 46 testing for differences between within-gram mean correlaticns. For enaane, eadirwcanbevisuallyexamiredtoverifythehypothesis that a reviaier's ratirgs stmld correlate more highly with other sane- grwpreviaaers' ratingsthantheystnlldwiththeratingsofrevialers assigned to the other cordition. More specifically, it was hypothesized that the mean of the group-process reviewers' intercorrelatiore slmld exceed that of the independent cariitim, primrilyduetothesharingof informatimthatoccnirsduringthe group-processccmdition. Toadiressthishypothesis,atestfor differences in the nean intra-group correlations was performed. 'Iwonethodswereusedtoevaluatethecaiparabilityoftheeao conditions using an additional source of data—the expirical item performance statistics frun administration of the aminatim for which itemswererated. 'Ihefirstevaluationwasbasedaltheextent to which the two conditions resulted in dependable classification (pass/fail) decisions. As the W (1985) state: "estimates of the reliability of licensure or certification decisions should be provided". . . and "the reliability of the decision of whether or not to certify is of primary importance" (p. 65) . Two estimates of decision consistency were utilized, 3° and it. These estimates of decision consistency, using randomly parallel tests, are elegantly defined by Millman (1979). Millman has characterized so as "the proportion of individuals classified the same wayoneach administration [ofatest]" andhedefinesfcasWhe proportion of the total number of agreements [in classification] above the dame level of agreement" (p. 86). It is also possible to conceive of these two indices as an indicator of classification 47 consistency ( £0 ), and an indicator of the relative contribution of the test to that level of classification consistency ( 3E ). Procedures are available for obtainirg estimates of decisim ccnsistencyusingonlyoneformofatest, andtheseprocedureswere usedinthepresentreseardl. Detailedexplications oftheprocedures have been provided by l-Iuynh (1976) and Subkoviak (1976; 1984; 1988). 'niesecondevaluationconsistedoftwowaysofexaminingthe relaticxehip between reviewers' ratings and item statistics obtained from the actual administration of the examination. For one analysis, individual iten reviewers' ratings for each item were compared with alpirically-obtained difficulty indices (pwalues) derived from the administration of the zoo-item test. Modified p—values (symbolized p ) were used for this comparison. The modification consisted of i calmlatin; the p-values based upon the performance of "minimally- caupetent" examinees only, rather than on the total group, following the suggestions of others (see, for example, Kane, 1984; 1986; Ddiauro & Powers, 1990; Cramer, 1990). For this analysis, minimally— carpetenteuamireesweredefiredasuiosescoringwiulintmstardard errorsofneasurementoftheoperationalpassingscoremthe seminationZ. The analysis consisted of obtaining an indication of absolute error, or the extent to which reviewers' item ratings approximated the items' actual performance in the minimally-calpetent 2 Forpdvaluestobecalculatedbasedonlyupontherespozsesof the "minimally-carpetent" group, an external criterion was needed. That is, theminimally—ccmpetentgrulpcouldmtbe establishedwith referencetothepassingstandardbasedupon the Angoff ratings. For the examination under study, the actual operational passing standard was established using the Bank (1984) nethodology, thereby avoiding a circular definition of carpetence. 48 gram. Fbllowixgtheomoepmal framarorkalggestedbyothers (vander Lirrlai, 1982; Subkoviak & Huff, 1986; Friedmn & Ho, 1990) the variable E was created to reflect error, or misspecificztim of iten performnoe bythereviewers. 'nms, theabsoluterootmeansquarederror (RISE) of @ecifioation for a reviewer, j, in coalition c, is represented by: / 200 2 E = / Z ( X " P ) (n " 1) .jc / ijc i // i / i=1 where x istheratingofitemibyreviewerjinomditimc ijc andp isthemodifiedpdvalue for itemi (describedabave). i Aseoordanalysiswasomductedtodataihanindicatimof relative error, or the extent to which reviewers' ratings apprmdmated group mean iten ratings. Thus, the relative RISE of specification for areviewerj inconditioncisgivenby: /200 _ 2 E'=/ (x -x)/ .337); however, thedifferernebetweenthemomeancorrelatimswasmt statistically significant. 12 i . g . ! ‘meextenttowhidlecposuretothem-informationardwith- information conditions resulted in differing levels of classification consistency was also examined. Indices of classification consistency p and k were calculated for each condition using the passing scores suggestedbyeach. 'IheresultsareshowninTable 15. AsTablelS shows, application of the tic-information passing score would result in a higher overall index of classification consistency ( so = .934 ), cmpared to the with-information condition index ( so =- .898 ). Accordingly, the contribution to classification of the examination itself to consistency of pass/fail classifications was greater under the with-information condition ( if: = .706 ) compared to the no- A information condition ( k = .678 ). l3: s 1.: E A - is at : 5 j“ .3 -E-I — t '.—'_ I: -51.- .3. E, E ‘2. 3 3:- h. at. I J. 3‘: if 100 Table 15 Indices of Decision Consistency for No-Information and With-Information Coalitions ,. A m’tion W B; Is No Information 54.9% (110) .934 .678 With Information 60.1% (120) .898 .706 101 ati ' f ' t' For iten reviewers in the Ira-information (NOINFO) and with- information (WI'IHJNFD) conditions, overall iten ratings for the 100 item were cotpared to iten difficulty indices resulting from the actual administration of the examination. As in Experiment 1, modified walues (mop) were used, obtained by calculating each iten's difficulty based only on the resporses of examinees whose total scorewaswithintmstardarderrorsofulepassingscore. Correlaticlm were calculated between the overall NOINFD and WITI-UNFO ratings and 100?. Correlations were also calculated between individual iten reviewers' ratings and mop. For both conditions, individual reviewers' ratings were found to be moderately related to FDDP. Interestingly, the lowest correlation with FDDP ( r = .197 ) was observed for a reviewer in the with-information condition, while thehighestcorrelationwithmDP(r= .505)wasobservedfora reviewer in the Ila-information condition. Also, surprisingly, the no- information condition produced a higher (thcugh Iran-significantly) overall correlation with modified mlues ( r = .590 ) than the with- information condition ( r = .573 ). 'Ihebwoindicescreatedtoreflectthedegreeofagreenentbeoween reviewers' ratings and certain criteria (E and E') were also calculated for each reviewer. Table 16 presents the obtained values of absolute error of specification (E) and relative error of specification (E') for the five reviewers under Inc-information and with-information conditions. Comparison of the values displayed in Table 16 indicates that, generally, absolute errors of specification are ally slightly reduced through the provision of additional 102 information. The mean absolute error Of Specification for the with- information condition (24.12) was quite close to the mean for the no- infonnaticn condition (24.93). However, relative errors of specification were also sightly reduced under the with-information condition (mean = 13.43) canpared to the rue-information condition (mean = 14.81). 103 Table 16 Absolute and Relative Errors of Specification for Itan Reviewers in bio-Information and With-Informtion Conditions tic-Information Condition With-Information Condition were: E E: E E1 1 23.52 14.09 22.48 12.98 2 26.27 14.76 22.89 10.99 3 23.95 13.24 24.95 14.36 4 25.94 13.77 25.84 12.78 5 24.99 18.17 24.46 16.04 Mean 24.93 14.81 24.12 13.43 Standard 1.20 1.96 1.41 1.89 Deviation 104 In evaluating the effect of the provision of additional informtim, it is again observed that individual item reviewers were mare proficient at estimating the overall group rating for the itens thantheyvereatpredictinghawuiehypoflietioalminimally-cmpetert amines group would perform. grass—1mm In order to further evaluate the effect of providing additional intonation to item reviewers, five regression analyses were performed. A regression model was developed which reflects the hypothsis that an individual reviewer's second (i.e., with- information) rating can be predicted by kmwledge of his original (without-information) rating and with hmledge of the group's original mean rating (with the group mean calculated ewluiing the iniividualreviewer). 'Ihesetmratingswereusedastheiniepenient variables in the regression equations with the reviewer's revised (with-information) rating used as the dependent variable. 'Iheoretically, the model assumes that reviewers' make their juignents about iten ratings based upon their own procedure-related knowledge; that is, knowledge regarding the hypothetical minimally-cmpetent eminee group and the difficulty of the itexrs being rated. And, reviewers take into account information gleaned from other reviewers: in this case, fran the distribution of reviewers' initial ratings that wasprovided fortheiruseinthesecondroundofratings. To assess the likelihood of such an effect, five regression analyses were conducted, one for each reviewer according to the proceduredescribedabove. Resultsoftheanalysesarepresentedin 105 Raw (non-standardized) multiple regression equations are Presentedinthetable,alm1gwithu1ecorrelatiorsbetvee1thet\o Table 17 . ixdqzeniem: variables, the multiple R, and R squared. In each case, the correlations between the independent variables are low to moderate, suggesting that the choice of independent variables does not pose a threat of nulticollinearity. For each regression performed, analyses of plots of predicted values against residuals revealed no disconcerting patterns; plots were broadly scattered and all residuals hadmeansatornearzero. 1 y = 8.805 + .535(x1) 2 y = 5.745 + .526(xl) 3 y =-1.o73 + .789(x1) 4 y =29.528 + .290(x1) 5 y =-7.161 + .537(x1) + + + + 106 Table 17 .424(x2) + e .524(:a) + e .349(x2) + e .273(x2) + e .555(x2) +'e .425 .461 .456 .480 .476 Regression Analyses for Individual Reviewers in Experiment 2 .715 .827 .750 .537 .807 Notes: xl=origina1 rating for itemibyreviewer j, and 2 MWK .511 .683 .563 .288 .652 :a = group's original mean rating for item i carputed with reviewer j excluded. 107 'me hypothesized influence of additional informtion appeared to be Wident in each of the regression analyses. For every reviewer, Values of b and b were tested for significant difference fran zero; in all oasis, thezt statistic were significant at p < .01. Further, the moderately high values of multiple R and (with the exceptim of Reviewer4)themoderatevaluesofquuaredsuggestthatthe regression model has accounted for at least half of the variation in reviewers' ratings. mined Imults Selectedresults franEbcperiuentlardEbcperimerrtZwerecmbined to achieve an overall assessment of the effect of the various stardard setting procedures. First, the gmlp-process ratings frcm Ecperiment 1werereana1yzedtoobtainthepassingstarriardthatwouldrosult usingrating forthefirst 100 itets only. 'Ihiswasdonesothat directcatparisorscouldbemadebetweenthepassixgstarflards suggested by the group-process condition, the iniependent/m— information condition, and the irriepenient/with—information condition, anithestandardstobeoouparedwondbebasedupmratingsofuie same 100 items. Table 18 presents the results of the combined analysis. Several striking differences between the three procedures are apparent. First, the mean item ratings for the three procedures differ corsiderably, from a low of 48.88% (for the group-process condition) to a high of 60.08% (for the indeperrient/with-infomation condition). The dramatic impact that differences of this magniulde would have on scum classificatim decisions is also shown in Table 18. For ample, the lowest passing rate (77.4%) was observed for the independent/with—informatim condition, while the highest passing rate (95.0%) was observed for the group—process condition. Accordingly, failure rates also varied dramatically, from 5.0% for the group- processconiitimtonearly41/2timesasgreatforthe indeperrierrt/with-informtion condition (22.6%) . 109 Tab1e18 Outparismofmperimentlandmperinentz SuggestedPassingStarriards -————oondi tic-s deependerrt Irriependent Mean Iten Rating 48.88 54.90 60.08 (across reviezers) Standard Deviation of 11.60 2.41 3.86 Reviewers' Overall Ratings Standard Error 5.19 1.08 1.73 Passing Score 97.76(98) 109.80(110) 120.16(120) (rounded)* . 95% Confidence Interval 88, 108 108,112 117, 124 for Passing Score Percent Fossing 95.0(5.0) 86.8(13.2) 77.4(22.6) (Failing) ** * = adjusted to reflect passing standard for a 200—item test. ** =based onpassing score dotainedusing Beuk (1984) method. 110 'Ihe issue of variability among individual reviewers' overall ratings (i.e, suggested passing standards) is also highlighted by the results displayed in Table 18. 'me wide variability across reviewers inthegrulp-processconditionisecprossstatistieallyinthelarge standard deviation of group-process reviewers' ratings (11.60). This fairly large value for the standard deviation of group-process reviewers' ratirgs is also reflected in a correspariingly large standard error (5.19) and a very wide confidence interval (88, 108). On the other hand, both of the indepenient conditions (i.e., the no-information and with-information conditions) displayed cmparatively smaller standard deviations for reviewers' overall ratings and correspondingly smaller standard errors and confidence intervals. Surprisingly, the smallest standard error (1.08) and narrowestconfidenceinterval(plusormi1m52rawscorem1its)was observed for the incleperxient/m—informatim condition. In sunmary, it should be eurhasized that thae fairly large differences nay—or may not—be attributable to exposure to the experinental conditiors. Because of the small panels of item reviewers utilized, it is possible that the results could be explained by randcm error. Althargh the social interaction hypothesis would predict the observed results, the failure to achieve statistical significance for groupneandifferencesdoeenotmlearttheobservationofthese results due to diame. V. WEN IIhisstudyconsis‘tedoftwoexperiments. 'Ihepurposeofthefirst eqaerinentwastoexamineorevariatimofthetraditicnalgrulp- processprocedureof establishingpassingstandardsusingtheArgoff standard setting methodology. The variation studied consisted of having iten reviewers generate their Angoff ratings under an "irriepenient" coalition in which the usual effects of the gralp- process procedure (e.g., social cauparison, sharing of information, etc.) Guild be controlled. 'meplrposeoftheseconiexperinentwastoisolatetheeffectof one souroe of information that item reviewers use in generating their ratings—lowledye of the ratings provided by other (peer) reviewers. 'Iheresults of eadleqlerinerrtaresmrarizedbelwarrialistof major findings and inplicatiors of these results is presented. mperimrtlamry Mean Ratm and Variabiligy Ten item reviewers in Experiment 1 provided Angoff ratings for 200 items on a medical specialty certification examination. Before providing their ratings, reviewers were given informational materials and participated in a training session to ensure their familiarity with the methodology. After this, reviewers were randanly assigned to 111 112 oneoftwoconditions: anjndependentcmditiminwhidlreviewershad no interreviewer interactions concerning their ratings, and a group- process condition in which reviewers freely discussed their ratings for itets, iten difficulty and relevance, and their cuseptiors of the hypothetical minimally-canpetent candidate group. Wiretothetwoconditionsproducedvariedresults. 'Ihe primarqustionofinterestwaswhefllerecposmetotheconiitias would yield differing passing standards. In Experiment 1, the passing starflardsdatainedshmwedflntuleirdeperflentomfiitimrosultedina starriardfllatwasarprmdmatelynirerawscorepointshigherfllanthe group-process condition. However, that difference was not statistically significant. Although the independent condition standard was higher, overall group item ratings provided by reviewers in each condition were nearly equally variable and fairly highly correlated. Asecondvariabilityissueaddressedinfibtperimentlwaswhether the two conditions resulted in differential ratings for individual items. As hypothesized, independent reviewers exhibited, on average, a slightly wider spread of ratings for individual items than did reviewers in the group-process condition. This result canplements the earlierobservationofthehigherstarriardsuggestedbythe independent grwpinthattheabseroeofreviewerinteractioninuleindeperdent group may have contributed to this result. Conversely, the variability of the group-process condition ratings for individual itens may have been reduced due to the effect of group interaction. It is critical at this point, however, to highlight the failure to achieve statistical significance for observed differences between 113 neanratixgsforthetwcccnditiorsinmperiuentl. Alttnlghthe resultswmldelrelyresultinlargepracticnlcasequenceeforthe examines pqnlation, the profession, and the certifying board, cmfidentstatanentsregardinguiereprodicahnitycfuaereantcarmt be made. Specifically, the failure to reject the null hypothesis for group mean difference means that the results could be explained sinply with reference to randan error: Different groups of iten reviewers calldbeelpanelledandarriveatidenticalpassirqscoresorevenat differ'e'rtpassingscor'esinusqpositedirectimasttnseobservedin thisstudy. gision 00mm Boththe fifleperientardgmlp-processconditiasyieldedhigh indicee of decision consistency, as evidenced by the coefficients a andlt. However, reitherthefactthatbothirfliceswerehighorthe fact that the group-process condition yielded slightly higher coefficients is particularly mteworthy: These findings can be explained by sinply noting that the examination itself was highly reliableardthatboththeinieperrientardgrmp-pmessspassing scoresweremtveryclosetotheoverallneanscoremthe examinatim (with the group-process cordition passing standard located sligl'rtlyfurtherfrantheoverallneanscorethanthestamd Wbytheirdepeiflentgrulp). Relationship of Ratggs' to Obtained Item Statistics fig Reviewer Characteristics Ratings frm itenreviewersintheirriepexrientandgmlp-process conditions were ccmpared to p—valuee which were calculated using only 114 the respcmes of the hypothetical minimally-ccmpetent candidate group. Althmgh, for all individual reviewers, correlatious between item ratings and modified p-values were significantly different frtm zero, all of the correlatiors were uniformly low. men when outlined to form group average item ratings, correlations with modified pwalues were underate at best. Similarly, the magnitude of the variables E and E' (cornqluwalized as average errors of specification for iten ratings) imicated that individual item reviewers, in general, exhibited a fairly large degree of error when attempting to estimate the performance of the minimally-carpetent group, as evidenced by the large values of E. It is of snall coxsolation that reviewers could more accurately provide estimates of their overall group item ratirgs, as evidenced by the relatively smaller values of E' . These findings, taken together, all confirm one canon criticisn of the Angoff standard setting methodology—that item reviewers often experience some difficulty in accurately concepulalizirg the minimally-competent examinee group. Further, precision in estimation of item ratings does not appear tobedeperrientuponanyofthereviewercharacteristiesneasuredin this study. For exanple, one might suspect that the more experience a reviewer had with producing and reviewing test items would lead to more accurate specifimtion in item ratings. This result was not deserved. Likewise, neither was a significant relationship observed betweentheextenttowhichreviewersreportedtotmderstarrlthe Angoff nethodology or their confidence in its results am! the precision of their ratings. These results do not rule out the Possibility that other reviewer characteristics do cartrihrte Slbstantially to accuracy in iten ratings; perhaps other significant badcgmmdvariablesexistthatweremtmeasnedinthissuxiy. 0n theotherharri, itisalsosmnewhatencwragingthattheneasned variablesdomtameartoinfluencerevieweracwracy. Ifstandard settingbodiescanbelessconcenxedabwtthesevariablesvmen epaneling reviewers, thepool ofpoterrtial reviewers migi‘rtbe larger, possibly widening to include participation by able reviewers who may haveot'herwisebeeneccltxled. Generalizabilig Analw Generalizability analysa were conducted to investigate differing sourceeofvariationinitemratingssothatpotentialfuture applications of either the iniepemlent or group-process procedures could be developed to yield increased dependability of measurenent (i.e., depe'flability of item ratirgs). G-study results indicated that variance carpcments were fairly well estimated (except for the group- process conditionraters carponerrt) andwouldbeuseful forsubsequem: d-stnflyanalyses. D-studyresults fortheirxieperde'rtandgruzp- precessca‘riitionsweredotained,varyirgthemmberofreviewers whileholdirgthemmberof itatsoonstant. 'Iheresultsstxmedthat slightly increased neasureezm dependability was achieved using under theiniependentcmditimascmparedtouaegruxp-pmcesscuditim, with acceptable results for operational purposes achieved with apprminately 11t015reviewers. 'Ihis firdingiscontrastedwimthe suggestions ofsanethatat least sixtoseven reviewersbealpaneled for passing score decisions, although others (cf., Cross, at al., 116 1984, p. 116) have also smested that eupaneling 15 or more reviewers is desirable. D-studyresults alsosggestedthataddingmoreitanreviewers (or, possibly, more extensive reviewer training) would likely result in irmeasedneasnerentdeperfiabflitymflereitherflueinieperdentor group-process oonditiors, though more so in the group-process condition. In practice, of course, inzreasing the nuuber of test item would also, generally, inprove overall dependability. However, with testlergth forthetestmflersbadyalreadyfairlylong (n=200 itens), more and bettertrained reviewers would likely be a more practical, less costly, and more efficacious method of addressing the issue of increasing the accuracy of item ratings. Cost Analfiis Becausetheindependentitenreviewprocedurewasproposedasan efficient alternative to the gram-process procedure, a cost analysis wasalso conducted. Asexpected, the financial costsassociatedwith inpleentation of an iniepenient/with—neeting rating procedure were lowerthanfliecostsassociatedwiflzcmfluctingthetraditiornlgmxp- process procedure for a 200—item examination. Substantially lower costs yet were estimated for an iniependent/wittnrt—neetn'g procedure. However, itismtedthatsarecontroloverthestaniardsettim process is surely lost when either ixflependent condition is utilized. One potentially inportant element that is excluded from the iJ'dependent/witlnlt-neeting condition is the ability of reviewers, as agroup, toarriveatsateooreensusregardjngmeirconoeptimofme mininally—cmpetent eaminee group—an important aspect of the Angoff 117 “Sundology. And, itismflamnl‘mstmfiardsestablishedusingan independent/without-meeting procedure would compare to the iniqendent/with-meeting or group—process procedures eramined in this research. Prunisingresultswereobservedforthea'evariatimofthe irdependent procedure in which iten reviewers assetble only long enoightoreceivegrulptraiJfim,beconefamiliarwiththe methodology, and develop cannon referents regarding the minimally- cmpetent group. This variation was also less expensive that the traditional group-processmethod, mtwouldrequireagreatertime ccmnitznentmthepartofpotentialitanrevieers. 'misopticn, however, should probably be oorsidered by groups contamlating the need for a standard setting study in light of earlier findings regarding the inportance of reviewer training. Wit 2 Sumary m Ratm‘ and Variabilig Five iten reviewers—the same reviewers who participated as irflepenient item reviewers in Experiment 1—were each provided with the five ratings generated for each of the first 100 items form the ZOO-item examination used in Experiment 1. 'Ihe reviewers were asked to reread the 100 items, to review the distribution of initial ratings for eadu item, and to independently provide a second rating for each item. 'Ihis procedure created two conditions: a "No-Infometion" condition represented by the initial ratings generated irrieperdently before any mrnative information was provided, and a "With- Infonration" condition represented by the subsequent ratings generated 118 With knowledge of the distributions of initial ratings for each item. Fairly consistently, ratings generated under the with-informtion cariitimwerehigherthanratingsgeneratedbythesamereviewers urrier the m—informatim condition. Differences between the condition means were of statistical and practical significance. However, overall mean iten ratings across reviewers were roughly equally variable for the I'D-information and with-information conditions, although at the individual item level, a slight reduction in variability for the with-information ratings was cbserved. These findings generally corplenent the finiings presented for Dcperiment 1. For ample, the provision of additional informtion— intheformofthedistributiors of itenratings—mayhavehadthe effect of calmmicating to reviewers a group "expectation" or conceptualization regarding minimal carpetence levels which they used in generating their second set of ratings. Accordingly, reviewers moseratingsneyhavebeelextreueinitiallymresubtlyinducedto converge on the standard inplied by the distributions of item ratings, making their subsequent ratings for individual items sanewhat less variable. miseffectissimilartowhatsanehavetemedthe "reality check" aspect of the modified Angoff method in which item reviewers, after providing an initial set of ratings, are given empirical iten difficulty levels and asked to generate a second (revised) set of ratings. Relationship of Ratm' to Obtained Item Statistics Ratings frcm iten reviewers in the nae-information and with- infornation conditions were compared to p-values which were calculated 119 again using mlyflleresporsesofthehypotheticnl minimlly-oalpeterrt mndidate group. Although, for all individual reviewers, correlations between iten ratings and mdified pwalues were significantly different frcm zero, all of the correlations were of low to moderate nagni‘ulde. Also, average oorrelatiors between ratings provided under the two cmfiitions and the modified p-values did rut differ significantly. 'Ihese results likely mean that the provisim of additional information did not influence the reviewers to converge on thestandardthatvmldbesggestedbytheacmalperformameofflle mininally-ccnpetent group (as operationalized in this study). Rather, reviewers converged on their own-sanewhat inaccurate—conceptim of thelevel atwhidianappropriateminimlmstandardstmldbeset. In fact, reviewers in both the no-information and with-informtim conditions had similar and fairly large absolute errors of specification. Mean relative errors of specification for the two coniitiors were also quite close. Becausethesamereviewerswhoprovidedratirgs forExperimentl also provided ratings for the second experiment, the results are sanewhatdepenient; thelowdegreeofaowracyfotmdinmperimentl is,toscmeextent,carriedoverto£b