rto§«..v.m.u 4 ‘ . LEM—hunt“. Qflvfinwflwtxfiaazn ». . 4...: «WtQfizr I3! IQQIOJ u 0.. . , Ciao .fit.‘ 15:83.“me s . . v: . a, » 4 A v 36575.. afiwwwthk.h.1wza§ul .9. it. ‘ .i ‘ 1. a 4n 8.»; :3. 342...... ... _. . E N $5.4“? .w : .. : 4v. . z. .65.. v? ”Vienna... vmvm“. {fig a A. . . . rim... 2.». u » . .r _ . . I 3 .‘ _ .. . .. v . at. $31,.Eurmsdar: 3w.» J 3 .. 2 x. . A - .ngt.§lm.irf.n ty—W‘o Ii it’d-g... anvil-yo. Big??? 16.. a... t i r .vn!-:hu$ ‘f:a. .71: 71t9a13fii§ !! l t a I .V 0.: u I a I nu I‘ ICJV . n I, t- J. n . 1 a. (it Vitmfaiv . o. a . 7 I! . rivflétsefli... I... g 1 : .. .. 3.... ..,, . have... , r1. . .4... t ‘ .1 it? $ 4! . . ’0'? ~! . .. IIrxflflivfiu‘lfOt e-h‘fi, .. 0 1' . : Z~ \UQ.. viibn IT.» .vAt 2,. . . u. %.§;\A.§i€ ~uh... .. .‘ ; ‘ u .. . ‘nl .r 3 . .. ‘ nflhavuxw. vt aohgfll .m‘unnuflv‘tc Hi. “I 0|‘! V. A: sis“ . - . \EI‘O‘EAV I. . ‘ z . n. a ‘ xi EPV $1.1. , y ‘ _ . il!‘ .v:>o~.:.l£»)‘$rvh kn ‘ hwy. ii iitfluléilégfi Dy v . x hlkfitgtJllt .flru»..v\i\. HWWR}? ’Hufln‘m , u , ‘ \ . I‘ '9...» : . ».L.. .1. .. :k‘ I . .. Y «2.5mm . (WW. 3.1.... .§v rNWmfltsxuhL.‘ . .3”; .35 , zTII-uuque rub». “3m?“ 5. . . Qfl‘fiwinw? 23% I! 5 u . s ‘ .g x R . L.‘ . ~. , . u ‘ , h. 9.5111” .3?»le .. LL... 3%.?! , «a :7. a A, ,. 2 ~, muh~n§¥..n...uuw§urvttw .. wag. .x ‘ mumwwrfinnmn Jam“. , . . '51.;va anon: g ez‘upfmmiwfir 2., , a .8..ng zmtw. finfib‘ a. z in? .. mumfiumsizu fugfi :. .. ‘ K (7*: “Withng . 9%“. 1” «1‘73?! 99“. at. QM it, n. lib}. 3:52.... : ...\n. q } f"..“ 11.3.? ‘t‘fifi‘:-:0vw"§’ “Ifi‘ ...$zx....-. 31.... . . ,ufifi. .. .. In“)... Irv“: ttz‘ 4 .éix‘g .9 .I.§.\I.E. 0..an .1 v .2.va c srllrvtv‘vv . I|\¢1tVUO..vu|t.§11 a :111N1vii 1. Lttf 53'! pa. v C. in iit‘v 9 f! ‘ r Its-00:31.. “a?“ . . e V inhii.‘ $3.: \‘ lllllll 3"...“va a: rdfldwvgt‘fv (g V 1“} a p ,7. r . u r... v I n, 3543:315va Wt. Irgil. 9.1 0.0; . e lfaifinflcfiv (1. u 91‘. ~11}; baht-H76. :0 :2 u‘. a x a I .1 _ fist-‘1‘!“ iii 3§1i9n« 1. vi QQQQQ hvl‘erVvv Eur: . .iflJrVI. u-. 9 x - . . .ri. fig“... 1. .u. .1. Ti. . ..)a\\qitk¥ful7tflf ,eraXVtfii'Oh ’ s. x v c . ENE ‘s’b’ni. gibii V . ‘ {L a . . n. .. I..vio’|§c h. 7G v i 3 u r ~ 4:. 7 A that ii: $1.?!- I R! w! w. luau‘ ‘ V . y. . u . ~ . Onavfi't’ EOVI. . ,0? .71 In . .. . Ji . Ivlyfllr. .lvlllh‘u 5.“..‘000! ‘ , I! . . . w nanhugfluuurvfl’ ‘cbzxfiflflufluvi {la 1 .5. ‘ ‘ Kn .. ‘ A; ~3§49Af¢ :tigclull J‘I‘hfiv! 5‘ . ‘ .,v n . ‘V‘Ols uIYx [in-r1! .02:qu . v.‘ I?!) i . an: givi¥ 1.5.1.11: - ‘ . “fig‘s {bu-lg Vt: Eufinfihlkfi‘gk’lzl I. ~ Jill. pq'féq'ohvfi a. . ...I4\An ‘9‘. 9&9} .\.. . §n§h ‘0: I 1.5!.“ L 1‘}. ‘ . n u inactivity. .0 . > . I. I! ‘ w . . . :5le iii . A 2 (19:. it. ‘ I; 'r I u“;- WW3 w , Ink . 1.1. 39...: ‘ 3.1.31: tit ‘ Iafiu. sz 3.. 11.1.. . . tib‘ .131 +1.5.ka 4 .mdsktl , . . 1.99. { fa. it...“ at: is V tit . ‘ 11} in; Iq'ltlbk...§.n$$s§ul!i§i , n: . . .vurtfimmfibll‘t) , ééi . 9! éifitzi 2H} nil... , 5.“!!! 2a h‘vbi\5u\l\uflv.nfl-\i hr: kg} to. um um wit “in fizyil; ...... tnfiht 1....» :1. It. ($91!. ‘1‘» .3.) {Eyivwinttsl i .l git-ii}. Ink. ,:u :i! . 1E\%hi}c . 1.:— . .. .‘l. ”rigiigggtin §|§3§|i¥h‘? sité‘g‘; inFiilftt.‘ , 9‘1 .3 £51.! .‘vgllé‘: 95%| (51.... 25...: iv! . .531- illfis.“ £111...." bI.\.Iv ‘9‘, I.“ ‘ ti ain‘t . \in \t . {\‘lln A‘x .1‘ n" . ,_ ./ ‘ JIE‘K‘)”H‘§'V!"‘"’¥§H§ 1 ":3" 6 K ‘ ilk?!» - ‘ E‘Qfi‘v’! 31‘s,!“ I. , ct Inn“ .IL 5 .L . K ..%nn“fluuritnii¥ii -iHHVEIEE LI... «$8.11.... ‘ _ V 51‘.- ‘7- l‘ )2.» ’Yotfilrtttl “3"“ ‘I In a \I 95;: ‘iififis. {inurztxtsillt} i?! ll. 15.x YE! 789%.}. NM it‘ll : Etti t «In-ll!!! \‘sbl y. 1"... {II .II .613.- ‘L. ‘1’ E 5.. 3.1;“ it’ll-xi! . Ava-til.»- LL 5.1.14 . . .. ,ui ‘ . 1:... Y $1.79.... {git A 1L.)- L. 35" 3 {ravnuun _ n , . L A u‘HPVWUT’VtVI... ...‘E-..Ls‘t Ltvriz...§|. » . 73...".11‘11‘ l§.EL nkl.’l§71§7~|{£ ).\.r§ul.§l\lllll.v J: A . .. . .139!» 3. Elk“. tkdfihlytl.l§‘..rrs?i . . V ‘ J . FMEMH‘V. ,: . 1???? .AT.‘ ‘ltfivilvsbflivp a: I. .15.}. £5.er gtthfll‘bti. stu- L I!» {ti‘itbfifitu Giziltn at; .‘ii‘Ir: , V . A x ‘ . 3!”). .12: :E‘L, ti"?! his? 4.; I ~ 1,1323%. iihrmzi’edxt‘wfiflfllrrtit .. ‘ at... £0.07! .....'VV£.L I.¢itxh‘t.ifi§ziflviyin . . ,r . .‘L‘Iiii’s?il‘vfi¢5.i .. \. . ‘ 313 izvri9vilfili {1:14; Lgrtgtttnthl?vflhn,wan.uvdvfls E . . . .V‘Ovtlni m. . .. ‘21... :. .ziéiu. Ii... ...1 e . .521! is. 1 9344.4 .. ~. .. ..| x0!t¥fl.¥\1:t§“fi?.fl.¥ - ‘ 3 9:3} I... .153: A .. . , . . .le:l.§ LVE‘EEE.’ L». is I . y ..,r\». ‘ :DPVIVQ. «‘ a . . . E I)! I _ , I.\\1!b£... . . x (M‘fl 35.1.3"!- 4632L$l\§.v«.\ ‘\ . .E.ISVL§YPI~41Lt\§\III\II-il ul! .1. ‘ ‘~ . !. 2;!!itblzvlatsfinnntvl it? 12‘ l A. . , . . Y‘lv‘w.t~lc)~f:V.... v Isl. >~»| .q. :1 \\..¥1.\’v v ..4){... L.;:i).lpr\.r|t \ .t... .‘A4I.L4‘M\. “in? . 1 .. 3.1....1. 1.51: iznfivtttst- ‘7‘: ‘ .u.;.§€9lf I. Jan.” -w'c -"E'Z:"1'T‘ \ a o , c”; (“I") '7“). 1-:f'd‘ F. "x .'-,’o .54» 9 ~.." ['2 '5 .. t '2 ' f " 1.5"}: :1 9—4..) .y ’.- . r .r -.:1-' ‘l. -~"—.§' [.0 Jr! i“. ~'\ 5‘. f5.“- ,-“ 2f. 3’)" .~ g ' ' ' . - ‘ a V - .-.-- v .- - '. - - ..,- a. - -l :9). .. m_._ ‘7;-:~'u “#1.; _.V_. .,,' 5:. . -, , _ - -‘.._. - -A, . '.,.1' . '. -'o ‘3' gh/ .. A .3 . 4.; 'r ~' ”on 1;- .i Q -I - ”5-0: hath! ,g ~. _1 {4 . 4" on, -', I"! — ii r1 :3 .1 “ .--‘ a“; "'17.“ et‘~.'~3':’.‘.“'t"’ . '1 .‘.;.\§‘ .; 7'... . .H‘f'iLu I. “5) 5:383 “A ‘0 ‘9‘ ..‘\ ‘vJ' “4 "DJ-4" —— This is to certify that the dissertation entitled The Validity of Patient Management Problems for Licensing and Certification of Physicians presented by Eric Duane Zemper has been accepted towards fulfillment of the requirements for Ph . D. degree in Counseling , Educational Psychology and Special Education / / I ///l.1/ I ” ' ' Hessor Date 5 MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 J.‘ MSU LIBRARIES “ RETURNING MATERIALS: Place in book drop to remove this checkout from your record. FINES will be charged if book is returned after the date stamped below. THE VALIDITY OF PATIENT MANAGEMENT PROBLEMS FOR LICENSING AND CERTIFICATION OF PHYSICIANS BY Eric Duane Zemper A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling and Educational Psychology 1982 Copyright by ERIC DUANE ZEMPER 1982 ABSTRACT THE VALIDITY OF PATIENT MANAGEMENT PROBLEMS FOR LICENSING AND CERTIFICATION OF PHYSICIANS BY Eric Duane Zemper Patient Management Problems have become an important part of licensing and certification examinations of physicians despite lack of any evidence of criterion-related validity. This study investigates the criterion-related validity of PMP's using data from 509 physicians who took the specialty certification examination of the American Board of Emergency Medicine. Simulated Clinical Encounters (SCE's) , highly structured and reliable oral simulations administered by trained examiners, serve as the performance criterion. Four hypotheses, based on directly observable consequences of the assumptions underlying the use of PMP's,-are tested. The results show that PMP's have little or no cor- relation with the criterion measure. Regression analyses indicate that PMP scores make no contribution to predicting SCE SCOreS, while scores from a multiple—choice (MCQ) battery atcc'30unt for 19% of the SCE score variance. Stepwise discrim- inant analyses indicate that, between PMP and MCQ scores, the MCQ'S account for all the ability to discriminate residency traineci emergency physicians (who presumably posess enhanced pr°blem-solving skills) from those without residency training. O Eric Duane Zemper PMP's contribute nothing to this discrimination. When the scores of each section of the PMP's are correlated with the criterion measure, the final management section correlation is significantly higher than any other, indicating that out- come measures may be more important than data—gathering (process) measures. This study can provide no evidence for criterion-related validity of PMP's in licensing and certification of physicians. These results, combined with the few existing studies of criterion-related validity of PMP's, indicate that the validity of using PMP's in making licensing and certifi- cation decisions should be considered highly suspect until proven otherwise by those using this examination format. Eric Duane Zemper PMP's contribute nothing to this discrimination. When the scores of each section of the PMP's are correlated with the criterion measure, the final management section correlation is significantly higher than any other, indicating that out- come measures may be more important than data-gathering (process) measures. This study can provide no evidence for criterion—related validity of PMP's in licensing and certification of physicians. These results, combined with the few existing studies of criterion-related validity of PMP's, indicate that the validity of using PMP's in making licensing and certifi- cation decisions should be considered highly suspect until proven otherwise by those using this examination format. ACKNOWLEDGMENTS At the conclusion of this long effort, special thanks must go to my wife, Jerri, for her great typing skills and even greater patience. My thanks also to Professor John F. Vinsonhaler, who first encouraged me to start this degree program and later became my committee chairman, and to Professor Jack L. Maatsch, the director of this dissertation, who has been my mentor for many years in the field of medical education. For also serving on my committee, and for their consistent help over the years, my deep appreciation to Professor Lee S. Shulman and Professor Sarah A. Sprafka. Finally, I would like to acknowledge Dr. Raywin Huang, who provided unstinting help and advice with the computer programs used for the statistical analyses in this study; and the American Board of Emergency Medicine and Dr. Ben Munger, Executive Director, without whose cooperation this study would not have been possible. iii CHAPTER LIST LIST I. II. III. TABLE OF CONTENTS OF TABLES . . . . . . . . . . . . . . . . . . OF FIGURES . . . . . . . . . . . . . . . . . . THE PROBLEM . . . . . . . . . . . . . . . . . Introduction . . . . . . . History of Licensing and Certification Exams in Medicine . . . . . . . . . . . . The development of Patient Management Problems . . . . . . . . . . . . . . Problems of examinations as a test of competence............. Need for This Study . . . . . . . . . . . The Problem . . . . . . . . . . . . . . Research Hypotheses . . . . . . . . . . . Summary . . . . . . . . . . Overview of the Dissertation . . . . . . . REVIEW OF THE LITERATURE . . . . . . . . . . . Licensing and Certification Examinations Validity . . . . . . . . . . . Defining the criterion . . . . . . . . Patient Management Problems . . . . . . . Validity . . . . . . . . . . . . Scoring of PMP' s . . . . . PMP' s as tests of problem-solving skills . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . PROCEDURES AND DESIGN . . . . . . . . . . . . Introduction . . . . . . . . . . . . . Examination Construction . . . . . . . . . Field Test of Test Items . . . . . . . . . Design . . . . . . . . . . . . . . . . . . iv PAGE vi ix 25 25 26 34 37 37 45 55 61 i" 1 CHAPTER . PAGE l Subjects . . . . . . . . . . . . . . . 71 MCQ format . . . . . . . . . . . . . . 73 PMP format . . . . . . . . . . . . . . 73 SCE format . . . . . . . . . . . . 74 Generalizability of results . . . . . 77 Questions Summarizing the Logic Under- lying the Testable Hypotheses . . . . . . 78 Summary . . . . . . . . . . . . . . . 81 IV. RESULTS . . . . . . . . . . . . . . . . . . . 82 Introduction . . . . . . . . . . . . . . 82 Summary of Test Results . . . . . 83 Correlation Summaries for PMP, MCQ, and SCE Scores . . . . . . . . . . . . . . . . . 92 Results Concerning Hypothesis I . . . . . 99 Results Concerning Hypothesis II . . . . . 100 Results Concerning Hypothesis III . . . . 106 Results Concerning Hypothesis IV . . . . . 116 Summary of Results for Tests of Hypotheses . . . . . . . . . 121 Results of Additional Analyses . . . . . . 123 Summary of Additional Analyses , . . . . . 133 Hypotheses and Analysis Methods . . . . . 79 V. SUMMARY AND CONCLUSIONS . . . . . . . . . . . 135 Introduction . . . . . . . . . . . . . . 135 Summary of Findings . . . . . . . . . . . 135 Conclusions . . . . . . . . . . . . . . . 139 Discussion of results . . . . . . . . . . 140 Suggestions for future research . . . . . 151 APPENDIX . . . . . . . . . . . . . . . . . . . . . 153 LIST OF REFERENCES . . . . . . . . . . . . . . . . 161 LIST OF TABLES TABLE PAGE 1.1 COMMON SCORING FORMULAE FOR PATIENT MANAGEMENT PROBLEMS . . . . . . . . . . . . . . . . . . . 13 3.1 TEST ITEMS ALLOCATED TO MEDICAL CONTENT CATEGORIES . . . . . . . . . 67 4.1 SUMMARY OF NBME SCORE CORRELATIONS WITH MCQ AND SCE SCORES . . . . . . . . . . . . 93 4.2 SUMMARY OF PI SCORE CORRELATIONS WITH MCQ AND SCE SCORES . . . . . . . . 94 4.3 SUMMARY OF EI SCORE CORRELATIONS WITH MCQ AND SCE SCORES . . . . . . . . . . . . . . . . . . 95 4.4 CORRELATIONS BETWEEN NBME, EI AND PI AVERAGE SCORES . . . . . . . . . . . . . . . . . . . . 97 4.5 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR CANDIDATES PASSING PART I . . . . . . . . . . 97 4.6 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR CANDIDATES FAILING PART I . . . . . . 98 4.7 PMP SCORE CORRELATIONS WITH MCQ AND SCE SCORES AND Z-TESTS OF SIGNIFICANCE OF DIFFERENCE . . 100 4.8 SUMMARY TABLE FOR REGRESSION ANALYSIS OF NBME AND MCQ SCORES ON SCE SCORES . . . . . . . . . 103 4.9 SUMMARY TABLE FOR REGRESSION ANALYSIS OF EI AND MCQ SCORES ON SCE SCORES . . . . . . . . . 103 4.10 SUMMARY OF REGRESSION ANALYSIS OF PI AND MCQ SCORES ON SCE SCORES . . . . . . . . . . . . . 104 4.11 SUMMARY TABLE OF REGRESSION ANALYSIS FORCING INITIAL ENTRY OF NBME SCORES . . . . . . . . . 104 4.12 SUMMARY OF REGRESSION ANALYSIS FORCING INITIAL ENTRY OF EI SCORES . . . . . . . . . . . 105 vi LIST OF TABLES (CONTINUED) SUMMARY OF REGRESSION ANALYSIS FORCING INITIAL ENTRY OF PI SCORES . . . . . . . . GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES . . . . . . . . . . . . . . . . . . . SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF PMP SCORES AND MCQ SCORES . . . . . . . . . . GROUP DISCRIMINATION BY EI SCORES AND MCQ SCORES . . . . . . . . . . . SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF EI SCORES AND MCQ SCORES . . . . . . . . GROUP DISCRIMINATION BY PI SCORES AND MCQ SCORES . . . . . . . . . . . . SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF PI SCORES AND MCQ SCORES . . . . . . . . . GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO PASSED PART I . SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO PASSED PART I . . . . . . . . . . . . . . . GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO FAILED PART I . . . SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO FAILED PART I . . . . . . . . . . . . . CLASSIFICATION ANALYSIS USING MCQ SCORES AS DISCRIMINANT FUNCTION . . . . . . . . . . CORRELATIONS OF PMP FRAME SCORES WITH SCE SCORES FOR NBME, EI AND PI SCORES . . . . . . CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED NBME FRAME SCORE CORRELATIONS WITH SCE SCORES . . . . . . . . . PAGE 105 109 109 110 110 111 111 113 113 114 114 115 117 118 LIST OF TABLES (CONTINUED) PAGE CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED EI FRAME SCORE CORRELATIONS WITH SCE SCORES . . . . . . . . . 119 CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED PI FRAME SCORE CORRELATIONS WITH SCE SCORES . . . . . . . . . 120 CORRELATION OF NBME FRAME SCORES WITH MCQ AND SCE SCORES . . . . . . . . . . . . . . . . . . 121 CORRELATION OF EI FRAME SCORES WITH MCQ AND SCE SCORES . . . . . . . . . . . . . . . . . . . . 124 CORRELATION OF PI FRAME SCORES WITH MCQ AND SCE SCORES . . . . . . . . . . . . . . . 125 SUMMARY TABLE FOR REGRESSION ANALYSIS OF NBME FRAME SCORES ON SCE SCORES . . . . . . . . . . 127 SUMMARY TABLE FOR REGRESSION ANALYSIS OF EI FRAME SCORES ON SCE SCORES . . . . . . . 128 SUMMARY TABLE FOR REGRESSION ANALYSIS OF PI FRAME SCORES ON SCE SCORES . . . . . . . . . 129 CALCULATED WEIGHTS FOR PMP FRAME SCORES . . . . 130 CORRELATIONS OF WEIGHTED PMP SCORES WITH MCQ SCORES AND SCE SCORES . . . . . . . . 132 viii Distrubution Distribution Distribution Distribution Distribution Distribution Distribution LIST OF FIGURES MCQ Scores NBME Scores EI Scores PI Scores SCE Scores SCE Scores SCE Scores ix a n o 0 (Total) (May) . (July). PAGE 84 85 86 87 88 89 90 CHAPTER I THE PROBLEM Introduction During the past two decades there have been two important developments related to testing medical competence that affect the licensing and certification of physicians in this country. One development has come from the medical education and testing community, the other receiving most of its impetus from outside the medical community, from the public and governmental sectors. The internal development was the introduction of Patient Management Problems (PMP's) as an objective method of evaluating physician competence. PMP's are a form of paper and pencil simulation of the patient-physician encounter in which the examinee is pre- sented with a series of data gathering and treatment options, from which he must choose the proper path to diagnosis and treatment of the patient. Studies conducted by the National Baord of Medical Examiners (NBME) and others (Hubbard, et a1., 1965; McGuire, 1966) showed that the time-honored bedside oral examination was not a very accurate or reliable means of assessing physician competence. Beginning in 1961 the PMP rapidly replaced the oral as a major component of licensing examinations in the hope that 1 'it would prove to be a more reliable and valid test of a physician's problem-solving skills (Hubbard et a1., 1965; Williamson, 1965). Soon afterward, several medical specialty boards incorporated PMP's as a component of their specialty certification examinations, either in conjunction with, or as a replacement for, the bedside oral examination. Within a relatively short time, PMP's have become a major part of the licensing and certification process. The external development affecting licensing and certification began to be felt during the late 60's and became increasingly important throughout the 70's. Spurred by rising medical costs, Medicare—Medicaid, a perceived shortage of physicians, increasing monetary support from the federal government for medical schools, and several other factors, there came an increasing demand from the government and the consumer for greater accountability from medical schools and licensing authorities (Abrahamson, 1976; Rakel, 1979; Senior, 1976). The public wanted better assurances _that licensing and certification of a physician truly indicated a more competent practitioner. At the very least they wanted assurance that a licensed physician was not incompetent, that he or she at least met minimal competence criteria for providing health care. In addition, there has been increasing pressure for recertification and relicensure, to assure continued competence once a physician has completed formal education. The past few years have seen an increased questioning of a common assumption upon which licensing and certification has been based. An individual who has received a medical degree from an accredited school, has completed a minimum amount of graduate medical education and has successfully negotiated the hurdle imposed by a licensing examination is therefore assumed to be truly capable of independently pro- viding at least minimally competent health care to the public. Unfortunately, because of difficulties in esta- blishing valid criterion measures of clinical performance, there is little direct evidence that licensing or .certification examinations predict what a physician actually does in practice (Abrahamson, 1976; Williamson, 1976). This lack of an explicit link between test performance and practice performance is generally true for all test formats employed in licensing and certification examinations, including PMP's. Nonetheless, PMP's have been felt to be a more valid means of discriminating the competent from the less competent physician than, for example, multiple choice questions about relevant medical knowledge. This study will examine four observable consequences of basic assumptions concerning the validity of PMP's. These assumptions, which are becoming increasingly questioned in the literature, form the basis for the use of PMP's in licensing and certification examinations. This study will use data from a national specialty certification examination in Emergency Medicine. In addition, the scoring procedure 1 m I commonly used for PMP's in licensing examinations will be examined and compared with alternative scoring procedures in an attempt to improve the discrimination and predictive ability of PMP's. History of Licensing and Certification Exams in Medicine During the 18th century the education of a physician and licensing in this country were one and the same, embodied in the apprentice system. A young man wishing to become a physician bound himself to a doctor for a period usually exceeding five years. At the successful completion of his apprenticeship he was given a document signed by his pre— ceptor, outlining the course of his training, which served as both his diploma and his license to practice (Miller, 1976). At about the time this country began moving toward independence, the colonial assemblies, and later the state legislatures, began to see a need to protect the public from the quacks and Charlatans who were becoming more numerous as the demand for medical practitioners rose. Accordingly, various means of testing and licensing physicians were set up, in most cases being delegated to the various state medical societies. As medical schools came on the scene, they also became involved in the licensing process. In a move to raise the standards of medical education and licensing, the American Medical Association was founded in 1847. As a result of each state having its own tests and licensing requirements, it became very difficult for a physician to move his practice from one state to another. -The situation remained this way until shortly after the beginning of this century, when a strong movement to standardize licensure requirements resulted in 1915 in the formation of a voluntary, independent examining body, the National Board of Medical Examiners (Hubbard, 1978). The NBME evolved a three-part examination which eventually be- came accepted by nearly all state licensing boards for pur- poses of licensure. The NBME exam originally consisted of essay tests in traditional medical sciences (Part I) and in clinical sciences (Part II). An oral "practical" examination comprised Part III, and was taken after a student had finished medical school and had completed at least one year of an internship. The first two parts were taken during progress through medical school. A physician who had passed all three parts of the NBME exam was considered competent to independently practice medicine, and could do so without further testing for licensure in those states accepting the NBME exam. In the early 1950's, a major change in the NBME exam was introduced. A three year study, in cooperation with the Educational Testing Service, showed objectively scored multiple—choice questions could provide a much more compre- hensive test which proved to be superior to subjectively scored essay tests. Therefore, the essay tests in Parts I and II were replaced by multiple choice tests developed by national committees of subject matter experts (Hubbard, 1978). The most recent major change in the NBME exam occurred in 1961 with the introduction of Patient Management Problems in the Part III exam. At first PMP‘s were to serve as an objective supplement to the traditional bedside oral examination, but it soon became evident that attempts to improve the reliability of the orals were not successful, and it was becoming increasingly difficult to arrange sufficient examination opportunities for a growing number of candidates. After 1963 the oral exam was dropped, and since then Part III has been composed of multiple—choice questions and a series of PMP's. Though still valuable as a teaching tool, the ancient and revered tradition of the bedside oral as a formal examination method was finally succumbing to increasing evidence of psychometric inadequacies and rapidly expanding difficulties and costs of administration on a national scale. In a further attempt to standardize state licensing requirements, in 1967 the Federation of State Medical Boards began working with the NBME to develop a licensing exam, based on NBME materials and administered by the NBME, which would be acceptable to all states as evidence of readiness to provide unsupervised health care to the public. This was called the Federation Licensing Examination, or FLEX, and was composed of three parts, parallel to the NBME exam. By 1973 it had been accepted by all but two state licensing au- thorities (NBME, 1973). FLEX became the primary means of licensure for a growing number of graduates of foreign medical schools wishing to practice in this country. Since they had not graduated from an accredited U.S. medical school, these individuals were not eligible to take the NBME exam. Concurrent with the development of the NBME and licensing examinations, there has been development of a variety of medical specialty boards. Beginning with the American Board of Ophthalmology in 1917, the number of rec- ognized specialty boards has risen to twenty-three, with the American Board of Emergency Medicine being the most recent, approved in 1979. To become a board certified specialist a physician must complete an approved residency training program which is usually at least three to five years in length, practice in the specialty for one or more years, and pass an examination administered by the specialty board. Originally these certification examinations were composed of essay exams and bedside oral exams. The first board to use multiple choice questions was the American Board of Internal Medicine in 1946, but soon most other boards had incorporated multiple-choice exams, influenced by the success of the ABIM and the NBME. In 1961 the NBME for the first time began directly helping specialty boards develop their examinations, and several boards began trying PMP's. Today most boards employ some combination of objective written exams and oral exams, while a few use only the written, and fewer still retain only the oral. From this brief overview of the history of licensing and certification examinations, it is evident that PMP's have relatively recently come to play an increasingly important role in evaluating the competence of a physician. First introduced as an objectively scored replacement for the bedside oral, in the NBME Part III, they also soon became a major component of the FLEX exam and several specialty certification exams. As an important part of the examinations used in the two major paths to state licensure (NBME and FLEX), and in several specialty certification examinations, nearly every new physician beginning practice in this country for the past fifteen years has been con- fronted with PMP's at a critical juncture affecting his or her medical career, and in turn affecting the health care of the public. In spite of a lack of any evidence for criterion—related validity, because of their inherent "face validity" (the subjective opinion of experts that it appears to be a valid measure) and their reliability in comparison with the oral exams they replaced, PMP's have been generally considered a legitimate means of testing a physician's problem-solving skills. 9 The development of Patient Management Problems Patient Management Problems had their beginnings in the Tab-Test Technic developed by the U.S. Army (Williamson, 1965) and the "Test of Diagnostic Skills" developed by Rimoldi (1955; 1961). These early forerunners of the PMP presented a student with a problem statement and a series of cards with questions which the student might ask on one side and answers on the other. By selecting the card cor— responding to the question he or she wished to ask, and reading the answer or result on the back, the student proceeded through the problem to a solution. The PMP in its present form was developed in the late 50's and early 60's by the NBME (Hubbard et a1., 1965) and by McGuire and her associates at the University of Illinois (McGuire, 1963; Williamson, 1965; McGuire and Babbott, 1967). In the last twenty years it has become a very popular teaching as well as testing technique. The PMP is essentially a paper and pencil simulation of the patient-physician encounter during which the student or examinee must collect history, physical and laboratory data in order to arrive at a diagnosis and treatment plan for the patient. The Appendix contains an example of a PMP. As can be seen, the problem begins with a brief statement out- lining the presenting complaint or problem. In this par- ticular example the examinee is then presented with a series of options concerning the history, from which he must select 10 only those he would ask for in the real situation. The answers to the questions, in the right column, are not visible because they are printed using a latent image process. To gain the information from a particular question, the examinee "develops" the answer by wiping over the answer space with a special latent image pen, and the words become visible. In earlier versions of PMP's, the answers were covered by an opaque layer which was erased using a pencil eraser, exposing the printed answer. Using this method, an explicit record is kept of every choice the examinee makes tfliroughout the problem, which is then used for scoring. Ohnriously, an examinee cannot change his mind and "cancel" a 1>ad choice once the answer has been exposed. In this example the examinee then moves on to collect physical edm Aucofimmmcme can mcflccmam Hmumcmflp mafipSHocfiv Eoum>m moofl>umm Hmoapma mocomumfim mGOHuoomcH HMOHuHuO mHDmOme paoo can cusm Naomoumu wooa m m mmmghumOHmm muwcHOme HmOHmOHOHsmz mEmHnonm upwaosuowam com cflsam mumcHOmflp OHmOHoo inOp can oamnoaac .oaaoncuoz mumpHOme poocpaflco can woccwcH mumpHOch Houflcomouo mHmoHomfic oeumEdmue mwflusmcfl Hcpmamxm mumpHOmflp SuccOEHDm AoaucEDMMucoc poo oflpmfismuuv mmflusncfi Moos can poms .umoung .mmoc .Hmm muopHOme HmsHEopnd AOflumfiscupcoc can oaumEscHuv mHoUMOme Hmadomc>oflcumo >Hommwcu MH ommucooumm mmHmOUMB¢U BZmEZOU AfiUHQWE OB DWB¢UOAH< mZMBH BmMB H.m mamdfi 68 5. Frequently encountered as cases and problems in the emergency department. A total of 136 Pictorial Multiple—Choice items were also developed. This item type presented a visual stimulus, such as an EKG rhythm strip, a color photograph of a patient, or an X—ray, and one or more multiple-choice items based on the visual stimulus. These were also of the single best answer type with four or five options. Criteria for selecting visuals and item content for this format included using visuals which (Maatsch et a1., 1976): 1. Test general interpretive skills. 2. Typically require immediate interpretation and use in an emergency department. 3. Present conditions that knowledgeable candidates can clearly see and interpret. A series of condition worksheets covering what were felt to be the most important areas were selected to be further developed into scenarios to serve as the basis for the two types of simulations to be used in the EMSCE, the Patient Management Problems and the Simulated Clinical Encounters (SCE's). The SCE's are highly structured oral examinations based on the patient games described by Maatsch (1974), in which a well-trained examiner presents standardized information about-a patient to a candidate, who must elicit further information to diagnose and manage the patient. Eight Simulated Patient Encounters, requiring the candidate to manage a single patient, were developed by teams from OMERAD and ACEP, along with four Simulated : 69 Situation Encounters, requiring the candidate to manage three patients concurrently, for a total of 12 SCE's involving 20 patients. Ten scenarios were selected for development as PMP's, under the direction of Dr. Sarah Sprafka, from OMERAD, working with ACEP content experts. All scenarios selected involved patients with only one problem, the cases were basically linear in nature, the necessary information could easily be presented in written form or with still visual stimuli (EKG tracings or X—ray prints), and completing the problem would not involve more than three or four basic steps (e.g., initial evaluation, data gathering, diagnosis, and preliminary management). All of the latent image PMP's developed from these scenarios were linear, relatively short (easily completed within 20 minutes), and composed of sections in the following order: introduction, history, initial diagnostic hypotheses, physical examination, provisional diagnosis, laboratory, final diagnosis, and management. The sections on initial diagnostic hypotheses and provisional hypotheses are normally not used in PMP's, but were included in these PMP's to conform with the theory of medical problem solving developed during the Medical Inquiry Project by Elstein et a1. (1978). A five point scale (-2 to +2) was used to assign weights to each item in the PMP's, based on the judgement of the scenario author and at least one other ACEP content expert. 70 Further details on the development of these PMP's can be found in Maatsch and Elstein (1979). Field Test of Test Items After development and extensive review by OMERAD test developers and ACEP content experts, the entire test library was field tested on October 22-26, 1977, in Lansing, Michigan, as part of a grant from the National Center for Health Services Research (HS 02038) entitled "Model for a Criterion-Referenced Specialty Test." The test library was administered to 94 subjects, consisting of 22 fourth year medical students, 36 second year residents in emergency medicine, and 36 practicing emergency physicians nominated by their peers for their expertise in emergency medicine, fourteen of whom were practice eligible and 22 of whom were residency eligible. The results of this field test were used to revise or eliminate items from the test library, and to select items for use in the first administration of the EMSCE by the American Board of Emergency Medicine, after their formal recognition as a new medical specialty, which took place in September, 1979. The field test results were also used to set the passing scores for each part of the examination. Using an eight point scale for the rating of the SCE's, the Board set a score of 5 as indicating minimally acceptable practice of emergency medicine, and set a passing level of a 5.75 average across all cases (or 5.0 or above on all cases) eme [lira 71 as the criteria for certification. The field test data showed that this corresponded to a score of 75% on the Part I objective portion of the EMSCE, and the Board set this as the minimal score needed to pass Part I and go on to take Part II. Further details of setting cut scores for this criterion-referenced examination can be found in Maatsch and Elstein (1979) and Maatsch (1980). Design‘ Subjects The subjects used in this study were the candidates who sat for the first administration of Part I of the Emergency Medicine Specialty Certification Examination, administered by the American Board of Emergency Medicine. The EMSCE is given in two parts. Part I is composed of the objective formats (MCQ and PMP), and Part II consists of the Simulated Clinical Encounters. Part I must be passed before taking Part II of the examination. Part I was administered on February 20, 1980, at three sites (Cherry Hill, New Jersey; Chicago, Illinois; and Los Angeles, California) to a total of 616 candidates. This group consisted of 136 candidates who had completed an approved residency program in emergency medicine (residency- eligible), and 480 candidates who had been practicing emergency medicine full-time for a minimum of five years (practice-eligible). The practice-eligible group contained 72 many second career physicians, some of whom have received board certification in other specialties, and foriegn trained physicians who are now licensed and practicing in the U.S. Part II was administered on two different occasions in Chicago, with 182 of the 387 candidates who passed Part I taking Part II during the week of May 19, 1980, and 188 being examined during the week of July 21, 1980. The remaining 17 did not take Part II during the first two scheduled administrations. From the original pool of 616 candidates, 107 were eliminated from the analyses of this study because they did not finish the PMP portion of the examination or, as was most frequently the case, because they did not completely follow the instructions for the PMP's and chose more items in one or more sections than permitted by the directions for those sections. Therefore, there remained a total of 509 subjects for analyses involving the PMP's and the MCQ's. Of the 182 who sat for the initial administration of Part II, 25 were among those eliminated, leaving a total of 157 subjects from the May session for analyses involving PMP's and SCE's. Of the 188 who sat for the July administration of Part II, 30 were among those eliminated, leaving 158 subjects from this session and a total of 315 subjects who took the SCE's. 73 MCQ format The multiple—choice portion of the examination was composed of 194 standard multiple-choice items selected from the field-tested library, presented in three booklets of approximately equal length. The same schedule of pre- sentation and standard instructions were used at all three sites for this and all other parts of the examination. The 86 pictorial multiple-choice items selected for use were split evenly into two booklets, with each candidate receiving only one of the booklets in random fashion. (The candidates were actually presented with 197 standard MCQ items, but three were deleted after item analysis and content review by the test committee. Each candidate also saw 49 pictorial MCQ items, but six atypically easy or difficult items were deleted from each booklet in a retrospective process of balancing the difficulty of each booklet.) Thus, each candi- date was scored on 194 MCQ items and 43 pictorial MCQ's, for a total of 237 Objective items. The standard reliability indices (KR-20 and Cronbach's Alpha) for this portion of the examination were found to be 0.91 and 0.89 for the two forms (using the two different pictorial MCQ booklets). PMP format Based on data from the field test, the American Board of Emergency Medicine decided to eliminate the use of PMP's for purposes fo certification, but did allow them to be used 74 during the first administration of the EMSCE, for research purposes. The candidates were not aware of this decision by the Board, and took the PMP's assuming they were being used for certification. Five of the most discriminating PMP's, based on field test results, Were selected for use out of the original 10 PMP's. The ABEM came to the conclusion that PMP's would not be used for certification after reviewing data showing that the PMP's were not as effective in predicting SCE (Part II) scores as were the carefully developed clinically relevant MCQ items. This result from the field test is not particularly surprising in light of the information reviewed in the previous chapter. From the field test data, reliability (Cronbach's Alpha) of the PMP's was found to be 0.77, using Proficiency Index scores and each PMP score as an item. SCE format Part II of the EMSCE consisted of five single patient Simulated Patient Encounters (SPE) and two Simulated Situation Encounters (SSE), each requiring concurrent management of three patients. A single examiner rated each candidate on each SCE, with no examiner seeing a particular candidate more than once. The examiners received extensive training in scoring the SCE's and went through a calibration process the day prior the the beginning of Part II of the examination. (Most of the examiners for Part II had been 75 examiners for the field test, and were experienced in administering SCE's.) Each candidate was rated, using an eight point scale, on seven aspects of competence for each SPE and eight aspects for each SSE. These aspects were: 1. Data acquisition - Completeness (appropriateness) and efficiency of data gathering. Did the candidate collect the appropriate data required to correctly diagnose and manage the patient? Problem solving - Appropriateness of the organization of data collection activities in relation to management decisions. Did collected data help select among reasonable alternative diagnoses while insuring patient stabilization? Did the candidate efficiently arrive at an informed and appropriate management plan? Patient management - Did the candidate treat or direct the appropriate treatments throughout the encounter, including proper referral at a proper time? Was the patient properly attended when directing attention to other patients? Health care provided (outcome) - The candidate's overall performance as viewed from the patient's perspective. By current medical standards, was the patient's condition stabilized and maximally improved by the medical interventions provided? Doctor-patient relations - Demonstrated concern and skill in dealing with the patient's psychological state. What is the examiner's best estimate of the sensitivity and skill level of the candidate in relating to the psychiatric, psychological and sociological (family) aspects of patient care? Comprehension of pathophysiology - Does the candidate understand the scientific basis for his/her actions or is he/she simply relying on memorized routine procedures usually followed in such cases? The examiner had the option of asking standardized questions at the completion of the simulation to assist in rating the candidate on this aspect. 76 7. Clinical competence (overall) - Overall assessment .of the demonstrated competence of the candidate to provide emergency health care in the specific class of conditions contained in the simulation. The level of combined cognitive and procedural skills employed by the candidate in providing health care in this setting. All things considered, how good was the candidate in handling these types of conditions or problems? 8. Resource utilization (SSE's only) - Evaluation of the capability of the candidate to effectively utilize himself and other supporting personnel and resources under the stress of managing a number of patients concurrently. The score for each problem was calculated by determining the mean of the seven (or eight) ratings given by the examiner for that problem. Examiners were also periodically scheduled during break periods to sit in as observers and independently score SCE's being administered by other examiners. These "verifier" scores were not used for certification purposes, but did serve as a means of quality control of administration and for determining inter-rater reliability. The inter-rater reliability of these examiner-verifier pairs of scores on a single candidate across all problems was 0.81. This reliability is much higher than usually seen for an oral examination, and can probably be attributed to the careful construction of the SCE's, and particularly to the careful training and calibration of the rating standards of the examiners. 77' Generalizability of results While it might be safest to say that the results of this study will generalize to all emergency physicians practicing in the U.S., the real areas of concern for this study lie in the widespread use of PMP's as supposedly valid tests of physician competence in many medical specialty and licensing examinations, and the accuracy of some of the assumptions underlying their use. As has been previously discussed, PMP's are currently used in many educational and testing situations. The PMP's used in this study were carefully developed in a manner similar to that commonly used in such situations, and the scoring formulae used are those in common use in many licensing and certification applications. In testing situations, particularly in licensing and certification, the use of PMP's is based on some very fundamental assumptions outlined in Chapter I; i.e., PMP's are a valid measure of clinical performance and are predictive of how an examinee will perform in a real clinical situation. The results of this study, testing observable consequences of those assumptions, can therefore be generalized to all situations where PMP's are used in a testing mode and where those assumptions form the basis for their use; i.e., making decisions about the competence of an examinee, as opposed to a teaching or practice/feedback mode. This specifically includes the use of PMP's in licensing and certification 78 examinations of physicians, and could also include the use of PMP's as an evaluation tool in undergraduate medical education (for instance as an element in making promotional decisions) or in licensing and certification examinations for allied health personnel such as nurses or physician assistants. Questions Summarizing the Logic Underlying the Testable Hypotheses The basic question underlying this study can be stated as follows: As presently scored (using the NBME scoring method), are PMP's a useful and valid method of evaluating clinical competence? An alternative way of asking this question might be: Are PMP's a valid substitute for MCQ batteries, or other more complex and expensive methods of measuring clinical competence such as SCE's or other oral examination methods? In Chapter I, four observable consequences of the assumptions concerning the use of PMP's in licensing and certification examinations were presented. The basic question being asked in this study can be related to these four observable consequences by the following specific questions: 1. Are PMP's better at predicting SCE scores obtained three and five months later than they are at predicting MCQ scores? 79 2. Do PMP's add anything to the information gained through MCQ's in predicting SCE scores (i.e., do they account for significant additional variance in examiner ratings of performance)? 3. Do PMP scores add anything to the ability to discriminate between residency- and practice- eligible candidates beyond that provided by MCQ scores? 4. Are there specific PMP frames or combinations of frames (i.e., data acquisition frames or diagnosis and management frames) that better predict SCE scores? In other words, are there alternative ' scoring algorithms which involve weighting frames, rather than items, that will improve the ability of PMP's to predict examiner ratings of performance on SCE's, and improve discrimination of residency— and practice-eligible candidates? Hypotheses and Analysis Methods Based upon the assumptions which form the rationale for the use of PMP's in licensing and certification, and the specific questions derived from them, which are listed above, the following four hypotheses will be tested. I. PMP scores correlate to a greater degree with SCE scores than with clinically relevant MCQ scores. This hypothesis will be tested by correlating PMP scores (NBME scoring method, as well as Proficiency Index 80 and Efficiency Index) with SCE scores, and PMP scores with MCQ scores, using a Pearson product-moment correlation. Z-tests of the significance of the differences will be calculated (Glass and Stanley, 1970). II. PMP scores account for a portion of variance of SCE scores beyond that contributed by clinically relevant MCQ scores. A stepwise multiple regression analysis will be performed, with an F-test for the significance of the addition of PMP to the MCQ B-weight. Multiple regression analyses will be done using the computer program Statistical Package for the Social Sciences (Nie et a1., 1975). III. PMP scores add to the ability of MCQ scores to discriminate between residency— and practice- eligible candidates. A stepwise discriminant analysis will be performed, with an F-test for the significance of the addition of PMP to the MCQ B-weight. Discriminant analyses will be done using the computer program Statistical Package for the Social Sciences (Nie et a1., 1975). IV. A11 frame scores of PMP's correlate equally with SCE scores. This hypothesis will be tested by Z-tests of significance of the differences between all possible pair- wise correlations between all frame scores of PMP's and the SCE scores (Glass and Stanley, 1970). If differences are observed, a multiple regression analysis will identify which PMP frames contribute most to predicting SCE scores and which frames add little or nothing to this predictive relationship. 81 New scoring algorithms will be developed if warranted by the data. Summary The first administration of the Emergency Medicine Specialty Certification Examination of the American Board of Emergency Medicine has provided data that will be used to test four observable consequences of assumptions concerning the use of Patient Management Problems in licensing and certification examinations as valid indicators of physician competence. Scores from 509 candidates who took Part I (237 MCQ items and five PMP's) and from 315 of those candidates who passed Part I and took Part II (seven Simulated Clinical Encounters) will be used to test four hypotheses derived from the observable consequences of the assumptions upon which the use of PMP's are based. The over- all goal of this study is to test the criterion-related validity of PMP's in a certification examination using performance on the SCE's as the criterion, and to explore methods of improving the validity of PMP's through changes in the scoring methods used with PMP's. The results of the analyses performed on these data will be presented in Chapter IV. CHAPTER IV RESULTS Introduction This chapter presents the results of statistical analyses of data gathered during the first administration of the American Board of Emergency Medicine certification examination, and their interpretation in relation to the four hypotheses of this study. Following a brief description of the test results from the first administration, tables are presented summarizing the correlations between the three PMP scores used in this study and the multiple-choice battery (MCQ) and the Simulated Clinical Encounters (SCE). The three PMP scores are the National Board of Medical Examiners Index (NBME), the Efficiency Index (EI), and the Proficiency Index (PI) as defined in Chapter I (Table 1.1). Correlations between the PMP scores and MCQ scores and SCE scores are then presented, along with the results of Z-tests of the significance of the difference between them. These results are interpreted in relation to Hypothesis I of this study. Then, following a brief description of regression analysis, the results of the analyses for Hypothesis II are presented. These consist of regression analyses using SCE scores as the dependent variable, and 82 83 PMP and MCQ scores as the independent variables. Next is a brief description of discriminant analysis, with analyses related to Hypothesis III. For this hypothesis discriminant analyses are performed using residency- and practice-eligible categories as the dependent variables, and PMP and MCQ scores as the independent or predictor variables. Hypothesis IV results are then presented, consisting of Z-tests of the significance of the difference between all pair-wise correlations between PMP frame scores and SCE scores. Finally, additional analyses are presented concerning attempts to improve the ability of PMP scores to predict SCE scores by developing a scoring scheme which weights individual frames of the PMP's. Summagy of Test Results For Part I (MCQ's), with 509 candidates, the mean score was 76.6% (SD=7.9%). For the PMP's the mean NBME score was 41.8 (SD=2.5), the mean EI score was .836 (SD=.054), and the mean PI score was .740 (SD=.077). The mean score for the 315 candidates taking Part II (SCE's) was 5.89 (SD=.7) on an 8 point scale. For the 157 who took Part II in May, the mean was 6.01 (SD=.70), while for the July administration to 158 candidates the mean was 5.76 (SD=.68).' These two means are significantly different (t=3.246), but this has no effect on the correlations of interest, so the two groups are combined for analysis. Figures 4.1-4.7 show the distributions for each of these sets of scores. mmuoom co: co coausbauumna I- H.a magmas 84 GHOOm 002 mm. am. am. as. as. me. as. am. am. as. tom. 1mm. now. ums. nos. -mm. tom. -mm. tom. uma. 11. o m _ a w H _ as , . mm mm . om mm mm . me as . OOH momma - mas mms mma a P oma A A mmmm Hams CON seqeprpuea go 85 mmuoom msmz mo :oHusbauumao In ~.v apnoea mnoom mzmz 0.54 m.aa a.~s m.oa m.mm m.mm m.am m.~m m.om m.mm no.ma no.me no.aa no.mm no.5m -o.mm no.mm uo.am uo.am no em — V N O H ma 4 am we mos Has mom": ems umm .om -mh . ooa . mNH omH m5.” CON seqeprpuea go 86 museum Hm mo scansbannmaa .. m.a magmas OHOUW Hm mm. mm. mm. mm. mm. om. he. on. HB. mo. Imm. loo. low. Iwm. lam. Ion. Imh. INN. loo. Imm. _ lllllllfill HH o o mm mm ov mm Nm momuc NHH moa mm om mm ooH mNH 'ON senepIpueo SO 87 mmuoom Hm mo QOHpSQHHumHQ II v.v ousoflm whoom Hm am. am. an. we. mm. em. mm. vm. me. Ive. Imm. low. Imp. low. lmm. low. Imm. Iom. Imv. Iom. N o e Ha m Hm mm mm .om on .mh .OOH moa baa .mNH ass manna -oma fimha 'ON seneprpuea go 88 Aamuoec mmuoom mom mo cohusnauumao I: m.O magmas OHOOm mum av.h mo.h mo.o mm.o mm.m mc.m mo.m mo.v mm.v mm.m IoH.h Ioh.o tom.o Iom.m Iom.m Ioa.m Ion.v Iom.v tom.m Iom.m HH mm Hm be me . om OO .Hs msmna . ms e OOH 'ON segeprpueo go Amaze mwuoom mom Ho :oHHanHHmHo In 0.0 musmHm whoom mom mv.h mo.b mo.o mm.o mm.m m¢.m mo.m mo.v mm.c mm.m IoH.> Ion.o Iom.o Iom.m Iom.m Ioa.m Io>.v Iom.v Iom.m Iom.m 89 O _ N H m HOH O — OH AH - .Om Hm mm hmHHc . om Om Om HOO CON sanepIpueo so 90 Awasho mouoom mom mo cofluocaupmaa II n.v musmam mHOOm mum mv.h mo.h mo.o mm.o mm.m mv.m mo.m mo.¢ mm.c mm.m Ioa.h Ioh.o tom.o Iom.m tom.m Ioa.m Ion.¢ Iom.c Iom.m Iom.m i TIIHL O - h . OH ma . mm mm mmauc . mm mm 0H om om ov °ON segeprpuea go 91 As can be seen in Figure 4.1, the distribution of the MCQ scores approximates a normal curve skewed in the negative direction, with the major portion of the candidates scoring above the 75% pass-fail cut point.’ This type of negatively skewed distribution of scores would be expected in a specialty certification examination where most of the examinees are more likely to be among the more competent practitioners. This type of distribution would also be expected in a criterion-referenced examination such as this. Figures 4.2 — 4.4 show the distributions for the NBME, EI and PI scores, respectively. The NBME and PI scores both show a very negatively skewed distribution, while the E1 score distribution does not show such a pronounced negative skew. Figures 4.5 - 4.7, for the total SCE distribution, the SCE distribution of the May candidates and the July candidates, respectively, show a nearly normal distribution with only a slight negative skew. The reliabilities for Part I and Part II during the first administration were reported by Maatsch (1980). The KR-ZO for Part I (multiple-choice battery) was .90 (SEM=2.5%), while for Part II (Simulated Clinical Encounters) it was .57 (SEM=.46 of a rating point on an 8 point scale). The five PMP problems used in this study had a reliability of .67 (SEM=1.46), calculated using the formula for Cronbach's a with NBME scores for each of the five PMP's as items. The following results are based on the observed correlations among these three approaches to testing which were used in 92 this first administration of the ABEM certification examination. Correlation Summaries for PMP, MCQ and SCE Scores Tables 4.1 - 4.3 summarize the Pearson product-moment correlations between the three PMP scores used in this study and the scores on the multiple-choice battery (MCQ) of Part I of the ABEM examination and the scores on the Simulated Clinical Encounters (SCE) of Part II. The three PMP scores are the NBME index (NBME), the Efficiency Index (EI) and the Proficiency Index (PI). These scores were defined in Table 1.1. The values used for each of these PMP scores are the average scores the candidate achieved across the five PMP's administered. The MCQ score is the percent correct on the Part I multiple-choice battery, and the SCE score is the average score achieved by the individual across all seven SCE's of Part II. These tables also illustrate the breakdown of the 509 candidates who were the subjects used in this study. The 509 candidates are first separated into two groups, those who passed and those who failed Part I (the MCQ battery) in February 1980. There were 331 who passed with a score of 75% or better, and 178 who failed. Of the 331 who passed Part I, 315 went on to take Part II (the SCE's). The first session of Part II, in May 1980 was attended by 157 candi- dates and the July session by 158 candidates. Those who failed Part I did not take Part II. Tables 4.1, 4.2 and 93 Ho.vm Q mo.vm m mam": "c Qmmme.nmom.ouz mom HMHOB 0" § s sisal a... ,mhanc H puma bmmsm.uooz.mzmz HHmm mmanc nhvcg.flm0m.002 mmao.nooz.mzmz .l a Nmoo Imom mzmz ammu: H #Hmm emHu: anOH.nooz.mzmz mmmm vame.umom.ouz QmH©N.NOUS.WEmZ HHNH.nmom.mzmz mane was hucsunmm Hmomc HH puma How: .msz H used mmmOUm mom 92¢ 002 mBHB mZOHfidflmmmOU mmOUm mzmz m0 wmdzsz H.v mqmdfi 94 Ho.vm Q mo.vm m mHmuc . mom": ......nw...m. awwmw.mmom.Hm bmmmm.uooz.Ha msHuc HH puma QHVFN.HOOS.HH HHcm mmHuc baavv.nmom.ooz OHHO.uooz.Hm . | a OOHO [mom Hm Hmmuc H puma smHn: mmmmH.uooz.Hm mmmm bOOmO.umom.ooz bmmmm.uooz.Hm MOOMH.umom.Hm wash he: wumsunom Hmomv HH puma Hows .HZHO H puma mmmOUm mom 02¢ 002 mBHB mZOHeflqumOO mmoom Hm .mO MmHNZSDm N.v mqmfifi 95 Ho.vd Q mo.vd m mHmnc mom": Qmmmv.nmom.ooz Hopoe OFQO.HOUZHHW VOVN.HOUE~HW .l a Q comOH Imom Hm manc H puma nmccm.uooz.Hm HHma mmHuc chvwv.umom.ooz nammm.ncos.Hm Qomvm.umom.Hm Hmmus H puma sman HNsO.uooz.Hm mmma pOOmv.umom.oos MOOH.uuooz.Hm eHmo.iumom.Hm stb >82 mucsunmm Hmomc HH puma Hoes .asac H puma mmmoom mom D24 002 EEHB mZOHefidmmmOU mmoom Hm m0 Mmdzzbm m.v mummfi 96 4.3 present the summaries for NBME, PI and EI scores, respectively. As can be seen in these tables, the correlation between the MCQ battery and the SCE's is statistically significant and moderately high (r=0.43). The correlations of the PMP scores with the MCQ battery, while in some cases reaching statistical significance, generally are relatively low, particularly for the group who passed Part I. The correlations between the PMP scores and the SCE scores shown in these three tables generally are quite low. A comparison of Table 4.1 with 4.2 shows that the correlations using NBME scores and PI scores parallel each other closely. This supports the results shown in Table 4.4, which are the correlations between the average scores on PMP's using NBME, EI and PI scoring. The correlation between NBME and PI is very high (r=0.97). On the other hand there is a very low negative or zero correlation of these two scores with the E1 score. This relationship is essentially unchanged when separately considering those who passed Part I and those who failed Part I, as shown in Tables 4.5 and 4.6. Table 4.3 presents the BI score correlations with MCQ and SCE scores. Despite the fact that there is zero or low negative correlation with NBME and PI scores, the El scores correlate with MCQ and SCE scores within the same low range (ré.1) as the NBME and PI scores. 97 TABLE 4.4 CORRELATIONS BETWEEN NBME, EI AND PI AVERAGE SCORES ' n=509 NBME EI PI NBME 1.000 EI -0.068 1.000 p: 0.970a -0.142a 1.000 ap <.05 TABLE 4.5 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR CANDIDATES PASSING PART I n=331 NBME EI PI NBME 1.000 EI -o.188a 1.000 PI 0.968a -0.293a 1.000 98 TABLE 4.6 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR CANDIDATES FAILING PART I n=178 NBME EI PI NBME 1.000 EI 0.000 1.000 PI 0.970a -0.052 1.000 ap <.05 There also appears to be a difference in correlations between the group which took Part II in May and the group which took it in July. The NBME and PI correlations with MCQ and SCE scores appear to be higher in May than in July, while for the BI scores the relationship is reversed. These differences are more apparent than real when one considers the squares of the correlations (an estimate of the variance in MCQ or SCE scores accounted for by the PMP scores.) Even though some of the correlations are statistically significant, they are quite low and all are on the very low part of the curve which relates correlations of two variables with variance accounted for in one variable from knowledge of the other variable (variance accounted for = r2). Therefore, there are relatively small differences in the variances accounted for in the May and July data sets. 99 Results Concerning Hypothesis I Hypothesis I stated that, since PMP‘s are designed to measure problem-solving skills rather than knowledge, PMP scores will correlate to a greater degree with SCE scores than with clinically relevant MCQ scores. Table 4.7 presents the correlations of each of the three PMP scores used in this study with the MCQ score and the SCE score, and the calculated Z-test statistic for the difference of two dependent cor- relation coefficients as presented by Glass and Stanley (1970). These Z-test statistics test alternative hypotheses concerning the difference between two correlation coefficients, e.g., "no difference" versus "significant difference" between the two correlation coefficients. For a one—tail test with an a=.05, the critical value is 1.65. As Table 4.7 shows, none of the Z-test statistics calculated for the three PMP scores reaches this level. The decision rule for this test is reject the hypothesis of no difference between the correlations if the Z—test statistic is greater than the critical value. Since none of the Z-test statistics reached the critical value, it must be concluded that there is no significant difference between the correlations of any of the PMP scores with the MCQ scores and with the SCE scores. Therefore, it may be concluded that these results d9 get support Hypothesis I, which stated that PMP scores will correlate to a greater degree with SCE than with MCQ scores. 100 TABLE 4.7 PMP SCORE CORRELATIONS WITH MCQ AND SCE SCORES AND Z-TESTS OF SIGNIFICANCE OF DIFFERENCE n=315 Correlations ‘ Z—test _ MCQ SCE anlculated NBME 0.141a 0.075 1.109 EI 0.067 0.109a 0.704 PI 0.124a 0.084 0.664 ap <.05 Results Concerning Hypothesis II Hypothesis II stated that PMP scores will account for a portion of variance of SCE scores beyond that contributed by MCQ scores. Regression analysis was used to analyze the data for this hypothesis, and a brief outline of this technique precedes the presentation of the results. Multiple regression is a statistical technique for analyzing relationships between a dependent or criterion variable (in this study it is the SCE scores) and one or more independent or predictor variables (in this case, the PMP and MCQ scores.) The general form of the regression equation is Y' = A + lel + B2X2 +-~kak where Y' is the estimated value of Y (the criterion variable), A is a constant equal to the Y intercept, and the Bi are 101 regression coefficients for the Xi predictor variables. The values for the A and Bi coefficients are selected so that the sum of squared residuals 2(Y—Y')2 is minimized. Besides generating equations to estimate values for the criterion variable from the values of predictor variables, it is also possible to determine the amount of contribution of the predictor variables and their relative importance to the criterion; i.e., the amount of variance of the criterion variable explained by each predictor variable. This latter use of regression analysis is the primary focus for testing Hypothesis II. Several statistics found in the summary tables are important in interpreting a regression analysis, and these are briefly explained below: 1. F to enter — a computed F ratio to determine if the value of Bi for the criterion variable entered at that stage of the analysis is significantly different from zero. Significance — the significance level of the above F ratio. If the F ratio is not significant (in this study 0:.05), the Bi for the predictor variable is not essentially different from zero and the predictor is not contributing anything to estimating the value of the criterion variable. Multiple R — this figure gives the relative strength and the direction (positive or negative) 102 of the relationship between the predictors and the criterion variable. 4. RE - indicates the proportion of variation in the criterion explained by all the predictors entered into the equation at that point. 5. R2 Change - indicates the proportion of variation in the criterion attributable to the predictor variable entered in that step of the analysis. The summaries of the results of the regression analyses using SCE scores as the criterion variable and PMP scores and MCQ scores as the predictor variables are presented in Tables 4.8 - 4.10. In each case the computer program selects first the variable which contributes the most to the prediction of the criterion. In all three cases the MCQ score is entered before the PMP score, indicating the MCQ score contributes more to the prediction of the SCE score. The significance of the F to enter for all three PMP scores fails to reach significant levels. As a confirmatory analysis to further explore the relative contribution of PMP scores to the prediction of SCE scores, the regression analyses were repeated, forcing the entry of the PMP scores before the MCQ scores. The results of these analyses are presented in Tables 4.11 - 4.13. Again, in all three cases the F—ratio to enter the PMP scores fails to reach significant levels. 103 mmmOUm mom ZO mmmoom ODE QZfl Hzmz ho mHmMA¢Z¢ ZOHmmmmwmm mom mqmfifi wm¢ZZDm m.¢ mqm¢9 emoo. memH. move. CHH. mmv.m Hm m mhmH. mhmH. mmme. o www.ms 002 H omcmco mm m OOEOOHchmHm Hmpcm OH pmumpcm * ampm Na mHaHpHsz a memHum> _ mHmnc mmmoum wow 20 mmmoum 002 Q24 Hm m0 mHmNA¢24 ZOHmmmmwmm mom mqmde wmazzbm m.v mqmdfi mooo. HmmH. smmv. owe. who.o mzmz m mme. mhmH. mmme. o nmv.m> 002 H omcoco Im m TOCOOHHHcmHm umpcm OH coumpcm # ampm mm m OHQHHHSS a oHQmHHc> mHmuc 104 mmoH. meH. hmmv. o ccH.on 002 N omoo. omoo. beho. me. mmh.H mzmz H mmccco mm m TOGOOHMHcmHm Hmucm OB cmumpcm * dopm mm deHpHsz a THQOHHO> mHmNG mmmoom mzmz m0 NMBZM HGHBHZH OZHumOm mHmMH<2¢ ZOHmmmmOmm m0 mqmdfi Nm<223m HH.¢ mamma mooo. mmmH. meme. mam. Ncm.o Ha N mhmH. mhmH. mmmv. o www.mm QUE H mocccu mm m TOQOOHHHcmHm umpcm OB cwumpcm # mmpm mm THdeHoz a mHanum> mHmuc mmmOUm mom ZO mmmovm 002 QZfi Hm m0 mHmMH424 ZOHmmmmwmm m0 Nm mHmus mmmOUm Hm mo Hmezm HHHEHZH UZHumOm mHmNHdzm ZOHmmmmUmm mo wmmzsz mH.e mqmde mmmH. mva. move. o mmo.o> 002 m mHHc. mHHo. omOH. mmo. mm>.m Hm H omccco mm m TOCOOHchmHm Hmpcm OB cwumpcm # dupm mm mHmeHsz m THQOHHO> mHmuc mmmoom Hm m0 wmfizm HHHBHZH OZHUmom mHmMH¢Z¢ ZOHmmmmomm m0 Nmflzzbm NH.v mHmdB 106 Examination of the change in R2 in Tables 4.8 - 4.10 shows that MCQ scores account for about 19% of the variance in SCE scores, while the PMP scores account for less than 1%, and as little as 0.02% in the case of NBME scores. Tables 4.11 — 4.13 show similar results. From these results it can be concluded that PMP scores do not make a significant contribution to predicting SCE scores, and, therefore, PMP scores d9 29E account for a portion of variance of SCE scores beyond that contributed by MCQ scores, as stated in Hypothesis II. Results Concerning Hypothesis III According to Hypothesis III, PMP scores will add to the ability of MCQ scores to discriminate between residency- and practice-eligible candidates. Discriminant function analysis was used to analyze these data. Before presenting the results, a brief description of this technique is given. Discriminant analysis is a multivariate statistical technique in which one or more linear equations are developed using a series of "discriminating" variables to statis- tically distinguish between two or more groups in a sample population. The equations are of the form Di = dilzl + dizz2 +~--+ dipzp where Di is the score on the discriminant function i, the di's are the derived weighting coefficients, and the Z's are 107 the standardized values of the p discriminating variables chosen for use in the analysis. Ideally, variables are chosen in such a manner that their values are relatively high for members of one of the groups to be distinguished, and relatively low for the other group(s). The resulting (equation(s) will then provide discriminant scores (D's) which will cluster around a common value for a particular group. The equation(s) are derived in such a way that these group values are maximally separated. The maximum number of equations derived is limited to one less than the number of groups, or equal to the number of discriminating variables, whichever is less. In this study there were two groups to be discriminated (residency— eligible and practice-eligible physicians) using two discriminating variables (PMP scores and MCQ scores), so only one equation was derived. The equations produced by this technique can be used for two purposes: classification and analysis. Classification entails using the equations to decide to which group new cases belong. The focus of this study is using the discriminant equations for analysis, specifically to look at the contribution of each variable to the ability to discriminate between residency- and practice- eligible physicians. There are two statistics which are impOrtant in inter- preting the following discriminant analyses. The first is the F-ratio, used to determine whether or not the values of 108 the di's are significantly different from zero (in the case of "F to enter"), or to determine whether or not the Wilks' lambda is significantly different from 1.0 (labelled "Fl,n-2" in the following tables). The other statistic is Wilks' lambda, which ranges between zero and one, and is an inverse measure of the discriminating power of the variables being analyzed. The lower the value of lambda, the better the discriminating ability of the variable being considered. Summaries of the results of discriminant analyses using residency-eligible and practice-eligible physicians as the groups to be discriminated, and PMP scores and MCQ scores as the discriminating variables, are presented in Tables 4.14 - 4.19. The discriminating variables were entered into a stepwise discriminant analysis using a selection criterion that minimizes Wilks' lambda. In this procedure the Variables are entered one at a time until exhausted or until the change in Wilks' lambda is insignificant, indicating the remaining variables are contributing little or nothing to the discrimination of the groups. Table 4.14 shows the raw score means and standard deviations for each of the criterion groups, as well as the Wilks' lambda and its associated F-ratio for NBME scores and MCQ scores. As can be seen, the Wilks' lambda values for both variables indicate relatively little discriminating power for either NBME or MCQ scores. 109 TABLE 4.14 GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES n=509 Practice- Residency- eligible eligible Wilks' Mean S.D. Mean S.D. Lambda F1,507 NBME 41.7 2.65 42.3 2.10 .9918 4.181a MCQ .751 .080 .818 .045 .8740 73.11a . ap<.05 TABLE 4.15 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF PMP SCORES AND MCQ SCORES n=509 Variable Variable Wilks' F Step Number Entered Remainipg Lambda To Enter 1 MCQ .8740 73.11a NBME .8739 .0614 2 (F level insufficient for further computation) ap<.05 110 TABLE 4.16 GROUP DISCRIMINATION BY EI SCORES AND MCQ SCORES n=509 Practice- Residency- eligible eligible Wilks' Mean S.D. Mean S.D. Lambda F1,507 ET .833 .056 .846 .047 .9894 5.429a MCQ .751 .080 .818 .045 .8740 73.11a ap<.05 TABLE 4.17 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF EI SCORES AND MCQ SCORES n=509 Variable Variable Wilks' F Step Number Entered Remaining Lambda To Enter 1 MCQ .8740 73.11a El .8737 0.1471 2 (F level insufficient for further computation) ap<.05 111 TABLE 4.18 GROUP DISCRIMINATION BY PI SCORES AND MCQ SCORES n=509 Practice- Residency- eligible eligible Wilks' F Mean S.D. Mean S.D. Lambda 1,507 PI .736 .080 .755 .061 .9884 5.942a MCQ .751 .080 .818 .045 .8740 73.11a ap<.05 TABLE 4.19 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF PI SCORES AND MCQ SCORES n=509 Variable Variable Wilks' F Step Number Entered Remaining Lambda~ To Enter 1 MCQ .8740 73.11a PI .8740 0.0075 2 (F level insufficient for further computation) ap<.05 112 The results of the stepwise discriminant analysis for NBME and MCQ scores is summarized in Table 4.15. The MCQ scores were entered first, indicating they have the greater discriminating ability of the two variables. After Step 1 was completed, the NBME score remained, and the Wilks' lambda value in the second line of the table indicates the value if the NBME score were entered in the next step. As can be seen, it would only decrease by .0001, and the resulting F to enter the NBME scores is Clearly not significant. The program was set so that the minimum F ratio to enter a variable was 1.00. Therefore, Step 2 was not done becuase the F level was insufficient for further computation. From this it can be concluded that NBME scores are not con- tributing to the discrimination of residency- and practice- eligible physicians. The results illustrated in Tables 4.16 - 4.17 for El scores, and in Tables 4.18 - 4.19 for PI scores, show exactly the same results. In both cases the PMP scores were not entered into the equation because of a lack of discriminating ability. These results are duplicated when the candidates who passed Part I and who failed Part I are considered separately, as summarized in Tables 4.20 - 4.21 and Tables 4.22 - 4.23, respectively. The results for NBME scores are shown in these tables. The results for El scores and PI scores were similar: when those who pass Part I and those 113 TABLE 4.20 GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO PASSED PART I n=331 Practice Residency- eligible eligible Wilks’ F Mean S.D. Mean S.D. Lambda 1,329 NBME 42.2 2.40 42.3 2.16 .9999 .0315 MCQ .807 036 .826 .035 .9389 21.41a ap<.05 TABLE 4.21 SUMMARY OF STEPWISE AND MCQ SCORES DISCRIMINANT ANALYSIS OF NBME SCORES FOR CANDIDATES WHO PASSED PART I n=331 Variable Variable Wilks' F Step Number Entered Remaining, Lambda To Enter 1 MCQ .9389 21.41a NBME .9383 .2219 2 (F level insufficient for further computation) a p<.05 114 TABLE 4.22 GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO FAILED PART I n=l78 Practice Residency- eligible eligible Wilks' F Mean S.D. Mean S.D. Lambda 1,176 NBME 41.1 2.83 42.4 1.46 .9878 2.168 MCQ .676 .058 .724 .034 .9647 6.441a ap<.05 TABLE 4.23 SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO FAILED PART I n=l78 Variable Variable Wilks' F Step Number Entered Remaining» Lambda To Enter 1 MCQ .9647 6.441a NBME .9610 .6722 2 (F level insufficient for further computation) ap<.05 115 TABLE 4.24 CLASSIFICATION ANALYSIS USING MCQ SCORES AS DISCRIMINANT FUNCTION Actual Group Predicted Group Membership Membership E Residency» Practice Residency- 116 94 22 eligible (81.0%) (19.0%) Practice- 393 164 229 eligible (41.7%) (58.3%) Total Cases Correctly Classified 323 (63.5%) who fail Part I are considered separately, the PMP scores are not entered into the discriminant equation. As mentioned earlier, the high values of Wilks' lambda for both the PMP scores and the MCQ scores indicate a relatively low discriminating ability for these scores. This is further illustrated when considering the results of classifying the candidates using the equation generated by the discriminant analysis. When each candidate is classi- fied as residency- or practice-eligible based on the results of using the discriminant equation (all three equations generated using the three PMP scores are the same, since only the MCQ scores were entered in each analysis), the result is 63.5% correct classifications' as shown in Table 4.24. In many respects this must be considered a poor performance, since knowing there are a majority of practice-eligible 116 candidates in the total population (116 residency-eligible and 393 practice-eligible) one can achieve a correct classification rate of 77.2% by simply classifying everyone as practice-eligible. Since the three PMP scores were never entered into the discriminant equations, it can be concluded that PMP scores do not add to the ability of MCQ scores to discriminate between residency- and practice-eligible candidates. Results Concerning Hypothesis IV The fourth hypothesis in this study states that all frame scores of PMP's correlate equally with SCE scores. This hypothesis is tested by calculating Z-test statistics for the difference of two dependent correlation coefficients (Glass and Stanley, 1970) for all pairWise combinations of frame score correlations with SCE scores. Table 4.25 summarizes the correlations of each frame score (totalled across all five problems) with the SCE scores of Part II. Table 4.26 presents the calculated Z-test statistics for significance of differences between all possible pairs of the correlations of NBME frame scores and SCE scores found in the first column of Table 4.25. Tables 4.27 and 4.28 do the same for ET and PI scores, respectively. If the fourth hypothesis is correct, there should be no significant difference between any pair of frame score correlation coefficients. 117 TABLE 4.25 CORRELATIONS OF PMP FRAME SCORES WITH SCE SCORES FOR NBME, EI AND PI SCORES n=315 Frame NBME, SCE EI, SCE PI, SCE (Hx) 1 .0752 .0427 .0601 (Dxl) 2 -.0197 -.0280 -.0155 (PX) 3 .0423 .0297 .0132 (Dx2) 4 -.0060 .0331 -.0113 (Lab) 5 .0221 .1200a .0246 (Diff. Dx) 6 .0085 .0196 .0316 (Mgt) 7 L .1322a .1302a .2244a ap<.05 118 TABLE 4.26 CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED NBME FRAME SCORE CORRELATIONS WITH SCE SCORES n=315 Frame 1 2 3 4 5 .6 7 (Hx) ]. -- (Dxl) 2 1.34 -- (PX) 3 0.55 0.87 -- (Dx2) 4 1.10 0.17 0.61 -- (Lab) 5 0.77 0.55 0.38 0.35 -- (Diff. Dx) 6 0.88 0.34 0.43 0.18 0.20 -- (Mgt) 7 0.80 1.97a 1.23 1.72 1.67 2.99a - a exceeds critical value of 1.96 (a=.05) 119 TABLE 4.27 CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED EI FRAME SCORE CORRELATIONS WITH SCE SCORES n=315 Frame 1 2 3 4 5 6 7 (Hx) l. -- (Dxl) 2 0.99 -- (PX) 3 0.26 0.81 -- (sz) 4 0.14 1.33 0.05 -- (Lab) 5 1.08 2.06a 1.24 1.25 -- (Diff. Dx) 6 0.32 0.70 0.13 0.23 1.68 -- (Mgt) 7 1.18 2.05a 1.32 0.29 0.17 2.11a -- aexceeds critical value of 1.96 (0:.05) 120 TABLE 4.28 CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE OF PAIRED PI FRAME SCORE CORRELATIONS WITH SCE SCORES n=315 Frame 1 2 3 4 5 6 7 (HX) 1 -- (Dxl) 2 1.07 -- (Px) 3 0.83 0.40 -- (Dx2) 4 0.97 0.06 0.31 -- (Lab) 5 0.54 0.52 0.21 0.44 -— (Diff. Dx) 6 0.37 0.57 0.23 0.57 0.10 -- (Mgt) 7 2.31a 3.13a 2.86a 3.04a 2.96a 3.32a -- a exceeds critical value of 1.96 (a=.05) The hypothesis that there is no difference between two correlation coefficients is rejected if the calculated statistic is greater than the critical value. For a two- tail test with.ana=.05, this critical value is 1.96. As can be seen in Table 4.26 for NBME scores, the calculated Z-test statistic exceeds the critical value for frame 7 (management frame) paired with both frame 2 (preliminary diagnosis) and frame 6 (differential diagnosis). For these pairs, the hypothesis that the correlations are equal must be rejected. Table 4.27 for BI scores shows that the calculated statistic for these same pairs exceeds the 121 ‘ critical value, as well as for frame 5 (lab) with frame 2. For PI scores (Table 4.28), the Z—test statistic exceeds the critical value for-the pairing of frame 7 with all the other frames. From these results it must be concluded that all frame scores do not correlate equally with SCE scores. At the same time it must be noted nearly all of the correlations of frame scores with SCE scores are essentially zero. Frame 7 in all three scoring methods and frame 5 for El scoring are the only frames with correlations reaching even a level of 0.1. Summary of Results for Tests of Hypotheses The statistical analyses performed to test the four hypotheses of this study can be summarized as follows: 1. Calculated Z-test statistics for significance of differences between PMP score correlations with MCQ scores and SCE scores show no significant differenCes between the correlations. Since the null hypothesis of no difference could not be rejected, there is no support for Hypothesis I, which stated that PMP scores will correlate to a greater degree with SCE scores than with MCQ scores. The actual correlations show very low or essentially zero correlations of any of the three types of PMP scores with either MCQ or SCE scores . 122 Results of regression analyses indicated that none of the PMP scores using the three PMP scoring methods make any significant contribution to predicting SCE scores. The R2 values indicate that MCQ scores account for about 19% of the variance in SCE scores while the PMP scores account for essentially none of the variation in SCE scores. Therefore, there is no support for Hypothesis II, which stated that PMP scores will account for a portion of variance of SCE scores beyond that contributed by MCQ scores. Stepwise discriminant function analyses performed to assess the relative abilities of PMP scores and MCQ scores to discriminate residency- and practice- eligible candidates show that MCQ scores are the better discriminators of the two, and that for each of the three PMP scores the F level is not sufficient to enter it into the discriminant equation. Therefore, there is no support for Hypothesis III, which stated that PMP scores will add to the ability of MCQ scores to discriminate practice- and residency-eligible candidates. Calculation of Z-test statistics for the significance of differences between all possible pair-wise correlations of PMP frame scores with SCE scores shows that there are two or more pairs 123 of correlations which are significantly different for each of the three PMP scoring methods. These pairs involve frame 7 (management) in all but one case. Therefore, the null hypothesis of no differences in all pair-wise correlations of PMP scores with SCE scores must be rejected in favor of the alternative that there are differences in the correlation of PMP frame scores with SCE scores. This means that Hypothesis IV of this study (all PMP frame scores correlate equally with SCE scores) cannot be supported. Results of Additional Analyses The results for Hypothesis IV (showing that all sections or frames of a PMP do not correlate equally with SCE scores) suggest the possibility that weighting those frames with higher SCE score correlations might improve the ability of PMP's to predict the SCE scores being used as the criterion measure. The results of an attempt to define such a revised scoring scheme are presented in this section. As was done for Hypothesis IV, the scores for each frame were totalled across all five problems. These total frame scores were then used to calculate the correlations between PMP frame scores and MCQ scores, and between PMP frame scores and SCE scores. The results are presented in Tables 4.29 - 4.31 for NBME, EI, and PI scores, respectively. 124 TABLE 4.29 CORRELATION OF NBME FRAME SCORES WITH MCQ AND SCE SCORES n=315 Frame MCQ SCE (Hx) 1 .057 .075 (Dxl) 2 .072 —.020 (PX) 3 .013 .042 (Dx2) 4 -.029 .006 (Lab) 5 .005 .022 (Diff. Dx) 6 .214a .009 (Mgt) 7 .280a .132a Total .141a .075 ap<.05 TABLE 4.30 CORRELATION OF EI FRAME SCORES WITH MCQ AND SCE SCORES n=315 Frame MCQ SCE (Hx) 1 -.022 .043 (Dxl) 2 -.027 -.028 (PX) 3 .016 .030 (sz) 4 -.009a .033a (Lab) 5 .133 .120 (Diff. Dx) 6 .211a .020a (Mgt) 7 .277a .130 Total .067 .109a ap<.05 125 TABLE 4.31 CORRELATION OF PI FRAME SCORES WITH MCQ AND SCE SCORES n=315 Frame MCQ SCE (Hx) 1 .059 .060 (Dxl) 2 .074 -.016 (Px) 3 -.009 .013 (sz) 4 -.022 -.011 (Lab) 5 .030 .025 (Diff. Dx) 6 .250a .032 (Mgt) 7 .331a .224a Total .124a .084 ap<.05 Given the results from the four hypotheses presented earlier, it is not surprising that the PMP frame scores show essentially no correlation with SCE scores. The only correlation to reach statistical significance (0:.05) across all three scoring methods is frame 7 (management), but the correlations are so low that they have little practical meaning. For correlations with MCQ scores, only frame 6 (differential diagnosis) and frame 7 reach statistical significance across all three scoring methods, but again are too low to be of any practical value. Three regression analyses were performed using the SCE scores as the dependent or criterion variable and frame scores for each of the three scoring methods as the 126 independent or predictor variables. These results are shown in Tables 4.32 - 4.34. For all three scoring approaches frame 7 (Mgt) is entered first, indicating it contributes most to predicting SCE scores. In fact, except for frame 6 (Diff. Dx) of the NBME scores, frame 7 is the only frame score in each case with an F to enter which reaches statistical significance. These results are not unexpected when considering the correlations for each frame score presented in Tables 4.29 - 4.31. The results of these regression analyses were then used to calculate weighting schemes for each PMP scoring method in a manner similar to that used during analysis of the Field Test of the ABEM certification examination (Maatsch and Elstein, 1979). Table 4.35 shows the rounded off weights calculated for each frame for each scoring method. The process of arriving at these weightings is illustrated by the following description for the PI scores. The regression analysis (Table 4.34) shows that only frame 7 (Mgt) makes any significant contribution to predicting SCE scores, the other six frames adding little or nothing. This indicates that only frame 7 should be scored; but, in fairness to the candidates, all frames should be scored. Therefore, each of the first six frames is given a weight of one, for a total weight of six. From the R2 values it can be calculated that frame 7 accounts for about 86% of the variance in the SCE scores which can be attributed to PMP scores, and the 127 .Emumoum map mp pmumucw Do: Axmv m mEmum Amxov Hooo. Nmmo. meH. va. ovo. v mfimum o 385 vooo. ammo. mhma. «Nb. mNH. m wfimum m Axmv omoo. hmmo. mmmH. mNm. Hem. H wfiwhm w :83 HNoc. hmmc. ommH. moo. wa. N mfiduh m AXO .MMflDV Hmao. mmmo. wmma. . mNo. mom.m m wEmum N 3sz tho. tho. MNMH. mac. me.m h mEde H mommao MM m DOGMOAMAGmHm umucm OB commuzm # mmum m 03332 E 038385 mam": mmmoom Hum 20 mmmoom MSdmm mEmZ m0 mHmwfldz< ZOHmmmmwmm mom mqmfiB wm9 pmumucm Doc Axmv m mEmHm Axmv mooo. Nmmo. Noma. mam. wma. H mfimum o Aaxov hmoo. hmmo. ommH. mbN. ohH.H N mfimuh m Amxov vNoo. ONmo. omba. Nmm. hob. e mfimum w Axe .mmaov whoo. mmNo. NNhH. wNH. hhM.N m wEmum m 386 Nmoo. NNNo. HmvH. omH. mhm.H m mfimum N 39.: Ohao. OhHo. Noma. HNO. oov.m h wfimum H omcmno mm m DOQMOHMHcmHm umucm OB pmumucm # mmum mm maaabasz m mabmaum> mam": mm.¢ mgmda mmmOUm wow 20 mmmOUm mzdmm Hm m0 mHmNA¢Z¢ ZOHmmmmUmm mom mamfle wMNSZDm 129 .Ewumoum may ma Umuoucm no: Amxov v wamum Axmv Hooo. nmmo. mmam. mNm. omo. m mEmHm .m Anmqv mooo. mmmo. oNvN. mam. HmN. m manna m 3:: mooo. mhmo. wovm. ham. mom. a mEmum v . Aaxov mooo. memo. vwmm. Haw. mmm. N mEmnm m Axe .mmaov Smoo. ammo. mmmm. ona. mmm.a m mfimum N 32,: vomo. eomo. «VNN. ooo. mmm.ma h msmum H mmcmno MN m cosmoHMHcmHm Hmucm OB pmnmucm w Qmum mm bananas: a 0Hb8a08> mHmHG mmmoom mvm ZO mmmOUm m2¢mm Hm m0 mHmwgdZ< ZOHmmmmUmm mom mqm¢9 wm¢zzbm vm.v mamfia 130 TABLE 4.35 CALCULATED WEIGHTS FOR PMP FRAME SCORES N_B_M;r:_ £1 .12 Framg| Weight % Weight % Weight % (Ex) 1 1 2 5 1 9 1 2.5 (Dxl) 2 1 2.5 l 9 1 2.5 (Px) 3 l 2 5 1 9 l 2.5 (Dx2) 4 1 2.5 l 9 l 2.5 (Lab) 5 1 2.5 l 9 1 2.5 (Diff. Dx) 6 15 37.5 1 9 l 2.5 (Mgt) 7 20 50 5 46 34 85 40 100 11 100 40 100 remaining frames account for about 14%. Thus, frame 7 accounts for about six times the variance of the remaining frames. This means that frame 7 should be weighted about six times the total weight of the remaining frames which have a total weight of six. After rounding off to a total weight of 40, the result is a weight of 34 for frame 7 and 1 for each of the remaining frames, as shown in Table 4.35. These results are quite similar to those obtained from the Field Test data (Maatsch and Elstein, 1979), which show the first five PI frames accounting for 10% of the variance 131 attributable to PMP scores, and a final weight of 30 for frame 7 and one for frames 1-5. The major difference was that frame 6 (Diff. Dx) contributed significantly to - predicting SCE scores in the Field Test results, and received a weight of 15. The weights for NBME and EI frames were calculated in a similar manner. For NBME frames, both frame 6 and frame 7 contributed significantly to predicting SCE scores (Table 4.32), and the beta weights from the regression equation indicated they should be weighted in a ratio of approximately 4:5, resulting in the weights shown in Table 4.35 Using these weights, the PMP's were rescored and correlations with MCQ and SCE scores calculated using the weighted PMP scores. The results are presented in Table 4.36, along with the unweighted score correlations from Tables 4.1 - 4.3 for comparison. The correlations of the weighted PMP scores with MCQ scores are increased to moderate levels. The correlations of the weighted NBME and EI scores with SCE scores show a negligible increase, while the weighted PI score shows an increase to a value approaching a moderate level. In order to explore the effect of the unreliability of the tests being used, correcting for attenuation was also done.’ Lord and Novick (l968)present the following formula 132 ' TABLE 4.36 CORRELATIONS OF WEIGHTED PMP SCORES WITH MCQ SCORES AND'SCE SCORES n=315 ESQ ESE NBME .270b .097a (.141b)+ (.075) El .234b .155a (.067) (.109a) PI .331b .221b (.124a) (.084) ap<.05 bp<.01 +Numbers in parentheses are correlations of unweighted scores for such purposes: I: * XY rxy= rxx ryy Where r;y is the correlation corrected for attenuation, r is the observed correlation, and rxx and r are the Observed reliabilities of the two tests. Using the observed correlations of NBME scores with MCQ and SCE scores, and the reliabilities presented at the beginning of this chapter, the corrected correlation between NBME and MCQ scores is .181 133 (compared to .1406 observed), and between NBME and SCE scores it is .121 (compared to .0747 observed). Again, although these values may be of statistical significance, they are too low to be of any practical use. Therefore, even if the tests used were perfectly reliable, the correlations of the PMP's with the MCQ's and SCE's would still be very low, and most likely would not change any of the results presented in this chapter concerning the four hypotheses of this study. Summary of Additional Analyses The correlations presented in Table 4.36 show that even weighting the PMP frame scores by the method suggested by Maatsch (Maatsch and Elstein, 1979) will not produce correlations of sufficient magnitude to be useful in predicting SCE scores (although there is a slightly stronger relationship to MCQ scores.) Correcting for attenuation also failed to produce meaningful correlations. This is to be expected, since the initial correlations of the PMP scores with the performance criterion provided by the SCE scores were essentially zero. No amount of manipulation is likely to overcome this lack of relationship between PMP scores and performance on simulated clinical encounters. 134 Chapter V discusses the findings presented in this chapter, draws conclusions concerning the results of this study, and makes suggestions for further research. CHAPTER V SUMMARY AND CONCLUSIONS Introduction This chapter summarizes this research study and draws several conclusions based on the results presented in the previous chapter. These conclusions and their implications are then discussed, followed by some possible lines of future research suggested by the results of this study. Summary of Findings This study was designed to investigate four observable consequences of basic assumptiOns involved in the scoring and use of Patient Management Problems (PMP's) in licensing and certification examinations for physicians. These basic assumptions are: 1) that PMP's are a valid measure of clinical performance, i.e., they are predictive of how a physician will perform in a real situation; and 2) all parts of the clinical problem solving process (and therefore all frames of a PMP) contribute equally to diagnostic and management proficiency. The data for this study were obtained from the first administration of the Emergency Medicine Specialty 135 136 Certification Examination (EMSCE) of the American Board of Emergency Medicine. The examination consisted of a battery of 194 standard multiple-choice items, 86 pictorial multiple-choice items, and five PMP's which were adminis- tered as Part I of the examination in February, 1980. Those who passed Part I with a score of 75% or better on the MCQ battery went on to take Part II in either May or July. Part II consisted of seven Simulated Clinical Encounters (SCE's), which are highly structured oral simulations of emergency medicine cases presented by trained examiners. In this study the examiners' rating of candidate performance on the SCE's served as the criterion measure. A total of 509 subjects used in this study sat for Part I of the examination, and 315 went on to complete Part II. The results from testing the four empirical hypotheses of this study are summarized below: Hypothesis I: PMP scores correlate to a greater degree with SCE scores than with clinically relevant MCQ scores. For this first hypothesis Z-test statistics were (lalculated for significance of differences between cor- realations of PMP scores with SCE scores and with MCQ scores. Results supporting the assumptions on which the use of PMP's if! licensing and certification are based should show the PEEP scores correlating to a higher degree with SCE scores tfllan with MCQ scores. This was not the result obtained. 'Phe results showed no statistically significant differences 137 between the correlations of PMP scores with SCE scores and with MCQ scores, and therefore the first hypothesis could not be supported. Hypothesis II: PMP scores account for a portion of variance of SCE scores beyond that contributed by clinically relevant MCQ scores. Regression analysis was used to test this second hypothesis concerning the relative amounts of variance in SCE scores accounted for by PMP scores and by MCQ scores. Results supporting the use of PMP's in licensing and certification would show PMP scores accounting for a significant proportion of the variance in SCE scores beyond that accounted for by MCQ scores. This was not the case. The results showed that PMP scores account for essentially none of the variance in SCE scores, and therefore make no significant contribution to predicting the criterion (SCE) scores. The second hypothesis could not be supported. Hypothesis III: PMP scores add to the ability of MCQ scores to discriminate between residency- and practice-eligible candidates. This third hypothesis was tested using stepwise discriminant function analysis to assess the relative abilitities of PMP scores and MCQ scores to discriminate between residency- and practice-eligible candidates. .138 Results in support of the use of PMP's in licensing and certification would show that PMP scores make a significant contribution beyond that of MCQ scores in discriminating between residency- and practice-eligible candidates. Again, this was not shown by the results of this study. The PMP scores show no ability to discriminate between the two types of candidates. Therefore, the third hypothesis could not be supported. Hypothesis IV: All frame scores of PMP's correlate equally with SCE scores. This final hypothesis in this study dealt with an assumption upon which the scoring of PMP's is based; namely, all parts or frames of a PMP contribute equally to the prediction of a criterion measure of clinical perfor- mance. Results supporting this assumption would show no significant differences between all pairs of correlations of the various PMP frame scores with SCE scores. Once again, this was not the result obtained. The correlations of the final management frame with SCE scores are significantly different from the correlations of the other PMP frames with SCE scores. Therefore, the fourth hypothesis could not be supported. The results of testing the fourth hypothesis suggested the possibility of weighting the final frame in order to improve the ability of PMP's to correlate with, or predict, SCE scores. While such a weighting scheme did result in 139 some improvement in the correlation of PMP and SCE scores, it was not enough to make the frame-weighted scores useful in predicting SCE scores. There is apparently so little correlation between PNP scores and the criterion that manipulations of this type will not tangibly improve measurement or prediction. Conclusions At the end of Chapter III four questions were presented that linked the hypotheses of this study with the assumptions concerning the use fo PMP's in licensing and certification of physicians. All of these questions can now be tentatively answered in the negative. PMP scores are no better at predicting SCE scores obtained three and five months later than they are at predicting MCQ scores. PMP's do not add anything to the information gained through MCQ's in predicting SCE scores. PMP's do not add anything to the ability to discriminate between residency- and practice- (eligible candidates beyond that provided by MCQ scores. Ztnd finally, there does not appear to be any way of weighting IPNW frames that would materially improve the predictive aability of PMP's, at least in the setting utilized in this s tudy . From the results of this study, the following con- Clusions may be drawn: 140 l. The assumptions tested in this study, upon which the use of PMP's in licensing and certification are based, do not appear to be valid. 2. Since the results showed little or no ability of the PMP's to predict a criterion score consisting of ratings provided by reliable expert examiners of quality of health care given to a sample of simulated patients, this study provides no evidence of the criterion-related validity of PMP's in licensing and certification of physicians. 3. Attempts to substantially strengthen the predictive ability of PMP's were unsuccessful because there is essentially little or no correlation with the criterion measure. Basically, the results of this study indicate that PMP's are not as useful and valid a method of evaluating Clinical competence as has generally been assumed. They do :not appear to be a valid substitute for MCQ batteries or