rto§«..v.m.u 4 ‘ .
LEM—hunt“. Qﬂvﬁnwﬂwtxﬁaazn ». .
4...: «WtQﬁzr I3! IQQIOJ u 0.. .
, Ciao .ﬁt.‘ 15:83.“me s . . v: . a, » 4
A v 36575.. aﬁwwwthk.h.1wza§ul .9. it. ‘ .i ‘ 1. a
4n 8.»; :3. 342...... ... _. . E N $5.4“?
.w : .. : 4v. . z. .65.. v? ”Vienna... vmvm“. {ﬁg a A. . . . rim... 2.».
u » . .r _ . . I 3
.‘ _ .. . .. v . at. $31,.Eurmsdar: 3w.»
J 3 .. 2 x. . A - .ngt.§lm.irf.n ty—W‘o
Ii it’d-g... anvil-yo. Big???
16.. a... t i r .vn!-:hu$ ‘f:a. .71:
71t9a13ﬁi§ !!
l t a
I
.V
0.:
u

I a I nu I‘
ICJV . n I, t-
J. n . 1 a. (it Vitmfaiv . o.
a . 7 I! . rivﬂétseﬂi... I... g 1
: .. .. 3.... ..,, . have... , r1.
. .4... t ‘ .1 it? $ 4!
. . ’0'? ~!
. .. IIrxﬂﬂivﬁu‘lfOt e-h‘ﬁ,
.. 0 1'
.
:

     

                          

                

   
        

    

     

Z~
\UQ..

viibn IT.»
.vAt

 

         

2,. . .
u. %.§;\A.§i€ ~uh... .. .‘ ;
‘ u .. . ‘nl .r 3
. .. ‘ nﬂhavuxw. vt aohgﬂl .m‘unnuﬂv‘tc Hi.
“I 0|‘! V. A: sis“ . - . \EI‘O‘EAV I.
. ‘
z .

n.
a ‘
xi EPV $1.1. ,
y ‘ _
. il!‘ .v:>o~.:.l£»)‘$rvh kn ‘
hwy. ii iitﬂuléilégﬁ Dy v .
x hlkﬁtgtJllt .ﬂru»..v\i\. HWWR}? ’Huﬂn‘m
, u ,
‘ \ .

       

 

I‘ '9...»

 

              

      
   

: .
».L.. .1. .. :k‘ I .
.. Y «2.5mm
. (WW. 3.1....
.§v rNWmﬂtsxuhL.‘ . .3”;
.35 , zTII-uuque rub». “3m?“
5. . . Qﬂ‘ﬁwinw?
23% I! 5 u . s ‘
.g x R . L.‘ .
~. , .
u
‘

, h. 9.5111” .3?»le ..
LL... 3%.?! , «a :7. a A, ,.

2 ~, muh~n§¥..n...uuw§urvttw .. wag. .x ‘ mumwwrﬁnnmn Jam“. ,

. . '51.;va anon: g ez‘upfmmiwﬁr 2., , a

.8..ng zmtw. ﬁnﬁb‘ a. z in? ..

mumﬁumsizu fugﬁ :. .. ‘ K
(7*: “Withng . 9%“. 1”

«1‘73?!
99“.

at.
QM it, n.
lib}. 3:52.... :
...\n. q

 

        

      

                

}

  

 

       

f"..“
11.3.?
‘t‘ﬁﬁ‘:-:0vw"§’ “Iﬁ‘
...$zx....-. 31.... . . ,uﬁfi. .. ..
In“)... Irv“:
ttz‘ 4
.éix‘g .9
.I.§.\I.E. 0..an

.1 v
.2.va c srllrvtv‘vv
. I|\¢1tVUO..vu|t.§11
a :111N1vii 1. Lttf 53'! pa.
v C. in iit‘v 9 f! ‘
r Its-00:31.. “a?“ .

   

   

  

               

    

          

     

   

        
     

.
e
V inhii.‘ $3.: \‘
lllllll 3"...“va a: rdﬂdwvgt‘fv (g V 1“}
a p ,7. r . u r... v I
n, 3543:315va Wt. Irgil.
9.1 0.0; . e lfaiﬁnﬂcﬁv
(1. u 91‘. ~11}; baht-H76. :0 :2
u‘. a x a I .1 _ ﬁst-‘1‘!“ iii 3§1i9n«
1. vi QQQQQ hvl‘erVvv Eur: . .iﬂJrVI. u-. 9
x - . . .ri. ﬁg“... 1. .u. .1.
Ti. . ..)a\\qitk¥ful7tﬂf ,eraXVtﬁi'Oh ’ s.
x v c . ENE ‘s’b’ni. gibii V . ‘
{L a . . n. .. I..vio’|§c h. 7G v i 3 u
r ~ 4:. 7 A that ii: $1.?!- I R! w! w. luau‘ ‘
V . y. . u . ~ . Onavﬁ't’ EOVI. . ,0? .71
In . .. . Ji . Ivlyfllr. .lvlllh‘u 5.“..‘000! ‘ ,
I! . . . w nanhugﬂuuurvﬂ’ ‘cbzxﬁﬂﬂuﬂuvi {la
1 .5. ‘ ‘ Kn .. ‘ A; ~3§49Af¢ :tigclull J‘I‘hﬁv! 5‘ . ‘
.,v n . ‘V‘Ols uIYx [in-r1! .02:qu . v.‘ I?!) i . an: givi¥ 1.5.1.11:
- ‘ . “ﬁg‘s {bu-lg Vt: Euﬁnﬁhlkﬁ‘gk’lzl
I. ~ Jill. pq'féq'ohvﬁ a. . ...I4\An ‘9‘. 9&9} .\..
. §n§h ‘0: I 1.5!.“ L 1‘}.
‘ .
n
u

 

    

 

inactivity. .0

. > . I. I!

‘ w . . . :5le iii

. A 2 (19:.
it.

 

‘
I; 'r I
u“;-

      

WW3 w
,

Ink
. 1.1. 39...: ‘ 3.1.31:
tit ‘

Iaﬁu. sz
3.. 11.1..
. . tib‘
.131 +1.5.ka

 

              

 

4 .mdsktl , . .
1.99. { fa. it...“ at:
is V tit . ‘ 11}
in; Iq'ltlbk...§.n$$s§ul!i§i , n: . . .vurtﬁmmﬁbll‘t)
, ééi . 9! éiﬁtzi 2H}
nil... , 5.“!!! 2a h‘vbi\5u\l\uﬂv.nﬂ-\i hr:
kg} to. um um wit “in fizyil; ...... tnﬁht 1....» :1. It.
($91!. ‘1‘» .3.) {Eyivwinttsl i
.l git-ii}. Ink.
,:u :i! . 1E\%hi}c
. 1.:— . .. .‘l. ”rigiigggtin
§|§3§|i¥h‘? sité‘g‘;
inFiilftt.‘ , 9‘1 .3
£51.! .‘vgllé‘: 95%| (51....
25...: iv! . .531- illﬁs.“
£111...." bI.\.Iv ‘9‘, I.“ ‘
ti ain‘t .
\in \t . {\‘lln A‘x

 

   

 

 
     

.1‘ n" . ,_
./ ‘ JIE‘K‘)”H‘§'V!"‘"’¥§H§

 

1
":3"
6

K

  

 

  

          

‘ ilk?!» - ‘
E‘Qﬁ‘v’! 31‘s,!“ I. , ct Inn“ .IL 5 .L .
K ..%nn“ﬂuuritnii¥ii -iHHVEIEE LI... «$8.11....
‘ _ V 51‘.- ‘7-
l‘ )2.» ’Yotﬁlrtttl “3"“ ‘I In a \I
95;: ‘iiﬁﬁs. {inurztxtsillt} i?!
ll. 15.x
YE! 789%.}. NM
it‘ll

:
Etti
t «In-ll!!!

\‘sbl y. 1"...
{II .II

 

.613.- ‘L. ‘1’
E 5.. 3.1;“
it’ll-xi!
.
Ava-til.»-
LL 5.1.14

 

 

. . ..
,ui
‘ . 1:... Y $1.79.... {git
A 1L.)- L. 35" 3 {ravnuun _
n , . L A u‘HPVWUT’VtVI... ...‘E-..Ls‘t Ltvriz...§|.
» . 73...".11‘11‘ l§.EL nkl.’l§71§7~|{£ ).\.r§ul.§l\lllll.v
J: A . .. . .139!» 3. Elk“. tkdﬁhlytl.l§‘..rrs?i .
. V ‘ J . FMEMH‘V. ,: . 1???? .AT.‘ ‘ltfivilvsbﬂivp a:
I. .15.}. £5.er gtthﬂl‘bti. stu- L I!» {ti‘itbﬁﬁtu
Giziltn at; .‘ii‘Ir: , V . A x ‘ .
3!”). .12: :E‘L, ti"?! his? 4.; I ~
1,1323%. iihrmzi’edxt‘wﬁﬂﬂlrrtit ..
‘ at... £0.07! .....'VV£.L I.¢itxh‘t.iﬁ§ziﬂviyin .
. ,r . .‘L‘Iiii’s?il‘vﬁ¢5.i .. \.
. ‘ 313 izvri9vilﬁli {1:14;
Lgrtgtttnthl?vﬂhn,wan.uvdvﬂs E
. . . .V‘Ovtlni

 

 

           

 

        
   

  

       

     

m. . ..
‘21... :. .ziéiu. Ii...
...1 e . .521! is. 1 9344.4 ..
~. .. ..| x0!t¥ﬂ.¥\1:t§“ﬁ?.ﬂ.¥ - ‘
3 9:3} I... .153: A .. .

, . . .le:l.§ LVE‘EEE.’ L». is I .

y ..,r\». ‘ :DPVIVQ. «‘ a . . . E I)! I
_ , I.\\1!b£... . . x (M‘ﬂ 35.1.3"!- 4632L$l\§.v«.\ ‘\

. .E.ISVL§YPI~41Lt\§\III\II-il ul! .1. ‘ ‘~ . !.
2;!!itblzvlatsﬁnnntvl it? 12‘ l A. .
, . . Y‘lv‘w.t~lc)~f:V.... v Isl. >~»| .q. :1 \\..¥1.\’v
v ..4){... L.;:i).lpr\.r|t \ .t... .‘A4I.L4‘M\. “in?
. 1 .. 3.1....1. 1.51: iznﬁvtttst- ‘7‘: ‘

.u.;.§€9lf I.

  

 

 

Jan.” -w'c -"E'Z:"1'T‘ \
a
o

 

, c”; (“I") '7“). 1-:f'd‘ F. "x .'-,’o .54» 9
~.." ['2 '5 .. t '2 ' f " 1.5"}: :1
9—4..) .y ’.- . r .r -.:1-' ‘l. -~"—.§' [.0 Jr!

i“.
~'\ 5‘. f5.“- ,-“ 2f. 3’)" .~ g
' ' ' . - ‘ a V - .-.-- v .- - '. - - ..,- a. -
-l :9). .. m_._ ‘7;-:~'u “#1.; _.V_. .,,' 5:.
. -, , _ - -‘.._. - -A, . '.,.1' . '. -'o
‘3' gh/ .. A .3 . 4.; 'r ~' ”on 1;- .i Q -I - ”5-0: hath!
,g ~. _1
{4 .
4" on, -', I"! —
ii r1 :3 .1 “ .--‘ a“; "'17.“ et‘~.'~3':’.‘.“'t"’
. '1 .‘.;.\§‘ .; 7'... . .H‘f'iLu I.

“5) 5:383 “A ‘0 ‘9‘ ..‘\ ‘vJ' “4 "DJ-4"

——

 

This is to certify that the
dissertation entitled

The Validity of Patient Management Problems for Licensing
and Certification of Physicians

presented by

Eric Duane Zemper

has been accepted towards fulﬁllment
of the requirements for

Ph . D. degree in Counseling , Educational
Psychology and Special Education

/ /

I ///l.1/
I ” ' '
Hessor

 

Date 5

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

J.‘

 

 

MSU

LIBRARIES
“

 

 

RETURNING MATERIALS:
Place in book drop to
remove this checkout from
your record. FINES will
be charged if book is
returned after the date
stamped below.

 

 

 

THE VALIDITY OF PATIENT MANAGEMENT PROBLEMS
FOR LICENSING AND CERTIFICATION

OF PHYSICIANS

BY

Eric Duane Zemper

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling and Educational Psychology

1982

Copyright by

ERIC DUANE ZEMPER

1982

 

ABSTRACT
THE VALIDITY OF PATIENT MANAGEMENT PROBLEMS
FOR LICENSING AND CERTIFICATION
OF PHYSICIANS
BY

Eric Duane Zemper

Patient Management Problems have become an important
part of licensing and certification examinations of physicians
despite lack of any evidence of criterion-related validity.
This study investigates the criterion-related validity of
PMP's using data from 509 physicians who took the specialty
certification examination of the American Board of Emergency
Medicine. Simulated Clinical Encounters (SCE's) , highly
structured and reliable oral simulations administered by
trained examiners, serve as the performance criterion. Four
hypotheses, based on directly observable consequences of the
assumptions underlying the use of PMP's,-are tested.

The results show that PMP's have little or no cor-
relation with the criterion measure. Regression analyses
indicate that PMP scores make no contribution to predicting
SCE SCOreS, while scores from a multiple—choice (MCQ) battery
atcc'30unt for 19% of the SCE score variance. Stepwise discrim-
inant analyses indicate that, between PMP and MCQ scores, the
MCQ'S account for all the ability to discriminate residency
traineci emergency physicians (who presumably posess enhanced

pr°blem-solving skills) from those without residency training.

O

 

 

Eric Duane Zemper

PMP's contribute nothing to this discrimination. When the
scores of each section of the PMP's are correlated with the
criterion measure, the final management section correlation
is significantly higher than any other, indicating that out-
come measures may be more important than data—gathering
(process) measures.

This study can provide no evidence for criterion-related
validity of PMP's in licensing and certification of
physicians. These results, combined with the few existing
studies of criterion-related validity of PMP's, indicate that
the validity of using PMP's in making licensing and certifi-
cation decisions should be considered highly suspect until

proven otherwise by those using this examination format.

 

 

Eric Duane Zemper

PMP's contribute nothing to this discrimination. When the
scores of each section of the PMP's are correlated with the
criterion measure, the final management section correlation
is significantly higher than any other, indicating that out-
come measures may be more important than data-gathering
(process) measures.

This study can provide no evidence for criterion—related
validity of PMP's in licensing and certification of
physicians. These results, combined with the few existing
studies of criterion-related validity of PMP's, indicate that
the validity of using PMP's in making licensing and certifi-
cation decisions should be considered highly suspect until

proven otherwise by those using this examination format.

 

 

 

 

ACKNOWLEDGMENTS

At the conclusion of this long effort, special thanks
must go to my wife, Jerri, for her great typing skills and
even greater patience. My thanks also to Professor John F.
Vinsonhaler, who first encouraged me to start this degree
program and later became my committee chairman, and to
Professor Jack L. Maatsch, the director of this dissertation,
who has been my mentor for many years in the field of medical
education. For also serving on my committee, and for their
consistent help over the years, my deep appreciation to
Professor Lee S. Shulman and Professor Sarah A. Sprafka.
Finally, I would like to acknowledge Dr. Raywin Huang, who
provided unstinting help and advice with the computer
programs used for the statistical analyses in this study;
and the American Board of Emergency Medicine and Dr. Ben
Munger, Executive Director, without whose cooperation this

study would not have been possible.

iii

 

 

 

 

 

 

CHAPTER

LIST

LIST

I.

II.

III.

 

TABLE OF CONTENTS

OF TABLES . . . . . . . . . . . . . . . . . .

OF FIGURES . . . . . . . . . . . . . . . . . .

THE PROBLEM . . . . . . . . . . . . . . . . .

Introduction . . . . . . .
History of Licensing and Certification

Exams in Medicine . . . . . . . . . . . .

The development of Patient Management

Problems . . . . . . . . . . . . . .

Problems of examinations as a test of

competence.............

Need for This Study . . . . . . . . . . .

The Problem . . . . . . . . . . . . . .
Research Hypotheses . . . . . . . . . . .
Summary . . . . . . . . . .
Overview of the Dissertation . . . . . . .
REVIEW OF THE LITERATURE . . . . . . . . . . .

Licensing and Certification Examinations

Validity . . . . . . . . . . .
Defining the criterion . . . . . . . .
Patient Management Problems . . . . . . .
Validity . . . . . . . . . . . .
Scoring of PMP' s . . . . .
PMP' s as tests of problem-solving
skills . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . .
PROCEDURES AND DESIGN . . . . . . . . . . . .
Introduction . . . . . . . . . . . . .
Examination Construction . . . . . . . . .
Field Test of Test Items . . . . . . . . .
Design . . . . . . . . . . . . . . . . . .
iv

PAGE
vi

ix

25
25

26
34

37

37
45

55

61

i"

 

 

 

1 CHAPTER . PAGE

l
Subjects . . . . . . . . . . . . . . . 71
MCQ format . . . . . . . . . . . . . . 73
PMP format . . . . . . . . . . . . . . 73
SCE format . . . . . . . . . . . . 74
Generalizability of results . . . . . 77

Questions Summarizing the Logic Under-

 

lying the Testable Hypotheses . . . . . . 78

Summary . . . . . . . . . . . . . . . 81

IV. RESULTS . . . . . . . . . . . . . . . . . . . 82

Introduction . . . . . . . . . . . . . . 82

Summary of Test Results . . . . . 83
Correlation Summaries for PMP, MCQ, and SCE

Scores . . . . . . . . . . . . . . . . . 92

Results Concerning Hypothesis I . . . . . 99

Results Concerning Hypothesis II . . . . . 100

Results Concerning Hypothesis III . . . . 106
Results Concerning Hypothesis IV . . . . . 116
Summary of Results for Tests of

Hypotheses . . . . . . . . . 121
Results of Additional Analyses . . . . . . 123
Summary of Additional Analyses , . . . . . 133

Hypotheses and Analysis Methods . . . . . 79

V. SUMMARY AND CONCLUSIONS . . . . . . . . . . . 135

Introduction . . . . . . . . . . . . . . 135
Summary of Findings . . . . . . . . . . . 135
Conclusions . . . . . . . . . . . . . . . 139
Discussion of results . . . . . . . . . . 140
Suggestions for future research . . . . . 151
APPENDIX . . . . . . . . . . . . . . . . . . . . . 153

LIST OF REFERENCES . . . . . . . . . . . . . . . . 161

 

 

 

 

LIST OF TABLES

TABLE PAGE

1.1 COMMON SCORING FORMULAE FOR PATIENT MANAGEMENT
PROBLEMS . . . . . . . . . . . . . . . . . . . 13

3.1 TEST ITEMS ALLOCATED TO MEDICAL CONTENT
CATEGORIES . . . . . . . . . 67

4.1 SUMMARY OF NBME SCORE CORRELATIONS WITH MCQ AND
SCE SCORES . . . . . . . . . . . . 93

4.2 SUMMARY OF PI SCORE CORRELATIONS WITH MCQ AND
SCE SCORES . . . . . . . . 94

4.3 SUMMARY OF EI SCORE CORRELATIONS WITH MCQ AND
SCE SCORES . . . . . . . . . . . . . . . . . . 95

4.4 CORRELATIONS BETWEEN NBME, EI AND PI AVERAGE
SCORES . . . . . . . . . . . . . . . . . . . . 97

4.5 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR
CANDIDATES PASSING PART I . . . . . . . . . . 97

4.6 CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR
CANDIDATES FAILING PART I . . . . . . 98

4.7 PMP SCORE CORRELATIONS WITH MCQ AND SCE SCORES
AND Z-TESTS OF SIGNIFICANCE OF DIFFERENCE . . 100

4.8 SUMMARY TABLE FOR REGRESSION ANALYSIS OF NBME
AND MCQ SCORES ON SCE SCORES . . . . . . . . . 103

4.9 SUMMARY TABLE FOR REGRESSION ANALYSIS OF EI AND
MCQ SCORES ON SCE SCORES . . . . . . . . . 103

4.10 SUMMARY OF REGRESSION ANALYSIS OF PI AND MCQ
SCORES ON SCE SCORES . . . . . . . . . . . . . 104

4.11 SUMMARY TABLE OF REGRESSION ANALYSIS FORCING
INITIAL ENTRY OF NBME SCORES . . . . . . . . . 104

4.12 SUMMARY OF REGRESSION ANALYSIS FORCING INITIAL
ENTRY OF EI SCORES . . . . . . . . . . . 105

vi

 

 

 

LIST OF TABLES (CONTINUED)

SUMMARY OF REGRESSION ANALYSIS FORCING INITIAL
ENTRY OF PI SCORES . . . . . . . .

GROUP DISCRIMINATION BY NBME SCORES AND MCQ
SCORES . . . . . . . . . . . . . . . . . . .

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF
PMP SCORES AND MCQ SCORES . . . . . . . . . .

GROUP DISCRIMINATION BY EI SCORES AND MCQ
SCORES . . . . . . . . . . .

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF EI
SCORES AND MCQ SCORES . . . . . . . .

GROUP DISCRIMINATION BY PI SCORES AND MCQ
SCORES . . . . . . . . . . . .

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF
PI SCORES AND MCQ SCORES . . . . . . . . .

GROUP DISCRIMINATION BY NBME SCORES AND MCQ
SCORES FOR CANDIDATES WHO PASSED PART I .

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF
NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO
PASSED PART I . . . . . . . . . . . . . . .

GROUP DISCRIMINATION BY NBME SCORES AND MCQ
SCORES FOR CANDIDATES WHO FAILED PART I . . .

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF
NBME SCORES AND MCQ SCORES FOR CANDIDATES WHO
FAILED PART I . . . . . . . . . . . . .

CLASSIFICATION ANALYSIS USING MCQ SCORES AS
DISCRIMINANT FUNCTION . . . . . . . . . .

CORRELATIONS OF PMP FRAME SCORES WITH SCE
SCORES FOR NBME, EI AND PI SCORES . . . . . .

CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE
OF DIFFERENCE OF PAIRED NBME FRAME SCORE
CORRELATIONS WITH SCE SCORES . . . . . . . . .

PAGE

105

109

109

110

110

111

111

113

113

114

114

115

117

118

 

 

 

LIST OF TABLES (CONTINUED)

PAGE

CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE

OF DIFFERENCE OF PAIRED EI FRAME SCORE

CORRELATIONS WITH SCE SCORES . . . . . . . . . 119
CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE

OF DIFFERENCE OF PAIRED PI FRAME SCORE

CORRELATIONS WITH SCE SCORES . . . . . . . . . 120
CORRELATION OF NBME FRAME SCORES WITH MCQ AND

SCE SCORES . . . . . . . . . . . . . . . . . . 121
CORRELATION OF EI FRAME SCORES WITH MCQ AND SCE
SCORES . . . . . . . . . . . . . . . . . . . . 124
CORRELATION OF PI FRAME SCORES WITH MCQ AND SCE
SCORES . . . . . . . . . . . . . . . 125
SUMMARY TABLE FOR REGRESSION ANALYSIS OF NBME

FRAME SCORES ON SCE SCORES . . . . . . . . . . 127
SUMMARY TABLE FOR REGRESSION ANALYSIS OF EI

FRAME SCORES ON SCE SCORES . . . . . . . 128
SUMMARY TABLE FOR REGRESSION ANALYSIS OF PI

FRAME SCORES ON SCE SCORES . . . . . . . . . 129
CALCULATED WEIGHTS FOR PMP FRAME SCORES . . . . 130
CORRELATIONS OF WEIGHTED PMP SCORES WITH MCQ

SCORES AND SCE SCORES . . . . . . . . 132

viii

Distrubution
Distribution
Distribution
Distribution
Distribution
Distribution

Distribution

LIST OF FIGURES

MCQ Scores
NBME Scores
EI Scores
PI Scores
SCE Scores
SCE Scores

SCE Scores

ix

a n o 0

(Total)
(May) .

(July).

 

PAGE
84
85
86
87
88
89

90

 

CHAPTER I

THE PROBLEM

Introduction

During the past two decades there have been two
important developments related to testing medical competence
that affect the licensing and certification of physicians in
this country. One development has come from the medical
education and testing community, the other receiving most of
its impetus from outside the medical community, from the
public and governmental sectors. The internal development
was the introduction of Patient Management Problems (PMP's)
as an objective method of evaluating physician competence.
PMP's are a form of paper and pencil simulation of the
patient-physician encounter in which the examinee is pre-
sented with a series of data gathering and treatment options,
from which he must choose the proper path to diagnosis and
treatment of the patient. Studies conducted by the National
Baord of Medical Examiners (NBME) and others (Hubbard, et a1.,
1965; McGuire, 1966) showed that the time-honored bedside oral
examination was not a very accurate or reliable means of
assessing physician competence.

Beginning in 1961 the PMP rapidly replaced the oral as
a major component of licensing examinations in the hope that

1

 

'it would prove to be a more reliable and valid test of a
physician's problem-solving skills (Hubbard et a1., 1965;
Williamson, 1965). Soon afterward, several medical specialty
boards incorporated PMP's as a component of their specialty
certification examinations, either in conjunction with, or
as a replacement for, the bedside oral examination. Within
a relatively short time, PMP's have become a major part of
the licensing and certification process.

The external development affecting licensing and
certification began to be felt during the late 60's and
became increasingly important throughout the 70's. Spurred
by rising medical costs, Medicare—Medicaid, a perceived
shortage of physicians, increasing monetary support from the
federal government for medical schools, and several other
factors, there came an increasing demand from the government
and the consumer for greater accountability from medical
schools and licensing authorities (Abrahamson, 1976; Rakel,
1979; Senior, 1976). The public wanted better assurances
_that licensing and certification of a physician truly
indicated a more competent practitioner. At the very least
they wanted assurance that a licensed physician was not
incompetent, that he or she at least met minimal competence
criteria for providing health care. In addition, there has
been increasing pressure for recertification and relicensure,
to assure continued competence once a physician has completed

formal education.

 

 

The past few years have seen an increased questioning
of a common assumption upon which licensing and certification
has been based. An individual who has received a medical
degree from an accredited school, has completed a minimum
amount of graduate medical education and has successfully
negotiated the hurdle imposed by a licensing examination is
therefore assumed to be truly capable of independently pro-
viding at least minimally competent health care to the
public. Unfortunately, because of difficulties in esta-
blishing valid criterion measures of clinical performance,
there is little direct evidence that licensing or
.certification examinations predict what a physician actually
does in practice (Abrahamson, 1976; Williamson, 1976).

This lack of an explicit link between test performance and
practice performance is generally true for all test formats
employed in licensing and certification examinations,
including PMP's. Nonetheless, PMP's have been felt to be a
more valid means of discriminating the competent from the
less competent physician than, for example, multiple choice
questions about relevant medical knowledge.

This study will examine four observable consequences of
basic assumptions concerning the validity of PMP's. These
assumptions, which are becoming increasingly questioned in
the literature, form the basis for the use of PMP's in
licensing and certification examinations. This study will
use data from a national specialty certification examination

in Emergency Medicine. In addition, the scoring procedure

1 m I

 

 

commonly used for PMP's in licensing examinations will be
examined and compared with alternative scoring procedures in
an attempt to improve the discrimination and predictive

ability of PMP's.

History of Licensing and Certification Exams in Medicine

 

During the 18th century the education of a physician
and licensing in this country were one and the same, embodied
in the apprentice system. A young man wishing to become a
physician bound himself to a doctor for a period usually
exceeding five years. At the successful completion of his
apprenticeship he was given a document signed by his pre—
ceptor, outlining the course of his training, which served
as both his diploma and his license to practice (Miller,
1976). At about the time this country began moving toward
independence, the colonial assemblies, and later the state
legislatures, began to see a need to protect the public from
the quacks and Charlatans who were becoming more numerous as
the demand for medical practitioners rose. Accordingly,
various means of testing and licensing physicians were set
up, in most cases being delegated to the various state
medical societies. As medical schools came on the scene,
they also became involved in the licensing process. In a
move to raise the standards of medical education and
licensing, the American Medical Association was founded in

1847. As a result of each state having its own tests and

 

licensing requirements, it became very difficult for a
physician to move his practice from one state to another.

-The situation remained this way until shortly after
the beginning of this century, when a strong movement to
standardize licensure requirements resulted in 1915 in the
formation of a voluntary, independent examining body, the
National Board of Medical Examiners (Hubbard, 1978). The
NBME evolved a three-part examination which eventually be-
came accepted by nearly all state licensing boards for pur-
poses of licensure. The NBME exam originally consisted of
essay tests in traditional medical sciences (Part I) and in
clinical sciences (Part II). An oral "practical" examination
comprised Part III, and was taken after a student had
finished medical school and had completed at least one year
of an internship. The first two parts were taken during
progress through medical school. A physician who had passed
all three parts of the NBME exam was considered competent
to independently practice medicine, and could do so without
further testing for licensure in those states accepting the
NBME exam.

In the early 1950's, a major change in the NBME exam
was introduced. A three year study, in cooperation with the
Educational Testing Service, showed objectively scored
multiple—choice questions could provide a much more compre-
hensive test which proved to be superior to subjectively

scored essay tests. Therefore, the essay tests in Parts I

and II were replaced by multiple choice tests developed by
national committees of subject matter experts (Hubbard,
1978).

The most recent major change in the NBME exam occurred
in 1961 with the introduction of Patient Management Problems
in the Part III exam. At first PMP‘s were to serve as an
objective supplement to the traditional bedside oral
examination, but it soon became evident that attempts to
improve the reliability of the orals were not successful,
and it was becoming increasingly difficult to arrange
sufficient examination opportunities for a growing number of
candidates. After 1963 the oral exam was dropped, and since
then Part III has been composed of multiple—choice questions
and a series of PMP's. Though still valuable as a teaching
tool, the ancient and revered tradition of the bedside oral
as a formal examination method was finally succumbing to
increasing evidence of psychometric inadequacies and rapidly
expanding difficulties and costs of administration on a
national scale.

In a further attempt to standardize state licensing
requirements, in 1967 the Federation of State Medical Boards
began working with the NBME to develop a licensing exam,
based on NBME materials and administered by the NBME, which
would be acceptable to all states as evidence of readiness

to provide unsupervised health care to the public. This was

called the Federation Licensing Examination, or FLEX, and

 

 

 

was composed of three parts, parallel to the NBME exam. By
1973 it had been accepted by all but two state licensing au-
thorities (NBME, 1973). FLEX became the primary means of
licensure for a growing number of graduates of foreign
medical schools wishing to practice in this country. Since
they had not graduated from an accredited U.S. medical
school, these individuals were not eligible to take the NBME
exam.

Concurrent with the development of the NBME and
licensing examinations, there has been development of a
variety of medical specialty boards. Beginning with the
American Board of Ophthalmology in 1917, the number of rec-
ognized specialty boards has risen to twenty-three, with the
American Board of Emergency Medicine being the most recent,
approved in 1979. To become a board certified specialist
a physician must complete an approved residency training
program which is usually at least three to five years in
length, practice in the specialty for one or more years,
and pass an examination administered by the specialty board.
Originally these certification examinations were composed of
essay exams and bedside oral exams. The first board to use
multiple choice questions was the American Board of Internal
Medicine in 1946, but soon most other boards had incorporated
multiple-choice exams, influenced by the success of the ABIM
and the NBME. In 1961 the NBME for the first time began

directly helping specialty boards develop their examinations,

 

 

 

and several boards began trying PMP's. Today most boards
employ some combination of objective written exams and oral
exams, while a few use only the written, and fewer still
retain only the oral.

From this brief overview of the history of licensing
and certification examinations, it is evident that PMP's
have relatively recently come to play an increasingly
important role in evaluating the competence of a physician.
First introduced as an objectively scored replacement for the
bedside oral, in the NBME Part III, they also soon became a
major component of the FLEX exam and several specialty
certification exams. As an important part of the
examinations used in the two major paths to state licensure
(NBME and FLEX), and in several specialty certification
examinations, nearly every new physician beginning practice
in this country for the past fifteen years has been con-
fronted with PMP's at a critical juncture affecting his or
her medical career, and in turn affecting the health care of
the public. In spite of a lack of any evidence for
criterion—related validity, because of their inherent "face
validity" (the subjective opinion of experts that it appears
to be a valid measure) and their reliability in comparison
with the oral exams they replaced, PMP's have been generally
considered a legitimate means of testing a physician's

problem-solving skills.

 

9
The development of Patient Management Problems

Patient Management Problems had their beginnings in
the Tab-Test Technic developed by the U.S. Army (Williamson,
1965) and the "Test of Diagnostic Skills" developed by
Rimoldi (1955; 1961). These early forerunners of the PMP
presented a student with a problem statement and a series
of cards with questions which the student might ask on one
side and answers on the other. By selecting the card cor—
responding to the question he or she wished to ask, and
reading the answer or result on the back, the student
proceeded through the problem to a solution. The PMP in its
present form was developed in the late 50's and early 60's
by the NBME (Hubbard et a1., 1965) and by McGuire and her
associates at the University of Illinois (McGuire, 1963;
Williamson, 1965; McGuire and Babbott, 1967). In the last
twenty years it has become a very popular teaching as well
as testing technique.

The PMP is essentially a paper and pencil simulation
of the patient-physician encounter during which the student
or examinee must collect history, physical and laboratory
data in order to arrive at a diagnosis and treatment plan for
the patient. The Appendix contains an example of a PMP. As
can be seen, the problem begins with a brief statement out-
lining the presenting complaint or problem. In this par-
ticular example the examinee is then presented with a series

of options concerning the history, from which he must select

 

10

only those he would ask for in the real situation. The
answers to the questions, in the right column, are not
visible because they are printed using a latent image
process. To gain the information from a particular question,
the examinee "develops" the answer by wiping over the answer
space with a special latent image pen, and the words become
visible. In earlier versions of PMP's, the answers were
covered by an opaque layer which was erased using a pencil
eraser, exposing the printed answer. Using this method, an
explicit record is kept of every choice the examinee makes
tﬂiroughout the problem, which is then used for scoring.
Ohnriously, an examinee cannot change his mind and "cancel"
a 1>ad choice once the answer has been exposed. In this
example the examinee then moves on to collect physical
ed<aunination findings, laboratory findings, and finally to
disagnosis and management of the patient.

Several types of PMP's have evolved as a result of
djafferent needs and philosophies in teaching and testing
(Iiealfer and Slater, 1971; Berner et a1., 1974). A primary
(iiﬁstinction can be made between linear and branching PMP's.
III a linear PMP there is one path through the problem
(ETubbard, 1978). The examinee first makes choices in a
history section, some of the available choices being useful

alhi some giving little or no useful information. After the
history there is a series of choices in a physical examination

Section, and then the examinee moves on to choose lab tests,

3

11

some of which are necessary, some of which may be of equi-
vocal value, and a few which may be harmful to the patient
and should be avoided. Finally, a diagnosis and a treatment
plan are selected. The example in the Appendix is a linear
PMP.

The branching PMP, on the other hand, does not require
the examinee to take a specified path through the problem
(McGuire, 1963; McGuire and Babbott, 1967). After each
section a choice is made as to what type of information to
collect next, allowing the examinee to control the sequence
of information gathering, as would be the case with a real
patient. In cases where the examinee selects a test or
procedure which causes a major change in the patient's
condition, the examinee may be directed to a specific
section which requires additional choices of action to
stabilize the patient. In either type of PMP it may be
possible for the patient to "die" if poorly handled by the
examinee. Obviously the branching PMP allows the examinee
to take any of a number of different pathways to a solution
of the problem, and no two examinees will necessarily take
the same path, or even necessarily see the same sections of
the PMP. As a result, branching PMP's require a more complex
scoring scheme.

Many people involved in medical education tend to
prefer the branching PMP, because it is not so rigid and

allows the examinee much more freedom in approaching the

 

 

 

12

problem, as would be the case with a real patient.
Branching PMP's tend to be used more than linear PMP's in
medical schools and instructional settings. However, be-
cause of the complexities of scoring branching PMP's, and a
desire based on psychometric considerations to assure that
each candidate sees and is tested on exactly the same
problem, the NBME and others involved in licensing and
certification have usually selected linear PMP's for
examination purposes (Hubbard, 1978). For this reason, this
study will deal only with linear PMP's.

Several methods for scoring PMP's have been developed
by various groups (Williamson, 1965; Donnelly, 1976;
Marshall, 1977; Hubbard, 1978). Generally, they involve
the assignment of weights to each item or choice by a con-
sensus of expert opinion. These weighting schemes may be
as simple as a three-point scale (+1, 0, -1 for items which
are useful, items which are neither helpful nor harmful, and
items wh1ch are unnecessary or contraindicated, respectively)
to complex ten-point scales with values ranging from +16 to
-16 or even +3 to -50 (a fatal error). Using these weights
assigned to each item or choice, PMP's are scored using one
or more formulae such as those presented in Table 1.1. The
first five scores were developed by the University of
Illinois group (McGuire et a1., 1976), with the Proficiency
Index (PI) and the Efficiency Index (E1) the most commonly

used. The sixth scoring formula was developed by the

13

TABLE 1.1

COMMON SCORING FORMULAE FOR PATIENT MANAGEMENT PROBLEMS

 

Score

Proficiency
Index
(PI)

Efficiency
Index
(EI)

Errors of
Ommission
(E0)

Formula
_ 2[(+)+(-)]

maximum

PI x 100%
algebraic sum of (+) and
(-) points for items
chosen, divided by maxi-
mum possible score,
converted to percent.

N+
E1 = ____ X 100%

N
Total

number of positively
weighted choices made,
divided by total number
of choices, and con—
verted to percent.

Z (+)
E0 = 100% - ——————-x 100%
maximum
100% minus the result of:
the sum of the positive
points chosen, divided by

the maximum, and converted

to percent.

Example

For all examples, Examinee
made the following choices
on a PMP where 90 points was
the maximum possible.

Weight Fre en §EE
+16 3 48
+ 8 2 16
+ 2 4 8

0 2 O
—1 2 —2
-4 2 -8

PI

= (72)+(-10)x 100%
90

PI = 68.8 or 69%

+=9
0=2
_=4

15

ET = ——3 X 100%
15

E1 = 60%

E0 = 100% - .23 x 100%
90

E0 = 20%

 

Score

Errors of
Commission
(EC)

Overall
Competence
(0C)

NBME
Score

14

TABLE 1.1--CONTINUED

Formula

I: (—)|

maximum

EC = x 100%

absolute value of the sum
of the negative points
chosen, divided by the
maximum, converted to
percent.

P + I
oc=————(IXE:) P xlOO%

PI weighted by El (both
expressed as decimal
values). Any other
desired method of
weighting may be employed.

NBME = N + N - N
Tota1(-) + -
Examinee is given
"handicap" score equal

to number of negatively
weighted items, one point
is subtracted for every
wrong (negatively
weighted) item chosen, one
point added for every cor-
rect (positively weighted)
item chosen. Total score
is therefore the number of
correct decisions made.

Example

EC =L‘El x 100%
90

EC = 11.1 or 11%

. . + .
CC = ( 69 x :0) 69 x 100%

Assuming 15 possible nega-

tively weighted choices:
NBME = 15 + 9 - 4
NBME = 20

If there were also 15
positively weighted choices,
the maximum possible score
on this PMP would be 30.

15

National Board of Medical Examiners (Hubbard, 1978). Be—
cause this formula is used by NBME in their licensing
examinations, it will be the primary formula used for

scoring in this study.

Problems of examinations as a test of competence

 

A cursory comparison of PMP's with multiple-choice
examinations and bedside oral examinations indicates why
they rapidly became a popular testing format. The PMP's
appear to have a great deal of "face validity“ in com-
parison with the typical multiple choice-examination and,
indeed, examinees and examiners alike almost unanimously
feel that PMP's do a better job of "getting at what a
physician does" and "Testing competence" (McGuire and
Babbott, 1967; McGuire et a1., 1976). Because of their
objective presentation and scoring methods, PMP's offer a
great improvement over bedside oral examinations in
producing reliable scores (Hubbard, 1978); and because
they are-designed to simulate the patient-physician inter-
action, PMP's presumably measure problem-solving skills and
not just knowledge as measured by MCQ's. Although the use of
PMP's in place of bedside orals has definitely improved the
reliability and objectivity of licensing and certification
examinations, and PMP's have been shown to have reasonably
good content and construct validity, they have not been

proven to have criterion-related validity.

hirigui “if

16

Evidence for the Validity of any testing format is
usually presented as one or more of the three basic types of

validity recognized in the Standards for Educational and

 

Psychological Tests (APA, 1974). These three types of

 

validity are: content validity —- expert judgement and
consensus that the items are an adequate measure of the
major concepts, principles and information in a particular
subject area; construct validity —- groups with known or
presumed skill levels perform in a predictable or hypothe—
sized manner on the items; and criterion-related validity --
performance on the items predict or correlate highly with
performance on a criterion measure, either another
validated predictor of the desired skill being tested or,
preferably, a direct measure of the desired skill itself.
The relationship between how an examinee performs on
a PMP and how well or how poorly he or she performs in a
real situation has never been shown, although because of
their content and construct validity they have been assumed'
to be valid in this critical respect. During the past few
years evidence has been accumulating that PMP's do not
possess the criterion-related validity which had been
assumed but never actually demonstrated because of problems
in measuring the criterion (i.e., actual practice) (Page and
Fielding, 1980; Goran et a1., 1973; Feightner and Norman,

1976).

17

This lack of demonstrated criterion—related validity
is not limited to PMP's. It also holds true for licensing
and certification examinations as a whole. Williamson (1976)
stated at a conference on the validity of certification con-
vened by the American Board of Medical Specialties: "Based
on correlational evidence, certification results, whether
measured by professional undergraduate grades or medical
specialty certification examinations, seem to have very
little relationship to quality of subsequent professional
performance." In an extensive review of methods used for
measuring physician performance, done for the Association of
American Medical Colleges, Barro (1973) concluded:

[A]t the present time there exists no

system for measuring the overall performance

of individual physicians that has been vali-

dated in the sense that physicians who

measure higher have been shown to produce

better patient outcomes. In other words, no

system of measurement now available allows us

to determine objectively who are the high- and low-

performing physicians. Lack of validation in terms

of patient outcomes is the primary shortcoming of
many of the approaches currently being used or
developed, and efforts to establish such validity
should be the focus of future endeavors in this field.
The situation has not changed significantly in the inter-
vening years (ABMS, 1980), although there is now some move-
ment toward solving the initial problem of defining
competency in a medical specialty (ABMS, 1979). The final
stage of a project headed by Dr. Jack L. Maatsch, of which
this study is a part, will be the first major attempt to

link performance on a specialty certification examination

 

18

with actual clinical performance, in this case in the

specialty of emergency medicine.

Need for This Study

 

VWhen PMP's or items in any other testing format are
used in an educational setting, evidence of their validity
usually consists of content validity and construct validity.
Substantial indication of these two types of validity is
generally considered adequate to justify their use as an
educational tool. In a licensing or certification setting,
however, criterion—related validity would seem to be the
most important of the three types. Performance by examinees
on PMP's should demonstrably correlate highly with their per—
formance with real patients in real situations, i.e.,
correlate highly with the quality of care they are capable
of providing in their practice, if the PMP's are going to
be of any legitimate value for making licensing and certifi-
cation decisions. Unfortunately, although they may be shown
to have good content and construct validity, adequate evi-
dence for the crucial question of criterion-related validity
has never been provided for PMP's used for licensing or
Certification purposes. This has usually been ascribed to
the difficulty of obtaining adequate measures of, and even
definitions of, the criterion (Hubbard, 1978).

In the past few years it has become increasingly evi-
dent that adequate content and construct validity are not

sufficient to justify use of PMP's in a licensing and

l9

certification setting, as has been generally assumed. The
development of a new certification examination in Emergency
Medicine provides an excellent opportunity to explore some

of the basic assumptions underlying the use of PMP's in
licensing and certification examinations, and to compare
performance on PMP‘s with Simulated Clinical Encounters
(SCE's), a highly structured patient simulation designed to
measure competence to provide health care in an emergency
setting (Maatsch et a1., 1978). This is an indirect measure
of the criterion-related validity of the PMP's and therefore
may be open to some question, being based on the assumption
that the SCE's are a valid criterion measure of actual
clinical performance. However, SCE's do provide highly
reliable expert judgement, by trained medical examiners, of
the quality of health care provided by candidates on a sample
of simulated emergency patients. The seven simulations, com-
posed of eleven different patients (five single-patient
problems and two three-patient problems), have been shown to
be an adequate sample on which to base certification
decisions, and the inter—rater reliability of the examiners'
ratings on these simulations has been shown to be very high
(Maatsch, 1980). As indicated previously, the SCE's are cur-
rently being used in a study which will for the first time
specifically provide criterion-related validity measures for

a certification examination.

20
The Problem

The basic problem being investigated by this research
concerns the validity of PMP's when used for purposes of
licensing and certification. The use of PMP's in licensing
and certification examinations is based On the critical
assumption that they are a valid measure of clinical per—
formance, i.e., are predictive of how the physician will
perform in a real situation. Three observable consequences
of this assumption, if it is valid, can be stated as follows:

1. Since PMP's are constructed to simulate clinical
performance, they should correlate to a higher
degree with other measures of clinical competence
(in the case of this study, with expert examiner
ratings of performance on SCE's) than with objec-
tive measures of knowledge such as multiple-choice
batteries.

2. Since PMP's are constructed to simulate clinical
performance, they should measure a component of
competence different than that measured by multiple-
choice batteries. Therefore, PMP's should account
for a significant portion of variance in more
direct measures of clinical competence (in this
study, expert examiner ratings of performance on
SCE's) beyond that accounted for by multiple-

choice batteries.

 

21

3. Since PMP's are constructed to simulate clinical
performance, they should be able to discriminate
those physicians with formal residency training in
a specialty, and therefore with enhanced clinical
problem solving abilities, from those without
formal specialty training to a much greater degree
than multiple—choice batteries. Therefore, PMP's
should signigicantly add to the ability of multiple-
choice batteries to discriminate between more
competent and less competent physicians.

There is also implicit in the scoring systems generally used
for PMP's an additional assumption that all parts of the
clinical problem solving process are equal contributors to
diagnostic and management proficiency. This should result
in the following observable consequence:

4. All parts (or frames) of the PMP contribute
equally to the discrimination of more competent
and less competent physicians.

As has been outlined in previous sections, lack of
evidence supporting the first three of these statements of
observable consequences is suggested in recent literature.
The fourth statement, concerning scoring, has come into
question as a result of the analysis of data from the field
test of items for the Emergency Medicine certification
examination.

This study will explore the validity of the assumptions

upon which the use of PMP's in licensing and certification,

22

and their scoring, are based by testing four observable con-
sequences of those assumptions utilizing data from the first
administration of the certification examination of the
American Board of Emergency Medicine (ABEM) in February, May
and July of 1980. It may also be possible to suggest some
changes in scoring procedures for PMP's, based on the results
of testing the fourth statement, which may improve the out-
come of tests of the first three statements of observable
consequences of assumptions behind the use of PMP's in

licensing and certification.

Research Hypotheses

 

The following four hypotheses will be the focus of

this study:

1. PMP scores correlate to a greater degree with SCE
scores than with scores on clinically relevant
multiple-choice questions (MCQ's).

2. PMP scores account for a portion of score variance
of SCE's beyond that contributed by scores on
clinically relevant MCQ's.

3. PMP scores add to the ability of MCQ scores to
discriminate between residency- and practice—
eligible candidates.

4. All frame scores of PMP's correlate equally with

SCE scores.

23

Summary

Approximately twenty years ago the Patient Management
Problem was introduced as an objective, reliable method of
assessing a physician's clinical competence. In a relatively
short time they all but completely replaced the time-honored
bedside oral examination as a component of licensing and
certification examinations.‘ PMP's had the advantages of
being obviously more objective than bedside orals, were much
easier to administer to large numbers of candidates, and
produced much more reliable scores. But while PMP's can be
shown to generally possess good content and construct val-
idity, no evidence of criterion-related validity has been
produced. The most crucial element in the justification of
the use of PMP's in licensing and certification, evidence
that scores on PMP's will predict competence in actual prac—
tice, has been missing. With increasing pressure from the
government and the consumer for assurances that licensing
and certification are bestowed only upon those who are truly
competent to provide health care to the public, it becomes
important to provide proof of the criterion-related validity
of PMP's. This becomes particularly critical when recent
literature indicates that even PMP's with good content and
construct validity do not necessarily posess adequate
criterion-related validity.

This study will examine four consequences of the

assumptions involved in the use of PMP's in licensing and

24

certification examinations. Data from the first adminis-
tration of the certification examination of the American
Board of Emergency Medicine will be used. The criterion
against which the criterion-related validity of PMP's will
be judged is examiner ratings of candidates' performance on
a sample of emergency patients as presented in the Simulated
Clinical Encounters, another format used at a later date as
Part II of the ABEM certification examination. This study
will also provide an opportunity to formulate and test
revised scoring procedures for PMP's, in an attempt to
improve the validity of the assumptions underlying the use

of PMP's in licensing and certification examinations.

Overview of the Dissertation

 

In Chapter II, the literature on the validity of
licensing and certification examinations will be reviewed,
as well as the literature on the development, scoring, and
validity of Patient Management Problems.

The development, construction and implementation of
the certification examination of the American Board of
Emergency Medicine will be described in Chapter III. Also
discussed will be selection of the subjects, and the statis-
tical-methods to be used in testing the hypotheses.

The results of the data analysis will be presented in
Chapter IV. In Chapter V, results of the statistical analyses
will be discussed and conclusions concerning the hypotheses
presented. Additional results and further research suggested

by this study will also be noted.

CHAPTER II

REVIEW OF THE LITERATURE

In the previous chapter an important problem was out—
lined concerning the use of Patient Management Problems in
licensing and certification examinations for physicians. In
this chapter the literature will be reviewed relating to the
validity of licensing and certification examinations in
general, and of PMP's in particular. The various scoring
methods used with PMP's will also be reviewed, as well as
some of the basic problems in measuring physician competence

and problem-solving skills.

Licensing and Certification Examinations

 

Despite the long history and the importance of the use
of examinations for licensing and certification purposes,
there are very few empirical studies attempting to link
results on these examinations with actual clinical perfor-
mance. What little has been done has not been too encour-
aging. The validity of the predominant format in use, the
‘multiple-choice format, will be reviewed first, followed by
a look at problems with defining criteria against which to

measure validity.

25

L;

 

26

Validity

Content and construct validity have been the types of
validity usually presented for licensing and certification
examinations. This is because these two types are easier
to provide, and because of the historical links of the
objective written licensing and certification examination
with the achievement examination, where the best proof of
validity is the correspondence of items on the test with
topics in the course of instruction (i.e., content validity)
(Wolf, 1976). This,combined with the difficulty of doing
criterion—related validity studies, has probably resulted in
an overdependence on content validity with an occasional nod
toward construct validity. The most relevant type of vali-
dity in a licensing or certification situation is criterion-
related validity (i.e., how well does the test predict actual
care provided to patients), and it has usually been left
unmentioned.

Results of content and construct validity studies for
the NBME examinations have been summarized by Schumacher
(in Hubbard, 1978). The methods of construction of items
for the objective NBME examinations, using national panels
of content experts, is felt to essentially guarantee their
content validity, while construct validity is addressed by
comparing the performances of examinees with varying levels

of education and experience. These carefully developed and

well-studied examinations are excellent in this respect.

 

 

27

However, evidence of criterion-related validity is for the
most part missing. Schumacher mentions that longitudinal
studies are underway, but results are not yet available.
Criterion-related validity is often subdivided into
two types, based on when the criterion is measured in
relation to the test being validated (Mehrens and Lehman,
1973). Concurrent validity refers to situations where the
test and the criterion measure are taken at the same time,
or nearly the same time. Predictive validity indicates that
the criterion is measured at some time after the test being
validated has been administered. An alternative distinction
between the two is based on the use to which the tests are
to be put. Concurrent validity is appropriate when the test
is to be used in place of some less efficient way of
gathering criterion data, while predictive validity is used‘
when the concern is the usefulness of the test score in pre—'
dicting future performance. Burg and Schumacher (1979)
state, in reference to NBME examinations, that predictive
validity has not yet been documented, although it is being
actively studied. They also state that there has been some
evidence for concurrent validity (c.f. Kelley et a1., 1970,
and Levine, et a1., 1970). These attempts at showing con—
current validity rested on the use of faculty and supervisor
ratings as the criterion, however, and should be viewed with
extreme caution at best. Korman and Stubblefield (1971)

studied eight performance criteria for interns, and their

28

relationship to medical school grades and ratings by various
groups. Grades were found to be consistently unrealted to
performance, a not uncommon finding (Wingard and Williamson,
1973; Price et a1., 1964). Clinical faculty ratings also
did poorly in predicting internship performance, and pre—
clinical faculty did even less well. The ratings of
immediate internship supervisors fared somewhat better,

but the best predictors were medical students' peer
evaluations. This study indicates that if performance
ratings are to be used as a criterion, the raters should be
immediately involved with the ratees' day-to—day performance
if the ratings are to be of any value. Generally, any study
of criterion—related validity based on a criterion using
supervisor ratings or chart audits [which have been shown to
be unreliable because many physician actions are never
recorded on the charts (Wakefield et a1., 1978)] should be
viewed with a great deal of caution since these two types of
criterion measures have major reliability and validity
problems themselves.

The validity of multiple-choice questions (MCQ's) is
reported in a few studies such as that of Levine, McGuire
and Natress (1970), who looked at the written objective
portion of the certification examination of the American
Board of Orthopaedic Surgery. They found that, for content
validity, while there was no problem with the content of the

MCQ items, most of the items required only simple recall of

 

 

29

isolated facts and did little to test higher cognitive
abilities. Construct validity was investigated in two ways.
Comparing the scores of examinees at various levels of
training produced the expected increase in scores with
increases in the amount of training. An analysis of the
factor structure, however, showed that the MCQ examination‘
loaded on only one of five factors related to orthopaedic
competence. Criterion—related validity was examined in the
form of concurrent validity, using supervisor ratings as

the criterion. The results showed only very low correlations
between the MCQ test scores and the ratings, and the authors
stated that the low reliabilities of the particular ratings
used left even those low correlations open to question.

Data on predictive validity was not presented.

Pawluk, Roberts and Neufeld (1976) reported a con-
current validity study of the certification examination of
the College of Family Physicians of Canada. Examination
scores were compared with a quality of care assessment
developed for another study by Sibley and others (Sibley
et a1., 1975). Assessment data and examination scores, most
obtained within 12 months of each other, were both available
for only 15 physicians with an average of 13 years in prac-
tice; This was a rather small sample, but the data was
fairly detailed. The analysis showed that the correlation
between quality of care assessment and the MCQ format of
the examination was -0.36, and between the assessment and

the total test score it was 0.12. Their conclusion was that

 

 

30

the overall examination score was not a useful predictor of
quality of care performance.. An explanation offered by the
authors for this lack of predictiveness of the examination
was that possibly the skills measured in the examination do
not correlate with the skills needed in actual practice.

In a study evaluating patient care for the specific
problem of urinary tract infection, Gonnella, Goran,
Williamson and Cotsonas (1970) also reported results showing
little correlation between test scores and actual performance.
In this case the physicians scored well on an MCQ test and
a PMP, indicating they had knowledge of how to handle UTI
problems, but they did poorly in detecting UTI's in a series
of 133 patients who had previously been examined by an expert.

Another problem with the use of MCQ's as a test of
competence was highlighted in a recent study by Newble,
Baxter and Elmslie (1979), who tested medical students and
physicians using two forms of the same examination. One
form was the standard MCQ format, the other was a free—
response format which usually consisted of the same stem
as the MCQ but without the five possible responses listed.
Scores were higher on the MCQ format than on the free-
response format where answers had to be generated rather
than selected from a list. Analysis showed that the MCQ
format provided a considerable advantage for the less well

trained and less competent.

 

31

Another recent study which indicated a significant
difference between a free-response format and a multiple-
choice format was that of Rose, Corman and Robbins (1979).
Their study looked at the relationship between responses to
examination questions and actual practice patterns as
measured by chart audit. Second year residents were asked
questions concerning the use of screening tests in a primary
care clinic. The two forms of the questions either presented
choices or were free-response. Two physical and six labor-
atory screens were used because they would most likely be
noted on a chart by the doctor or a nurse, or could be traced
through lab slips, and would avoid the problem of many
physician actions never being recorded on patient charts
(Wakefield et a1., 1978). When the residents were presented
with options, a larger number selected particular screens
than were generated in the free—response forms of the
questions. The study also indicated that even though resi-
dents had the knowledge that particular screens should be
used, according to the chart audits they were not consis—
tently using those screens. This failure of examination
questions involving factual recall to evaluate actual
practice becomes a major concern when studies like that of
McGuire (1963) estimate that the NBME examinations consist
of more than 90% factual recall items.

One of the more complete studies of physician per-
formance was that of Payne and his associates (Payne and

Lyons, 1972), who did process audits of physician performance

 

32

on 20 common health problems representing 32% of the total
hospital discharges and 16 problems representing 28% of the
total office visits in the state of Hawaii in 1968. Of par-
ticular interest was their finding that specialty certi-
fication or non-certification did not correlate with quality
of performance. There was, however, some evidence that
years of experience in certain specialties (Pediatrics,
Surgery, Internal Medicine) did predict quality of per-
formance.

In an analysis of the results of the field test of the
Emergency Medicine Specialty Certification Examination,
Maatsch et a1. (1978) found a similar trend in years of
training and experience predicting performance. These data
served as evidence of construct validity for the various com—
ponents of the test, while content validity was shown by
reference to the methods of item development (c.f., Downing,
1979). An interesting finding from these analyses was that
the objective MCQ battery, carefully constructed to be
clinically relevant, was found to correlate quite highly
(r=0.83) with performance measures (Simulated Clinical
Encounters). This value is much higher than usually found
for correlations between MCQ batteries and performance
measures. A recent summary of such studies by John Lloyd of
the American Board of Medical Specialties (ABMS, 1980) indi-
cated that correlations between MCQ tests such as the NBME
Part I and Part II and performance of medical students as

measured by various clinical and residency performance

33

evaluations generally resulted in values less than r=0.50,
and only rarely reaching a level of r=0.70. The high cor-
relation found in the field test of the emergency medicine
certification examination was obtained under conditions
which were probably a little more ideal than those in the
studies summarized by Lloyd. The group studied had a wide
range of ability (fourth year medical students to residency-
trained practicing physicians), the SCE's used as a per-
formance measure were relatively objective in nature, and
each of the SCE‘s were seen by all subjects in the study.
While this finding can only be taken as indirect evidence of
high criterion validity, until the validity of the SCE's is
elucidated in the study mentioned in the previous chapter,
it is the only real indication to be found of possible
criterion-related validity of the MCQ format in certification.
It also demonstrates that taking the time and extra effort
to carefully and explicitly develop MCQ's to be clinically
relevant has the potential to greatly improve their
criterion—related validity.

Empirical studies of the validity of licensing and
certification examinations and their principal objective
component, multiple choice questions, can be summarized as
follows:

1. Based on the methods used for test development

and construction, content validity has been shown
to be good.

2. Construct validity, based on comparisions of scores

34

of candidates with varying educational and
experience levels, appears good, but some
studies raise questions as to the ability of
MCQ's to adequately test more than a single
aspect of competence.

3. Reported empirical studies of criterion-related
validity for specific certification and licensing
examinations are practically non-existent, par-
ticularly for predictive validity. The rare
studies demonstrating weak levels of concurrent
validity are suspect because they use chart
audits or faculty and supervisor performance
ratings as the criterion, which have been found
to be of questionable value. Other studies have
shown there is little relationship between answers
to examination questions and actual performance.

4. A further problem for the MCQ format is found in
studies indicating a cueing effect which aids the

less competent.

Defining the criterion

 

As noted previously, the major problem in demonstrating
criterion-related validity for a licensing or certification
examination is the difficulty in adequately defining and
measuring the criterion. Formulating a definition of a
criterion for physician licensing or certification is

essentially a matter of defining competence, or what

 

35

constitutes competent performance. The terms "competence,"
"performance" and "competent performance“ tend to get used
interchangeably, although each has a distinct meaning (Burg
and Lloyd, 1979). Competence refers to the ability to carry
out a set of tasks or a role adequately or effectively.
Performance refers to the actual dging or carrying out of the
tasks or role adequately. Performance can be judged to be
anywhere on a spectrum from competent to incompetent. Com-
petent performance implies the possession of competence, but
the obverse is not necessarily true; incompetent performance
does not necessarily imply lack of competence. Performance
is measured in an actual clinical setting (in vivo) or in

an examination situation (in vitro), the latter usually
being referred to as the measurement of competence.

The in vivo measurements are the missing element in
demonstrating criterion-related validity, and are directly
dependent upon adequate definitions of competence. According
to Burg and Lloyd (1979), a definition of competence should
be composed of a set of statements describing the abilities
needed by a physician to adequately perform in a variety of
situations. These statements should describe the specific
abilities manifested by the competent individual, the con-
ditions under which they will be manifested, and a standard
at or above which competence will be declared.

Unfortunately, few definitions of competence contain
all of these elements. In a review of existing definitions

of competence for various medical specialties, Lloyd (1979)

 

 

36

found nine definitions based on a variety of approaches, of
varying complexity, and of variable use as criterion measures.
The general lack of adequate definitions of competence led to
a recent conference on this subject sponsored by the American
Board of Medical Specialties (ABMS, 1979). ABMS is currently
actively promoting the definitions of competence for each
medical specialty. This is a long and difficult process, but
it is essential that it be done. These definitions can then
be used by the specialty boards in examining the criterion—
related validity of their certification procedures.

Once adequate definitions of physician competence are
developed, reliable means of measuring them in vivo are
needed. A common approach is to use faculty or supervisor
ratings of performance; but, as noted earlier, this approach
is often not reliable. Another more promising approach in—
volves some type of carefully controlled chart or record
audit. Various forms of performance review are being proposed
for the purpose of quality assurance that also could serve as
criterion measures (c.f., Farrington et a1., 1980). Any use
of chart audits, though, should be approached cautiously,
since it has been found that as much as one-fourth to one-
third of elicited patient data may not be recorded (Wakefield
et a1., 1978). Other methods of assessing patient care have
been demonstrated recently (Gonnella et a1., 1970; Sibley
‘et a1., 1975; Kessner et a1., 1973), and may provide some
basis for solving the problem of measuring in vivo

performance criteria.

37

In summary, demonstrating criterion-related validity
for licensing or certification examinations (an in vitro
measurement of competence) rests on being able to adequately

measure the criterion performance (an in vivo measurement of

 

competence), which in turn depends on adequate definitions
of physician competence. Neither adequate measures of
criterion performance nor adequate definitions of physician
competence are generally available, although progress is now

being made toward providing both.

! Patient Management Problems

 

The same problems concerning lack of evidence of
criterion-related validity affecting licensing and certi—
fication examinations in general also apply to PMP's, only
more so. Rather than just lacking much evidence one way or
the other, in the case of PMP's there are a small but growing
number of studies of criterion—related validity, which uni-
formly demonstrate a lack of this type of validity. In
addition to reviewing the validity of PMP's, the scoring
methods used with PMP's will be explored, as well as a
review of PMP's as a means of testing physician problem-

solving skills.

‘ Validity

i As is the case with Parts I and II of the NBME
examinations, the care taken in developing Part III, con-

taining linear PMP's, is felt by Hubbard (1978) to provide

LL

38

sufficient evidence for their content validity. In
addition, when PMP's were first introduced into Part III,
it was only after a joint effort by the NBME and the American
Institutes for Research to produce a definition of clinical
competence. This definition served as the basis for the
development of the subject matter to be tested by PMP's.
This data was gathered using the critical incident tech-
nique, and consisted of more than 3,300 incidents of
effective and ineffective clinical performances collected
from about 600 physicians (AIR, 1976). These incidents were
then reduced to nine major dimensions and 30 subareas of
performance. This was the first major effort to produce a
broad-based definition of clinical competence. However, as
noted by Jason (1979), it is primarily a listing of areas
of concern and does not really state clearly what physicians
must actually do, or under what conditions, to prove their
competence.

Hubbard also indicates that the construct validity of
PMP's and Part III is under continuous study, and has proven
to be satisfactory. This type of validity, as for Parts I
and II, is shown by comparing the performance of examinees
with varying levels of education and experience, with
increasing education and experience correlating with higher
scores. As is usually the case, though, Hubbard offers very
little actual data to support his statements, providing at
most only a summary r value. Again, criterion—related
validity is not discussed, other than to say it is under

study.

 

39

National Board data showing the construct validity of
Part III is provided in a study by Kelley, Stumpe and Levit
(1970). They administered Part III for four consecutive
years to incoming U.S. Air Force interns (normally it is
given only to those who have completed at least a year of
internship), and again at the completion of internship.
There was a significant gain in scores for each of the four
groups completing the internship. This gain was not due to
just an increase in knowledge, as shown by the fact that
there was no parallel increase in scores on Part II
examinations, which were administered at the same time.
These results indicate that Part III is measuring something
other than clinical knowledge, or at least something dif-
ferent than Part II measures. This study was done to
evaluate the Air Force internship program; but, assuming
there is an increase in clinical competence during the
internship, it serves also as a demonstration of the con-
struct validity of Part III.

The content validity of branching PMP's is discussed
by McGuire and Babbott (1967) in an article reviewing the
use of PMP's to measure problem—solving skills. As evidence
of content validity the authors state that examinees and
experienced faculty feel that they use different intellectual
processes than when answering multiple-choice questions, and
that responding to PMP's "feels more like" what they do when
caring for patients. This is more akin to the highly

questionable notion of "face validity" than to content

 

 

 

40

validity. The authors provide data for construct validity
showing higher scores for those with more training. They
also mention that preliminary results of concurrent validity
studies show that practicing physicians make the same types
of errors of omission and errors of commission in their
PMP's as have been observed in office settings.

One of the earliest studies which could be interpreted
as a study of PMP concurrent validity was done by McCarthy
(1966). He compared student performance on two PMP's With
performance on the same problems presented in an oral, free-
response format. The results showed significantly more
correct choices on the PMP's. McCarthy attributed this to
the cueing effect of the choices presented in a PMP which
were not present in the free-response format. This gain in
performance was particularly pronounced for the lower scoring
students. The correlation between the written format and
the oral format was essentially zero (r=-0.09), indicating
a completely different rank ordering on the two formats.
Another result which the author did not discuss, but which
can be seen in the data presented, was that in the PMP
format there was an even greater gain in the number of in—
correct selections made than there was in the number of
correct selections. This study was one of the first
indications that the ability of PMP's to accurately evaluate
a physician's problem-solving performance may not be as good

as had been presumed.

 

 

41

In the past eight years a series of three studies,
with increasingly better designs, have all added more
evidence that PMP's lack criterion-related validity. The
first of these, by Goran, Williamson and Gonnella (1973),
used a subset of the data from a study mentioned earlier
(Gonnella et_a1., 1970). Thirty-five physicians and senior
medical students who had recently seen 33 patients with
abdominal pain, possibly associated with urinary tract
infection (UTI), were given a PMP concerning abdominal pain
and UTI. A comparison of performance on the real patients,
based on data from the medical records, and performance on
the PMP's was made which showed both physicians and students
to be much more thorough on the PMP, particularly in pursuing
history and physical data and in getting urine cultures. An
expert team had previously screened each of the 33 patients
and found that 27 of them required urine cultures. Only 13
(48%) of them were done by the study subjects, who were
working as physician-student teams, yet 80% were done on the
PMP. The authors concluded that "...performance on the PMP
did not discriminate between poor, average and excellent
clinical performance" and "...numerical ranking of those who
do well on a PMP does not appear to provide a measure that
correlates with excellence of clinical judgement."

This study was the first attempt to compare performance
on PMP's with actual clinical performance, but it had two
important weaknesses. It depended on extracting performance

data from charts, therefore some performance data could be

 

 

42

missing; and not all the real patients were cases identical
to the PMP even though they had the same presenting com-
plaints and ultimate diagnoses.

Feightner and Norman (1976) avoided these two problems
by using identical cases in a PMP and a live simulated
patient format. Four different cases were used, with
senior medical students seeing two in each format. A cross—
over design was used so students would not see the same case
in both formats. The simulated patient encounters were
videotaped so complete performance data could be analyzed.
The results essentially replicated those of Goran et a1.
(1973), showing that a significantly larger number of options
were chosen with PMP's than with live (simulated) patients,
and the authors concluded that their data "E9219 293 sub-
stantiate concurrent validity of patient management problems."

A major weakness of this study was that the subjects
could not be evaluated on the same case in both formats
because of obvious recall and practice problems, although
the authors felt the consistent differences observed across
cases in the crossover design indicated the differences were
a property of the format and not a lack of homogeneity in
the groups. A second weakness was that the live patients
were known to be simulated patients, although this did offer
the advantage of being able to videotape the encounters for

careful analysis.

 

43

The most recent study was that of Page and Fielding
(1980), who studied the performance of practicing pharma-
cists on four common problems in both the PMP format and by
the use of trained "customers" who presented themselves at
the pharmacists' places of business. This design overcame
several weaknesses in the previous studies. The subjects
were evaluated on the same problems in both formats.
Because the problems were of the type seen very often,
recall and practice effects would not be expected. The
experimental subjects were not aware that the trained
"customers" were not real. These role players were care-
fully trained to present their problems and evaluate the
subjects' responses according to a specific protocol. The
PMP's used in this study Were developed by the authors, and
data was presented showing them to have excellent content
and construct validity.

The results were similar to those of the previous
studies. About 18% more behaviors were exhibited on the
PMP's than in the actual setting. Further analysis of the
data showed that 79% of the "essential to do" items were
exhibited on PMP's but only 38% in the clinical setting. Of
the "essential to avoid" items, 16% were observed in the
clinical setting while only 5% were selected on the PMP's.
The authors concluded that PMP's were reasonable predictors
of what is 29E done in practice, but very poor predictors of

What is done. Despite their good content and construct

 

 

 

44

validity, the PMP's exhibited poor criterion—related
validity.

The major weakness of this study lies in the
dependence on the ability of the role-playing "customers"
to recall the interaction with the subjects in sufficient
detail to accurately complete a checklist of actions after
completion of the encounter. After training, they viewed
videotapes of encounters and were only able to achieve 76%
accuracy, although the authors state that the trainees felt
poor audio quality on some portions of the videotapes led to
part of the inaccuracy. Even if this explained only 10% of
the 24% inaccuracy, the method would still appear to be more
accurate than reviewing charts, which has been found to be
as much as 25-30% inaccurate.

The validity of Patient Management Problems can be
summarized as follows:

1. The content and construct validity of PMP's used
in a variety of situations has been shown to be
satisfactory for the purpose intended.

2. There appears to be a "cueing effect" in the
choices presented in the PMP format which tends
to encourage more thorough investigation of
problems, and particularly strengthens the per-
formance of the less well-trained and less

competent.

¥_¥

 

 

 

 

45

3. The few empirical studies of criterion-related
validity available show, at best, a weak rela-
tionship between performance on PMP's and

performance in a clinical situation.
Scoring of PMP's

Deriving a meaningful score for performance on a PMP
has always been a problem, particularly when they are used
in a testing situation rather than a teaching situation.
Despite the complexity and the interrelationships between
the various responses when performing on a PMP, most scoring
schemes are basically some variation of the simple MCQ test
scoring scheme: add up all the right responses and produce
a single total score for the problem. As noted by Sedlacek
and Natress (1972),

...traditiona1 reliability and validity
concepts emphasize the independence of

each response required rather than the

complex interrelated responses required
in a PMP. The lack of attention given

to the interrelatedness of examinee

responses has led to great difficulties
in deciding just how to score a PMP.

 

There are essentially three components of a scoring
scheme for PMP's: 1) the keying of each possible response
as right, wrong or neutral; 2) the weighting of each right
and wrong response; and 3) a formula to compute a final
score through some combination of the assigned weights of
the responses selected. The most common method of keying

the items is to have the author and/or a number of other

 

 

 

 

46

experts, either individually or as a committee, decide for
each item whether it is correct or useful, wrong or to be
avoided, or whether choosing it really makes no difference
one way or the other. This method of using consensus of
expert opinion is used for PMP's developed by the NBME
(Hubbard, 1978). An alternative approach is to empirically
(determine the categories of responses based on the results
piroduced by groups of examinees of known high and low
alﬁlity levels. Schumacher (1974) describes such an approach.
Assigning weights for each item, which is usually done
iIl conjunction with keying, also is done either rationally
cor" empirically. The simplest weighting scheme, used by the
DHBME (Hubbard, 1978), involves only three weights, +1 for
berrect responses, 0 for neutral responses, and -1 for
iJicorrect responses. For the NBME, the assignment of these
‘Veights is done by the experts on the test development
C“Dmmittees, and, because of the use of a three point scale,
lis essentially identical to the keying mentioned above.
C)ther weighting schemes vary considerably in complexity,
from five point scales (+2, +1, 0, -1, -2) on up to eleven
ENDint scales with weights varying from +16 to —16, or even
“50 for a fatal error (Bligh, 1980). These weights may also
IDse assigned empirically, using results from groups with

kIlown ability levels (c.f., Schumacher, 1974; Grace et a1.,

1977) .

 

47

As noted in the previous chapter (see Table 1.1),'
there are a variety of scoring formulae in use. One of the
simplest approaches is, again, that used by the NBME
(Hubbard, 1978). The examinee is given a "handicap score"
equal to the number of negatively weighted ("wrong") choices
in the problem. One point is added to the handicap score
for each positively weighted ("correct") item chosen, and
one point subtracted for each negatively weighted item the
examinee choses. The zero weighted items do not figure in
the scoring. The result is a numerical score representing
the number of correct decisions made; i.e., the number of
positively weighted items selected plus the number of
negatively weighted items avoided.

The remaining five scoring formulae illustrated in
Table 1.1 were developed‘at the University of Illinois
(McGuire et a1., 1976) , and are used together in teaching
Situations to provide a composite picture of several aspects
0f performance on PMP's, although the Proficiency and
Efficiency scores are often used together or individually
Without the other three scores. The Proficiency Index (PI)
is the most basic of these scores, representing the degree
to which the examinee's decisions and path through the PMP
correspond with that judged to be optimal by those setting
the scoring criteria. It is the algebraic sum of the
positive and negative weights selected by the examinee,
<1ivided by the maximum score obtainable if every positive

choice were selected and every negatively weighted choice

 

 

 

48

were avoided. The resulting number is converted to a
percentage. The PI is meant to indicate the examinee's
thoroughness and discrimination in selecting actions the
experts agree are helpful and avoiding those judged to be
harmful.

A low PI can be caused by a lack of thoroughness in
choosing options which contribute positively to a solution
of the problem, or by a lack of discrimination in avoiding
those actions which have a negative effect on finding the
solution, or possibly it can be the result of both
deficiencies. Errors of Omission (EO) and Errors of
Commission (EC) are designed to reflect these two aspects
of performance, respectively.

The E0 score is found by summing the weights of all
Positively weighted items selected by the examinee and
dividing by the same maximum possible score used in the PI
Calculation, and converting to a percentage. The EC score
is calculated by summing the weights of the negatively
Weighted choices made by the examinee, dividing by the
maximum possible score, and converting to a positive percent.
The PI, EO and EC should add up to 100 percent. A high EO
would indicate that a low P1 was caused by a lack of
thoroughness in selecting actions necessary to the solution
to the problem. On the other hand, a high EC would indicate
that the examinee was not able to discriminate those actions
which should have been avoided. Relatively high scores in

both indicate the examinee has problems in both areas.

L

 

 

 

49

The Efficiency Index (E1) is the only score which does
not utilize the weights assigned to each option. It is
simply the ratio of the number of positively weighted
choices and the total number of items selected. Of course,
a high BI is desirable.

The fifth Illinois score, Overall Competence (0C),
is a weighted average of both the PI and E1. The relative
weighting is supposed to vary, depending on whether the test
developers feel efficiency or proficiency is more important
for the particular PMP. The formula illustrated in Table 1.1
weights the PI more heavily than the El.

The scoring formulae from the University of Illinois
are the most commonly used for PMP's in teaching situations,
although others have been developed by various groups (c.f.,
Donnelly, 1976). The approaches used in scoring for
licensing and certification tend to be simpler, based pri-
marily on some variation of adding up the weights assigned
to all the responses selected by the examinee. One of the
few exceptions is a complex formula described by Grace et a1.
(1977) which is used for computerized PMP's in a
certification examination in pediatrics administered by
the Royal College of Physicians and Surgeons of Canada.
Examples of the simpler approaches normally used are the
NBME scores described earlier, and the scheme used by such
groups as the Royal Australian College of General
Practitioners (Marshall, 1977) and the American Board of

Internal Medicine (Webster, personal communication), who

 

 

50

merely use an algebraic sum of the weights assigned to all
items chosen by the examinee.

No matter what scoring scheme is used, there are basic
problems with scoring which are the result of the PMP format
itself. The problems relate to what are commonly termed
"retracing" and "cueing". Retracing occurs when an examinee,
while completing the final sections of a PMP, finds
indications of actions that should have been taken in earlier
sections. Despite instructions not to return to a section
once the examinee moves to a new section, many examinees will
go back and develop additional earlier responses which they
may have passed over, but they now know to be useful and
therefore will probably carry a positive weight. When
scoring, it is impossible to identify which responses have
been selected in this fashion, and it usually results in an
inflated score for the examinee.

A study by Schumacher, Burg and Taylor (1974) showed
that retracing significantly inflated scores, but could be
eliminated by using a computer-controlled presentation format
rather than the normal paper format. At the moment, using
a computer for presentation of the PMP appears to be the
only practical way of eliminating this problem in large group
settings.

Retracing is a specific instance of the more general

problem of "cueing, which is an almost unavoidable result

of presenting a series of options for selection by examinees

 

 

51

as is the case in the standard paper PMP format, rather than
having candidates generate their own responses. McGuire

and Babbott (1967) suggest that the effects of cueing can

be reduced by careful selection of numerous potential
responses which include not only the correct responses, but
also those which would be plausible to an examinee who may
be considering the most common incorrect solutions. Despite
such attempts to reduce the effect of cueing, studies
mentioned earlier, such as that of McCarthy (1966), have
invariably shown that examinees tend to select more options
when presented with a list than when forced to generate their
own responses, and this usually results in inflated scores
which particularly favor the less well-trained and less
competent. The only way to avoid this problem appears to

be the use of either a computer format which does not
present the option lists, or a highly structured oral format
which requires candidates to generate their own actions in
managing a case. The latter type of format has been
developed as Simulated Clinical Encounters for the American
Board of Emergency Medicine certification examination
(Maatsch et a1., 1978).

Another problem in scoring PMP's involves thoroughness
versus efficiency in solving medical problems. Most scoring
schemes for PMP's tend to favor the examinee who selects as
many of the positively weighted items as possible while

penalizing the examinee who tends to be much more efficient

 

52

in that fewer items are selected, but those selected are of
the most importance in handling the case. In discussing this
problem in relation to the PMP's used by the Royal Australian
College of General Practitioners, Marshall (1977) states:,
"It became obvious in the early years of the
examination that the most efficient problem
solvers were not obtaining the top marks in
what had been considered a problem solving
exercise. The reason appeared to be that
we had not really considered how the effec-
tive clinician works. We had reiterated the
indoctrinated belief that 'no history can be
detailed enough' and 'no physical can be
complete without a thorough systematic approach'.
In retrospect it was obvious that no family
physician, nor indeed any specialist, really
works this way."
Marshall found that examinees who reached the correct
diagnosis with only essential information were not scoring
as highly as many examinees who took more circuitous path-
ways and selected a greater number of items in the process.
The approach Marshall used to circumvent this problem
involved three adjustments to the scoring of the PMP's.
First, the weights assigned to the mid-range options (those
weighted -1,0, and +1) were carefully reviewed, with many
+1 options changed to a zero weight and many zero weighted
options changed to -1, on the grounds that the majority of
unnecessary information elicited in the data collection
frames should be penalized. Second, an examinee was
penalized for accumulating zero weighted options at a rate

of -3 points for every 10 zero options chosen. Third,

maximum marks were applied to each section of the PMP

;

 

 

 

 

53

(history, physical examination, etc.). These maximums were
the total of the weights assigned to the items in each
section which were considered necessary to the solution of
the problem. As Marshall explained the results of these
scoring changes: "The efficient problem solver could then
gain all the marks possible in each of the various sections,
while the more devious candidate who elicited superfluous
data would not benefit in so doing, for even though he
gained positive marks he could not exceed the maximum
allowed and the more information elicited the greater would
be the chance of accumulating zero or negative marks which
are deducted from the maximum section mark, rather than

from the total of all the positive marks gained in that
section [if that total exceeded the maximum]." This
adjustment in the scoring of the PMP's appeared to work
quite well in assuring that the efficient problem solver was
not at a disadvantage in relation to the excessively thorough
problem solver.

Aside from this one example reported by Marshall, the
few studies in the literature on PMP scoring involve varying
the weights assigned to individual items (c.f., Donnelly,
1976; and Bligh, 1980). These studies generally indicate
that varying the weights has little effect on the resulting
scores, and that simple weighting schemes do as good a job

as more complex schemes with a wider range of weights. A

surprising result in the recent study by Bligh (1980) was

 

 

 

 

54

his finding in one instance of a relatively high correlation
(r=0.63) between scores on a problem scored with a legiti-
mate weighting scheme and scores using a weighting scheme
comprised of randomly assigned positive and negative numbers.
Although it may be an artifact, such a result tends to call
into question the basic value of extensive ad hoc weighting
schemes.

All studies of weighting in the literature to date
involve the weighting of individual items in the PMP. None
involve the weighting of sections or frames (history,
physical, etc.) This indicates an emphasis in scoring on
the process, and therefore the knowledge, involved in solving
a PMP rather than the product, or outcome of the process.
This is probably the result of past difficulties in measuring
the products or outcomes of medical care, and the associated
problems, mentioned earlier, in obtaining adequate predictive
validity measures. These difficulties have led by default
to the emphasis on the relatively easier measurement of
process and knowledge. Preliminary results from the field
test of the ABEM certification examination indicate that
the weighting of frames rather than items, and therefore
more of an emphasis on product or outcome, may be a fruitful
approach in scoring PMP's (Maatsch and Elstein, 1979).

The NBME score and the University of Illinois scoring
formulae, as well as the simpler schemes used by various
certification boards, all can be called compensatory scoring

systems (Sprafka, personal communication). In a compensatory

55

system it is possible for the examinee to compensate for a
mistake in one area of the problem by doing very well in
other areas of the problem. Most commonly this is seen in
cases where examinees collect a lot of "free" points in the
data gathering sections (history, physical exam, lab) which
allows them to end up with a relatively high score even if
they make some serious errors in the management of the
patient. Attempts to overcome the problems of a compensatory
scoring system are seen in the approach of Marshall described
earlier, or the assigning of large negative weights such as
-50 to contraindicated (fatal) choices in the management of
PMP's.

In summary, the scoring schemes for PMP's generally
used for licensing and certification tend to be a simple
variant of the MCQ approach of adding up the total number of
right responses. Such an approach, combined with problems
of cueing and retracing which are inherent in the paper PMP
format, tend to provide an advantage for the overly thorough
and the less well prepared candidate, to the detriment of
the more efficient candidate. It is also apparent from the
literature that scoring, and scoring research, tends to
emphasize the process rather than the outcome of problem-

solving.

56

PMP's as tests of problem-solving skills

 

The development of PMP's came at a time when the
process of medical diagnosis first began to seriously be
studied as an example of human problem-solving skills. In
fact, Rimoldi was investigating the process of human problem-
solving, rather than looking for a way to teach or test
medical students, when he developed the forerunner of the
PMP. During the last 20 years there has been an increasing
emphasis in medical schools on the teaching of medical
diagnosis as a problem-solving skill, and PMP's have been
able to fill the need for a method of teaching and testing
this skill. Now, nearly everything written about PMP's
refers to them as tests of the physician's problem-solving
abilities.

One of the major studies of the medical diagnostic
process was the Medical Inquiry Project, done at Michigan
State University, and recently summarized by Elstein, Shulman
and Sprafka (1978). This study took an in-depth look at the
diagnostic skills of a small number of excellent or
"criterion" and "non-criterion" diagnosticians using PMP's
and live simulated patients. A problem-solving approach,
emphasizing investigation of the process of making a diag—
nosis, was used as opposed to a decision-making approach
emphasizing outcomes of diagnosis. One of the major findings
of the Inquiry Project was that physicians generate a series

of provisional hypotheses very early in the work-up, and use

b—__

57

these as a basis for further questioning and investigation.
Physicians use this hypothetico-deductive approach despite
their medical school training which usually stresses doing

a complete, thorough work-up and gathering all the
information before formulating any hypotheses about the
patient's problem. Medical diagnosis appears to present a
limitless, open problem which physicians try to transform
into a closed, goal-directed problem by generating early
diagnostic hypotheses which help to define the problem within
manageable limits.

Other results of the Inquiry Project indicate that
there is not a general "problem-solving skill" which can be
assessed or taught to physicians. Rather, clinical com-
petence seems to be domain- or case-specific; that is, the
ability to do well on one particular type of medical case
or problem does not predict very well how a physician will
do on another type of case.

A study conducted by Berner et a1. (1977) has been
interpreted to show, on the other hand, that there is a
general problem-solving skill which is measurable in
physicians. These results were obtained using a modified
PMP which provided corrective feedback at the end of each
section or frame. This procedure raises the question of the
generalizability of the results to real—life problem-

solving situations.

 

58

Bashook (1976) has suggested that domain-specificity
can be explained by using a three-dimensional conceptual
framework for evaluating clinical competence. These three
dimensions consist of: 1) the problem-solving process
(sensing, defining, resolving); 2) the clinical discipline
(medicine, gynecology, surgery, pediatrics, psychiatry,
etc.); and 3) the context of care (acute, chronic, health
maintenance, emergency). A particular medical problem will
fit in a particular niche in this conceptual framework or
matrix, and Bashook feels it would not be surprising if
performance on problems falling in different parts of the
matrix would not correlate well with each other. For
instance, it should not be expected that ability to do well
in defining (diagnosing) a chronic problem in the medicine
category would predict ability to handle a problem which
requires resolving (managing) a gynecological emergency.

{This conceptualization, along with the results of the Medical
Inquiry Project, imply that to adequately test a physician's
general problem-solving skill it would be necessary to
jpresent a series of problems covering a broad range of
"domains," possibly as defined by Bashook's matrix.

Over the past few years a group of studies utilizing
factor analysis of both branching and linear PMP's have found

tun) main factors or dimensions in PMP's (Donnelly et a1.,
1974; Donnelly and Gallagher, 1978; Skakun, 1978; Harasym
et.£31”, 1979; Juul et a1., 1979). These two dimensions found

in ‘aill of these studies are data acquisition and diagnosis/

1

¥_

59

management. The results of these five studies are consistent
and are particularly interesting when considering the
previously mentioned emphasis on process rather than product
or outcome in scoring PMP's. A recent study by Ekwo and
Loening-Baucke (1979) has shown that medical students collect
about 90% of the pertinent historical data, whether the
problem is familiar or unfamiliar to them; and Hubbard (1978)
has stated that data gathering sections of the PMP's used by
the NBME have proven to be undiscriminatory in most cases.

A study reported by Norman, Barrows, Feightner and Neufeld
(1974) came to the conclusion that "beginning medical
students possess a considerable baseline of skills" in regard
to data gathering on medical problems.

These results imply that the emphasis on data
acquisition, or process, may be misplaced. Indeed, in a
later report Norman, Barrows, Feightner and Neufeld (1977),
in reference to the outcome measures of diagnostic accuracy,
appropriateness of management and patient—physician relation-
ship, explicitly state the opinion that "these three
‘variables alone are a sufficient set to characterize an
encounter, and that evaluation of aspects of the process of
‘the encounter, although potentially useful for educational
feedback, have utility for decision-making only as they
relate to these outcome measures."

These studies, along with results from the field test

Of the ABEM examination indicating that only the final

 

 

60

diagnosis and management frames of PMP's correlate with
criterion measures (Maatsch and Elstein, 1978; Maatsch 1980),
point to the conclusion that data acquisition, or process,

is done reasonably well by the experienced and the inex-
perienced alike and therefore should not be receiving the
emphasis in scoring that it does now. Rather, the unique
aspects of medical problem-solving tend to be found in
products or outcomes, and therefore the final diagnosis

and management should be the focus of measurement in
licensing and certification situations.

A broad base of knowledge is necessary for, but is

not equivalent to, competence. Competence, in turn, is
necessary for, but not equivalent to, performance. For
historical reasons alluded to earlier, the emphasis in PMP
scoring has been on knowledge or process. As a result, the
primary approach to research on the medical diagnostic
process has been from the problem-solving viewpoint. Now
there are beginning to be indications that this emphasis
should shift to outcome or product measures, which could

be facilitated by a research orientation emphasizing a
decision-making approach.

Such a change could also lead to the wider acceptance

Of branching PMP's in the licensing and certification process.
All emphasis on scoring the product or outcome (diagnosis and

management) rather than process (data gathering) should

eliJninate the current unwillingness to allow a candidate to

talcéa different paths through the problem, and therefore go

¥

 

 

61

through different "processes", which currently implies they

are seeing "different" (non—equivalent)problems.
Summary

Considering the importance of physician licensing and
certification examinations, there has been relatively little
research done on their validity, particularly criterion-
related validity. The information which is available can
be summarized as follows:

1. Content and construct validity of the objective
portions of licensing and certification
examinations have been shown to be quite good,
but the most important aspect of validity,
criterion-related validity, remains essentially
undemonstrated.

2. Demonstration of criterion-related validity in
licensing and certification examinations is
dependent on being able to adequately measure
criterion performance, which in turn depends on
adequate definitions of physician competence.
Neither of these two key precursors are generally
available, although progress is being made toward
providing both.

3. Content and construct validity of PMP's can
generally be shown to be quite good, but a number

of studies show poor criterion-related validity.

 

 

62

The simple PMP scoring schemes used most often

for licensing and certification examinations,
which emphasize the process or knowledge involved
in medical problem—solving, combined with problems
of cueing and retracing which are inherent in the
PMP format tend to provide an advantage for the
overly thorough and the less well-prepared can-
didate, and penalize the more experienced and
efficient candidate.

Current results of research on the medical diag-
nostic process tend to support the notion that
more emphasis should be placed on the products or
outcomes of medical problem—solving rather than the

process.

The lack of adequate definitions of physician com—

petence is the key factor behind the lack of adequate

criterion measures of physician competence. This lack of

criterion measures in turn has resulted in inadequate

demonstration of criterion-related validity, which should be

considered the most critical aspect of validity for licensing

and certification examinations, as opposed to teaching and

Practice/feedback situations. Inadequate demonstration of

CIIiterion—related validity is becoming particularly evident

fCXr PMP's used in testing situations. While there are a

Valiiety of PMP Scoring schemes in use in educational settings,

"K331: scoring schemes used in licensing and certification

 

_._.—..-—~___/

63

settings are some variation of adding up all the correct
responses made by the examinee to produce a total score for
the problem. These scoring approaches rationally attempt to
sort out the good and poor physicians, usually without some
external measure of competence as a reference. The
development of adequate criterion—related validity measures
as external measures of competence should help define the
best approach to scoring PMP's used in testing situations,
which could result in improving the validity of PMP's in

licensing and certification.

 

 

CHAPTER III

PROCEDURES AND DESIGN

Introduction

The purpose of this research is to test four
observable consequences of assumptions involved in the use
of Patient Management Problems in licensing and certification
examinations of physicians. This exploration of PMP validity
will utilize data from the first administration of the
certification examination of the American Board of Emergency
Medicine. This research is one part of a study of the
development of a criterion—referenced certification
examination by Michigan State University's Office of Medical
Education Research and Development under contract to the
American College of Emergency Physicians. The entire study
is being completed under a National Center for Health
Services Research Grant (HS 02038) (Maatsch et a1., 1978;
Maatsch and Elstein, 1979; Maatsch, 1980).

The four observable consequences of the assumptions
of the validity of PMP's to be studied include: 1) PMP's
correlate to a higher degree with other measures of clinical

competence than with objective measures of knowledge such as

64

 

 

65

multiple choice examinations; 2) PMP's measure a component
of competence different than that measured by multiple—choice
examinations; 3) PMP's discriminate those physicians with
formal residency training in a specialty, and therefore with
enhanced clinical competence, from those without formal
specialty training to a much greater degree than MCQ's; and
4) all parts (frames) of the PMP contribute equally to the
discrimination of more competent and less competent
physicians. The results of testing the fourth statement may
lead to some suggested changes in the scoring of PMP's which
will possibly increase the validity of the first three
statements.

This chapter includes a description of the construction
of the objective items, the clinical simulations and the
PMP's used in this study, the design of this study, the
subjects used, the hypotheses, and the statistical procedures

to be used to test them in the course of this research.

Examination Construction

 

From January, 1975, to August, 1977, test developers
from the Office of Medical Education Research and Development
(OMERAD), Michigan State University, worked closely with
content experts from the American College of Emergency
Physicians (ACEP) to develop the test item library for the
Emergency Medicine Specialty Certification Examination
(EMSCE). The following is a brief summary of the process of

developing this examination.

 

 

66

The first step involved the identification and the
rank-ordering by importance of the appropriate content and
skill domains needed by the competent emergency physician.
This was accomplished by task forces of content experts from
ACEP. The result was a list, presented in Table 3.1, of 21
content areas and the approximate percentages of the EMSCE to
be devoted to each. This table of specifications guided the
remaining work in constructing the examination. A more
detailed account of this vital process can by found in
Maatsch and Elstein (1979) and in Downing (1979).

Next, a series of condition worksheets were developed
outlining specific information and skills within each of the
content areas. From these condition worksheets, under the
guidance of OMERAD staff, the actual test items were
developed by ACEP medical experts. A total of 372 multiple
choice items (one best answer from among four or five
options) were written under specific guidelines as to their
content, designed to make them as clinically relevant as
possible (Maatsch et a1., 1976). These guidelines suggested
that items be written only for content which was:

1. Frequently used general rules or principles.

2. Absolutely essential knowledge for competent
emergency department practice.

3. Specific applications of knowledge to clinical
emergency medicine.

4. Knowledge that must be remembered at all times
for competent practice.

 

67

mumpHOme HmucoEDmmch
mmﬂocmmnmﬁm Henson

coﬂumupmﬂcﬂﬁcm
ucoﬁpnmmmp hocoouwﬁm

mHHme ucmﬂumm\cmH0ﬂmmcm
muocuowﬂc poon
Hmoﬂanmlammmq
AOHHMEDMHpco:

can oapmﬁsmuuv
mepHOme ®>m
Aucoﬁmmmcme can mcﬂccmam
Hmumcmﬂp maﬁpSHocﬁv
Eoum>m mooﬂ>umm

Hmoapma mocomumﬁm
mGOHuoomcH HMOHuHuO

mHDmOme paoo can cusm

Naomoumu

wooa

m

m

mmmghumOHmm

muwcHOme HmOHmOHOHsmz
mEmHnonm upwaosuowam com cﬂsam

mumcHOmﬂp OHmOHoo
inOp can oamnoaac .oaaoncuoz

mumpHOme poocpaﬂco can woccwcH
mumpHOch Houﬂcomouo

mHmoHomﬁc oeumEdmue

mwﬂusmcﬂ Hcpmamxm

mumpHOmﬂp SuccOEHDm
AoaucEDMMucoc poo

oﬂpmﬁsmuuv mmﬂusncﬁ Moos

can poms .umoung .mmoc .Hmm
muopHOme HmsHEopnd

AOﬂumﬁscupcoc can oaumEscHuv
mHoUMOme Hmadomc>oﬂcumo

>Hommwcu

MH

ommucooumm

 

mmHmOUMB¢U BZmEZOU AﬁUHQWE OB DWB¢UOAH< mZMBH BmMB

H.m mamdﬁ

 

68

5. Frequently encountered as cases and problems in

the emergency department.

A total of 136 Pictorial Multiple—Choice items were
also developed. This item type presented a visual stimulus,
such as an EKG rhythm strip, a color photograph of a patient,
or an X—ray, and one or more multiple-choice items based on
the visual stimulus. These were also of the single best
answer type with four or five options. Criteria for
selecting visuals and item content for this format included
using visuals which (Maatsch et a1., 1976):

1. Test general interpretive skills.

2. Typically require immediate interpretation and
use in an emergency department.

3. Present conditions that knowledgeable candidates
can clearly see and interpret.

A series of condition worksheets covering what were
felt to be the most important areas were selected to be
further developed into scenarios to serve as the basis for
the two types of simulations to be used in the EMSCE, the
Patient Management Problems and the Simulated Clinical
Encounters (SCE's). The SCE's are highly structured oral
examinations based on the patient games described by Maatsch
(1974), in which a well-trained examiner presents
standardized information about-a patient to a candidate,
who must elicit further information to diagnose and manage
the patient. Eight Simulated Patient Encounters, requiring
the candidate to manage a single patient, were developed by

teams from OMERAD and ACEP, along with four Simulated

:

69
Situation Encounters, requiring the candidate to manage
three patients concurrently, for a total of 12 SCE's
involving 20 patients.

Ten scenarios were selected for development as PMP's,
under the direction of Dr. Sarah Sprafka, from OMERAD,
working with ACEP content experts. All scenarios selected
involved patients with only one problem, the cases were
basically linear in nature, the necessary information could
easily be presented in written form or with still visual
stimuli (EKG tracings or X—ray prints), and completing the
problem would not involve more than three or four basic steps
(e.g., initial evaluation, data gathering, diagnosis, and
preliminary management).

All of the latent image PMP's developed from these
scenarios were linear, relatively short (easily completed
within 20 minutes), and composed of sections in the following
order: introduction, history, initial diagnostic hypotheses,
physical examination, provisional diagnosis, laboratory,
final diagnosis, and management. The sections on initial
diagnostic hypotheses and provisional hypotheses are normally
not used in PMP's, but were included in these PMP's to
conform with the theory of medical problem solving developed
during the Medical Inquiry Project by Elstein et a1. (1978).
A five point scale (-2 to +2) was used to assign weights to
each item in the PMP's, based on the judgement of the

scenario author and at least one other ACEP content expert.

 

 

 

70

Further details on the development of these PMP's can be

found in Maatsch and Elstein (1979).

Field Test of Test Items

 

After development and extensive review by OMERAD test
developers and ACEP content experts, the entire test library
was field tested on October 22-26, 1977, in Lansing,
Michigan, as part of a grant from the National Center for
Health Services Research (HS 02038) entitled "Model for a
Criterion-Referenced Specialty Test." The test library was
administered to 94 subjects, consisting of 22 fourth year
medical students, 36 second year residents in emergency
medicine, and 36 practicing emergency physicians nominated
by their peers for their expertise in emergency medicine,
fourteen of whom were practice eligible and 22 of whom were
residency eligible.

The results of this field test were used to revise or
eliminate items from the test library, and to select items
for use in the first administration of the EMSCE by the
American Board of Emergency Medicine, after their formal
recognition as a new medical specialty, which took place in
September, 1979. The field test results were also used to
set the passing scores for each part of the examination.
Using an eight point scale for the rating of the SCE's,
the Board set a score of 5 as indicating minimally acceptable
practice of emergency medicine, and set a passing level of a

5.75 average across all cases (or 5.0 or above on all cases)

eme

[lira

71

as the criteria for certification. The field test data
showed that this corresponded to a score of 75% on the Part
I objective portion of the EMSCE, and the Board set this as
the minimal score needed to pass Part I and go on to take
Part II. Further details of setting cut scores for this
criterion-referenced examination can be found in Maatsch

and Elstein (1979) and Maatsch (1980).
Design‘

Subjects

The subjects used in this study were the candidates who
sat for the first administration of Part I of the Emergency
Medicine Specialty Certification Examination, administered
by the American Board of Emergency Medicine. The EMSCE is
given in two parts. Part I is composed of the objective
formats (MCQ and PMP), and Part II consists of the Simulated
Clinical Encounters. Part I must be passed before taking
Part II of the examination.

Part I was administered on February 20, 1980, at three
sites (Cherry Hill, New Jersey; Chicago, Illinois; and
Los Angeles, California) to a total of 616 candidates. This
group consisted of 136 candidates who had completed an
approved residency program in emergency medicine (residency-
eligible), and 480 candidates who had been practicing
emergency medicine full-time for a minimum of five years

(practice-eligible). The practice-eligible group contained

72

many second career physicians, some of whom have received
board certification in other specialties, and foriegn trained
physicians who are now licensed and practicing in the U.S.
Part II was administered on two different occasions in
Chicago, with 182 of the 387 candidates who passed Part I
taking Part II during the week of May 19, 1980, and 188 being
examined during the week of July 21, 1980. The remaining 17
did not take Part II during the first two scheduled
administrations.

From the original pool of 616 candidates, 107 were
eliminated from the analyses of this study because they did
not finish the PMP portion of the examination or, as was
most frequently the case, because they did not completely
follow the instructions for the PMP's and chose more items
in one or more sections than permitted by the directions for
those sections. Therefore, there remained a total of 509
subjects for analyses involving the PMP's and the MCQ's.

Of the 182 who sat for the initial administration of Part II,
25 were among those eliminated, leaving a total of 157
subjects from the May session for analyses involving PMP's
and SCE's. Of the 188 who sat for the July administration
of Part II, 30 were among those eliminated, leaving 158
subjects from this session and a total of 315 subjects who

took the SCE's.

73
MCQ format

The multiple—choice portion of the examination was
composed of 194 standard multiple-choice items selected from
the field-tested library, presented in three booklets of
approximately equal length. The same schedule of pre-
sentation and standard instructions were used at all three
sites for this and all other parts of the examination. The
86 pictorial multiple-choice items selected for use were
split evenly into two booklets, with each candidate receiving
only one of the booklets in random fashion. (The candidates
were actually presented with 197 standard MCQ items, but
three were deleted after item analysis and content review
by the test committee. Each candidate also saw 49 pictorial
MCQ items, but six atypically easy or difficult items were
deleted from each booklet in a retrospective process of
balancing the difficulty of each booklet.) Thus, each candi-
date was scored on 194 MCQ items and 43 pictorial MCQ's, for
a total of 237 Objective items. The standard reliability
indices (KR-20 and Cronbach's Alpha) for this portion of the
examination were found to be 0.91 and 0.89 for the two forms

(using the two different pictorial MCQ booklets).
PMP format

Based on data from the field test, the American Board
of Emergency Medicine decided to eliminate the use of PMP's

for purposes fo certification, but did allow them to be used

 

74

during the first administration of the EMSCE, for research
purposes. The candidates were not aware of this decision by
the Board, and took the PMP's assuming they were being used
for certification. Five of the most discriminating PMP's,
based on field test results, Were selected for use out of
the original 10 PMP's. The ABEM came to the conclusion that
PMP's would not be used for certification after reviewing
data showing that the PMP's were not as effective in
predicting SCE (Part II) scores as were the carefully
developed clinically relevant MCQ items. This result from
the field test is not particularly surprising in light of
the information reviewed in the previous chapter. From the
field test data, reliability (Cronbach's Alpha) of the PMP's
was found to be 0.77, using Proficiency Index scores and

each PMP score as an item.

SCE format

 

Part II of the EMSCE consisted of five single patient
Simulated Patient Encounters (SPE) and two Simulated
Situation Encounters (SSE), each requiring concurrent
management of three patients. A single examiner rated each
candidate on each SCE, with no examiner seeing a particular
candidate more than once. The examiners received extensive
training in scoring the SCE's and went through a calibration
process the day prior the the beginning of Part II of the

examination. (Most of the examiners for Part II had been

75

examiners for the field test, and were experienced in

administering SCE's.)

Each candidate was rated, using an eight point scale,

on seven aspects of competence for each SPE and eight aspects

for each SSE. These aspects were:

1.

Data acquisition - Completeness (appropriateness)
and efficiency of data gathering. Did the
candidate collect the appropriate data required to
correctly diagnose and manage the patient?

 

Problem solving - Appropriateness of the
organization of data collection activities in
relation to management decisions. Did collected
data help select among reasonable alternative
diagnoses while insuring patient stabilization?
Did the candidate efficiently arrive at an
informed and appropriate management plan?

 

Patient management - Did the candidate treat or
direct the appropriate treatments throughout the
encounter, including proper referral at a proper
time? Was the patient properly attended when
directing attention to other patients?

 

Health care provided (outcome) - The candidate's
overall performance as viewed from the patient's
perspective. By current medical standards, was
the patient's condition stabilized and maximally
improved by the medical interventions provided?

 

Doctor-patient relations - Demonstrated concern and
skill in dealing with the patient's psychological
state. What is the examiner's best estimate of the
sensitivity and skill level of the candidate in
relating to the psychiatric, psychological and
sociological (family) aspects of patient care?

 

Comprehension of pathophysiology - Does the
candidate understand the scientific basis for
his/her actions or is he/she simply relying on
memorized routine procedures usually followed in
such cases? The examiner had the option of asking
standardized questions at the completion of the
simulation to assist in rating the candidate on
this aspect.

 

 

 

76

7. Clinical competence (overall) - Overall assessment
.of the demonstrated competence of the candidate to
provide emergency health care in the specific class
of conditions contained in the simulation. The
level of combined cognitive and procedural skills
employed by the candidate in providing health care
in this setting. All things considered, how good
was the candidate in handling these types of
conditions or problems?

 

8. Resource utilization (SSE's only) - Evaluation of
the capability of the candidate to effectively
utilize himself and other supporting personnel and
resources under the stress of managing a number of
patients concurrently.

 

The score for each problem was calculated by determining the
mean of the seven (or eight) ratings given by the examiner
for that problem. Examiners were also periodically scheduled
during break periods to sit in as observers and independently
score SCE's being administered by other examiners. These
"verifier" scores were not used for certification purposes,
but did serve as a means of quality control of administration
and for determining inter-rater reliability. The inter-rater
reliability of these examiner-verifier pairs of scores on a
single candidate across all problems was 0.81. This
reliability is much higher than usually seen for an oral
examination, and can probably be attributed to the careful
construction of the SCE's, and particularly to the careful
training and calibration of the rating standards of the

examiners.

77'

Generalizability of results

 

While it might be safest to say that the results of
this study will generalize to all emergency physicians
practicing in the U.S., the real areas of concern for this
study lie in the widespread use of PMP's as supposedly valid
tests of physician competence in many medical specialty and
licensing examinations, and the accuracy of some of the
assumptions underlying their use.

As has been previously discussed, PMP's are currently
used in many educational and testing situations. The PMP's
used in this study were carefully developed in a manner
similar to that commonly used in such situations, and the
scoring formulae used are those in common use in many
licensing and certification applications. In testing
situations, particularly in licensing and certification,
the use of PMP's is based on some very fundamental
assumptions outlined in Chapter I; i.e., PMP's are a
valid measure of clinical performance and are predictive of
how an examinee will perform in a real clinical situation.
The results of this study, testing observable consequences
of those assumptions, can therefore be generalized to all
situations where PMP's are used in a testing mode and where
those assumptions form the basis for their use; i.e., making
decisions about the competence of an examinee, as opposed to
a teaching or practice/feedback mode. This specifically

includes the use of PMP's in licensing and certification

78

examinations of physicians, and could also include the use
of PMP's as an evaluation tool in undergraduate medical
education (for instance as an element in making promotional
decisions) or in licensing and certification examinations
for allied health personnel such as nurses or physician

assistants.

Questions Summarizing the Logic Underlying

 

the Testable Hypotheses

 

The basic question underlying this study can be stated
as follows: As presently scored (using the NBME scoring
method), are PMP's a useful and valid method of evaluating
clinical competence? An alternative way of asking this
question might be: Are PMP's a valid substitute for MCQ
batteries, or other more complex and expensive methods of
measuring clinical competence such as SCE's or other oral
examination methods?

In Chapter I, four observable consequences of the
assumptions concerning the use of PMP's in licensing and
certification examinations were presented. The basic
question being asked in this study can be related to these
four observable consequences by the following specific
questions:

1. Are PMP's better at predicting SCE scores obtained

three and five months later than they are at

predicting MCQ scores?

79

2. Do PMP's add anything to the information gained
through MCQ's in predicting SCE scores (i.e., do
they account for significant additional variance
in examiner ratings of performance)?

3. Do PMP scores add anything to the ability to
discriminate between residency- and practice-
eligible candidates beyond that provided by MCQ
scores?

4. Are there specific PMP frames or combinations of
frames (i.e., data acquisition frames or diagnosis
and management frames) that better predict SCE
scores? In other words, are there alternative

' scoring algorithms which involve weighting frames,
rather than items, that will improve the ability
of PMP's to predict examiner ratings of performance
on SCE's, and improve discrimination of residency—

and practice-eligible candidates?

Hypotheses and Analysis Methods

 

Based upon the assumptions which form the rationale
for the use of PMP's in licensing and certification, and the
specific questions derived from them, which are listed above,
the following four hypotheses will be tested.

I. PMP scores correlate to a greater degree with SCE
scores than with clinically relevant MCQ scores.

This hypothesis will be tested by correlating PMP

scores (NBME scoring method, as well as Proficiency Index

80

and Efficiency Index) with SCE scores, and PMP scores with
MCQ scores, using a Pearson product-moment correlation.
Z-tests of the significance of the differences will be
calculated (Glass and Stanley, 1970).
II. PMP scores account for a portion of variance of
SCE scores beyond that contributed by clinically
relevant MCQ scores.
A stepwise multiple regression analysis will be
performed, with an F-test for the significance of the

addition of PMP to the MCQ B-weight. Multiple regression

analyses will be done using the computer program Statistical

 

Package for the Social Sciences (Nie et a1., 1975).

III. PMP scores add to the ability of MCQ scores to
discriminate between residency— and practice-
eligible candidates.

A stepwise discriminant analysis will be performed,

with an F-test for the significance of the addition of PMP

to the MCQ B-weight. Discriminant analyses will be done

 

using the computer program Statistical Package for the Social
Sciences (Nie et a1., 1975).

IV. A11 frame scores of PMP's correlate equally with
SCE scores.

This hypothesis will be tested by Z-tests of
significance of the differences between all possible pair-
wise correlations between all frame scores of PMP's and the
SCE scores (Glass and Stanley, 1970). If differences are
observed, a multiple regression analysis will identify which
PMP frames contribute most to predicting SCE scores and which

frames add little or nothing to this predictive relationship.

81

New scoring algorithms will be developed if warranted by

the data.

Summary

The first administration of the Emergency Medicine
Specialty Certification Examination of the American Board
of Emergency Medicine has provided data that will be used
to test four observable consequences of assumptions
concerning the use of Patient Management Problems in
licensing and certification examinations as valid indicators
of physician competence. Scores from 509 candidates who
took Part I (237 MCQ items and five PMP's) and from 315 of
those candidates who passed Part I and took Part II (seven
Simulated Clinical Encounters) will be used to test four
hypotheses derived from the observable consequences of the
assumptions upon which the use of PMP's are based. The over-
all goal of this study is to test the criterion-related
validity of PMP's in a certification examination using
performance on the SCE's as the criterion, and to explore
methods of improving the validity of PMP's through changes
in the scoring methods used with PMP's.

The results of the analyses performed on these data

will be presented in Chapter IV.

CHAPTER IV

RESULTS

Introduction

 

This chapter presents the results of statistical
analyses of data gathered during the first administration of
the American Board of Emergency Medicine certification
examination, and their interpretation in relation to the
four hypotheses of this study. Following a brief description
of the test results from the first administration, tables
are presented summarizing the correlations between the three
PMP scores used in this study and the multiple-choice
battery (MCQ) and the Simulated Clinical Encounters (SCE).
The three PMP scores are the National Board of Medical
Examiners Index (NBME), the Efficiency Index (EI), and the
Proficiency Index (PI) as defined in Chapter I (Table 1.1).

Correlations between the PMP scores and MCQ scores
and SCE scores are then presented, along with the results of
Z-tests of the significance of the difference between them.
These results are interpreted in relation to Hypothesis I of
this study. Then, following a brief description of
regression analysis, the results of the analyses for
Hypothesis II are presented. These consist of regression

analyses using SCE scores as the dependent variable, and

82

83

PMP and MCQ scores as the independent variables. Next is a
brief description of discriminant analysis, with analyses
related to Hypothesis III. For this hypothesis discriminant
analyses are performed using residency- and practice-eligible
categories as the dependent variables, and PMP and MCQ scores
as the independent or predictor variables. Hypothesis IV
results are then presented, consisting of Z-tests of the
significance of the difference between all pair-wise
correlations between PMP frame scores and SCE scores.
Finally, additional analyses are presented concerning
attempts to improve the ability of PMP scores to predict
SCE scores by developing a scoring scheme which weights

individual frames of the PMP's.

Summagy of Test Results

 

For Part I (MCQ's), with 509 candidates, the mean score
was 76.6% (SD=7.9%). For the PMP's the mean NBME score was
41.8 (SD=2.5), the mean EI score was .836 (SD=.054), and the
mean PI score was .740 (SD=.077). The mean score for the 315
candidates taking Part II (SCE's) was 5.89 (SD=.7) on an 8
point scale. For the 157 who took Part II in May, the mean
was 6.01 (SD=.70), while for the July administration to 158
candidates the mean was 5.76 (SD=.68).' These two means are
significantly different (t=3.246), but this has no effect on
the correlations of interest, so the two groups are combined
for analysis. Figures 4.1-4.7 show the distributions for

each of these sets of scores.

mmuoom co: co coausbauumna I- H.a magmas

 

 

84

 

 

 

 

 

 

 

 

 

 

 

 

 

GHOOm 002
mm. am. am. as. as. me. as. am. am. as.
tom. 1mm. now. ums. nos. -mm. tom. -mm. tom. uma.
11. o
m _ a w H
_ as
, . mm
mm
. om
mm mm
. me
as
. OOH
momma - mas
mms mma a
P oma

 

A
A

mmmm Hams

CON

seqeprpuea go

85

 

mmuoom msmz mo :oHusbauumao In ~.v apnoea

 

 

 

 

 

 

 

 

 

 

 

 

 

 

mnoom mzmz
0.54 m.aa a.~s m.oa m.mm m.mm m.am m.~m m.om m.mm
no.ma no.me no.aa no.mm no.5m -o.mm no.mm uo.am uo.am no em

— V N O H

ma 4
am
we
mos
Has mom":
ems

umm

.om

-mh

. ooa

. mNH

omH

 

m5.”

CON

seqeprpuea go

86

museum Hm mo scansbannmaa .. m.a magmas

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

OHOUW Hm
mm. mm. mm. mm. mm. om. he. on. HB. mo.
Imm. loo. low. Iwm. lam. Ion. Imh. INN. loo. Imm.
_

lllllllﬁll

HH o o
mm
mm ov
mm
Nm momuc
NHH moa

 

mm

om

mm

ooH

mNH

'ON

senepIpueo SO

87

 

 

 

 

 

 

 

 

 

 

 

 

 

mmuoom Hm mo QOHpSQHHumHQ II v.v ousoﬂm
whoom Hm
am. am. an. we. mm. em. mm. vm. me. Ive.
Imm. low. Imp. low. lmm. low. Imm. Iom. Imv. Iom.
N o
e
Ha m
Hm mm mm
.om
on .mh
.OOH
moa
baa
.mNH
ass manna -oma
ﬁmha

'ON

seneprpuea go

88

Aamuoec mmuoom mom mo cohusnauumao I: m.O magmas

OHOOm mum

av.h mo.h mo.o mm.o mm.m mc.m mo.m mo.v mm.v mm.m
IoH.h Ioh.o tom.o Iom.m Iom.m Ioa.m Ion.v Iom.v tom.m Iom.m

 

HH

 

mm

 

Hm

 

 

 

 

be me . om

 

 

 

 

 

OO .Hs msmna . ms

 

e OOH

'ON

segeprpueo go

Amaze mwuoom mom Ho :oHHanHHmHo In 0.0 musmHm

whoom mom

mv.h mo.b mo.o mm.o mm.m m¢.m mo.m mo.v mm.c mm.m
IoH.> Ion.o Iom.o Iom.m Iom.m Ioa.m Io>.v Iom.v Iom.m Iom.m

 

 

 

 

89

 

 

 

 

 

 

 

 

 

 

 

O
_ N H m
HOH
O — OH
AH - .Om
Hm
mm hmHHc . om
Om Om HOO

CON

sanepIpueo so

90

Awasho mouoom mom mo coﬂuocaupmaa II n.v musmam

 

 

 

 

 

 

 

 

 

 

 

 

mHOOm mum
mv.h mo.h mo.o mm.o mm.m mv.m mo.m mo.¢ mm.c mm.m
Ioa.h Ioh.o tom.o Iom.m tom.m Ioa.m Ion.¢ Iom.c Iom.m Iom.m

i TIIHL O

- h .
OH
ma .
mm
mm mmauc .
mm mm

 

0H

om

om

ov

°ON

segeprpuea go

91

As can be seen in Figure 4.1, the distribution of the
MCQ scores approximates a normal curve skewed in the negative
direction, with the major portion of the candidates scoring
above the 75% pass-fail cut point.’ This type of negatively
skewed distribution of scores would be expected in a
specialty certification examination where most of the
examinees are more likely to be among the more competent
practitioners. This type of distribution would also be
expected in a criterion-referenced examination such as this.
Figures 4.2 — 4.4 show the distributions for the NBME, EI
and PI scores, respectively. The NBME and PI scores both
show a very negatively skewed distribution, while the E1
score distribution does not show such a pronounced negative
skew. Figures 4.5 - 4.7, for the total SCE distribution,
the SCE distribution of the May candidates and the July
candidates, respectively, show a nearly normal distribution
with only a slight negative skew.

The reliabilities for Part I and Part II during the
first administration were reported by Maatsch (1980). The
KR-ZO for Part I (multiple-choice battery) was .90 (SEM=2.5%),
while for Part II (Simulated Clinical Encounters) it was .57
(SEM=.46 of a rating point on an 8 point scale). The five
PMP problems used in this study had a reliability of .67
(SEM=1.46), calculated using the formula for Cronbach's a
with NBME scores for each of the five PMP's as items. The
following results are based on the observed correlations

among these three approaches to testing which were used in

92

this first administration of the ABEM certification

examination.

Correlation Summaries for PMP, MCQ and SCE Scores

 

Tables 4.1 - 4.3 summarize the Pearson product-moment
correlations between the three PMP scores used in this study
and the scores on the multiple-choice battery (MCQ) of Part
I of the ABEM examination and the scores on the Simulated
Clinical Encounters (SCE) of Part II. The three PMP scores
are the NBME index (NBME), the Efficiency Index (EI) and the
Proficiency Index (PI). These scores were defined in Table
1.1. The values used for each of these PMP scores are the
average scores the candidate achieved across the five PMP's
administered. The MCQ score is the percent correct on the
Part I multiple-choice battery, and the SCE score is the
average score achieved by the individual across all seven
SCE's of Part II.

These tables also illustrate the breakdown of the 509
candidates who were the subjects used in this study. The 509
candidates are first separated into two groups, those who
passed and those who failed Part I (the MCQ battery) in
February 1980. There were 331 who passed with a score of
75% or better, and 178 who failed. Of the 331 who passed
Part I, 315 went on to take Part II (the SCE's). The first
session of Part II, in May 1980 was attended by 157 candi-
dates and the July session by 158 candidates. Those who

failed Part I did not take Part II. Tables 4.1, 4.2 and

93

Ho.vm Q mo.vm m

 

 

 

 

 

 

 

mam":
"c
Qmmme.nmom.ouz mom HMHOB
0" § s
sisal a...
,mhanc H puma
bmmsm.uooz.mzmz HHmm
mmanc
nhvcg.ﬂm0m.002
mmao.nooz.mzmz
.l a
Nmoo Imom mzmz ammu: H #Hmm
emHu: anOH.nooz.mzmz mmmm
vame.umom.ouz
QmH©N.NOUS.WEmZ
HHNH.nmom.mzmz
mane was hucsunmm
Hmomc HH puma How: .msz H used

 

 

 

mmmOUm mom 92¢ 002 mBHB mZOHﬁdﬂmmmOU mmOUm mzmz m0 wmdzsz

H.v mqmdﬁ

 

94

 

Ho.vm Q mo.vm m

 

 

 

 

 

 

 

mHmuc
. mom":
......nw...m.
awwmw.mmom.Hm bmmmm.uooz.Ha
msHuc HH puma
QHVFN.HOOS.HH HHcm
mmHuc
baavv.nmom.ooz
OHHO.uooz.Hm
. | a
OOHO [mom Hm Hmmuc H puma
smHn: mmmmH.uooz.Hm mmmm
bOOmO.umom.ooz
bmmmm.uooz.Hm
MOOMH.umom.Hm
wash he: wumsunom
Hmomv HH puma Hows .HZHO H puma

 

 

 

mmmOUm mom 02¢ 002 mBHB mZOHeﬂqumOO mmoom Hm .mO MmHNZSDm

N.v mqmﬁﬁ

 

 

95

 

Ho.vd Q mo.vd m

 

 

 

 

 

 

 

mHmnc
mom":
Qmmmv.nmom.ooz Hopoe
OFQO.HOUZHHW VOVN.HOUE~HW
.l a Q
comOH Imom Hm
manc H puma
nmccm.uooz.Hm HHma
mmHuc
chvwv.umom.ooz
nammm.ncos.Hm
Qomvm.umom.Hm Hmmus H puma
sman HNsO.uooz.Hm mmma
pOOmv.umom.oos
MOOH.uuooz.Hm
eHmo.iumom.Hm
stb >82 mucsunmm
Hmomc HH puma Hoes .asac H puma

 

 

 

mmmoom mom D24 002 EEHB mZOHeﬁdmmmOU mmoom Hm m0 Mmdzzbm

m.v mummﬁ

 

 

96

4.3 present the summaries for NBME, PI and EI scores,
respectively.

As can be seen in these tables, the correlation
between the MCQ battery and the SCE's is statistically
significant and moderately high (r=0.43). The correlations
of the PMP scores with the MCQ battery, while in some cases
reaching statistical significance, generally are relatively
low, particularly for the group who passed Part I. The
correlations between the PMP scores and the SCE scores

shown in these three tables generally are quite low.

 

A comparison of Table 4.1 with 4.2 shows that the
correlations using NBME scores and PI scores parallel each
other closely. This supports the results shown in Table 4.4,
which are the correlations between the average scores on
PMP's using NBME, EI and PI scoring. The correlation
between NBME and PI is very high (r=0.97). On the other
hand there is a very low negative or zero correlation of
these two scores with the E1 score. This relationship is
essentially unchanged when separately considering those who
passed Part I and those who failed Part I, as shown in
Tables 4.5 and 4.6.

Table 4.3 presents the BI score correlations with MCQ
and SCE scores. Despite the fact that there is zero or low
negative correlation with NBME and PI scores, the El scores
correlate with MCQ and SCE scores within the same low range

(ré.1) as the NBME and PI scores.

97

 

 

 

 

 

TABLE 4.4
CORRELATIONS BETWEEN NBME, EI AND PI AVERAGE SCORES
' n=509
NBME EI PI
NBME 1.000
EI -0.068 1.000
p: 0.970a -0.142a 1.000
ap <.05
TABLE 4.5

CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR
CANDIDATES PASSING PART I

 

 

n=331
NBME EI PI
NBME 1.000
EI -o.188a 1.000
PI 0.968a -0.293a 1.000

 

 

98

TABLE 4.6

CORRELATIONS BETWEEN NBME, EI AND PI SCORES FOR
CANDIDATES FAILING PART I

 

 

 

 

n=178
NBME EI PI
NBME 1.000
EI 0.000 1.000
PI 0.970a -0.052 1.000
ap <.05

There also appears to be a difference in correlations
between the group which took Part II in May and the group
which took it in July. The NBME and PI correlations with
MCQ and SCE scores appear to be higher in May than in July,
while for the BI scores the relationship is reversed. These
differences are more apparent than real when one considers
the squares of the correlations (an estimate of the variance
in MCQ or SCE scores accounted for by the PMP scores.) Even
though some of the correlations are statistically significant,
they are quite low and all are on the very low part of the
curve which relates correlations of two variables with
variance accounted for in one variable from knowledge of the
other variable (variance accounted for = r2). Therefore,

there are relatively small differences in the variances

accounted for in the May and July data sets.

 

99

Results Concerning Hypothesis I

 

Hypothesis I stated that, since PMP‘s are designed to
measure problem-solving skills rather than knowledge, PMP
scores will correlate to a greater degree with SCE scores
than with clinically relevant MCQ scores. Table 4.7 presents
the correlations of each of the three PMP scores used in this
study with the MCQ score and the SCE score, and the calculated
Z-test statistic for the difference of two dependent cor-
relation coefficients as presented by Glass and Stanley (1970).
These Z-test statistics test alternative hypotheses concerning
the difference between two correlation coefficients, e.g.,

"no difference" versus "significant difference" between the
two correlation coefficients.

For a one—tail test with an a=.05, the critical value
is 1.65. As Table 4.7 shows, none of the Z-test statistics
calculated for the three PMP scores reaches this level. The
decision rule for this test is reject the hypothesis of no
difference between the correlations if the Z—test statistic
is greater than the critical value. Since none of the Z-test
statistics reached the critical value, it must be concluded
that there is no significant difference between the
correlations of any of the PMP scores with the MCQ scores
and with the SCE scores.

Therefore, it may be concluded that these results d9
get support Hypothesis I, which stated that PMP scores will

correlate to a greater degree with SCE than with MCQ scores.

100

TABLE 4.7

PMP SCORE CORRELATIONS WITH MCQ AND SCE SCORES
AND Z-TESTS OF SIGNIFICANCE OF DIFFERENCE

 

 

 

n=315
Correlations ‘ Z—test
_ MCQ SCE anlculated
NBME 0.141a 0.075 1.109
EI 0.067 0.109a 0.704
PI 0.124a 0.084 0.664
ap <.05

Results Concerning Hypothesis II

 

Hypothesis II stated that PMP scores will account for
a portion of variance of SCE scores beyond that contributed
by MCQ scores. Regression analysis was used to analyze
the data for this hypothesis, and a brief outline of this
technique precedes the presentation of the results.

Multiple regression is a statistical technique for
analyzing relationships between a dependent or criterion
variable (in this study it is the SCE scores) and one or
more independent or predictor variables (in this case, the
PMP and MCQ scores.) The general form of the regression
equation is

Y' = A + lel + B2X2 +-~kak

where Y' is the estimated value of Y (the criterion variable),

A is a constant equal to the Y intercept, and the Bi are

101

regression coefficients for the Xi predictor variables. The

values for the A and Bi coefficients are selected so that

the sum of squared residuals 2(Y—Y')2 is minimized.

Besides generating equations to estimate values for the

criterion variable from the values of predictor variables, it

is also possible to determine the amount of contribution of

the predictor variables and their relative importance to the

criterion;

i.e., the amount of variance of the criterion

variable explained by each predictor variable. This latter

use of regression analysis is the primary focus for testing

Hypothesis II.

Several statistics found in the summary tables are

important in interpreting a regression analysis, and these

are briefly explained below:

1.

F to enter — a computed F ratio to determine if
the value of Bi for the criterion variable entered
at that stage of the analysis is significantly
different from zero.

Significance — the significance level of the above
F ratio. If the F ratio is not significant (in
this study 0:.05), the Bi for the predictor
variable is not essentially different from zero
and the predictor is not contributing anything to
estimating the value of the criterion variable.
Multiple R — this figure gives the relative

strength and the direction (positive or negative)

102

of the relationship between the predictors and
the criterion variable.

4. RE - indicates the proportion of variation in the
criterion explained by all the predictors entered
into the equation at that point.

5. R2 Change - indicates the proportion of variation

 

in the criterion attributable to the predictor
variable entered in that step of the analysis.
The summaries of the results of the regression analyses using
SCE scores as the criterion variable and PMP scores and MCQ
scores as the predictor variables are presented in Tables
4.8 - 4.10. In each case the computer program selects first
the variable which contributes the most to the prediction of
the criterion.
In all three cases the MCQ score is entered before the
PMP score, indicating the MCQ score contributes more to the
prediction of the SCE score. The significance of the F to
enter for all three PMP scores fails to reach significant
levels.
As a confirmatory analysis to further explore the
relative contribution of PMP scores to the prediction of
SCE scores, the regression analyses were repeated, forcing
the entry of the PMP scores before the MCQ scores. The
results of these analyses are presented in Tables 4.11 -
4.13. Again, in all three cases the F—ratio to enter the

PMP scores fails to reach significant levels.

103

 

 

 

 

 

mmmOUm mom ZO mmmoom
ODE QZﬂ Hzmz ho mHmMA¢Z¢ ZOHmmmmwmm mom mqmﬁﬁ wm¢ZZDm

m.¢ mqm¢9

emoo. memH. move. CHH. mmv.m Hm m
mhmH. mhmH. mmme. o www.ms 002 H
omcmco mm m OOEOOHchmHm Hmpcm OH pmumpcm * ampm

Na mHaHpHsz a memHum> _
mHmnc
mmmoum wow 20 mmmoum
002 Q24 Hm m0 mHmNA¢24 ZOHmmmmwmm mom mqmde wmazzbm
m.v mqmdﬁ
mooo. HmmH. smmv. owe. who.o mzmz m
mme. mhmH. mmme. o nmv.m> 002 H
omcoco Im m TOCOOHHHcmHm umpcm OH coumpcm # ampm
mm m OHQHHHSS a oHQmHHc>
mHmuc

104

 

mmoH. meH. hmmv. o ccH.on 002 N

 

 

omoo. omoo. beho. me. mmh.H mzmz H
mmccco mm m TOGOOHMHcmHm Hmucm OB cmumpcm * dopm
mm deHpHsz a THQOHHO>
mHmNG

mmmoom mzmz m0
NMBZM HGHBHZH OZHumOm mHmMH<2¢ ZOHmmmmOmm m0 mqmdﬁ Nm<223m

 

 

HH.¢ mamma
mooo. mmmH. meme. mam. Ncm.o Ha N
mhmH. mhmH. mmmv. o www.mm QUE H
mocccu mm m TOQOOHHHcmHm umpcm OB cwumpcm # mmpm
mm THdeHoz a mHanum>
mHmuc

mmmOUm mom ZO mmmovm
002 QZﬁ Hm m0 mHmMH424 ZOHmmmmwmm m0 Nm<zzbm

OH.v mqmﬁﬁ

105

 

 

 

 

 

mHmH. mme. meme. 0 omm.mm mZmz m
Hhoc. Hhoo. ovmo. hMH. mmm.~ Ha H
omcmco mm m TOOOOHHHcmHm umpcm OB couopcm # ampm

Na mHaHpHsz a memHum>
mHmus
mmmOUm Hm mo
Hmezm HHHEHZH UZHumOm mHmNHdzm ZOHmmmmUmm mo wmmzsz
mH.e mqmde
mmmH. mva. move. o mmo.o> 002 m
mHHc. mHHo. omOH. mmo. mm>.m Hm H
omccco mm m TOCOOHchmHm Hmpcm OB cwumpcm # dupm
mm mHmeHsz m THQOHHO>
mHmuc

mmmoom Hm m0
wmﬁzm HHHBHZH OZHUmom mHmMH¢Z¢ ZOHmmmmomm m0 Nmﬂzzbm

NH.v mHmdB

106

Examination of the change in R2 in Tables 4.8 - 4.10
shows that MCQ scores account for about 19% of the variance
in SCE scores, while the PMP scores account for less than
1%, and as little as 0.02% in the case of NBME scores.
Tables 4.11 — 4.13 show similar results.

From these results it can be concluded that PMP scores
do not make a significant contribution to predicting SCE
scores, and, therefore, PMP scores d9 29E account for a
portion of variance of SCE scores beyond that contributed by

MCQ scores, as stated in Hypothesis II.
Results Concerning Hypothesis III

According to Hypothesis III, PMP scores will add to
the ability of MCQ scores to discriminate between residency-
and practice-eligible candidates. Discriminant function
analysis was used to analyze these data. Before presenting
the results, a brief description of this technique is given.

Discriminant analysis is a multivariate statistical
technique in which one or more linear equations are developed
using a series of "discriminating" variables to statis-
tically distinguish between two or more groups in a sample
population. The equations are of the form

Di = dilzl + dizz2 +~--+ dipzp
where Di is the score on the discriminant function i, the

di's are the derived weighting coefficients, and the Z's are

107
the standardized values of the p discriminating variables
chosen for use in the analysis. Ideally, variables are
chosen in such a manner that their values are relatively
high for members of one of the groups to be distinguished,
and relatively low for the other group(s). The resulting
(equation(s) will then provide discriminant scores (D's) which
will cluster around a common value for a particular group.
The equation(s) are derived in such a way that these group
values are maximally separated.

The maximum number of equations derived is limited to
one less than the number of groups, or equal to the number
of discriminating variables, whichever is less. In this
study there were two groups to be discriminated (residency—
eligible and practice-eligible physicians) using two
discriminating variables (PMP scores and MCQ scores), so
only one equation was derived.

The equations produced by this technique can be used
for two purposes: classification and analysis.
Classification entails using the equations to decide to
which group new cases belong. The focus of this study is
using the discriminant equations for analysis, specifically
to look at the contribution of each variable to the
ability to discriminate between residency- and practice-
eligible physicians.

There are two statistics which are impOrtant in inter-
preting the following discriminant analyses. The first is

the F-ratio, used to determine whether or not the values of

 

108

the di's are significantly different from zero (in the case
of "F to enter"), or to determine whether or not the Wilks'
lambda is significantly different from 1.0 (labelled "Fl,n-2"
in the following tables). The other statistic is Wilks'
lambda, which ranges between zero and one, and is an inverse
measure of the discriminating power of the variables being
analyzed. The lower the value of lambda, the better the
discriminating ability of the variable being considered.

Summaries of the results of discriminant analyses
using residency-eligible and practice-eligible physicians
as the groups to be discriminated, and PMP scores and MCQ
scores as the discriminating variables, are presented in
Tables 4.14 - 4.19. The discriminating variables were
entered into a stepwise discriminant analysis using a
selection criterion that minimizes Wilks' lambda. In this
procedure the Variables are entered one at a time until
exhausted or until the change in Wilks' lambda is
insignificant, indicating the remaining variables are
contributing little or nothing to the discrimination of the
groups.

Table 4.14 shows the raw score means and standard
deviations for each of the criterion groups, as well as the
Wilks' lambda and its associated F-ratio for NBME scores and
MCQ scores. As can be seen, the Wilks' lambda values for
both variables indicate relatively little discriminating

power for either NBME or MCQ scores.

 

109

TABLE 4.14

GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES

 

 

 

n=509
Practice- Residency-
eligible eligible
Wilks'
Mean S.D. Mean S.D. Lambda F1,507
NBME 41.7 2.65 42.3 2.10 .9918 4.181a
MCQ .751 .080 .818 .045 .8740 73.11a .
ap<.05
TABLE 4.15

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS
OF PMP SCORES AND MCQ SCORES

 

 

 

 

 

n=509
Variable Variable Wilks' F
Step Number Entered Remainipg Lambda To Enter
1 MCQ .8740 73.11a
NBME .8739 .0614
2 (F level insufficient for further computation)

 

ap<.05

110

TABLE 4.16

GROUP DISCRIMINATION BY EI SCORES AND MCQ SCORES

 

 

 

 

n=509
Practice- Residency-
eligible eligible
Wilks'
Mean S.D. Mean S.D. Lambda F1,507
ET .833 .056 .846 .047 .9894 5.429a
MCQ .751 .080 .818 .045 .8740 73.11a
ap<.05
TABLE 4.17

SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS
OF EI SCORES AND MCQ SCORES

 

 

 

n=509
Variable Variable Wilks' F
Step Number Entered Remaining Lambda To Enter
1 MCQ .8740 73.11a
El .8737 0.1471
2 (F level insufficient for further computation)

 

ap<.05

111

TABLE 4.18

GROUP DISCRIMINATION BY PI SCORES AND MCQ SCORES

 

 

 

 

 

n=509
Practice- Residency-
eligible eligible
Wilks' F
Mean S.D. Mean S.D. Lambda 1,507
PI .736 .080 .755 .061 .9884 5.942a
MCQ .751 .080 .818 .045 .8740 73.11a
ap<.05
TABLE 4.19
SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS
OF PI SCORES AND MCQ SCORES
n=509
Variable Variable Wilks' F
Step Number Entered Remaining Lambda~ To Enter
1 MCQ .8740 73.11a
PI .8740 0.0075
2 (F level insufficient for further computation)

 

ap<.05

 

 

112

The results of the stepwise discriminant analysis for
NBME and MCQ scores is summarized in Table 4.15. The MCQ
scores were entered first, indicating they have the greater
discriminating ability of the two variables. After Step 1
was completed, the NBME score remained, and the Wilks'
lambda value in the second line of the table indicates the
value if the NBME score were entered in the next step. As
can be seen, it would only decrease by .0001, and the
resulting F to enter the NBME scores is Clearly not
significant. The program was set so that the minimum F ratio
to enter a variable was 1.00. Therefore, Step 2 was not done
becuase the F level was insufficient for further computation.
From this it can be concluded that NBME scores are not con-
tributing to the discrimination of residency- and practice-
eligible physicians.

The results illustrated in Tables 4.16 - 4.17 for El
scores, and in Tables 4.18 - 4.19 for PI scores, show
exactly the same results. In both cases the PMP scores
were not entered into the equation because of a lack of
discriminating ability.

These results are duplicated when the candidates who
passed Part I and who failed Part I are considered
separately, as summarized in Tables 4.20 - 4.21 and Tables
4.22 - 4.23, respectively. The results for NBME scores are
shown in these tables. The results for El scores and PI

scores were similar: when those who pass Part I and those

 

113

TABLE 4.20

GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES
FOR CANDIDATES WHO PASSED PART I

 

 

 

 

n=331
Practice Residency-
eligible eligible
Wilks’ F
Mean S.D. Mean S.D. Lambda 1,329
NBME 42.2 2.40 42.3 2.16 .9999 .0315
MCQ .807 036 .826 .035 .9389 21.41a
ap<.05
TABLE 4.21

SUMMARY OF STEPWISE
AND MCQ SCORES

DISCRIMINANT ANALYSIS OF NBME SCORES
FOR CANDIDATES WHO PASSED PART I

 

 

 

n=331
Variable Variable Wilks' F
Step Number Entered Remaining, Lambda To Enter
1 MCQ .9389 21.41a
NBME .9383 .2219
2 (F level insufficient for further computation)

 

a
p<.05

 

114

TABLE 4.22

GROUP DISCRIMINATION BY NBME SCORES AND MCQ SCORES
FOR CANDIDATES WHO FAILED PART I

 

 

 

 

 

 

n=l78
Practice Residency-
eligible eligible
Wilks' F
Mean S.D. Mean S.D. Lambda 1,176
NBME 41.1 2.83 42.4 1.46 .9878 2.168
MCQ .676 .058 .724 .034 .9647 6.441a
ap<.05
TABLE 4.23
SUMMARY OF STEPWISE DISCRIMINANT ANALYSIS OF NBME SCORES
AND MCQ SCORES FOR CANDIDATES WHO FAILED PART I
n=l78
Variable Variable Wilks' F
Step Number Entered Remaining» Lambda To Enter
1 MCQ .9647 6.441a
NBME .9610 .6722
2 (F level insufficient for further computation)

 

ap<.05

 

115

TABLE 4.24

CLASSIFICATION ANALYSIS USING MCQ SCORES
AS DISCRIMINANT FUNCTION

 

 

 

 

Actual Group Predicted Group Membership
Membership E Residency» Practice
Residency- 116 94 22
eligible (81.0%) (19.0%)
Practice- 393 164 229
eligible (41.7%) (58.3%)

Total Cases Correctly Classified 323
(63.5%)

 

who fail Part I are considered separately, the PMP scores
are not entered into the discriminant equation.

As mentioned earlier, the high values of Wilks' lambda
for both the PMP scores and the MCQ scores indicate a
relatively low discriminating ability for these scores. This
is further illustrated when considering the results of
classifying the candidates using the equation generated by
the discriminant analysis. When each candidate is classi-
fied as residency- or practice-eligible based on the results
of using the discriminant equation (all three equations
generated using the three PMP scores are the same, since only
the MCQ scores were entered in each analysis), the result is
63.5% correct classifications' as shown in Table 4.24. In
many respects this must be considered a poor performance,

since knowing there are a majority of practice-eligible

116

candidates in the total population (116 residency-eligible
and 393 practice-eligible) one can achieve a correct
classification rate of 77.2% by simply classifying everyone
as practice-eligible.

Since the three PMP scores were never entered into the
discriminant equations, it can be concluded that PMP scores
do not add to the ability of MCQ scores to discriminate

between residency- and practice-eligible candidates.

Results Concerning Hypothesis IV

 

The fourth hypothesis in this study states that all
frame scores of PMP's correlate equally with SCE scores.
This hypothesis is tested by calculating Z-test statistics
for the difference of two dependent correlation coefficients
(Glass and Stanley, 1970) for all pairWise combinations of
frame score correlations with SCE scores. Table 4.25
summarizes the correlations of each frame score (totalled
across all five problems) with the SCE scores of Part II.
Table 4.26 presents the calculated Z-test statistics for
significance of differences between all possible pairs of the
correlations of NBME frame scores and SCE scores found in the
first column of Table 4.25. Tables 4.27 and 4.28 do the same
for ET and PI scores, respectively. If the fourth hypothesis
is correct, there should be no significant difference between

any pair of frame score correlation coefficients.

117

TABLE 4.25

CORRELATIONS OF PMP FRAME SCORES WITH SCE SCORES

FOR NBME, EI AND PI SCORES

 

 

n=315

Frame NBME, SCE EI, SCE PI, SCE
(Hx) 1 .0752 .0427 .0601
(Dxl) 2 -.0197 -.0280 -.0155
(PX) 3 .0423 .0297 .0132
(Dx2) 4 -.0060 .0331 -.0113
(Lab) 5 .0221 .1200a .0246
(Diff.

Dx) 6 .0085 .0196 .0316
(Mgt) 7 L .1322a .1302a .2244a

 

 

ap<.05

 

 

118

TABLE 4.26

CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE
OF PAIRED NBME FRAME SCORE CORRELATIONS WITH SCE SCORES

 

 

 

n=315

Frame 1 2 3 4 5 .6 7
(Hx) ]. --
(Dxl) 2 1.34 --
(PX) 3 0.55 0.87 --
(Dx2) 4 1.10 0.17 0.61 --
(Lab) 5 0.77 0.55 0.38 0.35 --
(Diff.
Dx) 6 0.88 0.34 0.43 0.18 0.20 --
(Mgt) 7 0.80 1.97a 1.23 1.72 1.67 2.99a -

 

a

exceeds critical value of 1.96

(a=.05)

119

TABLE 4.27

CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE
OF PAIRED EI FRAME SCORE CORRELATIONS WITH SCE SCORES

 

 

 

n=315

Frame 1 2 3 4 5 6 7
(Hx) l. --
(Dxl) 2 0.99 --
(PX) 3 0.26 0.81 --
(sz) 4 0.14 1.33 0.05 --
(Lab) 5 1.08 2.06a 1.24 1.25 --
(Diff.
Dx) 6 0.32 0.70 0.13 0.23 1.68 --
(Mgt) 7 1.18 2.05a 1.32 0.29 0.17 2.11a --

 

aexceeds critical value of 1.96 (0:.05)

120

TABLE 4.28

CALCULATED Z-TEST STATISTICS FOR SIGNIFICANCE OF DIFFERENCE
OF PAIRED PI FRAME SCORE CORRELATIONS WITH SCE SCORES

 

 

 

 

n=315
Frame 1 2 3 4 5 6 7
(HX) 1 --
(Dxl) 2 1.07 --
(Px) 3 0.83 0.40 --
(Dx2) 4 0.97 0.06 0.31 --
(Lab) 5 0.54 0.52 0.21 0.44 -—
(Diff.
Dx) 6 0.37 0.57 0.23 0.57 0.10 --
(Mgt) 7 2.31a 3.13a 2.86a 3.04a 2.96a 3.32a --
a

exceeds critical value of 1.96 (a=.05)

The hypothesis that there is no difference between two
correlation coefficients is rejected if the calculated
statistic is greater than the critical value. For a two-
tail test with.ana=.05, this critical value is 1.96. As
can be seen in Table 4.26 for NBME scores, the calculated
Z-test statistic exceeds the critical value for frame 7
(management frame) paired with both frame 2 (preliminary
diagnosis) and frame 6 (differential diagnosis). For these
pairs, the hypothesis that the correlations are equal must
be rejected. Table 4.27 for BI scores shows that the

calculated statistic for these same pairs exceeds the

121

‘ critical value, as well as for frame 5 (lab) with frame 2.
For PI scores (Table 4.28), the Z—test statistic exceeds the
critical value for-the pairing of frame 7 with all the other
frames.

From these results it must be concluded that all frame
scores do not correlate equally with SCE scores. At the same
time it must be noted nearly all of the correlations of frame
scores with SCE scores are essentially zero. Frame 7 in all
three scoring methods and frame 5 for El scoring are the only

frames with correlations reaching even a level of 0.1.
Summary of Results for Tests of Hypotheses

The statistical analyses performed to test the four
hypotheses of this study can be summarized as follows:

1. Calculated Z-test statistics for significance of
differences between PMP score correlations with
MCQ scores and SCE scores show no significant
differenCes between the correlations. Since the
null hypothesis of no difference could not be
rejected, there is no support for Hypothesis I,
which stated that PMP scores will correlate to a
greater degree with SCE scores than with MCQ
scores. The actual correlations show very low
or essentially zero correlations of any of the
three types of PMP scores with either MCQ or SCE

scores .

 

122

Results of regression analyses indicated that none
of the PMP scores using the three PMP scoring
methods make any significant contribution to
predicting SCE scores. The R2 values indicate
that MCQ scores account for about 19% of the
variance in SCE scores while the PMP scores account
for essentially none of the variation in SCE
scores. Therefore, there is no support for
Hypothesis II, which stated that PMP scores will
account for a portion of variance of SCE scores
beyond that contributed by MCQ scores.

Stepwise discriminant function analyses performed
to assess the relative abilities of PMP scores and
MCQ scores to discriminate residency- and practice-
eligible candidates show that MCQ scores are the
better discriminators of the two, and that for
each of the three PMP scores the F level is not
sufficient to enter it into the discriminant
equation. Therefore, there is no support for
Hypothesis III, which stated that PMP scores will
add to the ability of MCQ scores to discriminate
practice- and residency-eligible candidates.
Calculation of Z-test statistics for the
significance of differences between all possible
pair-wise correlations of PMP frame scores with

SCE scores shows that there are two or more pairs

123

of correlations which are significantly different
for each of the three PMP scoring methods. These
pairs involve frame 7 (management) in all but one
case. Therefore, the null hypothesis of no
differences in all pair-wise correlations of PMP
scores with SCE scores must be rejected in favor
of the alternative that there are differences

in the correlation of PMP frame scores with SCE
scores. This means that Hypothesis IV of this
study (all PMP frame scores correlate equally

with SCE scores) cannot be supported.

Results of Additional Analyses

 

The results for Hypothesis IV (showing that all
sections or frames of a PMP do not correlate equally with
SCE scores) suggest the possibility that weighting those
frames with higher SCE score correlations might improve
the ability of PMP's to predict the SCE scores being used as
the criterion measure. The results of an attempt to define
such a revised scoring scheme are presented in this section.

As was done for Hypothesis IV, the scores for each
frame were totalled across all five problems. These total
frame scores were then used to calculate the correlations
between PMP frame scores and MCQ scores, and between PMP
frame scores and SCE scores. The results are presented in

Tables 4.29 - 4.31 for NBME, EI, and PI scores, respectively.

124

TABLE 4.29

CORRELATION OF NBME FRAME SCORES WITH MCQ AND SCE SCORES

 

 

 

 

 

n=315
Frame MCQ SCE
(Hx) 1 .057 .075
(Dxl) 2 .072 —.020
(PX) 3 .013 .042
(Dx2) 4 -.029 .006
(Lab) 5 .005 .022
(Diff. Dx) 6 .214a .009
(Mgt) 7 .280a .132a
Total .141a .075
ap<.05
TABLE 4.30

CORRELATION OF EI FRAME SCORES WITH MCQ AND SCE SCORES

 

 

 

 

n=315
Frame MCQ SCE
(Hx) 1 -.022 .043
(Dxl) 2 -.027 -.028
(PX) 3 .016 .030
(sz) 4 -.009a .033a
(Lab) 5 .133 .120
(Diff. Dx) 6 .211a .020a
(Mgt) 7 .277a .130
Total .067 .109a
ap<.05

125

TABLE 4.31

CORRELATION OF PI FRAME SCORES WITH MCQ AND SCE SCORES

 

 

 

 

 

n=315

Frame MCQ SCE
(Hx) 1 .059 .060
(Dxl) 2 .074 -.016
(Px) 3 -.009 .013
(sz) 4 -.022 -.011
(Lab) 5 .030 .025
(Diff. Dx) 6 .250a .032
(Mgt) 7 .331a .224a

Total .124a .084

ap<.05

Given the results from the four hypotheses presented earlier,
it is not surprising that the PMP frame scores show
essentially no correlation with SCE scores. The only
correlation to reach statistical significance (0:.05) across
all three scoring methods is frame 7 (management), but the
correlations are so low that they have little practical
meaning. For correlations with MCQ scores, only frame 6
(differential diagnosis) and frame 7 reach statistical
significance across all three scoring methods, but again
are too low to be of any practical value.

Three regression analyses were performed using the
SCE scores as the dependent or criterion variable and

frame scores for each of the three scoring methods as the

126

independent or predictor variables. These results are shown
in Tables 4.32 - 4.34. For all three scoring approaches
frame 7 (Mgt) is entered first, indicating it contributes
most to predicting SCE scores. In fact, except for frame 6
(Diff. Dx) of the NBME scores, frame 7 is the only frame
score in each case with an F to enter which reaches
statistical significance. These results are not unexpected
when considering the correlations for each frame score
presented in Tables 4.29 - 4.31.

The results of these regression analyses were then
used to calculate weighting schemes for each PMP scoring
method in a manner similar to that used during analysis of
the Field Test of the ABEM certification examination (Maatsch
and Elstein, 1979). Table 4.35 shows the rounded off weights
calculated for each frame for each scoring method. The
process of arriving at these weightings is illustrated by
the following description for the PI scores. The regression
analysis (Table 4.34) shows that only frame 7 (Mgt) makes
any significant contribution to predicting SCE scores, the
other six frames adding little or nothing. This indicates
that only frame 7 should be scored; but, in fairness to the
candidates, all frames should be scored. Therefore, each of
the first six frames is given a weight of one, for a total
weight of six. From the R2 values it can be calculated that
frame 7 accounts for about 86% of the variance in the SCE

scores which can be attributed to PMP scores, and the

127

 

.Emumoum map mp pmumucw Do: Axmv m mEmum

 

 

Amxov
Hooo. Nmmo. meH. va. ovo. v mﬁmum o
385
vooo. ammo. mhma. «Nb. mNH. m wﬁmum m
Axmv
omoo. hmmo. mmmH. mNm. Hem. H wﬁwhm w
:83
HNoc. hmmc. ommH. moo. wa. N mﬁduh m
AXO .MMﬂDV
Hmao. mmmo. wmma. . mNo. mom.m m wEmum N
3sz
tho. tho. MNMH. mac. me.m h mEde H
mommao MM m DOGMOAMAGmHm umucm OB commuzm # mmum
m 03332 E 038385

 

mam":
mmmoom Hum 20 mmmoom MSdmm mEmZ m0 mHmwﬂdz< ZOHmmmmwmm mom mqmﬁB wm<zzbm

Nm.v mqm¢8

128

 

.Emumoum map >9 pmumucm Doc Axmv m mEmHm

 

 

 

 

Axmv
mooo. Nmmo. Noma. mam. wma. H mﬁmum o
Aaxov
hmoo. hmmo. ommH. mbN. ohH.H N mﬁmuh m
Amxov
vNoo. ONmo. omba. Nmm. hob. e mﬁmum w
Axe .mmaov
whoo. mmNo. NNhH. wNH. hhM.N m wEmum m
386
Nmoo. NNNo. HmvH. omH. mhm.H m mﬁmum N
39.:
Ohao. OhHo. Noma. HNO. oov.m h wﬁmum H
omcmno mm m DOQMOHMHcmHm umucm OB pmumucm # mmum
mm maaabasz m mabmaum>
mam":

mm.¢ mgmda

mmmOUm wow 20 mmmOUm mzdmm Hm m0 mHmNA¢Z¢ ZOHmmmmUmm mom mamﬂe wMNSZDm

129

 

.Ewumoum may ma Umuoucm no: Amxov v wamum

 

 

Axmv
Hooo. nmmo. mmam. mNm. omo. m mEmHm .m
Anmqv
mooo. mmmo. oNvN. mam. HmN. m manna m
3::
mooo. mhmo. wovm. ham. mom. a mEmum v
. Aaxov
mooo. memo. vwmm. Haw. mmm. N mEmnm m
Axe .mmaov
Smoo. ammo. mmmm. ona. mmm.a m mﬁmum N
32,:
vomo. eomo. «VNN. ooo. mmm.ma h msmum H
mmcmno MN m cosmoHMHcmHm Hmucm OB pmnmucm w Qmum
mm bananas: a 0Hb8a08>

 

mHmHG
mmmoom mvm ZO mmmOUm m2¢mm Hm m0 mHmwgdZ< ZOHmmmmUmm mom mqm¢9 wm¢zzbm

vm.v mamﬁa

 

130

TABLE 4.35

CALCULATED WEIGHTS FOR PMP FRAME SCORES

 

 

N_B_M;r:_ £1 .12
Framg| Weight % Weight % Weight %
(Ex) 1 1 2 5 1 9 1 2.5
(Dxl) 2 1 2.5 l 9 1 2.5
(Px) 3 l 2 5 1 9 l 2.5
(Dx2) 4 1 2.5 l 9 l 2.5
(Lab) 5 1 2.5 l 9 1 2.5
(Diff.
Dx) 6 15 37.5 1 9 l 2.5
(Mgt) 7 20 50 5 46 34 85
40 100 11 100 40 100

remaining frames account for about 14%. Thus, frame 7
accounts for about six times the variance of the remaining
frames. This means that frame 7 should be weighted about
six times the total weight of the remaining frames which
have a total weight of six. After rounding off to a total
weight of 40, the result is a weight of 34 for frame 7 and
1 for each of the remaining frames, as shown in Table 4.35.
These results are quite similar to those obtained from
the Field Test data (Maatsch and Elstein, 1979), which show

the first five PI frames accounting for 10% of the variance

 

131

attributable to PMP scores, and a final weight of 30 for
frame 7 and one for frames 1-5. The major difference was
that frame 6 (Diff. Dx) contributed significantly to -
predicting SCE scores in the Field Test results, and received
a weight of 15.

The weights for NBME and EI frames were calculated in
a similar manner. For NBME frames, both frame 6 and frame 7
contributed significantly to predicting SCE scores (Table
4.32), and the beta weights from the regression equation
indicated they should be weighted in a ratio of approximately
4:5, resulting in the weights shown in Table 4.35

Using these weights, the PMP's were rescored and
correlations with MCQ and SCE scores calculated using the
weighted PMP scores. The results are presented in Table
4.36, along with the unweighted score correlations from
Tables 4.1 - 4.3 for comparison. The correlations of the
weighted PMP scores with MCQ scores are increased to
moderate levels. The correlations of the weighted NBME
and EI scores with SCE scores show a negligible increase,
while the weighted PI score shows an increase to a value
approaching a moderate level.

In order to explore the effect of the unreliability
of the tests being used, correcting for attenuation was also

done.’ Lord and Novick (l968)present the following formula

 

132

' TABLE 4.36

CORRELATIONS OF WEIGHTED PMP SCORES WITH
MCQ SCORES AND'SCE SCORES

 

 

n=315
ESQ ESE
NBME .270b .097a
(.141b)+ (.075)
El .234b .155a
(.067) (.109a)
PI .331b .221b
(.124a) (.084)
ap<.05
bp<.01

+Numbers in parentheses are correlations of unweighted
scores

for such purposes:

I:
* XY

rxy=

 

rxx ryy
Where r;y is the correlation corrected for attenuation, r
is the observed correlation, and rxx and r are the
Observed reliabilities of the two tests. Using the observed
correlations of NBME scores with MCQ and SCE scores, and the

reliabilities presented at the beginning of this chapter, the

corrected correlation between NBME and MCQ scores is .181

 

133

(compared to .1406 observed), and between NBME and SCE scores
it is .121 (compared to .0747 observed). Again, although
these values may be of statistical significance, they are

too low to be of any practical use. Therefore, even if

the tests used were perfectly reliable, the correlations

of the PMP's with the MCQ's and SCE's would still be very
low, and most likely would not change any of the results
presented in this chapter concerning the four hypotheses

of this study.

Summary of Additional Analyses

 

The correlations presented in Table 4.36 show that
even weighting the PMP frame scores by the method suggested
by Maatsch (Maatsch and Elstein, 1979) will not produce
correlations of sufficient magnitude to be useful in
predicting SCE scores (although there is a slightly
stronger relationship to MCQ scores.) Correcting for
attenuation also failed to produce meaningful correlations.
This is to be expected, since the initial correlations of
the PMP scores with the performance criterion provided by
the SCE scores were essentially zero. No amount of
manipulation is likely to overcome this lack of relationship
between PMP scores and performance on simulated clinical

encounters.

 

134

Chapter V discusses the findings presented in this
chapter, draws conclusions concerning the results of this

study, and makes suggestions for further research.

 

CHAPTER V

SUMMARY AND CONCLUSIONS

Introduction

 

This chapter summarizes this research study and draws
several conclusions based on the results presented in the
previous chapter. These conclusions and their implications
are then discussed, followed by some possible lines of

future research suggested by the results of this study.

Summary of Findings

 

This study was designed to investigate four observable
consequences of basic assumptiOns involved in the scoring and
use of Patient Management Problems (PMP's) in licensing and
certification examinations for physicians. These basic
assumptions are: 1) that PMP's are a valid measure of
clinical performance, i.e., they are predictive of how a
physician will perform in a real situation; and 2) all
parts of the clinical problem solving process (and
therefore all frames of a PMP) contribute equally to
diagnostic and management proficiency.

The data for this study were obtained from the first

administration of the Emergency Medicine Specialty

135

 

136

Certification Examination (EMSCE) of the American Board of
Emergency Medicine. The examination consisted of a battery
of 194 standard multiple-choice items, 86 pictorial
multiple-choice items, and five PMP's which were adminis-
tered as Part I of the examination in February, 1980. Those
who passed Part I with a score of 75% or better on the MCQ
battery went on to take Part II in either May or July.

Part II consisted of seven Simulated Clinical Encounters
(SCE's), which are highly structured oral simulations of

emergency medicine cases presented by trained examiners. In

 

this study the examiners' rating of candidate performance

on the SCE's served as the criterion measure. A total of 509
subjects used in this study sat for Part I of the examination,
and 315 went on to complete Part II.

The results from testing the four empirical hypotheses

of this study are summarized below:

Hypothesis I: PMP scores correlate to a greater

degree with SCE scores than with
clinically relevant MCQ scores.

For this first hypothesis Z-test statistics were
(lalculated for significance of differences between cor-
realations of PMP scores with SCE scores and with MCQ scores.
Results supporting the assumptions on which the use of PMP's
if! licensing and certification are based should show the
PEEP scores correlating to a higher degree with SCE scores
tﬂlan with MCQ scores. This was not the result obtained.

'Phe results showed no statistically significant differences

137

between the correlations of PMP scores with SCE scores and
with MCQ scores, and therefore the first hypothesis could not
be supported.

Hypothesis II: PMP scores account for a portion of
variance of SCE scores beyond that
contributed by clinically relevant
MCQ scores.

Regression analysis was used to test this second
hypothesis concerning the relative amounts of variance in
SCE scores accounted for by PMP scores and by MCQ scores.
Results supporting the use of PMP's in licensing and
certification would show PMP scores accounting for a
significant proportion of the variance in SCE scores
beyond that accounted for by MCQ scores. This was not
the case. The results showed that PMP scores account for
essentially none of the variance in SCE scores, and therefore
make no significant contribution to predicting the criterion
(SCE) scores. The second hypothesis could not be supported.

Hypothesis III: PMP scores add to the ability of MCQ

scores to discriminate between
residency- and practice-eligible
candidates.

This third hypothesis was tested using stepwise
discriminant function analysis to assess the relative
abilitities of PMP scores and MCQ scores to discriminate

between residency- and practice-eligible candidates.

 

.138

Results in support of the use of PMP's in licensing and
certification would show that PMP scores make a significant
contribution beyond that of MCQ scores in discriminating
between residency- and practice-eligible candidates.
Again, this was not shown by the results of this study.
The PMP scores show no ability to discriminate between
the two types of candidates. Therefore, the third
hypothesis could not be supported.

Hypothesis IV: All frame scores of PMP's correlate

equally with SCE scores.

This final hypothesis in this study dealt with an
assumption upon which the scoring of PMP's is based;
namely, all parts or frames of a PMP contribute equally to
the prediction of a criterion measure of clinical perfor-
mance. Results supporting this assumption would show no
significant differences between all pairs of correlations of
the various PMP frame scores with SCE scores. Once again,
this was not the result obtained. The correlations of the
final management frame with SCE scores are significantly
different from the correlations of the other PMP frames with
SCE scores. Therefore, the fourth hypothesis could not be
supported.

The results of testing the fourth hypothesis suggested
the possibility of weighting the final frame in order to
improve the ability of PMP's to correlate with, or predict,

SCE scores. While such a weighting scheme did result in

 

139

some improvement in the correlation of PMP and SCE scores,
it was not enough to make the frame-weighted scores useful
in predicting SCE scores. There is apparently so little
correlation between PNP scores and the criterion that
manipulations of this type will not tangibly improve

measurement or prediction.

Conclusions

At the end of Chapter III four questions were

presented that linked the hypotheses of this study with

 

the assumptions concerning the use fo PMP's in licensing and
certification of physicians. All of these questions can now
be tentatively answered in the negative. PMP scores are no
better at predicting SCE scores obtained three and five
months later than they are at predicting MCQ scores. PMP's
do not add anything to the information gained through MCQ's
in predicting SCE scores. PMP's do not add anything to the
ability to discriminate between residency- and practice-
(eligible candidates beyond that provided by MCQ scores.
Ztnd finally, there does not appear to be any way of weighting
IPNW frames that would materially improve the predictive
aability of PMP's, at least in the setting utilized in this
s tudy .

From the results of this study, the following con-

Clusions may be drawn:

140

l. The assumptions tested in this study, upon which
the use of PMP's in licensing and certification
are based, do not appear to be valid.

2. Since the results showed little or no ability of
the PMP's to predict a criterion score consisting
of ratings provided by reliable expert examiners
of quality of health care given to a sample of
simulated patients, this study provides no
evidence of the criterion-related validity of

PMP's in licensing and certification of

 

physicians.

3. Attempts to substantially strengthen the
predictive ability of PMP's were unsuccessful
because there is essentially little or no
correlation with the criterion measure.

Basically, the results of this study indicate that

PMP's are not as useful and valid a method of evaluating
Clinical competence as has generally been assumed. They do
:not appear to be a valid substitute for MCQ batteries or
<Dther methods of measuring competence. In fact, at least in
'this study, the MCQ battery was a much better predictor of

‘the criterion measure.
Discussion of results

The results of this study add another piece of

evidence to the growing volume of evidence from other medical

141

settings showing the lack of criterion-related validity of
PMP's, as typified by the three studies reviewed in

Chapter II (Goran, Williamson and Gonnella, 1973; Feightner
and Norman, 1976; Page and Fielding, 1980). While these
results support those of the previous studies, this study
goes beyond them in that is uses a large number of
experienced physicians in an actual certification
examination. Therefore, it must be considered much more
directly relevant to the important question of the validity

of PMP's used in licensing and certification examinations.

 

The literature review presented in Chapter II that was
concerned with the validity of PMP's generally showed good
content validity and construct validity, but the only
available studies of criterion-related validity of PMP's
showed negative results for this most crucial aspect of
validity. There apparently are no available studies that
show evidence supporting the criterion-related validity of
PMP's in a licensing or certification setting. It is
possible that some licensing or certification authorities may
have developed some evidence supporting the use of PMP's;
but, if this is the case, such information has not been made
generally available. In View of the widespread use of
PMP's in such settings, any evidence of criterion-related
validity of PMP's would be very important. At the moment,

the lack of such evidence is most significant.

142

It should also be pointed out that very few studies
purporting to support the validity of PMP's go beyond dis-
cussing content validity. While it is important to provide
any evidence that PMP's cover a representative sample of the
desired performance domain, such evidence of content validity
by itself is not sufficient. It must be combined with
evidence of construct and criterion-related validity. As
noted in the literature review in Chapter II, the few studies
claiming to show construct validity all are based on the
prediction that the groups of examinees with more education
will perform better on the PMP's. This appears to be the
only "construct" used in the PMP literautre. Aside from the
question of whether or not the number of years of medical
education in itself qualifies as a psychological construct,
there are two problems with this approach: 1) it apparently
does not deal directly with physician problem solving
capability as a psychological construct; and 2) demonstration
of construct validity requires more than a single study or
single approach to the psychological construct in question.

As noted in the APA's Standards for Educational and
Psychological Tests (1974):

"Evidence of construct validity is not found

in a single study; rather, judgments of construct

validity are based upon an accumulation of

research results. In obtaining the information

needed to establish construct validity, the

investigator begins by formulating hypotheses

about the characteristics of those who have high

scores on the test in contrast to those who have

low scores. Taken together, such hypotheses form

at least a tentative theory about the nature of
the construct the test is believed to be measuring."

 

143

Based on the generally accepted description of construct
validity by the APA, it appears that even the few claims of
construct validity for PMP's could be considered very weak.

The primary feature of this study which could be
questioned is the use of SCE scores as the criteriOn measure.
SCE's do provide a highly reliable expert judgement, by
trained medical examiners, of the quality of health care
provided by the candidates on a varied sample of simulated
patients. The key question then is the validity of the SCE's.
This question is currently under active investigation, using
the same experienced examiners and rating scales to rate
quality of health care provided by emergency physicians in
their own hospitals, by means of face-to-face reviews of
charts of patients seen during the previous week. Prellim-
inary results show correlations of these ratings of actual
patient encounters and the SCE scores received by the same
physicians to be greater than 0.40, and when corrected for
attenuation the correlation is in the range of 0.60 (Maatsch,
personal communication). This indicates that the SCE's are
a reasonably valid criterion measure.

Since only the MCQ's were used in deciding which
candidate went on to take the SCE's, it is possible that
the MCQ results were lowering the correlation between PMP
and SCE scores by "filtering out" the low PMP scores.
However, the very low correlation between the PMP and MCQ

scores indicates that such a truncation of range of PMP

144

scores probably was not occurring to any significant extent.
Therefore the low correlation between PMP and SCE scores is
quite likely a valid observation, at least in this case.

Although, in relation to the analysis of Hypothesis IV,
correction for imperfect reliability was used in an attempt
to enhance the predictive ability of PMP's, another cor-
rection technique, correction for restricted ranges of
performance, was not used. Since this study was done under
actual certification examination conditions, and a somewhat
restricted range is a normal part of the environment for
this type of use, it was not felt that such a correction
would be appropriate.

As mentioned in Chapter III, the focus of this study is
not just the performance of PMP's in an emergency medicine
certification examination, but rather to test some basic
assumptions upon which the use of PMP's in licensing and
certification are based. This study utilized PMP's that
‘were designed and scored using commonly accepted methods,
and were shown to have good content and construct validity,
as well as reasonably high reliability. These PMP's were
used in an actual certification examination to test four
empirical hypotheses based on observable consequences drawn
from those basic assumptions. The fact that none of the
hypotheses of this study were supported by the data, and
therefore the basic assumptions behind the use of PMP's in

licensing and certification examinations could not be

145

supported as valid, should call into question any use of
PMP's in such evaluation settings that are dependent on those
same baSic assumptions. Those licensing and specialty cert-
ification authorities that use PMP's should no longer delay
developing means to empirically test the criterion-related
validity of their PMP's. Given the results of this study
and those from other medical settings, the validity of the
use of PMP's in making licensing and certification decisions
should be considered highly suspect until proven otherwise.

On the other hand, the results of this study have
little relevance to the use of PMP's as tools in the early
phases of teaching diagnostic methods to medical students.
It is often demonstrated that PMP's have good content
validity and construct validity. They can apparently cover
the basic content areas involved in diagnosing a medical
problem, and individuals with less experience tend to do less
well than those with more experience, at least in regard to
final diagnosis and management. Content validity is quite
adequate for use of PMP's in such educational settings where
the student is going through the preliminary stages of
learning medical diagnosis.

As outlined in Chapter II, because of the difficulty
in measuring or even defining the criterion for licensing
or certification in medicine, plus the historical relation?
ship of these types of examinations to educational achievement

tests, there has been little effort to provide evidence of

146

criterion-related validity for these important tests. The
adequacy of providing content and construct validity for
most achievement tests has fostered the assumption that this
will also be adequate for licensing and certification
examinations. The increasing evidence of the lack of
criterion-related validity for PMP's in a variety of
settings illustrates the fallacy of such an assumption.
Given the apparent absence of any demonstrated criterion-
related validity for PMP's used in evaluation settings, it
would seem that the burden of proving the apprOpriateness
of their use should be on those who use or propose to use
PMP's in a licensing or certification setting.

As Ebel has pointed out (quoted in Senior, 1976),
knowledge of what to do is a major element in competence to
do, and hence in performance. There are, however, crucial
distinctions between these three elements of a person's
capability. Knowledge is easily tested, but it does not tell
the whole story of a person's competence, their potential or
capacity to act. Measuring competence, as a licensing or
certification examination seeks to do, must go beyond testing
knowledge and provide some measure of their capacity to act,
to test or predict their ability to perform in the future.
Based on the results of this study and the ongoing studies of
the EMSCE mentioned previously, MCQ tests appear to do this
much better than PMP's but not as well as SCE's. Performance,

on the other hand, is the actual act of doing something;

147

SQEE do rather than can do (Senior, 1976; Lloyd and Abraham--
son, 1979). Measurement of performance must be done in
retrospect, as a measure of on-the-job performance. Perform-
ance measures will no doubt be crucial elements in re-
licensing and re-certification, and this will probably hasten
the development of definitions of criteria and the measurement
of clinical performance. This in turn will be helpful in
developing more adequate measures of competence for licensing
and certification.

Basically, competence implies knowledge, but knowledge
does not necessarily imply competence; adequate performance
implies competence, but competence does not necessarily
imply adequate performance. Put another way, knowledge is
necessary for and a part of, but not equivalent to, com-
petence; competence is necessary for and a part of, but not
equivalent to, performance. This is supported by studies
such as those of Rose, Corman and Robbins (1979) or Goran,
Williamson and Gonnella (1973) that demonstrated there is a
low correlation between knowledge and performance.

The results of the field test of the EMSCE (Maatsch
and Elstein, 1979) did show a moderate to high correlation
of knowledge with performance (on SCE's). However, this
involved a wide ability range in the examinees, from fourth
year medical students to residency-trained, board-eligible
emergency physicians. When only board-eligible physicians

are considered (Maatsch, 1980), the range of performance (SCE)

148

scores is naturally much more restricted and the correlation
with knowledge (MCQ) scores is much lower. There is also a
much lower correlation between knowledge and PMP scores. In
this study there was a low to moderate correlation on MCQ
scores with SCE scores. This correlation was also signif-
icantly higher (based on Z-tests) than the correlation of
PMP scores with SCE scores. Despite the fact that PMP's are
supposed to be a performance measure more than a knowledge
measure, these results showed a very low or essentially zero
correlation with another performance measure and a slightly
better correlation with a knowledge measure.

The results of this study may run counter to the
commonly held beliefs and opinions of many people concerning
the use of PMP's in licensing and certification examinations.
These results are also contrary to several studies purporting
to support the use of PMP's in such settings; but such
studies, if they consider validity at all, look only at
content validity and occasionally at construct validity.
However, the results presented here are quite consistent with
those few studies in the literature that do look at criterion-
related validity of PMP's.

The implications of this study are quite clear. First,
PMP's as presently constructed and scored have little or
no criterion-related validity for use in licensing and
certification examinations, and the assumptions upon which
their use has been based do not appear to be valid. There—

fore, the use of PMP's in such a situation should be

149

carefully reconsidered by licensing and certification author-
ities. This may appear to be a rather strong statement, but
the results of this study, added to the results of previous
studies reviewed here, allow little else in the way of con—
clusions. If any of the first three hypotheses of this
study had produced positive results, there might have been
some reason to continue supporting the use of PMP's. But
all three hypotheses, based on what must reasonably be
considered directly observable consequences of the basic
assumptions behind the use of PMP's, produced results that
were contrary to those assumptions. Hypotheses II and III
were actually constructed in a manner somewhat favorable to
PMP's. Hypothesis II required only that PMP scores account
for a portion of SCE score variance "beyond that contributed
by MCQ scores" rather than a more rigorous requirement to
account for a portion "greater than that contributed by

MCQ scores." Hypothesis II required only that PMP scores
"add to" rather than "exceed" the ability of MCQ scores to
discriminate between residency- and practice-eligible
physicians. As the results showed, PMP scores contributed
nothing in either case.

A second implication drawn from the results of this
study and from the review of literature is that, since linear
PMP's appear to have no criterion-related validity, perhaps
branching PMP's should be given more consideration in

licensing and certification applications. As outlined in

150

Chapter II, past difficulties in measuring the outcome of

medical care has led to an emphasis on measuring the process

of medical problem solving. Therefore, testing authorities

have generally insisted on using a linear PMP to ensure that

each candidate sees exactly the "same" problem and goes

through the same process to arrive at a solution. Unfor-

tunately for such a viewpoint, the literature shows that

the diagnostic process is seldom linear. Many studies, such

as those of DeDOmbal and his co-workers (Leaper et a1., 1973;

DeDombal et a1., 1974; DeDombal, 1979), have shown the real

diagnostic process does not follow such a rigid pattern.

There are a variety of ways of approaching a given diagnostic

problem, and different physicians working on the same problem

will often follow different routes to the same solution.

A shift in focus to emphasize products or outcomes rather

than the process of medical problem solving would allow the

use of branching PMP's. This in turn might potentially

result in a better ability to provide evidence of criterion-

related validity of PMP's in licensing and certification.
Directly related to this second implication is the

third implication of this study, that new scoring alternatives,

such as weighting frames rather than items, should be explored.

The fact that the results presented in the last chapter showed

that the final management frame correlated at a higher level

with the criterion than all the other frames could be taken

as an indication that new scoring schemes should be explored,

 

151

and as an indication that outcome is more important than
process. Perhaps the use of branching PMP's in combination
with a simplified scoring scheme that emphasizes outcome will
prove to have better criterion-related validity than has been

evidenced for PMP's in this and other studies.

Suggestions for future research

 

The most obvious suggestion for future research is that
this study be replicated with PMP's used in other specialty

certification examinations and licensing examinations. It

 

is critical that any such replications use an adequate
clinical performance criterion measure. Such studies could
also provide data for work on alternative scoring schemes,
such as weighting frames as was done in this study. There
should also be work done to further explore the process
versus outcome problem in evaluating physician competence.
As indicated by the literature reviewed in Chapter II,
there are several studies that indicate the product or
outcome of medical problem solving is more important than
the process of diagnosing a medical problem in discriminating
the more competent from the less competent physician. This
was also supported by results presented in this study
showing that the last frame (management) of the PMP's
correlated with performance on the SCE's to a higher degree
than the data-gathering frames. This also suggests that

development of branching PMP's and their scoring schemes for

152

use in licensing and certification examinations should be
undertaken as well.

Finally, it was mentioned earlier in this chapter that
work is currently being done on demonstrating the criterion-
related validity of the SCE's that were used as the criterion
in this study. Many of the physicians who will be rated on
their actual on-the-job performance will be among those who
were the subjects of this study. The results of the per-
formance ratings of these individuals should be used as a

new criterion measure to be compared with their PMP scores.

 

The number of subjects will be smaller, but repeating this
study using these performance ratings instead of SCE scores
as the criterion measure could provide very strong con-

firmation of the results presented here.

 

APPENDIX

 

APPENDIX

EXAMPLE OF A PATIENT MANAGEMENT PROBLEM

HEADACHE
Section A: INTRODUCTION AND HISTORY

You are on duty in the Emergency Department of a major
hospital in a moderately large metropolitan area. The
hospital is fully staffed and all diagnostic facilities are
available to you. On duty with you are an emergency nurse,
a clerk and an ECG Technician.

It is 11:00 a.m. on Tuesday.
A well dressed man looking to be in his early 50's is

brought to the Emergency Department by his secretary. Your
nurse obtains the following information:

NAME: James Callahan

AGE 51

OCCUPATION: Corporation Vice—President

HT: 5 ft., 11 in. (180.3 cm) PULSE: 70/min

WT: 205 lb. (93.2 kg) RESP: 18/min
TEMP: 98.6OF oral (37 C) BP: 280/150mmHg

CHIEF COMPLAINT: Patient complains of "severe
headache at the back of my head"
when he got up this morning, and
"a little dizzy" when he moved
around.

The patient is a moderately obese, apprehensive white
male in moderately acute distress who looks his stated age.

Which of the following items of historical information
would you obtain at this time? (SELECT AS MANY AS ARE

PERTINENT TO THE PRESENTING PROBLEM)

1. Description of present illness 1. [The patient awoke
as usual at 6:30 a.m.

153

 

 

154

Previous episodes of head- 2.

ache

Precipitating factors re-
lated to present episode

Relief or aggravation of
present episode

Past medical history

Medications

Family history

History of respiratory
problems

He noted a severe occi—
pital headache, some
blurring of vision, and
a feeling of heaviness

over the precordium. He
went to work, but the
headache got worse. At

10:30 he asked his secre-
tary to drive him to the
Emergency Department.]

[None this severe or
which could not be re-
lieved by aspirin.]

[None known. He experi-
enced no trauma to head.]

[None noted]

[The patient has been
under the care of an
internist for 3 years

for high blood pressure.
Workup 9 months ago.

EKG showed left ventricu-
lar hypertrophy. Follow
up visit 2% months ago.
BP 150/90 mmHg. No

other abnormalities.]

[None for present episode
of headache. Blood
pressure controlled with
hydrochlorothiazide
(Hydrodiuril), 50 mg,
once a day.

[No history high blood
pressure, TB, or other
chronic diseases.]

[None.]

155

9. Allergies 9. [None.]

10. Social history 10. [The patient has been
in his current position
of Corporation Vice
President for Research
for five years. He is
married with three
children. He enjoys
sailing and playing
golf.]

WHEN YOU HAVE COMPLETED THIS SECTION CONTINUE WITH SECTION B.

156

Section B: PHYSICAL EXAMINATION

Which of the following areas of physical examination
would you evaluate? (SELECT ONLY THOSE WHICH ARE PERTINENT
TO THE PRESENTING PROBLEM)

1. General appearance 1. [Moderately obese appre-
hensive white male in
moderately acute dis-
tress who looks his
stated age.]

2. Repeat vital signs 2. [T: 98.6°P (oral) 37°C);
P: 70/min; R: 18/min;
BP: 275/140 mmHg.]

3. Head 3. [No evidence of trauma
to head.

4. Neck 4. [Neck supple. Neck vein
distension noted. No
bruits.]

5. Funduscopic 5. [Arterioles appear as

fine cords, blood column
not visible. Several
hemorrhages and exudates
are noted. Papilledema
is present.

6. Heart and lungs 6. [Clear to percussion and
auscultation. PMI in 5th
intercostal space at mid
clavicular line. Fourth
(4th) heart sound present.]

7. Abdomen 7. [Protuberant. No hepa—
tosplenomegaly. No
masses alpated. Normal
sounds.§

8. Genito-urinary 8. [Circumcized male.
Testes descended, normal
size. Prostate firm,
slightl enlarged, non-
tender.§

157

9. [Well developed. No
muscle weakness, wasting
or fasciculation.]

9. Extremities

[Cranial nerves intact.
DTR's 2+/4+ bilaterally.
Plantar response down-
going.]

10. Neurological 10.

WHEN YOU HAVE COMPLETED THIS SECTION, CONTINUE WITH SECTION E.

158

Section C: LABORATORY

Which of the following diagnostic studies would you
now do? (SELECT ONLY THOSE WHICH ARE PERTINENT TO THE
PRESENT PROBLEM)

1. CBC l. [Hgbz 16.0 3; Hot: 45%;
WBC: 7500/mm ; Differen-
tial: Stabs: 4%; Segs:
58%, Lymphs: 30%, Eos:
2%; Monos: 6%.

2. Urinalysis 2. [Sp.Gr.: 1.015; pH: 6;
protein: 2+; acetone:
neg.; glucose: neg.

3. Portable chest x-ray 3. [Cardiac shadow shows
left ventricular pre-
dominance.

 

4. EKG 4. [See Figure 1.]

5. Electrolytes 5. [Na+: 140 mEq/l;
C02: 25 mM/l; pH: 7.36;
K+: 40 mEq/l; C1: 100

mEq/l.]

6. BUN 6. [15 /m /d1 (Normal: 8-18
mg/dl)?

7. Fasting blood sugar 7. [80 mg/dl (Normal: 70-100
mg/dl)]

8. Serum creatinine 8. [.5 mg/100ml (Normal:
.8-2.0 mg/100ml)]

9. SGOT 9. [25 K.U. (Normal: 8-40
K.U.)]

10. Spinal tap 10. [Patient suffers a her-

niated brain stem and
dies. END OF PROBLEM. DO
NOT MAKE ANY MORE CHOICES.]

WHEN YOU HAVE COMPLETED THIS SECTION, CONTINUE WITH SECTION D.

ONLY ONE)

1. Subarachnoid hemorrhage l.
2. Histamine headache 2.
3. Hypertensive encephalopathy 3.
4. Brain tumor 4.
5. Labile hypertension 5.

159

Section D:

What is your diagnosis for this patient?

DIAGNOSES

(SELECT

[Noted]
[Noted]
[Noted]
[Noted]

[Noted]

WHEN YOU HAVE COMPLETED THIS SECTION, CONTINUE WITH SECTION E.

in the Emergency Department?
SELECTING ONE.

1.

Section E:

TREATMENT

Which of the following treatments would you institute

SELECT ONLY ONE).

Diazoxide (Hyperstat),
IV slowly in brachial vein

Furosemide (Lasix), 40 mg, 2.
diazoxide (Hyperstat), 300 mg,
IV slowly in brachial vein

Furosemide (Lasix), 40 mg, 3.
diazoxide (Hyperstat), 300 mg,
IV rapidly in brachial vein.

300 mg l.

(READ ALL OPTIONS BEFORE

[Blood pressure does
not respond.

Patient suffers a
myocardial infarc-
tion. END OF PROBLEM,
DO NOT MAKE ANY MORE
CHOICES.]

[Patient's blood
pressure does not
respond. END OF
PROBLEM. DO NOT
MAKE ANY MORE
CHOICES.]

[Patient's blood
pressure drops to
135/85. His phys-
ician is contacted.
He is admitted for
further treatment.]

160

4. Reserpine, 1 mg, IV rapidly in 4. [Patient's blood
brachial vein. pressure does not
respond. END OF
PROBLEM. DO NOT MAKE
ANY MORE CHOICES.]

5. Hydralazine hydrochloride 5 [Patient's blood
(Apresoline), 25 mg, IM pressure does not
respond. END OF
PROBLEM. DO NOT MAKE
ANY MORE CHOICES.]

END OF PROBLEM

LIST OF REFERENCES

LIST OF REFERENCES

ABMS. Definitions of Competence in Specialties of Medicine-
Conference Proceedings. Chicago, Illinois: American
Board of Medical Specialties (1979).

 

 

ABMS. Board Evaluation Procedures: Developing a Research
Agenda. Chicago, Illinois: American Board of Medical
SpeClalties (1980).

Abrahamson, S. "Validation in Medical Education," in
Conference on Extending the Validity of Certification,
Chicago, Illinois: The American Board of Medical
Specialties (1976).

AIR. The Definition of Clinical Competence in Medicine:
Performance Dimensions and Rationales for Clinical
Skill Areas. Palo Alto, California: American Insti-
tutes for Research in the Behavioral Sciences (1976).

APA. Standards for Educational and Psychological Tests.
Washington, D.C.: American Psychological Association,

 

Inc. (1974).
Barro, A.R. "Survey and Evaluation of Approaches to Physician
Performance Measurement." Journal of Medical Education,

 

48:1050-1093 (1973).

Bashook, P.G. "A Conceptual Framework for Measuring Clinical
Problem-Solving," Journal of Medical Education, 51:
109-114 (1976).

 

Berner, E.S.; Bligh, T.J.; and Guerin, R.O. "An Indication
for a Process Dimension in Medical Problem-Solving,"
Medical Education, 11:324-328 (1977).

 

Berner, E.S.; Hamilton, L.A., Jr.; and Best, W.R. "A New
Approach to Evaluating Problem-Solving in Medical
Students," Journal of Medical Education, 49:666-680
(1974). _—

 

Bligh, T.J. "Written Simulation Scoring: A Comparison of
Nine Systems," Paper presented at the American
Educational Research Association Annual Meeting,
Boston, (1980).

161

.162

Burg, F.D., and Lloyd, J.S. "Definitions of Competence: A
Conceptual Framework," in Definitions of Competence
in Specialties of Medicine-Conference Proceedings.
Chicago, Illinois: American Board of Medical
Specialties (1979).

Burg, F.D., and Schumacher, C. "Standardized Tests as
Measures for Medical Certification,"'Professions
Education Researcher, l (l):l3-l7 (1979).

 

 

DeDombal, F.T. "Computers and the Surgeon - A Matter of
Decision," Surgery Annual, 11:33-57 (1979).

 

DeDombal, F.T., et al., "Human and Computer-aided Diagnosis
of Abdominal Pain: Further Report with Emphasis on
Performance of Clinicians," British Medical Journal,
1:376-380 (1974).

 

Donnelly, M.B. "Measuring Performance on Patient Management
Problems," in Proceedings of the Fifteenth Annual Con-
ference on Research in Medical Education, Washington,
D.C.: Association of American Medical Colleges (1976).

 

Donnelly, M.B., and Gallagher, R.E. "A Study of the Pre-
dictive Validity of Patient Management Problems,
Multiple Choice Tests and Rating Scales," in
Proceedings of the Seventeenth Annual Conference on
Research in Medical Education, Washington, D.C.:
Association of American Médical Colleges (1978).

 

Donnelly, M.B.; et al. "The Dimensionality of Measures
Derived From Complex Clinical Simulations," in
Proceedings of the Thirteenth Annual Conference on
Research in Medical Education, Washington, D.C.:
Association of American Medical Colleges (1974).

 

Downing, S.M. "An Analysis of the Effects of Different
Multiple-Choice Item Selection Strategies on the
Reliability and Validity of Measures of Physician
Competence in Specialty Certification," 1979, A
Dissertation at Michigan State University.

Ekwo, E.E. and Loening-Baucke, V. "Clinical Problem Solving
and Clinical Knowledge," Medical Education, 13:251—256
(1979).

 

Elstein, A.S.; Shulman, L.S.; and Sprafka, S.A. Medical
Problem Solving - An Analysis of Clinical Reasoning.

Cambridge, Massachusetts: Harvard University Press
(1978).

 

163

Farrington, J.F.; Felch, W.C.; and Hare, R.L. "Quality
Assessment and Quality Assurance," New England Journal
of Medicine, 303:154-156 (1980).

 

Feightner, J.W. and Norman,G.R. "Concurrent Validity of
Patient Management Problems by Comparison With the
Clinical Encounter" in Proceedings of the Fifteenth
Annual Conference on Research in Medical Education,
Washington, D.C.: Association of Amefican Medical
Colleges (1976).

Glass, G.V, and Stanley, J.C. Statistical Methods in
Education and Psychology. Englewood Cliffs, New
Jersey: Prentice-Hall, Inc. (1970).

 

Gonnella, J.S.; et al. "Evaluation of Patient Care/'Journal—
American Medical Association, 214:2040-2043 (1970).

Goran, M.J.; Williamson, J.W.; and Gonnella, J.S. "The
Validity of Patient Management Problems," Journal of
Medical Education, 48:171-177 (1973).

 

Grace, M.; et al. "A Scoring Technique for Computerized
Patient Management Problems," Medical Education, 11:
335-340 (1977).

 

Harasym, P.H.; et al. "The Underlying Structure of Clinical
Problem-Solving: Process or Content?", in Proceedings
of the Eighteenth Annual Conference on Research in
Medical Education, Washington, D.C.: Association of
American Medical Colleges (1979).

 

Helfer, R. and Slater, C. "Measuring the Process of Solving
Clinical Diagnosis Problems," British Journal of
Medical Education, 5:48—52 (1971).

 

Hubbard, J.P. Measuring Medical Education. 2nd Ed.
Philadelphia, PA.: Lea and Febiger (1978).

Hubbard, J.P.; et al. "An Objective Evaluation of Clinical
Competence," New England Journal of Medicine, 272:
1321—1328 (1965).

 

Jason, H. "Defining Competence: Some Basic Considerations,"
in Definitions of Competence in Specialties of Medicine:
Conference Proceedings. Chicago, Illinois: American
Board of Medical_Specialties (1979).

 

Juul, D.H.; Noe, M.J.; and Nerenberg, R.L. "A Factor
Analytic Study of Branching Patient Management
Problems," Medical Education, 13:119—203 (1979).

 

164

Kelley, P.R., Stumpe, A.R., and Levit, E.J. "A Four-Year
Study of the Internship in United States Air Force
Hospitals: An Objective Measurement of Gain in
Clinical Competence," Military Medicine, 135:537-545
(1970).

 

Kessner, D.M.; Kalk, C.E.; and Singer, J. "Assessing Health
Quality - the Case for Tracers," New England Journal of
Medicine, 288:189-194

 

Korman, M. and Stubblefield, R.L. "Medical School Evaluation
and Internship Performance," Journal of Medical
Education, 46:670—673 (1971).

 

Leaper, D.J., et al. "Clinical Diagnostic Process:
Analysis," British Medical Journal, 3: 569- 574 (1973).

 

Levine, H.G., McGuire, C.H., and Nattress, L.H., Jr. "The
Validity of Multiple Choice Achievement Tests as
Measures of Competence in Medicine," American
Educational Research Journal, 1:69—82 (I970).

 

Lloyd, J.S. "Existing Definitions of Competence in Medicine,"
in Definitions of Competence in Specialties of Medicine-
Conference Proceedings. Chicago, Illinois: American
Board of Medical Specialties (1979).

 

 

Lloyd, J.S., and Abrahamson, S. "Effectiveness of Continuing
Medical Education: A Review of the Evidence,"
Evaluation and the Health Professions, 2:251-280 (1979).

Lord, F.M., and Novick, M.R. Statistical Theories of Mental
Test Scores. Reading, Mass.: Addison—Wesley Publishing

Co. (1968).

Maatsch, J.L. An Introduction to Patient Games: Some
Fundamentals of Clinical Instruction, East Lansing,
Michigan: Michigan State University Office of Medical
Education Research and Development (1974).

 

 

Maatsch, J.L. "Model for a Criterion-Referenced Medical
Specialty Test," Final Report on HEW Grant No. HS-02038,
Michigan State University Office of Medical Education
Research and Development (1980).

Maatsch, J.L.; and Elstein, A.S. "Model for a Criterion-
Referenced Medical Specialty Test," Progress Report on
HEW Grant No. HS-02038, Michigan State University
Office of Medical Education Research and Development
(1979).

165

Maatsch, J.L.; et al. "The Emergency Medicine Specialty
Certification Examination (EMSCE)," Journal;gf‘the
American College of'Emergencnyhysicians, 1:529-534
(1976). .

 

 

Maatsch, J.L.; et a1. "Toward a Testable Theory of Physician
Competence: An Experimental Analysis of a Criterion—
Referenced Specialty Certification Test Library," in
Proceedings of the Seventeenth Annual COnference on
Research in Medical Education, Washington, D.C.:
Association of American Medical Colleges (1978).

 

 

McCarthy, W.H. "An Assessment of the Influence of Cueing
Items in Objective Examinations," Journal of Medical
Education, 41:263-266 (1966).

 

 

McGuire, C. "A Process Approach to the Construction and
Analysis of Medical Examinations," Journal of Medical
Education, 38:556-563 (1963).

 

 

McGuire, C.H. "The Oral Examination as a Measure of
Professional Competence," Journal of Medical Education,
41:267-274 (1966).

 

McGuire, C.H.; Solomon, L.M.; and Bashook, P.G. Construction
and Use of Written Simulations. The Psychological
Corporation (1976).

 

 

Marshall, J. "Assessment of Problem-Solving Ability,"
Medical Education, 11:329-334 (1977).

 

Mehrens, W.A. and Lehman, I.J. Measurement and Evaluation
in Education and Psychology. New York: Holt,
Rinehart and Winston (1973).

 

 

Miller, G. "A Physician in 1776," JAMA 236:26-30 (1976).

NBME. Evaluation in the Continuum of Medical Education.
Report of the Committee on Goals and Priorities.
Philadelphia: National Board of Medical Examiners
(1973).

Newble, D.I.; Baxter, A.; and Elmslie, R.G. "A Comparison of
Multiple-Choice Tests and Free-Response Tests in
Examinations of Clinical Competence," Medical Education
13:263-268 (1979).

 

Nie, N.H., et al. Statistical Package for the Social
Sciences, Second Edition, New York: McGraw-Hill Book
Co. (1975).

 

 

166

Norman, G.R.; et al. "Clinical Judgement of Students - A
Preliminary Report," in PrOceedings of the Thirteenth
Annual Conference on Research in Medical Education,
Washington, D.C.: Association of American Medical
Colleges (1974).

 

Norman, G.R.; et a1. "Measuring the Outcome of Clinical
Problem-Solving," in Proceedings of the Sixteenth
Annual Conference on Research in Medical Education,
Washington, D.C.: Association of American Medical
Colleges (1977).

 

Page, G.G. and Fielding, D.W. "Performance on PMP's and
Performance in Practice: Are They Related?"
Journal of Medical Education, 55:529-537 (1980).

Pawluk, W.; et a1. "Concurrent Validity of the Canadian
Certification Examination in Family Medicine," in
Proceedings of the Fifteenth Annual Conference on
Research in Medical Education, Washington, D.C.:
Association of American Medical Colleges, (1976).

 

Payne, B.C. and Lyons, T.F. Method of Evaluation and
Improving Personal Medical Care Quality, Office Care
Study for Hawaii Medical Association, Ann Arbor,
Michigan: University of Michigan School of Medicine
(1972).

 

 

Price, P.B.; et al. "Measurement of Physician Performance,"
Journal of Medical Education, 39:203-211 (1964).

 

Rakel, R.E. "Defining Competence in Specialty Practice:
The Need for Relevance," in Definitions of Competence
in Specialties of Medicine - Conference Proceedings,
Chicago, Illinois: American Board of Medical
Specialties (1979).

 

Rimoldi, H.J.A. "A Technique for the Study of Problem
Solving," Educational and Psychological Measurement,
15:450-561 (1955).

 

Rimoldi, H.J.A. "The Test of Diagnostic Skills," Journal of
Medical Education, 36:73-79 (1961).

 

Rose, S.D.; Corman, L.C. and Robbins, J. "Failure of
Examination Answers to Evaluate Actual Practice
Patterns by Medical House-Staff in an Outpatient
Clinic," in Proceedings of the Eighteenth Annual
Conference on Research in Medical Education, Washington,
D.C.: Association of American Medical Colleges (1979).

 

167

Schumacher, C.F. "A Comparative Study of Four Methods for
Scoring Experimental Computer-Based Examination for
Clinical Problem-Solving," in Proceedings of the
Thirteenth Annual Conference on Research in Medical
Education, Washington, D.C.: Association of American
Medical Colleges (1974).

Schumacher, C.F.; Burg, F.D.; and Taylor, W.C. "Computeri-
zation of Patient Management Problems Examination to
Prevent "Retracing"," in Proceedings of the Thirteenth
Annual Conference on Research in Medical Education,
Washington, D.C.: Association of American Medical
Colleges (1974).

Sedlacek, W.E.; and Nattress, M.A., Jr. "A Technique for
Determining the Validity of Patient Management
Problems,“ Journal of Medical Education, 41:263—266
(1972).

Senior, J.R. Toward the Measurement of Competence in
Medicine. Philadelphia, Pennsylvania: National Board
of Medical Examiners (1976).

Sibley, J.C.; et al. "Quality-of—Care Appraisal in Primary
Care: A Quantitative Method," Annals of Internal
Medicine, 83:46-52, (1975).

Skakun, E.N. "The Dimensionality of Linear Patient Manage-
ment Problems," in Proceedings of the Seventeenth
Annual Conference on Research in Medical Education,
Washington, D.C.: Association of American Medical
Colleges (1978).

Wakefield, J.G.; et al. "The Validity of the Medical Record:
A Comparison of Elicited and Recorded Clinical Data,"
in Proceedings of the Seventeenth Annual Conference on
Research in Medical Education, Washington, D.C.:
Association of American Medical Colleges (1978).

Williamson, J.W. "Assessing Clinical Judgement," Journal of
Medical Education, 40:180-186 (1965).

 

Williamson, J.W. "Validation by Performance Measures," in
Conference on Extending the Validity of Certification,
Chicago, Illinois: The American Board of Medical
Specialties (1976).

Wingard, J.R. and Williamson, J.W. "Grades as Predictors of
Physicians' Career Performance: An Evaluative Litera-
ture Review," Journal of Medical Education, 48:311-321
(1973).

168

Wolf, R.N. "Validity - Basic Ideas and Their Use," in
Conference on Extending the Validity of Certification,
Chicago, Illinois: The American Board of Medical
Specialties (1976).

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

1|WM)W[\(rljlﬂlﬂﬂlﬂlﬂlwlHM1