THE EFFECT 0F'HYPOTHESiS GENERATlON AND
VERBAUZAEGN ON CERTI-‘éﬁ ASPECTS OF MEDICAL
' PROBLEM SOB/HG

Dissertation for the Degree of Ph. D,
momma STATE UNXVERSITY ‘
SARAH A. SPRAFKA
' 1973

  

THESIS

Michigan State

 

 

 

 

 

 

 

 

ABSTRACT

THE EFFECT OF HYPOTHESIS GENERATION
AND VERBALIZATION 0N CERTAIN ASPECTS
OF MEDICAL PROBLEM SOLVING

By
Sarah A. Sprafka

Thirty medical students going into their fourth year of medical
school were asked to solve three modified Patient Management Problems.
Subjects either were or were not constrained to think aloud during
problem solving. They were instructed to generate diagnostic
hypotheses early, to withold judgment about diagnosis until the end
of the problem, or were given no instructions about hypothesis
generation. Number of hypotheses generated, thoroughness of cue
acquisition, efficiency of cue acquisition, and accuracy of solution
were the dependent variables. Multivariate analysis of variance.
revealed that instructions concerning hypothesis generation had no
effect on outcome. Subjects constrained to verbalize generated
significantly more hypotheses for one problem (Problem III) than
subjects without that constraint. Further, there was a significant
interaction effect of instructions concerning hypothesis generation
and constraint to verbalize on number of hypotheses generated for that
same problem.

In the interest of assessing the relation between certain aspects
of performance and outcome, subjects were reassigned to groups using

many (3_lO) or few (< l0) hypotheses as one independent variable and

Sarah A. Sprafka

early or late hypothesis generation as the other independent variable.
Dependent measures were thoroughness and efficiency of cue
acquisition, and accuracy of outcome. The interaction of early or
late hypothesis generation and many or few hypotheses affected
thoroughness on one problem (Problem I).

Subsequently two of the three problems were analyzed to determine
whether a different problem solving process had been used by subjects
who had high accuracy scores on those problems and subjects who had
low accuracy scores. It was found that on the one problem which had
a complex solution subjects who had low accuracy scores generated
but did not retain the elements of the complex solution. 0n the'
other problem it was found that subjects who received low accuracy
scores either did not generate or generated and dropped the correct
hypothesis. These subjects also generated a larger variety of
inaccurate hypotheses than subjects receiving higher accuracy scores.

The results of the three stages of analysis demonstrate that
instructions about hypothesis generation and constraint to verbalize
have little overall effect on performance. Furthermore, whether
subjects generate the first hypothesis early or late, or generate
many or few hypotheses makes little difference. Lastly, differences
in accuracy of solution can be tentatively attributed to different

causes for different problems.

THE EFFECT OF HYPOTHESIS GENERATION
AND VERBALIZATION ON CERTAIN ASPECTS
OF MEDICAL PROBLEM SOLVING
BY
Sarah AL Sprafka

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling,
Personnel Services and Educational Psychology

1973

f ACKNOWLEDGMENTS

I would like to acknowledge the professional direction and
encouragement given me by the members of my guidance committee:
Dr. Stephen Yelon, Dr. Lee Shulman, Dr. Arthur Elstein, Dr. Gordon
Hood and Dr. Howard Teitelbaum.

I owe thanks also to Professor Christine McGuire and Mr. Gordon
Page for the supply of materials, information and advice.

The cooperation of members of the staffs of the University of
Michigan Medical School, Sparrow Hospital in Lansing, Saint Mary's
Hospital in Grand Rapids and Hurley Hospital in Flint who made space
available for interviewing subjects is gratefully acknowledged.

ii

DEDICATION

To Merlin and Falstaff who gave me solace in my weakest moments.

MﬂMﬂ................. ..........
ACKNOWLEDGMENTS
DEDICATION ............... ' . . . . . . . . . . .
TABLE OF CONTENTS ......................
LIST OF TABLES AND FIGURES ..................
LIST OF APPENDICES ..................... K.
LIST or DISPLAYS .......................

CHAPTER
I.

II.
III.

IV.

TABLE OF CONTENTS.

INTRODUCTION . . . . . . . . . . . . . . . . . . . .

The Problem
Questions To Be Investigated

REVIEW OF LITERATURE ................
DESIGN OF THE STUDY .................

Hypotheses

Subjects

Procedures

Description Of The Patient Management Problems And
Modifications Made For The Present Study

Reliability Of The PMP

Validity Of The PMP

Modifications To The PMP For The Present Study

Validity And Reliability Of The Modified Problems

ANALYSIS OF RESULTS, SUMMARY AND DISCUSSION .....

Tests 0f Hypotheses Concerning Hypotheses
Generation And Verbalization

Summary And Discussion 0f Verbalization And
Interaction Effects

Summary And Discussion 0f Early/Late Hypothesis
Generation

iv

Page

ii
iv
vi
vii

viii

22

48

CHAPTER
IV. (continued)

Results Of Performance Hypotheses
Summary And Discussion 0f Performance Results
Results And Discussion 0f Problem By
Problem Analysis
Summary And Discussion 0f Process
Analysis

v. SUMMARY, CONCLUSIONS AND IMPLICATIONS .......

Conclusions
Implications

BIBLIOGRAPHY ........................
APPENDICES .........................
DISPLAYS ..........................

Page

91

105
109
139

Table

6&7.

LIST OF TABLES AND FIGURES

Means and Standard Deviations of Thoroughness,
Efficiency, and Accuracy on Two Problems for

Three Types of Simulations ..............

Manova of Effects of Instructions on Number of

Hypotheses, Thoroughness, Efficiency, and Accuracy . . .

Hypothesis Generation X Group X Problems .......

Manova of Effects of Many/Few and Early/Late
Hypothesis Generation on Three Outcome Variables,

Problem I ......................

Problem-By-Problem Manova on Dependent Measures

Data for Process Analysis, Problem I .........

Data for Process Analysis, Problem II ........

Instructions X Verbalization Interaction,

Problem III .....................

Instructions X Verbalization Interaction,

Problem I ......................

Instructions X Verbalization Interaction,

Problem 11 ......................

Instructions X Verbalization Interaction,

Average .......................

Many/Few X Early/Late Interaction, Problem I .....

Instructions X Verbalization Interactions,

Problem I ......................

Instructions X Verbalization Interactions,

Problem I ......................

Page

42
50

56

65
7O
75
83

54

66

LIST OF APPENDICES

Page
Appendix
A. Instructions Read to Subjects ........... l09
B Cue Sheets Used for Modified PMP's ..... . . . 121
C. Procedures for Scoring Modified PMP'S . . . . . . . 130
D Criteria Used for Process Analyses . . . . . . . . l35

vii

LIST OF DISPLAYS

Display Page
1. Accuracy Scores . . . . . . . ............. l39
11. Representative Hypotheses, Problem II ......... l44

viii

CHAPTER I
INTRODUCTION

Medical care is made up of many facets, one of which is diagnosis.
Medically defined, diagnosis is "the determination of the nature of the
case of disease" ( l). The association of diagnosis exclusively with
disease is a questionable one. Therefore a looser, non-medical
definition seems more appropriate. More loosely defined, diagnosis is
ftheinvestigation or analysis of the cause or nature of a condition ...
or problem" ( 2). In order to diagnose a patient's condition or problem
a physician must go through an information gathering and processing
procedure, arriving finally at some sort of conclusion about his
patient's condition or problem.

Diagnosis as a cognitive activity may be studied in a number of
ways. Two major approaches are: l) the study of diagnosis as a
decision-making process, and 2) the study of diagnosis as a problem
solving process.” Although the two processes are closely linked in
practice, the terms "decision-making" and "problem solving" refer to
two distinct areas of research. Students of diagnosis as decision
making use a methodology which differs markedly from those who look at
it as problem solving. The emphasis for the decision-makers is more on
outcome than on process. The method of study depends much more on

mathematical models used as predictors or describers of outcome.

2

Correlation models, regression models and probability models are
among the most popular.

Those who study diagnosis as problem solving, the present writer
included, place more emphasis on the process of arriving at a
diagnostic decision, whether tentative or not, than on the decision
itself. The method of study usually involves conducting experiments
much like the one reported below in the interest of ascertaining what
the essential elements of diagnostic problem solving are and what role

they play in the problem solving process.

The Problem

 

A number of ways of progressing from the beginning to the solution
of a diagnostic problem have been proposed. One way advocated by many
texts and teachers of medical diagnosis is to gather a large amount
of data according to a prescribed format, record this information and
then use it to arrive at a tentative differential diagnosis or list of
problems which will help determine what diagnostic tests should be
ordered and what management steps would be taken. This approach is
advocated by Lawrence Need, for example (3,4L He feels that a complete
Data Base should be gathered on a patient before any further steps are
taken, except of course in cases of emergency. Weed offers criteria
for the establishment of a Data Base which would apply to almost all
patients who came to a given health care facility. The Data Base can
be elicited by paramedical personnel, thus saving the physician who will
see the patient a good deal of time and trouble. When presented with
the Data Base the physician can then select those cues which fit

together into potential problems and, when he sees the patient, can

3

perhaps gather more data to enable him to formulate those problems
and make diagnostic and management decisions based on them.

Another way of approaching a diagnostic problem involves the
constant evaluation of information as it comes in. This, to a certain 1
degree is the approach advocated by Morgan and Engel ( 5). Especially .
in gathering information about the presenting complaint one should at
times let one's ideas about possible causes guide one's questioning.
Morgan and Engel emphasize the necessity of finding out about the
bodily location, chronology, setting, etc. of a patient‘s complaints.
Questions about these aspects can be triggered by considering possible
causes. These considerations can, for example, cause the physician
to elicit other symptoms the patient has not mentioned yet. They
can also help direct specific questions about aggravating and‘
alleviating factors which the patient might not have mentioned. Further,
on-going use of information to suggest possible causes can help direct
the physical examination.

Morgan and Engel do not place great emphasis on the use of
incoming information to guide inquiry. They appear to assume that
physicians do this during interviews as well as physical examinations,
and give examples of where this activity can be put to good use.ﬁ

Although both of these approaches to gathering and using patient
information yield similar results, i.e. a preliminary differential
diagnosis, they arrive at this end via rather different routes. How,
thenjdoes the physician use the information he gathers to solve a
diagnostic problem? It is felt that the essential reasoning

mechanism used to transform information into a diagnostic solution is

4
the generation of diagnostic hypotheses and the evaluation of these
hypotheses in the light of the information available. Thus the Need
procedure and that advocated by Morgan and Engel have an essential
element in common: hypothesis generation. But this element plays
a different role in each approach to problem solving. 0n the one
hand establishing the Need Data Base involves gathering information in
a prescribed order. Hypotheses are then generated based on a large
amount of data. The problem solver reasons from facts to hypotheses.
0n the other hand, conducting an interview and physical examination
in the manner prescribed by Morgan and Engel may lead to the gathering
of different kinds of information in varying orders depending on
what possible causes the physician entertains and how he goes about
evaluating them. The problem solver in this case may reason either
from facts to hypotheses or vice-versa -- from hypotheses (possible
causes) to facts (gathered via specific questions about those causes).
The importance of the difference between these two approaches to
diagnostic problem solving is borne out by recent investigations ( 6)
which are finding that despite training in the fact-to-hypothesis
approach to diagnosis, many physicians seem to reason both from facts
to hypotheses and vice-versa as acknowledged by Morgan and Engel.
The result is a combination of the format procedure and an hypothesis
generating and testing procedure. Information is gathered partly
according to a memorized format and partly in the interest of testing
hypotheses. The actual solution to the problem is usually arrived at
by verification of one or more hypotheses generated along the way,

backed up by the elimination of others.

Questions To Be Investigated

Certain questions arise out of these observations. Should a
diagnostician be encouraged to generate hypotheses early in a problem;
should he be encouraged to reserve judgment; or should he simply be
permitted to find his own problem solving style? Would instruction to
use a format versus an early hypothesis generation approach have any
effect on the physician's performance, both as to his approach to the
problem as well as to the quality of his diagnosis? Independent of
instructions, does the early generation of a number of hypotheses lead
to a more or less accurate diagnostic formulation than the late
generation of as few as one hypothesis?

Recent investigations (6) have raised some questions about the
appropriate methodology for investigating hypothesis generation and
testing behavior. Those investigations used a thinking aloud procedure
as well as a stimulated recall to assess the physician's thinking as he
solved each problem. The authors of that study (6) felt that the
influence of thinking aloud on a physician's reasoning should be
investigated. The present study therefore included a thinking aloud
(verbalization) condition.

' Specifically, the study investigated the following questions:

1. Do instructions to use early vs. late hypotheses have any

effect on the diagnostician's approach to a diagnostic problem?

2. Do instructions to use one or the other of these approaches

have any effect on the quality of his diagnosis?

6
Independent of instruction, is there any relationship between
the number of hypotheses, and the delay of their generation,
and the quality of a diagnostic workup?
Do instructions to verbalize during a problem have any effect
on the diagnostician's approach to the problem?
Do instructions to verbailze have any effect on the quality of

the diagnosis?

CHAPTER II
REVIEW OF LITERATURE

Two major areas of investigation are relevant. One has to do with
the difference in problem solving strategy being studied here; the
difference between generating and testing hypotheses during problem
solving, and generating hypotheses only after most of the information
relevant to the problem has been gathered. The second area of investi-
gation deals with the effect of verbalization on problem solving.

The issue of problem solving strategy has its roots in the
philosophy of science [see Kessel, 1969 (7) for a stimulating discussion
of this subject]. In its purest form the difference being studied here
has to do with the nature of so-called factual information and how this
information is treated by scientists. Francis Bacon, an early phil-
osopher of science (8) believed that facts should be treated as strictly
objective. Scientific investigation should proceed step-wise from
the observation of particulars to the creation of elementary axioms
from those observations. These axioms should then be used as the basis
for further experiments which would lead to the establishment of more
global axioms, thence to further experiments, and so on to the creation
of general principles. Bacon cautioned strongly against using a few
observed particulars as bases for creation of broad principles, and
then using these principles to direct new observations and the creation

of more elementary axioms. This approach to science implied the

interpretation of observations in the light of previously established
principles. In his opinion observations should in no way be colored
by the observer's predispositions. For him Observed facts were the
ultimate arbiters of theory.

This strictly empirical approach to scientific investigation has
engendered a good deal of criticism. Fundamental questions arise as
to the relationship between facts which may be observed and the observer.
Is it not true that what is observed and how observations are interpreted
are strongly determined by the observer? Is it not also true that many
theories have been created on the basis of comparatively little
information, and perpetuated in the fact of contradictory observable
evidence?

Kessel (7) points out that the autonomy of observable facts is
highly questionable. Facts cannot be separated from the reasons for
gathering them. Scientific investigators' choices of problems to study
and their methods of gathering and interpreting information reflect
their own predispositions as well as the general predisposition of their
field of study. Kuhn notes (9) that the interpretation of facts may
change following a change in paradigm, i.e. a change in the intuitive
conception of the nature of things. It thus appears that scientific
investigation as practiced is not strictly empirical. Beyond that, it
may be that strict empiricism is impracticable.

The alternative to a strict facts-to-theory approach is one that in-
volves the generation of tentative formulations or hypotheses based on
small amounts of data and the subsequent testing of these hypotheses.
Hypothesis testing leads to the gathering of more data and the possible

generation of more hypotheses. As noted above, this seems to be the way

9

scientific investigation is practiced. Two strong proponents of this
hypothetico-deductive approach are Popper (10) and Medawar (11).

Popper proposes that "... the work of a scientist consists in
putting forward and testing theories" (10, p. 3l). He feels that the
goal of science is not to string facts together into generalizations,
but to justify certain generalizations with experience. Furthermore,
theories are not methodically built around facts. A lot of creativity
and luck goes into the creation of a theory. Only once it is created
can it be tested. Medawar criticizes the empirical approach for its
emphasis on facts without interpretation. A fact is not a fact unless
it is interpreted relevant to something, and that something is a
hypothesis or theory. Therefore, to gather facts without interpreting
them is likely to be impossible. And if possible it is going to lead
to the unsystematic gathering of a lot of irrelevant information. He
proposes that scientific investigation should be a hypothetico-deductive
process. Hypotheses should be generated by whatever means (including
observations of phenomena, luck, and inventiveness) and then further
data should be gathered systematically in the interest of testing those
hypotheses. By testing, a hypothesis can be refuted or temporarily
accepted.

Diagnostic problem solving can never be equated with scientific
investigation. Yet fundamental elements are shared. In both domains
information is gathered and interpreted by humans. Hypotheses are
formed. The hypotheses may be tested and the results may lead to the
generation of new hypotheses as well as to the further confirmation or
disconfirmation of previously entertained hypotheses. Furthermore,

something akin to the strictly empirical approach to scientific

lO

investigation is possible for the diagnostic problem solver. Infor-
mation may be gathered and recorded without being interpreted. After
a large amount of information has been gathered, sets of observations
may be grouped together leading to the generation of diagnostic
hypotheses.

The results of either of these approaches to diagnosis is usually
the same -- a correct diagnosis. Is one better than the other? A
better question would be, should one be used to the exclusion of the
other? Both approaches have their advantages and disadvantages as is
demonstrated by studies of problem solving discussed below.

Early investigation into approaches to problem solving was done
by Luchins and Luchins (12) in their work on set or Einstellung.
Subjects were given problems for which the discovery of a mathematical
formula was necessary to avoid a trial and error approach. However once
the appropriate formula had been discovered and found to work on a
series of problems, the problems were modified by the experimenter with
the result that the subject could not apply the formula he had dis-
covered to the solution of one of the modified problems. Rather than
backing off and reformulating the problem, the subjects tended largely
to stick with their original equation and fail to solve that problem.
That they held to the original equation was evidenced by their using it
to solve two other modified problems which could have been solved more
neatly using a simpler equation. One may find a lesson herein. These
studies were designed specifically to test problem solving set. The set
which was established involved encouraging subjects to generate a

hypothesis about the appropriate formula for solution. Subsequently

11

they had trouble changing this hypothesis, even though it made it more
difficult, even impossible in one case, to solve subsequent problems.

Similar findings are reported as the result of extensive invest-
igation by Mason and others (13). It was found that subjects given
the opportunity to generate a rule for a progression of numbers did so
quite readily. Unfortunately if they generated the wrong rule they
were unwilling or unable to change it or even to think of examples of
number progressions which might prove their rule incorrect. A number
of means were used to encourage subjects to invent counterexamples to
their rule or to invent a totally different rule. Some worked to a
certain extent, but none was totally successful. Mason concludes that
people in these and similar circumstances will tend to stick to a
conclUsion even though erroneous, and will not entertain alternatives
or attempt to disprove their conclusion.

Medical diagnosis may be a circumstance similar to this. It is
possible that by generating hypotheses early on, a physician may not
truly test and attempt to falsify his hypotheses, but gather information
only in the interest of confirming them. As a matter of fact, this
phenomenon has been observed in recent investigations of diagnostic
reasoning. Elstein and Shulman (l4) state that having formed a
hypothesis that a patient is hysterically ill rather than afflicted
with an organic disease, one physician studied elicited cues which
would disconfirm this hypothesis by asking a set of rather routine
questions, but did not process these cues, i.e. did not apply them
to the disconfirmation of that hypothesis.

Returning once again to non-medical problem solving —- some of the

most extensive research in how subjects go about solving a restricted

11

they had trouble changing this hypothesis, even though it made it more
difficult, even impossible in one case, to solve subsequent problems.

Similar findings are reported as the result of extensive invest-
igation by Mason and others (l3). It was found that subjects given
the Opportunity to generate a rule for a progression of numbers did so
quite readily. Unfortunately if they generated the wrong rule they
were unwilling or unable to change it or even to think of examples of
number progressions which might prove their rule incorrect. A number
of means were used to encourage subjects to invent counterexamples to
their rule or to invent a totally different rule. Some worked to a
certain extent, but none was totally successful. Hason concludes that
peOple in these and similar circumstances will tend to stick to a
conclUsion even though erroneous, and will not entertain alternatives
or attempt to disprove their conclusion.

Medical diagnosis may be a circumstance similar to this. It is
possible that by generating hypotheses early on, a physician may not
truly test and attempt to falsify his hypotheses, but gather information
only in the interest of confirming them. As a matter of fact, this
phenomenon has been observed in recent investigations of diagnostic
reasoning. Elstein and Shulman (14) state that having formed a
hypothesis that a patient is hysterically ill rather than afflicted
with an organic disease, one physician studied elicited cues which
would disconfirm this hypothesis by asking a set of rather routine
questions, but did not process these cues, i.e. did not apply them
to the disconfirmation of that hypothesis.

Returning once again to non-medical problem solving -- some of the

most extensive research in how subjects go about solving a restricted

12

set of problems (the discovery of figural concepts) is that done by
Bruner and colleagues (l5). Bruner's studies found that although
subjects used a number of strategies to discover a concept, the
strategy which worked best, i.e. produced the correct solution after
the fewest trials was labeled the "conservative focusing strategy" by
Bruner. This strategy involves finding an object which is a positive
instance of the concept in question, identifying all the elements of
that instance, and then picking successive objects which differ from
the positive instance in only one way and establishing which of these
are also positive instances of the concept. When all the necessary
positive instances have been identified (all the necessary information
has been gathered) the concept itself can be formulated. This type

of problem is then best solved by reasoning from individual pieces

of information to a higher order statement. The advantages of using
this approach for solving a restrictive problem of this type are
obvious -- once all of the necessary positive instances are identified,
the formation of the concept itself follows naturally. All elements
of the problem are mutually exclusive and exhaustive. Further, the
problems are content-free. The selective application of previously
learned information is not required to solve them. All information
necessary for solution is contained in the problem. The application
of this approach to the gathering and interpretation of data for a
diagnostic decision however, is inappropriate for three reasons. All
of the relevant symptoms which might serve to incontrovertibly establish
a diagnosis can never be found. The concept problem is amenable to an

exhaustive search, the diagnostic problem is not. Secondly, hypotheses

13

to which symptoms may be applied are not mutually exclusive. Lastly,
the selective use of previously learned information plays a large
part in diagnostic problem solving.

Evidence presented thus far would lead one to believe that although
a hypothetico-deductive approach to diagnostic problem solving may be
attractive, it may also be a risky approach to take. Perhaps in
reality a more thorough approach is called for. However, investigations
described below will tend to indicate that human information processors
may have trouble proceeding in that manner. These studies of problem
solving and information processing have led to theories explaining
the organization of and restrictions on the human problem solving
mechanism.

Miller, Galanter and Pribram (l6) propose that
although algorithms for problem solving are thorough and always lead
to the correct solution, people tend not to use them. Use of an
algorithm usually involves the recall of large amounts of relevant
information, and the human short term memory tends to become over-
loaded. Instead of algorithms, people employ heuristics, or what the
authors call Plans for problem solving. The Plan used by a given
individual to solve a given problem is also somewhat restricted by
human factors, since it is largely determined by that individual's
Image of the problem. The Image consists of what the problem solver
knows about the problem situation, e.g. the boundaries of its solution,
its essential elements, possible time factors, how good he is at
problems of this type,and so on. Different problem solvers may have

different Images of a problem and for that reason choose different Plans

14

for its solution. In all events, if problem solving is going to take
place, the problem solver must have an Image of the problem as well
as a heuristic or Plan for solving it.

A more concise expression of the Miller, Galanter and Pribram
idea is given by Simon and Newell (17). Their theory of problem
solving posits that although very few characteristics of the human as
an information processer are constant over task and person, those
féw are enough to determine that a task environment (problem) is
represented by the solver as a problem space. Problem solving then,
takes place in a problem space. The structure of the problem space is
determined by the nature of the task. And the set of steps, heuristics,
etc. used to solve the problem are determined by the structure of
the problem space. Once a problem is detected the problem solver
creates a space for that problem which contains some representation
of his goal, the relevant information he has, as well as some
strategies for gathering and using further information that will get
him to his goal without over loading short term memory. The problem
space then is a type of orientation of the problem solver. It is not
as specific as a theory to be tested but it does preclude the random
gathering of information without a referent.

A type of strategy might be that proposed by Bartlett in his
studies of sectional map reading (18). He calls solving sectional map
reading problems (how to get from point A to point B quite a ways away
by the shortest route possible?) problem solving in an open system. He
has found that map readers tend to explore along the line of greater

possibilities since the more possibilities you have to work with the

15

more probable you are to find the right one, i.e. the shortest route to
your destination. Although sectional map reading does not present a
short term memory problem, still some sort of scheme for search must be
established at the outset. The preferred scheme seems to be a flexible
exploratory one rather than a binary search or some other more rigid
strategy.

Another scheme or strategy is the one proposed by Miller (l9)
which we use to avoid over loading short term memory. He proposes that
as more and more information is gathered concerning the solution of a
problem we need to group it together. The grouping is necessary because
the capacity of short term memory is on the order of 7 :_2 elements.

To avoid exceeding that limit a problem solver needs to "chunk"
(Miller's term) elements together which go together thus reducing the
memory load.

It appears that not only is the search for and use of information
for problem solving not random, it is highly organized and can be quite
selective. Facts, or particulars, are selected for observation based
on the problem solver's perception of the task.

The discussion thus far has focused on approaches to non-medical
problems. How do subjects approach problems concerned with medical
diagnosis? In his experiments with physicians, Kleinmuntz (20) found
that his subjects used one very useful strategy for gathering and
storing information. In a study where the subject was required to gather
pieces of information one at a time from a data bank, Kleinmuntz found
that subjects with more medical experience tended to start with general
questions and converge on a diagnosis using progressively more and more

specific questions. The types of questions asked were those which yielded

16

the greatest amount of information (reduced ambiguity to the greatest
extent). The information which was stored was relevant to a specific
diagnostic hypothesis. Although a definite strategy is being used here,
it is a binary search strategy rather than a more flexible one such as a
hypothetico-deductive approach. If medical diagnosis can be considered
a search in an open system, using a rigid binary search strategy to solve
diagnostic problems tends to contradict Bartlett's findings. The
strategy used was very likely an artifcat of the experimental setting.
Subjects were constrained to asking "yes-no" questions, so the most
efficient strategy available was obviously the binary search.

Other authors have found strong evidence for hypothetico-
deductive reasoning in physicians. Price and Vlahcevic (2l), both
physicians, speak from their own experience. They make an excellent
case for hypothetico-deductive reasoning in diagnosis. Their claim
is that physicians choose hypotheses and interpret the data they gather
in the light of those hypotheses. A diagnostic decision is arrived at
by combining the Elimination of erroneous hypotheses with the tentative
confirmation of one or more appropriate hypotheses. They insist that
both of these elements must be present. An effort should be made to
find th§_diagnostic formulation which fits the greatest number of
symptoms. Not only must that diagnosis fit that set of symptoms, but
conversely all symptoms necessary to confirm that diagnosis must be
present. Furthermore, all other hypotheses which could fit that set of
symptoms must have been rejected.

In a similar vein Dudley (22) has observed that experienced
physicians are much more selective about their data gathering than

medical students. He sees the data gathered by physicians as being

17

processed as it is gathered and lumped together into Boolean type nets
or lattices relevant to one or more hypotheses. The establishment
Of these nets enables a physician's search to become more and more
specific and the possible hypotheses to be honed down to only one.

Studies to investigate these aspects of diagnostic reasoning
have been done by Sprosty (23) and Elstein et al. ( 6). Sprosty
studied history taking and accuracy of diagnosis in medical students.
He observed that those students who obtained the correct diagnosis
seemed to ask more and shorter questions. Furthermore, their questions
were more specific and it was apparent that more hypothesis testing
was being done by these students than by those who did not obtain
the correct diagnosis. Although Sprosty does not have a measure for
hypothesis generation, it is obvious that those students whom he found
gave good performances did generate and test hypotheses.

A recent study by Elstein et al. ( 6) has led to a preliminary QX
theory of medical inquiry. The focus of this theory is hypothesis
generation. The authors found that physicians tend to generate
specific diagnostic hypotheses early in the workup, usually as the
result of some discrepant finding interpreted by the physician as
problematical. These hypotheses may be systematically tested and/or
further data may be gathered according to some routine the physician
has memorized. The data is then applied to the various hypotheses
in the interest of disconfirming some and tentatively confirming others.

Another recent study by Elstein et al. (30) demonstrates that
hypothetico-deductive thinking occurs in diagnostic problem solving
and that different strategies are used for different problems. This

study is a process analysis of four Patient Management Problems (35)

l8
completed by l5 physicians who participated in a larger study ( 6).
In each problem subjects were asked to generate diagnostic hypotheses
at various points throughout the problem. All subjects, as instructed,
. generated diagnostic hypotheses early in each problem. It was not
possible to assess whether specific items of information were selected
to test out those hypotheses. Thoroughness and efficiency of data
acquisition were evaluated, however, and some inferences may be made
from those scores about the effect of hypotheses on data acquisition
and utilization. In one problem for example, those subjects who
generated the correct hypothesis early in the problem needed less
total data to arrive at the correct solution than did those subjects
who did not generate the correct hypothesis until later. One may
infer from this that the correct hypothesis, generated early, was
guiding an efficient course of data acquisition. This phenomenon is
not uniform across problems, however. On another problem an attractive
but erroneous diagnostic hypothesis was presented at the outset of
the problem. This clearly had an effect on subjects' hypothesis
generation and data acquisition. In this problem all subjects who
arrived at a correct solution generated the correct hypotheses late
in the problem, perhaps due to the influence of the hypothesis suggested
at the outset. Those subjects who did not reach an accurate solution,
did not ever generate the correct solution, and restricted the scope
of their data acquisition. This appears to be an example of Luchins'
Einstellung effect (12) in which the early suggestion of one attractive
but erroneous hypothesis led to restricted data collection, failure to
generate the correct hypothesis, and failure to arrive at the correct

solution.

19

These findings demonstrate that diagnostic problem solving is not
a straight-forward uniform process. Different approaches may be taken
to different problems, and success or failure may be achieved in diff:
different ways.

In conclusion then, it appears that hypothetico-deductive
reasoning is not only justifiable as an approach to solving scientific
problems, it may also be justified as an approach to more practical
problems such as medical diagnosis. As information processors humans
tend to employ heuristics for gathering, storing and using information
relevant to different problems. One of the most productive of these
heuristics is the hypothetico-deductive one. However, use of this
heuristic exclusively may also be risky in that it restricts the amount
and type of data gathered. It therefore should be, and usually is
supplemented with other data gathering heuristics. The combination of
heuristics used may vary depending on the problem. The present study
did not investigate the other heuristics. They and other elements of
diagnostic problem solving are being studied elsewhere (e.g. 24, 25).

The second major element of the study concerns use of the
"thinking aloud" procedure to gather data and the resultant effects
of verbalization on problem solving. Thinking aloud was used as early
as l9l7 whenClaparede (26) studied the origin of hypotheses about
problem solutions. He describes the procedure as being useful since
it is neither retrospective nor introspective but gives a running
count of the problem solver's process. He found the drawbacks of
thinking aloud were that first of all it needs training, and even then
some subjects do not talk during the most interesting moments of their

problem solving. One reason for this may be that when one is thinking

20

very hard he is not prone to verbalize. Furthermore, one thinks much
faster than he talks. We might infer from this that constraints to

think aloud may alter the course of thought as well as perhaps slow it
down. More recently Newell and Simon (27, 28), and Simon and Newell (l7)
have relied heavily on the thinking-aloud technique to obtain information
about specific heuristics being used by subjects for solving problems.
Neisser (29) has criticized this procedure on the grounds that any
complex or multiple processing being done by a subject may be made to
appear sequential by the thinking aloud. Worse yet, thinking aloud may
cause a subject to use a sequential process where he would not do so
otherwise.

Elstein et al. (6) in their study of diagnostic reasoning used
thinking aloud to determine what processes physicians were using.

Since then McGuire (32) has expressed concern that thinking aloud may
not only make diagnosticians' thought processes appear more orderly
than they are, but may also cause their processes to be more orderly
than they would otherwise have been.

Little concrete evidence is available on the effects of thinking
aloud on problem solving. Gagne and Smith (33) built a thinking-aloud
condition into their study of children's ability to solve a problem
and to formulate a general rule for its solution. They found that
differences in accuracy of solution were attributable to verbalization.
Moreover those subjects who verbalized were able, after the fact, to
generate acceptable general rules for solution.

It appears then, that verbalization does have an affect on

problem solving, although perhaps not a detrimental one as Neisser fears.

21
On the contrary the effect may be to enhance problem solving ability.

If this is so, the phenomonon clearly deserves further attention.

CHAPTER III
DESIGN OF THE STUDY

Hypotheses

The hypotheses relevant to this study fall into two major areas:
one concerning the effect of verbalization on problem solving; the
other concerning the effect of instructions on approach to problem, as
well as the effect of instructions and approach to the problem on
solution. A brief explanation of how the relevant hypotheses were
investigated (expanded in Procedures section below) will clarify the
meaning of each specific hypothesis.

Hypotheses relevant to verbalization were investigated by either
constraining subjects to verbalize during problem solution, or omitting
that constraint.

Hypotheses relevant to instructions and approach to problem were
investigated as follows. Subjects were first divided into three groups.
One (Group E) was instructed to generate hypotheses early in the problem;
another (Group L) was instructed to withold judgment about diagnoses
until all the information is in; a third (Group C) was given no
instructions about hypothesis generation. All subjects were given
identical problems to solve. The amount of information gathered before
generation of the first hypothesis and the numbEr of hypotheses
generated by each subject were tabulated. All subjects were scored on
the efficiency, thoroughness, and accuracy of their solution.

22

23

Subsequently subjects were reassigned to groups based on their per-
formance. Those subjects who asked at least one question before
generating their first hypothesis were assigned to one group, those who
asked no questions before generating the first hypothesis to another
group. Similarly, those subjects who generated comparatively many
hypotheses were assigned to one group, those generating comparatively
féw hypotheses to another. The degrees of cross-over from the original
Instructions groups to the after-the-fact performance groups were
assessed, and efficiency, thoroughness, and accuracy were recomputed
for the performance groups.

The reason for re-assignments was to assess the degree to which
subjects in the various Instructions groups followed the instructions
they were given. As stated earlier, humans as information processors
tend to select information based on their perception of the task, and to
store information temporarily in such a way as to avoid overloading
short term memory.

A non-medical example of this occurred in a well-known study done
by Frase in the area of prose learning (40). He organized sentences
about Chess men and their attributes in three ways: according to names,
according to attributes and randomly. Subjects were asked to read the
group of sentences and try to recall as much as possible in a given
time. Amount and organization of the subjects' written efforts at
recall were evaluated. It was found that although subjects in the group
reading randomly organized sentences recalled less than those in the
other two groups, they tended to use the names of the chessmen as the
basis for organization of their recall. As a matter of fact, 30% of

those subjects whose reading passages were organized around attributes

24

reorganized the sentences in their recall to a names organization.

The experimenter hypothesized that the tendency to organize recall

around the names of the chessmen might have been due to the fact that
such an organization required less memory than the attribute organization
or no organization at all.

The major heuristic used for medical problem solving which helps
one to avoid a memory overload is the generation of hypotheses to
which cues may be related as they emerge. Use of this heuristic is
shown in the reports of Price and Vlahcevic (21), Sprosty (23), and
Elstein et a1. (6). These studies led one to believe that in spite of
instructions to the contrary, subjects might generate hypotheses
before all the information was in. As concerns other hypotheses stated
specifically below, studies cited earlier suggested that instructions
about hypothesis generation would have a selective effect on how early
in the problem hypotheses were generated as well as on how many
hypotheses were generated. Furthermore, these instructions would
have an effect on the thoroughness and efficiency of problem
solution as well as on accuracy. There is no concrete evidence as to
the effect of instructions to verbalize on earliness of hypothesis
generation or number of hypotheses generated. However, the
experimenter felt that concerns that verbalization helped to guide
thinking were legitimate. This was investigated. There is concrete
evidence (33) showing a positive effect of verbalization on accuracy
of solution. The specific hypotheses tested in this study are listed

below.

25

Effect of instructions on efficiency, thoroughness, and accuracy

of performance

1.

Subjects instructed to generate hypotheses early will give a
more efficient, less thorough, and more accurate performance
than those instructed to withold judgment.

Subjects given no instructions about hypothesis generation will
show the same pattern as in #1 above.

Subjects constrained to verbalize will give a more accurate
solution than those without that constraint.

Instructions to verbalize will have no effect on the efficiency

or thoroughness of performance.

Effect of instructions on earliness of hypothesis generation

5.

Subjects instructed to generate hypotheses early will
generate hypotheses earlier than those instructed to withold
judgment.

Subjects given no instructions to generate hypotheses will
show a pattern similar to #5 above.

Instructions to verbalize will have no effect on how early

hypotheses are generated.

Effect of instructions on number of hypotheses generated

8.

10.

Subjects instructed to generate hypotheses early will generate
more hypotheses than those instructed to withold judgment.
Subjects given no instructions about hypothesis generation
will show a similar pattern to #8 above.

Instructions to verbalize will have no effect on the number

of hypotheses generated.

26

Effect of performance (re: hypothesis generation) on efficiency,

thoroughness and accuracy

11.

12.

13.

Subjects

Subjects who generate hypotheses early will give more
efficient, less thorough, and more accurate performance

than those who generate hypotheses later.

Subjects who generate comparatively many hypotheses will give
a more efficient, less thorough, and more accurate performance
than those who generate comparatively few hypotheses.
Independence of earliness of hypothesis generation and number
of hypotheses generated: Earliness of hypothesis generation
and number of hypotheses generated will be statistically

independent.

Subjects were 30 medical students going into their fourth year of

medical school. Fifteen of these subjects were randomly seleCted from

the Michigan State University College Of Human Medicine, and 15 came

from the University of Michigan Medical School. This sample was

chosen because:

1.

The materials used were deemed difficult enough to challenge
this group, but were not perceived to be too difficult for them.
These medical students were more accessible than, for example,
a group of physicians would be.

This group of students had similar backgrounds. They had
similar amounts of medical knowledge and had had similar
practical experiences. A more sophisticated group such as

physicians would have had a more divergent set of experiences.

27

Since it has been found that medical students' information
gathering Skills vary as they progress through a training
program (34), a group of students at different levels in
medical school would also be inappropriate.

4. This group was judged to be interested in the materials to
be used in the study since the problems resemble Part III of
the examination of the National Board of Medical Examiners
which some of these students would be taking within a year of

the study.

Procedures

All subjects were given a modified version of three Patient
Management Problems (PMP) developed by the Interdepartmental
Appraisal Committee of the University of Illinois College of Medicine.
The experimenter administered each problem individually to each
subject. After reading and checking the common instructions as well
as experimental group-specific instructions (see below), each subject
was asked to request information from that available. The information
was made available to the subject by a cue sheet containing numbered
items. As the subject requested information he recorded the number
identifying each item. The experimenter then handed that information,
printed on a file card, to the subject. All subjects, regardless of
group assignment were requested to record a differential diagnosis at
the end of each problem. All subjects did the three problems in the

same order.

28

The independent variables manipulated were instructions concerning
hypothesis generation and verbalization. Each subject was assigned
to one of six groups as follows:

Group E-V (instructions to generate hypotheses early and
verbalize) -- Subjects in this group were instructed to generate
diagnostic hypotheses as early in the problem as possible and, if
they wish, to use these hypotheses to guide their data gathering.

' They were also stopped periodically during the problem and asked to
write down any diagnostic hypotheses they had at that point as well as
describe how and when those hypotheses were generated.

Group E-NV (instructions to generate hypotheses early but no
verbalization) -- Subjects in this group were instructed to generate
diagnostic hypotheses as early in the problem as possible, and, if
they wished, to use them to guide the course of their data gathering.
At the end of each problem each subject was asked to go back over the
problem orally with the experimenter and indicate what hypotheses were
generated and at what point in the problem generation occurred.

Group L-V (instructions to generate hypotheses only after all the
data are in and to verbalize) -- Subjects in this group were admonished
to withold judgment about diagnostic hypotheses until most or all of
the data were in. They were stopped periodically during the problem
and asked how they were coming along and what their thoughts were about
the data they had gathered up to that point.

Group L-NV (late generation, no verbalization) -- Subjects in this
group were similarly admonished to reserve judgment about diagnostic

hypotheses until the end of the problem. They were then asked to

29

review the problem orally and comment on what their thoughts had been
about the progress of the problem at certain points.

Group C-V (no hypothesis generation instructions, verbalize) --
Subjects in this group were given no instructions concerning hypothesis
generation. They were Simply stopped periodically and asked if they
had any idea as to where the problem was going or any other comments
on the problem up to that point.

Group C-NV (no hypothesis generation instructions, no verbal-
ization) -- As in Group C-V above, subjects were given no instructions
about hypothesis generation. At the end of each problem subjects
were asked to review the problem orally with the experimenter. Review
questions emphasized where the subject thought that problem was going
at certain points.

The instructions which were read to each subject are contained

in Appendix A.

Description Of The Patient Management Problems And Modifications Made

For The Present Study

 

The Patient Management Problems (PMP) were developed over a
number of years by the Interdepartmental Appraisal Committee of the
University of Illinois College of Medicine. A recently published book,

Clinical Simulations (35) contains a large selection of the problems

 

developed to date. Each problem begins with a brief introduction
containing some information about the ”patient", including the chief
complaint. All problems deal with a patient with some kind of ailment
requiring a physician's care. None are of the healthy recruit or

insurance physical variety. Having obtained the initial information,

30

the task of the examinee (problem solver) is to gather more information
in the interest of diagnosing and/or managing the patient. The
particular subset of problems to be used in modified form for this
experiment consists of a booklet containing four problems (three will
be used). “In this booklet is the introductory information and a list
of further types of information available. Accompanying the booklet is
a set of answer sheets and figures. The answer sheets contain the
answers to the "questions" the examinee may choose from the booklet.
The figures (blood smears, x-rays, etc.) are non-verbal answers to
questions in the booklet. To obtain an answer to a question the
examinee must rub out an opaque overlay covering that section of the
answer sheet corresponding to the question he asked. Correspondence is
achieved by numbering each possible question and each response. No
track is kept of the order in which an examinee requests information,
except for a record of the order in which certain sections of a problem
are done. What hypotheses may be governing his search_is not determined
in the original version of the booklet. A special version developed
for the MSU Medical Inquiry Project asks the problem solver to list
the hypotheses he is considering at the end of each problem section.
The PMP's were developed as a result of early work done by
Rimoldi (36). Rimoldi's Test of Diagnostic Skills gives the subject
some initial information about the patient and asks the subject to
obtain further information to solve the problem. The solution is a
diagnosis of the patient's illness. Additional information is provided
on cards contained in a problem folder. On one side of each card is
printed the question the subject may wish to ask. History questions,

possible physical exam manipulations, and laboratory and other studies

31

are included. On the reverse side of each card is printed the answer
to each question. All questions available for any one problem are
displayed for the subject at one time. The sequence of each subject's
choices is recorded by the experimenter. Scores on the Test of
Diagnostic Skills related to the number and usefulness of items A
chosen; the order in which choices were made; and the accuracy of the
final diagnosis. Primary emphasis is placed on the utility of each
choice. Utility is defined in terms of the frequency with which each
item is chosen by the group taking the test. An item chosen by many
subjects has high utility. A high score is gained by choosing as few
high utility items as possible. Rimoldi claims (36) that the reasoning
of the subject as he reaches a diagnosis may be assessed via these
problems.

The PMP's differ from the Test of Diagnostic Skills in a number of
ways. First, PMP'S are diagnosis and management problems, not just
diagnostic problems. Secondly, the format of the PMP offers the subject
a number of alternative,. and equally good routes through a problem.
Only one optimal route through a problem will result in a high score 5'
on the Test of Diagnostic Skills. Thirdly, the number of options
available to a subject doing a PMP is usually greater than the Test of
Diagnostic Skills, thus reducing the effect of cueing. Lastly, and
most important, the PMP's are scored in such a way as to measure
something quite different from what is measured by the Test of
Diagnostic Skills. Items are weighted as to their value to the problem ‘»
solver. Strongly positive weights are given to those items which help
the subject to diagnose and manage his patient. In addition positive

weights are given to items which should be included in a thorough workup

32

and conscientious management plan. Negative weights are assigned to
items which should not be chosen (e.g. because they may be costly to the
patient) and zero weights are assigned to items which are non-
contributory or simply distractors. The decision on assignment of
weights is made by a group of criterion physicians.

Subjects are scored for over-all competence in working up and
managing a patient (see 37). A proficiency score as well as an
efficiency score is calculated. In contrast to the Test of Diagnostic
Skills, strong emphasis is placed on proficiency, a kind of selective
thoroughness, and less emphasis is placed on efficiency, or reaching a
solution in the fewest possible steps. A high score is obtained by
choosing a reasonably large number of positively weighted items or by
doing a thorough workup (diagnosis as well as management) of the
patient. Distortions in performance due to cueing are greatly reduced
by offering a large number of options and making it difficult for the
subject to scan all available options at once.

PMP scores are calculated as follows:

Definition of terms:

”5
hs

Positively weighted items selected by 5

Negative and zero weighted items selected by 5
MAX = Sum of all positive weights possible

Efficiency = 2H5
sz + zhs

Proficiency = z weights of HS + z weights of hs
MAX

 

Errors of Omission = MAX -'2 weights of Hs
MAX

 

Errors of Commission = -2 weights of hs
MAX

 

33

The subject receives all of the above scores for his diagnostic
performance and his management performance. He also receives a score
for attack strategy by which he is rewarded for following an appropriate
sequence of sections and penalized for doing certain sections out of

order.

Reliability Of The PMP

 

Since the PMP is an unconventional (i.e. not multiple choice or
true-false) achievement test, estimates of its reliability cannot be
made by using standard methods such as the Spearman-Brown or the Kuder-
Richardson formulas. Thus alternative approaches must be developed (38).
For these purposes reliability cannot be considered strictly as the
(accuracy with which a test measures something, reflecting a judgment
about how closely a subject's score on the test approximates his
true score. One must instead consider the purpose which estimations of
reliability serve, namely to show with what consistency a test
measures what it purports to measure.

Calculations of measurement consistency for the PMP have been
made in two areas: 1) consistency across different ways of scoring the
test, and 2) consistency across different but similar tests (38). In
the first instance two methods were used. First the standard weights of
items were changed to increase penalties for errors as well as to
increase reward for correctness. Spearman's rho was computed on tests of
the six medical specialties in the battery for the two systems of rating.
Rho ranged between .95 and .97. A more significant change in the scoring
procedure was to ask two groups of judges to independently assign

weights to each item (previous weights had been assigned by consensus

34

of one group of judges). Subjects' scores were then recomputed using
the two new sets of weights. Correlations between scores using the
two sets of weights ranged from .92 to .95.

Consistency across different tests was assessed in a number of ways,
two of whiCh were: 1) consistency across subtests, and 2) consistency
across problems in different disciplines. One estimation of consistency
across subtests was Obtained by dividing a given problem into two
problems, each of which tested the same competence factors as were
tested in the original whole test. Correlations were then computed
between scores on the subtest and on the whole test for all disciplines
included in the battery. Average correlations ranged between .445 and
.912, a result which does not differ greatly from that obtained on
multiple choice tests. Consistency across problems in different
disciplines was Obtained by correlating scores on the whole battery
of twelve problems as they occur in one discipline (e.g. Surgery) with
scores on those problems as they occur in another discipline
(e.g. Medicine). A coefficient alpha of .56 was obtained.

The investigators feel that these estimations of consistency

of measurement Show the PMP to be a reliable instrument.

Validity Of The PMP

 

The validity of the PMP, or how well it measures what it purports
to measure has been considered from four points of view (39).
l. Predictive validity or how well performance on the test
predicts subsequent performance of a similar type. Thus far
follow-up studies which would evaluate the predictive

validity of the PMP have not been done.

35

Concurrent validity or how well performance on this test
resembles performance on another test that is considered a
true measure of the competence in question. Performance of
physicians on the PMP was compared to their performance of
data gathering in an actual clinical setting as revealed by
chart audits. It was found that physicians recorded consider-
ably less data on their charts than they requested on the PMP.
However when six items considered critical to solution of the
PMP were singled out, a rather high correspondence was fOund
between physicians' recording of that item on the chart and
their requesting that item on the PMP. It can be stated then
that performance on the chart and performance on the PMP are
highly similar for certain specific relevant items. No
correlations were reported (39, p. 9).

Content validity or how closely intellectual processes used to .
solve the PMP resemble those used in a clinical setting. The
correspondence has been assessed by asking subjects to comment
on their thought processes while solving the problems and
compare them to the processes used in practice. Anecdotal
data indicate that the process of responding to PMP's closely
simulates the thinking process one goes through in a clinical
setting (39, p. 4). Again, no statistical data are reported.
Construct validity or how closely differences in performance
of different groups on this test corresponds to reasonable
hypotheses about how the groups should differ. Two reasonable
hypotheses considered were: 1) As a subject grows older and

more experienced, he will tend to make a decision based on

36

less information. It was found that experienced physicians
gave a less thorough performance on the PMP than medical
students. No data are reported. 2) As a subject grows older
and more experienced, he will be more willing to take prompt,
even radical action. In a study using residents, candidates
and examiners as subjects and a problem calling for amputation,
36% of the residents, 40% of the candidates, and 50% of the
examiners chose amputation as a course of action.
Since the PMP is an effort to use a paper and pencil simulation
to measure a highly complex kind of competence, a high degree of
correspondence between performance on this simulation and performance
in a clinical situation cannot be expected. The authors (39) feel that
the degree of correspondence that has been found demonstrates that the

PMP is an adequate simulation of clinical problem solving.

Modifications To The PMP For The Present Study

 

General

The present study used the PMP's more as an observation instrument
than an evaluation instrument. Subjects were not scored relative to any
criterion. Judgment was not passed as to the quality of a subject's work-
up, only as to the accuracy of his diagnosis. This use of the PMP
justifies certain changes in the format as well as in the scoring of the

problems.

Format modifications
1. The PMP presents available information in a booklet format.

The booklet also includes instructions for progressing through

37

the problem, i.e. instructions about what section of the problem
the subject should do next. The booklet format and instructions
may restrict the subject's freedom of choice in requesting
information. In the interest of Observing the sequence which

a subject would naturally choose in this situation the
presentation was changed to a set of cue sheets and the
instructions were omitted. The cue sheets for each problem

are contained in Appendix B.

The PMP makes available the information requested by having

the subject rub out an opaque overlay, thus revealing the
Hanswer". This format enables the subject to obtain more than
one item of information at a time and denies the experimenter
the opportunity of keeping track of the order in which infor-
mation was requested. In the present study information was
presented in printed form on cards one item at a time. This
procedure assured that the subject would receive no more infor-
mation than he requested, and it facilitated the recording of

the order in which items were presented.

Scoring modifications

 

1.

Thoroughness -- The PMP Proficiency (or selective thoroughness)
score is a calculation of the percentage of positive points (or
total weights of positive items) chosen by the subject (see

p. 32 for formula). This way of calculating thoroughness rewards
subjects as much for choosing a smaller number of heavily
weighted items as for choosing a larger number of less heavily

weighted items. Perhaps this is why it is called a Proficiency

38

rather than a thoroughness score. In the Opinion of this
experimenter a more thorough solution is given by choosing more
positively weighted items regardless of the magnitude of their
weight, as well as by choosing zero-weighted items. For that
reason thoroughness for this experiment was the percentage of
positively and zero-weighted items chosen by the subject (see
formula p. 39). In addition to fulfilling the criteria

stated above, this method of calculating thoroughness yields

a score having the same metric as the efficiency score which
is the percentage of all the items the subject chose which
were positively weighted.

2. In the light of the purpose for which the present version of
the PMP was used, no errors of omission or of commission
were calculated. Furthermore, no separate efficiency and
thoroughness score for diagnosis and management were computed,
nor was an overall competency score determined.

3. Accuracy, which is not calculated for the PMP was computed on
a five-point scale (0, l, 2, 3, 4). The aCcuracy of each
subject's definitive diagnosis was determined by comparing his
diagnostic formulation with a set of often-entertained
diagnostic hypotheses which had been weighted for their
appropriateness by the developers of the PMP.

In summary, problem scores calculated for the study reported here

are: I
Number of positively weighted items

Efficiency =' ""““Ch sen by Subject .
Total number of items Chosen by sUbject

 

39

Number of positively and zero-weighted items
Thoroughness = chosen by subject
TotaT'number of positive and zero items
in problem

Accuracy = O, l, 2, 3, or 4

Procedures for determining Thoroughness, Efficiency and Accuracy

scores are contained in Appendix C.

Validity And Reliability Of The Modified Problems

To make judgments about content and concurrent validity of the
modified problems, performance on two of these problems by the
subjects in the present study was compared to performance of physicians
on the same two problems in the "original PMP" form as well as to
those physicians' performance on two similar problems presented as
high fidelity simulations. Since the same subjects did not complete
all three types of simulation (modified PMP, original PMP, and high
fidelity simulation) the comparisons are at best crude. Generally,
it was hoped that students' thought processes as well as other aspects
of their performance on the modified PMP's would closely resemble
physicians' performance on the high fidelity simulations. Where
comparisons between all three types of problems were possible, it was
hoped that the comparison of student performance on the modified
PMP's with physician performance on the high fidelity simulations
would be more favorable than comparison of physicians' performance
on the original PMP's with their own performance on the high fidelity

simulations.

40

The original PMP's used for comparison here are contained in the
booklet of four problems mentioned earlier. The only difference
between this booklet and the booklet originally created by the
University of Illinois group is that the former contains opportunities
for subjects to list the diagnostic hypotheses they were entertaining
at various points in the problem.

The high fidelity simulations used as criterion for judgments
about concurrent validity were two cases presented by programmed
patients in a realistic office setting (for further description of
these simulations see reference 6). These simulations were chosen
for this purpose because they closely resemble the actual clinical
setting, and because measures were available from these simulations
which could be compared to similar measures on the two types of PMP.

Regarding content validity, the intellectual processes used by
students to solve the modified PMP's closely resembled those used by
physicians on the high-fidelity simulations in at least two ways. First,
subjects doing both types of simulation spontaneously generated diagnostic
hypotheses early in the problem after having obtained very few cues.
Secondly, subjects doing both types of simulation used the cues they
obtained to help them evaluate these hypotheses. Furthermore there
were instances where a cue was elicited specifically in the interest
of testing out a hypothesis. Information is not available on what
specific strategies were used to solve the original PMP's so a
comparison cannot be made at present between them and the high-fidelity
simulations.

Certain tentative conclusions can be drawn concerning concurrent

validity by comparing means and standard deviations of Thoroughness,

41

Efficiency, and Accuracy scores on the three types of simulations.
These means and standard deviations are presented in Table l.

The GI problem presented in the high fidelity simulations
completed by physicians was a case of ulcerative colitis in a 22-year-
old male. -The content of the original PMP completed by physicians was
identical to that of the modified PMP completed by students. The
hematological problem presented in the high fidelity simulation was
a case of hereditary spherocytosis and infectious mononucleosis in a
21-year-old female. The content of the original and modified PMP
was identical.

The thoroughness score calculated for the high fidelity simulation
and the original PMP was the percentage of cues available in the case
which was acquired by the subject. The thoroughness score calculated
for the modified PMP was the percent of positive and zero-weighted
items (using the weighting system described on p. 31 and 32) chosen by
the subject. The efficiency score calculated for the high-fidelity
simulations and the original PMP was the percent of cues acquired
by the subject which were critical for one or more of his diagnostic
hypotheses. Efficiency on the modified PMP was the percent of cues
acquired by the subject which were positively weighted. Accuracy
on the high-fidelity simulations and the original PMP ranged from
zero to two. Accuracy on the modified PMP ranged from‘zero to four.

Thoroughness and efficiency scores on the three types of
simulation may be compared by inspection of Table 1. For the GI
problem average thoroughness on the modified PMP is more similar to
thoroughness on the high-fidelity simulation than is thoroughness on the

original PMP. However, efficiency on the modified PMP is less

42

Table 1
Means and Standard Deviations of Thoroughness, Efficiency, and
Accuracy on Two Problems for Three Types of Simulations

Thoroughness Efficiency Accuracy

 

 

 

 

 

Hi fi Simulation 47.14 32.36 1.63
( 7.99)* ( 9.09) ( .79)
G1
Problem Original PMP 32.67 44.60 1.62
(11.44) ( 7.77) ( .51)
Modified PMP 40.79 62.79 2.92
(13.08) ( 9.24) (.78)
Hi fi Simulation 57.43 46.81 1.80
(6.40)_ (7.02) ( .40)
Hema-
tological Original PMP 41.07 67.53 '1.23
Problem (22.94) (17.65) (1.01)
Modified PMP 40.17 65.50 2.04
(12.88) (13.32) (1.73)

 

* Standard deviations are in parentheses

43

comparable to efficiency on the high fidelity simulation than is
efficiency on the original PMP. Thus, although average thoroughness on
the modified PMP compares most favorably with that on the high fidelity
simulation, this is not the case for average efficiency. For the
hematological problem thoroughness on the modified PMP compares less
favorably with that on the high fidelity simulation than does that on
the original PMP. The opposite is true of the average efficiency
scores. It should be noted here that the thoroughness and efficiency
scores on the original and modified PMP's are quite similar to each
other for this problem and neither one compares particularly favorably
with its counterpart in the high fidelity simulations.

In summary, the overall comparison of modified PMP scores with
high fidelity simulation scores is not more favorable than the
comparison of original PMP and high fidelity simulation scores. The
author feels that any comparisons favoring the modified PMP might
have been serendipitous. The data are taken from two different
samples; the scores are obtained by different means; and the content
of the problems is not identical for all three types of simulation;
all of which renders conclusions about the meaning of these
comparisons somewhat doubtful.

It may be appropriate to establish the validity of a paper
simulation by comparing performance on it with that on a high fidelity
simulation. However, the same subjects should participate in both
simulations, the content of both simulations should be identical, and
the scores on outcome variables Should be arrived at in the same way
for both types of simulation. Until this is done, no definite con-

clusions can be drawn about the validity of these PMP's.

44

Reliability

 

In the interest of determing whether the three problems could be
considered a three-item test which measured thoroughness, efficiency
and accuracy of diagnostic problem solving independent of experimental
group assignments, these scores were correlated (Pearson r). The

results were as follows:

Thoroughness
Prob 2 .32
Prob 3 .77 .55

Prob l Prob 2

Efficiency
Prob 2 .09
Prob 3 .10 .32

Prob l Prob 2

Accuracy
Prob 2 .19
Prob 3 .09 -.15

Prob l Prob 2

Problem 1: Acute Abdomen

Problem 2: Pale Lethargic Child

Problem 3: Pale Confused Patient

The strong amount of variability in these correlations indicates

that there is considerable lack of consistency in Thoroughness,

Efficiency and Accuracy across problems and that therefore each

45

problem should be considered a separate test. All subsequent
analyses are done on each problem separately.
As stated earlier (page 33) the type of reliability estimates
appropriate for these problems are those estimates which reflect
the consistency with which a test measures what it purports to
measure. With this in mind, the internal consistency of each problem
was calculated using one of the procedures described by Lewy and
McGuire (38). Each problem was divided into sections as follows:
Problem I -- Credited introductory items (i.e. items la and b,
2, 3b, 3c, 4a-c, 9a-c on the cue sheets, see Appendix C)
Physical exam
Laboratory
Non-surgical intervention
Surgical intervention
Problem II -- Diagnosis, Prognosis
Problem III -- Credited introductory items (i.e. items la-d,
2a-d, 3-7 on cue sheets, see Appendix C)
History
Physical examination
Laboratory
Therapy
A subproblem was then constructed from each problem by picking
every third item, beginning with the first item in each section. Ten
randomly selected subjects were then given a score on the whole
problem and on the subproblem consisting of the sum of the weights of

the items he Chose in the total and subproblem. The weights used were

46

those assigned the items by the University of Illinois criterion
group. These scores served as the basis for estimation of the
reliability (internal consistency) of each problem.

Internal consistency was calculated using the formula derived
by Angoff (42) and his correction for spuriousness (43). The
resulting internal consistencies were as follows:

Problem I: .80 A

Problem II: .34

Problem III: .87

These coefficients indicate that Problems I (Acute Abdomen) and
III (Pale Confused Patient) were internally consistent. Internal
consistency is interpreted here as meaning that the set of items which
was chosen for the subtest was a representative sample of the items
used in the whole test. This apparently was not the case for
Problem II (Pale Lethargic Child). The items which made up the
subtest were not a representative sample of the frequency distribution
for subjects' choices of items on the whole test. 0n the whole test
the ratio of number of items chosen by more than half of the subjects
to number chosen by less than half of the subjects was approximately
two to one. On the subtest this ratio was closer to four to one.

In the opinion of this writer, these coefficients do not reflect
the generalizability of the results on this test to other tests
dealing with the same type of patient and containing a similar factor
(defined as sectional divisions) structure. It is felt that to
accomplish this goal a set of parallel tests containing items selected

from a pool representative of the universe of items appropriate for

47

this type of patient would have to be constructed, and a test-retest
reliability would have to be calculated.

Additional comments are in order concerning the use on these
tests of any reliability coefficient based on part-whole or part-part
correlations. The choice Of an item in these problems may depend
in one of two ways on the choice of another item in the problem, First,
some items are redundant as discussed above. If one of a pair is
chosen in any problem then the other member of the pair logically
cannot be chosen. When calculating the reliability of the test one
item cannot be automatically declared the redundant one and its weight
dropped from the calculation. Thus all redundant items are included
in the reliability. It is sometimes possible to create a subtest
that is equivalent in its redundancy to the whole test, but this
possiblity is not always available. Secondly, irrespective of
redundancy the items are interdependent from the point of view of
the problem solver. What he has learned by a certain point in the
problem may determine his choice of subsequent items. Most formulas
for calculating reliability, especially those single-administration
procedures such as the split-half coefficient or the Angoff formula
used here depend heavily on the assumption that the test items
are independent of one another. This assumption is violated by the

problems discussed here.

CHAPTER IV
ANALYSIS OF RESULTS, SUMMARY AND DISCUSSION

The results of the study were analyzed in two ways. First, the
experimental hypotheses were tested using the multivariate analysis
of variance program developed by Jeremy Finn (44). This procedure
was chosen for the bulk of the analysis for a number of reasons. First
all dependent variables were interrelated. Not only was there a
close relationship among some independent measures within a problem,
but it could not be assumed that performance on any one problem
was independent of that on any other. Secondly, the hypotheses
were phrased in such a way as to require the analysis of clusters of
variables rather than isolated ones. The multivariate procedure
permits this type of analysis.

The second type of analysis was a process or clinical analysis
of certain subjects' performances. This second analysis was done
in the interest of identifying other elements not included among the
initial set of variables which might be contributing to variations in

performance.

Tests Of Hypotheses Concerning Hypothesis Generation And Verbalization
The first hypothesis to be tested was Hypothesis 13 to
establish the statistical independence of earliness of hypothesis

generation and number of hypotheses generated. The hypothesis of

48

49

statistical independence should be rejected ifx2 > x: (.95) = 3.84.
A statistic of.x: = 1.87 was obtained thus confirming the independence
of the two variables.
The next hypotheses to be tested were 1-4 and 8-10. Specifically
these hypotheses are:
1. Subjects instructed to generate hypotheses early will
give a more efficient, less thorough, and more accurate

performance than those instructed to withold judgment.

2. Subjects given no instructions about hypothesis generation
will Show the same pattern as in #1 above.

3. Subjects constrained to verbalize will give a more accurate
solution than those without that constraint.

4. Instructions to verbalize will have no effect on the
efficiency or thoroughness of performance. »

8. Subjects instructed to generate hypotheses early will
generate more hypotheses than those instructed to
withold judgment.

9. Subjects given no instructions about hypothesis
generation will show a similar pattern to #8 above.

10. Instructions to verbalize will have no effect on the
number of hypotheses generated.

Table 2 shows the results of the multivariate analysis of variance
relevant to those hypotheses. Note that this analysis could not
evaluate the effects of independent variables on Problem III Accuracy
since this variable had no variance. Although this occurrence is
unfortunate from an experimental viewpoint, it points to possible
problem specific differences in the study. Only for this problem did
all subjects reach the correct solution. Based on the results shown

in Table 2 the hypotheses that instructions about hypothesis generation
would have a differential effect on thoroughness, efficiency and

accuracy of performance as well as on the number of hypotheses

SO

 

 

mm. co. co. mm. mm. Fm. om. om. oo. NP. mm. cog» mmoF o
mm.P mm. mm. mm. mm. on. up. mm.~ Fn.m Fm.m om.P AoN.N u eovz
ouoweo>wcz
zNoo. v a mom .mm H to ”mm.m u do :o_poe2doc2
mm. m_. mm. mm. mm. om. Fm. om. oo. Fm. mm. comp mmoﬁ o
mm. mm. om. mm. mm. oo. mo. om. .Fo.o Fo. ooo. Aom F u move
opo_eo>wcz
zoo. v o moF .FF u mo ”No.m u my mcowoooepmcF cowponFooeo> mo uooeem
Amo. v o ”mm .Nm u yo moo.~ u av cowpoeoooo mwmozpoozz co mcowpooeumoH eo pooeem
<>oz<z
o.o o._ o.m o.mm o.mm o.om m.mm ~.mo ~.wo w.n o.m 0.0 .oeo> oz
o.o m.~ o.~ o.mm o.mo o.mo o.mm o.mm o.mo o.NP m.m o.FP onpooeo>
Foepcoo
o.o m.p o.m w.om o.mo o.mo m.om o.mo o.~o m.o~ o.m o.pp .oeo> oz
o.o o.~ P.m o.om o.mo m.oo o.mo N.mo w.mm o.m o.o o.__ oNWFooeo>
opoP opoeocou
o.o m.P o.m m.mm m.mo o.mo m.mo o.mo o.oo 0.0 m.o o.o~ .oeo> oz
o.o 0.? o.m o.mm o.mo o.—o o.nm o.Fm o.oo o.m_ o.m m.m oNPPooeo>
zPLoo mpoemcoo
mcooz FFou
HHH HH H HHH HH H HHH HH H HHH HH H EoPQoeo
zooeooo< zoomwumem mmmczmooeozh monocuoqzz z

zooeooo< woo .zooomowmmm .mmoozmooeozp .mommzpooz: mo eooEoz :o mcowuooepmcH mo mpomemm mo o>ocoz
N opoop

51

generated must be rejected. Specifically hypotheses l, 2, 8 and 9
must be rejected.

The multivariate analysis of hypotheses concerning the effect
of verbalization constraints on outcome shows that these hypotheses
probably should not be rejected. An investigation of the univariate
effects of verbalization shows that verbalization probably did not
affect the accuracy of performance, thus leading to the rejection of
hypothesis 3. Further, there was no apparent effect of verbalization
on the thoroughness or efficiency of performance. Hence, hypothesis
4 cannot be rejected. On the other hand verbalization did have an
apparent effect On the number of hypotheses produced, especially
on Problem III. Subjects constrained to verbalize produced more
hypotheses for Problem III than those without that constraint, which
points again to possible problem specific differences.

In addition to the verbalization differences, there were
differences in number of hypotheses generated due to the interaction
of hypothesis generation instructions and verbalization. These
differences were particularly evident again on Problem III. This
interaction is plotted in Figure 1. Both the vertical and horizontal
dimensions of the interaction are of interest. On the vertical
dimension subjects instructed to generate hypotheses early and
subjects given no hypothesis generation instructions (control)
behaved similarly. Both groups generated more hypotheses when
constrained to verbalize than when not under that constraint. Subjects
instructed to generate hypotheses late showed the opposite pattern.
They generated fewer hypotheses when constrained to verbalize than

when not constrained to do so. On the horizontal dimension subjects

52

 

 

 

14
3 12
U!
“>5 a / “e
f ’z’ ‘A NO Verb.
o 6 A”
3
-§ 4
z 2

E L C

Hypothesis generation instructions

Figure 1: Instructions X Verbalization interaction, Problem III

14
m ‘2 4W. Verb.
Q, ”—a"' C \\
a 10 4’ ~\
5 ‘4 NO Verb.
o O
o. 8
>,
.C
u- 6
o
a 4
.Q
5
z 2

 

 

E L C
Hypothesis generation instructions

Figure 2: Instructions X Verbalization interaction, Problem I

53

who verbalized produced more hypotheses when instructed to generate
hypotheses early or given no instructions, than they did when instructed
to generate hypotheses late. Conversely subjects who were not con-
strained to verbalize produced fewer hypotheses when instructed to
generate hypotheses early or given no instructions than when told to
generate hypotheses late.

In the interest of determining whether this was a problem specific
effect, the relation of instructions and verbalization to number of
hypotheses was plotted for the other two problems as well as for the
average number of hypotheses generated by each subject over the
three problems. As shown in Figures 2, 3 and 4 these are also
interactions. It appears then that this is not a problem specific
phenomenon. Except for Problem I the interactions are disordinal.
The interaction pattern for Problem I is dissimilar in that
1) subjects in the "early'I group who verbalized produced fewer
hypotheses than those who did not verbalize, and 2) verbalizers
produced fewer hypotheses in the "early" group than in either the
"late" or control group.

A statistical analysis was not conducted to evaluate the
following hypotheses:

5. Subjects instructed to generate hypotheses early will

generate hypotheses earlier than those instructed to

withold judgment.

6. Subjects given no instructions to generate hypotheses
will show a pattern similar to #5 above.

7. Instructions to verbalize will have no effect on how
early hypotheses are generated.

54

     

14
m 12
CD
U)
.2 10 Verb.
g ’A*——-
E; 8 K -‘4No Verb.
..C A”
"5 6 o
S.-
2.’ 4
E
3
2: 2

 

 

E L C
Hypothesis generation instructions

Figure 3: Instructions X Verbalization interaction, Problem II

 
 

14
3} l2 Verb.
33
*5 , “A No Verb.
3, 8 4’
.C
q...
C
S.-
g 4
E
3
z 2

 

 

E L C
Hypothesis generation instructions

Figure 4: Instructions X Verbalization interaction, Average

55

Inspection of Table 3 will help clarify why this was not deemed
necessary. As stated earlier, early generation was defined as
_ generation of the first hypothesis occurring directly after the
reading of the problem introduction. As can be seen from Table 3
this occurred in 75 cases (85% of the time). With this great
difference it was felt that a statistical analysis would contribute
little. Instructions had no apparent effect on subjects' early or
late hypothesis generation behavior. Instructions to generate early
produced only one more early generation than instructions to generate
late (withold judgment). Subjects in control groups produced one
fewer early generation than subjects instructed to withold judgment.
Subjects constrained to verbalize produced one fewer early generation
than those without that constraint. Instructions to generate late
and verbalize produced the fewest early generations, but only one
fewer than did instructions to generate early and not verbalize.

It was noted that Problem I produced the most late generations,
a total of 10. Problems I and III combined produced only five late
generations. A x2 Test for homogeneity across problems was conducted
to evaluate the significance of this occurrence. The hypothesis of
homogeneity should be rejected if x2 > x: (.95) = 5.99. A statistic
Of x: = 10.22 was obtained which confirms a lack of homogeneity of
hypothesis generation across problems. Problem I is producing
significantly more late generations than Problems II and III,

further evidence of problem specific differences.

Before pursuing the analysis of hypotheses dealing with the effect

of hypothesis generation performance on the dependent variables, a

discussion of significant findings obtained thus far is in order.

 

(cells contain frequencies)

56

Table 3
Hypothesis generation X Groups X Problems

 

 

 

 

 

Early Late
generation generation

Problem
Generate I 5 0
early II 5 0
III 5 0
Generate I l 4
Verbalize late II 5 0
III 4 l
I 3 2
Control 11 5 0
III 4 1
Generate I 3 2
early II 3 2
III 5 0
Generate I 5 O
No Verbalize late 11 5 0
III 5 O
I 3 2
Control 11 5 0
III 4 l

 

 

h-

57

Summary And DiScusSion Of Verbalization And Interaction Effects

Based on results obtained thus far it would appear that instructing
subjects in hypothesis generation and constraining them to verbalize
has a combined effect on the number of hypotheses generated, but not
on the thoroughness, efficiency, or accuracy of performance.

There was a significant effect for verbalization on Problem III
only. This effect got its primary contribution from the "early" group
where more than twice as many hypotheses were produced by verbalizers
as by non-verbalizers (an average of 13.6 gs, 6.0). Control verbalizers
produced more hypotheses than non-verbalizers, but late verbalizers
produced fewer hypotheses than non-verbalizers. For this reason
primary attention must be given to the effect of interactions between
hypothesis generation instructions and verbalization on number of
hypotheses generated.

It should be noted first of all that the interactions for all
but Problem I Show a similar pattern in the "early" and control groups.
Combining this observation with the results presented in Table 3, one
might infer that under instructions to generate hypotheses early,
subjects behave as they would naturally. In other words, instructing
a subject to generate hypotheses early is Simply telling him to do
what he would do anyway. It should also be noted that the average
performance simply reflects a general trend across the problems. This
general trend demonstrates that on the average subjects in the "early"
group performed similarly to those in the control group. What needs
explanation is why subjects in the "late" group performed differently
and what there is about Problem I that induces a different hypothesis

generation pattern than Problems II and III.

58

On the "early-late” dimension "early" instructions led to more
hypotheses being produced by verbalizers than by non-verbalizers.
whereas "late" instructions led to more hypotheses being produced
by non-verbalizers than by verbalizers. It is possible that this
occurred due to a combination of the subjects' efforts to follow
instructions and a change in their perception of the problem once
the decision about the problem sOlution had been reached. Both
Problems II and III were of the type for which a single diagnosis was

appropriate, i.e. all the data fit together under a single disease

 

rubric. However, these two problems presented some ambiguity at the
outset. In each case the patient's presenting complaint did not lead
subjects to think of the solution immediately. Most subjects (20 in
Problem II, 19 in Problem III) had collected more than half of their
data before‘generating their solution hypothesis. This ambiguity
combined with instructions to generate hypotheses and to verbalize

may have produced more hypotheses by this group on these problems. On
the other hand the non-verbalizers in the "early" group did not
enumerate their hypotheses until they had completed the problem,

i.e. decided on a solution. In Problem III this was always the correct
solution, and subjects were convinced it was correct. The solution
(Pernicious anemia) was not an unusual one and there was a good deal
of confirmatory data for it. These subjects then, in looking back on
' the problem may have suppressed many of the hypotheses generated
during the problem once the solution became obvious to them. The
sOlution to Problem II was not as obvious, SO that although a

Similar form Of suppression probably occurred the difference between

verbalizers and non-verbalizers was not as great.

This explanation is supported in part by Miller, Galanter and
Pribram's (16) concept of the Image Of a problem and a Plan for
solving it. Once a problem is solved one's Image of it may change and,
retrospectively, the Plan used for solving it may be more congruent
with the post-solution Image than with the pre-solution one. Similarly
the explanation may be further supported by Newell and Simon's
concept of a problem space. While working through the problem the
subject may be operating in one space, whereas after solving the
problem, the space he constructs for that problem may be different.

One component of the problem spaces -- pre- and post-solution --
is storage of information. Retrospectively it is possible that a
greater number of cues appeared to belong in the Pernicious anemia
category, therefore no new hypothesis had to be generated to accomodate
them. Thus, retrospectively the number of hypotheses needed was fewer
than during solution of the problem.

The solution to Problem I is not as clear cut as those for
Problems II and III. For this reason the subjects' Image of the problem
may not have changed perceptibly after solution. This may have led them
to generate more hypotheses retrospectively than during solution.

This still does not explain the difference in pattern between "early"
and control groups. The interaction of instructions and verbalization
on Problem I is not significant. A more significant interaction should
be obtained before lengthy steps are taken to explain this phenomenon.

The facts that subjects in the "late" group generated fewer
hypotheses during verbalization than they did retrospectively as

well as generating fewer hypotheses during verbalization than the "early"

60

group may both be due to the effect of the Nlatef instructions.

This group (see Appendix A, L-V and L-NV instructions) was asked to
gather enough data to do a good workup and group the data together
into formulations. They were discouraged from leaping to conclusions
based on small amounts of data. Subjects may have followed these
instructions, not by generating the first hypothesis later in the
problem, but by generating fewer hypotheses. Further, during
verbalization they gave more evidence of following the instructions
than retrospectively.

It is difficult to conceive Of a reason why subjects in the "late"
group generated more hypotheses retrospectively than subjects in the
Yearly" group. Why, under any circumstances would the Plate" group,
who had been encouraged to be conservative, generate more hypotheses
than the "early” group who had been encouraged to speculate? This
phenomenon is particularly noticeable on Problem III. It was noted
that subjects in the "early" group on the average generated their
final hypothesis during retrospection earlier than those in the "late"
group. Specifically the "early" non-verbalizers on the average had
produced the solution hypothesis by the time they were almost 50% into
the workup, whereas the "late” non-verbalizers did not produce that
hypothesis until they were almost 63% into the workup. It may be
then that the "late” instructions led subjects to do a more thorough
evaluation in retrospect and generate more hypotheses, whereas
subjects in the "early" group allowed their knowledge of the final
solution to influence their retrospective hypothesis generation causing

them to generate fewer hypotheses.

61

In summary the interactions of instructions and verbalization,
noticeable only in Problems II and III are probably due in part to the
structure of these problems as well as to subjects' interpretations
of the instructions combined with a changed perception of the problem

after it has been solved.

Summary And Discussion Of Early/Late Hypothesis Generation

 

It was noted earlier that except for Problem I, subjects in
the "early" generation group behaved similarly to the control group
regarding the number of hypotheses generated during verbalization or
retrospectively. The statement was made earlier (p. 56) that telling
subjects to generate hypotheses early encourages them to do what
they would do anyway. This conclusion is further borne out by the
results of the effects of instructions on early or late hypothesis
(generation. The instructions themselves had no apparent differential
effect on early or late hypothesis generation. Thus not only does
telling a subject to generate hypotheses early encourage him to do
what he would do anyway, but also telling him not to generate hypotheses
early does not discourage him from doing so. Early hypothesis generation
and therefore hypothetico-deductive thinking seem to be an integral
part of any diagnostic problem solving strategy. This conclusion
derived from the data obtained is so obvious as to seem trivial. To
the layman and to modern philosophers of science such as Kessel (7),
Popper (10), and Medawar (11), a hypothetico-deductive approach to
the solution of scientific problems seems essential. In the medical
domain physicians such as Price and Vlahcevic (21) as well as

investigators such as Elstein et a1. (6) and teachers of medical students

62

such as Morgan and Engel (5) contend that to solve a diagnostic
problem hypotheses are generated and evaluated throughout the problem.
Although this point is blatantly obvious to some, the present invest-
igator included, it is not always obvious to others. One of those
who does not espouse this viewpoint is Lawrence Weed and, by association,
those who teach the Problem-Oriented Record (POR) using Weed's
approach. This is not a criticism of the POR. As a record keeping
device it is excellent, and when implemented in a health care system
it provides the uniformity and conciseness missing in non-problem-
oriented systems. Further, it allows data to be gathered from
patients by non-physicians thus saving the physician time and effort.

If the present findings and those discuSsed above are indeed
true, use of the POR as a device for organizing thought about
diagnostic problems goes counter to the problem solver's intuitive
approach which starts with the generation of diagnostic hypotheses.
Further investigation Of the POR and its role in the training of
medical students is necessary before this issue can be resolved.

The finding that Problem I produced significantly more late
generations again points up the possibility of problem Specific
differences. Earlier, in the Instructions X Verbalization interaction
results, Problem I stood out as having a different pattern than
Problems 11 and III. The reasons behind this may be in a number of
areas. One is the structure of the experiment. Problem I was the
first problem subjects did and without instructions they were
hesitant to generate hypotheses early since many had been trained not
to. Another reason may be the structure of the simulation. The

introduction to Problem I (see Appendix B, Problem I) presented a

63

comparatively small amount of information proportional to the total
amount of data needed to solve the problem. Some sort of critical
mass concept may be at work here whereby certain subjects needed
more data than was presented at the outset to generate their first
hypothesis. This critical mass may have been exceeded in the other
two problems. Lastly the structure and/or content of the problem
may have induced more subjects to put off generating their first
hypothesis until more data had been obtained.

A combination of the above two explanations is plausible. The
information presented at the outset is non-definitive enough that the
patient's complaints could stem from any number of sources. The
problem is presented as an emergency, yet the emergent nature of the
situation is not immediately Obvious. No subset of elements fits
together immediately to suggest a hypothesis that one might pursue.
These aspects of Problem I can be contrasted with the introduction to
Problem II (see Appendix B, Problem II) for example in which a great
deal of information is presented. The subject's primary task is to
obtain laboratory tests to evaluate his hypotheses. Problem III (see
Appendix B) also contrasts with Problem I in that the introduction to
the former presents a female patient with the types of complaints
which not infrequently occur in female patients of her age. Thus
the range of possible initial hypotheses on Problem III is probably
less than that suggested by the introduction to Problem I.

To use Newell and Simon's terminology, it may be more difficult
to establish a problem space based on the introduction to Problem I.
The concept of problem space has not been defined for medical problems

as yet. It is felt that medical problem spaces at the very least

64

consist of classes of hypotheses, aggregations of cues and heuristics
for relating cues to hypotheses. Based on the introduction to Problem I
it may be impossible for some subjects to establish this space in

which to operate. Further investigation is needed into the nature of
the problem Space for various problems, and the amount of data that is-

needed to establish a problem space for different types of problems.

Results Of Performance Hypotheses

 

AS could be predicted from the results of the previous hypotheses
discussed (Hypotheses 5, 6 and 7), it was only possible to evaluate
the effects of Early gs, Late and Many gs, Few hypotheses for
Problem I. Attempts to make these evaluations for Problems II and III
resulted in at least one cell with an N of less than 2.

Table 4 shows the results of a multivariate analysis of variance
on the performance variables. No overall significant differences were
Obtained for early or late hypothesis generation, or for the generation
of many (10 or more) or few (less than 10) hypotheses. Thus Hypotheses
11 and 12 must be rejected. Further, there was no interaction effect.
Univariate analysis showed a significant effect for many gs, few
hypotheses with subjects generating many hypotheses being more thorough
than subjects generating few hypotheses. The interaction between
Early/Late and Many/Few was disordinal and significant. This interaction
is plotted in Figure 5. Subjects who generated the first hypothesis
early and used few hypotheses were somewhat less thorough than those
who started early and used many hypotheses. Those who generated the

first hypothesis late and used few hypotheses were considerably less

65

Table 4
Manova of Effects of Many/Few and Early/Late Hypothesis Generation
on Three Outcome Variables, Problem I (Acute Abdomen)

Thoroughness Efficiency Accuracy
Cell Means
Few (<10) Hyp
Early 37.1 66.2 3.0
Late 26.3 66.0 2.5
Many (:10) Hyp
Early 42.0 63.0 2.9
Late 51.3 55.6 3.1
MANOVA

Effect of Early or Late (F .98; df = 3, 24; p < .42)

Effect of Number of Hypotheses (F = 1.99; df = 3, 24; p < .14)

Univariate

F(df = l, 26) 6.19 3.15 .21

P less than .02 .09 .65
Interaction (F = 1.8; df = 3, 24; p < .17)

Univariate

F(df = l, 26) 4.07 .91 1.45

P less than .05 .35 .24

Thoroughness

Figure 5:

55
so
45
4O
35
30
25

 

66

 

, ,4 Many
A’ ’
CiFew
Early Late

Many/Few X Early/Late interaction, Problem I

67

thorough than those who started late and used many hypotheses. Those
who started early and used few hypotheses were considerably more

thorough than those who started late and used few hypotheses.

ngmary And Discussion Of Performance Results

The results obtained from the analysis of performance hypotheses
are inconclusive. They Show generally that thoroughness correlates
with number of hypotheses and that there is some kind of combined
effect on thoroughness of earliness of hypothesis generation and
number of hypotheses used.

It was hoped that first of all there would be enough subjects who
generated hypotheses late that this analysis could be done for all
three problems. This was not possible. Secondly, it was hoped that the
analysis that was done would Show some significant differences on the
Early/Late dimension. Since early hypothesis generation plays such an
important role in diagnostic problem solving, one might think that those
subjects who do not generate hypotheses early would go about problem
solving differently than those who do generate early hypotheses.
Although no effect of early hypothesis generation was observed for any
of the outcome variables there was an interaction effect. The possible
behavioral implications of this are discussed below. The results of
the Many/Few dimension and the interaction were somewhat more
rewarding. Generating many hypotheses caused subjects to be more
thorough on Problem I. Inspection of the correlation matrix for
these data reveals that this phenomenon is somewhat uniform across
problems. Number of hypotheses correlates with thoroughness .41 on

Problem II and .18 on Problem III.

68

Generating the first hypothesis early results in a similar
amount of thoroughness whether few or many hypotheses are used. However
generating the first hypotheses late results in considerably more
thoroughness when many hypotheses are used and considerably less
thoroughness when few hypotheses are used. Further evaluation of
this interaction showed that only 3 of the 10 subjects who generated
the first hypothesis late used less than 10 hypotheses. Seven subjects
who started late also used many hypotheses. This, it is felt, further
supports the earlier comment about Problem I (p. 63). This seems
to be a problem for which it is difficult to establish a problem space
which leads to a large number of late first hypothesis generations.
Subjects who do have trouble establishing a problem space seem to
find this problem difficult to work through. For this reason 7 of
the 10 late generators use a comparatively large number of hypotheses
and do comparatively thorough workups.

It would be most interesting to know whether late generation
combined with the generation of a large number of hypotheses and a
thorough workup is a problem specific phenomenon. It would also be
most interesting to know whether the fact that the three subjects
who started late also used few hypotheses and were noticeably less
thorough is a chance occurrence. If not one might have to investigate
two types of late starters; those who have trouble establishing a
problem space on this type of problem and continue to have trouble
clarifying it, and those who have trouble establishing one, but once it

is established can arrive at a solution with fewer hypotheses and a

69

less thorough workup than those who were able to establish a problem
space early. Under those circumstances problem type by subject

interactions could be rather involved.

Results And Discussion Of Problem By Problem Analysis

 

Having observed a number of apparent problem Specific differences,
a problem-by-problem multivariate analysis of variance was deemed
appropriate. Although this procedure could not clarify problem
Specific differences in accuracy (Problem III appeared to stand out)
it might clarify problem specific differences on the other dependent
variables. The results of this analysis reaffirmed some of the
results found earlier but, except in one possible case did not reveal
any overall problem specific differences. As is shown in Table 5,
the interaction between instructions and verbalization helped to
generally differentiate between groups on Problem I.

There were two significant univariate F values:

Effect of verbalization on Problem III hypotheses
Univariate F = 4.61; df = 1, 24; p < .04

Interaction effect on Problem III hypotheses
Univariate F = 3.71; df = 2, 24; p < .04

These results tend to confirm some of those found earlier. The finding
of differences between groups on Problem III hypotheses due to the
effect of verbalization and due to an interaction of verbalization and
instructions has been reported and discussed. It would appear that this
may not be a problem specific phenomenon since when all three problems
were analyzed as a group, there was an overall significant effect for

interaction whereas when‘ Problem III was separated out, the overall

70

om. v a wee .o u to meN.F n a mo. v o ”mm .m u ea ”mm.z u a No. v a New .o u to ”me. n L HHH edzooea

.4. v a ”No .o n co moo; u a mo. v a AFN .a u co ”mm. H a mo. v a ”Na .o u to ”me. n a HH Eo_ooea

mo. v a "we .o u to mom; I a mo. v o m_m .e u to mom. I 2 mm. v a ”me .o u Lo map._ I a H ee_ooea
compuogmch cowuoNVPooeo> .epmcH .cow .ozz

mooeeem ,_eee>o

moeomooz ucoocooom :o o>ocoz so_ooeo-zm-Eo_ooeo
m opooh

71

effect disappeared. In the analysis of the combined problems,
Problem II probably made a contribution to the interaction effect on
the hypotheses variable.

The finding of an interaction effect on Problem I is a new
occurrence. Inspection of a step down F shows that once the variance
contributed by hypotheses is removed, thoroughness is still significant.
Thus the differences in number of hypotheses generated and the’
differences in thoroughness are making primary contributions to the
interaction. These two interactions are plotted in Figures 2 and 6.
Figures 7 and 8 Show the contributions of efficiency and accuracy to
that interaction. It is felt that before going to great lengths to
explain the interaction, it should be further investigated to assure
that it did not occur by chance and to ascertain exactly what factors

might be at work here.

Results And Discussion Of Process Analysis

 

As was stated earlier, both a statistical and a clinical analysis
were done on the results of this study. This clinical or process
analysis, as will be seen below, is not of the type that tries to
describe the sequential strategy used by certain subjects to solve a
problem, such as Simon and Newell (17), for example, have done for
their cryptarithmetic tasks. The process analysis done here was
simply an effort to extract certain elements from the students'
performances after the fact, and to try to make inferences from these
about the differences between high scorers and low scorers.

The process analysis was felt to be highly justified since the

statistical analysis revealed certain possible problem specific

Thoroughness
cocoa-4:01
010010

0

Figure 6:

Figure 7:

 

72

/ /A NO Verb .

  

o Verb.

 

 

E L C
Hypothesis generation instructions

Instructions X Verbalization interaction, Problem I

(ff/"like Verb.

\

“A No Verb.

 

E L C
Hypothesis generation instructions

Instructions X Verbalization interaction, Problem I

73

4 a ”A No Verb.
A—-—- -A’ "
g3 / \oVerb.
S.
:2
U
0
<1
0

 

 

E L C
Hypothesis generation instructions

Figure 8: Instructions X Verbalization interaction, Problem I

74

differences but did not clarify them particularly. Further, the
statistical analysis was not able to discriminate groups on the
accuracy of their performance. Since accuracy is an important aspect
of medical problem solving the process analysis tries to differentiate
between high and low scorers on accuracy. The analysis was done on
Problems I and II only since there were no low scorers on Problem III.
This fact points up a potential problem specific difference. Problem I
had five low scorers, Problem II had 20 low scorers and Problem III
had none. Furthermore, Problem II was the only problem on which any
subject received a score of "O". This may be due to the content and/or
structure of the problem, or it may be the result of more stringent
scoring being unknowingly applied. The criteria for assigning scores,
as discussed earlier was derived from a criterion developed by experts.
Thus the suggestion that it was somehow a more stringent or less fair
scoring scheme is probably unjustified. The imbalanced distribution
of accuracy on the three problems suggests that before simulations are
more widely used -- especially low fidelity simulations such as these --
scoring schemes should be established which in some way assure that
solutions with parallel degrees of accuracy receive equivalent scores.

A number of new elements were found which seem to point up
differences between those subjects who received high and low scores.
As can be seen in Table 6 these elements deal with the acquisition of
cues which were positive for a high scoring (called generally "correct"
in Table 6) solution. The high scoring solutions are shown in Display I.
The elements also deal with the interpretation of these cues. The

criterion used for determining which cues were positive for the correct

75

Table 6
Data for Process Analysis, Problem I (Surgical Abdomen)
Hi h Scorers Low Scorers
n = 25) (n = 5)

Cue acquisition

Thoroughness 43.7 37.4

Efficiency 61.1 66.4

% of cues + for

correct hyp. 41.8 40.1
*Avg. # of cues per subject

+ for correct hyp. 24.7 20.5

Hypothesis generation
# of hypotheses 10.5 11.0
*Avg. # of cues +
for correct hyp.
obtained 1
before generation 16.8 En=10 9.2
after generation .2 n=10 7.6
*Avg. # of cues + for
solution hyp.
obtained
before generation
after generation 1
EarlyzLate
ManyzFew

U1\l
U'IOO

Cue utilization
Avg. percent of
cue misinterpreted 25 21

*GI hypothesis only

76

solutions is one developed at Michigan State University(30). Briefly,
the criterion consists of a grid which relates cues to hypotheses.

The cells of the grid contain weights from -3 to +3. A positive
weight means that cue tends to confirm that hypothesis; a negative
weight means that cue tends to disconfirm or rule out that hypothesis;
and a 0 weight means that cue is non-contributory to that hypothesis.
The grids used for the process analysis of Problems I and II are

contained in Appendix 0.

Process analysis of Problem I

The variables used in the process analysis of this problem are
shown in Table 6. Thoroughness and efficiency are as defined earlier
using the weighting system developed at the University of Illinois.
The percent of cues positive for the correct (high scoring) hypothesis
answers the question: overall of the diagnostic (as opposed to inter-
vention) cues obtained by subjects, what percentage of them was
positive for the correct hypothesis? This element gives a feeling
for how closely the subjects' problem solving centered around cues
positive for a high-scoring solution. It should be noted that
calculation of the average numbers of cues (per subject) positive for
correct and other outcome solutions was calculated for the GI hypothesis
only. This figure was calculated on a sample of 10 high scorers in the
one case. In the other case four of the five 10W scorers gave SOlUtiODS
of pancreatitis and/or cholecystitis rather than solutions involving
ulcer and complications. The latter average is for the number of cues
positive for those outcome solutions obtained by subjects before and

after generation of the hypothesis.

77

An earlier discussion centered around the different thoroughness
scores Obtained by subjects who generated the first hypothesis late and
who used many or few hypotheses. The discussion proposed that those
who generated the first hypotheses late were having trouble establishing
a problem space for this problem and since 7 of the 10 subjects also
used many hypotheses perhaps these subjects continued to have trouble.
It was thought that this possible difficulty establishing a problem
space might have an effect on the accuracy of subjects' performance.

As shown in Table 6 apparently this is not the case. High and low
scorers are equivalently unequally divided on the early/late dimension,
and equivalently equally divided on the many/few dimension.

Table 6 shows that in general subjects who received high scores
were more thorough and less efficient than those who received low
scores. The statistical analysis bears this out to some degree. A
correlation of .29 was obtained between thoroughness and accuracy, and
a correlation of -.13 was obtained between efficiency and accuracy on
this problem. These correlations are not significantly different from
zero and thus do not point to clear differences between the two groups.
The same is true of the percent of cues positive for a correct
hypothesis and the average number of cues per subject positive for a
correct hypothesis. In sum, the differential ability to arrive at a
correct solution is probably not due to any differential cue acquisition
abilities. A similar conclusion may be drawn about cue utilization.

In fact high scorers made slightly more errors in their cue interpretation
than did low scorers.

To analyze the elements of hypothesis generation the problem was

separated into its two components: Diabetic ketoacidosis and ulcer

78

with complication. Concerning the diabetic ketoacidosis hypothesis,
low scorers either did not generate a diabetes hypothesis at all (there
was only one case of this) or did not include both diabetes and
acidosis or ketoacidosis in their solution. Three of these subjects
generated either diabetic ketoacidosis, or diabetes and acidosis or
ketoacidosis separately, but did not include both in their solution.
This was probably either due to negligence in recording the solution
or because they did not perceive the fact that diabetes and acidosis
were two parts of a separate solution component and should be combined.
One subject generated the diabetes portion but not the acidosis portion
although he obtained three cues that were strongly positive for it.
The one subject who did not generate a diabetic hypothesis at all
obtained four cues which were strongly positive for diabetes and/or
ketoacidosis. One of these cues was correctly interpreted relative
to another hypothesis (elevated blood sugar was seen as positive for
pancreatitis) but was not used to generate a diabetic hypothesis. This
subject may have been unwilling to deal with the ambiguity of a two
solution problem and was able to interpret cues only relevant to one
type of hypothesis, i.e. GI hypotheses.

Concerning the GI hypothesis, it should first be noted that all
low scorers generated a GI hypothesis which would have received a
score of 3 had it been retained as the final solution. These subjects
apparently dropped that hypothesis before the end of the problem. It
has been suggested by another analysis of this problem (31) that
failure to retain this hypothesis could be due to errors in cue

interpretation. However, low scorers on this problem did not do

79

particularly poorly at cue interpretation. All but one subject
accurately interpreted at least half of the cues they used. Two of
these subjects had no inaccurate interpretations for a correct
hypothesis. One subject accurately interpreted 10 out of 21 cues and
inaccurately interpreted 7 out of 21 cues for a correct hypothesis.
Generally, then it does not seem that inaccurate cue interpretation
underlies rejection of a correct hypothesis combined with acceptance
of an incorrect one.

It should be observed that because of the structure and content
of this problem not only are there two correct solutions (diabetic
ketoacidosis and a GI solution) but a correct GI solution is complex.
As can be seen from Display I (pp. 139, 140 ) a score of 3 is given to
solutions which involve l) ulcer + localization, 2) Obstruction +
localization, or 3) ulcer + obstruction. Three low scorers generated
an ulcer hypothesis as well as an obstruction hypothesis but did not
ever combine the two. Failure to do this may in some way be due to
an inability to combine individual elements of a complex hypothesis.
This, however, does not explain why the other two subjects who generated
ulcer + localization hypotheses failed to retain them.

The ability to combine elements of a complex hypothesis and to use
an appropriate hypothesis generation and cue utilization pattern may
contribute to being successful on this problem. As is shown in
Table 6 high scorers did not generate the accurate hypothesis which
they used as their solution until they had obtained an average of 16.8
cues which were positive for that hypothesis. Furthermore, they obtained

iin average of only .2 cues positive for that hypothesis after its

80

generation. This means that most high scorers generated the accurate
hypothesis which they used as their solution very near the end of their
data gathering. In contrast low scorers generated a correct hypothesis
(or the components of a correct hypothesis) much earlier in their

data gathering. To establish a parallel between judgments about the
generation of components of a correct hypothesis, high scorers'
performances were analyzed to determine by what point they had generated
the essential components of the accurate hypothesis they used as their
solution. It was found that 6 of the 10 subjects did not generate
these essential components until they had obtained sll_of the cues
positive for that hypothesis. Two subjects generated the essential
components before obtaining any positive cues and then used the
positive cues to refine (increase the specificity of) the hypothesis.
0f the remaining two high scorers, one generated the essential
components after obtaining 7 out of 21 positive cues. His final
solution was a refinement of the essential elements. The other subject
behaved much like the low Scorers in that he generated the elements

of the solution after obtaining 6 out of 13 positive cues and did

not refine the solution beyond that point. He differed from the low
scorers in that he retained that solution whereas they did not.

These analyses seem to show that high scorers are able to perceive
that this is a complex problem. They perceive that not only are there
two accurate outcomes but one of them is complex in itself. They tend
to respond then, particularly vis-a-vis the GI hypothesis, by either
waiting until they have gathered a large number of cues before making
a decision about a solution or by generating the general elements of

the solution very early and using data subsequently obtained to refine

81

the solution. What distinguishes these subjects from the low scorers
is primarily the delay in decision about a correct solution. Rather
than dropping any essential elements generated along the way in favor
of a less complex hypothesis as do low scorers, high scorers retain
essential elements and use data to generate a more complex solution.
A simple analysis of accuracy of cue interpretation does not distinguish
these groups from one another. Therefore the ability to generate
and build on a potentially complex hypothesis may stem from an ability
to deal with ambiguity combined with an ability to accurately use
incoming information to generate accurate alternatives and reject
inaccurate ones.

Subjects' behavior on this problem appears to be a classic case
of the use and misuse of the strategy outlined by Price and
Vlahcevic (21). Low scorers are not able to use cues to rule in an
accurate hypotheses, possibly because they are not able to generate,
retain, and relate the elements of a complex hypothesis in a complex
problem; and they are not able to rule out a less complex but
inaccurate hypothesis. High scorers are able to use cues to establish
and/or refine a complex hypothesis whether they decide early or late
on its components. They are also able to use cues to rule out
inaccurate hypotheses. More generally, in Miller, Galanter, and
Pribram's (16) terms, high scorers seem to have an accurate Image of
the problem and are able to select an appropriate Plan for solving it.
They are aware of the essential elements of the problem as well as
of the solution. They are also aware of the boundary conditions for
its solution, i.e. that it is probably a complex problem and a good

deal of information needs to be gathered before deciding on a solution.

82

In contrast, poor scorers although apparently aware of the essential
elements of the problem may not select a good Plan for its solution

due to a lack of awareness of the boundary conditions.

Process analysis of Problem II
The variables used in the process analysis of this problem are
shown in Table 7. Certain of the variables are self-explanatory.
Those deserving further definition are explained below:
Thoroughness and Efficiency, as defined earlier, using
University of Illinois weighting system
Percent of cues positive for a correct (high scoring)
hypothesis defined as the average percent of cues obtained

by subjects which were positive for a correct hypothesis

Hi - subjects receiving high scores
L0 = subjects who received low scores and did not generate
a correct hypothesis
Lo & Drop = subjects who generated a correct hypothesis but
dropped it and received low scores
Table 7 shows that subjects who received high scores differed
from those with low scores primarily in the areas of cue acquisition and
hypothesis generation (the number as well as type of hypotheses
generated). Cue utilization did not appear to differentiate among
these subjects.
Subjects who received high scores were generally more efficient
in their data gathering and particularly more so in their gathering of

tunes positive for a correct hypothesis. However they did not gather

83

Table 7
Data for Process Analysis, Problem 11 (Pale Lethargic Child)

Hi h scorers Low scorers Low scorers who
Tn=10) (n=l6) dropped correct
hypothesis (n=4)
Cue acquisition

Thoroughness 47.6 50.9

Efficiency 73.9 59.1

% of cues + for

correct hyp. 41.9 29.0

Avg. # of cues per subject

+ for correct hyp. 6.4 6.1
(4-8) (3-8)

Hypothesis generation

# of hypotheses 6.7 9.2

(3-11) (3-15)

% of workup where
final hyp. was generated 68.1 50.9
Avg. # of cues + for
correct hyp. obtained
before generation 4.3 .8
after generation 2.2

Cue utilization
Number of subjects who 4(out of 10) 9(out of 16) 3(out of 4)
misinterpreted any
cues
Number of cues misin- O l
terpreted for correct
hypothesis

84
any more positive cues than did subjects with low scores. This
seems to indicate that high scoring subjects were better able to
establish a problem space for this problem; Contrary to the findings
for Problem I discussed earlier, subjects did not differ concerning the
time at which the first hypothesis was generated. Only two subjects
generated the first hypothesis late. One received a high score,
the other received a low score. The establishment of a problem
space for this problem appears to hinge in part on the knowledge of
what cues must be collected to solve the problem and what cues are
not necessary for its solution. High-scoring subjects were better
able to focus their cue acquisition on those cues which were helpful
in arriving at a correct solution.

The ability to use a focused problem solving strategy is further
borne out by a crude analysis of the types of alternative hypotheses
generated by subjects in the various groups. These hypotheses are
listed in Display II (p. 144). The list is accompanied by the frequency
with which each hypothesis was entertained by each group. One notes
that almost all subjects in all three groups entertained the hypothesis
of sickle cell anemia and a large number of these subjects entertained
the hypothesis of GGPD deficiency. These hypotheses were strongly
suggested by the introduction to the problem. However there is an
apparent difference between the other types of hypotheses entertained
by high and low scorers. High scorers entertained primarily the
alternative hypotheses of hemolytic anemia, autoimmune and iron
deficiency anemia. Leukemia was considered by five high scorers.
0n the other hand, low scorers primarily entertained alternative

hypotheses of blood loss anemia, some sort of hereditary cell problem,

85

hypersplenism, bone marrow repression, leukemia, and lead poisoning.
Although there was a good deal of overlap -- for example in the
consideration of leukemia and autoimmune anemia -- there was also a
good deal of divergence. For example, blood loss anemia was entertained
by four low scorers and only one high scorer, hypersplenism by five
low scorers and no high scorers, and lead poisoning by 10 low scorers
and only one high scorer. Furthermore, a greater number and a wider
variety of hypotheses were considered by low than by high scorers.
Clearer definition of the role played by the alternative hypotheses
in diagnostic problem solving is definitely needed before any
conclusions can be drawn about the implications of these hypothesis
.generation patterns. For the time being it appears that subjects who
obtained high scores on this problem were able to focus their cue
acquisition as well as their hypothesis generation strategies better
than low scorers.
Concerning the number of hypotheses generated (see Table 7),
one notes that subjects who received high scores generated fewer
hypotheses than low scorers. The statistical analysis bears this
out. A correlation of -.41 was obtained between number of hypotheses
and accuracy on Problem II. This correlation is significantly
different from zero (p < .05). Not only did high scorers generate
fewer hypotheses than low scorers, they generated their final hypothesis
later and based it on more supportive cues. For 8 out of the 10 high
scorers the last hypothesis generated was a correct one and the high
scorers had obtained an average of 4.3 cues positive for that hypothesis
before generating it. They obtained an average of only 2.2 positive

cues after its generation. This may be contrasted with low scoring

86

subjects who generated but dropped a correct hypothesis. This group
.generated the correct hypothesis quite early (an average of 8% into
the workup) and had very little supportive data (an average of .8 cues)
when they did so.

These-results indicate that for this problem, generation of the
correct hypothesis should come rather late in the problem if one is
going to obtain a correct outcome. Although this indication does not
go contrary to the findings Of Elstein et al. (6), their findings may
need some clarification. They noted that subjects generate diagnostic
hypotheses early, but did not differentiate between the generation
of a correct hypothesis and other incorrect alternatives. For this
particular problem, although hypothesis generation occurs early for
almost all subjects, generation of the last and correct hypothesis
occurred later for high scorers than for low scorers.

An attempt was made to determine whether the discovery of a
particular cue triggered the generation of the correct hypothesis and
whether this cue by chance was obtained late in the problem. Attention
was focused on the osmotic fragility test (cue #16), a test which is
supposed to confirm the presence of spherocytosis. AlthOugh it appears
that subjects related this cue to the correct hypothesis, one cannot
conclude that obtaining that cue triggered generation of the hypothesis.
In five instances cue #16 was obtained one, two, or three cues before~
hypothesis generation, and in five cases this cue was obtained one,
two, or three cues after hypothesis generation.

In conclusion, these results show that for this problem, subjects

who obtain high accuracy scores are able to use a focused problem

87

solving strategy in the sense that they are apparently able to sense the
problem and use a small number of steps (hypotheses as well as cues) to
arrive at the correct solution. Although they generate hypotheses
early, they need fewer and a smaller variety of hypotheses than low
scorers. In addition the high scorers are able to gather a higher
proportion of cues positive for the correct hypothesis than low

scorers.

An additional somewhat anomalous aspect of the high scorers'
strategy is that they do not generate the final and accurate solution
to the problem until rather late, and until they have Obtained a
reasonably large number of cues positive for that hypothesis. But
no one single cue seems to trigger generation of the hypothesis. One
cannot conclude from this, however, that these subjects are using a
strategy which involves the elimination of all alternative hypotheses
before generation of the accurate one. The subjects' patterns of
positive and negative cue associations indicates a strategy which
combines elimination and confirmation of hypotheses.

In contrast, some subjects who receive low scores are not able to
generate an accurate hypothesis even though they 1) generate more and
a greater variety of alternatives than high scorers, 2) are generally
less efficient than high scorers, and 3) obtain the same number of cues
positive for the correct hypothesis as high scorers. Other low scorers
are able to generate a correct hypothesis but do not retain it. The
latter subjects generate the correct hypothesis early in the problem
having obtained less than one cue which is positive for it.

The concept of "focused strategy" is as yet only grossly defined

in the diagnostic problem solving context. It differs from the

88

conservative focusing strategy discussed by Bruner and colleagues (15)
in that it involves both the generation of alternative hypotheses and
the acquisition of cues. This concept seems to fit more Closely

with Simon and Newell's (l7) definition of a problem space and with
Miller, Galanter and Pribram's (l6) concept of the Image of a problem.
High scorers seem to have a more closely defined or smaller space for
this problem than low scorers. They seem to be using heuristics for
hypothesis generation and for the relation of cues to hypotheses that
differ from those used by low scorers. Possible strategies have been
suggested such as reasoning from general to specific (Kleinmuntz, 20)
or honing down involving the collection of more and more specific

cues (Dudley, 22). The high scorers' hypothesis generation and cue
association patterns do not suggest that they are using these strategies.
Although a careful analysis of these patterns was not done, the
hypotheses generated by high scorers seem to vary in specificity
throughout the problem. Furthermore, the high scorers' hypothesis
generation and cue association patterns do not differ noticeably from
those used by low scorers. A possible strategy used by high scorers
may involve the elimination of certain hypotheses combined with the
confirmation of others as suggested by Price and Vlahcevic. A more
careful analysis of the subjects' cue association patterns would have

to be done to establish this fact.

SummaryAnd Discussion Of Process Analyses

 

The process analyses conducted on Problems I and II were able to
identify aspects of diagnostic problem solving which differentiated

between subjects who received high scores and those who received low

89

scores. These aspects were at once similar and different for each
problem. Relevant to Problem I differentiating elements such as the
ability to perceive and deal with a complex problem, and to delay

final decision about a solution until a comparatively large number

of cues had been gathered. Elements which differentiated high

scorers from low scorers on Problem II were related to the ability

to delineate a closely circumscribed problem space and to put off
deciding on a solution until a large amount of information positive for
that solution had been acquired.

Two brief points should be made relevant to these process
analyses. First, although it is important to know about elements of
problem solving which differentiate high scorers, one cannot infer
that people who possess these abilities will do well on problems
for which they are appropriate. Much investigation is needed before
conclusions about the abilities of subjects who score high on these
problems can be turned into conclusions about the performance of
subjects who possess these abilities.

Secondly, although it has not been mentioned up to now, the
role played by knowledge and experience in solving these problems
cannot be overlooked. Thus far it has been suggested that subjects
did well or poorly on these problems because of certain general, though
ill-defined abilities they possessed. These apparent abilities may
be the result of increased knowledge of that type of problem and/or
of that content area. Similarly, subjects who received high scores may
have had more and/or more recent experience with that type of problem.

Again before conclusions are made about the performance of subjects

90

who possess some of the abilities discussed here, the relation of

these abilities to knowledge and experience must be clarified.

CHAPTER V
SUMMARY, CONCLUSIONS, AND IMPLICATIONS

Fourth year medical students completed three modified Patient
Management Problems. Subjects were randomly assigned to six groups
in a two by three design. On one dimension of the design, instructions
concerning hypothesis generation were manipulated. Subjects were
encouraged to generate diagnostic hypotheses early in the problem,
to withold judgment about diagnostic hypotheses, or were given no
instructions about hypothesis generation. 0n the other dimension
subjects were either constrained to think aloud during problem
solving or were asked to discuss the problem after they had solved it.
Their performance on each problem was scored for thoroughness and
efficiency of cue acquisition as well as for accuracy of outcome,
and number of hypotheses generated.

Before analyzing the results of the study, efforts were made to
estimate the internal consistency and concurrent validity of the
modified Patient Management Problems. These efforts revealed that two
of the three problems were internally consistent and one (Pale
Lethargic child) was not. Furthermore, students' thoroughness and
efficiency scores on one of the problems more closely resembled
physicians' scores on the original PMP than those on the high

fidelity simulation. Students' scores on the other problem showed the

91

92

opposite trend, resembling more closely physicians' scores on the
high fidelity simulation than their scores on the original PMP.

After effect of instructions on number of hypotheses, thoroughness,
efficiency, and accuracy had been determined, each subject was
reassigned in a two by two matrix according to whether he had
.generated comparatively many or few hypotheses,and whether he had
generated his first hypothesis comparatively early or late in the
problem. The effect of these two variables on thoroughness, efficiency
and accuracy was analyzed. Lastly, a process analysis, relating certain
aspects of performance to outcome, was done on two problems.

Results were as follows:

1. Instructions had no effect on thoroughness, efficiency,

accuracy, or number of hypothesis generated.

2. For one problem the constraint to verbalize prompted subjects

to produce significantly (p < .04) more hypotheses than
the absence of that constraint.

3. For that same problem the interaction of instructions and

constraint to verbalize had a significant effect on the
number of hypotheses generated (p < .04).

4. Regardless of instructions early hypothesis generation

occurred 85% of the time.

5. One problem prompted significantly (p < .05) more late

hypothesis generations than did the other two.

6. On one problem subjects who generated comparatively many

hypotheses were significantly more thorough in their
acquisition of cues than those who generated comparatively

few hypotheses (p < .02).

10.

93

For this same problem the interaction between_generating
comparatively many or few hypotheses and generating the
first hypotheses early or late had a significant effect on
thoroughness of cue acquisition.

Obtaining a high score on one problem (Acute Abdomen) is

associated with the ability to generate and retain the

elements of a complex solution.

Obtaining a high score on another problem (Pale

Lethargic Child) is associated with the ability to conduct
a focused inquiry by gathering a small number of high
yield cues, and generating a relatively small number of
diagnostic hypotheses.

On both the Pale Lethargic Child problem and the Acute
Abdomen problem Obtaining a high score is aSsociated with
postponing generation of the solution until a relatively

large number of cues positive for it have been gathered.

 

Conclusions

1.

Concerning the Patient Management Problems as an instrument:
Due to the variability of performance across problems
(discussed in conclusion #3), the Patient Management

Problems should not be judged as a general technique. Each
problem presents a different situation to the problem solver.
At this time it appears that success on one Patient Management
Problem (PMP) is determined by different factors than is
success on another PMP. For that reason judgments about

the validity and reliability of PMP's should be made about

94

individual problems rather than about PMP'S as representing

a general measurement technique.

a.

Validity of PMP's: The two most recent estimates of the
validity of PMP's are that done for the present study and
that done by Goran et al. (41). The present study found
that physicians performed differently on the two PMP's
which were analyzed than on the two high-fidelity
simulations with which they were compared. The differences
may have occurred because the PMP situations and the
high-fidelity situations were not analogous to each other.
The study done by Goran et a1. (41) also found that
physicians and fourth year medical students performed
differently on a real patient (observed via chart audit)
than they did on an analogous PMP. These differences

may be due in part to the lack of thoroughness of the
charts which were audited.

Studies of validity of PMP's have thus far demonstrated
that the validity of these tests is difficult to ascertain
and that based on the results of these experiments the
validity of the PMP's studied is questionable.

Reliability of PMP's: The only forms of reliability which
have been estimated for PMP's is the consistency of
scoring systems and the internal consistency of problems.
The latter was estimated in the present study. Although
the problems sampled appear to be internally consistent,

this measure of reliability is weak. A measure of test-

95

retest reliability or of generalizability of responses is
needed before any conclusions can be drawn about the

reliability of any PMP.

2. Concerning the methodology of the present study

a.

Instructions to the subject can play a number of roles in
an experiment (see Sutcliffe [45] for a discussion

of the role of instructions in psychological experiments).
Instructions are usually used to simply acquaint the subject
with his task. The clearer these instructions are,the
more predictable is the effect they will have on the
outcome of the experiment. The instructions concerning
hypothesis generation used in this study went beyond
subject orientation " to exhort the subject to approach

a problem in a certain way. The specific effect of these
instructions is not known. The results of the study do
indicate that the instructions had no measurable effect
on any of the dependent variables. It can be concluded
from these results that an experimenter should never
assume that a certain set of exhortations is going to
have the desired effect on a group of subjects.
Verbalization during problem solving provides the
experimenter with varied and reliable information

about a subject's thought processes without apparently
interfering with those processes. 0n the other hand, at
least for certain problems, verbalization about thought

processes after the problem has been solved can introduce

96

retrospective distortion into the experiment. For some
problems the information about processes given

retrospectively may be unreliable.

3. Concerning the results of the study:

a.

Early hypothesis generation is an integral part of
diagnostic problem solving. Moreover, how early in a
problem hypotheses are generated is probably determined
more by the problem and the way in which it is perceived
by the problem solver than it is by training or

exhortations from his teachers. Although early hypothesis

_generation has no apparent relation to the ability to

gather data about a case or to the ability to accurately
solve a problem it is an essential element of clinical
inquiry.

One of the strongest influences on how a subject approaches
a diagnostic problem is the nature of the problem itself.
The amount of information presented at the outset, whether
or not the situation is emergent, and how complex the
solution is are only a few of the elements which combine
to influence a subject's performance. There is great
variability across the three problems used in this study.
One presents a large amount of information at the outset
and leaves few options for the subject to choose;

the other two present less initial information and contain
more options. One has a complex solution, the other two
have simple solutions (i.e. the patient has only one

disease). 0f the last two one is apparently quite easy

97

since all subjects arrive at the correct solution, the
other apparently quite difficult since the fewest subjects
solve it. Lastly different behaviors seem to character-
ize successful solvers of one problem than characterize
successful solvers of another. Diagnostic problems are
Often differentiated based on organ system and clinical
specialty area. These two factors are not sufficient to
categorize problems. The other elements discussed above
play an equally if not more important role in the
characterization of clinical problems than do organ
system and clinical specialty.

c. Although no conclusions can be drawn about the kinds of
abilities which lead to success on Patient Management
Problems, tentative conclusions can be made concerning
certain abilities associated with success (arriving at an
accurate solution) on the two problems which were clinically
analyzed. In both cases successful problem solvers did not
generate the hypothesis they used as a solution until
near the end of the problem after a large number of cues
had been acquired. This seems to imply that being able to
delay arriving at a decision about a solution until a
large number of cues are known, and being able to rule
out other hypotheses generated early in the problem may

be helpful in reaching an accurate solution.

Implications

 

1. Concerning the Patient Management Problems as an instrument:

98

Patient Management Problems, particularly the more complex
variety such as the Acute Abdomen and the Pale Lethargic
Child are an attractive way of simulating the clinical
setting. They involve sequential information gathering
such as is practiced when working up a patient and they
give the problem solver the opportunity to use thought
processes similar to those he would use in working up a
real patient. However it has been demonstrated on a
number of occasions that problem solvers perform
differently on Patient Management Problems than they do on
real patients or even on higher fidelity clinical
simulations. Until this descrepancy is corrected, use

of Patient Management Problems as a replacement for a
clinical encounter is unacceptable. In this writer's
opinion wide spread use of Patient Management Problems for
the evaluation of general clinical competence is
inappropriate until the validity and reliability of the
problems is more clearly established.

The reliability of the problems could be established

using a test-retest technique or by clearly establishing
the universe of skills to which performance on a PMP can
be generalized. The former approach is straight-forward
if done in the appropriate context. The two problems used
must be essentially identical since there are so many
hidden variables which can cause apparently parallel
problems to differ. Time lag between the administration

of the two problems must be great enough so that performance

99

on the one will minimally influence performance on the other.
Further, the subjects used for the test should be at a
stage in their career where a large amount of learning

is no longer taking place. In this manner the time

lag between tests will not be accompanied by a

noticeable change in the subject caused by new experiences
and acquisition of large amounts of new knowledge. The
establishment of a universe of generalizability is a

more complex task which is being dealt with elsewhere (46)

and will not be pursued here.

2. Concerning the methodology:

a.

The ineffectiveness of instructions as they were used in
the present study Should serve as a caveat for future
investigators of problem solving process. Instructions*
should be clear. If possible they should go no further
than to acquaint the student with his task. It is

very difficult to ascertain a subject's response to
exhortations. Instructions should not require a subject
to do something in a given situation which conflicts with
what he usually does in that situation unless a specific
goal of the experiment is to create such conflict. In
the present study for example, instructions to withold
judgment in hypothesis generation conflicted with the
subjects' apparent natural propensity to generate
hypotheses early. Although the conflict did not seem to
hamper the subjects' performance it did render

instructions useless as an experimental variable. Lastly,

100

subjects should be constrained to give a behavioral
manifestation of the fact that they are following
instructions. In the present study the verbalization
instructions were successful at least in part because
subjects were not only instructed but constrained to
verbalize during or after problem solving.
Verbalization, particularly thinking aloud has a bright
future as an experimental as well as training tool.
Experimentally, thinking aloud can give the experimenter
information about the logic and sequence with which the
problem is progressing for the subject. Thinking aloud
can also help the experimenter to identify factors

which may influence problem solving. In the medical
context what hypotheses a subject is entertaining at any
one time; how new cues are related to those hypotheses;
what goal led the subject to seek a specific item of
information; and what, if any, informal decision rules
the subject is using are all items of information which
can be obtained by asking the subject to verbalize during
or after problem solving. The present writer feels

that verbalization during problem solving, i.e. thinking
aloud not only yields more reliable information about
the subjects'thought processes but does not noticeably
interfere with those processes.

The use of thinking aloud as a training tool allows the

teacher or other members of a peer instruction group to

101

monitor the problem solver's progress. Information from
the problem solver about his progress can help the
monitor(s) to give supportive or corrective feedback
where appropriate. This especially helps the problem
solver avoid compounding his errors by constantly

recycling him onto the main track of the problem.

3. Concerning the results of the study:

a.

The extremely important role that early hypothesis
generation plays in diagnostic problem solving makes it
an essential component of medical training and evaluation
programs. Traditionally early hypothesis generation is
not emphasized in medical training. As stated earlier
(page 4, 5) certain training and record keeping
procedures (specifically Weed's Problem Oriented Record
(3, 4) discourage that activity. Others (e.g. Morgan

and Engel [5]) do not discourage that activity but

do not advocate that the acquisition of skill in early
hypotheses generation be an important aspect of a
physician's training. It may be that this activity is

so fundamental to any kind of problem solving that it
need not be specifically taught to medical students.
Whether or not specific training in that skill is needed,
a medical student's possession of that skill should be
assured before he takes responsibility for the care of
patients. The possession of the skill should be included

in the objectives of any medical curriculum and all

students should be evaluated on it.

102

The fundamental role of early hypotheses generation in
medical problem solving has implications for use of the
Problem Oriented Record (POR) as a training and record
keeping tool. .It will be some time before the technology
is available for the automation of patient data
acquisition and storage. For this reason among others,
data gathering and data interpretation skills will
continue to be an important part of a physician's
training. If early hypothesis generation is indeed as
unusual as it appears to be then the POR format which
discourages this activity might create the type of
conflict described earlier if used as a training tool.

The POR has established its value as a record keeping tool
but how useful is it as a training device? A simple

bit of applied research could probably answer this question.
A group of first year medical students might be started
out with training in the use of the POR for record
keeping and training in the generation and evaluation
of hypotheses to guide their thinking. One would want
to keep track of the kinds of conflicts which arose in
both groups, what the sources of confusion were and how
positive the students attitudes were and award each
training program. At the end of the second year, each
student could be evaluated on how often and in what
manner he used the POR as well as be given a simulated

problem to solve. During the problem he would be asked

103

to think aloud which would tell the experimenter

whether he was thinking hypothetico-deductively or like a
POR.

The content and structure of any simulated diagnostic
problem a subject solves have a strong influence on his
problem solving behaviors. This observation has strong
implications for anyone wishing to use diagnostic

problem simulations in an experiment, or in a training
and/or evaluation program.

The knowledge that different people solve different problems
differently should discourage an experimenter or teacher/
evaluator from trying to use the results of a performance
on one or several diagnostic problems to make judgments or
predictions about general clinical competence. Even if a
subject could be given a "score" for clinical competence
this score would not be justifiable if based only on
solutions of a few diagnostic problems.

Particularly in a problem oriented curriculum, attempts
should be made to identify aspects of problems used for
training which care be associated with different problem
solving approaches. Which problems tend to converge on a
single solution? Which tend to be ambiguous or incon-
clusive? Can it easily be determined which problems are
relatively easy (i.e. the solution is rather obvious from
the outset) and which problems are more difficult (e.g. a

special type of knowledge is needed to solve them)?

 

104

Which problems are common, which are uncommon? If these
elements are identified, then problems can be found or
created which best address certain problem-solving

objectives in a medical curriculum.

BIBLIOGRAPHY

 

10.

11.

12.

BIBLIOGRAPHY

Dorland's Illustrated Medical Distionary, 24th Edition, T

 

Philadelphia and London, W.B. Saunders Co., 1965

Webster's SeVenth New Collegiate Dietignary, Springfield, Mass.,
6. and C. Merriam Co. Publishers, 1970

Weed, L.L., Medical Records, Medical Education, and Patient Care,
Cleveland, The Press of Case4WEStern Univ., 1971

 

Hurst, J.W. and Walker, H.K., The Problem-Oriented System,
New York, Medcom Press, 1972

 

Morgan, W.L. Jr. and Engel, G.L., The Clinical Approach to the
Patient, Philadelphia, W.B. Saunders Co.,71969

 

Elstein, A.S., Kagan, N., Shulman, L.S., Jason, H., and Loupe, M.J.,
Methods and theory in the study of medical inquiry, Journal of
Medical Education, 47, 1972, p. 85-92

 

 

Kessel, F.S., The philosophy of science as proclaimed and science
as practiced: "identity“ or "dualism"?, American Psychologist,
Nov. 1969, p. 999-1005

 

Bacon, Francis, Novum Organum (First Part), in The English
Philosophers from Bacon to Mill, Ed, E.A. Burtt, New YOrk
Modern Library,'1931

 

 

 

Kuhn, T.S., The Structure of Scientific Revolutions, Chicago,
University of Chicago Press, 1962

 

Popper, K.R., The Logic of Scientific Discovery, New York,
Basic Books, Inc., 1959

 

Medawar, P.B., Induction and Intuition in Scientific Thought,
London, Methuen and Co., Ltd., 1969

 

Luchins, A.S. and Luchins, E.H., New experimental attempts at
preventing mechanization in problem solving, JOUrnal of General
Psychology, 42, 1950, p. 279-297

 

 

105

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.
23.

24.

25.

26.

27.

106

Wason, P.C., 'On the failure to eliminate hypotheses ...' -- a
second look, in Thinking and Reasoning, Ed, P.C. Wason and
P.N. Johnson-Laird, Baltimore, Penguin Modern Psychology
Readings (U.S.), 1968, p. 165-174

Elstein, A.S. and Shulman, L.S., Scoring and analysis of medical
inquiry protocols, Unpublished manuscript, 1971

Bruner, J.S., Goodnow, J.J., and Austin, G.A., A Study of
Thinking, New York, Wiley, 1956

 

Miller, G.A., Galanter, E., and Pribram, K.H., Plans and the
Structure of Behavior, New York, Holt, 1960

 

 

Simon, H.A. and Newell, A., Human problem solving: the state
of the theory in 1970, American Psychologist, December, 1970

 

Bartlett, F.S., Thinking, New York, Basic Books, 1958

Miller, G.A., The magical number 7 i_2: some limits of our
capacity for processing information, Psychological Review,
63, 1956, p. 81-97

 

Kleinmuntz, B., The processing of clinical information by man and
machine, in Formal Representation of Human Judgment, Ed, B.
Kleinmuntz, New York, Wiley, 1968

 

Price, R. and Vlahcevic, Z.R., Logical principles in differential
diagnosis, Annals of Internal Medicine, 75, 1971, p. 89-95

 

Dudley, H.A.F., Clinical method, The Lancet, Jan. 2, 1971

 

Sprosty, P.J., The use of questions in the diagnostic problem
solving process, in the Diagnostic Process, Ed, J.A. Jacquez,
Ann Arbor, 1964

 

Elstein, A.S., Current status of the medical inquiry project,
paper presented at AAMC Invitational Workshop, Washington, D.C.,
June 6-8, 1972

Gordon, M.J., Training in the use of heuristics in diagnostic
problem solving among advanced medical students, Tentative
title of doctoral dissertation in progress, ETC, May, 1973

Claparede, E., La genese de l'hypotheses, Archives de Psychologie,
24, 1934, p. l—154

 

Newell, A., Simon, H.A., and Shaw, J.C., Elements of a theory of
human problem solving, Psychological Review, 65, 1958, p. 151-166

 

28.

29.

30.

31.

32.
33.

34.

35.

36.

37.

38.

39.

40.

107

Newell, A., On the analysis of human problem solving protocols,
Paper given at the International Symposium on Mathematical and
Computational Methods in the Social Sciences, Rome, 1966

Neisser, U., The multiplicity of thought, British Journal of
Psychology, 54, 1963, p. l-l4

 

 

Elstein, A.S., Sprafka, S.A., and Shulman, L.S., Analyzing
Medical Inquiry Processes, Paper read at 1973 Annual Meeting of
American Educational Research Association, New Orleans, La.,
Feb. 25 - Mar. 1, 1973

Elstein, A.S., Kagan, N., Shulman, L.S., and Jason, H., Final
Report of Medical Inquiry Project, Chapter 4, Office of Medical
Education Research and Development, Michigan State University,
ETC, July, 1973

McGuire, C.H., Personal Communication, Feb. 25, 1972
Gagne, R.M. and Smith, E.C., Jr., A study of the effects of

verbalization on problem solving, Journal of Experimental
Psychology, 63, 1962, p. 12-18

 

 

Helfer, R.E. and Ealy, J.M., Observations of pediatric interviewing
skills, Paper read at the 1971 Annual Meeting of AAMC

Interdepartmental Appraisal Committee of University of Illinois
College of Medicine, Clinical Simulations, Christine H. McGuire
and Lawrence M. Solomon, Eds., New York, Appleton, Century,
Crofts, 1971

 

Rimoldi, H.J.A., The Test of Diagnostic Skills, Journal of Medical
Education, 36, 1961, p. 73-79

 

McGuire, C.H. and Babbott, J.M., Simulation techniques in the
measurement of problem solving skills, Journal of Medical
Education, Spring, 1967

 

Lewy, A., and McGuire, C.H., A study of alternative approaches in
estimating the reliability of unconventional tests, Paper read at
Annual Meeting of AERA, February, 1966

McGuire, C.H., A summary of the evidence regarding the technical
characteristics of Patient Management Problems, A special report
prepared for the Committee on Examinations of the Americal
Academy of Orthopedic Surgery, Fall, 1970

Frase, L.T., Paragraph organization of written materials: the
influence of conceptual clustering upon the level and organization
of recall, Journal of Educational Psychology, 60, 1969,

p. 394-401

 

 

41.

42.

43.

44.

45.

46.

108

Goran, M.J., Williamson, J.W., and Gonnella, J.S., The
validity of Patient Management Problems, Journal of Medical
Education, 48, 1973, p. 171-177

Angoff, W.H., Test reliability and effective test length,
Psychometrika, 18, l, 1953, p. 1-14

 

Angoff, W.H., A note on the estimation of_nonspurious
correlations, Psychometrika, 21, 3, 1956, p. 295-297

 

Finn, J., Multivariance -- Univariate and Multivariate Analysis
of Variance and Covariance, a FORTRAN IV Program, State
University of New York at Buffalo, 1970

Sutcliffe, J.P., 0n the role of "instructions to the subject"
in psychological experiments, American Psychologist, 27, 8,
1972, p. 755-758

Levine, H., University of Texas at Austin, February, 1973,
Personal Communication

 

APPENDICES

APPENDIX A

Instructions Read to Subjects

E-V
109

INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive diag-
nosis for each problem with the aid of the information available.

The procedure used for solving all three problems is the same. As

you can see from the sheet entitled Problem I which I have given you,

a written introduction is given at the beginning of each problem.

The introduction is followed by numbered options which you may request.

These options give you information which will help you solve the problem,

 

i.e. reach a definitive diagnosis. The options fall into the general
categories of history, physical,laboratory and x-ray studies, and
management. As you request the options, I will hand you the
appropriate information printed on a card. Please request information
95s item at a time by stating the number of the item. As you request
an option please list its number in the column labeled "options" on the
sheet provided.

As you work on the problem, I would like you to use any useful
information you get to help you think of possible problem formulations.
I encourage you to speculate somewhat and base formulations on
relatively small amounts of information. Furthermore, if you wish to,
use these formulations as guides in the selection of more information.

While you are doing the problem, I will interrupt you periodically
and ask you to do two things. First I will ask you to write down any
problem formulations you may have thought of at that point. Secondly,
I will ask you to quickly describe to me what caused you to entertain

those formulations. For example, did a certain item of information

110

cause you to think of a formulation; had you seen a case before which
presented in this manner, and so on. Each time I ask you to list
problem formulations, please list as many as you are thinking of up to
five (5). If you have no formulations in mind you need not list any.
If you are still entertaining the same formulations you listed
previously, you may simply list those same ones again. At the end of
each problem please state your definitive diagnosis for that problem.

You will be evaluated on the efficiency, thoroughness, and
accuracy of your work. The accuracy score refers only to the
definitive diagnosis. Efficiency and thoroughness refer to how much
information you gather to reach a solution. These two scores are
closely related. They tend to balance each other. A good performance
is given by choosing that amount of information which results in an
adequate workup. You should not try to solve the problem by choosing
as few items as possible, nor should you request information which is
useless to you or may be harmful to your patient.

I will keep this tape recorder going while you are working. 00
not pay any attention to it. It is harmless. Now, are there any
questions before we begin?

If you are ready, please read the introduction to Problem I

aloud.

 

111 L-V
INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive diagnosis
for each problem with aid of the information available. The procedure
used for solving all three problems is the same. As you can see from ,
the sheet entitled Problem I which I have given you, a written intro-
duction is given at the beginning of each problem. The introduction
is followed by numbered options which you may request. These options
give you information which will help you solve the problem, i.e. reach
a definitive diagnosis. You may request these options in any order
you wish. The options fall into the general categories of history,
physical, laboratory and x-ray studies, and management. As you request
the options, I will hand you the appropriate information printed on a
card. Please request information gns_item at a time by stating the
number of the item. As you request an option please list its number
on the sheet provided.

I would like you to gather as much information as you feel you need
to do a generally good workup and to make a definitive diagnosis. When
you have the information you need, please group together those cues
which fit together into problem formulations, and indicate which
formulation may be considered as the definitive diagnosis for that
problem. I caution you against leaping to conclusions about possible
formulations based on relatively small amounts of data. Drawing early
conclusions can adversely bias a workup and may lead you to overlook
some important information. Throughout the problem you will have

available all the cards I have given you. Thus when you have the data

 

112

you need, it will be an easy task for you to aggregate groups of
data together into problem formulations.

While you are doing the problem, I will interrupt you periodically
and ask you to "think aloud" for me about how you are coming along. I
may ask you a few questions about how helpful you find the data I
am giving you, and what types of data you still need in order to
arrive at a solution for the problem. My main interest in doing this is
to get an idea from you about how you approach a problem of this type,
and what goes through your mind while you are solving it.

You will be evaluated on the efficiency, thoroughness and
accuracy of your work. The accuracy score refers only to the definitive
diagnosis. Efficiency and thoroughness refer to how much information
you gather to reach a solution. These two scores are closely related.
They tend to balance each other. A good performance is given by
choosing that amount of information which results in an adequate
workup. You should not try to solve the problem by choosing as few
items as possible, nor should you request information which is useless
to you or may be harmful to your patient.

I will keep this tape recorder going while you are working. 00
not pay any attention to it. It is harmless. Now, are there any

questions before we begin?

113 C-V
INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive
diagnosis for each problem with the aid of the information available.
The procedure used for solving all three problems is the same. As
’you can see from the sheet entitled Problem I which I have given you,

a written introduction is given at the beginning of each problem.

The introduction is followed by numbered options which you may

request. These Options give you information which will help you

solve the problem, i.e. reach a definitive diagnosis. You may request
these options in any order you wish. The options fall into the general
categories of history, physical, laboratory and x-ray studies, and
management. As you request the options, I will hand you the appropriate
information printed on a card. Please request information sns_item

at a time by stating the number of the item. As you request an option
please list its number in the column labeled "Options" on the sheet
provided.

As you work through the problem you may make notes on that same
sheet about interesting items of information, possible things you
might look for, and so on. You do not have to make notes if you do not
wish to. They are purely for your convenience.

While you are doing the problem I will interrupt you periodically
and ask you to "think aloud'l for me about how you are attacking this
problem and how helpful you find the information I am giving you. My

main interest in doing this is to get an idea from you about how you

114

go about solving a problem like this and what kinds of things you
think about as you work through it.

You will be evaluated on the efficiency, thoroughness, and
accuracy of your work. The accuracy score refers only to the
definitive diagnosis. Efficiency and thoroughness refer to how much
information you gather to reach a solution. These two scores are
closely related. They tend to balance each other. A good performance
is given by choosing that amount of information which results in an
adequate workup. You should not try to solve the problem by choosing
as few items as possible, nor should you request information which
is useless to you or may be harmful to your patient.

I will keep this tape recorder going while you are working. 00
not pay any attention to it. It is harmless. Now are there any

questions before we begin?

i115 E-NV
INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive
diagnosis for each problem with the aid of the information available.
The procedure used for solving all three problems is the same. As
you can see from the sheet entitled Problem I which I have given you,

a written introduction is given at the beginning of each problem. The
introduction is followed by numbered options which you may request.
These options give you information which will help you solve the
problem, i.e. reach a definitive diagnosis. You may request these
options in any order you wish. The options fall into the general
categories of history, physical, laboratory and x-ray studies, and
management. As you request the options, I will hand you the appropriate
information printed on a card. Please request information gﬂs_item

at a time by stating the number of the item. As you request an option
please list its number on the sheet provided.

As you work on the problem, I would like you to use any useful
information you get to help you think of possible problem formulations.
I encourage you to speculate somewhat and base formulations on
relatively small amounts of information. Furthermore, if you wish to,
use these formulations as guides in the selection of more information.
Please try to remember any formulations you entertain along the way.

If you can, also try to recall what events caused you to think of these
formulations. When you have finished the problem, I will go back over
it with you and ask you to review your thoughts as you worked through

the problem.

116

Although this is a simulation, I would like you to treat me as
much like a patient as possible. Particularly, I request that you not
reveal to me what you are thinking about while you are doing the
problem, by, for example "talking to yourself" about it.

You will be evaluated on the efficiency, thoroughness and accuracy
of your work. The accuracy score refers only to the definitive
diagnosis. Efficiency and thoroughness refer to how much information
you gather to reach a solution. These two scores are closely related.
They tend to balance each other. A good performance is given by choOsing
that amount of information which results in an adequate workup. You
Should not try to solve the problem by choosing as few items as
possible, nor should you request information which is useless to you
or may be harmful to your patient.

Now are there any questions before we begin?

 

117 L-NV
INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive
diagnosis for each problem with the aid of the information available.
The procedure used for solving all three problems is the same. As
you can see from the sheet entitled Problem I which I have given you,
a written introduction is given at the beginning of each problem.

The introduction is followed by numbered options which you may request.
These options give you information which will help you solve the
problem, i.e. reach a definitive diagnosis. You may request these
options in any order you wish. The options fall into the general
categories of history, physical, laboratory and x-ray studies, and
management. As you request the options, I will hand you the appropriate
information printed on a card. Please request information gﬂs_item at

a time by stating the number of the item. As you request an option
please list its number on the sheet provided.

I would like you to gather as much information as you feel you
need to do a generally good workup and to make a definitive diagnosis.
When you have the information you need, please group together those
cues which fit together into problem formulations, and indicate which
formulation may be considered as the definitive diagnosis for that
problem. I caution you against leaping to conclusions about possible
formulations based on relatively small amounts of data. Drawing early
conclusions can adversely bias a workup and may lead you to overlook
some important information. Throughout the problem you will have

available all the cards I have given you. Thus when you have the

118

data you need, it will be an easy task for you to aggregate groups

of data together into problem formulations. I would like to find

out more about how you solve these problems than your written responses
can tell me. For that reason, when you have finished the problem, I
will go back over it with you. At that time I would like you to

review for me any interesting thoughts you may have had while solving
the problem.

Although this is a simulation, I would like you to treat me as
much like a patient as possible. Particularly, I request that you
not reveal to me what you are thinking about while you are doing the
problem, by, for example "talking to yourself" about it.

You will be evaluated on the efficiency, thoroughness and
accuracy of your work. The accuracy score refers only to the definitive
diagnosis. Efficiency and thoroughness refer to how much information
you gather to reach a solution. These two scores are closely related.
They tend to balance each other. A good performance is given by
choosing that amount of information which results in an adequate
workup. You should not try to solve the problem by choosing as few
items as possible, nor should you request information which is useless
to you or may be harmful to your patient.

Now, are there any questions before we begin?

119 C-NV
INSTRUCTIONS

During this session you will be asked to solve three simulated
clinical problems. Your task will be to arrive at a definitive
diagnosis for each problem with the aid of the information available.
The procedure used for solving all three problems is the same. As
you can see from the sheet entitled Problem I which I have given you,

a written introduction is given at the beginning of each problem.

The introduction is followed by numbered Options which you may

request. These options give you information which will help you solve
the problem, i.e. reach a definitive diagnosis. You may request these
Options in any order you wish. The options fall into the general
categories of history, physical, laboratory and x-ray studies, and
management. As you request the options, I will hand you the appropriate
information on a card. Please request information ggs_item at a time

by stating the number of the item. As you request an option please

list its number on the sheet provided.

As you work through the problem you may note particularly
interesting bits of information. I am sure that many things will flash
through your mind which you will not write down. I would like to find
out more about how you solve these problems than your written responses
can tell me. For that reason, when you have finished the problem, I
will go back over it with you. At that time I would like you to review
for me any interesting thoughts you may have had while solving the
problem.

Although this is a simulation, I would like you to treat me as

much like a patient as possible. Particularly, I request that you not

120

reveal to me what you are thinking about while you are doing the
problem, by, for example "talking to yourself" about it.

You will be evaluated on the efficiency, thoroughness and
accuracy of your work. The accuracy score refers only to the definitive
diagnosis. Efficiency and thoroughness refer to how much information
you gather to reach a solution. These two scores are closely related.
They tend to balance each other. A good performance is given by
choosing that amount of information which results in an adequate
workup. You should not try to solve the problem by choosing as few
items as possible, nor should you request information which is useless
to you or may be harmful to your patient.

Now, are there any questions before we begin?

 

121

Problem I
A SURGICAL ABDOMEN

Assume you are a young general practitioner, a member of the staff
of your modern BOO-bed community hospital. You are called by the intern
at 10:30 p.m. to see a patient in the Emergency Room.

When you arrive, you find a forty-seven year old man who complains
of abdominal pain and vomiting. The pain began 3 weeks ago; the
patient took Bromo-SeltzerR with some relief. He continued to work
until 1 week ago when he stopped working because of pain. After 2 days
at home the pain began to improve, but he began to vomit small amounts.
Similar, though less severe, episodes of pain have occurred off and on
for the past three years.

In working up this patient you
would be interested in doing
or finding out which of the
following (select as many
items as you consider
pertinent in the order you
feel is appropriate):

Laboratory, X-ray and other
diagnostic tests:

10 Hemoglobin and hematocrit
White blood cell count

Red cell smear, morphology
Differential white count

14 Urinalysis
l Admit patient to hospital 15 (Report)
2 Give antispasmodics, anal- 16 Urine culture

gesics, and antidotes;
reassure the patient;
send him home and plan
to see him at home early
the next morning

3 Call the operating room
and schedule the patient
for surgery

4 Observe the patient
closely for the next
few hours

5 Obtain history infor-
mation

6 Examine the patient

7 Obtain laboratory,
x-ray, and other
diagnostic infor-
mation

8 Start appropriate therapy

9 Continue management with-
out surgical intervention

28

Stool guaiac

Erythrocyte sedimentation rate

Serum electrolytes

Arterial pH and p002 determinations
Venous pH

Blood urea nitrogen

Serum creatinine

Total protein

Albumin/globulin ratio

Serum protein electrophoresis
Blood ammonia

Total and direct bilirubin
Cholesterol

Bromsulphalein retention

Cephalin flocculation

Thymol turbidity

Serum glutamic oxalecetic
transaminase

Alkaline phosphatase

Lactic dehydrogenase

Acid phosphatase

Serum amylase

Urine amylase

Random blood sugar

Fasting blood sugar

2 hour postprandial
blood sugar

Serum calcium and in-
organic phosphorus
VDRL

Urine electrolytes
(Report)

Purified protein
derivative skin test

Blood volume

Gastric analysis

Chest X-ray

Upright film of
abdomen

Flat film of abdomen

Barium enema

Upper gastrointestinal
series

Intravenous pyelogram

Oral cholecystogram
Intravenous cholangio-
gram
Electrocardiogram

Venous pressure

Circulation time
(arm to tongue)

Pulmonary scan

Pulmonary function
studies

Hepatic scan

Lumbar puncture

Echoencephalogram

History:

Headache

Epistaxis
Hemoptysis

Chest pain

Cough

Previous hypertension
Appetite

Dysphagia

Nausea and vomiting
Bowel habits

Type of pain
Location of pain
Nature of vomiting
Nature of stools
Nature of diet
Weight loss

Alcohol intake

101
102
103
104
105

Jaundice in past
Bleeding tendency
Belching

Radiation of pain
Chills and fever
Pruritus

Steatorrhea
Hematemesis

Food intolerance
Fatigue

Dizziness, vertigo, fainting
Character of urine
Family history

Angina

Dyspnea

Dysuria

Edema

Allergies

Drug history

Smoking history
Previous hospitalization
Previous operations
Previous X-ray studies
Trauma history

Physical Examination:

106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
123
124
125
126
127
128
129
130

Pupils

Eyegrounds
External ear
Scalp

Nose

Mouth

Pharynx

Neck

Chest and lungs
Breasts
Peripheral pulses
Blood pressure
Valsalva maneuver
Pulse rate
Respiratory rate
Temperature
Abdominal wall
Bowel sounds
Liver

Spleen

Abdominal mass
Abdominal tenderness
Inguinal area
External genitalia

 

131

132
133
134
135
136

137
138
139
140
141
142
143

144
145
146

Right lower quadrant
tenderness to palpation
Rebound tenderness
Referred rebound
Costovertebral angle

Back tenderness

Range of motion of
sp1ne

Striaght leg raising

Rectal examination

Sigmoidoscopy

Skin

Heel-to-knee test

Serial sevens

General appearance of
patient

Chvostek's sign

Axilla

Visible peristalsis

Intervention (nonsurgical):

147
148
149
150
151
152

153
154
155
156
157
158
159

160
161

162

163
164
165

Nothing by mouth

Clear liquid diet

Bland, low residue diet
Force fluids

Nosogastric suction

Long intestinal tube
for suction

Gastric lavage

Tap water enema

Magnesium sulfate 15 gm
by mouth

Catheterized urine for
urinalysis

Condom drainage

Record intake and output
Record urine output
every 2 hours

MaaloxR, 30 ml every hour

Type and Crossmatch
1500 cc whole blood

Oxygen by nasal catheter
at 6 liters per

minute flow

Irrigate nasogastric
tube every hour

Oxygen tent with high
humidity

Atropine 0.4 mg intra-
venously

123
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186

187

188

189
190
191

192

Morphine 5 mg intramuscularly

Meperidine 100 mg intra-
muscularly every 4 hours

Neomycin by mouth in therapeutic
doses

Procaine penicillin G in ther-
apeutic doses

Ampicillin in therapeutic

doses

Kanamycin and penicillin in
therapeutic doses
Chloramphenicol and penicillin
in therapeutic doses

Lanatoside C 0.5 mg by mouth
Digitoxin 0.4 mg intravenously

Hydrocortisone intravenously

Lente insulin 40 units
subcutaneously

Protamine zinc insulin 80 units
subcutaneously

Tolbutamide

Calcium chloride intravenously
Packed RBC slowly intravenously
5% dextrose in water, 74 ml

per hour

Hypotonic saline (0.5%) 1000 ml
intravenously in next 4 hours
1/6 molar sodium lactate, 1000 ml
intravenous in next 4 hours
Potassium 20 mEq for each hour of
intravenous fluids

Add potassium chloride, 60 mg to
each bottle of intravenous fluid
Central venous pressure catheter
placed

Crystaline insulin 70 units stat
intravenously and 50 units
subcutaneously

Crystaline insulin every 1 to 2
hours depending on blood sugar
determination

Phosphate 14 millimoles for each
hour of intravenous therapy
Intravenous fluids containing
ammonium chloride

Intravenous antibiotics and
hydrocortisone in appropriate
doses

Tracheostomy and respirator
assisted ventilation

 

124

193 Order intravenous fluids
for next 24 hours:
1000 ml 5% dextrose in
water, 500 ml Ringer's
lactate, 4O mEq KCl given
slowly; leave nasogastric
tube in place but
clamped

194 Start liquid diet

195 Order intravenous fluids
for next 24 hours:
2500 m1 5% dextrose in
water, 1500 ml normal
saline, 120 mEq KCl;
continue nasogastric
suction

196 Pull nasogastric tube,
start fluids cautious-
ly by mouth

197 Order intravenous fluids
for next 24 hours:
1500 ml 5% dextrose in
water, 1000 m1 normal
saline; continue
nasogastric suction

Intervention (surgical):

198 Emergency laparotomy

199 Cholecystectomy

200 Cholecystostomy

201 Exploration of common
bile duct

202 Vagotomy, antrectomy and
gastroduodenostomy

203 Transverse colostomy

204 Wide gastrotomy and
repair of bleeders

205 Vagotomy and gastro-
jejunostomy

206 50% gastric resection and
gastrojejunostomy

207 Drainage of pancreatic
pseudocyst

208 Pancreaticojejunostomy

209 Appendectomy

210 Lysis of obstructing adhesions

211 Repair of hiatus hernia

 

125

Problem II

THE PALE, LETHARGIC CHILD

The patient is an eight year old Negro boy who is brought to the
Emergency Room because of fatigue and pallor which has developed over

a one week period.
but denies nausea or vomiting.

He also complains of occasional abdominal pain,
Prior to this episode the boy had been

in apparent good health, but occasionally was thought to be pale
compared to his sister, especially following infections (colds, sore

throats).

received an antibiotic for one week.
with this latter episode improved within 3 to 4 days after starting

the drug.

Three weeks prior to admission he had tonsillitis and

The symptoms and fever associated

Physical examination reveals a well developed moderately pale

Oral temperature - 99.6°F, Pulse - 98
per minute, Respiration - 26 per minute, Blood pressure - 108/66 mm Hg.
The spleen is palpable 3 cm below the left costal margin; the liver

Negro boy in no acute distress.

is not felt.
present.
limits.

Slight diffuse abdominal tenderness without rebound is
The remainder of the examination is entirely within normal

The initial blood count was recorded in the Emergency Room as

follows:

hemoglobin - 3.0 gm/lOO m1; hematocrit - 10%; white blood
cell count - 13,000/cu mm; differential:
neutrophils - 58%, lymphocytes - 40%.

bands - 2%, segmented

The patient is admitted to the hospital for further evaluation.

In working up this patient you
would be interested in which of
the following (select as many
items as you consider pert-
inent in the order you feel

is appropriate):

More detailed family history

More detailed dietary
history

More detailed past history

Chest film

Tuberculin skin test

Urinalysis

Serum protein electro-
phoresis

Hemoglobin electro-
phoresis

Blood smear for morph-
Ology

Platelet count

Sickle cell prep-
aration

do to 00 Noam-boo N—J

—l—ul

Bone marrow aspiration

Serum iron and iron binding
capacity

Serum B]

Serum fo ic acid

Osmotic fragility of red cells
Prothrombin time

Partial thromboplastin time

Bleeding time

Reticulocyte count

Serum bilirubin

Red cell survival study with
radioactive chromium (patient's
cells)

Bone marrow biopsy (surgical)
Blood urea nitrogen

 

126

25 Serum uric acid

26 Urine coproporphyrin

27 Blood lead level

28 Coombs test

29 Lupus erythematosus
cell preparation

30 Erythrocyte glucose -
6 phosphate dehydro-
genase screening test

Your prognosis for this
patient is:

31 Excellent with proper
medical therapy

32 Excellent with proper
surgical therapy

33 Should respond to proper
management of acute
episode, but is unlikely
to have normal
longevity

34 May respond to medical
management but is most
likely to die of disease
within months to a
few years

35 Unlikely to survive
present episode

127

Problem III
A PALE, CONFUSED PATIENT

A young married woman brings her fifty-seven year old gray-haired
mother to your office for a medical checkup. The daughter tells you
that her mother's appetite, weight, strength and well-being have
progressively deteriorated over the last 8 months. Lately she has
become more confused and mildly disoriented and recently has exhibited
slight memory loss. The patient added that for the past 4 weeks she
has tired easily, especially when walking.

In working up this patient you ll Arrange for laboratory tests on
would be interested in doing an outpatient basis
or finding out which of the 12 Initiate therapy

following (select as many items

as you consider pertinent in

the order you feel is History:

appropriate):

13 The family history
1 Defer further investigation 14 Episodes of febrile illness

for the purpose of obser- 15 Nausea and vomiting
vation and ask the patient 16 Frequency or urgency
to return in: 17 Dietary intake
a) 1 week 18 Food fads
b) 2 weeks 19 Soreness in mouth or tongue
c) 3 weeks 20 Alcohol intake
2 Order chest, cervical 21 History of diabetes mellitus
spine, and skull films 22 History of hypertension
and ask the patient to 23 History of heart disease
return in: 24 History of arthritis
a) 1 week 25 Shortness of breath on exertion
b) 2 weeks 26 Any abnormalities in sensation or
c) 3 weeks changes in perception
3 Arrange for an urgent 27 Motor weakness
psychiatric examination 28 Pain on urination
4 Request a neurosurgical 29 Diarrhea
consultation 30 Involuntary loss of urine
5 Hospitalize the patient 31 Coordination
and arrange for further 32 Chest pain
diagnostic procedures 33 Cough
6 Hospitalize the patient 34 Allergies
in a mental institution 35 Gait
7 Transfer the patient to a 36 Changes in intellectual capacity
psychiatric ward 37 Orientation to time, place and
8 Obtain history information people
9 Examine the patient 38 Uncontrollable crying and laughing
10 Arrange for laboratory 39 Difficulty speaking
evaluation on an 40 Double vision

inpatient basis 41 Headaches

 

42 Insomnia

43 Black stools

44 Vaginal bleeding

45 Leg pains

46 Dyspepsia

47 History of abdominal
surgery

Physical Examination:

48 Extraocular movements

49 Fundi

50 Pupils

51 Mucosa and nail beds

52 Heart and lungs

53 Abdomen

54 Lymph nodes

55 Rectal

56 Pelvis

57 Sensory examination

58 Motor examination

59 Deep tendon reflexes

6O Babinski Sign

61 Mental status

62 Cerebellar system

63 Romberg test

64 Skin

65 Sclera

66 Blood pressure, pulse,
respirations temperature

67 Tongue

68 Visual fields

69 Muscle atrophy

70 Breasts

71 Tenderness over spine

Laboratory, X-ray, and Other
Diagnostic Tests:

72 Hemoglobin, hematocrit,
white blood cell count,
differential

73 Urinalysis

74 Tri-iodothyronine uptake
(T ) (Resin)

75 FasEing blood sugar and
2 hour post-prandial
blood sugar

76 Blood urea nitrogen

77 Electrocardiogram

102

Chest X-ray

Stools for occult blood

Gastric analysis

Serum sodium, potassium, carbon
dioxide, chloride

Skull X-ray

Liver function studies

Blood smear

Red cell indices

Sickle cell preparation
Lupus erythematosus cell
preparation

Sedimentation rate

Serum creatinine

Serum uric acid

Serum total protein

VDRL

Stool for fat

Bone marrow

Lumbar puncture
Electroencephalogram

Serum calcium, phosphorus,
alkaline phosphatase

Gastric analysis after subcutaneous
histamine

Schilling test

Myelogram

Liver biopsy

Reticulocyte count

APPENDIX C

Procedures for Scoring Modified PMP's

 

129

Therapy:

103 Multivitamins, 2 tabs
three times a day
by mouth

104 Prednisone, 5 mg three
times a day by mouth

105 Adrenocorticotropic
hormone (ACTH) 30
units intravenous
infusion

106 Iron ~

107 Vitamin 812 1000 mg
intramuscularly every
3 days

108 Laminectomy for
decompression of the
Spinal cord

109 Whole blood trans-
fusion (3 units)

110 Multivitamin capsules,
one capsule three
times a day

111 Aristocort 4 mg twice
a day

112 Vitamin B12 100 microgm
intramuscularly for
two weeks

113 Folic acid 10 mg per
day by mouth

114 Physiotherapy

 

130

APPENDIX C

Procedures for Scoring Modified PMP's

l. Thoroughness

Before calculating any scores, the total number of possible points
had to be calculated for each problem. Total possible points was
defined as the total number of non-redundant positively and zero-
weighted items in each problem. A redundant choice is one which

logically cannot be ordered if another is ordered. For example, in

 

IEIW

one problem the opportunity to hospitalize the patient occurs twice.
Depending on what point in the problem the subject chooses to
hospitalize the patient, the response to his choice of that item is
different. Logically the subject can choose that item only once.
Therefore one hospitalization choice is redundant.

To obtain the total possible points for a problem all the response
cards were counted and certain cards were subtracted from that total.
First cards for which the subject received no credit were subtracted.
These are not zero-weighted cards, but cards which simply guide the
subject through the problem. They give him no information about the
status of the patient. An example of one of these is: the student
chooses the option which states ”Obtain history information”. He
is given a card which says "See History section below". Secondly, all
redundant cards are subtracted. Thirdly, all non-redundant negatively
weighted cards are subtracted. The remaining total non-redundant
positively and zero-weighted cards are the total possible points for
that problem. One point is assigned to each card. Although this is a

somewhat complex method for arriving at a possible total, it was thought

131

necessary. The scores (thoroughness particularly) are proportions
and inclusion of redundant and no-credit items in a total would
artificially and non-uniformly reduce subjects' scores.
Thoroughness, then is the proportion of non-redundant positive
and zero-weighted items chosen by the subject. The formula for the

calculation is given in the text on page

2. Efficiency
This score is the most straightforward. It is simply the
proportion of items chosen by the subject which were positively

weighted. The formula is given on page

3. Accuracy

The accuracy of a subject's final response to a problem was
scored in two ways: maximum accuracy and mean accuracy. These two
scores are explained below. Responses on each problem were scaled
from O to 4 using as criterion a set of weights provided by
Christine McGuire. Display I gives a breakdown of final responses
and the accuracy scores assigned them. The responses listed in the
Display are all of those given by subjects in the present study. They
are variants of those found on the criterion provided by McGuire.

The maximum accuracy score a subject could obtain was determined
by the presence or absence in his final diagnosis of one of the
possibilities listed in Display I. If one of these solutions was
present, the subject's maximum accuracy was simply the accuracy score
assigned to that solution. There were two accurate solutions to

Problem 1. Maximum accuracy was the mean of those two solutions as

 

132

listed by the subject. If the subject listed only one solution, his
maximum accuracy was half of the value assigned that solution.

For example:

 

 

 

 

 

Final Solution Max Accuracy
Peptic ulcer with pyloric obstruction and 4 + 4 = 4
Diabetic ketoacidosis 2
Acute pancreatitis and 2 + 4 = 3
Diabetic ketoacidosis 2
Diabetic ketoacidosis 4 + O = 2

2

The mean accuracy score was the mean of the accuracy scores of
all the solutions he listed. Examples of these scores on Problems I,

II and III are shown below.

 

 

 

 

 

 

 

Problem I
Solutions Max Accuracy Mean Accuracy
Peptic ulcer with 4 4 + 4 + 1 = 3
pyloric obstruction 3
Diabetic ketoacidosis
Dehydration
Problem 11
Solutions Max Accuracy Mean Accuracy
Familial hemolytic anemia 3 3 + l = 2
Bone marrow failure 2
Problem III
Solutions Max Accuracy Mean Accuracy
Pernicious anemia 4 4 + 2 + l = 2.3
Peripheral neuropathy 3
Hypertension

When the two accuracy scores were calculated it was thought that

the mean accuracy score would be used in the analysis. This was later

 

133

deemed unwise since it tended to penalize students for being thorough
in their final listing of solutions or problems as those trained in
the problem-oriented tradition have been taught to do. For this

reason primarily, the maximum accuracy score was used in analysis.

4. Hypothesis Generation and Number of Hypotheses

Before determining when hypotheses were generated and how many
were entertained by a subject the entity, hypothesis, had to be
defined.

A hypothesis is any disease entity or problem (in the problem-
oriented sense) mentioned by the subject. It must be mentioned in a
positive context. The following are examples of hypotheses:

Ulcer

Diabetes

Anemia mentioned ssfsgs_CBC is obtained. In Problem II, anemia

is nst_a hypothesis unless qualified, e.g. Hemolytic anemia

Bleeding

Gall bladder problem

GI problem

Depression

Psychological problem
The above hypotheses would be credited if mentioned in the following
types of context:

”The patient may have HYP”

”This makes me think of HYP"

"I'm going to check for HYP"

 

134

Those hypotheses would g9t_be credited if mentioned in the following
types of context:

"This is probably not HYP"

"This cue does not go along with HYP”

Generation of a hypothesis was credited at the time the hypothesis
was first mentioned. If the hypothesis was first mentioned in
association with a cue (e.g. "When I got CUE it made me think of HYP”),
generation was credited at the time the cue was obtained. For subjects
in the Verbalization condition this procedure was easily applied since
the subject stopped periodically during the workup and commented. His
comments often contained hypothesis generations and cue associations.
Subjects in the Non-Verbalization condition were asked to recall the
problem by going back through it card by card. This procedure was
used to reduce retrospective distortion to a minimum. Hypothesis
generation was credited during the recall using the same rules as
were used for the Verbalization group.

To test one of the research hypotheses (see page ) subjects
were divided on two hypothesis generation dimensions: early or late,
and many or few. Early hypothesis generators were those who generated
the first hypothesis after having read the introduction to the problem,
but before choosing any options. Late generators were subjects who
generated the first hypothesis after choosing one or more options. The
total number of hypotheses generated were plotted in a separate
histogram for each problem. A natural break fell at about ten
hypotheses in each problem. Therefore subjects were divided into two
groups at that point with up to nine hypotheses being "few”, and ten

or more hypotheses being "many”.

 

APPENDIX D

Criteria Used for Process Analyses

 

DISPLAYS

 

DISPLAY I

Accuracy Scores

1135

 

memoowooouox zp_3 mopWFFoE mouooowow
ompouwpoEoo An omumuWFQEouco Am "Loop: uwuomop

 

~+

m+

m+

~oseo: ii owaoomocows .o>wuo
-oo: .. mFAo .+o ii ocouooo
.+o ii wmoo:_m .ooocu

.. cwmooeo .m.m .1 2o .mmo._
.. zuw>oeo uwwrooom .3o_~oz
oooo ii Lopou HoFoEom
vaWmeozuou .mwmzpoowc:

m—

 

—+

r+

~+ o_o> oo o_oae= bee_bea

«—

 

NM -1 mouzoocos

.aow -- mobzooeoez_ .NoA

ii m__zooeuooc .mm 1. mocom
”woooo wows: _owucocoeewo

MF

 

uwozooEeoc woo oesoczo
ioseo: oco m_—oo no» one
”zmoFozocoe .eooEm FFoo ooz

_+ F1

N_

 

N+

F+

~+

~+

.EE :U\oom.m_

_+ _+ ”pcoou —qu ooo_o mpwzz

_—

 

ame ii .uoz
.—E oop\msm w.op .1 .2:
o o upweooooEo; oco cone—mesoz

 

o—

 

.ocmmq<

 

.ucoecH
.ozz

 

.umno
_o3om

 

.eoocm
.ooz

 

mwuwcumoo mmouWPsz

mopooo_m

 

mmmocuomx:

 

m_o_u
-omeocoo

H Emomomo

 

eaoee_m .ow_ez _Edd_o
__eo Ho .aea

 

 

 

 

moou

.1136

 

memoo_uoooox zoo: mouw__os mmuooowom
oouoo__osoo An oouoowposooco Ao ”coop: o_uooo_

 

N+

~+

._\ooe

o_ -- Noo m~\ooe oo
-- _o m_\ooe o.~ -- x
"F\owe mm_ -- oz 68

 

~+

N+

.P\ooe

oz -- moo m_\ome oo
-- po m~\oms o.~ -- x
w:ooe om_ -- 62 Am
"mouz_oeuoopo Eoeom

m_

 

_+

F+

F+

F+

_.+

—+

_+

_+ _+

.Acoumcopmozv Loo; _
cw as mm “moo; co_po
nucoswoom ouzooccuzem

mp

 

F+

~+

P+

P+

F+

.+N w>wpmmoo
”unwoam _ooum

m_

 

.ebzoeo oz zo

 

F+

—+

—+

 

.oop op ucom ooswuumw
empozuou .oo~_
neocuoo mo pone ooo
.zpmooocoucoom o_o>
op o_ooco powwow; Ao
”oeou_oo move:

mp

 

.ocmoa<

 

.ucomcH
.0»:

 

.bmno
pozom

 

.Lmucu
.mmz

 

mao_eomeo

 

NazuaFFez
mmoooawo

mmmozpomxz

 

m_bao
-ooeocoo

 

Leooe_m
__ao

zod=:_oeoov H zoomoma

 

.oazez
Ho

chupa
.omo

 

 

 

moou

1.3'7

 

owEo:< uwuzﬁoem: ocoeeeoooq u H<
o_Eoc< owuzooeozom Fopwcoooou n <mu
owEoc< Fpmu opxo_m u <um

owLXzoLoo u .caeoo

cowuoowxoucH oooo u

.xOch oooo

mwuwezooz u .zooz
“own ouoooooocH n Howe .oocH
m_u_:moozoezo owcoucomoz u .zoezo .moz

memoo_ooco:oe mooroooccz n ocoz .oocoz
owmo_o< __ou oom u o_mo_o< _zou ooz
Home» Hm mmoo ooozm u mmoo coo—m
oEocoezoiowmeooo M 4-4

 

_.+

_.+

FE oo_\Emoco_E mm_-zorooaoo
ocwoc_o cog? m_e oo_\EooeowE
om_-cocw Eocmm "zu_ooooo
mcwoo_o :oew oco cog? Eocom

MF

 

N+

F+ F+

.m_ mcomwo mom
”coruocwomo zoecoe ooom

NF

 

.FP meow?“ mom
”:o_oocoooeo __oo o_xoem

_—

 

.EE :U\ooo.oom
”ooooo ooFooo_o

OF

 

m+

Pl

.op meomwu mom
”zooFOSQLoE coo eomsm coo—m

m

 

—+ F+

_.+

.<< eeoo_ooeez

”memoeozooeuoozo croo_ooaoz w

 

oo< coo

 

om_-aeeeo .oo_

-ebem .eo_-me;a_< .ee-:__s
-oozo Fe;o_< m.....ooéEBZ
”FE oo_\Em o.~ crouoeo _oooe

”m_moeozooeooo_o :_mpoco Eoeom N

 

H<

 

<mu

 

<um

 

.gmcoo

 

.xoucH
eaoo

 

.eomz

bozo
.omeo

 

.qu A
.moz

 

momoouomxz

 

ocoz
.ooeoH

 

owmo_m<
Fqu ooz

HH Emomomo

 

mmoo
ooopm

 

o-o

 

moou

1.353

owsmc< o_uzuococom pou_comcou
oweoc< Fqu oyxowm

 

owemc< u_uz_oeoz moosewouo< n H<

n <mu
n <um

owezzocoo u .zocoo

corpoowxoch oooo n

.xouoH oooo

mzp_e;ooz u .cooz
poem mpooomoocH u “owe .oocH
mwuwcooozoezo oweopcmmoz u .zoazo .moz

memoo_oococoz moo_oooccz n ocoz .ooeoz

memo_o< __mu oom u owmo_o< __ou ooz
pooch Ho mmoo ooo_m u mmoo oooFm

osozoezz-owmeooz u 4-4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Amocooom m_ Foeocoov mooooom
NP ”oE_o cwoaoezooeo N_
mow—om wmo.o oo
m+ m+ m- N- o N- m- mooFoEoo ”oez_em woe.o be
mowocwmmo mwmz_oEmI nm__oo
not ea severance beboemo o—
N- ze\eooEdeE e__we m_
”ovum o?_oe Eoeom mr
_- _E\Emoeowe
oo< coo ocowE com ”NFm Eocom o—
Hz <mo «om .eoooa .xooeH .cooz coco .eoezo ocoz acme_oa mmoo o-o
new; .oooH .mor .ooecH Ppmu ooz oooFm
momocpomxz moou

zooscwbeoov Ho zoomoaa

139

DISPLAY I

Accuracy Scores

Problem I -- GI solutions
Score of 4
Peptic ulcer with pyloric obstruction
Duodenal peptic ulcer with pyloric spasm
Peptic ulcer with scarring and narrowing of the proximal duodenum
Duodenal ulcer with obstruction
Duodenal bulb ulcer with gastric outlet obstruction
Pyloric outlet obstruction secondary to Old duodenal ulcer
High partial obstruction with or without ulcer
Score of 3
Right sided obstruction of ascending colon
Partial obstruction at duodenum
Active chronic duodenal ulcer
Acute GI bleed with duodenal ulcer
Gastric outlet obstruction
Obstruction secondary to ulcer
Duodenal peptic ulcer
Zollinger-Ellison Syndrome
Score of 2
Acute pancreatitis
Cancer of the head of the pancreas causing obstruction
Obstruction secondary to pancreatitis
Alcoholic pancreatitis

Cholecystitis
Stroke

 

140

Score of 2 (continued)

Pancreatic pseudocyst

Chronic pancreatitis

Ulcer

Small bowel infarct secondary to obstruction
Score of l

Alcoholism

Barium peritonitis

Alcoholic gastritis

Dehydration

GI hemorrhage with question of obstruction
Score of O

Gall bladder problem

Problem I -- Diabetes solutions

Score of 4

Diabetic ketoacidosis
Score of 2

Diabetes

Diabetes mellitus

Acidosis

 

141

Problem 11

Score of 4
Congenital spherocytosis
Hereditary spherocytosis
Score of 3
Familial hemolytic anemia
Score of 2
Hemolytic anemia
Hemolytic process
Score of l
Ideopathic hypersplenism
Chronic disease of hematopoetic and immune systems
Aplastic anemia
Hereditary cell defect
Bone marrow failure
Hereditary RBC disease
Score of 0
Iron deficiency anemia
Chronic lymphocytic leukemia
Leukemia
Lead poisoning
Coagulopathy

Autoimmune hemolytic anemia

 

142

Problem II -- Other identified problems

Score of l
Porphyrinuria
Decreased production of RBC'S

Severe anemia

Problem III

Score of 4
Pernicious anemia with neurologic complications
Pernicious anemia
Decreased production of Intrinsic Factor with decreased
absorption of Vitamin 812
Pernicious anemia with achlorhydria
Vitamin B12 deficiency
Vitamin B12 deficiency with subacute combined degeneration
Score of 3
Subacute combined degeneration
812 or Folic dependent macrocytic anemia
Megaloblastic anemia
Score of 2
Peripheral neuropathy
Score of 1
Brain stem problem
Score of O
Gastric carcinoma

Strobe

 

143

Score of 0 (continued)
Plummer-Vinson Syndrome
Organic brain syndrome
Spinal cord compression
Cerebral metasteses

Cerebral dimentia

Problem III -- Other identified problems

Score of 1

Hypertension

 

Representative Hypotheses

Hypothesis
Sickle cell anemia
G6PD deficiency
Thallassemia
Hemolytic anemia
Autoimmune anemia
Blood loss anemia
Hereditary cell problem
Hypersplenism
Bone marrow repression
Leukemia
Lymphoma
Iron deficiency anemia
Vit. 312 deficiency
Folic acid deficiency
Lead poisoning

Infection

144

DISPLAY II

Hi (n=10)
10

8
1

#0

Lo (n=lO)
10
6

\l-hNU‘l-bm

woo-bow

Lo & Drop (n=4)
3
4
1

MN

 

"iiiiiiiliiiiiiiiiii