w

 

 

THE MEASUREMENT OF EMOTIONAL HEALTH
THROUGH THE USE OF ESTAVAN’S MODIFIED
PAIRED COMPARISON TECHNIQUE

Thesis for tho Doom of M. A.
MICHIGAN STATE UNIVERSITY
Ross E. Carter
I966

THESIS

  
     

  

LIBRARY

Michigan State
University

 

ABSTRACT

THE MEASUREMENT OF EMOTIONAL HEALTH THROUGH THE USE
OF ESTAVAN'S MODIFIED PAIRED COMPARISON TECHNIQUE

by Ross E. Carter

The purpose of this paper was to determine if judg-
ments of emotional health could be quantitatively measured
using Estavan's modified paired comparison method to de-
rive a scale value for each stimulus judged, as well as
to assess the reliability of such measurements.

Six protocols of 20 TAT stories each were presented
in ﬂ-§E:£L pairs to two judges who judged the amount of
emotional health of one member of a pair as compared to the
other member. The Estavan method of modified paired com-
parisons was used. This method requires that the member of
a pair judged greater on an attribute be represented by a
20 centimeter line and that the lesser member of the pair
be compared to the greater by placing a point on the 20
centimeter line which indicates how much, in comparison to
the greater member, the lesser member has of the attribute
being judged. This procedure results in a ratio or propor-
tional judgment.

Ratio scale values were derived for each set of

TAT stories for each judge. A measure of inter-judge

Ross E. Carter

reliability resulted in a correlation of .87. Measures of
intra—judge reliability, using a method similar to Gulliksen
and Tukey's for Thurstone's paired comparison data. showed
that the scale values accounted for .79 of the variance

of Judge 1 and .93 of the variance of Judge 2.

Approved : g; ,Zﬂ’4

Date:

 

"‘1—7"

‘“\

THE MEASUREMENT OF EMOTIONAL HEALTH THROUGH THE USE

OF ESTAVAN'S MODIFIED PAIRED COMPARISON TECHNIQUE

BY

Ross E. Carter

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

MASTER OF ARTS

Department of Psychology

1966

0'

ACKNOWLEDGMENTS

The author wishes to express hks appreciation to
Dr. Bertram Karon who was most patient and who gave freely
of his time.

A special word of thanks goes to the author's wife
who did all those good things which often lighten the heavy

load of work.

ii

DEDICATION

To Charl

iii

TABLE

INTRODUCTION . . . . . . .

REVIEW OF THE LITERATURE .

PROBLEM . . . . . . . . .

METHOD . . . . . . . . . .

RESULTS . . . . . . . . .

DISCUSSION . . . . . .

SUMMARY . . . . . . . . .

BIBLIOGRAPHY . . . . . .'.

APPENDICES . . . . . . . .

OF CONTENTS

iv

Page

15
16
27
3O
36
37

43

Table

1.

LIST OF TABLES

Scale values for the stimuli judged by
two judges . . . . . . . . . . . .

Scale values for subjects by class . .

Appendix
A.
B.
C.

D.

LIST OF APPENDICES

Subject Data . . . . . . . . . . . . . .

Order of TAT Card Presentation to Subjects
Instructions to Judges . . . . . . . . . .
Order of Presentation of the Stimuli . . .
Member of Each Pair Judged Healthier . . .
Scale values for Naive Judges . . . . . .
Inter- and Intra-Judge Reliabilities . . .

Questionnaire . . . . . . . . . . . . . .

vi

Page
44
46
48
54
56
58
60

62

INTRODUCTION

The purpose of this paper was to determine if judg-
ments of emotional health could be measured in such a way
that ratio scale values on an unidimensional scale could be
assigned to stimuli judged, and if the method of judging
used was reliable for both inter- and intra-judge comparisons.

AS Thorne (1961) points out. judgments of a clinical
nature have traditionally been thought of as being based in
intuition or other personal, subjective factors. This View
has fostered the belief that clinical judgment could not
be systematically and rigorously investigated due to its
incongruity with objective measures and methods.

Clinical judgment. or rather, judgment in a clinical
situation of clinical material, may however, be thought
of as not necessarily qualitatively different from other
judgments. As Johnson (1955) describes it. judgment
is the decisive or end product of an intellectual problem
solving activity which has the function of evaluating or
settling an uncertain state of affairs. From such a
point of View there seems to be little to indicate that
clinical judgment differs from any other sort of judgment
except in terms of the type and complexity of the material

being judged.

As HDnt and Jones (1962. p. 34) says of the

comparison of psychophysical and clinical judgments:

They are merely the opposite poles of a rough

continuum. a quantitative continuum marked by

the clarity and specificity with which the

stimuli are defined. by the degree to which

the judgmental setting is standardized through

careful control of the known pertinent variables

and the elimination of extraneous cues, and by the

provision of uniform modes of reporting or response

that lend themselves to convenient mathematical

treatment.

Research in the area of clinical judgment has not

reached the refined point of psychophysical judgment.
One difficulty has been the lack of any method for measuring
clinical judgment in an exact manner. The importance of
this paper seems to lie in the fact that it introduces a
method for obtaining a refined measurement of judgments of
clinical material which results in a ratio scale value
so that differences between the scale values can be inter-

preted as reflecting actual differences between the stimuli

measured.

REVIEW OF THE LITERATURE

A review of the literature on clinical judgment
indicates that a major portion of the research in the area
has been concerned with showing how the reliability and
validity of judgments are effected by certain variables
associated with either the materials or the judges.

The results of such research have been contradic—
tory and are confusing.‘ It is suggested that much of this
confusion is due to the fact that a refined and accurate
method of measuring clinical judgment does not yet exist.

In most cases, clinical judgments have been ex—
pressed through either ranking or rating techniques which,
in turn. have been analyzed by the use of correlational
methods. One exception to this is found in a study done
by Albee and Hamlin (1949) where Guilford's (1928. 1931)
method of paired comparison preferences was used to obtain
Scale values for clinical judgments which were then tested
for reliability by comparison to ranked orders of adjust-
ment as made by clinicians. While this is a more sophisti-
cated method of measurement than is found in most studies,
its use can be criticized on the basis that the number of
judges used by Albee and Hamlin was less than Guilford's

method requires to produce reliable results.

Even though rating and ranking methods, which in
turn, can be studied by correlational methods, suffice in
some studies, it should be remembered that rating and
ranking methods are subject to errors of leniency, central
tendency, and halo effects, as well as anchoring and context
effects, and that correlational methods only serve to show
associational relationships. MOreover, it is questionable
whether the assumptions underlying correlation coefficients
are met by such methods let along the assumptions underlying
the analysis of variance, "t" tests, and other common power-
ful statistical tools. It would seem that more exact findings
would result if a better method of measuring clinical judg—
ment could be devised and used.

This review of the literature will deal with those
studies which have shown clinical judgment to be affected
by certain variables such as judges' experience and use
of the materials, as well as kinds and amounts of materials.
This review will also deal with those studies which have
specifically manipulated stimulus properties in order to
show an affect on clinical judgment. In order to reduce
confusion the research has been divided into sections and

will be reported on under those sections.

Experience of Judges

Using schizophrenic responses to vocabulary test
items from an intelligence scale as material in a study

designed to compare the judgments of experienced clinicians

with those of inexperienced judges, Arnhoff (1954) found
that the reliability of the judgments, as expressed on a
rating scale, decreased with increases in experience, so
that experienced clinicians actually produced less reliable
judgments than did the naive or inexperienced judges.

In a follow-up on Arnhoff's study, Hnnt, Jones, and
Hunt (1957), using a set of improved instructions found
that while there was no significant difference between
the mean reliability of judgments made by experienced
judges and naive undergraduates, there was a significantly
smaller degree of variance in the judgments of the experienced
clinicians which indicated that reliability, defined as
inter-judge agreement, was greater for the experienced
judges.

In further investigations on the reliability of
experienced and inexperienced judges using rating scales,
it has been found that while both experienced and naive
judges can make reliable judgments of clinical material
(Luft, 1950; Bialick and Hamlin, 1954; Weitman, 1962;
and Allison, Korner and Zwanziger,l964), naive judges
tend to have difficulty in making judgments which require
finer discriminations (HUnt and Jones, 1958a; Hunt and
Jones, 1958b).

While it is difficult to account for the ability
of naive judges to make as reliable judgments as do experienced
clinicians, one explanation could be found in terms of

there being various levels of ability among judges in the

experienced groups. Even though Grigg (1958) found no
relation between various levels of experience and varying
levels of reliability, there is other evidence which does
support this notion.

HUnt, Arnhoff, and Cotton (1954) investigating the
individual reliability coefficients of experienced judges,
using rating scales, found a range of from +.02 to +.93.
The results of their study agree with those of a study done
by Phelan (1965) who used a matching task to make inter-
judge comparisons among experienced clinicians for relia-
bility. Phelan found that while all interjudge comparisons
in his study were fairly reliable, there was a wide range
of reliability. Further evidence for there being varying
levels of ability among experienced clinical jUdges comes
from research done by HOlsopple and Phelan (1954), Phelan
(1960, 1964), and Gunderson (1965).

From these studies it could be concluded that
experience in and of itself does not result in increases
in ability to make reliable clinical judgments and that
in Spite of some judges having high ability in the experienced
groups, the low ability of some judges operates to equate
the reliability of the experienced judges with that of a

naive group.

Use of Materials

 

While experienced judges may vary in terms of the

amount of ability in making judgments, another factor which

seems to be involved in clinical judgments is the way the
individual clinician uses the materials.

Raines and Rohrer (1955) found that while experienced
judges described personality traits of subjects in a way
which agreed with external criteria, the judges themselves
differed as to what they felt were important traits.’

The authors concluded that these differences were due to
personal factors among the judges which resulted in
selective sensitivity to particular elements in the material.

Further evidence for variation among clinicians in
the use of materials comes from Golfarb (1959), who found
that diagnostic judgments varied with individual clinicians
and from Grosz and Grossman (1964) who found significant
differences among clinicians in the reporting of anamnestic
data which were emotionally charged, as well as from

Mehlman (1952) and Pasamanek (1959).

Types of Material

 

It would seem that, to some extent, judgment should
be related to the types of material used in the judgment
task. Several studies have investigated the reliability
of clinical judgment as it is related to various materials.

While Soskin (1959) found no difference in the
reliability of groups of judges using either objective test
data, projective test protocols, observations of role play—
ing situations or biographical data, used either alone or
in succession, Sines (1959) found that the use of biographi-

cal data added accuracy to judgments made only on the basis

of test data.

Kostlan (1954) also studied the effect of the kinds
of materials used in clinical judgments by varying kinds
of information given to clinical judges and found signifi—
cant differences as information was varied. In particular,
his study showed that predictions from social histories
alone were as reliable as predictions from a combination
of TAT, MMPI, and Rorschach protocols.

Little and Shneidman (1959) investigated congruences
between personality descriptions made by clinicians on the
basis of different sources of information such as anamnestic
data, MAPS, TAT, Rorschach and MMPI protocols, and found
that reliability, defined as agreement between judges,
was greater when judgments were based on anamnestic data
than when based on any other source of material.

Further evidence that test data alone do not lend
themselves to accurate clinical judgments comes from the
work of Mancuso (1961) and HOrwitz (1962).

Such findings as these seem to indicate that test
data do not form an adequate basis from which to make
reliable clinical judgments. This is important since
tests are Often used in actual clinical practice as a
basis for judging personality dynamics. We shall have
occasion to challenge such conclusions about the adequacy
of test data, for, when these resUlts are thought of not
in terms of types of material, but rather in terms of amount
of material, a new and critical variable seems to be more

important.

Amount of Material

Hamlin (1954) reviewed ten studies of clinical
judgment which had used projective tests. Five of the
studies had shown positive results and five had shown
negative results in terms of the reliability and validity
of the judgments.

In comparing the amounts of material used, Hamlin
hypothesized that when global or atomistic units of
information were used, the effect was to produce negative
findings. He concluded that it was not the type, but the
amount of material used which was important, and suggested
that the optimal amount of material to be used was one which
was large enough to allow the judges to formulate patterns
of the subject's personality, but small enough that the
judges were not overwhelmed by the material.

HUnt and Walker (1962) found that valid and reliable
judgments could be made from vocabulary and comprehension
scales of intelligence tests using a global approach.

This would seem to contradict Hamlin's hypothesis except
that the reliability of the global approach used in this
study may have been due to the limited scope and homogeneity
of the stimulus materials.

Jones (1959) investigated the reliability and validity
of judgments made from individual intelligence test items
as well as from global appraisals of the test protocols,

and found that increased amounts of material lowered the

10

reliability of both the experienced and inexperienced
judges, but did not effect validity. In contrast to

Jones, Levine (1954) has reported that validity is decreased
by increases in amounts of material, but not reliability.

Powers and Hamlin (1957) found that judges who made
reliable and valid judgments used several items of infor-
mation more frequently than they used either one item or
all items. Supporting evidence for this finding has been
offered by Martin (1958) and Lee and Tucker (1962).

Miller and Bieri (1963) using an information theory
approach, studied the channel capacity of clinicians by
varying the amounts and the types of information given to
judges. Their study showed that about one bit of infor-
mation was all that could be handled reliably by judges,
with some variations due to the type of information and
type of judgment involved. One may question the findings
of this study on the basis of whether or not these results
would generalize to types of material other than those
used in the study.

In investigating the use of the total Rorschach
protocol, Grant, Ives and Ranzoni (1952) found that relia-
bility of judgments was low when based on the total protocol.

Cummings (1954) found however, that reliability could
be achieved when only one Rorschach card was used. Newton
(1954), in contrast to both Grant et_§1,, and Cummings, found
that reliable judgments could be made using total Rorschach

protocols, but concluded that these results were obtained

11

only because judges were allowed extensive time in which
to analyze the protocols.

Thus, either limiting the amount of material or
giving the judges sufficient time to assimilate the infor-
mation contained in large amounts of material resulted in
increased reliability. It could be concluded from these
studies that the amount of material which is used in research
on clinical judgment does affect the reliability of the
judgments and that in making judgments of clinical material,
there is a limited amount of information which can reliably

be handled at a given time.

Stimulus Variables

Only a few studies have attempted to demonstrate
the affect of the characteristics of the stimulus materials
or methods of presentation on the judgment process itself.

Campbell, Hunt, and Lewis (1957) studied context
effects in judgments of adjustment using rating scales,
by varying the context in which stimuli were presented
and found thatassimilation and contrast effects were pro—
duced and caused distortions of the judgments.

Jones (1957) produced context effects in judgments
about Severity of schizophrenia by presenting a limited
range of stimuli, but allowing judgments to made on a full
range of pathology.

Context effects have also been shown in the studies

of Levy (1960) and King, Ehrman, and Johnson (1952).

12

Jackson (1963) studied the affects of frequency,
extremeness, and order of presentation on clinical judgments
and found that extremeness of conflict material was more
important than frequency of conflict material in effecting
clinical judgments.

Miller and Bieri (1963) in support of Jackson,
found that more reliable judgments expressed on rating
scales, were made when stimuli were in the extreme ranges
of pathology, and that as stimuli decreased in extremeness,
so did reliability decrease.

Hnnt, Schwartz, and Walker (1965) utilized the
results of ratings performed in other studies Of HUnt, and
found that stimuli rated as extreme in pathology showed
smaller deviations and concluded that reliability for
these judgments, defined as agreement among judges, was
higher than for other stimuli judged less severe.

In Jackson‘s study mentioned above it was found
that judgments of adjustment made from test protocols
were affected more by recency of exposure than by primacy.
Sines (1959) found that interviews added more to the total
reliability ofjudgments made on the basis of test protocols
when judges were exposed to interview material before test
data, indicating that primacy effects were greater than
recency.

Miller and Campbell (1959) have supplied a clue to
the resolution of the conflict over primacy and recency

by their finding that neither recency nor primacy effects

13

are constant during clinical judgment, but depend upon the
time at which their measure is taken.

While Arnhoff (1954) was not able to Show anchor
stimuli caused distortions in clinical judgments, Block
(1964) in analyzing his study, found that the judges'
personal frames of reference intruded on judgments and
exerted a strong anchor effect. Block also noted shifts
in frames of reference with changes in the context of the
stimuli which is similar to the findings of Soskin (1954).

Block (1962) has also shown that response sets
may affect clinical judgments. He devised fictitious test
results, and found that deceived clinicians would write
clinical descriptions of fictitious patients based on
these contrived test results. He concluded that clinical
training consists more of indoctrination than of training
in the ability to think critically. In contrast to these
findings and opinions, Gross (1961) found that response
sets had little, if any, affect on clinical judgments. His
study showed highly significant stimulus affects in a
task requiring the judging of subjects by judges, but
little affect due to response sets.

Regarding Block’s study, one might legitimately
ask what should be expected when clinicians are presented
with clinical material and asked to make clinical judg-
ments about the material.

The method of presenting the stimuli to be judged
has been shown to have little or no affect on the reliability

of clinical judgments. Giedt (1955) and Borke and FiSke

14

(1957) were unable to demonstrate any affect when clinical
material to be judged was presented through direct inter—
view, seeing and hearing interviews, hearing interviews, or
reading interviews. Luft (1951) compared the effectiveness
of listening with that of reading clinical material and
found that judgments in the form of making predictions to
responses on objective tests were equal for groups who heard
or read the material to be judged, but that prediction of
reSponses to projective tests were more accurately made

by listeners than by readers.

In short, clinical judgments are complex and are
related to many variables, but the factors affecting
them can be investigated.

Many of the findings of research in this area are
confusing. It is suggested that this confusion results not
so much from the fault of poor research, as it does from
the difficulty of dealing with such a complex subject.

It would seem that the complexity of the material
demands more rigorous investigation if the subtle factors
involved are to be brought to light. One requirement of
rigorous research is an exact method of measurement.

It is this problem to which this paper is directed.

THE PROBLEM

The purpose of this paper was to investigate a
method for measuring judgments of emotional health. The
clinical concept of adjustment would seem to be multifaceted
yet it is useful to think of adjustment as a single
dimension for many purposes. A factor analysis of 14
criteria of adjustment in the Menninger Psychotherapy
Project (Luborsky, 1962) Showed that 60% of the variance
was accounted for by the first principle component, which
suggests that much of the variance can be accounted for by
a single dimension.

It is also suggested that the data may be more uni-
" dimensional than factor analysis suggests because symptom
substitution and interchangeability can only be taken into
account by a human judge. Therefore, if an appropriate
quantitative technique for mapping clinical judgments onto
a numerical scale can be developed, it might be found that
a single dimension accounts for a surprisingly large amount
of the variance. We Shall attempt to find out if this is

SO.

15

THE METHOD

Sets of 20 TAT stories were obtained from each of
six male subjects; two "normal" college students, two
college students receiving psychotherapy on an outpatient
basis, and two hospitalized schizophrenics. As far as
possible, the subjects were equated for age, education,
number of siblings in the family and father and mother's
occupation. Appendix A lists these variables for the
subjects.

Administration of the TAT cards was carried out in
standard fashion except that the complete set of 20 cards
was administered to a subject at one setting. One
examiner was used for all subjects. All subjects were
shown the same cards, but not in the same order due to
examiner error. Order of presentation is shown in the
Appendix.

Stories told by the subjects were first recorded
on tape and then transcribed verbatim so that as little
distortion or fill-in by the examiner as possible would
occur.

The six sets were identified by letter and presented
with information regarding the subjects' age, sex and

number of siblings to two advanced clinical graduate students

16

17

for judgment. Both judges were experienced in the interpre-

tation and evaluation of TAT protocols as well as protocols

from other projective devices and, in addition, were

functioning as psychotherapists in both group and individual

cases .

The attribute to be judged was the emotional health

of each subject as compared to every other subject. For

the purposes of this study emotional health was defined as

being comprised of the following:

(a)
(b)
(C)
(d)
(e)

(f)
(g)
(h)
(i)
(J')
(k)

Ability to take care of self

Ability to work

Sexual adjustment

Social adjustment

Absence of hallucinations, bizzarre delusions,
gross distortions of reality, lack of passivity
Degree of freedom from anxiety and depression,
degree of diffuse hostility

Amount of affect, of feelings

Variety and Spontaneity of affect

Satisfaction with life and with self, absence of
deficiency motivation, i.e., making up for lost
love

Achievement of capabilities, mastery of the
environment

Benign rather than malignant affect on others

Indications of emotional health as found in TAT

stories were defined, in addition, as follows:

(a)
(b)

(C)
(d)

(e)
(f)
(9)

Long protocols

Protocols should show more affect, more varied
affect

Less stereotyped and more varied material

An increase in benign fantasies and more helping
parent figures

Better reality testing

Problems should be directly represented

There should be indications of confidence

Task instructions were given to the judges together

as a pair, in both written and verbal form. It was stressed

18

that while they should use the criteria indicated as guides
in forming their judgments, they should rely on their own
subjective, clinical impressions and not judge strictly

on these signs. The judges were requested to complete

a questionnaire regarding the use of the criteria after
finishing the task. This questionnaire as well as the
written part of the instructions has been included in the
Appendix.

At the time of instructions, both judges were given
examples of TAT stories representing both extremes of
adjustment, in order to Show how the criteria of emotional
health could be applied to the materials of this study as
well as to establish examples of pathology and adjustment
as they might appear in TAT protocols. One extreme of
pathology was represented by three TAT‘S taken from
hOSpitalized schizophrenics, while the other extreme was
represented by a TAT taken from Wessman and Ricks' (1966)
study of college students.

The judges were instructed to judge the TAT stories
in pairs, using Estavan's modified method of paired com-
parisons, so that each protocol was compared to each of the
other five protocols. Both judges judged the same pairs
independently of each other. For each pair of stories, the
judges were asked to judge which member of a pair was
healthier, and in comparison to the healthier member, to
judge how healthy the other member was.

Each judge was presented with a sheet of paper on

19

which a 20 centimeter line had been drawn. In expressing
his judgment, the judge was instructed that fOr each pair
judged, the healthier member should be represented by the
entire length of the line, which should be labeled accord—
ingly. The comparative judgment of the less healthy member
of a pair to the more healthy member of the same pair was
expressed by placing a point on the 20 cm. line which indi—
cated how much health the less healthy member had, using
the emotional health of the healthier member as a standard.
This method of comparison results in a ratio judgment.

Comparison of a stimulus with itself, such as (A,A)
was not used. Recognizing that reciprocal comparisons such
as (B,A) and (A,B) result from the same judgment, there
were Eiglil, or 15 independent comparisons.

The order of comparison was randomized as is shown
in the Appendix, and was carried out so that the protocol
listed first in any pair was read before the second protocol.

Since this paper utilized Estavan's modified method
of paired comparisons to obtain scale values for the
judgments, it may at this point, be useful to describe
the rationale for deriving these scale values.

Estavan's Mbdified Method of
Paired Comparisons

For each pair of stimuli judged, the point on the
20 cm. line was measured. The resulting length was divided
by 20 to produce a proportion.

The judgments represent the ratio of one stimulus to

20

another, i.e., the ratio of stimulus B to stimulus A, or
B divided by A.

Estavan has found that reliable judgments only
occur if the greater stimulus is equated with the fixed
length of the line and the lesser stimulus judged as a pro—
portion of the line. When the lesser stimulus was equated
with the fixed length of the line and the line extended to
indicate the magnitude of the greater stimulus, Estavan
found that the judgments were unreliable. Hence, of the
judgments, B divided by A and A divided by B, only one can
be observed, that one in which the greater stimulus is the
denominator. The other judgment can be determined only
numerically by taking the reciprocal of the observed fraction.

If we take a hypothetical problem involving three
stimuli, A, B, and C, the observations may be arranged in
a 3 x 3 matrix (or n x n matrix, where n equals the number
of stimuli) as is shown below, with there being a row and
a column for each stimulus. The entries in the cells of
the matrix are the column stimulus divided by the row

stimulus.

A B c
A B Q
A A A A
A B c
B B B B

0
(My
(Mm
0k)

21

The diagonal entries are by definition equal to one.
Half of the off-diagonal entries will be determined by the
observations. The other off—diagonal entries are determined
by taking the numerical reciprocals as explained above.

Thus, in the comparison of the pair (A,B), A over
B will be observed where B is the greater stimulus. B
over A is determined from its reciprocal.

It is obvious if we have compared A with B and B
with C, that one ought to be able to predict what one would
observe if one compared A with C. Such redundancy permits
us to observe how well the scaling model fits the data. We
shall describe the systematic procedure for doing this
below. I

To derive the scale values for the observed data,
each entry in the matrix of observations is transformed to

its logarithm of the base 10 as is shown below.

A B C
A Log 13:- Log g- Log g-
B Log % Log g- Log 91;
C Log ‘%- Log ‘g' Log ‘%

The above is equivalent to the following:

22

A B C
A Log A Log B Log C
-Log A -Log A -Log C
Log A Log B Log C
B -Log B —Log B —Log B
C Log A Log B Log C
. -Log C —Log C -Log C

The resulting matrix of differences is at this point, simi-
lar to the matrix of differences in Thurstone's Case V
Method.
Mosteller (1951) has shown that a least squares
solution for the scale values derived from such a matrix
of differences is extraordinarily simple. (In our case, it
is the sum of squares of errors on the logarithm scale which
is being minimized. Although the error term might be
defined in some other fashion, this leads to the simplest
computational procedure.) One need only sum the columns
which yields the following totals:
3 log A - log A - log B - log C
3 log B - log A - log B - log C
3 log C - log A - log B — log C.
If we divide by n, the number of stimuli, we get;
log A -‘L
IogB-E

log C —‘L,

23

'where L is the mean of the logs of the scale values. If
we set L equal to zero (which means that we have chosen the
geometric mean of the scale values as our unit of measure—
ment, i.e., 1), then these column averages are our best
estimate of the logs of the scale values. Transforming
to anti—logs gives us the scale values themselves.
Obviously, any set of judgments, no matter how
meaningless could be entered into the matrices and used to
derive scale values. One needs some way of evaluating
whether the data make any sense, that is, whether the
scaling model fits the empirical observations. We are
presuming ratings of emotional health can be summarized by
a one dimensional ratio scale.
If one has compared stimulus A with stimulus B,
and has compared stimulus B with stimulus C, i.e., has a
rating of A divided by B and B divided by C, then one can
predict what one ought to observe when one compares A with
C. If the prediction is correct, then the scale values
have summarized the data. If the prediction is inaccurate,
the scale values have not summarized the data and the scaling
model is inappropriate to the data under consideration.
Gulliksen and Tukey (1958) have presented a procedure
for performing such an evaluation over the whole matrix
of data. They provide a procedure for dividing the total
variance (T) of the empirical observations into two

components; variance accounted for by the scale values,

24

and discrepancy variance (D), variance not accounted for
by the Scale values. They then define the following index

of reliability, RSS, as:

T D

«E
which summarizes the percent of the variance of the obser—
vations accounted for by the scaling procedure. Included
in the discrepancy variance are all errors of observation,
unreliability of judgment, lack of unidimensionality, and
failure of any of the assumptions of the scaling model.
Therefore, RSS measures the degree to which the scale values
reliably summarizes the data and hence, the degree to which
the scaling model is appropriate and valuable.

Inasmuch as Gulliksen and Tukey derived their
index for difference rather than ratio Observations, i.e.,
Thurstone's Case V model, RSS can be computed most straight—
forwardly from the logs of the observations and logs of
the soale values.

For Estavanfs procedure, RSS measures intra-judge
reliability or scalability. To measure inter-judge relia-
bility, one need only compute product moment coefficients

as between any other two measuring instruments that yield

measurements on a ratio scale.

Analysis of the Data

 

In the discussion above, a 3 x 3 matrix was used
as an example. In the analysis of the data, a 6 x 6 matrix

was necessary.

25

For each judge, a 6 x 6 matrix was determined as
above. The loglO of each cell entry was determined to form
a matrix of the logs of the observations. The columns of
this matrix were summed and divided by 6 to determine the
logs of the scale values. By converting these values to
their anti-logs, the scale values for each stimulus were
arrived at.

, a new 6 x 6 matrix of

In order to determine RSS

the theoretical observations was determined by subtracting
the log10 of the row stimulus from the log10 of the column
stimulus. The entries in this matrix represent what the
logs of the observed values would have been if the scale
values determined were the true scale values and if there
were no error variance. Entries in this matrix were
subtracted from corresponding entries in the matrix of
logarithms of observed values to form the matrix of

errors or discrepancies. The entries in half the matrix
of errors (either those above the diagonal or equivalently,
those below the diagonal) were then squared and added to

determine the discrepancy sum of squares. When this is

(n-l)(n-2)

divided by the degrees of freedom of error, 2

the result is the discrepancy variance (D).

The sum of the squares of the entries in half the
matrix of the logs of observations (either the entries above
the diagonal, or equivalently, those below the diagonal)
determines the total sum of squares. When this is divided
by the total degrees of freedom, 2%E:L)' or 15, the result

is the Total Variance, or (T).

26

R , the intra-judge reliability, as described above,

SS
is E—%—2. For 6 stimuli, RSS has upper and lower bounds

of +1.00 and -.50 respectively.

RESULTS

The scale values obtained for each set of TAT
stories are shown in Table l for both Judge 1 and Judge 2,
and represent the amount of emotional health each subject

was judged to have.

Table l. The scale values for the stimuli judged by two

 

 

 

judges.
Judges
Stimuli . l 2

A 3.0130 1.8150
B 1.0570 .8823
C 1.6730 1.4580
D .7171 .4569
E .2327 .6696
F 1.1250 1.4000

 

Intra-judge reliability, R , for the degree of

ss
internal consistency for each judge, was found to be .79

for Judge 1 and .93 for Judge 2. The judges agreed in the
designation of the healthier member of a pair in 14 out of

a possible 15 cases. The pairs picked are shown in the

Appendix.

27

28

Inter-judge reliability calculated through a
Pearson r correlation coefficient was found to be .87,
significant at the .05 level.

Measures of validity were determined only indirectly
since this was not a major concern of this paper. The
scale values derived for the two "normal" subjects, the
two subjects receiving psychotherapy and the two hOSpitalized

schizophrenics are shown in Table 2.

Table 2. Scale values for subjects by class.

 

 

Judges

Subjects 1 2
NOrmals

A 3.0130 1.8150

D .7171 .4569
Psychotherapy

C 1.6730 1.4580

F 1.1250 1.4000
Hospitalized

B 1.0570 .8823

E .2327 .6696

 

Inspection of Table 2 shows that, with one exception,
the highest scale values began with the normal subjects,
decreased through the subjects receiving psychotherapy, to
the hOSpitalized subjects.

The one exception, Subject D, a "normal" subject,

received the second lowest rating of Judge 1 and the lowest

29

rating of Judge 2. In 3 out of 4 comparisons where Subject
D was a member of the pair, the judges agreed in picking
Subject D as being the lesser adjusted member of the pair.
In the case of the one disagreement, Subject D was picked

as being healthier than a hospitalized subject. Independent
analysis of the TAT stories of this subject by two judges
nOt used in the study Showed that even though this subject
was functioning outside an institution, he was severely
maladjusted. Attempts to obtain further diagnostic material
on this subject were not successful.

The questionnaire given to the judges to be completed
after the task indicated that they used the criteria out-
lined in the instructions more than they used their own
subjective criteria, but that they felt the criteria agreed
with their own conception of emotional health. One judge
repOrted more use of the TAT criteria than the other criteria
while one judge reported using both equally. Both judges
stated that the criteria helped them in making their
judgments. Neither felt that judging adjustments, which
is somewhat contrary to the usual type of judgment involved
in clinical judgment Studies, interfered with their
judgment although both felt that this emphasis was dif—

ferent.

DISCUSSION

The Technique

Intra-judge reliability, Rss' was .73 and .93, which
indicates that emotional health was reliably scaled on a
unidimensional ratio scale, since the discrepancy variance
includes failures of the theoretical model such as departures
from unidimensionality or lack of ratio scale properties,
as well as errors of judgment, fatigue and carelessness.

It is clear that this technique of scaling makes clinical
judgment a quantitative measuring device as least as
reliable as most objective tests. Moreover, the cor-
relation between the judges of .87 is certainly high enough
to consider the two judges parallel forms of the same test.

Even though the method used in this study bears some
similarity to the paired comparison technique of Thurstone
(1926, 1927a, 1927b), it has at least two advantages which
seem to make it more desirable as a method to be used in
clinical judgment studies. Thurstone's method, as Guilford
points out (1929, 1931, 1954), requires a great deal of
computation. Derivation of scale values by the technique
used in this study requires much less computation. Mbre-
over, Thurstone's method, as well as Guilford's modification

of it (1928, 1931), requires that stimuli be judged many

30

31

times, either by one judge, judging many times, or by many
judges, each judging one time.

Disregarding the use of many judgments produced
by one judge, which are frequently found to be in error,
‘the use of many judges presents a hinderance to the use of
either Thurstone's or Guilford's method with clinical
material. Finding large numbers of qualified judges to
participate at one time in a research project is almoSt
impossible. The method used in this study overcomes this
difficulty since scale values can be obtained from single
judgments of paired stimuli by as few as one judge.

Naturally, in practice, more than one judge would be used.

The Attribute

Mbst often clinical training consists of focusing
on pathology so that the clinician is set to see signs of
psychopathology and to make his judgments accordingly.
Insofar as judgments are made on the basis of Signs, there
is the risk of relying on indications which have been
shown to result in judgments of low reliability (Elikins,
1958).'

The use of psychopathology in clinical judgment
studies, as an attribute to be judged, may not be optimal
since it is impossible to establish a base line of illness.
The use of emotional health as the attribute to be judged

avoids this difficulty since it is difficult to conceive

32

of anyone as being completely without health. Thus,
emotional health has at least a conceivable point of
origin.

While it is impossible to say what bounds or limits
there are to emotional health, it is fairly,safe to assume
that no one ever achieves his fullest potential. MOreover,
while any one aspect of emotional health may be taken as
indicating the presence or absence of the attribute, only a
human judge is able to evaluate simultaneously all the
interrelationships of the various components and produce a

single judgment.

Anchors and the Amount of Material

Four naive judges were used in an exploratory study
where the task was to judge emotional disturbance using
TAT stories in pairs. Scale values and reliability coefficients
for the exploratory study are shown in the Appendix. Scale
values for the naive judges had a spread of 5.1 units, while
scale values for the two experienced judges had a spread of
2.8 units. Whether or not the example TAT stores repre-
senting both extremes of pathology which were used in
instructing the experienced judges, but not the naive judges,
acted as anchors for the experienced judges, is a matter of
speculation since no specific test for such effects were
included in this study. HOwever, as HUnt (1941) points
out, the use of anchors serves to extend the rating scale

and results in a greater tendency for judgments to be nearer

33

the middle of the scale. It is possible that anchor effects
were operating in the judgments of the experienced judges
and resulted in less spread of the scale values. If this
were so, it would indicate the importance of having

supplied anchors in clinical judgment studies for both

ends of the continuum rather than leaving it up to the
judges to develop their own anchors as is the case when
anchors are not supplied.

The use of 20 TAT stories in each of the six sets
represents a large amount of material for each judge to
process. The finding that such reliable judgments could
be made by experienced judges contradicts the findings of
several studies, but may be explained by the finding of
Newton (1954) that reliable judgments could be made using
large amounts of material if the judges were allowed time
for exhaustive analysis of the material. The judges in
this study made their judgments over a two-week period

of time at their leisure.

The Use of the Method

One advantage to ratio scale values is, as Torgerson
(1958) points out, that the difference between the ratios
of the scale values can be interpreted as reflecting the
differences of the stimuli, as well as transitivity, so
that if A is judged greater than B, and B is judged greater
than C, then A can be assumed to be greater than C and the

differences between the scale values can be interpreted as

34

reflecting the differences between the properties being
judged.

Given any complex entity composed of inter-related,
identifiable aspects, it would seem to be possible to
use this measurement technique in a series of judgment
studies where each identifiable aspect was isolated and
used as a single criteria for the attribute being judged.
Thus scale values derived for the attribute being judged
on the basis of different aspects of the attribute could
be compared and the relative contribution of each to the
formation of judgments about the attribute could be
evaluated according to the property of ratio scale values
mentioned above. That is, if judgments of A, using aspect
Z, resulted in scale values of 2.00, and judgments of A
using aSpect Y, resulted in scale values of 1.00, it would
be reasonable to assume that judgments of A were
affected more by Z than Y when both were used as criteria.
Obviously, the reliability of judgments made on the
basis of each aspect could be determined in order to see
which aspect afforded the greater reliability.

While this would, in effect, result in a "factoring"
out of the dimensions along which judgments are made, such
a "factoring" would be more closely tied to the subjective
use of the dimensions than would seem to result when
formal methods of factor analysis are used. In this way,
clinical judgment research would come closer to studying

the actual process of forming judgments than has resulted

35

in research which has relied primarily on correlational
analysis.

Since emotional health is, as Johoda (1959), Scott
(1958), and Epstein (1958) point out, comprised of many
components, it can be regarded as a multidimensional
attribute. Since this multidimensional concept was
scaled on a unidimensional scale, it is likely that other
attributes as complex as emotional health may also be
scaled, so that it seems feasible to use this measuring
technique to compare whole entities rather as well as parts

of one.

SUMMARY

The purpose of this paper was to determine if
judgments of emotional health could be measured using
Estavan's modified paired comparison method, and scale
values derived for the stimuli judged.

TAT stories were judged in 21%ZLL pairs by two
experienced clinical graduate students for emotional health.
Following Estavan‘s method, scale values were derived for
each stimulus judged. Inter—judge reliability was found
to be .87. Intra-judge reliability was found to be .79
for one judge and .93 for the others.

The method of developing scale values as used in
this study bears some strong resemblance to Thurstonefs

Case V method, but has definite advantages over the Case

V method.

36

BIBLIOGRAPHY

Albee, G. W. and Hamlin, R. M. An investigation of the
reliability and validity of judgments of adjustment
inferred from drawings. J. clin. Psychol., 1949,
5, 389-392. '

 

Allison, Roger, Jr., Korner, I. N., and Zwanziger, M. D.
Clinical judgments and objective measures. ‘J.
Psychol., 1964. 57, 451-456.

Arnhoff, F. N. Some factors influencing the unreliability
of clinical judgment. J. clin. Psychol., 1954,
10, 272-275.

 

Bialick, J. and Hamlin, R. M. The clinician as judge:
Details of procedure in judging projective
material. J. consult. Psychol., 1954, 230-242.

 

~BloCk, W. E. A study of meaning set in the judgment of
clinical test data. J. clin. Psychol., 1962, 18,
511-512.

 

Block, W. E. Adaptation effects in clinical judgment of
projective test data. J. clin. Psychol., 1964,
20, 448-454.

 

Borke, H. and Fiske, D. W. Factors influencing the
prediction of behavior from a diagnostic interview.
J. consult. Psychol., 1957, 21, 78-80.

Campbell, D. T., Hunt, W. A., and Lewis, N. A. The effects
of assimilation and contrast in judgments of clinical
materials. Amer. J. Psychol., 1957, 70, 347-360.

Cummings, S. T. The clinician as judge: Judgment of
adjustment from Rorschach single card performance.
J. consult Psychol., 1954, 18, 243-247.

 

Elkins, E. Diagnostic validity of the Ames “danger signals".
J. consult Psychol., 1958, 22, 281—287.

Epstein, N. B. Concepts of normality or evaluation of
emotional health. Behv. Sci., 1958, 3, 335-343.

 

37

38

Giedt, F. H. Comparison of visual, content and auditory
cues in interviewing. J. consult. Psychol., 1955,
19, 407-416.

Goldfarb, A. Reliability of diagnostic judgments made by
psychologists. J. clin. Psychol., 1959, 15, 392-396.

Grant, M. Q., Ives, V., and Ranzoni, J. H. Reliability
and validity of judged ratings of adjustment on the
Rorschach. Psychol. Monogr., 1952, 66, No. 2
(Whole Number 334).

Griggs, A. E. Experience of clinicians and speech character-
istics and statements of clients as variables in
clinical judgments. J. consult. Psychol., 1958,

22, 315-319.

Gross, C. F. Intra-judge consistency in ratings of hetero-
geneous persons. J. abnorm. soc. Psychol., 1961,
62, 605-610.

Grosz, H., and Grossman, K. The source of Observer variation
and bias in clinical judgment. I: The item of
psychiatric history. Journal of Nerv. and Ment.
‘Qi§., 1964, 138,.105-113.

Guilford, J. P. The method of paired comparisons as a
psychometric method. Psychol. Rev., 1928, 35, 494-506.

Guilford, J. P. Some empirical tests of the method of
paired comparisons. J. gen. Psychol., 1931, 5,
64-76.

Guilford, J. P. Psychometric Methods. New York: McGraw-
Hill, 1954.

Gulliksen, H. and Tukey, J. W. Reliability for the law of
comparative judgment. Psychometrika, 1958, 23, 95-110.

Gunderson, E. K. E. Determinants of reliability in
personality ratings. J. clin. Psychol., 1965,
21, 164-169.

Hamlin, R. M. The clinician as judge: Implications of a
series of studies. J. consult. Psychol., 1954,
18, 233-238.

 

Helsopple, J. 0., and Phelan, J. G. The skills of
clinicians in analysis of projective tests. J,
clin. Psychol., 1954, 10, 307-320.

Herowitz, M. J. A study of clinicians judgments from pro-
jective test protocols. J. consult. Psychol., 1962,
26, 251-256.

39

Hunt, W. A. Anchoring effects in judgments. Amer. J.
Psychol., 1941, 44, 395-403.

Hunt, W. A., Arnhoff, F., and Cotton, J. Reliability, chance
and fantasy in interjudge agreement among clinicians.
J. clin. Psychol., 1954, 10, 296-299.

 

HUnt, W. A. and Jones, N. F. Clinical judgments of some
aspects of schizophrenic thinking. J. clin. Psychol.,
1958a, 14, 235-239.

 

Hunt, W. A. and Jones, N. F. The reliability of clinical
judgments of asocial tendency. J. clin. Psychol.,
1958b, 14, 233-235.

Hunt, W. A. and Jones, N. F. The experimental investigation
of clinical judgment. In A. J. Bachrach (Ed.),
Experimental foundations of clinical psychology.

New York: Basic Books, 1962.

HUnt, W. A., Jones, N. F., and Hunt, E. B. Reliability of
clinical judgments as a function of clinical experience.
J. clin. Psychol., 1957, 13, 377-378.

Hunt, W. A., Schwartz, M. L., and Walker, R. E. Reliability
of clinical judgments as a function of range of
pathology. J. abnorm. Psychol., 1965, 70, 32-33.

HDnt, W. A. and walker, R. E. A comparison of global and
specific clinical judgments across several diagnostic
categories. J. clin. Psychol., 1962, 18, 188-194.

Jackson, M. A. The effects offrequency, extremeness,
consistency and order of the stimulus on clinical
judgments. Diser. Abst., 1963, 24, 1244.

 

Jahoda, M. Current conceptions of positive mental health.
New York: Basic Books, 1958.

Johnson, D. M. Theypsychology of thought and judgment.
' New YOrk: Harper and Brothers, 1955.

.Jones, N. F. Context effect in judgment as a function of
experience. J. clin. Psychol., 1957, 13, 379-382.

 

.Jones, N. F. The validity of clinical judgments of schizo-
phrenic pathology on verbal responses to intelligence
test items. J. clinc. Psychol., 1959, 396-400.

Icing, G. F., Ehrmann, J. C. and Johnson, D. M; Experimental
analysis of the reliability of observations of
social behavior. J. soc. Psychol., 1952, 35, 151-160.

4o

Kostlan, A. A method for the empirical study of psycho-
diagnosis. J. consult. Psychol., 1954, 18, 83-88.

Lee, J. C. and Tucker, B. An investigation of clinical
judgment: A study in method. J. abnorm. soc.
Psychol., 1962, 64, 272-280.

Levine, H. The influence of fullness of interview on the
reliability, discriminability and the validity of
interview judgments. J. consult. Psychol., 1954,
18, 303-306.

Levy, L. H. Context effects in social perception.
J. abnorm. soc. Psychol., 1960, 61, 295-297.

Little, K. B., and Shneidman, E. S. Congruencies among
interpretations of psychological tests and anamnestic
data. Psychol. Mbnogr., 1959, 73, No. 42 (Whole
No. 476).

Luborsky, L. The patient's personality and psychotherapeutic
change. In Strupp, H. and Luborsky, L. Research
in Psychotherapy. Washington: American Psychological
Assn., 1962. '

 

Luft, J. Implicit hypotheses and clinical prediction.
J. abnorm. soc. Psychol., 1950, 45, 756-760.

Luft, J. Differences in prediction based on hearing vs.
reading verbatim clinical interviews. J. consult.
Psychol., 1951, 15, 115-119.

 

Mancuso, C. J. The role of influences of sociological
variables in clinical judgment. Diser. Abst.,
1961, 21, 2785.

 

Martin, H. T., Jr. The nature of clinical judgments.
Diser. Abst., 1958, 18, 310.

Mehlman, G. The reliability of psychiatric diagnosis.
J. abnorm. soc. Psychol., 1952, 47, 577-578.

Miller, H. and Bieri, J. An informational analysis of
clinical judgment. J. abnorm. soc. Psychol., 1963,
67, 317-325.

Miller, N. and Campbell, D. T. Recency and primacy in
persuasion as a function of the timing of Speeches
and measurement. J. abnorm. soc. Psychol., 1959,
59, 1-9.

41

Mosteller, F. Remarks on the method of paired comparisons:
I. The least squares solution assuming equal standard
deviations and equal correlations. Psychometrika,
1951, 16, 3-9.

 

newton, R. L. The clinician as judge; total Rorschach and
clinical case material. J. consult. Psychol.,
1954, 18, 248-250.

 

Pasamanick, B., Dinitz, S., and Lefton, M. Psychiatric
orientation and its relation to diagnosis and
treatment in a mental hospital.. Amer. J. Psychiat.,
1959, 116, 127-132.

 

Phelan, J. G. The subjective feeling of certainty of
diagnostic judgment of clinical psychologists.
J. clin. Psychol., 1960, 16, 101-104.

Phelan, J. G. iRationale employed by clinical psychologists
in diagnostic judgments. J. clin. Psychol., 1964,
20, 454-458.

Phelan, J. G. Use of matching methods in measuring relia-
bility in individuals. Psychol. Reps., 1965,
16, 490-496.

 

Powers, W. T. and Hamlin, R. M. The validity, basis and
process of clinical judgment using a limited amount
of projective test data. J.,proj. Tech., 1957, 21,
286-293.

Raines, G. N. and Rohrer, J. H. Individual differences in
clinical judgment. Amer. J. Psychiat., 1955, 110,
721-725.

Scott, W. A. Research definitions of mental health and
mental illness. Psychol. Bull., 1958, 55, 29-45.

Sines, L. K. The relative contribution of four kinds of
data to accuracy in personality assessment. ‘J.
consult. Psychol., 1959, 23, 483-492.

Soskin, W. F. Frames of reference in personality assessment.
J. clin. Psychol., 1954, 10, 107-114.

Soskin, W. F. Influence of four types of data on diagnostic
conceptualization in psychological testing. ‘J.
abnorm. soc. Psychol., 1959, 58, 69-78.

Thorne, F. C. Clinical Judgment. Brandon, Vermont: Journal
of Clinical Psychology: 1961.

42

Thurstone, L. L. The method of paired comparisons. .J.
abnorm. soc. Psychol., 1926, 21, 384-400.

 

Thurstone, L. L. A.law of comparative judgment. Psychol.
Rev., 1927a, 34, 273-286.

Thurstone, L. L. Psychophysical analysis. Amer. J. Psychol.,
1927b, 38, 368-389.

Torgerson, W. Theory and Method of Scaling, New York:
Wiley, 1958.

Weitman, M. Some variables related to bias in clinical
judgment. J. clin. Psychol., 1962, 504-506.

 

Wessman, A. E. and Ricks, D. F. Mood and Personality.
New York: Holt, Rinehart and Winston, 1966.

 

APPENDI CES

APPENDIX A

45

 

Table A. Classification and personal data of subjects
giving TAT protocols.
NORMALS SUBJECT A SUBJECT D
Age 19 19
Sex Male Male
Education Sophomore Sophomore
MOther Living Living
Age 45 40
Occupation Office WOrker Hbusewife
Father Living Living
Age 44 47
Occupation Office Manager CPA
Siblings Three None
Sex, Age Male
Occupation 21 College
12 High School
5
HOSPITALIZED SUBJECT B SUBJECT E
Age 20 26
Sex Male Male
Education High School High School
Mether Living Living
Age 39 47
Occupation Housewife Hbusewife
Father Living Living
Age 39 51
Occupation Mechanic Post Office Employee
Siblings Two One
Sex, Age Male Female
Occupation 16 High School Hbusewife
10 Grammar School
Diagnosis Schizophrenic Schizophrenic

Length of Hospitali-
zation

COUNSELING CENTER
Age
Sex
Education
Mother
Age
Occupation
Father
Age
Occupation
Siblings
Sex, Age
Occupations

Three months

SUBJECT C

20

Male

Junior

Living

42

Housewife

Living

40

TV Repair

Two

1 Male, 17, High
School; 1 Female,
11

Four months

SUBJECT F
20

Male
Junior
Living

49
Hbusewife
Living

43
Fireman
One
Female, 21 Office
Work

APPENDIX B

47

 

 

 

 

 

Table B. Order at TAT card presentation to subjects.
Subjects
Normals Counseling Center Hespitalized

A D B E C F

l 1 l l l l

2 2 2 2 2 2
ll 6 BM 4 11 11 11
10 10 6 BM 9 BM 4 9 BM
17 BM 11 5 18 BM 13 MF 17 BM
18 BM 9 BM 9 BM 17 BM 19 12 M
12 M 12 M 7 BM 12 M 7 BM 18 BM
6 BM 18 BM 8 BM 14 17 BM 6 BM
14 17 BM 3 BM 3 BM 18 BM 20

9 BM 4 10 10 9 BM 14

3 BM 13 MF 15 6 BM 12 M 3 BM'
13 MF 5 20 5 l4 5

7 BM 15 19 19 3 BM 4

8 BM 3 BM 18 BM 15 10 15

20 l4 14 20 6 BM 13 MF
15 19 13 MF 7 BM 5 l9

5 20 12 M 8 BM 15 10
19 7 BM 17 BM 4 20 7 BM
4 8 BM 11 13 MF 8 BM 8 BM
16 16 l6 l6 16 16

 

APPENDIX C

49
INSTRUCTIONS TO JUDGES

INTRODUCTION

You are being asked to act as a judge in research
which I am doing for my Master's Thesis. The thesis is
concerned with whether or not clinical judgments can be
quantified and compared over judges. The materials to be
judged are 6 sets of TAT stories, with 20 stories to a set.
The sets are to be judged in pairs. The judgments you will
be making are concerned with the amount of emotional health
one person has when he is compared with another person.

I have outlined below the criteria I wish you to

use in making your judgments as well as a method to use.

CRITERIA

AS you read the TAT stories, keep in mind the
following criteria of emotional health. YOu will find
that the criteria are divided into two sets. One set
describes some components which we believe are involved
in emotional health, while the second set describes indi-
cations or signs of emotional health as it might appear
specifically in TAT stories.

In making your judgments, use both sets of criteria,
but remember that they are not absolute. In the end, rely
upon your own subjective, clinical judgment, and let these

criteria only be guides to that judgment.

50

Components of Emotional Health
Ability to take care of self
Ability to work

Sexual adjustment

Social adjustment

Absence of hallucinations, bizarre delusions,
gross distortion of reality, lack of passivity.

Degree of freedom from anxiety and depression,
degree of diffuse hostility.

Amount of affect, owning of feelings.

Variety and spontaneity of affect

Satisfaction with life and with self, absence of
deficiency motivation, i.e., making up for

lost love

Achievement of capabilities, mastery of environment

Benign rather than malignant effect on others

Indications of Emotional Health in TAT Stories

The protocols Should be longer

There should be more affect, and more varied affect
There should be less stereotyped, and more varied
materials, e.g., the TAT stories should vary more
from card to card indicating an ability to deal

with differing aspects of the world

There Should be more benign fantasies and more
helping parent figures.

There Should be good reality testing
Problems should be directly represented

There will be indications of confidence

51

METHOD

YOu will be judging the protocols in pairs. When
grouped together, there are 36 possible pairs. When you
eliminate pairs because of duplications, such as, (AB, BA),
and (BF, EB), and self-comparisons, such as, (AA) and (BB),
you are left with only 15 pairs. We are concerned with these
15 pairs. I have listed them on the last sheet of these
instructions.

Take one pair at a time, according to the order in
which I have listed them. Read each protocol of each
pair as you judge the pair. The first protocol to be
read is the first one listed in the pair. For example,
of the pair (A,F), read protocol A first, then read F;
of the pair (D,E), read protocol D first, then read E.

After you have completed a pair, judge, according
to the criteria outlined above, which protocol seems to
represent the person who has the most psychological health
of the pair. At the time you are judging, you may wish
to reread parts of one or both protocols. You may do so.
Do not, however, compare them as you are first reading them
through.

Take a sheet of the paper on which I have drawn a
line. Label the sheet with the letter representing which
member of the pair you have judged to be healthier. Let
the line represent the total amount of health the healthier
member has. In comparison to this amount, mark off some

point on the line which indicates how much of this health

52

the second member of the pair has. For example's sake, let's
suppose you are considering the hypothetical pair (Z,X), and
you think that Z is the healthier member of the pair. Label
the sheet Z. Let the line equal the total amount of health

Z has. Suppose you think that in comparison to Z, X has
about half as much health. Place a mark in the middle of

the line. Continue on for each of the other pairs. I

have marked the sheets so that you will be able to identify
easily the pairs as you are marking the Sheets. After you
are finished judging, I would appreciate it if you would

answer the questionnaire I have included with these instructions.

THE PROTOCOLS

The protocols were obtained by administering 20 TAT
cards to six subjects. All subjects were given the same
card, and asked to make up stories to them. Their stories
were first recorded on tape and then transcribed to paper.

The stories are as near as possible to verbatim.

In transcribing the stories, no effort was made to altar

the stories in any way, so that the story as told could

be judged. Pauses, when the subject seemed to be groping
for words, have been indicated by a series of dots (.....).
The numberwmfdots does 32E indicate the length of the pause.
Long silences, when the subject seemed to be searching for
ideas are indicated with the words (Long Silence). Comments
or questions made by the tester during the session have been

enclosed in brackets so as to distinguish them from the

53

story proper.

On the face sheet of each protocol, you will find
information about the subject's age, sex and number of
siblings in the family. This should help you in your

judgments.

APPENDIX D

55

Table C. The order in which protocols were presented to
the judges in pairs for comparison and judgment.

 

(A, F)
(D,E)
(B, F)
(C,A)
(E, F)
(C.D)
(B,E)
(A, D)
(c,E)
(D, F)
(B,C)
(A,B)
(B,D)
(c, F)

(A,B)

 

APPENDIX E

57

Table D. Member of each pair judged to be healthier by

 

 

two judges.
Judged Healthier
Pair Judge 1 Judge 2
(A,F) (A) . (A)
(D,E) (D) (E)
(B,F) (F) (F)
(C.A) (A) (A)
(E,F) (F) (F)
(C1D) (C) (C)
(B,E) (B) (B)
(A,D) (A) (A)
(c,E) (C) (C)
(D,F) (F) (F)
(BaC) (C) (C)
(A,B) (A) (A)
(B,D) (B) (B)
(c,F) (C) (C)

(A, B) (A) (A)

 

APPENDIX F

58

 

 

Table E. Scale values for 6 stimuli judged by 4 naive
judges using an attribute of emotional disturbance.

Stimulus 1 2 3 4
A 3.147 .3910 .7100 1.311
B 2.307 4.124 1.539 .7573
C .8902 2.036 3.959 1.993
D .2650 .3062 .4173 .3268
E .1865 .1875 .2873 .3506
F 3.131 5.269 1.927 4.313

 

APPENDIX G

61

 

 

 

 

 

Table F. Intra-judge reliability for 4 naive judges.
Judge 1 -.24
Judge 2 +.80
Judge 3 -.22
Judge 4 -.16
Table G. Inter-judge reliabilities for 4 Naive judges
(Pearson r).
Judges r
(1,2) .58
(1,3) .09
(1:4) .60
(2,3) .45
(2,4) .72
(3,4) .64
Table H. Inter-judge reliabilities for 4 naive and 2

experienced judges (Pearson r).

 

 

Judges r
(11A) .63
(1,8) .22
(2,A) -.08
(2,B) .21
(3,A) .21
(3,B) .46
(4,A) .20
(4,B) .18

 

APPENDIX H

63

QUESTIONNAIRE

The following questions were asked of the experienced

judges after completion of their task.

1.

10.

11.

In judging the stories did you rely more on your own
clinical judgment or upon the criteria outlined in
the study?

Did you use the criteria outlined to make your
judgments?

In making your judgments, which criteria did you use
most, if either?

Did the criteria help or hinder in any way your making
judgments?

Did the emphasis on emotional health rather than
sickness seem different to you or interfer with your
judgment?

Did the criteria outlined in the study clash with your
own conception of emotional health?

Which of the criteria outlined in the study did you
find to be the most helpful?

Which of the criteria outlined in the study did you
find to be the least helpful?

Was there enough material in the stories so you could
judge on the basis of the criteria?

Could you use the criteria with TAT stories?
Do you think this kind of judgment is made more

accurate due to the comparison of one person with
another?

 

"7:111:11 (1(1)!) (WM) (1 (111)”