MANKPUIATING RESPONSE SET
IN THE TRUE-FALSE TEST -_

Thesis for the Degree at P31. De
MICHIGAN STATE UNIVERSlTY
SARAH S. KNJGHT
19.72

 

(Hasty-s gaunt?“ sum \

This is to certify that the

thesis entitled

MANIPULATING RESPONSE SET IN THE TRUE-FALSE TEST

presented by .4 .

Sarah S. Knight

has been accepted towards fulﬁllment
of the requirements for

Ph.D. degreein Department of Counseling,

Personnelgservices, and
Educational Psychology

, 4' '
ﬂ C

a y: Lrb/JJ’ ( MAR
Major professor

Date 1972/Ju1y 3]. v

0-7639

     
     
 

  

: BINDING 3v V

HUAB & suns"
800K amnrnv mc.

LIBRARY :2 worms

 

ABSTRACT

MANIPUIATING RESPONSE SET
IN THE TRUE-FALSE TEST
By

Sarah S. Knight

True-false test items are potentially effective and efficient
measures of academic achievement. Numerous criticisms of this item
type have been made, however. Most of the criticisms can be dealt
with by careful attention to item construction, by the weight of
logical argument and by research evidence. Despite these efforts,
the susceptibility of true-false tests to the effects of response
set remains as a limitation on their academic measurement potential.
This research study was designed to assess various methods of manip-
ulating response set in the true-false test.

Response set. in the form of a set to say true, has generally
been considered to be a kind of response style which appeared con-
sistently across tests for a given subject. Response set might in-
stead be characterized as test specific and temporary, appearing
across tests for a given subject because the various tests had certain
influential characteristics in common. This was the thesis of this
research study. It was tested experimentally by manipulating the
test characteristics of item format (true-false, two-reSponse
multiple-choice), response option order (true-false, false-true),

and test instructions (emphasis on true, emphasis on false).

Sarah S. Knight

The experimental treatments were five combinations of the
test characteristics, constituting five levels of emphasis on the
true and false response Options. One treatment involved two-
reSponse multiple-choice items and the other four treatments in-
volved the same set of true-false items, which were systematically
derived from the two-response multiple-choice items.

The experiment yielded answers to three questions: 1) Can
reSponse set be manipulated by alterations in emphasis in T-F tests?
2) Does alteration of response option order affect response set?

3) Does the two-response multiple-choice item format yield response
set?

First, response set was altered as a result of some of the ma-
nipulations. Decreasing emphasis on "true" yielded a decreasing set
to respond "true." When the emphasis was on "false," the set to
respond true increased as the emphasis on "false" increased. It
was concluded that response option order reversal might have acted
to suppress the subjects' tendency to respond "true," and that em-
phasis on either response prompted the subject to attend more closely
to the entire response continuum, thereby enhancing a response style
consisting of response acquiescence.

Second, response option order had no significant effect on
response set in the T-F tests.

Third, the two-response multiple-choice items showed only
slight evidence of generating any response set. 0n the basis of
mean response bias scores, the data indicated that there was a sig-

nificant tendency for subjects to prefer response option two.

Sarah S. Knight
Response set is generally not considered to be present in a test if
the subject's bias score is not reliable. On this basis, the two-
response multiple-choice items elicited little or no response bias,
because the bias scores were of extremely low reliability.

In general, the two-response multiple-choice item format appeared

to be the most effective manipulation in dealing with reSponse set.
When the response category labels were eliminated, response set de-

creased to a near-zero level.

MANIPULATING RESPONSE SET

IN THE TRUE-FALSE TEST

By

\/
._T/
f

A;
Sarah 32 Knight

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Counseling, Personnel
Services, and Educational Psychology

1972

ACKNOWLEDGMENTS

The patient counsel and continued aid of Dr. Robert C. Craig,
chairman of my guidance committee, are most gratefully acknowledged.
Sincere thanks are also extended to my committee members, Dr. Robert
L. Ebel, Dr. Donald M. Johnson and Dr. Byron H. vanRoekel for their

valuable suggestions and critiques of this research study.

ii

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . V
Chapter

I. PROBLEM AND RELATED RESEARCH . . . . . . . . . . . 1

Introduction . . . . . . . . . . . . . . . . . . 1

Response set and true-false tests . . . . . . . 2

Manipulating response set . . . . . . . . . . . 6

Other criticisms of the T-F test . . . . . . . . 9

The problem . . . . . . . . . . . . . . . . . . 12

Hypotheses . . . . . . . . . . . . . . . . . . . 13

Definition of terms . . . . . . . . . . . . . . 14

Overview . . . . . . . . . . . . . . . . . . . . 14

II. METHOD . . . . . . . . . . . . . . . . . . . . . . 15

Research design and analysis . . . . . . . . . . 15

Design . . . . . . . . . . . . . . . . . . . 15

Analysis . . . . . . . . . . . . . . . . . . 17

Test development . . . . . . . . . . . . . . . . 19

Item conversion . . . . . . . . . . . . . . . 20

Item pretesting . . . . . . . . . . . . . . . 23

Final test assembly . . . . . . . . . . . . . 25

Subjects . . . . . . . . . . . . . . . . . . . . 26

Test administration . . . . . . . . . . . . . . 28

Summary . . . . . . . . . . . . . . . . . . . . 29

iii

iv
Chapter

III. RESULTS 0 O O O O O O O 0

Results concerning response

Hypothesis 1 . . .
Hypothesis 2 . . .

Hypothesis 3 . . .

Bias score reliability

Test analysis results

Summary . . . . . . .
IV. DISCUSSION . . . . . . .

T-F tests . . . . . .

Multiple-choice test .

Suggestions for future research

8‘1er ooooooo

REFERENCES . . . . . . . . o . .

APPENDIX A: Written and oral test instructions

Page
31
31
31
33
34
34
35
37
40
42
46
47
49
51

55

LIST OF TABLES

Table Page
1. Description of experimental treatment conditions . . 15

2. Number of true and false items at each difficulty

level 0 O O O O O O O O I I O O O O O O O O O O O O O 26
3. Allocation of subjects in the experimental design . . 27
4. Analysis of variance of bias scores . . . . . . . . . 32

5. Means and variances of bias scores for each
treatment 0 O O O O O O O O O O O O O I O O O O O O O 32

6. Summary statistics for each of the five test forms . 36

7. Distribution of item difficulty indices for each
0fthefiveteStf0mS.oooooocooooooo 36

8. Distribution of item discrimination indices for
each test form . . . . . . . . . . . . . . . . . . . 38

Chapter I

Problem and Related Research

d ti n

Classroom tests using a true-false item format, while not
currently popular, do have proponents (Ebel, 1965, 1970, 1971).
True-false items are attractive tools for achievement testing
because they tend to be relatively easy to construct and because
students can respond quickly to them. If criticisms of true-
false items can be obviated, these items can become an effective
as well as an efficient achievement testing technique.

Most criticisms of the true-false item can be countered on
the basis of logic and experimental evidence, but there remains
the stubborn and pervasive problem of the item's susceptibility
to response set. Numerous studies attest that students display a
definite tendency to use the response "true" when they are in
doubt about the answer to a test item. This pattern of behavior
is generally considered to be a kind of response style, a response
tendency which the student brings to any true-false test, regard-
less of the Specifics of that test.

There is, however, another way in which response set can be
characterized. Instead of being a person's general style of re-

sponding to any true-false test, it could be a response tendency

2
which is temporary and test-specific. The set to respond "true"
may show up consistently across true-false tests because the tests
happen to have certain characteristics in common. That is the
thesis of this research study. It was tested experimentally by
manipulating the test characteristics of item format, response
option order and test instruction.

Experimental treatment conditions consisted of various com-
binations of item format, response option order and test instruc-
tions. So combined, these test characteristics constituted several
levels of emphasis on the true and the false response options. It
was hypothesized that the direction and amount of reSponse set that
students displayed on a test would shift according to the emphasis
placed on the response options. Successful manipulation of response
set from a set to respond "false," through a set to respond "true,"
would favor the hypothesis that response set is test-specific.
Further, such evidence should indicate methods for reducing or elim-
inating response set as a limiting factor in the utility of true-
false tests.

R n - s

True-false (T-F) tests are said to yield responses which are
influenced by response set. Unlike other criticisms made of the
T-F test, there have been no substantial rebuttals of this point.
Ebel mentions it (1965), but does not accord it any extensive con-
sideration. Therefore, of all the criticisms made of the T-F test,
response set assumes the central position.

Studies by Cronbach (1941, 1942, 1946, 1950) are the major

sources of information on response sets in tests in general, and in

3
T-F tests in particular. Response set is defined as, "any tendency
causing a person consistently to give different responses to test
items than he would when the same content is presented in different
form" (Cronbach, 1946, p. 476). The Specific response set with
respect to T-F items is considered to be respgnse acquiescence,
defined as the subject's responding "T" more often on the average
than "F," and/or the subject's tendency to respond ”T" rather than
"F" when in doubt (Cronbach, 1941, 1946).

Response acquiescence (RA) has been shown to exist in the
responses to T-F tests by many studies prior to those of Cronbach
(Arnold, 1927; Fritz, 1927; Cranich, 1931: Krueger, 1932). In 1941,
Cronbach compared the performance of students taking multiple-choice
tests and T-F tests. His results indicated little significant dif-
ference between the two tests. Items keyed "F" had considerably
higher reliability and validity than the items keyed "T." The
number correct on "T" items was greater than the number correct on
”F" ones. Theoretical considerations in the same paper led to the
prediction that: response acquiescence would restrict the range of
test scores; response acquiescence would have greatest effect on
difficult items; the acquiescent subject would achieve a low score
on the test if less than 502 of its items were keyed "T," and a high
score if more than 502 of the items were keyed "T." Acquiescence
was measured by tabulating the total number of "T" responses given
by each student.

Cronbach offered the following practical example of the effects
of RA on test scores. Suppose a 10-item test has 5 T and 5 F items.

An examinee who simply guesses and marks as many F as T could get a

a

score from 10 to 0. An acquiescent examinee, guessing and marking
7 items T, could get a score ranging from 8 to 2. Now suppose the
test has 7 T and 3 F items. The same examinees, following the same
response patterns, could achieve scores from 8 to 2 when simply
guessing and from 10 to 4 when they acquiesced. High scores would
be likely to go to the acquiescent examinee if a test contains more
than 502 T items.

It also follows that RA could be expected to inflate scores on
the portion of the test containing just T items and similarly de-
flate the scores on the F items. When this obtains, low reliability
and validity for T items and high reliability and validity for
F items follows. Item statistics become uninterpretable. It becomes
impossible to tell how much of the determined item difficulty and
discrimination is due to the effects of RA, and how much to knowledge.

In 1942, Cronbach focused on the questions: 1) Are T items in
general less reliable and valid than F items? 2) Is RA a consistent
individual difference? 3) Can new test instructions obviate ac-
quiescent behavior? The first two questions were answered affirma-
tively. Cronbach showed, however, that his new instructions did
not eliminate RA effects. The latter result was attributed to the
instructions being ill-conceived originally. As predicted in 1941,
there were indications that RA did in fact limit score range as
evidenced by test variances.

A review of the existence and extent of response sets was under-
taken subsequently (Cronbach, 1046). Evidence was found for several
types of response sets in several test types, among them T-F tests.

The effect of response sets appeared to increase in ambiguous,

5
unstructured, or difficult situations. It was noted also that
reaponse sets could be compared to constant errors in psychophysics.
In light of the survey's findings, Cronbach recommended a number of
techniques for decreasing response set effects and thus increasing
test validity, such as increasing test structure, use of the multiple-
choice format, and avoidance of unreasonably difficult items.

Using the general technique of deriving a "bias" (response set)
score and measuring its internal consistency to "prove" the set's
existence, Cronbach (1950) studied response set with respect to test
design. Response set was found to be consistent across as well as
within test administrations, leading to the conclusion that it was a
stable, personality-like trait. Relative to two-choice test for-
mats, it was noted that altered test instructions yielded altered
test biases on the same test (Rubin, 1940). This was not related to
T-F tests, however. Apparently it was felt that the demonstration
of parallel form reliability for response set was sufficient to
characterize RA as a stable factor.

Following Cronbach's lead, Miklich (1965) investigated the
relationship between RA and item importance and ambiguity, using
personality/aptitude test items. The results confirmed that am-
biguous items did elicit RA. An important interaction was found
between importance and ambiguity, however. Important but ambiguous
items tended to elicit agreement (RA), while unimportant ambiguous
ones tended to yield disagreement.

Miklich (1968) designed a study to demonstrate that the response
set in T-F items was not acquiescence, but rather test-taking careful-

ness (TTC). A set of maximally difficult (realistic nonsense) items

6

was generated. One half of the items contained specific determiners
usually associated with true statements (the "pseudo-true" items),
and the remaining items contained specific determiners usually found
in false statements (the "pseudo-false" items). Difficult items
should tend to yield RA. Therefore, if RA was operative, it should
appear as an excess of T responses over all items, regardless of
specific determiners. If TTC was operative, a negative correlation
between the number of T responses given to pseudo-true and pseudo-
false items should obtain. Evidence favored TTC, the more T responses
subjects tended to give to pseudo-true items, the fewer they tended to
give to pseudo-false ones. It should be noted, however, that this
result does not eliminate RA as an explanation for response set in
T-F tests because 1) the experimental situation was so extreme as
to be highly unrealistic, and 2) careful item construction should
eliminate TTC while probably leaving RA to a discriminable degree.
WW

Regardless of the exact source of response set in T-F tests,
there is abundant evidence of its existence. If T-F items are con-
sidered as special cases of a two-choice judgment task, there also
exists evidence that response set can be manipulated. Cronbach has
noted that alteration of instructions could affect response set, and
that response set was related to the concept of constant error in
psychophysical judgments, but he failed to develop either notion
with respect to T-F tests.

In 1930, Fernberger showed that in a weight lifting experiment,
different instructions yielded different category widths for a three

category scale of heavier, equal or lighter. When the instructions

7
emphasized finding a difference between the standard and variable
weights, very few weights were judged equal to the standard. When
the instructions allowed free use of the equal category, many weights
were judged equal to the standard.

Rubin (1940) altered the instructions on the Seashore Pitch Test
(a High, Low, two-choice test), to emphasize the second response option,
L. He got fewer (56.82) errors of "L marked H" as compared with the
number of such errors (602) when the option H was emphasized. Altered
instructions reduced set. Similarly, Rubin obtained a correspond-
ing significant result when subjects were to imagine a coin toss and
record its results. He found more "heads" responses when "heads" was
emphasized in the instructions, and the same effect for "tails"
responses.

In psychophysical judgments, Goodfellow (1940) and Gault and
Goodfellow (1940) reported parallel findings. When instructional
emphasis was placed on the “yes" reSponse (stimulus present) the
usual response set, reporting the stimulus to be absent, was reversed.

Holland (1961) demonstrated that specific alterations in task
instructions led to differences in flicker-fusion thresholds as well
as to over-all differences in what should normally be very similar
sets of results. The task was a two-choice one, flicker was judged
to be either present or absent. Task instructions emphasized the
presence of flicker (report as soon as any unsteadiness appears in
the light source) or alternatively, the absence of the steady light
(report only when you are certain that flicker is present, i. e.,
the steady light is definitely absent). Instructional emphasis on

flicker presence depressed the threshold, while emphasis on steady

8
light absence raised the threshold.

LeFurgy (1966) manipulated response set with a conditioning
technique. He trained subjects to associate size of circle with the
response categories positive, negative and neutral. One group of
subjects learned that large circles were positive, medium ones were
neutral and small ones were negative. Another group learned the
associations in the reverse order. When tested on a second set of
circles, subjects who learned that large circles were positive used
that response category more frequently, using it for a wide range of
larger circles. For subjects who learned that small circles were
positive, that response category was used most frequently, including
a wide range of smaller circles.

Using T-F tests, Bugelski and Herson (Bugslski and Herson,
1966; Herson, 1966, 1967) demonstrated that a response set could be
conditioned on "ambiguous" items. In effect, they emphasized either
the T or F response to ambiguous items by using a training session
prior to testing, in which subjects answered test items and then
were verbally informed of the correct response by the experimenter.
When ambiguous items were called "T" by the experimenter, subjects in
subsequent testing continued tending to respond ”T" to such items,
and vice versa when "F” was conditioned.

Mathews (1927) studied the effects of response option position
in two-response items. Subjects were children in grades five and
six. He found that alternating the horizontal ordering of a pair of
responses definitely affected response selection. The response in
the left position was consistently favored. When vertical ordering

of the response pairs was altered, the upper position was consistently

9
favored. Further, response position influence was greatest where
guessing was greatest.

There are a number of studies which indicate that an item's
position in a series might affect the response given to it. Good-
fellow (1940) reported that subjects' responses followed some definite
patterns in two-choice situations. The subjects tended to avoid
symmetric series of responses (ABABA) while they also tended to al-
ternate responses. George (1953) found that subjects could be in-
ducedtx>respond in a predictable way to a paired-comparison task.

He constructed a series of comparisons so that the subject gave the
same response (heavier or lighter) to the first two sets, then, when
the subject encountered a third, much more difficult comparison, he
displayed a definite tendency to repeat the previous response.

h -F

It remains to establish that the T-F test is a viable technique
for measuring academic achievement. Therefore, the following major
criticisms of the T-F test will be considered: 1) T-F tests are very
susceptible to chance error introduced by guessing. 2) Such tests
lack reliability. 3) The test content concerns trivialities.

4) It encourages rote learning. 5) HAbsolute" truth or falsity in
items is difficult or impossible to attain. 6) T-F tests lack
validity.

First, it has been charged that excessive error can be intro-
duced by guessing. Ebel (1965, 1970) responds that blind guessing
is the major concern. Laﬁgzmeg guesses provide valid indications of
achievement. Blind guessing has been shown to occur relatively

rarely, 3% to 82 of total test responses (Ebel, 1968), and its

10
effects vary inversely with test length. Further, it has been
shown that reliable T-F tests can be constructed (Ebel, 1968;
Burmester and Olson, 1966). This could not obtain if blind guessing
seriously affected the test scores.

Burmester and Olson (1966) showed that unreliability is not a
necessary characteristic of the T-F test. Two-choice tests do yield
somewhat lower reliabilities than three-choice tests of comparable
length, composition, respondents, etc. (Ebel, 1969; Williams and
Ebel, 1957), while three-choice items tend to be more ideal than
four-choice items (Costin, 1970). Reliability can be raised with
increased test length, however, and given that two-choice items can
be composed and responded to in much less time than three- or four-
choice items, two-choice items become preferable.

The T-F items are 32; limited to potentially trivial specifics,
Ebel maintains (1970). It follows that if T-F items are not so
limited, they are not likely to encourage rote learning or to
measure only very low levels of knowledge. He demonstrated this
point by generating a series of good T-F items measuring understanding
of event or process, principle applications, knowledge of functional
relationships and problem solving. As Ebel says, "Surely there is
nothing intrinsically trivial about a statement whose truth is cpen
to question" (Ebel, 1970, p. 382).

The careful, skillful item writer can generate T-F items which
are neither trivial nor ambiguous. Emphasis is on the application
of care and skill. Items thus generated will, of course, be affected
by ambiguities present in the nature of the language, but can be

otherwise free of them. Another source of ambiguity is the basic

11
assumption that items must be absolutely true or false, raising
the problem of defining truth. Ebel proposes two steps to deal
with the definitional problem: students should be instructed to
respond T if the item contains more truth than falsity, and the
reverse for F responses. Second, the definite truth or falsity
should obtain in the opinion of qualified experts (Ebel, 1971).
And again, the occurrence of high test reliabilities argues that
unambiguous test items have in fact been produced in considerable
quantity.

Critics say T-F tests lack validity. What they refer to in
part is the proposition that T-F tests cannot measure educational
achievement due to its previously-discussed faults. That is, ap-
parent concern lies with a kind of logical validity attained by T-F
tests. Assuming the following four statements are correct and follow
each other logically, then it is clear that T-F tests can attain
logical validity.

”1. The essence of educational achievement is the command of
useful verbal knowledge.

2. All verbal knowledge can be expressed in propositions.

3. A proposition is simply a sentence that can be said to be
true or false.

4. The extent of a student's command of a particular area of
knowledge is indicated by his success in judging the truth
or falsity of propositions related to it (Ebel, 1970,
pp. 373-374).
Whether T-F tests can attain a respectable level of concurrent
or predictive validity remains open to question. Frisbie (1971)
compared parallel forms of T-F and multiple-choice items and found
that two of eight sets of comparisons were of significantly less than

perfect correlation at¢x >».10. These results suggest, with a rela-

tively large chance of error, that T-F tests can be constructed so

12
that they achieve tolerable levels of concurrent validity, using
scores on a parallel multiple-choice test form as the criterion
variable. Response set operating in the T-F test forms remains a
plausible explanation for the two rather low correlations between
T-F test scores and multiple-choice test scores which Frisbie found.

Cronbach (1942) calculated the predictive validity of several
T-F tests. The criterion was a summation of scores on other, non
T-F, tests taken in the same classes. The resulting correlations
were low, ranging from 0.30 to 0.67. Again, the plausible explana-
tion is response set, as Cronbach pointed out.

Apart from the criticisms, the T-F test has some intrinsic
assets. Students can respond to considerably more T-F (or two-
choice) items per unit of time than they can to three-, four-, or
five-option multiple-choice items. When four-response multiple-
choice items were compared with T-F items, it was found that about
three T-F items could be responded to for every two multiple-choice
items (Frisbie, 1971). True-false items are also often judged
faster and occasionally easier to write than the latter.

W

Objective achievement tests, T-F tests included, have certain
characteristics like instructions, response option order and item
format, which can and frequently do vary across tests. That is, the
nature of such characteristics tends to be specific to each test.

If one closely observes a T-F test, it is clear that the instructions
emphasize "T," at least to the extent that it is always mentioned
first. It is further emphasized by being listed first or assigned the

number ”1" on machine score answer sheets, and whenever such

13'
two-choice tests are discussed, they are labeled ggugffalse tests.

This being the case, is reaponse set a student's response style
with respect to T-F tests, or is it a temporary phenomenon resulting
from consistent but test-specific emphasis on ”T" across tests?

Note, even the "standard" and altered test instructions used by
Cronbach (1942) inadvertantly emphasized T by placing it first and
mentioning it first, e. g., "This is a T-F test, circle T before

a statement if it is always true. Circle F if the statement is
false in any way" (p. 409).

Cronbach has mentioned a number of methods aimed at compen-
sating for response set in T-F tests. The basic assumption under-
lying these methods is that response set is stable and depends on
each person's style of responding. If it can be shown that altered
emphasis produces altered response set, then it would argue that the
response set is momentary, with the test itself as the source, thus
casting doubt on Cronbach's recommendations.

Therefore, before methods of compensating for response set can
be applied, it must be clear whether the phenomenon is intrinsic to
the test or is the result of students' response styles. This research
study is intended to clarify the point.

W

Three hypotheses were tested in this research study.

1. The response set found with T-F tests is a temporary phenom-
enon, whose source is intrinsic to the test, rather than a
response style whose source is the person taking the test and
unrelated to Specific test characteristics.

The corresponding research hypothesis: Systematic variation

14
in the degree and direction of emphasis of response options
will result in a corresponding variation in response set
found with T-F tests.

2. The order in which the response options are presented will

affect response set displayed on T-F tests.

3. Two-response multiple-choice items, which correspond to

T-F items, will elicit no response set.
D 'n’t'o e

Specific Les; gnagggtgrigtigg refer to the test's instructions,
the form of its items and the order in which its response options
are presented. Emphasis is a variable made up of various combina-
tions of test characteristics.

Respgg§g_§gt_is the tendency to use one reSponse option over
any others when there is doubt about a test item's keyed response.
menus:

The design and analysis of the research, test development and
administration, and the research subjects are discussed in Chapter
II. Research results are presented in Chapter III. The final
chapter contains a summary of the study, a discussion of the results

and recommendations for further research.

Chapter II

Method

Reggargh design and analysis

Design. The experiment involved one treatment dimension which
consisted of five levels of emphasis (E) on either T or F. The degree
of emphasis for the test in each treatment condition was determined by
various combinations of test instructions, item format and response
option order. A description of the content and level of emphasis for

each treatment condition is presented in Table 1.

Table 1. Description of experimental treatment conditions

 

 

 

Treatment Emphasis Item format Instructions Response option order
E1 high T T-F stress on T T I 1, F a 2
E2 low T T-F minimum stress T a 1, F = 2
on T and F
E3 none ZRMC best answer
E4 low F T-F minimum stress F = 1, T = 2
on T and F
E5 high F T-F stress on F F = l, T a 2

 

15

16

The five levels of emphasis constituted the inggpgnggg;
ya;iab1g,in this experiment. The ggpenggng,yg;i§21g,was response
set, operationally defined as the number of responses in position
one on an answer sheet, minus the number of responses occurring in
position two on the same answer sheet (for conditions E4 and E5
the definition was the reverse, the number of responses marked two
minus the number of reSponses marked one). Thus derived, this
number was called the subject's bias score.

Both the amount and direction of emphasis varied across the five
treatment levels yielding treatment conditions of high emphasis on
T (E1), slight T emphasis (E2), no emphasis on either T or F (E3),
slight F emphasis (E4)’ and high emphasis on F (ES). Conditions
E1 and E5 obtained maximal emphasis by combining reSponse option
order and test instructions so that they emphasized the same response.

A condition of no emphasis was achieved in condition E by using two-

3

response multiple-choice (ZRMC) instead of T-F items. Low emphasis

occurred in conditions E2 and E4 through a combination of instruc-

tions that placed a minimum of stress on either T or F, and a response

option order in which first place went to the response to be emphasized.
The ZRMC item format was chosen for two reasons. First, since

it is a "two-choice" item form and hence related to T-F items, its

use in a test which directly parallels the T-F test yields a condition

in which emphasis is at an irreducible minimum. Neither its instruc-

tions nor its reSponses listed on the answer sheet involve the notions

of true or false. Second, Cronbach (1946, 1950) and others (Wevrick,

1962) found multiple-choice items to be virtually free of response set.

17
Whether this attribute would extend to the ZRMC form was of experi-
mental interest.

All levels of the treatment were administered in each of
two classrooms. Within each of the classes, subjects were ran-
domly assigned to treatments, and each subject received only one
treatment. The experimental design was what is known as a randomized
block design, with five levels of the treatment factor (emphasis) and
two levels of the block factor (classes).

Factors concerned with internal and external validity of the
study were controlled. In Campbell and Stanley's (1963; Bracht &
Glass, 1968) terms, the design can be considered as an elaboration
of the true experiment "posttest-only control group design." Threats
to internal validity were controlled by random assignment of subjects
to treatments, with each subject receiving only one treatment, and
all treatments being administered to each class of subjects. Treat-
ment administration and the measurement of its effects were carried
out at the same time.

Research generalizability was limited by the nature of the subject
population, the differences between actual classroom testing and the
administration of treatments, the specialized type of T-F item which
was used, and the subject matter included in the tests.

Analysis, The analysis assumed five levels of treatment, with
subjects randomly assigned to treatments. Each subject received only
one treatment, and all levels of treatment occurred in each of two
classes.

Hypothesis one, concerning the effects of varying emphasis

18
on response set was tested using an analysis of variance for a
randomized-block design. Given a significant treatment effect,
post hoc multiple comparisons were made using Scheffé's technique.

Hypothesis two, concerning the effect of response option order
on response set was tested with a post hoc analysis. Mean bias
scores for treatment conditions E2 and E4 were compared with
Scheffé's method for multiple comparisons.

The question of the amount of response set elicited by the ZRHC
treatment condition, hypothesis three, was answered with a one-
sample t-test. The obtained mean bias score for treatment condition
E3 was tested against the score which was equivalent to zero bias.

Theoretically, the bias score as defined earlier could range
between +74 and -74. In order to avoid the possibility of negative
numbers in the analysis, 74 was added to each bias score. Thus the
bias score which was analyzed was:

x - y + 74,

where x - number of 1 responses marked (2, for E and E5),

4

y I number of 2 responses marked (1, for E and E5).

4
No bias was represented by a score of 74, high negative bias by a
score of 0, and high positive bias by a score of 148.

Complete item analyses were performed on the test responses in
each treatment condition. The item statistics which were computed
were difficulty (proportion correct) and discrimination (r point-
biserial). The r point-biserial was selected as the discrimination
index instead of the more traditional r biserial because it was most

plausible to assume only two distinct positions, right and wrong

on the item continuum, given items which were cast in a true-false

19
format. The discrimination indices which were obtained from the
tests in the experimental treatments were therefore somewhat lower
than they would be had r biserials been computed.

Summary statistics for the entire test were: mean, variance,
reliability (KR20)’ and standard error of measurement. This detailed
information was used for a close examination of the functioning and
comparability of the T-F and ZRMC item types.

In order to estimate the reliability of the bias scores, an
odd-even Split-half reliability coefficient was computed for each of
the treatment conditions. The Spearman-Brown prophecy formula was
applied to these reliability coefficients in order to estimate their
magnitudes on tests twice as long.

T ve m n

Social psychology was selected as the test's content domain.

The items which formed the initial test item pool were drawn from a
collection of four-response multiple-choice social psychology items
for which item statistics were available. All of these items were
constructed by subject-matter experts, and were designed for use with
students similar to those in the research sample. The item statistics
were based on several hundred responses per item.

Each of the introductory psychology classes in this study in-
volved a unit on social psychology, thus a test involving this content
area achieved considerable face validity in terms of being an integral
part of the course. The original items were also selected because
they were likely to be sufficiently difficult for the research

sample. Item difficulty is an important factor in response set

20

and this level of difficulty was expected because the items were
written for students who had had a more extensive unit on social
psychology than the research subjects.

The following three items were typical of those in the original
multiple-choice item pool:
1. Role determinants of personality rest mainly upon

1. the socialization process.

2. idiosyncratic experiences.

3. biological inheritance.

4. accidental events which occur in our lives.
2. Victims of violence within lower-class Black ghettos are usually

1. close acquaintances of the offender.

2. white merchants.

3. strangers to the offender.

4. local political leaders.

3. Which of the following contributes most to the present problem
of poverty in the United States?

1. The decline of the Protestant ethic
2. Technological innovation

3. Lack of motivation to work among persons in the lower
socioeconomic groups

4. The progressive increase in the cost of living

I;em_gggyg;§193, From each of the original four-response multiple-
choice items, two new items were developed. One item of the pair
was a two-response multiple-choice (ZRMC) item, and the other was a
Special type of T-F item. It was intended that the ZRMC and T-F
forms should be as comparable as possible; therefore each T-F item

was a statement of comparison between two alternatives. The truth

21
or falsity of these statements depended on the order in which the
alternatives were compared, thus making the criterion of truth in-
ternal to each item.

The item conversion methods, which were unique to this research,
were kept as uniform as possible across items. First, the multiple-
choice item was reduced from four to two responses, one of which was
the one keyed correct. The second response was the distractor which
drew the most responses from students who scored low on the test, i.e.,
the distractor which was the most discriminating. Occasionally there
was no most discriminating distractor, or if there was it was not of
the same character as the correct response. In these special in-
stances, either the second most discriminating distractor became the
second response or one of the three distractors was suitably altered
to become the second response.

True-false items were developed from the ZRMC items. Their
general form was the ZRHC stem combined with the two ZRMC response
options so that a statement of comparison between the ZRMC response
options resulted. Thus the T-F item was the ZRMC item recast as a
single statement.

There were some instances where two statements were required for
clarity of communication. However, the general item conversion rule-
of-thumb was to form the T-F item using a single statement, sacri-
ficing as little of the item's original wording as possible.

Whether a T-F item was formed as a true or a false statement was
decided on a random basis within each of several specified levels of

item difficulty. This was done to ensure a balance of items keyed

22
T and F within as well as across levels of item difficulty.
The following items demonstrate the item conversion process.
The pairs of items were developed from the three items presented
above as representative of the original multiple-choice item pool.
The keyed response for each ZRMC item is indicated by an asterisk
beside that response. The keyed response for each T-F item follows
that item in parentheses.
The first example item became the following when only the keyed
answer and the most discriminating distractor were retained:
ZRMC1 Role determinants of personality rest mainly upon
1. idiosyncratic eXperiences.
*2. the socialization process.
From this new item the T-F mate was then formed:

T-F1 Role determinants of personality rest more on idiosyncratic
experiences than on the socialization process. (F)

The ZRMC/T-F pair corresponding to the second example item
became:

ZRMC Victims of violence within lower-class Black ghettos are
usually

*1. close acquaintances of the offender.
2. strangers to the offender.

T-F2 Victims of violence within lower-class Black ghettos are more
often close acquaintances than strangers to the offender. (T)

The third pair became:

ZRMC3 Which of the following contributes most to the present

problem of poverty in the Unites States?
*1. Technological innovation

2. The progressive increase in the cost of living

23
T-F The progressive increase in the cost of living contributes
more to the current United States poverty problem than does
technological innovation. (F)

Based on pretest data, the T-F item was changed from a false

3

to a true statement. The following T-F item which was included in

3

the final test form demonstrates the determination of truth or falsity

on the basis of alternative order in the comparative statement.

T-F , Technological innovation contributes more to the current
United States poverty problem than does the progressive
increase in the cost of living. (T)

Item T-F2 could similarly be changed to form a false statement:

T-FZ, Victims of violence within lower-class Black ghettos are more
often strangers than close acquaintances of the offender. (F)

Item_p;gtgsting. There were 120 pairs of T-F/ZRMC items available
for pretesting. For the 120 T-F items, an attempt was made to balance
the number of items keyed T and F within each of the quartiles of the
distribution of item difficulties. Over all, there were 60 items
keyed T and 60 items keyed F in the item pool.

For pretesting, the 120 pairs of items were first split into two
sets of 60 items, again with an attempt to balance the number of items
keyed T and F within and across the item difficulty distribution.

The 60 pairs of items were then separated to form a 60-item ZRMC
test and a 60-item T-F test. Thus, pretesting was accomplished by
breaking the item pool into four separate tests, two ZRMC and two T-F.

Two introductory psychology classes participated in the pre-
testing. Each class was randomly divided in half, with one half
taking a ZRHC test and the other half taking the matching T-F test.
This technique yielded an average of 24 responses per item and allowed

a comparison between the T-F/ZRMC pairs.

24

Since an unusual form of T-F item was used, response rates on
the ZRMC and T-F items were compared. About five minutes into the
testing period, the subjects were asked to circle the number of the
item on which they were currently working. Five minutes later they
were asked to do this again. The resulting data indicated that the
rate of responding was similar for the two item types: 12.5 T-F
compared to 11.6 ZRMC items in the first class; 9.3 T-F items compared
to 10.7 ZRMC items in the second class.

The preliminary test forms were introduced to the subjects
as pretests for their unit on social psychology. The class instructors
presented the tests, and stressed that while the subjects' course grade
would be unaffected by the test, they were very interested in how much
their students knew of social psychology prior to formal instruction
in the subject. There was no mention of the presence of two dif-
ferent test forms. The students marked their answers on machine score
sheets which had the item response option lettered, with T and F
response options also indicated.

The item responses were machine scored and analyzed. The resultant
item statistics were item difficulty, in terms of the proportion of
respondents answering the item correctly (p), and item discrimination
in terms of r biserial coefficients. Summary statistics for the test

included an internal consistency reliability estimate, KR Of these

20'
statistics, item difficulty indices were of especial interest, while
the others served as rough indicators of the way the final forms of

the test could be expected to function. Item difficulty indices were

used to determine which ZRMC/T-F pairs were to be retained for use

in the final test forms, and to assure that the T-F test would have

25

items keyed T and F balanced within and across item difficulty
levels.

£1331_§g§;_3§§§mblz. Based on pretest data, 74 item pairs were
selected for use in the final test form. Items which were too hard
(p ".20) or too easy (p ) .90) were excluded, as well as item pairs
whose T-F and ZRMC halves were not of reasonably comparable diffi-
culty. Item pairs were rejected if the difference between the item
difficulties of the T-F and ZRMC halves was greater than .20. Several
item pairs were rejected because of awkward, uncorrectable T-F state-
ments.

The total number of items on the final test was set at 74 for
two reasons. First, the pretest response rate data indicated that
everyone could reasonably be expected to complete a test of this length
within the available 60 minutes of testing time. Second, this even
number of items could conveniently be split into an equal number of
items keyed T and F so that the subjects' bias scores could be derived.

At some difficulty levels the pretest data indicated that the
number of true and false items were out of balance. To reestablish
that balance, the excess items within each difficulty level were ran-
domly selected and rewritten so that their keyed response was altered.

The distribution of true and false items across item difficulty
levels for the final test can be seen in Table 2. The entries in
Table 2 were derived from the pretest data.

The items in the final test form were arranged according to in-
creasing item difficulty (or decreasing p values). The T-F and

ZRMC test items were arranged in the same order, the difficulty

26

levels of the T-F items being used to determine the order.

Table 2. Number of true and false items at each difficulty level

 

 

 

Difficulty (p) T F
.3o-.39 5 5
040-059 13 13
.60-.79 18 17
080-089 1 2

 

§2bissta

The subjects for this study were members of four introductory

psychology classes at a Michigan university. Both the classes and
their members participated in the research on a volunteer basis.
The classes used were those in which the instructors agreed to in-
corporate the experimental treatments in their usual teaching pro-
cedures. The subjects were not required to respond to the experi-
mental treatment, although they were urged to do so.

About 752 of the subjects were college freshmen, 202 sophomores,
and 52 juniors. Approximately 452 of the subjects were male, 552
were female. They represented a broad range of both academic achieve-
ment and academic majors. Due to the specific academic requirements
of this university, the students who enroll in introductory psychology
can be reasonably expected to be representative of the freshman-
SOphomore student body.

The subjects who participated in the test development phase of

this research were members of the two smallest of the four classes,

27
with enrollments of about 60 students each. The subjects in the final
experimental phase of the research were members of the two largest
classes, with enrollments between 200 and 300. A total of 99 students
took part in test development: 315 students were in the final experi~
ment. There were at least 60 students from the latter group in each
treatment condition.

The number of subjects allocated to each treatment X class
combination in the research design is presented in Table 3. The
number in parentheses within each cell indicates the number of
subjects whose bias scores were analyzed in the randomized blocks
analysis of variance. In cells where some subjects were eliminated,
the elimination was done at random. When the number of subjects was
reduced as specified, the design became an orthogogonal one with pro-

portional cell frequencies.

Table 3. Allocation of subjects in the experimental design

 

 

Levels of emphasis

 

Classes
E1 E2 E3 54 ES
1 40 43 44 45 42
(40) (40) (40) (40) (40)
2 20 23 19 20 19

(19) (19) (19) (19) (19)

 

28
e dm'nistr ' n
Five sets of test instructions were develOped for the five
levels of emphasis. The crucial paragraphs from the instructions

for each level of emphasis were as follows. The test booklet cover

sheets with the complete test instructions for each treatment con-

dition are in Appendix A.

E1 The test consists of statements which are either true or false.
You might think a statement is mg;g_£;gg than false, then you
should mark the number on your answer sheet which corresponds
with true.

If you think a statement is more true than false, mark 1 on the

answer sheet for that statement. If you think it is more false
than true, mark 2 for that statement. Remember,

1 8 true
2 = false
E2 The test consists of statements which are either true or false.
If you think a statement is more true than false, mark 1 on the
answer sheet for that statement. If you think it is more false
than true, mark 2 for that statement. Remember,
1 = true
2 = false
E3 The test consists of multiple choice items. For each item
there are 2 choices. Select the one best answer for each
item, and mark its number in the appropriate space on the
answer sheet.
E4 The test consists of statements which are either true or false.

If you think a statement is more false than true, mark 1 on the
answer sheet for that statement. If you think it is more true
than false, mark 2 for that statement. Remember,

1 = false

2 I true

29
E The test consists of statements which are either false or true.

You might think a statement is mg;e,§al§g than true, then you
should mark the number on your answer sheet which corresponds
with false.
If you think a statement is more false than true, mark 1 on the
answer sheet for that statement. If you think it is more true
than false, mark 2 for that statement. Remember,

1 = false

2 = true

One of the classes used in the experimental phase of this re-
search had just completed their unit on social psychology while the
other class was just beginning their unit at the time the experimental
treatments were administered. This necessitated slightly different
verbal instructions for the treatments in these classes. The scripts
which were used by the two instructors as a basis for introducing the
treatments are in Appendix A.

One major intent of the instructors' verbal introduction of the
treatments was to establish the experimental tests as being of es-
pecial interest to him. That is, it was intended to establish the
experimental tests as quasi classroom tests, even though course grades
would be unaffected by the results. verbal instruction was also in-
tended to ensure that the subjects' attention would be focused on the
instructions written on the cover sheet of the test booklet.
enemas!

EXperimental treatments were administered within the framework
of a randomized blocks design, which had five levels of the treatment
factor, emphasis, and two levels of the blocking factor, classes.

The independent variable was emphasis and the dependent variable was

response set, represented by bias score.

30

An analysis of variance for randomized-blocks was computed for
the bias scores of 295 subjects, followed by post hoc comparisons
using Scheffé's method. A one-sample t-test was computed for the
ZRMC mean bias score.

Two sets of 74 items, ZRMC and T-F, were systematically devel-
oped from a common set of four-response multiple-choice items. These
items formed five treatment condition tests, which were administered
to 315 subjects. Each subject received either the ZRMC or the T-F
items. Subjects responding to the T-F items did so under one of
four instructional/response-option-order variations.

The experimental subjects were representative of the freshman
student body of a Michigan university, and they received the experi-
mental treatments under reasonably standardized conditions. The
testing conditions were structured so that they appeared to be a

natural part of the subject's class.

Chapter III

Results

The results are presented in two sections. Results concerned
with response set are presented in the first section. The second
section is descriptive, containing the test analysis data from the
five experimental test forms.

W

H th ' . The major research hypothesis was that systematic
variation of the degree and direction of emphasis would result in a
corresponding variation in response set. It was tested with an
analysis of variance for randomized blocks. The hypothesis that the
five treatment mean bias scores were not different was tested against
the alternative hypothesis that at least two mean bias scores were
different.

The dependent variable of bias was calculated as follows:

x - y + 74 = bias

where x = number of 1 responses (2, for E and E5)

1:

y = number of 2 responses (1, for E and E5)

4
The resultant analysis of variance appears in Table 4. As the
levels of significance for the F statistic Show, there is clear

evidence in favor of the alternative hypothesis: at least two

of the treatment means were different. Further, there is no

31

32
evidence to indicate that the two different classes performed
differently either over-all or at any particular treatment X class

combination.

Table 4. Analysis of variance of bias scores

 

 

 

Source df MS F at
Replications 285 184.93

Classes 1 45.22 .240 (.6214
Emphasis 4 1884.70 10.188 <.0001
Emphasis X Classes 4 97.24 .526 (.7170

 

Given that treatment differences existed in the data, it was
important to locate the source of the differences. The mean bias
values on which the analysis of variance was computed, and their

variances, appear in Table 5.

 

 

 

Table 5. Means and variances of bias scores for each treatment
Classes Levels of emphasis

E1 E2 E3 E4 E5
Class 1 85.20 83.10 71.20 79.45 79.45
Class 2 86.00 83.79 70.21 73.89 80.32
Pooled means 85.46 83.32 70.88 77.66 79.73
Pooled variances 191.75 175.28 67.48 206.12 150.63

 

33
Inspection of the means suggested that the bias scores obtained by
the subjects in the E3 treatment condition were a source of the
highly significant treatment effect found in the analysis of variance.

To test this, the following contrasts were tested with the Scheffé

method of multiple comparisons:

+
(I)
<
9)
"l
t")

(1) 21-3333 Varf. (4)2: -E

(2) E3-ESiSJ Vat-’1‘. (5) 21-24
(3)32'3313" Var’i: (6) E -s

where E1 5 are pooled treatment means

Var'L = estimated contrast variance

I+
a:
1<
D)
H
r9

,..:
U'I
|+
(D
<
9)
H
:l

 

S = f“ F.05, a, 285

Contrasts 1, 2, 3 and 5 were significant ata= .05. As ob-

servation of the obtained cell means suggested, condition E was

3
responsible for three of the significant treatment differences pointed
up by the analysis of variance.

In summary, the evidence partially supported hypothesis one.
Systematic variations in emphasis did result in differences in reSponse
set, although the differences did not directly correspond with the
direction of variation of emphasis. The no-emphasis condition (E3)
yielded bias scores which were, as intended, significantly less than
those from the two true-emphasis conditions. However, condition E3
also yielded bias scores which were significantly less than those of
E5, where the reverse was intended. Finally, there was no difference
between conditions E1 and E5 bias scores, where differences were an-

ticipated.

Hypgghesis 2. Response Option order will affect the reSponse

34
set found in true-false tests. The test of this hypothesis was a
post hoc Scheffé's multiple comparison between treatment conditions

E2 and 34' as follows:

122-5413.] Vsr’i.

where E2 and E4 are pooled treatment means,

A . o
Var L I estimated contrast variance

_._.__-‘

S ' r4 F.05, a, 285
The contrast failed to achieve significance with¢K I .05. Therefore,
hypothesis two was not supported by the data. There was no evidence
that response option order affected the subjects' bias scores.
Hypothe§i§_§, Multiple-choice items with two response cptions
will yield no response set. Reference to Table 5 indicates that treat-
ment condition E3 did yield the lowest mean bias scores. Further, the

post hoc multiple comparisons above showed that the E means were sig-

3

nificantly different than the means of conditions E , E , and E To

1 2 5’
ascertain whether the E3 pooled mean was significantly different than
the value indicating no bias, a one-sample t-test was performed.
For the t test, the hypothesis of no difference between the E3
mean and a value of 74 (0 bias) was tested against the alternative that
the E3 mean was different than 74. Based on the data from all 63 sub-
in 133, the statistic t was significant (at ‘.002, df = 62). It was
concluded that the E3 pooled mean represented a bias score which was
significantly less than 74, and thus subjects in E3 displayed a signifi-

cant tendency to prefer response option two. Hypothesis three was

therefore not supported by the evidence.

.Bia§_§£9;2_zgliagilitx. Cronbach maintained that response set

35
could be said to exist if it could be shown to be reliable. Relia-
bility coefficients were therefore computed for the bias scores in
each of the treatment conditions. The odd-even reliability coef-
ficients, corrected with the Spearman Brown prophecy formula were
.68, .72, -.15, .79, and .46 for treatment conditions one through
five respectively.

Treatment condition five gave bias scores of relatively low
reliability. Condition three, the ZRMC condition, yielded a co-
efficient so low that it is doubtful whether its bias scores did in
fact represent the existence of any response set.

n ' e u

The validity of the experimental results with respect to response
bias depends on the character of the test from which the results were
derived. These tests should contain comparable and sufficiently dif-
ficult items. They should be generally comparable in reliability and
item discrimination to each other and to the classroom test in order
to enhance the generalizability of the results.

In order to obtain good descriptions of the subjects' responses
to the five test forms, the data from the two classes participating in
the research were pooled. The test analyses were therefore based on
at least 60 responses per item per test form.

Summary statistics for each of the five tests are presented in
Table 6. It can be seen that, with the exception of condition E

3!

the mean test scores appear similar across conditions. Condition E3

had the highest mean test score. Test forms E4 and ES displayed the

highest reliability coefficients, as well as the greatest variance

in test scores.

36

 

 

 

 

Table 6. Summary statistics for each of the five test forms
Statistic . Test form

E} E2 E3 E E5
No. examinees 60 66 63 65 61
Mean no. correct 43.1 42.4 45.4 40.9 41.4
SD 6.1 5.6 5.9 8.0 6.8
Reliability (KRZO) .47 .55 .63
SEM 4.05 4.06 3.94 4.12 4.11
Table 7. Distribution of item difficulty indices for each of the

five test forms

 

 

 

 

Difficulty * T93” f°rm

E1 E2 E3 E4 E5
91 - 100 1 1
81 - 9o 3 1 8 1
71 - 80 14 13 15 3 4
61 - 7o 14 15 17 23 27
51 - 60 20 18 12 30 14
41 - so 14 16 13 10 20
31 - 4o 8 8 7 7 8
21 - 30 1 2 1 1
Mdn. p 57.5 56.6 62.9 56.8 53.6

 

* p I proportion of subjects correctly reSponding to an item
(decimals deleted)

37
The distributions of item difficulties for each of the test forms
are shown in Table 7. Test forms E4 and E5 had a slightly restricted
range of difficulty levels relative to the other three forms. With
the exception of form E3, the median item difficulty was similar for

all test forms. The multiple-choice items in the E condition appear

3
to have been somewhat easier for the research subjects than were the
comparable true-false items.

The over-all level and distribution of item difficulties for all
of the test forms suggests that the items were appropriate for study-
ing reSponse set. As Cronbach has indicated, difficult items in-
crease the likelihood of the appearance of response set. The present
items can be considered to have been very difficult for the subjects,
when the obtained difficulty levels are compared with the value shown
by Lord (1952) to be ideal for two-choice items, p = .85, assuming there
is no correction for guessing and assuming that all those who do not
know the answer to an item respond randomly.

The obtained item discrimination indices are shown in Table 8.
Because the indices are r point-biserial coefficients, they are

somewhat lower than the more traditional r biserial coefficients

generally would have been.

§ﬂﬂﬂi§1

Hypothesis one, that systematic variation in emphasis would
result in a corresponding variation in response set was partially

supported. Conditions E1, E2, and E yielded bias scores as pre-

3

dicted, with E1 having the highest scores and E the lowest. Con-

3

ditions E4 and ES, however, had associated bias scores which were

the reverse of the predicted negative bias scores.

38

Table 8. Distribution of item discrimination indices for each
test form

 

 

 

Discrimination * Test form
E1 E2 E3 E4 E5
51 60 2 4
41 50 l l 2 8 3
31 4O 14 5 11 11 17
21 3O 16 23 16 21 13
11 20 19 21 15 13 20
00 10 18 17 20 7 13
-Ol -10 5 5 5 7 5
-11 -20 1 1 2 2 3
-21 -30 1 1 1

 

* r point-biserial (decimals deleted)

39

Hypotheses two and three were not supported by evidence.
Differences in response set did not obtain when reSponse option
order was reversed. Moreover, although the ZRMC condition (E3)
did yield the lowest bias scores, these scores were significantly
less than a score corresponding to zero bias.

Reliability coefficients altered with the Spearman Brown
prophecy formula showed that a very low coefficient, -.15, came from
the E3 treatment condition bias scores.

Detailed test analyses of the tests used in each of the five
treatment conditions indicated that they were appropriate for re-

search on reaponse set in two-choice classroom tests.

Chapter IV

Discussion

As Cronbach (1941, 1942, 1946, 1950) and his predecessors noted,
response set, in the form of RA, tended to appear when students re-
sponded to T-F tests. The present results essentially reaffirmed
this point. When two-choice items were cast in a form requiring a
value-laden reSponse of either true or false, then subjects tended
to say T when in doubt. When the two-choice item was cast instead in
a ZRMC format whose responses involved neither the concepts true or
false, then only very weak evidence for the presence of any response
set appeared. Item format thus appeared to be the most potent and
effective factor in manipulating response set.

Is RA a general response style? The research results point to
the conclusion that it might be when the subject responds to items
whose response options include the words true and false. These re-
sponses have connotations which extend beyond the test items and may
well shade all such responses to all T-F items.

Cronbach (1942, 1946) had suggested that appropriate test in-
structions could aid in reducing RA in T-F tests, although he had not
successfully used them for that purpose. As with Cronbach, the instruc-
tional manipulation used in this experiment was not effective in al-

tering response set. The instructions were not effective in the sense

40

41
that they appeared to increase RA where an increase in the tendency
to respond F was predicted.

Individual differences in response set are important considera-
tions. Cronbach (1942) concluded that because there was a variation
in response between subjects which remained relatively stable across
tests, RA scores represented real individual differences. The present
experiment clearly elicited varying amounts of response set for sub-
jects within each treatment condition. The central question, however,
was whether or not different treatment conditions yielded different
degrees of variation in response set. A Cochran's test for homo-
geneity of variance (C) was computed, and found nonsignificant at
0t ‘ .05. Thus there was no evidence to conclude that individual
variation in response set was different in any of the treatment con-

ditions, although the bias scores in E had notably less variation

3

than those in the other conditions.
Another indication of individual differences in reaponse set

is reliability of the bias scores within and across treatment con-

ditions. Within treatment conditions, except for E3, subjects in

general tended to be moderately consistent in displaying response

set throughout the test. Across treatment conditions subjects clearly

responded differently. Assuming that the bias score reliability

estimates are reasonable Pearson product-moment coefficient estimates,

the five coefficients were compared with a variant of the Fischer

r to Z transformation. The resulting statistic was significant

(0t 4 .01, df = 4). Subjects displayed response set more consis-

tently in some conditions than in others. Examination of the data

suggested that the source of the difference was condition E3.

42
The ZRMC items appeared to elicit a much less consistent response
set from the subjects.

Frisbie (1971) found that four-response multiple-choice items
were either as reliable or more reliable than their T-F counterparts.
Although subjects in the different treatment conditions of this experi-
ment were equated on the basis of random assignment of subjects to
treatments, a comparison similar to Frisbie's can be made between the
total number correct on the ZRMC and T-F tests. If the Kuder-Richard-
son formula 20 reliability coefficients are considered to be reasonable
estimates of Pearson product-moment coefficients, than the coefficients
for the five test forms can be compared in the same manner as the bias
score reliability coefficients, with a variant of the Fischer r to Z
transformation. The resultant statistic was not significant (a > .05,
df I 4). There were no significant differences among the five relia-
bilities; the ZRMC test appeared no more, or less, reliable than the
T-F counterparts.

The conclusion that each T-F test was generally as reliable as
each other T-F test and as reliable as the ZRMC test must be tempered
somewhat because its basis is weak. Each reliability coefficient was
based on only about 60 cases, and the scores for the tests were quite
positively skewed.

m
When ordered on the basis of decreasing mean bias scores, the

treatments assumed the order: E1, E2, E5, 34’ E Only one signifi-

3O
cant difference occurred between the T-F treatment mean bias scores,
E1 was significantly larger than E4. The size of the obtained treat-

ment means are suggestive, however. If response set was completely

unaffected by manipulations of response option order and instructions,

43

the E1 - E4 difference should not have occurred, and further, there
would be no reason to anticipate any pattern in the results. Logically,
all mean bias scores should have been close to those of E2. This

is because the E2 treatment could be considered to be an example Of
the typical T-F test, with both instructions and response option
order conforming to the typical case.

The results appear anomalous in light of these considerations.
In fact, high emphasis on either response option appeared to boost
bias scores above those Of their treatment counterparts which had
the same response option order but different instructions. With
the ES condition, the increased tendency to mark T was enough that
while E1 and E4 had significantly different bias scores, E1 and E5
did not.

Based in part on the non-significant trends in the experimental
results, the following might be hypothesized. The altered response
option order depressed the tendency to mark T in E4’ and added stress
on the response option F partially counteracted these effects. That
is, an unusual response option order decreased the subjects' response
set (true RA in Cronbach's terms), and any attention otherwise speci-
fically drawn to the responses functioned only to alert the subject
to the importance of the entire response continuum, not to just one
portion of that continuum.

To investigate this hypothesis, an experiment could be conducted
involving four treatment groups: 1) high T emphasis, response Option
order T-F; 2) high T, F-T; 3) high F, T-F; 4) high F, F-T. If response
option order functioned to depress response set, then conditions 1

and 3 should yield high, similar bias scores while conditions 2 and 4

44

should have similar, lower bias scores. Given these results it could
also be concluded that emphasis on one response option draws attention
to both options. If it was the case that response option order did
lower RA and that the effect was one which lasted across tests, then
it would become one method for coping with response set in T-F tests.

Other indications of the source of the experimental results
might come from an examination of subjects' item-by-item test responses.

Did subjects in the different conditions, especially E compared with

1
ES, tend to answer correctly or guess on different items? Did either
true or false items contain unanticipated Specific determiners?
For each test, did the mean number correct vary across tests in a
pattern which parallelled the experimental results?

Unfortunately, neither close examination of each test item nor
of the tests' over-all performance gave much indication of why such
experimental results occurred. When the item difficulty indices, which
gave the proportion of subjects answering each item correctly, were
correlated for all pairs of tests, all coefficients were positive and
reasonably high (.62 to .84) for the T-F tests. In particular, the
correlation between E and E

1 5
appeared that for a large proportion of the test items, subjects re-

yielded a coefficient of .72. Thus it

sponded similarly in both groups.

When the T and the F items were correlated separately across
tests, generally high positive coefficients again resulted. There
were two exceptions. The correlation coefficient for T items between
E1 and E4 was positive but rather low (.42) and for T items between

E4 and ES it was moderate and positive (.53). The T items in E4

45

1 and ES.

Response order reversal without extra stress on the response appeared

appeared to function differently than they did in E

to yield T items which had a tendency to be more difficult. If the
T items tended to have higher difficulty indices, then fewer subjects
than in E1 and E5 were answering the item as it was keyed, and hence
there was apparently less "guessing T when in doubt." This might be
taken as evidence that response Option reversal does in fact have the
potential for depressing the tendency to respond T.

It is improbable that an entire class of items tended to con-
tain specific determiners in the usual sense. The type of T-F item
used argues against the presence of specific determiners. Truth or
falsity depended only on the order of the comparison between alter-
natives, not on wording changes. Many of the T-F items did use
positive comparisons (more than, greater than) rather than negative
ones (less than). These items were about equally distributed be-
tween items keyed T and F. While this type of Specific determiner
might help to account for the response set favoring T that occurred
across tests, research evidence suggests that it is unlikely. Whipple
(1957) studied the effects of positive and negative phrasing in T-F
items, and found only a very slight tendency for subjects to say T
to positively stated items.

The summary statistics for each test as a whole revealed little.
The mean number correct within each treatment condition was similar.
However, condition E4 yielded the highest reliability estimate and its
scores had the largest variance. The higher reliability in the E

a
test might have had two sources. First, the subjects' test scores were

46

most variable in this condition, due possibly to the unfamiliarity
of the response Option order. Second, the T items apparently func-
tioned better, eliciting fewer guesses, thus removing their generally
depressive effects on the over-all test reliability (Cronbach, 1941,
1942). When these points are considered apart from questions of test
validity, they add weight to the case for the potential usefulness of
response Option order reversal in dealing with RA. I

NO test, no matter how reliable, is of much value as an achieve-
ment test if it is not also valid. The validity of T-F and ZRMC tests
under the present experimental manipulations remains to be shown.
Assuming that appropriate criteria were available, the role of RA in
them would have to be assessed. Logically, the behavior of responding
T when in doubt would play little or no role in strictly academic
achievement. When this is the case, as Cronbach suggested (1946),
the presence Of any response set would reduce test validity. Thus
while E4 might point the way to dealing with RA, it would have to re-
duce RA to a near-zero level while maintaining a reasonably high re-
liability tO achieve the necessary validity to put it into contention
with the ZRMC test.
Multiple-choice test

Did, or did not ZRMC items elicit response set? Yes, they
yielded a significant tendency to use reSponse Option two, if the
conclusion is based solely on the test of whether or not the mean bias
score was different from zero. No, if the conclusion is based on
whether or not the bias scores were reliable. Cronbach maintained

that they must achieve reliability in order to be considered to exist,

47

and the reliability estimate for the ZRMC bias scores was negative
and very low (-.15).

Unless and until the results can be replicated, and they have
a reasonably high reliability estimate, it seems most conservative
to conclude that the experimental data gave only weak evidence that
the ZRMC test yielded response set, a conclusion which is in agreement
with those of Cronbach (1946, 1950) and Wevrick (1962) concerning tests
with four- and five-reSponse multiple-choice items.

Comparing the T-F and ZRMC item pairs, the ZRMC items tended
to be easier across all test pairs. This could be attributed to the
higher apparent verbal difficulty of the T-F items. The two item forms
compare like the outline of a paragraph (ZRMC) and the fully written
paragraph (T-F).

A test with ZRMC items would be recommended as the best method
of dealing with response set appearing in a two-choice format, if the
test could achieve adequate reliability, in addition to remaining un-
affected by response set. Such a test would be most likely to achieve
adequate validity as well. The ZRMC test would be recommended over
the T-F test form E4 on the assumption that improved reliability is
likely to be achieved sooner and with a more lasting effect than is a
reduction in RA through response option order reversal.

' ns re e h

Two-response multiple-choice items may prove to be an excellent
compromise between T-F and four- and five-response multiple-choice
items. They should require a minimum of composition time to obtain
a maximum of clarity. Students have been shown to be able to re-

spond to them nearly as quickly as they can to T-F items (Ruch &

48

Stoddard, 1925). Therefore the following suggestions for research

with ZRMC items are made.

1.

3.

4.

Replicate this experiment with more, and different subjects
to find out if response set does reliably appear on_2RMC
tests.

If response set does appear, find out if it can be manipulated
with instructions.

Investigate the optimum number of items for the desired
level of reliability.

Investigate the concurrent and/or predictive validity of
an achievement test with ZRMC items. ‘
Compare the validity of ZRMC tests and matching T-F tests,
where the T-F test is administered with reversed response

option order.

Concerning T-F tests, there still remain unanswered questions

generated by this experiment. The following studies are suggested as

methods of answering some of the questions.

1.

2.

3.

Compare the standard T-F item with the comparative type
used in this study to see if the latter yields more or
less response set.

As suggested in the discussion, investigate whether re-
Sponse option order reversal (F-T, instead of T-F) de-
presses subjects' bias scores.

If response option order reversal does depress subjects'
bias scores, find out if this is a stable phenomenon or

if it is the result of novelty.

49
Summary

This research study was an investigation of the effects of
manipulating test instructions, item format and response Option order
on response set in T-F tests. Subjects were each given one of five
tests with various combinations of these variables.

The questions to be answered by the experiment were: 1) Can
response set be manipulated by alterations in emphasis in T-F tests?
2) Does alteration of response option order affect response set?

3) Does the ZRMC item format yield response set?

Response set was altered in some instances, corresponding to
the degree of emphasis placed on one (T) response. However, response
set did not parallel the direction of emphasis used in each treatment.
The first three treatments with decreasing emphasis on T showed a de-
creasing set to respond T, but when the emphasis was on F, set to
respond T increased again with increasing P emphasis. It was con-
cluded that response option order reversal might have acted to sup-
press the subjects' tendency to respond T, and that emphasis on either
response prompted the subject to attend more closely to the entire
response continuum, thereby enhancing a response style consisting of
response acquiescence.

Response option order had no significant effect on response set
in the T-F tests.

The ZRMC items showed only slight evidence of generating any
response set. 0n the basis of mean response bias scores, the data
indicated that there was a significant tendency for subjects to prefer
response option two. ReSponse set is generally not considered to

be present in a test if the subject's bias score is not reliable.

50
On this basis, the ZRMC items elicited little or no response bias,
because the bias scores were of extremely low reliability.
In general, the ZRMC item format appeared to be the most ef-
fective manipulation in dealing with response set. When the re-

Sponse category labels were eliminated, response set decreased to

a near-zero level.

REFERENCES

References

Arnold, H. L. An analysis of discrepancies between true-false and

simple recall examinations. Journal of Educational Psychology,
1927, lo, 414-420.

Bracht, G. H. & Class, C. V. The external validity of eXperiments.

American Educational Reoearch Journal, 1968, g, 437-474.

Bugelski, B. R. & Hersen, M. Conditioning acceptance or rejection

of information. Joornol of Exoorioeogol Psxcoology, 1966,
21, 619-6230

Burmester, M. A. & Olson, L. A. Comparison of item statistics for
items in multiple-choice and alternative response form.
Scienoo Educooion, 1966, go, 467-470.

Campbell, D. T. & Stanley, J. C. EXperimental and quasi-experimental
designs for research on teaching. In N. L. Gage (Ed.),
nd k of rese h . Chicago: Rand McNally, 1963.

Costin, F. The optimal number of alternatives in multiple-choice
achievement tests: some empirical evidence for a mathematical

proof. Eouootionol 3nd Poyooologicol Measurement, 1970,
go, 353-358.

Cronbach, L. J. An experimental comparison of the multiple true-
false and multiple multiple-choice tests. Jouoool of Educ;-
tional Psyohology, 1941, §ZJ 533-543.

Cronbach, L. J. Studies of acquiescence as a factor in the true-

false test. Joogoal of Eduootionol Psyohology, 1942, :1,
401-416.

Cronbach, L. J. Response sets and test validity. d c ' n d
Poyohologiool Meoouremont, 1946, o, 475-494.

Cronbach, L. J. Further evidence on response sets and test design.

Eduootionol ond Psyohologicol Meoounomono, 1950, lg, 3-30.

Ebel, R. L. Megougiog oducational oohievomono. Englewood Cliffs,
N. J.: Prentice-Hall, 1965.

51

 

52

Ebel, R. L. Blind guessing on objective tests. Jouznol of Edooo-
tional Meosugomeng, 1968, 5, 321-325.

Ebel, R. L. Expected reliability as a function of choices per item.

Egooogionol 3nd Payohologiogl Meoouremeno, 1969, 22, 565-570.

Ebel, R. L. The case for true-false test items. 32h991 Reviow,
1970, 18, 373-389.

Ebel, R. L. How to write true-false items. Educational and Psycho-
Logiogl Measurement, 1971, 3;, 417-426.

Fernberger, S. W. The use of equality judgments in psychophysical
procedures. Psychologiool Roxigg, 1930,'11, 107-112.

Frisbie, D. A. Comparative reliabilities and validities of true-false
and multiple-choice tests. Unpublished Ph.D. dissertation,
Michigan State University, 1971.

Fritz, M. F. Guessing in a true-false test. Eduoagionol Reseagch
BUlletin, 1942, .2—13’ 9-12.

Gault, R. H. & Goodfellow, L. D. Sources of error in psychophysical
measurements. Journol of Genegol Psyohology, 1940, 2;, 197-200.

George, F. H. 'Either-or' questions in series. British Journol

Goodfellow, L. D. The human element in probability. Journal of

General Psyohology, 1940, 11, 201-205.

Granich, L. A technique for experimentation on guessing in objective
tests. Jougool of Eduoooiongl Poyohology, 1931, 21, 81-91.

Hersen, M. Generalization of positive and negative response biases.
Joornal of Exoozimontol Pozohology, 1966, 12, 834-840.

Hersen, M. Experimentally induced response biases as a function of
positive and negative wording. urna of E e ' nt

W. 1967. 2.9.. 588-590-

Holland, H. C. Judgments and the effects of instructions. Aota
W 1961. it. 445-457-

Kreuger, W. C. F. An experimental study of certain phases of a true-

false test. Joogoal of Edoootionol Psychology, 1932, 21, 81-91.

LeFurgy, W. G. The induction of anchoring effects in absolute judg-

ments through differential reinforcement. Joorngl of Psyohology,
1966, 9;, 73-81.

Lord, F. M. The relation of the reliability of multiple-choice tests
to the distribution of item difficulty. Psychomeorika, 1952,
11, 181-193.

53

Mathews, C. 0. The effect of position of printed response words upon
children's answers to questions in two-response types of tests.

Jooynol of Egooooioool P§ychology, 1927, 111. 445-457.

Miklich, D. R. Item characteristics and agreement-disagreement
response set. (Doctoral dissertation, University of Colorado)
Ann Arbor, Mich.: University Microfilms, 1966. No. 66-3259

Miklich, D. R. & Gordon, G. P. Test-taking carefulness yo'response

set on true-false examinations. Educational ind Psychological
Measurement, 1968, go, 545-548.

Rubin, H. K. A constant error in the Seashore Test of Pitch Discrim-
ination. Unpublished master's thesis, University of Wisconsin,
1940.

Ruch, G. M. & Stoddard, G. D. The comparative reliabilities of five

types of objective examinations. Joornal of Eduootional Psychology,
1925, 1.6-, 89-103.

Smith, K. An investigation of the use of "double-choice" items in

testing achievement. Jougool of Egoootioogl Regearo , 1958,
ii, 387-389.

Wesman, A. G. Writing the test item. In R. L. Thorndike (Ed.),
Eooooyioool_ﬂoo§oyomg_o, Washington, D. C.: .American Council

on Education, 1971.

Wevrick, L. Response set in a multiple-choice test. E ' n and
Psxshglgzisal Megooremono, 1962, 22, 533-538.

Whipple, J. W. A study of the extent to which positive or negative
phrasing affects answers in a true-false test. Jooyool_o§

W 1957. 51. 56-63-

Williams, B. J. & Ebel, R. L. The effect of varying the number of
alternatives per item on multiple-choice vocabulary test items.

Fooytoenth yearbook of the Nogioool Cooncil on Hegsuremen; in
Eduootion, 1957, 63-65.

General References

Glass, G. V. & Stanley, J. C. Sogtiogiogl NEEDQQ§ in eduoooion ing
ooyooology, Englewood Cliffs, N. J.: Prentice-Hall, 1970.

Hays, W. L. Soagistios. New York: Holt, Rinehart and Winston, 1963.

S4

Henryssen, S. Gathering, analyzing, and using data on test items.

In R. L. Thorndike (Ed.), Edocgtionol moosugomeno. Washington,

D. C.: American Council on Education, 1971.

Johnson, D. M. 3¥§EGE§§12 iniyoduotion to tho osyohology oﬁ thinkiog.
New York: Harper and Row, 1972.

Rorer, L. G. The great response-style myth. Poychologioal Bulletin,
1965, 92: 129-154.

APPENDIX

APPENDIX A

Written and oral test instructions

Written instructions for treatment condition E

1

INSTRUCTIONS:

1. Put your name and student number on the answer sheet.

2. NOTE: on this answer sheet, the answer Spaces go ACROSS
the WIDTH of the page.

3. The test consists of statements which are either true or false.
You might think a statement is 2212.522: than false, then you
should mark the number on your answer sheet which corresponds
with true.

If you think a statement is more true than false, mark 1 on the
answer sheet for that statement. If you think it is more false
than true, mark 2 for that statement. Remember,
1 = true
2 = false
4. There is no penalty for guessing, so respond to EVERY statement.

55

56

Written instructions for treatment condition E2

INSTRUCTIONS:

1. Put your name and student number on the answer sheet.

2. NOTE: on this answer sheet, the answer spaces go ACROSS
the WIDTH of the page.

3. The test consists of statements which are either true or false.
If you think a statement is more true than false, mark 1 on the

answer sheet for that statement. If you think it is more false
than true, mark 2 for that statement. Remember,

1 = true

2 - false

4. There is no penalty for guessing, so respond to EVERY statement.

57

Written instructions for treatment condition E

3

INSTRUCTIONS:

1.

2.

Put your name and student number on the answer sheet.

NOTE: on this answer sheet, the answer spaces go ACROSS
the WIDTH of the page.

The test consists of multiple choice items. For each item
there are 2 choices. Select the one bos; answer for each item,
and mark its number in the appropriate space on the answer
sheet.

There is no penalty for guessing, so reSpond to EVERY item.

58

Written instructions for treatment condition E

4

INSTRUCTIONS:

1. Put your name and student number on the answer sheet.

2. NOTE: on this answer sheet, the answer spaces go ACROSS
the WIDTH of the page.

3. The test consists of statements which are either true or false.
If you think a statement is more false than true, mark 1 on the
answer sheet for that statement. If you think it is more true
than false, mark 2 for that statement. Remember,

1 = false
2 = true
4.

There is no penalty for guessing, so respond to EVERY statement.

59

Written instructions for treatment condition E

5

INSTRUCTIONS:

1. Put your name and student number on the answer sheet.

2. NOTE: on this answer sheet, the answer spaces go ACROSS
the WIDTH of the page.

3. The test consists of statements which are either false or true.
You might think a statement is moro {also than true, then you
should mark the number on your answer sheet which corresponds
with false.

If you think a statement is more false than true, mark 1 on the
answer sheet for that statement. If you think it is more true
than false, mark 2 for that statement. Remember,
1 = false
2 = true
4. There is no penalty for guessing, so respond to EVERY statement.

60

Oral instructions for Class 1

I'm interested in how aware you are of some elements of social psycho-
logy, since you've nearly finished introductory psychology: and have
had some exposure to social psychology. This test will help me to

get this information. Its results will help me with some future
course planning. The results won't affect your grade in this course.

This test is different than the kind that you are used to. It is

essential for you to reoo tho inotgootiono very corefolly.

Notice that the answer sheets are slightly different than the ones

you usually use. (Hold up the answer sheet and demonstrate the follow-
ing, making sure that the ENTIRE class sees it.) The answer spaces

go in order across the width of the answer sheet, starting with 1

here . . .

Answer oll,of the test items. You might find some of them a little
difficult, because they were originally written for people who had

had an entire course in social psychology, but still, do your best to
give me an indication of what you know about social psychology and the
social issues in this test.

Remember, the test is different than the ones you're used to, so
you must read the instructions very carefully.

(go ahead... When you're through, put the answer sheet back in the
test booklet and drop it in the box at the door when you leave.)

61

Oral instructions for Class 2

I'm interested in how aware you are of some elements of social psycho-
logy. Everyone comes into psychology with some information about
social pSYChOIOSY: and you know about many issues. I would like to
know how much you do know about the subject before you have any

formal instruction in it. The test results will help me with some
future course planning. They won't affect your grade in this course.

This test is different than the kind that you are used to. It is
essential for you to yegd the inotyoogiono yogy ogreﬁully.

Notice that the answer sheets are slightly different than the ones

you usually use. (Hold up the answer sheet and demonstrate the follow-
ing, making sure that the ENTIRE class sees it.) The answer spaces

go in order across the width of the answer sheet, starting with 1

here . . .

Answer all of the test items. Ybu might find some of them a little
difficult, because they were originally written for people who had
had an entire course in social psycholoSY. but still, do your best to
give me an indication of what you know about social psychology and the
social issues in this test.

Remember, the test is different than the ones you're use to, so
you must read the instructions very carefully.

(go ahead... When you're through, put the answer sheet back in the
test booklet and drop it in the box at the door when you leave.)