{1 11111111111 w. 1111:1111 ‘
5131111111111 1,1911” 114‘:
. ‘1 421149.111 =15".

 

qm“.

a...

-o
.9-

1
7:.
'M ’—
- -lg
“133:3?“
‘55:; Ir

1 1

'2'”
m-
- 43-

1*

:1 ‘i‘fl

:3.
—.¢--0 -

.ﬂ -
”3%

...r"
hay-IE

’
51": 'i
' -=; 3
3E1
-.... .. -

.rH—
0--

4==m=émwssi1g1 ',
ﬁa‘i‘kawf“

”it;
”7‘
am»-

".j {Lit-‘11
i111," “vi";l‘

hmmn W
{ﬁns} g

.
&P%':"13‘w_;. I '. '
;.,..£ . '1
,. 1.". n: :.' . v»: ‘
ﬁzz-3"?”hﬁ‘?“ ., . .. ‘
M... am..- ._ _ - 1
4Q- -

..:..; 1.

.. .-

.,:.,,:.. ”a"
IW’giaga'

a"
*"5'4.

4mg»

. n.“

W a vwz':
ﬂ.

”324:3“!
.

~11: '
! """‘
.
1. ail-4‘
I "pv - l‘
.' .. " "hm
,A ‘. L
, . ..L

vii -

I{.\
. 2‘;
:Q
‘5'“
. «2'; t
A“

. {.1 , ‘o .1. . ‘ ‘ . ‘4-
“; .u.-_....-.-- - .-

,. mar —'

[,1 1‘3"
1‘111L111ﬁ1 13:13“
11:7} 1“”: 131911111

LL11! W}

m,‘ “of".

‘A
n

_ _v ,;" 7.31:}
.. -

   

 

._ awn

, ﬁv— ~
:1: who
“'. '
. w
WT. . -
W'ﬁw.
- N .. «.-
--.:c‘:.-...-m~,__.:...: _ .
i -
A - '- -
--..,
..._

 

. r"
h- -uung

Elwyn" 1 '1

-i ..
A h-y‘r‘f’l‘
l )gw

1

 

51:

C.

p...-
‘mm.’

.-

22%;;
_ ..
w: «u

.- _.- ‘
_‘ L... ., -u .,
..-,....~ I
. ,.-— ..'; '. I
_ Inﬂ- ,. ¢ . v
- - '~ :2
. .

1—.
mou

n

”5..
~23“; xx?
~ - 1
§,
1::—
;:‘.. .31..

7:;
P' L.
“J1
1 «-

' 1

,
m <_ .-

nd- "'

”u. v' K

veg-:3“ .

~ ~ .y

vuz‘q: _

_“,1.m '

M

r.
Nu"
.

.r.
’ I;
4.—

1:? i V : jig-:1: 23:1:- _ ‘ "I
11%.,512.
k132i;
) \ 3

   

”-2"-
f’

WW

quv v. -“ "

—-1 'ZMw-‘N"
r..

ha
.—

a"
v":

ELM-nus."
Eire“

I2-
m
:5;

-v

1 i1 M1”
11111111“! A? 11k
‘1 ‘11$ixg"111"13;$1h1111111 1;

       
 
 
   
 
 
    

«91":
”mam-77m; 1-%£:::
”a; ‘__x..——-
m '

    
  
    
 

ﬁlm
31194 1!! 3‘12 '1

Vigil
w 1:1 I
2‘:

'1‘ M124

    
   

 

  
  

         

   

‘ ~ ._ H . l
1": IE"'L~".’!\-.. , t iv h
11": u “giggligrgmtgg 114- '4 I: 3; '1‘. ‘” n: t L
. '1‘ ““1 'I' t ‘1‘ ' ' - i J . 2‘ :.
.Lt'v 2,!k-- -.'-~ w“ . . h ’15 ii H1 u 1‘ 1? 111 '4 W" 1 -
1' E.‘ . 1‘20, 3;" 1% 1151'; '91“ . 1, 311! 1 <,, ‘ '3!" 1:1" "1"11121'33'Eh" 31'111',15.-‘1:'!'i~"
'6‘115‘1 ‘11.}:th ,2} is“ "I ._ 1.331‘siilﬁiﬁﬁl ‘1““1’ﬂsnu‘1' 113; \zﬁé; 11%;; -1_ § Al‘s??? :. $151} 1‘ {Ex}? 1m.hi;r111l‘!1i111‘%11 LE} .1141}. "Li "a. 1.1‘I11‘1o,8t:§|§u'1_1
' . “1 1:. 5‘ :m‘ '1”.' “ =3‘F'r‘g‘ 1’19}. ’v" 1 v q . "‘
A 121111; 1 1‘111‘1‘101111 1;!111‘1.112u*1rv1, =if "L"'l’1¥1d11, 111’1'1‘11'ﬁ’r‘1qhs1! 111‘111315'."131111131“1"; “1111111111111“11‘1112 1111““); '1111315'5121}: *4 "'«J'H'W
Lin ‘ ‘11 $1; 1 .: .Ll'1l 1*1W1HLM'. 31111 ml "* ‘1111'Wﬂ H «m 1"?~i‘;y‘iru!-”1§‘ "' "‘ '1‘ " ‘2‘ "1‘ 1:1
1“”113‘111111'511111111111~ “li.1‘1‘1111111'111h11r‘301,9'111‘11911‘11'1111111111111111111151u$1111 11111111 “:5 a 31:71. M W "156” ‘7311?
11.13131. 111111111111111111.1111"I 1“" 131?; $241.11 m.” 1:11: guy; ”1 1‘11“ ' L’ij1.11“iﬂ 11111 1‘11! 1‘ “111111111? ““9111“ "'13 1111111
1‘13 'H $114.11.“:11‘J ”I 111.?“ 11111111" 0 E 2 It ’ 5”. W11 "15,1? I“
111111111111110111111111131111111111 41113311 11E: 111'“: 1:111:11. 11.11111111111111113111111'311 11111111R111ﬁ111111111§§1111a§ 1‘ 1911 '11“; f1 kl». {15.21111
‘ 311:1:11, 3L1»! : . 1“..1",‘ A I ‘1 é . h" ’u "
'11‘31111z1ﬁ11‘giﬁﬂi3&111‘1“ m M“ 113311?“ . 1 P ‘ 1111‘ . “‘1‘" ‘=? ‘1
' 'z " “115]: § ,3 '
1

l 1: 1’. _: ‘
§}£.11‘~'*11’1’:.

O r ‘3
. 1111,11!
11,111...“

   

g”)
11111 11-
r11“

  

THESIS

0&6, CEO

llllllll'lllﬂlllllllllllllllllllillllllllll

3 1293 02058 6321

This is to certify that the

dissertation entitled

INVESTIGATING THE EFFECTS OF ITEM WORDING
ON RATING RESPONSES

presented by

Annie Woo

has been accepted towards fulﬁllment
of the requirements for

%- D- degreein MﬁfWa/‘XM
ade/ﬂ

 

   

Major professor

DateA/o-zz’ /j’/??l/7
/

MSU is an Afﬁrmative Action/Equal Opportunity Institution 0-12771

 

LIBRARY

Michigan State
University

 

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11/00 chlRODalantﬁS—p.“

INVESTIGATING THE EFFECTS OF ITEM WORDING
ON RATING RESPONSES

By

Annie Woo

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Counseling, Educational Psychology and Special Education

1999

ABSTRACT

INVESTIGATING THE EFFECTS OF ITEM WORDING
ON RATING RESPONSES

By

Annie Woo

The purpose of the study was to investigate how item wording affected rating
responses. Semantic negative and positive items measuring the same construct, in the
form of a Likert—type scale, were tested on a sample of students enrolled in middle
schools in Taipei, the capital of Taiwan. The psychometric properties of each item
(means, standard deviation, skewness, and kurtosis) were examined as a function of four
modes of item wording. The four modes were: Mode 1 (regular), “1 like myself”; Mode 2
(negated), “I do not like myself”; Mode 3 (polar opposite), “I dislike myself”; and Mode 4
(negated polar opposite), “I do not dislike myself.” A hierarchical measurement model
regarding the relationship between the modes of item wording and the responses was
constructed. A hierarchical conﬁrmatory analysis estimated the correlations between the
item scores and the four modes of item wording. The correlations between four versions
of a shame scale and a measure of anxiety (DOSC Anxiety Factor) and of life satisfaction
(Satisfaction with Life Scale) were also computed.

Pearson correlation coefficients were obtained for each of these subscales (Mode

1 to Mode 4), DOSC-Anxiety Factor, Satisfaction with Life Scale, and gender. The

Pearson correlation coefficients ranged from -0.l33 to 0.898. Gender had low correlation
coeﬁicients with the four modes, DOSC, and Satisfaction with Life Factor, ranging from
—0.133 to 0.128. There seems to be not much relationship between gender and these
scales.

The MANOVA results showed that the responses to the modes of item wording
are signiﬁcantly different between male and female. The F -ratios of Mode 1, Mode 2,
and Mode 3 were all signiﬁcant at the 0.05 level. Mode 4 was not signiﬁcant at the 0.05
level. These results were similar to that of the correlational analyses. It seems that Mode
4, which has double negative semantics, introduced some ambiguity to the items.

To determine whether the four Modes of semantics measured the same construct,
ﬁve models were tested in a conﬁrmatory factor analysis. The 2-factor model (Modes 1,
2, & 3 vs. Mode 4) ﬁt the data statistically and showed an overwhelming superiority over
the other models. These results rendered strong indications of the inequivalence between

double negatives (Mode 4) and the rest of the items (Modes 1, 2, & 3).

Though it may be useful to include some negative items to reduce a response bias,
the ﬁndings from the present study suggest that special caution should be exercised in the
use of double negative item phrasing. Despite the conventional wisdom so often found in
measurement textbooks, recent ﬁndings by researchers in the area of item phrasing have
suggested that negatively phrased items, especially double negatives, may reduce the

validity of a questionnaire. The present study clearly corroborates this position.

Copyright by
ANNIE W00
1 999

Dedicated to

Paul and Nathan

ACKNOWLEDGMENTS

This dissertation has been completed with the help of many people. First of all, I
wish to express my gratitude to the members of my Dissertation Committee. I wish to
thank Dr. William Mehrens, who served as the dissertation chairmen of my committee,
for his concerns and guidance in the completion of this study. His friendship and
willingness to offer suggestions are most appreciated. I also wish to thank, Dr. Betsy
Becker for her statistical expertise and suggestions in the data analysis, Dr. Teresa Tatto
for her assistance during the all stages of my study, and Dr. Margot Kurtz for her
constructive suggestions as a member of the committee.

I am ever grateful to Margaret Gunn for her assistance in the editing and
proofreading of this dissertation.

I wish to express my appreciation also to my best friend, Christine Lau for her
encouragement and assistance at some of the crucial stages of my study. Without her
help, this study would have taken a much longer time to complete.

I especially appreciate the support and cooperation of all the students, teachers,
and school administrators who participated in this study. I will always remember their
contributions of time and effort.

Finally and most of all, I wish to thank my husband, Paul, for his love, patience,
understanding, and support during the completion of my dissertation. I appreciate the

many sacriﬁces made by him to make it all possible.

vi

TABLE OF CONTENTS

Page
LIST OF TABLES ....................................................................................................... ix
LIST OF FIGURES ...................................................................................................... xi
CHAPTER
1. STATEMENT OF THE PROBLEM
Introduction ...................................................................................................... 1
Purpose of the Study ......................................................................................... 4
Signiﬁcance of the Study .................................................................................. 4
Research Questions .......................................................................................... 7
Overview .......................................................................................................... 8
II. REVIEW OF LITERATURE
Introduction ...................................................................................................... 9
Response Style .................................................................................................. 9
Item Wording .................................................................................................. 12
Counterbalance of Positive and Negative Items ........................................ 12
Item Reversal ............................................................................................. 16
Summary ......................................................................................................... 24
III. RESEARCH DESIGN AND PROCEDURE
Introduction .................................................................................................... 29
Development of the Test Instrument .............................................................. 29
General Shame Scale ................................................................................. 29
Dimensions of Self-Concept (DOSC) ....................................................... 31
Satisfaction with life Scale ........................................................................ 32
Questionnaire Development ...................................................................... 33
Selection of the Sample .................................................................................. 37
Procedures of Test Administration ................................................................. 37
Quality Control Screening .............................................................................. 38
Data Analyses ................................................................................................. 38
Conﬁnnatory Factor Analysis ................................................................... 40
Exploratory Factor Analysis ...................................................................... 41
Parallelism ................................................................................................. 42
Summary ......................................................................................................... 42

vii

IV. ANALYSIS AND INTERPRETATION OF THE DATA

Introduction .................................................................................................... 44
Characteristics of the Sample ......................................................................... 44
Analyses of the Questionnaire Data ............................................................... 46
Answers to the Research Questions ............................................................... 48
Research Question 1 ............................................................................... 48
Histogram Analyses ........................................................................... 48
Reliability Analyses ........................................................................... 52
Correlational Analyses ....................................................................... 53
Research Question 2 ............................................................................... 57
MAN OVA Analyses .......................................................................... 57
Exploratory Analyses ......................................................................... 6O
Conﬁrmatory Analyses ...................................................................... 63
Summary ......................................................................................................... 66
V. SUMMARY, CONCLUSIONS, IMPLICATIONS, AND
RECOMMENDATIONS
Summary of the Purpose and Procedures of
the Study ................................................................................................. 67
Discussion and Conclusion ............................................................................ 68
Implications .................................................................................................... 70
Cross Validation Using a New Sample .......................................................... 74
Cultural Speciﬁcity of the Questionnaire ....................................................... 76
Issues Concerning the Use of Tests ................................................................ 78
Limitations ...................................................................................................... 79
Recommendations .......................................................................................... 8O
APPENDICES
A. Four modes of Item Wording ..................................................................... 83
B. Sample Questionnaire in English ............................................................... 86
C. Sample Questionnaire in Chinese .............................................................. 94
D. Descriptive Statistics of the Items ............................................................ 102
E. Descriptive Statistics of Four Modes of Item Wording ........................... 106
F. Item-Total Statistics ................................................................................. 109
BIBLIOGRAPHY ...................................................................................................... l 12

viii

LIST OF TABLES

Page

Table 1. Summary of Findings Related to the Study .............................................. 23-24
Table 2. The General Shame Scale (Chang & Hunter, 1988) ...................................... 35
Table 3. Dimensions of Self-Concept (DOSC) (Anxiety Factor) (Michael & Smith,

1976) ............................................................................................................... 36
Table 4. Satisfaction with Life Scale (Diener, et al., 1985) .......................................... 36
Table 5. Gender of the Participants .............................................................................. 45
Table 6. Age of the Participants ................................................................................... 45
Table 7. Grade Level of the Participants ...................................................................... 45
Table 8. Participants’ Age and Gender by Grade ......................................................... 46
Table 9. Coding Format of the Subscales in the Questionnaire ................................... 47
Table 10. Tests of Normality .......................................................................................... 49
Table 11. Results of Reliability Analysis on Four Modes of General Shame Scale

(88 Items) in the Questionnaire ...................................................................... 52
Table 12. Mean, Standard Deviation, and Cronbach’s Alpha Coefﬁcients

for the Subscales ............................................................................................. 53

Table 13. Correlations between the 4 Modes, DOSC-Anxiety Factor, Satisfaction with

Life Scale, and Gender ................................................................................... 54
Table 14. Items with Corrected Item-Total Correlation Less Than 0.35 ........................ 56
Table 15. Descriptive Statistics of Four Semantic Modes by Gender ............................ 58

ix

Table 16.

Table 17.

Table 18.

Table 19.

Table 20.

Table 21.

Table 22.

Table 23.

MANOVA Tables by Gender for the Four Semantic

Mode Subtotal Scores ............................................................................... 58-59
Factor Structure of the Questionnaire (19 factors) ......................................... 62
Factor Structure of the Questionnaire (6 factors) ........................................... 63
Goodness-of-F it Indices of Five Models ........................................................ 66
Four Modes of Item Wording ................................................................... 84-85
Descriptive Statistics of the Items ........................................................ 103-105
Descriptive Statistics of Four modes of Item Wording ........................ 107-108
Item-Total Statistics .............................................................................. 110-111

LIST OF FIGURES

Page
Figure 1. Histogram of Totals of 107 Items on the Questionnaire ................................ 49
Figure 2. Histogram of Mode 1 Scores on the Questionnaire ........................................ 50
Figure 3. Histogram of Mode 2 Scores on the Questionnaire ........................................ 50
Figure 4. Histogram of Mode 3 Scores on the Questionnaire ........................................ 51
Figure 5. Histogram of Mode 4 Scores on the Questionnaire ........................................ 51
Figure 6. Scree Plot from the Exploratory Factor Analysis (19 factors) ........................ 61

xi

CHAPTER I

STATEMENT OF THE PROBLEM

Introduction

Conventional wisdom has suggested that psychological measures should be
constructed to contain an even balance of positively and negatively-worded items, so as to
counteract response biases such as agreement response tendencies. The practice of using
a balance of positively and negatively-phrased items in an affective instrument stems
from recommendations found in the literature. Most textbooks or publications listing
recommendations concerning attitude scale construction also suggest that questionnaire
items should include both positively and negatively-worded item stems (Anastasi, 1982;
Wiggins, 1973). Nunnally (1978) speciﬁcally advocated the reduction of response styles
by having an item pool “...divided evenly between positive and negative statements”
(p.605), and asserted that “stylistic variance ...can be mostly eliminated by ensuring that
an instrument is constructed so that there is a balance of items keyed ‘agree’ and

9”

‘disagree (p.669). Other psychometricans have made similar statements. The general
consensus in the literature has been that measures should have both positive and negative
items (Scott, 1968; Anastasi, 1982).

The early data reported in support of response styles is not without challenges.
Samelson (1972) stated that the researchers (Bentler, Jackson, & Messick, 1971, 1972;
Couch & Keniston, 1960; Jackson, 1967a, 1967b; Jackson & Messick, 1958, 1965) failed

to clarify the conceptual meaning of response styles, and used an incorrect logical model

in interpretation. The root of the problem seemed to be that all discrepant responses were
deﬁned as acquiescence — a mistake of the sort called the “the psychologist’s fallacy”
(Samelson, 1972), which refers to the confusion of the researcher’s own standpoint with
that of the mental fact about which he/she is making his/her report. Block (1967, 1971 , &
1972) argued that acceptance (the tendency to ascribe characteristics to oneself, regardless
of the direction of item keying), was not likely to be of appreciable import for
understanding the nature of responses to structured personality inventories.

Rorer (1965) stated that response styles must be distinguished from response sets.
He deﬁned a response set as “the criteria according to which a respondent evaluates item
content when selecting his answer,” whereas a response style was deﬁned as “a way or a
manner of responding, such as the tendency to select some particular response option
independently of the item content” (Rorer, 1965, p.151). Rorer also felt it was also
necessary to distinguish between achievement examinations and personality, attitude, and
interest inventories, when assessing the extent to which styles affect answers to items.
On examinations, but not on inventories, there were right answers and there were items
whose answers the respondent must guess. Inferences concerning an individual’s
response style might be made on the basis of his/her response to a number of
examinations by comparing the proportion of his/her answers to any given category with
the proportion keyed for that category, and by considering the proportion of wrong
answers in each of the response categories. Rorer concluded that response styles were of
no more than trivial importance in determining rating responses.

Despite the challenges, the practice of using positive and negative item phrasing

continues to receive widespread endorsement. Psychometricans have suggested the use

of an equal number of positively and negatively-worded items as a way of reducing the
possibility of a response style inﬂuencing the responses to affective instruments
(Anderson, 1981; Mehrens & Lehmann, 1984). The commonly referenced, and often
followed, recommendation to use an equal number of positive and negative items is based
upon two assumptions. First is the assumption that the items (whether they are positively
or negatively phrased) are measuring the same construct. Second is the assumption that
by balancing positive and negative item phrasing, a more valid index is obtained.
Although there is a wide acceptance of these assumptions, there is little research on their
tenability.

In fact, the assumption that negative and positive items measure the same
construct is so widely prevalent among test developers, it seems to be unquestionably
accepted. The problem arises from test constructors’ common practice of using negative
items based on unveriﬁed assumptions. Generally, in verbal self-report measures of
latent traits, it is assumed that, given standard testing conditions, an examinee’s responses
are determined by item content, examinee’s characteristics, and, to some extent,
instrument artifacts. Most Likert-type scales include a balance of semantically negative
and positive-valence items, with the intent of ridding the instrument of the effects of
certain response styles. It is further assumed that both negative and positive items
measure the same trait. However, the empirical veriﬁcation of this assumption seems to
have received little attention.

Over the past several years, numerous questions have arisen pertaining to the
impact of item wording on rating responses. Speciﬁcally, which item wording format is

to be preferred? Is one format superior to the other, and under what constraints? How do

different modes of item wording affect rating responses? Are we measuring the same
construct if we use positive and negative items? If subjects respond differently to the
same item stem when the item wording format varies, could the items be regarded as
nonequivalent in the same sense as content-parallel achievement items, which vary in
difﬁculty level? More research on the effects of item wording on rating responses is

needed.

Purpose of the Study
The purpose of the study was to investigate how item wording affected rating
responses. The four modes of item wording: Mode 1 (regular) “1 like myself,” Mode 2
(negated) “I don’t like myself,” Mode 3 (polar opposite) “I dislike myself,” and Mode 4
(negated polar opposite) “I don’t dislike myself,” were considered. Speciﬁcally, the
effects of modes of item wording of rating scale items on scale and item score mean,
distribution, and reliabilities were examined. The correlations between the modes of item

wording and student responses were also studied.

Signiﬁcance of the Study

Anderson (1981) has argued that affective characteristics facilitate desired
cognitive goals of the schooling process, and are, in themselves, desired goals of the
schooling process. Similarly, Bloom (1978) stated that schools should produce
“independent learners” who are able to engage in higher-level thinking, develop

conﬁdence in their abilities, and possess a degree of social responsibility. In the arena of

education, increasing emphasis is being placed upon the need for valid and reliable means
of assessing affective outcomes.

Most affective outcomes are currently assessed through attitude surveys. Surveys
have been used widely in the measurement of attitudes and opinions. They are also a
popular method for evaluating student achievement in performance-based or constructed
response assessment. Furthermore, surveys are the predominant method for eliciting
judgments from students on course and instructor effectiveness.

Given that surveys are so widely used in the social sciences, both as research tools
and in practical applications, item development becomes an important consideration in
their construction. The literature dealing with item construction is voltuninous.

Hundreds of articles have been published concerning issues such as the use of ratings,
their reliability and validity, and potential biasing factors. Because of conﬂicting ﬁndings
in this literature, however, it is difﬁcult for reviewers to identify general trends.

The possible effect of item wording on overall ratings is particularly relevant to
many of today’s available rating scales. Yet, the current body of literature leaves
numerous questions unanswered. What has yet to be determined is the possible effect of
positively and negatively worded items on raters’ evaluations. Do negatively worded
items “encourage” a more critical evaluation than do positively worded items? Existing
studies on response schemes have not directly addressed issues of general validity and the
problems evolving from the speciﬁc questions stated above have not been addressed
directly in the studies on response schemes which have appeared in the literature. Further
research needs to be conducted to determine whether one format over another is more

susceptible to rating errors of leniency or other sources of invalidity.

In survey and evaluation research, much emphasis has been placed on the
development of the item stems of the questionnaire being used. Negatively worded items
may highlight the negative aspects or faults of the object or person, or may serve to
suggest unconsciously to the respondent particular problem areas anticipated by the
researcher. If so, rating scale evaluations may be affected as much by the wording of the
items as by the quality of the object or person being evaluated. The possible effect of
different modes of item wording on overall ratings is particularly relevant to many of
today’s available student rating instruments. It has yet to be determined if the different
modes of item wording inﬂuence rating responses.

Both positive and negative items are commonly used in educational and
psychological measurement. Over the past several years, numerous questions have arisen
pertaining to item wording. But despite the large amount of research on rating
methodology, there have been relatively few conclusions concerning this measurement
issue. In addition, the factors that bring some degree of control over the distributional
parameters of ratings scales have received relatively little attention. There is little
research on the factors that inﬂuence the meaning that subjects apply to response options
when responding to rating prompts. Investigating this problem will bring some
understanding of how the different modes of item wording inﬂuence the rating responses.

The underlying premise for this research is that the item wording has an inﬂuence
on scale and item score mean, distribution, and reliabilities. This study makes a start
toward illustrating this premise by analyzing the results derived from applying four
modes of item wording to a survey. The study will also provide valuable information on

an essential and important dimension of instrument development, namely, knowledge and

understanding of the effects of item wording on rating responses. The resulting
information should be valuable for educators and researchers whose focus is developing
effective rating scales.

This research study intends to enhance our understanding of measurement issues
in item development in two ways: First, the study will lead to general conclusions on the
overall relationship between item wording and rating responses. Second, the study will
provide some insight about the equivalence of items between different modes of item
wording. The results of the study should prove useful to administrators and faculty
members who use surveys to assess affective outcomes, and to educational researchers

who are looking for state-of-the art research on survey item construction.

Research Questions
The present study was designed to provide answers to the following questions:
1. What is the inﬂuence of item wording on scale and item score mean, distribution,
reliabilities, and correlations between scales?
2. Are items equivalent among four different modes of item wording? The four modes
are: Mode 1 (regular), “1 like myself”; Mode 2 (negated), “I do not like myself”;
Mode 3 (polar opposite), “I dislike myself”; and Mode 4 (negated polar opposite), “I

do not dislike myself.”

Overview
This chapter has presented the problem, purpose, signiﬁcance, and research questions of
the study. In Chapter II, a review of the literature related to the study will be presented.
Chapter HI describes the procedures and design of the study. The analysis and
interpretation of the data is presented in Chapter IV. In Chapter V, the conclusions and

implications of the study will be presented.

CHAPTER II

REVIEW OF LITERATURE

Introduction

The purpose of this study was to investigate the effects of item wording on
responses on a rating scale. In developing a new survey or utilizing existing surveys, the
researcher needs to examine how the wording affects subjects’ responses. A number of
studies have focused on various aspects of item characteristics and their possible
inﬂuence on reliability and variance. Numerous researchers have also investigated the
impact of item wording on rating response. Research studies were reviewed to determine
the current professional opinions regarding impact of item wording on the responses of
rating scale.

This chapter is divided into two major sections. The ﬁrst section deals with
response style. The second section focuses on item wording. Some overall conclusions

follow.

Response Style
The development of scales to assess attitude poses complex methodological
problems for the test constructor. Self-report, paper-and-pencil-type verbal measures are

most commonly used in behavioral research and assessments. Among the various types

of verbal self-report measures, Likert—type scales are the most popular, mainly because
the Likert method is conceptually simple and practically straightforward.

However, one of the major sources of criticism of self-report data centers on the
susceptibility of self-report measures to various response sets that pose a continuing
threat to construct validity. A good deal of research in this area was conducted in the
19503 and 19605. Cronbach (1950) examined the effects of selected response sets on the
validity of cognitive instruments, and some corrective procedures were suggested. He
also identiﬁed acquiescence as a response tendency that favors afﬁrmative responses over
negative responses. Couch and Keniston (1960) called this tendency “yea-or-nay-saying,”
wherein respondents consistently select in one direction, either positive or negative. The
hypothesis was that some individuals have a general disposition on the positive/negative
continuum regardless of the content of the items.

Various types of response sets were identiﬁed and their effects were investigated.
Jacobs and Barron (1968), Green (1951), Radcliffe (1966), Stricker (1969), Wesman
(1952), and Wiggins (1973) investigated the inﬂuence of social desirability in personality
measurement. Couch and Keniston (1960) examined the impact of an acquiescence
response set. Berg (1961) identiﬁed the deviant response set and hypothesized that it was
an important dimension of personality. In fact, the literature on response styles
accumulated to the point that by 1970 there had been several major reviews of literature
and even reviews of the reviews (Nunnally, 197 8). Bentler, Jackson, and Messick (1972),
Jackson (1967a, 1967b), Jackson and Paunonen (1980), Rorer (1965), and Samelson

(1972) are some of the researchers expressing there views regarding response set.

10

As a consequence of such an upsurge of research, controversial though it had
been, it was argued that response styles do account for a certain portion of test score
variance, and that if one is interested in the construct validity of the instrument, then
measures should be taken to free the instrument from this stylistic variance (N unnally,
1978, p.660). There is a relatively large body of literature on response styles which
indicates that these wording changes may make a signiﬁcant difference in the factor
structure of scales and the item validity (Bentler, Jackson, & Messick, 1972). Bentler et
al., argued this point convincingly for two different types of acquiescence response styles.

There has been considerable psychological research over the last thirty years
dealing with concepts broadly classiﬁed under the general heading of “response styles.”
Speciﬁcally, acquiescence response sets and various other response biases have been
examined as they relate to different types of item wordings. Most of this research has
utilized measures of the California F -scale (Adorno, Frenkel-Brunswik, Levinson, and
Sanford, 1950), the Minnesota Multiphasic Personality Inventory (MMPI) (Hathaway and
McKinley, 1967), and the Personality Research Form (PRF) (Jackson, 1967a).

Although this research has been valuable in questioning the interpretation of
subject responses to these measures, it has not resolved the issue of response style
relevance. The argument has been heated over whether or not response styles exist, and,
if so, whether they impact upon research results in a meaningful way. Rorer (1965), for
example, concluded that “the inference that response styles are an important variable in
personality inventories is not warranted on the basis of the evidence now available”
(p.150). Jackson and Messick (1965) responded to Rorer with extensive criticism of both

his data and his conclusions. The results of their study suggested that the inclusion of

negatively worded items can result in less accurate responses and therefore impair the
validity of obtained results. Thus, although the inclusion of negatively stated items may
theoretically control or offset agreement response tendencies, their actual effect is to
reduce response validity. This situation suggests that the current recommendation
concerning the desirability of including both positive and negative items on a

questionnaire may be premature and apparently warrants much further investigation.

Item Wording

 

Surveys are widely used in education and psychology. Because of both their
widespread use and their importance, the construction of survey items has been heavily
researched; yet, concerns remain about what factors inﬂuence rating responses. One of
the often expressed concerns is how the item wording affects students’ responses, and
numerous research studies have investigated the extent to which item wording affects the
rating response. By no means, though, is there total agreement on the extent of the
relationship. In fact, while some investigators have found a strong relationship between

item wording and rating responses, others have found no relationship at all.

CounteLbaiance of positive and negative items

Psychometricians recommended counterbalancing the questions which were
asked, so that a positive response to one question and a negative response to another both
contributed towards increasing the score on the measure as a whole (Lemon, 1973; Likert,
1932; Edwards, 1957a). This consensus has found its way to many specialty areas in

educational and psychological testing. Likert (1932) suggested that those “two kinds of

12

statements ought to be distributed throughout the attitude test in a chance or haphazard
manner.” (p. 91) Most Likert-type scales include a balance of semantically negative and
positive valence items with the intent to rid the instrument of the effects of certain
response styles. It is further assumed that both negative and positive items measure the
same trait.

For example, Schriesheim and Kerr (1974) indicated that subject agreement
response tendencies can usually be controlled by having an adequate number of
negatively-worded items, and, based upon their investigation, the major existing
instruments measuring perceived leader behavior were inadequate in terms of not having
sufficient negative items. The authors concluded in their review that revised scales were
needed, and that these measures should have a larger number of negatively worded items
to offset acquiescence response biases.

A statement of caution concerning the use of negatively worded items is
appropriate. Because their use may introduce covariance, some researchers have begun to
question the utility of negatively worded items (e.g., Thacker, Fields, & Tetrick, 1989).
Nonetheless, Schmitt and Stults (1985) suggested that covariance introduced by the
direction of item wording does not necessarily result in a methodological confound that
distorts conceptual interpretation. If negatively worded items areito be used, however, it
would be wise to ensure that their inclusion does not present a methodological confound.
It is preferable that constructs are not exclusively deﬁned by negatively worded items
during scale development. Instead, scales should have equal numbers of positively and

negatively worded items.

13

A number of studies have focused on various aspects of scale characteristics and
their possible inﬂuence on reliability, variance, and correlations. In Simpson, Rentz, and
Shrum’s (1976) study, questions concerning six socially-signiﬁcant topics were designed.
Items representing each concept were written with two criteria: strong wording versus
mild wording, and positive stance versus negative stance. Each item was consequently
written in four forms so as to ﬁt in one of each possible category: “mild-positive,” “mild-

99 “

negative, strong-positive,” and “strong-negative.”

The authors found that wording inﬂuenced responses more than the content of the
items - an outcome suggesting the inﬂuence of an “agree-disagree” response set. They
also found that the positive versus negative item structure inﬂuenced students’ responses
more than mild versus strong wording. Moreover, the authors found that the students
tended to agree more with mildly worded positive items than strongly stated positive
items. When reacting to the same concepts worded in the negative, students disagreed
more with the strongly worded counterparts of each pair. Within some of the topical
areas, students' responses were inﬂuenced more by an “agree-disagree” set than by the
intended meaning of the items. Items written at a higher “emotional level” tended to
elicit stronger responses when they were stated in 3 disagree format than when written in
an agree format.

The extremity of attitude conveyed by the wording of the item also affects the
mean response and variability of response. J aroslovsky (1988) examined the effects of
wording on poll results. In his study, the respondents provided different answers when

the same question was asked in two different ways: “Do you think the United States

should allow public speeches against democracy?” and “Do you think the United States

14

should forbid public speech against democracy?” The author noted that wording and
context are two of the principal and closely related sources that public opinion experts
refer to as “response biases” in polls. He found that even a small change in how a
question was asked could trigger connotations or interpretations in the respondents’
minds that could have a major effect on how the question was answered. Moreover, he
found that answers to even identical questions could vary from poll to poll, depending
upon how the question is juxtaposed with others in the same survey.

The effect of wording on polls may not be obvious to many people. However,
words convey tones that can have a substantial effect on the answers. In 1940, researcher
Donald Rugg had pollster Elmo Roper ask similar questions of two separate national
samples. The researchers found that support for free speech was 21 percentage points
larger among those asked whether speeches “should be forbidden” than those who were
asked whether speeches “should not be allowed.” The experiment was replicated 35
years later; researchers asked the same set of questions. The results showed a substantial
increase in respondents’ willingness to tolerate free speech. People remained more
willing to “not allow” speeches against democracy than to “forbid” them.

Winkler, Kanouse, and Ware (1981) examined a technique of control for
acquiescence-response sets. Logical or polar opposites were used to reverse regular
items. Each concept was deﬁned by a set of matched, contradictory statements. No
adverse effects of using contradictory statements were found. The authors recommended

using a balance of regular and polar items for controlling acquiescence effects.

15

Item reversal

Researchers (Anastasi, 1982; Nunnally, 1978) suggested using an equal number of
positive and negative valence statements in scales to minimize the inﬂuence of the
response sets triggered by item content, the tendency to agree, and the tendency to mark
in the left or right columns. However, it was difﬁcult to determine whether the meaning
of an item had been virtually reversed. Rorer (1965) investigated this aspect of item
wording and concluded that many reversal pairs of items did not turn out to be reversals.
Inspection of the items indicated that in many instances there was nothing at all
inconsistent or contradictory about rejecting both the original and the reversed item. He
indicated that while more extreme reversals result in lower correlations than less extreme
reversals, the more extreme reversals are simply permitting a greater number of consistent
rejections of both forms of the item.

Many studies have been conducted on the impact of reverse-scored items on
survey results. In general, measurement specialists recommend a mixture of both regular
and reverse-scored items in order to guard against various response biases such as
acquiescence and agreement response tendency (Anastasi, 1982; Nunnally, 1978).
However, there is agreement, based on intuition, that negative statements are more
difﬁcult to understand, and there is a study that suggests that negatively phrased items are
less valid. Schriesheirn and Hill (1981) used undergraduates to investigate the effects of
positive and negative item phrasing on the validity of the responses to three forms of the
same questionnaire. The authors studied the effect of item wording on questionnaire

reliability and validity with data from 280 undergraduates who read a scenario describing

16

a hypothetical leader’s behavior, and then completed one of four different questionnaires
to describe that leader. The authors examined the effects of item wording on the accuracy
of responses to Form XII of the Leadership Behavior Description Questionnaire (LBDQ-
XII) (Stogdill, 1963). Three forms of the questionnaires (all regular items, all negated
items, and a mix of the two forms of items) were randomly distributed. The responses
were compared to the LBDQ-XII descriptions of a ﬁctitious supervisor to the known
levels of behavior actually shown to each subject. The ﬁctitious supervisor was portrayed
on a one-page script given to each subject. The authors found that polar opposite and
negated polar opposite items had signiﬁcantly lower coefﬁcient alpha internal-
consistency reliabilities as compared with those for the regular and negated regular items.
Accuracy scores yielded the same results for validity of measurement. The authors also
found that regular wording format yielded more accurate responses than the negated or
mixed formats. The lowest level of respondent accuracy was found in the mixed format.
Chang (1995) examined the psychometric equivalence of negative and positive
items. Some researchers call negative items “semantically negative,” meaning they have
a negative meaning. Such a deﬁnition lacks accuracy. “Semantic” refers to the formal
meaning, or the nature of a statement free from value judgment or sentiment. The value
judgment or feeling of a word or statement is represented by its connotation. For
example, the words frugal and miserly have the same formal meaning, whereas one has a
positive or neutral connotation and the other has a negative connotation. Similarly, a test
item can be connoted as positive or negative, whereas its semantics are ﬂee from a
positive or negative sentiment. Chang suggested that deﬁning items as semantically

negative is incorrect. Items can be deﬁned not in terms of the manifest syntax or

semantics but in terms of the underlying connotation. The opposite connotations of items
represent two directions of a latent construct continuum of which items or their semantics
are indicators.

Chang deﬁned a test item as connotatively consistent or connotatively inconsistent
when the connotation of the item either agrees (conforms with) or disagree (contradicts)
that shared by the majority of the items making up a test or a subscale of a test. He
examined the equivalence between connotatively consistent and connotatively
inconsistent items of a 4-point Likert type scale using conﬁrmatory factor analysis. His
study concluded that these items measured correlated but distinct traits. He also
suggested that connotatively inconsistent items should not be used. Items on a test or
questionnaire should possess items connotatively consistent with the construct being
measured.

Ahlawat (1985) concluded that semantically negative and positive item contents
do not measure essentially the same construct. His study was based on a sample of
Jordanian middle school students using an Arabic translation of State-Trait Anxiety
Inventory (Spielberger et al., 1970). Four sets of items were constructed with distinct
modes of semantic presentation. He suggested that double negative items create
cognitive complexity for the students, which may end up in confusion. On the basis of
the correlational and variance-related analyses, his ﬁndings concluded that semantically
negative and positive item contents do not measure essentially the same construct.
Furthermore, the author suggested that more cognitive steps are required to decode or
unravel a negatively worded statement, thus making the task more difﬁcult than

responding to a positively worded statement. The ﬁndings of this study question the

18

common practice of using both negative and positive valence items in a scale and making
decisions on the basis of the students’ total score on items. This problem deserves closer
inspection through more speciﬁcally designed studies.

Benson and Hocevar (1985) examined the effect of item phrasing on the validity
of Likert-type attitude scales. Three content-parallel forms of the same questionnaire
were developed to assess student attitudes toward integration. The forms differed only in
terms of item phrasing. The ﬁrst contained 15 positively phrased items. The second form
contained the same 15 items phrased negatively. The third form contained the same 15
items, eight with positive phrasing, and seven with negative phrasing. The words “was
not” were used to create a negative statement from the positive statement.

The study reported strong evidence that the insertion of the word “not” has a
profound inﬂuence on student responses. The items that induced a favorable response on
the positive form induced a less favorable response on the negative form. Respondents
were less likely to indicate agreement by disagreeing with a negatively phrased item than
to indicate agreement by agreeing with a positively phrased item. Moreover, items that
induced an unfavorable response on the positive form were less likely to induce an
unfavorable response on the negative form. The analyses provided evidence that
changing positive statements into negative statements may have an effect on the
psychometric characteristics of an item. These analyses did not provide conclusive
evidence that positive and negative items measure the same construct in different ways.
The results indicated that the subjects had difﬁculty expressing agreement by disagreeing

with the negated items.

In order to examine if a different construct was being measured, two models were
contrasted. One model composed of two factors where the factors were set to be
correlated (undifferentiated model), and a second model where the correlation between
the two factors was estimated (differentiated model). The authors found that the
differentiated, two-factor model provided a better ﬁt to the data than the undifferentiated
two-factor model. It was suggested that the positive to negative transformations change
not only an item’s psychometric characteristics, but also change the construct that an item
is intended to measure.

Campbell and Grissom (1979) examined the effect of item phrasing on attitude
scale items. Two forms were developed — the ﬁrst containing all regularly-scored items
and the second consisting of items that were designed to be their logical opposites (polar).
The results of factor analyses suggested that the two different formats measured different
constructs. The authors also indicated that scoring-negated or polar-attitude scale items
were not equivalent to the reversal of regular items.

Schmitt and Stult (1985) suspected that a small number of respondents who were
careless in their responses might be responsible for the appearance of negative factors
composed only of reverse-scored items. The objective of their study was to show how a
“negative factor” can be produced by a relatively small number of careless respondents.
A frequently occurring phenomenon in the analysis of personality or attitude scale items
is that all or nearly all questionnaire items that are negatively keyed deﬁne a single factor.
Although substantive interpretations of these negative factors are usually attempted, this
study demonstrated that the negative factor could be produced by a relatively small

portion of the respondents who failed to attend to the negative-positive wording of the

20

items, and who did not notice that some items are the opposite in meaning to the majority
of the items. In a series of simulations, the proportion of “careless” respondents and the
proportion of negatively keyed items were varied for data generated from three different
correlation matrices reﬂecting different levels of item intercorrelation. The results
indicated that when only a small portion of the respondents were careless (ten percent), a
clearly deﬁnable negative factor was generated. The authors cautioned about the use of
item reversal and the interpretation of the reversed-score items.

In summary, it was concluded that scales with negatively keyed items frequently
led to the identiﬁcation of a factor deﬁned wholly or mostly by those negatively keyed
items. Other literature cited above indicates that this ﬁnding is relatively widespread in
the sense that it occurs in a variety of research areas.

Marsh (1984), who found a negative item factor in a self-concept measure for
elementary school children reported similar results. Other researchers have also reported
negative factors (Campbell & Grissom, 1979; Simpson et al., 1976), thus adding support
to the notion that negative phrasing may actually change the construct that the item is
intended to measure.

Harasym (1992) reported, from a study with approximately 200 ﬁrst-year nursing
students, evidence that the use of negation (e.g., not, except) should be limited in stems of
multiple-choice test items, and that a single-response, negatively worded item should be
converted to a multiple-response, positively worded item.

Several other researchers (Andrich, 1983; Campbell & Grissom, 1979; Simpson,
Rentz, & Shrum, 1976) have investigated whether phrasing can inﬂuence overall attitude

levels on different attitudinal questionnaires. These researchers all concluded that item

21

phrasing makes a difference. However, the results these researchers report cannot be
easily corroborated with each other. At issue is that the word “not” was used in
Andrich’s study (1983) to create parallel negative statements, whereas the other
researchers cited created negative statements by item reversal (Campbell & Grissom,
1979; Simpson, Rentz, & Shrum, 1976). Rorer (1965) suggested that this latter procedure
often leads to negative statements that reﬂect different content or ideas; consequently,
such statements are not direct opposites of the original positive statements. It is because
of this problem that many affective scales contain the word “not” to create a negative
statement (Coopersmith, 1967; Marsh, Smith, Barnes, & Butler, 1983; Piers, 1969).

The ﬁndings that relate to the present study are summarized in Table 1. For each
review, the table shows the name of investigator, year of publication, and a summary of

ﬁndings.

22

Table 1. Summary of Findings Related to the Study.

 

 

Investigator Year Finding
1. Likert 1932 & The authors recommended counterbalancing the survey items
Edwards 1957 & so that a positive response to one question and a negative
Lemon 1973 response to another both contributed towards the score on the
measure.

2. Cronbach 1946 & 1950 There is a tendency for people to favor afﬁrmative responses
over negative responses.

3. Couch & Keniston 1960 The authors labeled the tendency wherein respondents
consistently select in one direction, either positive or negative,
as “yea-or-nay saying.”

4. Rorer 1965 Many reversal pairs of items do not turn out to be reversals.

5. Simpson, Renz, 1976 Wording inﬂuenced responses more than the content of the

& Shrum items.

6. Campbell & Grissom 1979 Scoring-negated or polar-attitude scale items are not
equivalent to the reversal of regular items.

7. Schriesheirn & Hill 1981 Regular wording format yielded more accurate responses than
the negated or mixed formats. The lowest level of respondent
accuracy can be found in the mixed format.

8. Winkler, Kanouse 1981 No adverse effects of using contradictory statement was

& Ware found. The authors recommended using a balance of regular
and polar items for controlling acquiescence effects.

9. Ahlawat 1985 Semantically negative and positive item contents do not

measure essentially the same construct.

 

23

Table l. (cont’d)

 

Investigator Year

Finding

 

10. Benson & Hocevar 1985

ll. Schmitt & Stult 1985
12. Jaroslovsky 1988
13. Harasym 1992
14. Chang 1995

A detrimental effect would occur if the number of regular and
reverse-score statements were balanced.

A frequently occurring phenomenon in factor and cluster
analysis of personality or attitude scale items is that all or
nearly all questionnaire items that are negatively keyed will
deﬁne a single factor. The study demonstrates that the
negative factor could be produced by a portion of respondents
who fail to attend to the negative-positive wording of the
items.

The wording of the item affects the mean response and
variability of response.

A single-response, negatively worded item should be
converted to a multiple-response, positively worded item.

Connotatively consistent and connotatively inconsistent items
measured correlated but distinct traits. Items on a test or
questionnaire should be connotatively consistent with the
construct being measured.

 

Summag

Surveys are used extensively in a wide range of assessment. The increase in

popularity of surveys as measures of affective outcomes consequently has focused a great

deal of attention on their validity. It is important to understand the ways in which people

use survey items and, especially, to understand the factors that can inﬂuence the

responses given. Many researchers have provided an extensive review of research in this

area. Previous research has not addressed the perceived defensibility or accuracy of the

24

assumption of construct equivalence with regard to using both positive and negative items
in a survey.

The literature review has provided insights into the effects of response format on
rating scales. The various research studies suggest that item wording does account for a
certain portion of test score variance. If we are interested in the construct validity of the
instrument, then measures should be taken to account for this stylistic variance.
(Nunnally, 1978, p.660). However, the cited studies have shown inconsistent ﬁndings
with respect to the effects of wording of items on the rating scale. The desirability of
including a mixture of regular and reversed-scored items on attitude and questionnaire
measuring instruments is yet to be determined conclusively. Research studies yielded
inconsistent and ambiguous support for balancing regular and reversed-scored items
(Bentler, Jackson, and Messick, 1972; Jackson and Messick, 1965; Rorer, 1965). This
circumstance raises serious concern about whether both regular and reversed-scored items
should be included on a measuring instrument. The research results of these studies
indicate that further investigation is needed.

Although the recommendations of some authors are in conformity with
conventional psychometric advice, the experience of the authors suggests that the use of
negative items may have at least some dysfunctional consequences. In their experience,
negatively worded items often reduce scale reliability, and they may be eliciting response
biases or measuring unintended aspects of constructs under investigation. In any event,
as measurement validity requires instrument reliability, these impressions suggest that the

use of negative items may not be cost-free.

25

The investigators have raised serious questions about the inﬂuence of item
wording on student responses. Because of conﬂicting ﬁndings, however, it is difficult to
draw ﬁrm conclusions. Furthermore, several complexities in this research literature add
to the difﬁculty of drawing overall generalizations concerning the impact of item
wording. The research to date suggests that positive to negative transformations change
an item’s psychometric characteristics and, more importantly, change the construct that an
item is intended to measure. However, the studies that have been reviewed do not show
that positively phrased items are necessarily better indicators of attitude. Nevertheless,
there is some indication that negatively phrased items are less valid. There is the
plausible argument that respondents may not understand that they can indicate agreement
by agreeing with a negative statement. Marsh (1984) provided support for the above
contention. Despite the conventional wisdom so often found in measurement textbooks,
pronouncements by researchers in the study of item phrasing have been unanimous that
negatively phrased items reduce the validity of a questionnaire (Andrich, 1983; Campbell
& Grissom, 1979; Marsh, 1984; Schriesheim & Hill, 1981; Simpson et al., 1976).

One measure almost universally adopted in Likert-type scales to minimize the
inﬂuence of the response set triggered by item content, the tendency to agree, and the
tendency to mark in the left or right columns is to include an equal number of positively
and negatively valenced statements in the scales. An actual reversal of meanings of an
item, however, may be hard to achieve. To complicate matters still further, it is not easy
to determine that the meaning of an item has been virtually reversed. Rorer (1965) has
provided many examples of reversed pairs of items that on close scrutiny turn out not to

be reversals.

26

The assumption that negative and positive items measure the same construct is so
widely prevalent among test developers that it seems to be unquestionably accepted by
almost everybody. Test constructors’ conviction in this assumption is ftu'ther fortiﬁed by
empirically obtained high indices of homogeneity contrary to the psychometricans’
warning that homogeneity neither implies nor guarantees the unidimensionality of the
trait being measured by the test.

The problem arises out of the test constructors’ common practice of using
negative items based on unveriﬁed assumptions. Generally, in verbal self-report
measures of latent traits, it is assumed that, given standard testing conditions, an
examinee’s response is determined by item contents, exarrrinee characteristics, and, to
some extent, instrument artifacts. Most Likert-type scales include a balance of
semantically negative- and positive-valence items with the intent of ridding the
instrument of the effects of certain response styles. It is further assumed that both
negative and positive items measure the same trait. The empirical veriﬁcation of this
assumption seems to have received little attention from researchers.

The purpose of this study was to investigate the effects of different modes of item
presentation on the responses of various groups of high school students. Taking a simple
case of semantic change, a sentence with a positive import, for example, “ I like myself,”
can be reversed in meaning either by replacing the word “like” with one of its antonyms
(e.g., dislike) or by using a grammatically negative mode (e. g., “I do not like myself”).
Similarly, a negatively valenced sentence can be reversed in meaning by using the same
transformations. Following this scheme, four distinct modes of semantic presentation can

easily be deﬁned: Mode 1 (regular), “1 like myself;” Mode 2 (negated), “I do not like

27

myself;” Mode 3 (polar opposite), “I dislike myself”; and Mode 4 (negated polar
opposite), “I do not dislike myself.”

More speciﬁcally, the study looks into the following questions: What is the
inﬂuence of wording of rating scale items on rating scales? What is the inﬂuence of item
wording on scale and item score mean, distribution, reliabilities, and correlations between
scales? What is the item equivalence between different modes of item wording? Do
different modes of presenting the seemingly same content measure the same trait? Does
the grammatically negative mode incorporate ambiguity into the items? How much
variance is due to format factors? Does stating a concept in a positive manner affect
reactions differently than stating the same concept in a negative manner? Do students
agree with positively stated items to the same extent they disagree with the concept when
stated negatively? These questions warrant further investigation of the effects of item

wording on rating responses.

28

CHAPTER III
RESEARCH DESIGN AND PROCEDURE
Introduction
The purpose of this study was to investigate the effects of item wording on the
rating responses. This chapter includes a description of the development of the test
instrument, selection of the sample, procedures of test administration, quality control
screening, and the statistical procedures for examining the research questions of this

study.

Development of the Test Instrument

General Shge Soak

The General Shame Scale was developed by Chang & Hunter (1988). On the
basis of an examination of major writings in shame literature, everyday language, and
clients' verbal reports, the authors developed and tested scales measuring each of six
shame themes. They are disappointment with oneself, feelings of inferiority, feelings of
defectiveness, feelings of worthlessness, feelings of unimportance, and feelings of falling
short of one's own standards and ideals. Items were written to avoid reference to other
affects such as embarrassment or self-consciousness. These shame themes were highly
correlated and were shown by conﬁrmatory factor analysis to measure one underlying
factor. The speciﬁc factors were shown to be uncorrelated with measures of emotional

and social function once the general shame was partialed out. They then combined the

29

shame theme scales to form a general shame scale. These shame themes were developed
into six shame theme scales. Initially, 65 items were written altogether for six shame
theme scales. Mainly on the basis of content meanings, 22 best items were retained. The
reliability of the 22 item general shame scale was .95. The shame theme scales are
presented in Table 2.

For this study, four versions of the items in four distinct modes of semantics were
prepared (Appendix A). In the ﬁrst version, the statements were presented by
semantically positive words or phrases. For example, “ I like myself.” In the second
version, the statements were presented by semantically positive words or phrases
structured in grammatically negative sentences. The 22 sentences of the ﬁrst version
were transformed into “do not” sentences, for example, “ I do not like myself.” In the
third version, “I dislike myself,” the statement was reversed in meaning by replacing the
word “like” with one of its antonyms (e. g., dislike). Similarly, in the fourth version, the
items in the third version were transformed into “do not” sentences, for example, “I do
not dislike myself.” Since English is not the ﬁrst language of the author of this study, the
items were carefully reviewed by a native speaker to ensure the accuracy of the
conversion of the four modes of item wording.

Chang & Hunter (1988) have shown that shame relates very substantially to
emotional and social problems. In their study, people high in shame are high in anxiety
(r=0.86) and low in life satisfaction (r=-0.73). Hence, for the sake of validation, two

other scales, which measure anxiety and life satisfaction, were used in this study.

30

Dimensions of Self-Concept (DOSC)

The Dimensions of Self-Concept (DOSC) (Michael & Smith, 1976) was
developed to measure non-cognitive factors associated with self-concept in a school
setting. There are two main purposes for the DOSC: a) to identify those students who
might experience difﬁculty in their schoolwork because of their perceptions of a low
degree of self-esteem; and 2) to diagnose for purposes of counseling or guidance by the
teacher, professional counselor, or administrator those dimensions as well as the speciﬁc
activities associated with them that might contribute to low self-esteem and might impair
learning capabilities relative to negative affectivity.

The DOSC is a self-report instrument that reﬂects the perception that students
have regarding each statement of the ﬁve main dimensions: aspiration, anxiety, academic
interest and satisfaction, leadership and initiative, and identiﬁcation vs. alienation. The
ﬁve factor dimensions measured by the DOSC scales are described as follows:

1. Level of Aspiration

This factor is a manifestation of patterns of behavior that portray the degree to
which achievement levels and academic activities of students are consistent with their
perceptions of their potentials in terms of scholastic aptitude or past and current
attainments.

2. Anxiety

This factor reﬂects behavior patterns and perceptions associated with emotional

instability, a lack of objectivity, and a heightened concern about tests and the preservation

of self-esteem in relation to academic performance.

31

3. Academic Interest and Satisfaction

This factor portrays the love of learning and pleasure gained by students in doing
academic work and in studying new subject matter.
4. Leadership and Initiative

This factor appears to represent those behavior patterns and perceptions that are
associated with star-like qualities, in which a student has an opportunity to demonstrate
his mastery of knowledge, to help others, to give direction to group activities, to become
a respected expert whom others consult, and to put forth sound suggestions for classroom
activities.
5. Identiﬁcation vs. Alienation

This factor is intended to represent the extent to which a student feels that he has
been accepted as part of the academic community, has been regarded by his teachers, and
peers as a signiﬁcant person.

The 14-item Anxiety Factor was used in this study (Table 3). A 0.82 coefﬁcient

alpha was reported for the 14-item scale (Michael & Smith, 1976).

Satisfaction with Life Scale

The Satisfaction with Life Scale (Diener et al., 1985) is a ﬁve-item scale, which
measures global life satisfaction. The scale is designed to assess global life satisfaction
and does not tap related constructs such as positive affect or loneliness. The purpose of
the scale is to obtain an overall judgement of the respondent’s life in order to measure the

of concept of life satisfaction. In the initial phase of item construction, 48 self-report

32

items were generated. A ﬁve-item scale was formed after a factor analysis and an
examination of the semantics. Diener et al. (1985) reported a high reliability (0.87
coefﬁcient alpha) for the scale. It also correlates moderately to highly with other
measures of subjective well-being. The scale is well suited for use with different age
groups. The high correlations with personality indicators of well-being suggested that the

scale rrright also be useful in clinical settings (Table 4).

Questionnaire Development

The 107 items were constructed and translated into equivalent Chinese structures.
The preservation of the connotation of meaning and the essence of the original sentences
was the top priority. Furthermore, the items (in Chinese) were edited for clarity and
readability by school teachers, who were requested to edit each item with respect to its
clarity and suitability for the middle and high school students. The Chinese version of the
questionnaire was reﬁned on the basis of the teachers’ comments. When the English
version of the questionnaire was given to the Chinese teacher who assisted with the
translation, she raised a concern regarding the rating scale. In the English version, the
anchors of the rating scale are “Never,” “Seldom,” “Sometimes,” “Often,” and “Almost
Always.” She reminded the author that the last scale anchor “Almost Always” might not
be an appropriate word choice. In the Chinese version “Almost Always” was changed to
“Always.”

In order to ensure the appropriate wording of the survey and the students’
understanding of the survey, a pilot test was conducted in two classrooms of about 80 to

90 students. The subsequent item analysis served as a basis for further reﬁnement of the

33

translation and wording of the survey. Based on the recommendations of the teachers and
the students, the researcher made some very slight changes to the wording of some the
survey items. The items were then assembled in random order into a questionnaire with
instructions. A copy of the survey can be found in Appendix B. The survey in Chinese

can be found in Appendix C. Subjects also completed items on their gender and age.

34

Table 2. The General Shame Scale (Chang & Hunter, 1988)
(1) Disappointment with Oneself

1. I don't like myself.

2. I am pleased with myself.

3. I am disappointed in myself.
4. I feel ashamed of myself.

(2) Feelings of Inferiority

5. I feel like I am just not quite good enough.

6. I feel that I am inferior to most of my friends.

7. Compared to others, I feel like I don't measure up.
8. I am just as good as my friends.

(3) Feelings of Defectiveness

9. I feel inadequate as a person.
10. I feel there is something defective in my character.
11. I look down on myself because of my ﬂawed character.
12. I see myself as intact and without personal defects.

(4) Feelings of Worthlessness

13. I feel worthless as a person.

14. I feel like a useless person.

15. I feel like I am good for nothing.

16. I feel I am a complete failure as a person.
17. I feel like a failure.

18. I am a worthwhile person.

(5) Feelings of Unimportance

19. I feel unimportant.
20. I feel so insigniﬁcant to others, as ifI were invisible.

(6) Falling Short of Own Standards and Ideals

21. I always seem to fall short of my aspirations.
22. I ﬁnd that I don't live up to my own standards and ideals.

35

Table 3. Dimensions of Self-Concept (DOSC) (Anxiety Factor)
(Michael & Smith, 1976)

h—‘i—ﬁ I—‘H
5"!"

ﬂ
5“

T‘PFFSP‘PPP‘S‘JT‘

Statements that some teachers make about my schoolwork hurt my feelings.

I feel so nervous about some of my classes that it is hard for me to attend.

I become tense and nervous when I am studying.

I am upset about so many things that I cannot concentrate on or do my schoolwork.
I worry about how well I am doing in my classes.

I am afraid to ask teachers to explain a difﬁcult concept a second or third time.

I avoid talking to my classmates about schoolwork because they might make fun of me.
I become ﬁightened when a teacher calls on me in class.

Talking in front of class makes me feel nervous.

I feel upset when I have to take a test.

I would be afraid to tell a teacher that he or she made a mistake in explaining an
assignment or in working a problem.

I have trouble sleeping well the night before an important examination.

I am embarrassed to face my friends or family if I have made a low grade on a test
or assignment.

I worry that my score on a test will not be one of the highest in class.

Table 4. Satisfaction with Life Scale (Diener et al., 1985)

.V‘PP’NI"

I am satisﬁed with my life.

The conditions of my life are excellent.

In most ways, my life is close to my ideal.

So far, I have gotten the important things I want in life.

If I could live my life over, I would change almost nothing.

36

Selection of the Sample
Participation in the study was solicited from students enrolled in middle schools.
The subjects in this study were the students studying in the sixth and seventh grades in
two schools located in Taipei, the capital of Taiwan. Although the schools were not
randomly selected, they were considered typical in terms of achievement, socioeconomic
status, and ethnic background to other schools within the district. In Taiwan, there are

usually 35 to 40 students in one classroom. Twenty-two classrooms were sampled.

Procedures of Test Administration

The questionnaires were administrated during class sessions. No titles were
printed on the questionnaires. The questionnaire required approximately 20 minutes to
complete. Before responding, students were told only that the survey dealt with opinions
about themselves. Subjects were instructed to use all ﬁve points on the rating scale to
provide an accurate reﬂection of their opinions. Subjects were encouraged to respond to
all items. The teacher read the survey instructions aloud as the students followed along.
Students were told that their responses would be completely anonymous. No names or
other identifying information were collected. Students were told that the survey was
completely voluntary, that they did not have to participate, and that they could leave
unanswered any questions they thought were too personal. The teachers left their

classrooms while the questionnaires were administered.

37

Qaalig antrol Screening

Quality control procedures were developed with prior surveys to screen for
incomplete or otherwise unusable responses. Students were instructed to respond to each
item of the questionnaire. After the questionnaires were completed, the students’
responses were entered into an SPSS data ﬁle. A printout of this ﬁle was obtained and
each of the entries was then checked for mis-entry with each of the students’ surveys. All
missing responses were coded “9”.

Each questionnaire was carefully inspected for any detectable abnormalities. Such
aberrant cases were discarded from the sample. A screening procedure was applied to
exclude students who were not taking the survey seriously. If the student marked one
particular option constantly throughout the survey, that survey was discarded. A

questionnaire was considered unusable if the student did not answer any items.

Data Analyses

Items representing each mode were summed to obtain a subtotal score for each
mode. Total test score was obtained by adding the four subtotal scores. It was reasoned
that if subjects’ responses were mainly determined by item contents, as hypothesized,
then all four categories of items would tap the same source trait (construct), and
consequently, the correlation coefﬁcients among the pairs of mode subtotal scores would
be high; otherwise, they would be low. The Mode 1 (regular - “I like myself”) and Mode
4 (negated polar opposite - “I do not dislike myself”) variables were assumed to be

positive, and the Mode 2 (negated -— “I do not like myself”) and Mode 3 (polar opposite —

38

“I dislike myself”) were assumed to be negative aspects of the same construct. The
intermode Pearson product moment correlation coefﬁcients were computed.

Scale responses were scored by reversing all negatively worded questionnaire
statements. Items representing each version were summed to obtain a subtotal score for
each mode. A higher score on any of the subscales or on the total scale indicated a more
positive attitude toward a certain aspect. The means, standard deviations, and reliabilities
of the six-shame theme scale and the general shame scale were computed. The
intercorrelations between six shame themes were also computed.

Hunter and Gerbing (1982) noted that if the right research design is used, then
conﬁrmatory factor analysis can be used to assess items using the criteria: "internal
consistency" and "parallelism" (or "external equivalence"). If item responses differ from
each other only by random error of measurement, then the item errors will not correlate
with each other. The correlations between items within a scale should then satisfy a
mathematical product rule discovered by Spearman (1904, cited in Hunter & Gerbing,
1982) — his “one factor model.” If the correlations between items within a scale satisfy
the Spearman product rule, then the scale is said to be "internally consistent." This is a
weak criterion for item equivalence.

Hunter and Gerbing (1982) also noted that there is a stronger criterion for item
equivalence - parallelism in the pattern of correlations between the items and important
"outside" variables, such as the measures of emotional and social functioning used in the
present study. If all items measure the same construct, then the item errors will not
correlate with outside variables. The correlations between items in a unidimensional

scale with any outside variable should satisfy a condition called "parallelism" (Tryon,

39

1939; Tryon & Bailey, 1970; Hunter, 1973; Hunter & Gerbing, 1982). This is a strong

test for item equivalence.

Conﬁrmatory Factor Analysis

A conﬁrmatory factor analysis was run on the items organized into the predicted
four clusters (modes of item wording). The analysis was used to examine the quality of
each scale. Conﬁrmatory factor analysis is a method for computing the correlations for
constructs from the correlations for observed measures, so long as the measures obey the
assumed measurement model. Each scale was checked for homogeneity of content, for
internal consistency, and for external consistency or parallelism. The ﬁrst analysis was
done to assess each of the four scales separately. If there was no signiﬁcant departure in
ﬁt, each of the four scales would prove internally consistent. The items within scales
would be parallel in relationship to the other scales and to that extent would appear to be
equivalent to each other.

A hierarchical measurement model was constructed with regard to the relationship
between the modes of item wording and the responses. A hierarchical conﬁrmatory
analysis estimated the correlations between the item scores and the four modes of item
wording. The correlations between the four scales and each of the measures of anxiety
and life satisfaction were also computed. A conﬁrmatory factor analysis of the four
modes of item wording together ﬁts the data only if each scale is parallel to each other
scale in their pattern of correlations with them. A stronger test of parallelism was

obtained by testing the items in each scale for parallelism in terms of how the items

40

related to outside variables. In order to do this, each mode of item wording was tested
separately for parallelism in relationship to the measures of anxiety and life satisfaction.
If the ﬁrst stage of analysis shows that each of the scales is unidimensional, then a
second stage of analysis can test the hypothesis that the four modes of item wording are
each measures of one underlying trait. This analysis ﬁrst checks to see if the separate
scales deﬁne speciﬁc factors or if they are identical to each other. If the scales are not
highly correlated with each other, then there are speciﬁc factors, which differentiate the

shame constructs from one another.

Exploratory Factor Analysis

To see if there might be some completely unanticipated dimension in the data, an
exploratory factor analysis of the items was also run. The communalities were estimated
as the largest correlation. The principal axis factors were followed by VARIMAX
rotation. The eigenvalue cutoff for the number of factors was set at 1.00. An
examination of the resulting factors would show if the clusters matched with the a priori
clusters. In other words, when the items were blindly grouped together using the highest
loading from the varimax factors, the clusters thus formed should be similar to the

original clusters (modes of item wording).

41

Parallelism

Parallelism is the basis for the use of correction for attenuation to eliminate the
bias in correlations produced by error of measurement. Thus the test for parallelism is the
test which directly justiﬁes the use of correction formulas. Since the correction formulas
are implicit in conﬁrmatory factor analysis, it is the test for parallelism that is the heart of
the assumptions for conﬁrmatory factor analysis.

Items in each of the four scales were examined in terms of how they correlated
with outside variables. If the items in a cluster all measure the same affect, then the items
in that cluster should correlate in a parallel way with each of the outside variables.
Parallelism was tested by computing the correlations between the items and the outside
variables (anxiety and life satisfaction), and by examining that correlation matrix for
parallelism (as noted by Hunter, 1973, and Tryon and Bailey, 1970). The most direct way
to check for equivalence is to correlate scales measuring the same thing (though it is
important to correct for the attenuation produced by random error of measurement). If
two measures are equivalent, then they correlate in exactly the same way with other
variables. The correlations between all four scales, anxiety (DOSC Anxiety Factor) and
life satisfaction (Satisfaction with Life Scale) were computed and were corrected for

attenuation due to error of measurement.

Summm
The research design of this study was aimed to address two main issues

concerning the effect of item word in the rating responses. The ﬁrst issue was the

inﬂuence of item wording on scale and item score mean, distribution, reliabilities, and

42

correlations between scales. The second issue was the item equivalence between
different modes of item wording. Two Taiwanese schools were selected for this purpose,
all from an urban setting in Taipei. A questionnaire was constructed and administered to
a total of 861 sixth and seventh grade students from the four schools. Demographic
information was also obtained regarding the students’ grade, gender, and age. All this
information was coded into a SPSS ﬁle, cleaned and subjected to a data analyses.
Descriptive statistics, correlational analyses, and reliability analyses were obtained. A
conﬁrmatory factor analysis was also conducted to further analyze the data. Various
other statistical tests were also performed on the results of the survey data.

In summary, the ﬁndings of several other researchers clearly suggest that item
wording can make a difference. They have not shown that a different construct was being
measured. This study provides a second set of analyses to test whether altering item
wording had an effect on the nature of the construct being measured. The results will
have important implications for measurement instrument design by addressing the
question: just what are the effects of item wording on the item responses? The purpose
of this current research was, therefore, to explore the item wording issue further and to

clarify its implications for the validity of questionnaires.

43

CHAPTER IV

ANALYSES AND INTERPRETATION OF THE DATA

Introduction
This chapter presents the data analyses. First, a general description of the
characteristics of the sample will ﬁrst be presented. This will be followed by an account
of the factor analyses on the questionnaire data. The manner in which the analyses were
conducted will be explained and its implication on survey development based on the
results will be discussed. Finally the results for the two research questions of the study

together with their interpretations will be reported.

Characteristics of the Sample

A total of 861 surveys were returned. All the surveys were carefully inspected
before data entry. Eleven surveys were discarded because participants marked the same
score on each item. One student did not ﬁll out the survey.

A total of 849 useable surveys were included in this study. They were completed
by students from two middle schools in Taipei. There were 443 (52.2%) male and 406
(47.8%) female students. All students were from the urban setting. Their age ranged
from 12 to 17. Eleven (1.3%) students were twelve years old; 152 (17.9%) were thirteen;

451 (53.1%) were fourteen; 228 (26.9%) were ﬁfteen; 5 (0.6%) were sixteen; and

44

2 (0.2%) were seventeen. Two hundred and ﬁfty-six students (30.3%) were in the sixth
grade and 593 students (69.8%) were in the seventh grade. Table 5, 6, 7, and 8 show the

distribution of students by gender, age, and grade.

Table 5. Gender of the Participants

F Percent
Male 443 52.2%

Female 406 47.8%
Total 849 1 00.0%

 

Table 6. Age of the Participants

F Percent
12 11 1.3%
13 17.9%
14 53.1%

15 26.9%
16 0.6%
17 0.2%
Total 100.0%

 

Table 7. Grade Level of the Participants

 

 

 

 

 

[ Grade Frequency Percent I
| 6th Grade 256 30.3% |
I 7th Grade 593 69.8% I

 

45

Table 8. Participants’ Age and Gender by Grade

6 Grade

Gender
Male 1 3 1
Female 125

 

Analyses of the Questionnaire Data

The 107 items were coded into six subscales as shown in Table 9. The four
modes of the shame scale represented four different modes of semantics for a total of 88
items (as mentioned in Chapter 3.) In Mode 1, the statements were presented by
semantically positive words or phrases. In Mode 2, the statements were presented by
semantically positive words or phrases structured in grammatically negative sentences.
The 22 sentences of Mode l were transformed into “do not” sentences. In Mode 3, the
sentences were reversed in meaning by replacing each adjective with one of its antonyms.
Similarly, in Mode 4, the items in Mode 3 were transformed into “do not” sentences.
DOSC-Anxiety is a 14-item scale, which measured student academic anxiety.

Satisfaction with Life scale is a 5-item scale, which measured global life satisfaction.

46

Table 9. Coding Format of the Subscales in the Questionnaire

 

 

 

 

 

 

 

 

 

Modes Question Number
Model 2, 3, 9, 13, 24, 25, 36, 41, 48, 49, 57, 61, 67, 81, 83, 85, 87, 88, 95, 100,
102, 104
Mode 2 4, 10, 20, 23, 31, 34, 43, 46, 51, 59, 60, 64, 72, 78, 82, 86, 89, 93, 96, 99,
101, 105
Mode 3 7, 8, 12, 18, 30, 32, 35, 37, 39, 50, 52, 54, 55, 66, 73, 75, 77, 80, 84, 92,
94, 97
Mode 4 l, 5, 6,14,16,17, 21, 26, 27, 40, 42, 44, 47, 53, 56, 63, 68, 71, 79, 91, 98,
107
DOSC- ll, 15, 19, 29, 38, 45, 62, 65, 69, 70, 74, 90, 103, 106
Anxiety
Factor
Satisfaction 22, 28, 33, 58, 76
with Life
Scale

 

 

47

Answers to the Research Questions

Research Qaestion 1: What is the inﬂuence of item wording on scale and item score
mean, distribution, reliabilities, and correlations between scales?
Histogram Analyses

To determine if the sample chosen was representative of a normal population, a
histogram of the student’s scores on the questionnaire was plotted (see Figure 1). The
mean and standard deviation of the 107-item questionnaire were 298.2 and 53.52
respectively (Appendix D). The item descriptive statistics of the four modes can be found
in Appendix E.

The plot showed that the distribution of the total scores was not normally
distributed. The test of normality was signiﬁcant at p<0.0001 level. For purposes of
comparison, the points of a normal curve base on all valid values of the scores were
superimposed on the histogram.

Four other histograms were plotted for the four modes of items in the
questionnaire (see Figures 2, 3, 4, and 5). They were also found to be not normally
distributed, except Mode 1 (Table 10). The means of the four Modes were 68.5 (Mode
1), 78.7 (Mode 2), 83.1 (Mode 3), and 67.9 (Mode 4). The means of Mode 2 and Mode 3

were higher than Mode 1 and Mode 4.

48

Table 10. Tests of Normality

 

 

 

 

 

 

 

 

Kolmogorov-Simirnov
Statistic Sig.
Total of 4 Modes .047 .000
Mode l .028 .142
Mode 2 .053 .0001
Mode 3 .084 .0001
Mode 4 .043 .001

 

 

 

 

Figure 1. Histogram of totals of 107 items on the questionnaire.

 

 

Std. Dev = 53.52
Mean = 298.2

 

 

 

110.0 150.0 190.0 230.0 270.0 310.0 350.0 390.0 430.0
130.0 170.0 210-0 250.0 290.0 330.0 370.0 410.0

 

 

 

49

Figure 2. Histogram of Mode 1 scores on the questionnaire

 

120

 

Std. Dev = 17.37
Mean = 68.5

 

 

 

25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0
30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0

 

 

 

Figure 3. Histogram of Mode 2 scores on the questionnaire

 

14o

 

120

100

40

20 Sid. Dev = 14.84

Mean = 78.7

 

 

 

25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0
30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0

 

 

 

50

Figure 4. Histogram of Mode 3 scores on the questionnaire

 

 

120

40-

20 - Std.Dev=16.15

Mean = 83.1

 

 

 

.250 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0
30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0

 

 

 

Figure 5. Histogram of Mode 4 scores on the questionnaire

 

 

140

Std. Dev = 15.80
Mean = 67.9

 

 

 

MODE4

 

 

 

Reliability Analysis

Based on the sample of 849 students, mean scores and standard deviations were
obtained and test reliability for the total scores of four modes of the General Shame Scale
(88 items) was calculated. The internal consistency reliability was estimated, using
coefﬁcient alpha (Cronbach, 1951). The results are shown in Table 11. In the analysis of
variance, the F statistic for the variation between items was signiﬁcant (F= 152.86,
p<0.001). This indicated that the items had signiﬁcantly different means. This ﬁnding
was conﬁrmed by the large Hotelling’s T-squared statistic (T-squared=4084.47) which is
a test for the equality of means. Its F statistic (F=42.19, p<0.001) was signiﬁcant and
indicated that the hypothesis that the items had equal means in the population could be

rejected. The 88-item test was reliable with Cronbach’s alpha at 0.9670.

Table 11. Results of Reliability Analysis on Four Modes of General Shame Scale (88

Items) in the Questionnaire

 

Analysis of Variance

Source of Variation Sum of Sq. DF Mean Square F Prob.
Between People 27606.2646 848 32.5546
Within People 93563.2159 73863 1.2667
Between Measures 14289.9870 87 164.2527 152.8626 .0000
Residual 79273.2289 73776 1.0745
Total 121169.4806 74711 1.6218
Grand Mean 3.3890
Hotelling's T-Squared = 4084.4675 F = 42.1867 Prob. = .0000
Degrees of Freedom: Numerator = 87 Denominator = 762
Reliability Coefficients 88 items
Alpha = .9670 Standardized item alpha = .9682

 

52

Another reliability analysis was conducted on each of the four different modes of

the questionnaire, the anxiety factor, and the life satisfaction factor. The results are

shown in Table 12. The Cronbach’s alpha coefﬁcients for the four modes were 0.9392

(Mode 1), 0.9059 (Mode 2), 0.9383 (Mode 3), and 0.8891 (Mode 4). This showed that

the four modes had about the same reliability. The Cronbach’s alpha coefﬁcients for the

anxiety factor and life satisfaction factor were 0.7524 and 0.6268 respectively.

Table 12. Mean, Standard Deviation, and Cronbach’s Alpha Coefﬁcients

for the Subscales

 

 

 

 

 

 

Subscale Mean Standard Cronbach’s alpha
Derivation coefﬁcients

Mode 1 68.50 17.37 0.9392
Mode 2 78.70 14.84 0.9059
Mode 3 83.10 16.15 0.9383
Mode 4 67.93 15.80 0.8891
DOSC-Anxiety Factor 34.51 8.51 0.7524
Life Satisfaction 14.50 3.89 0.6268

 

Correlational Analyses

Items representing each mode were summed to obtained a subtotal score for each

mode. Total test scores were obtained by adding the four subtotal scores. It was reasoned

that if subjects’ responses were mainly determined by item contents, then all four

categories of items would tap the same source trait (construct), and, consequently, the

correlation coefficients among the pairs of mode subtotal scores would be high;

otherwise, they would be low. Mode 1 and Mode 4 variables were assumed to be

positive, and Modes 2 and 3 were assumed to be negative aspects of the same construct.

 

Pearson correlation coefﬁcients were obtained for each of these subscales (Mode

1 to Mode 4), DOSC-Anxiety Factor, Satisfaction with Life Scale, and gender. By

inspecting these correlations, it became clear that the subscales correlated among each

other. The result of the correlation analyses is presented in Table 13. The Pearson

correlation coefﬁcients ranged from -0.133 to 0.898. All subscales were signiﬁcantly

correlated with each other, except between Satisfaction with Life Factor and gender.

Gender had low correlation coefficients with the four modes, DOSC, and Satisfaction

with Life Factor, ranging from —0.133 to 0.128. There seems to be not much relationship

between gender and these scales.

Table 13. Correlations between the 4 modes, DOSC-Anxiety Factor, Satisfaction

with Life Scale, and Gender

 

Mode 1 Mode 2 Mode 3 Mode 4 DOSC Life Gender
Mode 1 1.000
Mode 2 0.660 1.000
Mode 3 0.672 0.898 1.000
Mode 4 0.656 0.315 0.346 1.000
DOSC 0.325 0.465 0.491 0.166 1.000
Life 0.616 0.452 0.442 0.345 0.219 1.000
Gender -0.110 -0.133 -0.129 -0.055 0.128 -0.002 1.000

 

The major problem this study set out to investigate concerned the construct

validity of the differential semantic modes of item presentation. Among six correlation

coefﬁcients between the pairs of four semantic mode scores, all are signiﬁcantly different

54

from zero (p <0.05). However, the double-negative Mode 4 item score showed a smaller
relationship with both Mode 2 and Mode 3 item scores. Mode 2 and Mode 3 contained,
overall, negative-valence items, whereas Modes 1 and 4 had positive-valence items. The
largest correlation coefﬁcient in the set (0.898) was between Mode 2 and Mode 3. On the
basis of correlational evidence, it is conspicuous that, by and large, Mode 4 seems to
measure different constructs from Mode 2 and Mode 3 in spite of their deceptive content
similarity.

On the other hand, the value (0.97) of coefﬁcient alpha, which is generally taken
as an evidence of the homogenous nature of the test components, is quite impressive. An
inspection of the item-remainder correlation coefficients revealed that 8 of the 88 items
have corrected item-total correlations of less than .35 (Appendix F). Out of these eight
items, two belong to Mode 2 and six items are in Mode 4 (Table 14). The salient
common feature of these modes is that they are characterized by the grammatically
negative form. This produces evidence in favor of the argument that the “do not” form of
sentences, even with a simple structure otherwise, creates ambiguity and confusion, and

that double negatives, as in Mode 4, add to this confusion.

55

Table 14. Items with Corrected Item-Total Correlation Less Than 0.35.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Corrected Item
Mode 2 Total Correlation
Q64. I do not feel signiﬁcant enough to others 0.2249
that people notice me.
Q89. I don’t see myself as intact and without 0.1127
personal defects.
Mode 4 Q1. I do not see myself as ﬂawed and with 0.2599
personal defects.
Q5. I am not unhappy with myself. -0.4999
Q6. 1 do not always fall short of my aspirations. 0.1564
Q27. 1 do not feel unimportant. 0.3383
Q56. Compared to others, I do not feel like I am 0.2797
less of a person.
Q71. I don’t feel inferior to most of my friends. 0.3006

 

56

 

Rasearch Question 2: What is the item equivalence between different modes of item
wording?
MANOVA Analyses

One may hypothesize that if all four modes of presentation measured the same
construct in the same way in all the groups of students, then one should be able to reach
the same conclusions on the basis of any of the previously deﬁned, ﬁve dependent
variables (four mode subtotal scores and the total score). On the basis of male and female
as the group deterrniners, a Multivariate Analysis of Variance procedure (MANOVA)
was conducted. The results of the MANOVA are presented in Tables 15 and 16.

Inspection of each row in Table 16, showing F ratios calculated from four

different sets of scores (four item modes), reveals that the four F-ratio values were far
from being equivalent. The MAN OVA results showed that the responses to the modes of
item wording are signiﬁcantly different between male and female. The F-ratios of Mode
1, Mode 2, and Mode 3 were all signiﬁcant at the 0.05 level. Mode 4 was not signiﬁcant
at the 0.05 level. These results were similar to that of the correlational analyses. It seems
that Mode 4, which has a double negative semantics, introduced some ambiguity to the

items.

57

Table 15. Descriptive Statistics of Four Semantic Modes by Gender

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Gender Mean Std. Deviation
Male 70.3273 16.6181
Mode 1 Female 66.5025 17.9645
Total 68.4982 17.3704
Male 80.5553 14.1062
Mode 2 Female 76.6010 15.3556
Total 78.6643 14.8404
Male 85.1400 15.1455
Mode 3 Female 80.9606 16.9337
Total 83.1413 16.1517
Male 68.7585 16.0670
Mode 4 Female 67.0222 15.4791
Total 67.9282 15.8031

 

 

Table 16. MAN OVA Tables by Gender for the Four Semantic Mode Subtotal

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Scores
Multivariate Tests
Effect Value F Hypothesis Error (If Sig. Eta
Squared
Intercept Pillai's Trace .975 8215.575 4.000 844.000 .000 .975
Wilks' Lambda .025 8215.575 4.000 844.000 .000 .975
Hotelling's 38.936 8215.575 4.000 844.000 .000 .975
Trace
Roy's Largest 38.936 8215.575 4.000 844.000 .000 .975
Root
Gender Pillai's Trace .019 4.062 4.000 844.00 .003 .019
Wilks'Lambda .981 4.062 4.000 844.000 .003 .019
Hotelling's .019 4.062 4.000 844.000 .003 .019
Trace
Hotelling's .019 4.062 4.000 844.000 .003 .019
Trace
Roy's Largest .01 4.062 4.000 844.000 .003 .019
Root
0

 

 

 

 

 

 

 

 

 

58

 

Table 16 (Cont’d)

Tests of Between-Subjects Effects

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Source Dependent Type III df Mean F Sig. Eta
Variable Sum of Square Squared
Squares
Mode 1 3099.210 1 3099.210 10.385 .001 .012
Corrected Mode 2 3312.574 1 3312.574 15.294 .000 .018
Model Mode3 3700.342 1 3700.347 14.409 .000 .017
Mode 4 638.661 1 638.661 2.562 .110 .003
Mode 1 3966279422 1 3966279422 13290.652 .000 .940
Intercept Mode 2 5232215283 1 5232215283 24157.626 .000 .966
Mode 3 5844726448 1 5844726448 22758.468 .000 .964
Mode 4 3905689591 1 3905689591 15667.897 .000 .949
Mode 1 3099.210 1 3099.21 10.385 .001 .012
Gender Mode2 3312.574 1 3312.574 15.294 .000 .018
Mode 3 3700.347 1 3700.347 14.409 .000 .017
Mode4 638.661 1 638.661 2.562 .110 .003
Mode 1 252767037 847 298.42
Error Mode 2 183448.755 847 216.586
Mode 3 217522.692 847 256.815
Mode 4 211139.956 847 249.280
Mode 1 4239381000 849
Total Mode 2 5440436000 849
Mode 3 6089921000 849
Mode 4 412926300 849
Mode 1 255866.247 848
Corrected Mode 2 186761.329 848
Total Mode 3 221223.039 848
Mode 4 211778.617 848

 

59

 

Exploratory Factor Analyses

The survey was subjected to a principal components analysis as the initial method
of factor extraction. The two criteria used to select the model were obtained from
suggestions offered by Rummel (1970) and Hair, Anderson, and Tatham (1987), that is,
a) statistical indicators for selecting the 'best' number of factors, and b) an analysis of the
content of the’factors. Selection rules for the best number of factors have been developed
for eigenvalues, scree plots, percent of variance accounted for by the model, overlap in
factor loadings, and loadings values. The following selection rules applied. First, a
factor must have an eigenvalue of one or greater to be considered signiﬁcant. Second, a
scree plot indicates the maximum number of factors to extract when the plot becomes
horizontal, that is, when the curve ﬁrst begins to straighten out. As a general rule, the
scree tail test will result in at least one more factor being considered signiﬁcant than will
the latent root criterion (Hair et al., 1987). Third, the percent of variance accounted for
should be as great as possible in considering the best number of factors. Fourth, the
choice with the best number of factors should have the least amount of overlap in the total
item factor loadings. Finally, the loading value with the largest absolute factor loading is
the optimal choice. The loading values should be at least 0.30 to be considered
signiﬁcant, while factor loadings of 0.50 or greater are considered very signiﬁcant.
Ultimately, the number of signiﬁcant loadings on each column of the factor matrix or
loading associated with one variable would need to be maximized (Hair et al., 1987).

Thus, a dually loaded item would be placed in the factor with the higher loading.

60

Product-moment correlation coefﬁcients were computed, and a principal
components analysis with iterations was performed on the resulting correlation matrix.
The 107-item questionnaire was subjected to a principal components analysis. The factor
analysis yielded nineteen factors with eigenvalue greater than 1.00. Three items from
DOSC (question #1 l, #15, & #74) had no factor loadings over 0.30 and hence could not
be identiﬁed with any of the factors. The scree plot, which is the plot of the total variance
associate with each factor, is shown in Figure 6. The factor structure of the questionnaire

can be found in Table 17.

Figure 6. Scree Plot from the Exploratory Factor Analysis (19 Factors)

 

Scree Plot

 

40

30 n

20-

 

Eigenvalue

 

 

 

7 19 31 43 55 67 79 91 103

Component Number

 

 

 

61

Table 17. Factor Structure of the Questionnaire (19 Factors)

 

 

Factor Question Number

1 2,3,4,5,7,8,9,10,12,13,17,18,20,21,23,24,25,26,28,
30,31,32,33,34,35,36,37,39,41,42,44,46,49,50,51,52
54,55,60,61,66,68,72,73,75,79,87,88,92,93,94,97

2 1,14,16,27,40,47,53,56,63,71,91,98,99,100,101,104

3 57,58,59,102,107

4 38,62,65,69,70,89,90,105

5 43,48

6 71

7 64,85,103

8 76,84

9 80,96

10 67,95

1 1 6,106

12 83

13 86

14 82

15 78

16 45

17 22

18 81

19 77

 

62

Since the original factor structure was not interpretable, the number of factors was

reduced to six in another analysis. This factor structure can be found in Table 18.

Table 18. Factor Structure of the Questionnaire (6 Factors)

 

Factor Question Number

 

1 4,5,7,8,10,12,18,23,30,31,32,34,35,37,39,46,49,50,
51,52,54,55,57,59,60,61,66,72,75,77,78,82,86,92,93,
94,96,97,99,101
2 1,6,13,14,15,16,17,21,26,27,41,42,44,47,53,63,68,7l,79,81,98,104,107
3 2,3,9,24,25,28,33,36,83,84,85,87,88,95
4 1 1,3 8,45,62,65,69,70,74,90,103,106
5 43,48,67,73,76,80,91,105

6 19,20,41,100,102

 

Conﬁrmatory Factor Anaclyses

Conﬁrmatory factor analysis may not be a complete answer to the issue of
construct validity (See Hunter and Gerbing, 1982). The statistical method asks only
whether items measure the same construct, not whether that construct is the right
construct. Whether the items measure the right construct is a question of content, which
is usually tested by looking at correlations between the scale and other constructs. At the
level of the shame theme scales, the issue of the nature of the construct was dealt with

solely in terms of an item content analysis. The items in each shame theme cluster were

63

closely examined to see if they were psychologically equivalent in both the affect
expressed and in the manner in which that affect was expressed.

Hunter and Gerbing (1982) noted that if the right research design is used, then
conﬁrmatory factor analysis can be used to assess item equivalence in two very different
ways: "internal consistency" and "parallelism" (or "external equivalence"). If item
responses differ from each other only by random error of measurement, then the item
errors will not correlate with each other. The correlations between items within a scale
should then satisfy a mathematical product rule discovered by Spearman (1904, cited in
Hunter & Gerbing, 1982): his one factor model. If the correlations between items within
a scale satisfy the Spearman product rule, then the scale is said to be "internally
consistent." However, this is a weak criterion for item equivalence.

There is a stronger criterion for item equivalence — parallelism in the pattern of
correlations between the items and important "outside" variables, such as the measures of
anxiety and life satisfaction factors used in the present study. If all items measure the
same construct, then the item errors will not correlate with any outside variable. The
correlations between items in a unidimensional scale with any outside variable should
satisfy a condition called "parallelism" (Tryon, 1939; Tryon and Bailey, 1970; Hunter,
1973; Hunter and Gerbing, 1982). This is a strong test for item equivalence. If an item is
contaminated by some unintended variable, and if that contaminating variable is one of
the outside variables (or is correlated with one), then the item will correlate more highly
with that outside variable than will the other uncontaminated items. Thus, failure to ﬁnd
parallelism not only shows an item to be contaminated, but it also identiﬁes the

contaminating variable (Hunter, 1986, 1987). Parallelism can either be tested by doing a

64

conﬁrmatory factor analysis including both the items and the outside variables (as noted
by Hunter and Gerbing, 1982) or by computing the correlations between the items and the
outside variables and examining that correlation matrix for parallelism (as noted by
Hunter, 1973, and Tryon and Bailey, 1970).

To answer the question of whether the four Modes of semantics measured the
same construct, ﬁve models were tested. The ﬁrst model speciﬁed a single factor in
which all the items in four different modes measured one single general factor. The
second model, speciﬁed a two-factor model where one factor represented the positive
wording items and the other factor represented the negative wording items. The third
model speciﬁed another two-factor model with one factor representing double negatives
(Mode 4), and the second factor with afﬁrmative semantic mode (Mode 1 and Mode 3)
and with the “don’t” form semantics (Mode 2). The fourth model, speciﬁed three factors:
one factor representing afﬁrmative semantic mode (Mode 1 and Mode 3), the second with
the “don’t” fonrr semantics (Mode 2), and the third factor with double negatives (Mode
4). The ﬁfth model parameterized the four Modes of semantics in the survey.

Results from these ﬁve sets of conﬁrmatory factor analyses models are reported in
Table 19. The 2-factor model (Modes 1, 2, & 3 vs. Mode 4) ﬁt the data statistically and
showed an overwhelming superiority over the other models. These results rendered
strong indications of the inequivalence between double negatives (Mode 4) and the rest of
the items (Modes 1, 2, & 3). The 3-factor model also ﬁt the data statistically. The results
rendered some indications of the inequivalence between the afﬁrmative semantic mode

(Mode 1 & 3), “don’t” form semantics (Mode 2), and double negatives (Mode 4).

65

Table 19. Goodness-of-Fit Indices of Five Models

 

 

 

 

 

 

Model Chi-square GFI AGFI
1-factor (Modes l,2,3,4 as one general factor)) 298.30 0.721 0.699
2-factor (Modes 1 &4 vs. Modes 2 & 3) 317.50 0.689 0.532
2-factor (Modes 1,2,3 vs. Mode 4) 222.67 0.956 0.889
3-factor (Modes 1 & 3, vs. Mode 2 vs. Mode 4) 259.70 0.938 0.874
4-factor (Modes l,2,3,4 as individual factor) 389.76 0.679 0.514

 

 

 

 

 

 

Note. GFI=goodness-of-ﬁt index; AGFI=adjusted goodness-of-ﬁt test

Summary

The data of this study were the responses of 849 students to a questionnaire.
Preliminary steps were taken to ensure that the data were appropriately coded and
sufﬁciently accurate before the start of the analysis. Statistical data analysis techniques of
exploratory factor analysis, conﬁrmatory factor analysis, analysis of variance, and

reliability analysis were employed to answer the research questions.

66

CHAPTER V

SUMMARY, CONCLUSIONS, IMPLICATIONS, AND RECOMMENDATIONS

Summary of the Purposes and Procedures of the Study

A group of 849 Taiwanese students from two schools were selected for this study.
These two schools were located in an urban setting. A questionnaire was constructed
with 107 items. The students’ responses to this 107-item questionnaire were analyzed. A
conﬁrmatory factor analysis was conducted and reliability of the subscales reported. Two
research questions were formulated for this study. Statistical data analysis techniques of
exploratory factor analysis, conﬁrmatory factor analysis, analysis of variance, and
reliability analysis were employed to answer the research questions.

The purpose of this study was to investigate how item wording affects rating
responses. The ﬁrst issue investigated was the inﬂuence of item wording on scale and
item score mean, distribution, reliabilities, and correlations between scales. The second

issue examined the item equivalence between different modes of item wording.

67

Discussion and Conclusion

Reﬁnement of the survey development process is an essential component of any
serious effort to enhance the reliability and validity of survey results. In the introductory
chapter it was noted that item wording is an important consideration in survey
development. A review of research on the impact of item wording on rating responses
revealed a lack of consistent ﬁndings.

The results of this study suggest a rather important conclusion for measurement
instrument design — that the inclusion of negatively worded items can result in less
accurate responses and therefore can impair the validity of obtained results. Thus,
although the inclusion of negatively stated items may theoretically control or offset
agreement response tendencies, their actual effect is to reduce response validity. This
situation suggests that current recommendations concerning the desirability of including
both positive and negative items on a questionnaire may be premature (and perhaps
incorrect), and the inclusion of both apparently warrants much ﬁrrther investigation.

In examining the effects of item wording on item responses, this study has shed
some interesting light on an issue that has here-to-fore been the focus of arguments based
upon ambiguous data and results. Overall, the ﬁndings suggest that the use of double
negatively worded items may result in the measurement of a different construct than is
intended. This outcome is in direct contrast to the conventional psychometric
recommendations summarized earlier. This seems to be generally a function of the
double negatively worded items themselves, and not of their exertion of a strong

contextual effect on the positive items.

68

The data from the present study provide strong evidence that the insertion of the
word “not” has a profound inﬂuence on the student responses. Two trends were indicated
in the results. First, items that induced a more favorable response on the positive form
induced a less favorable response on the negative form. In other words, respondents were
less likely to indicate agreement by disagreeing with a negatively phrased item than to
indicate agreement by agreeing with a positively phrased item. Second, items that
induced an unfavorable response on the positive form were less likely to induce an
unfavorable response on the negative form. In addition, conﬁrmatory factor analyses
indicated that the factor structures were clearly different for the positive and the negative
forms.

The ﬁndings from the present study suggest that caution should be exercised in
the use of negative item phrasing. Although it may be useful to include some negative
items to reduce a response bias, these items need not be used in computing a total attitude
score. The use of double negatively phrased items should be used with great caution.
Despite the conventional wisdom so oﬁen found in measurement textbooks, recent
pronouncements by researchers in the area of item phrasing have suggested that
negatively phrased items, especially double negatives, reduce the validity of a
questionnaire. This present study clearly corroborates that position.

The research to date suggests that positive to negative transformation changes an
item’s psychometric characteristics, and, more importantly, changes the construct that an
item is intended to measure. However, the present study and the other studies that have
been reviewed do not prove that positively phrased items are necessarily better indicators

of attitude. Nevertheless, there are some hints that negatively phrased items are less

69

valid. First, there is the plausible argument that respondents may not understand that they
can indicate agreement by disagreeing with a negative statement. Similarly, they may not
understand that they can indicate disagreement by agreeing with a negative statement.

A word of caution concerning the use of negatively-worded items is appropriate.
If negatively worded items are to be used, it would be wise to ensure, during scale
development, that their inclusion does not present a methodological confound. Rather
than having alternative interpretations to contend with some two decades after scale
development (e.g., McGee et al., 1989; Tracy & Johnson, 1981), it would be preferable to
ensure that constructs are not exclusively deﬁned by negatively worded items during scale

development.

Implications

Survey development is a difﬁcult task. This study attempts to provide some
insights that will begin to answer questions pertinent to the impact of item wording on
rating responses. The results of these analyses have a clear implication for researchers
who factor analyze data in which the wording of items is varied. Such researchers should
be cautious of factors loaded primarily with negatively keyed items. Likewise, consumers
of this research should question substantive interpretations of such negative factors.

Researchers should be especially cautious concerning negative factors when
responses to questionnaires are “involuntary” or when there is a reason to sabotage the
researcher effort. The respondents’ data should be examined to detect unusual response
patterns. If negative and positive items are recoded so as to be consistent, then a

respondent whose primary responses on a 7-point scale are 5 and 6 would be suspicious if

70

negatively worded item responses were 2 and 3. Responses from these individual would
be best deleted prior to any further analyses. A more systematic analysis of these
respondents is possible with the use of item response theory. Latent trait analyses allow
the determination of which item responses made by an individual are not well predicted
by the IRT model. As a consequence, it is possible to detect unusual responses at the
individual level. However, since ample size and number of item requirement for IRT
analyses are large, latent trait parameters may not be obtainable for many instruments.

This investigation conﬁrms the ﬁndings of earlier studies. Taken together, these
studies offer important implications for measurement practice and theory development.
One practical implication is that double negative items should not be used. Items on a
test or survey should be consistent in positive wording with respect to the construct being
measured. Positive and negative wording items are not bipolar indicators of a common
trait continuum and therefore construct unidimensionality cannot be maintained by simply
reversing the scale points associated with the negative items. Consequently, the
inconsistent direction of wording is likely to change the intended factor structure of a test
or survey. Researchers should not deliberately use double negatively worded items, even
for the purpose of countering response-set effects.

Including many double negatively worded items in a test may have the impact of
altering the original operational deﬁnition of the underlying construct. In order to
maintain the intended original factor structure, the connotation of the items must be
consistent, in one direction or the other. Research may be needed to reexamine the
construct validity of other tests that use a large number of double negatively worded

items.

71

It makes sense that the presence of a construct often indicated by a positively
worded item does not necessarily mean the opposite (reversal of scale points) of the
absence of a construct, as is often indicated by the negative wording counterparts on a
test. It seems that the constructs in measurement are inseparable from the way their item
indicators are connoted. Researchers using Likert-type scales routinely recode or ﬂip the
responses scales of the negatively worded items. Therefore, a primacy effect from this
measurement factor needs to be avoided, considered, held constant, or separated from the
interest effected by cautious researchers.

The results of this study will have important implications for researchers who
analyze data in which the wording of items is varied. It is suggested that questionnaire
instructions may include a warning to potential respondents that some questions will be
negatively keyed and that they should attend to all items. The overall questionnaire
length in an instrument that uses the same response format may also be a concern. The
respondents may become fatigued or bored when they answer many like-sounding items.
Trott and Jackson (1967) found that an acquiescence factor was strongly associated with
the speed of presentation of personality items. It may be useful to experiment with the
wording directions and the length of questionnaires/instruments as well as the serial
position of any negatively keyed items. Further, the context in which data are collected
could be varied in an effort to assess the effect of context on the presence or absence of
negative factors.

The possible effect of item wording on overall ratings is particularly relevant to
many of the student and employer rating instruments available today. Increasingly,

emphasis is being placed upon the need for valid and reliable means of assessing teaching

72

and working performance. Rating scales used to evaluate a new project, person, or course
of instruction, often include both negatively and positively stated items about the objects
or person being evaluated. What has yet to be determined is the possible effect of item
wording on raters’ evaluations. Do negatively worded items encourage a more critical
evaluation than do positively worded items? Negatively worded items may highlight the
negative aspects or faults of the object or person being evaluated, or may serve to suggest
unconsciously to the rater particular problem areas anticipated by the evaluator. If so,
rating scale evaluations may be affected as much by the wording of the items as by the
quality of the object or person being evaluated.

Several other researchers (Andrich, 1983; Campbell & Grissom, 1979; Simpson,
Rentz, & Shrum, 1976) have investigated whether phrasing can inﬂuence overall attitude
levels on different attitudinal questionnaires. These researchers have all concluded that
item phrasing makes a difference. However, the results that these researchers report
cannot be easily corroborated with each other or with the present study. One important
differentiating feature is that in this study, the word “not” was used to create parallel
negative statements, whereas the other researchers created negative statement on an
intuitive basis. Rorer (1965) suggested that this latter procedure often leads to negative
statements that reﬂect different content or ideas; consequently, such statements are not
direct opposites of the original positive statement. It is perhaps because of this problem
that many affective scales contain the word “not” or the preﬁx “un” to create a negative
statement (Coopersmith, 1967; Marsh, Smith, Barnes & Butler, 1983; Piers, 1969).

Given the relative frequency with which a negative factor is reported in the

literature and the ease with which a factor is produced, researchers should be especially

73

cautious when their factor analyses produce factors that are loaded primarily by negative
items. Further, those who design questionnaires may also want to take steps to minimize
problems during the construction of their instruments and with the directions which
accompany them.

On the basis of the data analyses, the ﬁndings of this study conclude that double
negative and semantically positive item contents do not measure essentially the same
construct. Furthermore, these “do-not” form-generated negative-to-positive (double
negative) items create ambiguity and confusion. This problem deserves closer inspection
through more speciﬁcally designed studies. Interesting thoughts about measurement

theory, suggested by the present study, may be worthy to ponder.

Cross Validation Using a Nevﬁample

The problem of situational speciﬁcity is always a major concern in validity
studies. Validity generalization research is based on the application of a particular set of
meta-analytic methods (Hunter, Schmidt, & Jackson, 1982) to criterion-related validities
of tests. This meta-analysis method was developed as a way of attacking a critically
important problem in psychology: the problem of situation speciﬁcity. The belief was
based on the empirical fact that considerable variability was present from study to study
in observed validity coefficients, even when the jobs and tests studied appeared to be
similar or identical. The explanation developed for this variability was that the factor
structure of job performance was different from job to job. The conclusion was that
validity studies must be conducted in every setting; that is, that validity evidence could

not be generalized across settings.

74

Schmidt and Hunter (1981) hypothesized that most or all of the variance of study
correlations across studies and settings was due to artifactual sources, such as sampling
error, and not to real difference between jobs. Artifacts other than sampling error and
differences between studies in measurement error and in range restriction can cause
variance in study outcomes. Because the most common form of validity evidence is the
correlation coefficient between predictor and criterion scores, it is important to recognize
that the restriction of the range of scores on the questionnaire may results in attenuation
of the observed validity coefficient. One example is the instance in which the test being
validated is used for selection purposes before its validity has been established. Other
things being equal, the greater the variability among the observations, the greater the
value of correlation coefﬁcients. Thus, restriction of range occurs on the questionnaire
because of the explicit selection on that scale. As the correlations among items increase,
the tests become more homogeneous in content (increase in internal consistency).
Moreover, when we compute statistics from a set of data, we are getting the best
estimates from the data. If these statistics are used in further calculations, we take
advantage of the original calculations and therefore overestimate the second calculations.
In Cureton’s study (Cureton, 1950), the author said that one should not use a data set for
conducting an item analysis and then use the results of that analysis to compute validity
coefﬁcients. When items are deleted from an original sample, some random errors are

introduced. The cross validated sample will result in lower item correlations.

75

Cultural Speciﬁcity of the Ouestionnaira

Crocker and Algina (1986) mentioned that the ultimate criterion for the number of
factors to interpret is replicability. When the same variables are investigated in different
studies, the factors that are replicated in the studies are those that should be interpreted.
Due to situational speciﬁcity of Taiwan, the survey may not generate the same factors
shown by this study when applied to a new sample.

The items in the survey are very culturally speciﬁc. For instance, items such as
"Compared to others, I feel like I don’t measure up," "I ﬁnd that I don’t live up to my own
standards and ideals," "I look down on myself because of my ﬂawed character," "I feel so
insigniﬁcant to others, as if I were invisible," and "I always seem to fall short of my
aspirations," “I see myself as intact and without personal defects, " while familiar to
students of Chinese ethnicity, can have very different implications when administered to
students of different cultures.

In general, Chinese think about social institutions such as school quite differently
from American educators, seeing teachers as professionals with authority over their
children's schooling. Teachers in the Chinese culture are accorded a higher status than
teachers in the United States. They believe that parents are not supposed to interfere with
school processes. Chinese people highly value formal education, and believe that high
achievement brings honor and prestige to the family, while failure brings shame. The
intense pressure upon children to succeed often leads to intergenerational conﬂicts, and
many Chinese children suffer from test anxiety, social isolation, and low self-esteem

because of their mediocre school performance. They have difficulties accepting learning

76

disabilities and depression, and believe that psychological distress is an indication of
organic disorders and harmful to both the individual and the family.

Confucian ideals, which include respect for elders, deferred gratiﬁcation, and
discipline, are a strong inﬂuence. Cohesion and harmony are valued above individual
achievement. Hard work, duty, obligation, frugality, and responsibility are also priorities.
Most Chinese parents teach their children to value educational achievement, respect
authority, feel responsibility for relatives, and show self-control. Chinese parents tend to
view school failure as a lack of will, and they address this problem by increasing parental
restrictions. Chinese children tend to be more dependent, conforming, and willing to
place family welfare over individual wishes than are American children.

Self-effacement is a trait traditionally valued by Chinese culture. Chinese children
tend to wait to participate, unless otherwise requested by the teacher. Having attention
drawn to oneself, for example, having one's name put on the board for misbehaving, can
bring considerable distress. Many Chinese children have been socialized to listen more
than speak, to speak in a soft voice, and to be modest in dress and behavior. Discipline
and obedience are highly valued in the Asian cultures, whereas creativity and freedom are
important in Western cultures. The deﬁnition of appropriate student attitudes may be

different due to differences in the interpretation of students' behaviors.

77

Issues Concemmghe Use of Tests

Attitude measures are measures of typical behavior and are distinguished from the
ability tests, which measure maximmn performance (Cronbach, 1960). By measuring
attitudes, we want to know what the person normally does rather than what he or she can
do under exceptional motivation. Valid information about attitudes can be valuable to
teachers, counselors, and students. Attitude inventories can help in identifying the
problems and needs of students, provided that they are truthful in answering the items.
These inventories provide a more complete and holistic understanding of the students.
However, the results should not be treated as the sole source of information. Teachers
and counselors can also identity problems and needs through observations and interviews.
Since these tests usually have lower reliabilities and validities than cognitive tests, the
interpretation of the results should be handled with great caution (Mehrens & Lehmann,
1984)

It is important for the technical quality of test materials and standards to be
considered. The information about these should include evidence of reliability and
validity; information regarding the method of estimating reliability and the population on
which it was measured; and types of validity evidence, including validity relevant to the
intended purpose of the test. Teachers and counselors need to ask themselves these
questions: Is the test appropriate for the person who is being tested? How are the results
going to be used? Are the test scores reliable enough? Does the test possess enough

validity to be used for the purpose for which it is planned? Is the welfare of the student

78

being taken into consideration in the choice and use of tests? Will conﬁdentiality become
an issue if the subject does not want to reveal himself or herself to the tester?

Another important issue is the competence of the teacher or counselor who will be
administering various available assessment instruments. Do those who use various tests
have sufﬁcient knowledge and understanding to select tests appropriately and to interpret
their results? Since different tests demand different levels of competence for their use,
users must recognize the limits of their competence and make use only of instruments for
which they have adequate preparation and training.

Lastly, presentation of test results also requires signiﬁcant attention. Teachers and
counselors should avoid labeling when communicating the results of a test to students.
Labeling can stigmatize a person even when such terms can be justiﬁed. They not only
suggest a lack of any chance to grow or change, but they may also become self-fulﬁlling
prophecies. Instead, interpretations should be presented in terms of possible ranges of
academic achievement or formulations of interventions to assist the individual in

behaving more effectively.

Limitations
As an investigation of the impact of item wording on rating responses, the present
study has a number of limitations. First, since the survey was developed according to the
context in the United States, the questionnaire may not be relevant to teachers' situations
in other countries. Therefore, the application of this questionnaire to other cultures
should be used with caution. Second, the subjects were students in Taiwan and they do

not represent a random sample from the population. The results are therefore suggestive

79

rather than deﬁnitive, and cannot be generalized to other populations without
qualiﬁcation. Third, the factor structure generated by the factor analyses may represent a
chance phenomenon, which would not hold up in a second study. A cross-validation
study is needed in order to validate the generalizability of the factor structure. Fourthly,
the conversion of item wording from one mode to another may add some confusion and
ambiguity to the original meaning of the items. The translation of the questionnaire from
English to Chinese may add more possible ambiguity to the original version of the
questionnaire. Sometimes it is difﬁcult to ﬁnd an appropriate translation of certain words
despite the best effort. Nonetheless, the results provide practical implications for test

developers and measurement researchers.

Recommendations

The study has raised a number of issues which future work should address. This
research sheds some light on the effects of item wording on rating responses, and
suggests other possible investigations of problems of interest to survey developers and
educational researchers. Several directions can be suggested for future research on the
effects of item wording on rating responses. One direction would be to replicate the study
using larger samples of students from countries other than Taiwan. Replication with
other forms might shed more light on the pattern of interaction between each of these
factors and form.

Only 107 items and 849 students were used in this study. A relevant question that
might be asked is to what extent can the results of the study be generalized to students in

other cultures. Thus another direction which future research might take would be to

80

replicate the study using subjects and items from other disciplines, such as politics and
religion, which are more bipolar.

Further research might reveal answers to such questions as:

1) Will the factor structure remain the same when the questionnaire is administered to
another sample in a different setting?

2) How will the content of the items affect the item wording, which in turn may affect the
item responses?

3) Is there a relationship between item content, item wording, and rating scale?

4) What are the differences between male and female students in responding to this
instrument?

Despite the work that still needs to be done in this area, this study provides some
insights in the ﬁeld of survey development. Researchers may be able to gain new insights
resulting in more efﬁcient survey construction. Further research into the item wording
and the item responses would not only improve the accuracy of the inferences that may be

made from these surveys, but also may have an important impact on survey development.

81

APPENDICES

82

APPENDIX A

Four Modes of Item Wording

83

 

:88: mac—um: m 8H: HooH H0: 00 H

:88: 328: m 8H: :5 H

:8th Hana: a 8H: BE Ho: 00 H

:88: H88: a 8H: HocH H

 

Embed a mu 802:0? HuoH H0: 0: H

:OEoH a ma 803:0? HooH H

:08: a 8 H553 HooH Ho: 0H0 H

:88: a 8 H583 HcUH H

 

mHooHoHv Han—023
5H3 H05 H532“: mm HHomHE 8m 8: 0H0 H

308ch HEOEoH
at; H08 83% mm .HHomHE 8m H

ﬂooHoHu Hm:0m:om :55;
0:: USE mu HHomHE com H0: 00 H

808% H888: 52E?
HEN SEE we .HHomHE com H

 

8.0320 poow HE H0
8:83 HHowHE :0 :38 H02 H0: 0: H

8880:... H003“: HE
H0 8583 .HHomHE :0 Esau x02 H

86820 coast HE H0
8:803 .HHomHE 9 a: x02 H0: 0H0 H

8880:”. H000w HE
E 3:83 .HHcmHE 0H 0: H02 H

 

88830 HE E
038030 MEEoEOm & 205 HouH H.:0u H

SUE—£0 HE
E 3880: 33888 M. 205 :5 H

USE E 88988 HE HooH H.:0H0 H

“055
a 5.0820 HE E: HcoH H

 

deﬁed
a 8 88835 Ea H HouH 3:00 H

:88: a 8 28302 HooH H

89.3 a mm 8888 HooH “be: H

:88: a mm 2250:: EH H

 

06:05 HE 2 SEE: H0: En H

8:05 HE 8 58%.: En H

88$ HE 8 boom 8 H0: Ea H

885 HE mu 000M 8 «man Eu H

 

:OmHonH 0 H0 82 En
H 85 HouH H.:0H0 H 6850 2 HauSnHEoU

:8qu 0 H0 mmuH
Ea H 82 BE H £850 8 HUB—«9:00

a: 2888 H.:0H0
H 8H: :3 H .3050 2 noSHHEou

a: oEmmoE H
85 BE H .0050 0. cohanHEoo

 

8:05
HE H0 HmoE 0. SEE: HooH H.:0H0 H

8:05
HE .Ho 50E 8 SEE Ea H 35 HucH H

«ESE HE H0 80E
9 58:3. Ea H 35 HuuH H0: 0: H

38.5 HE .H0 HmoE
0H Eton—am Em H 35 HooH H

 

£958
boom H0: 83 En H EH: Hand Ho: 00 H

23:08 noom H0: 8E. Ea H 35 Hand H

£958 boom Ea H HouH Ho: 0: H

@008 000m Ea H 85 HooH H

 

.HHomHE H0 HooEﬂHma HooH H0: on H

HHomHE .Ho 8838 Hunt H

.HHumHE H0 H052: HuoH .0: 0H0 H

.HHomHE .3 H503 HouH H

 

HHUmHE HEB 0252586 H0: E: H

HHomHE 5H3 coHEoanHamHHv Em H

.HHomHE .23 85.08 :0: E H

.HHumHE 5H3 comma—8 Eu H

 

 

 

 

.HHumHE 5H3 HQHSHEH Ho: Ea H .HHUmHE 5H3 HnHAHEHE. Eu H .HHomHE 5H3 cumnoHnH Ho: E H .HHomHE EH? 0883 E: H
.HHomHE SHHHmHU ..:0H0 H .HHomHE 85ch H .HHomHE 8H: #:00 H .HHomHE 8H: H
v MOO—2 m H902 N EGO—2 H EGO—H

 

 

 

 

 

”£0.83 83H H0 8102 Eek .3 03:8

84

 

£82 HE“ 8:85:88
:30 HE H0 tonm :5 H0: 05 H 85 55H H

£82 H28 58588
.56 E a .88 E H as E: H

282 5.8 58588
HE S a: 3: H.:0H0 H 85 UEH H

£82 H05 58588
:30 HE S a: o>= H :5 Eu H

 

80.5858
HE .Ho :08 :5 mHaBHm H.:0H0 H

805858 HE H0 :08 :5 mHmBHa H

8:28:88
HE E a: 3: mHmBHa 9.:0H0 H

205838
HE S m: 03H 9 :88 mHmBHm H

 

83835 883 HHH
8 .8850 8 E85885 85 H0: 0H0 H

2502: 883
H .HH 8 £850 0: ago—15:38.5 HooH H

8E 0050: 2:08 85 .8850
2 H3508 58$:me 85 H0: 05 H

0E 0050: 0303 85
.0850 2 Emu—HEM; 8 85 H

 

EEOQEE: 88H H0: 20 H

ESHQHEHHE HocH H

E8595 85 H0: 0H0 H

ESEQEH H8> HouH H

 

COwHOQ mmouzuog 5 H0: Ed H

:88: 825:0? 0 En H

:88H 353583 a H0: Em H

:88: 0:555:03 8 Eu H

 

8:55 m 8H: 88 H0: 2.0 H

228 a 85 3 H

H5882; HBH H0: 0H0 H

Hammuooam 85 H

 

:88H a 8 255
881500 1. Eu H 8H: H8120: 05 H

:88: a
8 8:55 oHoHHEoo a Ea H 8H: 88H H

:88: a mm 8888
oHoHanu a Ea H 85 .0: 05 H

:88: a 8
8888 829:8 a Ea H 85 H

 

wE50: 8H 500w :8 H 8H: 88 H0: 05 H

wE50: EH 500w Ea H 8H: 85 H

w55oE8
8H 500w Em H 8H: 85 H0: 05 H

mE58E8
8H 500w Em H 8H: .85 H

 

 

v HOG:

 

m ”HA—OE

 

N HOD:

 

H nan—05H

 

 

8.288 .8 28:.

85

APPENDIX B

Questionnaire in English

86

Code Number:

 

INSTRUCTIONS

For each item, ﬁll in the circle on the answer sheet for that item which
corresponds to the word or phrase that best describes yourself. Read the response options
carefully before making your selection. These survey results will be used in a research
study. Please read both the item and response options carefully before selecting your

answer.

Your answers will be kept strictly conﬁdential. Results of this survey will appear
in summary or statistical form only, so that individuals cannot be identiﬁed. Thank you

for your time and cooperation.

 

TO THE STUDENT

There are ﬁve possible responses to each statement:

Never Seldom Sometimes Often Almost Always
1 2 3 4 5
O O O O O

For each statement select ONE response. Please mark the bubble which best indicates
your agreement with the statement. There are no right or wrong responses.

 

 

 

87

 

Never Seldom Sometimes Often Almost Always

 

l 2 3 4 5
l. I do not see myself as ﬂawed and with O O O O 0
personal defects.
2. I am pleased with myself. 0 O O O O
3. I feel that my character is intact. O O O O O
4. I am not as good as my friends. 0 O O O O
5. I am not unhappy with myself. 0 O O O O
6. I do not always fall short of my aspirations. O O O O O
7. I feel like a failure. 0 O O O O
8. I am a worthless person. 0 O O O O
9. I am look up to myself because of my good 0 O O O 0
character.
10. I do not feel proud of myself. 0 O O O
l l. I worry that my score on a test will not be 0 O O 0
one of the highest in class.
12. I feel there is a something defective in my 0 O O O 0
character.
13. I am a worthwhile person. 0 O O O O
14. I do not feel like I am good for nothing. 0 O
15. Statements that some teachers make about 0
my schoolwork hurt my feelings.
16. I do not feel ashamed of myself. 0 O O O O
17. I am not disappointed with myself. 0 O O O O
18. I am inferior to my friends. 0 O O O 0
19. Talking in front of class makes me feel 0 O O O O

nervous .

88

 

Never

Seldom
2

Sometimes

3

Often Almost Always

4

5

 

20.

2].

22.

23.

24.

25.

26.

27.

28.

29.

30.

3].

32.

33.

34.

35.

36.

37

38

39

Compared to others, I feel like I don’t
measure up.

I do not look down on myself because of
my good character.

If I could live my life over, I would change
almost nothing.

I do not feel important.

I like myself.

I feel I am a complete success as a person.
I do not feel like a failure.

I do not feel unimportant.

I am satisﬁed with my life.

I am embarrassed to face my friends or
famin if I have made a low grade on a test
or assngnment.

I feel unimportant.

I don’t feel adequate as a person.

I see myself as ﬂawed and with personal
defects.

The conditions of my life are excellent.
I don’t feel like I am good for something.
I feel inadequate as a person.

I am satisﬁed with myself.

. I feel that I am just not good enough.

I have trouble sleeping well the night
before an important examination.

I feel like I am good for nothing.

89

0

0000000

0

0

000000

0

0000000

0 O

0

000000

0

0000000

0

0000000

0

0

OOOOOOO

0000000

0

0

0000000

0 O

0

000000

 

 

Never Seldom Sometimes Often Almost Always
l 2 3 4 5

40. I do not feel like a useless person. 0 O O O

41. Compared to other people I feel like I do 0 O O O 0
measure up.

42. I don’t dislike my self.

43. I ﬁnd that I don’t live up to my own 0
standards or ideals.

44. I do not feel I am a complete failure as a O O O O 0
person.

45. I avoid talking to my classmates about 0 O O O O
schoolwork because they might make fun
of me.

46. I am not satisﬁed with myself. 0 O

47. I am not a worthless person. 0

48. I ﬁnd that I live up to my own standards 0 O O
and ideals.

49. I feel like I am good enough. 0 O O O O

50. I feel ashamed of myself. 0 O O O O

51. I do not like myself. 0 O O O O

52. I feel like a useless person. 0 O O O O

53. I don’t feel there is a something defective O O O O O
in my character.

54. Compared to others, I feel like I am less of O O O O O
a person.

55. I feel so insigniﬁcant to others, as if I were 0 O O O 0
invisible.

56. Compared to others, I do not feel like I am 0 O O O 0
less of a person.

57. I feel worthy as a person. 0

58. So far I have gotten the important thing I 0

want in life.

90

 

 

Never Seldom Sometimes Often Almost Always
l 2 3 4 5

59. I am not a worthwhile person. 0 O O O O

60. I don’t feel worthy as a person. 0 O O O O

61. I feel like a useful person. 0 O O O O

62. I feel so nervous about some of my classes 0 O O O O
that it is hard for me to attend.

63. I do not feel that I am just not good 0 O O O O
enough.

64. I do not feel signiﬁcant enough to others 0 O O O O
that I people notice me.

65. I am afraid to ask teachers to explain a O O O O O
difficult concept a second or third time.

66. I feel that I am inferior to most of my 0 O O O 0
friends.

67. I always seem to live up to what I aspire to O O O O 0
be.

68. I don’t feel I am defective as a person, as if 0 O O O O
something is basically wrong with me.

69. I become frightened when a teacher calls 0 O O O O
on me in class.

70. I am upset about so many things that I O O O O 0
cannot concentrate on or do my
schoolwork.

71. I don’t feel inferior to most of my friends. 0 O O O

72. I don’t feel I am a complete success as a 0
person.

73. I ﬁnd that I fall short of my own standards 0 O O O O
or ideals.

74. I feel upset when I have to take a test. 0 O O O O

75. I dislike my self. 0 O O O O

76. In most ways my life is close to my ideal. O O O O O

77. I feel I am a complete failure as a person. 0 O O O O

91

 

 

Never Seldom Sometimes Often Almost Always
l 2 3 4 5

78. I do not feel that I am superior to most of O O O O O
my friends.

79. I do not feel worthless as a person. 0 O O

80. I always seem to fall short of my 0 O O O O
aspirations.

81. I see myself as intact and without personal 0 O O O O
defects.

82. I don’t feel like a useﬁil person. 0 O O O O

83. I feel proud of myself. 0 O O O O

84. I am unhappy with myself. 0 O O O O

85. I feel successful. 0 O O O O

86. I don’t look up myself because of my 0 O O O O
ﬂawed character ﬂaws.

87. I feel important. 0 O O O

88. I feel adequate as a person. 0

89. I don’t see myself as intact and without 0 O 0
personal defects.

90. I become tense and nervous when I am 0 O O O O
studying.

91. I ﬁnd that I don’t fall short to my own 0 O O O 0
standards or ideals.

92. I look down on myself because of my 0 O O O O
ﬂawed character.

93. I do not feel that I am good enough. 0 O O O O

94. I feel disappointed with myself. 0 O O O O

95. I amjust as good as my friends. 0 O O O O

96. I am not pleased with my self. 0 O O O O

97. I feel worthless as a person. 0 O O O O

98. I am not inferior to my ﬁ'iends. O O O O O

92

 

Never Seldom Sometimes Often Almost Always

 

l 2 3 4 5
99. 1 do not feel successful. 0 O O O O
100. I feel so signiﬁcant that people notice 0 O O O 0
me.
101. I don’t feel my character is intact. O
102. I feel I am superior to most of my friends.
103. I worry about how well I am doing in my 0 O O O 0
classes.
104. I feel like I am good for something. 0 O O O O
105. I don’t always seem to live up to my 0 O
aspirations.
106. I would be afraid to tell a teacher that he 0 O O O O
or she made a mistake in explaining an
assignment or in working a problem.
107. I don’t feel insigniﬁcant to others, as if I O O O O 0

were invisible.

Gender:

 

Grade:

 

Age:

 

93

APPENDIX C

Questionnaires in Chinese

94

58881

 

1'4? 453%

a‘éﬁﬁ’iﬁ $815885?! WM? 57,? ﬁtﬂﬁaﬁé‘r

1318165258 iiigﬁ‘zau, 581%;Bﬁ‘éaﬁ- “41313881

:24?) 13317883 *$%%%QI$EI§EZRJ fifh’i’FZSz“ at ,
aamwas 88%ﬁﬁﬁ.

4”%K$%?3éﬁi¥a F=375~3*%#%VM§5%5=IL
iaél’ﬁikiﬁ. FIiVAIIﬂ/xﬁai’iﬁiﬁﬁi'lﬂ: ##451
éﬁésﬁﬁvAt’F

 

 

8988141435?
ﬁ-“ﬁﬂ’ﬁ'iﬁl‘fﬁ‘éé‘lﬁ’ﬁz

am 1%: 753:5: i314? ”ﬂ?
1 2 3 4 5

ﬁ-wﬁééﬁﬁﬁ‘lﬂﬁiﬁi £11444 ncﬁaxiléAé’Jtaﬂi
5285113 E§5%. Fﬁﬁﬁa‘gﬁﬁi’iﬁi‘léa‘z»

 

95

 

 

8:5

5%

#8

3*

8*

 

10.

ll.

12.
13.
14.
15.

16.
17.
18.

19.

ﬁ$ﬁaéiﬂﬁﬁ
8888A

889688
81886A88§
8888888888
ﬁﬁﬁé$8®
8$iiﬁﬁiﬁﬁ
88188
ﬁt—maﬁmmwA
§336888A8888

8$mE688

88688888888
888888—8

81836A8888
ai—mﬁﬂﬁwA
881836—¥88

$88888¥8888
#88878

ﬁ$868$
aﬁaaxxz
ﬁw$iﬁﬂﬂi
88888888888
88

0

0000000000

0000

0000

96

000000000

00

0000

0000

0

0000000000

0000

0000

000000000

00

0000

0000

000000000

00

0000

0000

 

 

ﬁ$ «a iﬁ ﬁﬁ ﬁt
1 2 3 4 5
20. ﬁwmtﬁﬁﬁﬁtﬁaa o o o o o
EBA—3r
21. ﬁziaaaéhziﬁﬁxkfé o o o o o
txgaa
22. aﬂlﬁﬁmiﬁiﬁiﬁﬁéﬁ O o o o o
ﬁi$$ﬁ$i£ﬂiﬁ
23. axtﬁaaﬁﬁ: o o o o o
24. sums. o O o o o
25. ﬁtéﬂsmﬁlﬁm O O o o o
26. ﬁiﬁtsdsiﬂi o o o o o
27. axtﬁacxnz o o o o o
28. aﬁﬁéﬁiﬁé-tiﬁﬁ O O o o o
29. hiéﬁiwﬁﬁcﬁxﬁ o o o o o
attﬁmmuaﬁ
“iii/x
3o. atﬁaaxﬁr o o o o o
31. ﬁ/‘Ftﬁaaﬁiﬁﬁéh o o O o o
32. ﬁtaakﬁﬁﬁﬁkm o O o o 0
WA
33. ﬁﬁﬁiiﬂﬂﬁiﬁ o o o o o
34. sift/3E! Cﬁrxifﬁﬁﬁ O O O O O
35 atﬁaamﬁ. o O o o o
36 ﬁﬁacﬁt O o o o o
37 ﬁiﬁaamﬁtﬁiz o o O o o
38 ﬁtﬁ!§£ﬁﬂ#ﬁﬁ o o o o o
—%#&$ﬁ
39. ﬁtﬁé} a—tﬁeﬁi O o o o o

97

 

 

ﬁx MK tﬁ at ft
1 2 3 4 5

4o. 5523;1213‘9 aiﬁcmwx o o o o o

41. ﬁwmtmuﬁ a a. o o o o o
rim—é?

42. ﬁxﬁma a o o o o o

43. ﬁﬁtﬁiﬁﬁa a ass/ms: o o o o o
agaﬁ

44. 425511; E: Mini 0 o o o o

45. ﬁﬁkﬁﬁlhﬁéwamﬁ; o o o o o
mammxﬁ

46. airs 6$5%$ o o o o o

47. axi—maﬁmmwx o o o o o

48. aﬁtﬁsa Ex 6642mm 0 o o o 0
3243.655

49. ﬁtﬁ s: 61H? 0 o o o o

50. ﬁrx a 631k 0 o o o o

51. «ﬁiF-Eﬁél a. o o o o o

52. 45.14% Ci-ﬁeméﬁ/x o o o o o

53. W511; a Exxﬁﬁwa o o o o o

54. ﬁmxrtﬂﬁtﬁ a EtbFF o o o o 0
LA?

55 435.14% a EﬂFt-EGJF o o o o 0

2311—132

56.§4&/\tt1&35i$133 é: Eat. o o o o o
$LA§

57. am; a Eﬁﬂﬁﬁ o o o o o

58. .3913 ﬂaoiﬁmmrid o o o o o

i%?m§:§%éﬁiéﬁ

98

 

 

 

 

 

 

5&2; 0525 7E9? il'¥' 11’

1 2 3 4 5
59. ﬁxi—{aﬁuﬁwx o o o o O
60. $51M: aiﬂiﬁﬁ O O O O O
61. $1233 Enigma/x o o o o O
62. ﬁgiiﬁﬂiﬁﬁﬁﬁai o o o o o
63. 43:51:53 6.591153%): 0 o o o O
64- ﬁiﬁiﬁ’rﬁ 5:29» 0 O O O O

Aﬂﬁtﬁ
65. ggﬁgfggggmmﬁ O O O O O
66. ﬁtﬁa ﬂtt$l1§~§ EM 0 o o o O
67. Mﬁma‘eﬁms o o o o O
68. ﬁziizﬁél aﬁrﬁg O O O O O
69. gﬁﬁﬁwiﬂﬁﬁﬁﬂﬁ o O o O O
70. 13.3 $¢ﬁi§miﬁﬁcii o o o O O
71. £75113”: Btt$i1ﬁ§ O o o o 0
£812:

72. 15411433 mum o o o o O
73.3%}:13149513 wanna. O o o o O
74. gﬂﬁ££ﬁﬁﬁfﬂ3$ O o o O O
75. saw-ma E. o o o o O
76. gﬁéﬁﬁﬁwiﬁ-aﬁm o o o O O
77. £13439 2.129434 0 o o o O

99

 

 

ﬁ$ «a ﬁﬁ a? #t
1 2 3 4 s

78. ﬁxtﬁa Eﬁiﬁﬁﬁﬁﬁﬂi o o o o o
79. 3313613133 mug-ma O o 0 O O
80. 382525182482 0 o o o o
81. gag/Ezimmmiﬁ o o o o o
82. ﬁxiﬁra EnHiﬁWJ/k O O O O O
83. aux a 6.32% o o o ' o o
84. £043 watt: 0 o o O o
85- ﬁétﬁ’iI/J O O O O O
86.13% El 6%?FEEA46 O O O O O

$iﬁ£él E.
87. ﬁtﬁél EARf£ O O O O O
88. 4815888 6753.58.88): o o o o o
89. 8875278 aiﬁﬁéﬂﬁiﬁﬁ o o o o o

ﬁﬁﬁA
9o. azimamzwrma o o o o o
91. §§§§§$mza we 0 o o o 0
92.35881 8 waigxxg o o o o o

ﬁxiﬂél E.
93. $051138 61% o o o o o
94. £683 6%: O O O O O
95. ﬁﬁﬁéﬁaﬁi—ﬁﬁ o o o o o
96. xii-ta Exam 0 o o o o
97. 5&1; El Eiiﬁmﬁ O O O O O
98. siziht$iﬁmai O o o o o

100

 

4875 {Ma 759'? 3* *1

 

 

1 2 3 4 5
99. 45% 8&8sz o o o o o
100. £1533 E1 EJREErxg O o o o o
Aﬂﬁtﬁ
101. £75113 El Bkﬁiﬂ‘ o o o o o
102. £113 Ex Eﬁx‘ééﬁﬁ 5812: O o o o o
103. 35mm El 8.411).:me o o o o o
ﬁﬁﬁ’aﬁ
104. sit??? a EETVXWﬁIskLst O o o o o
105. ﬁxjiﬁ‘ﬁwaﬁ ms 0 o o o o
106. tatemﬂﬁzmawxa O o o o o
mﬁﬁ%ﬁ$ﬁ%ﬁw
107. £17513 8 Exiﬂgiﬁ o o o o 0
#41432
'Tiﬁ']:
$337.:

 

$35}:

 

lOl

APPENDIX D

Descriptive Statistics of the Items

102

Table 21. Descriptive Statistics of the Items

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Mean SD Variance

1 3.0212 1.4182 2.011
2 3.4511 1.1182 1.250
3 3.2768 1.1886 1.413
4 3.2556 1.0813 1.169
5 2.3934 1 .0294 1 .060
6 2.8363 1 .2268 1 .505
7 3.5053 1.0496 1.102
8 4.1567 1.0795 1.165
9 3.3934 1.2154 1.477
10 3.5995 1.1492 1.321
1 1 3.4087 1.4918 2.225
12 3.7538 1.1136 1.240
13 3.4570 1 .2231 1 .496
14 3.1390 1.3905 1.933
15 3.9199 1.1516 1.326
16 3.3816 1.3908 1.934
17 3.3675 1.1902 1.417
13 3.4134 1.1502 1.323
19 2.9870 1.4303 2.046
20 2.8728 1 .2364 1 .529
21 3.4912 1.3238 1.763
22 2.3840 1.4146 2.001
23 3.7503 1.2519 1.567
24 3.5783 1.2291 1.511
25 2.7491 1.0379 1.077
26 2.9411 1.1228 1.261
27 2.9835 1 .3675 1 .870
23 3.3922 1.1927 1.423
29 2.5689 1 .2969 1 .682
30 3.8799 1 .1845 1 .403
31 3.4064 1.1507 1.324
32 4.1519 1.0160 1.032
33 3.0777 1.1710 1.371
34 3.7986 1.1759 1.383

 

 

 

 

103

 

Table 21. (Cont’d)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Mean SD Variance
3 5 4.2108 1.0252 1.051
36 3.1955 1.1470 1.315
37 3.9305 1.0511 1.105
38 3.9576 1 .2046 1 .451
39 3.9847 1.1121 1.237
40 3.2509 1 .3879 1 .926
41 2.3557 1.1173 1.248
42 3.5406 1 .2935 1 .673
43 3.6231 1 .2021 1.445
44 3.1178 1.2411 1.540
45 4.2968 1.0054 1.01 1
46 3.6019 1.1642 1.355
47 3.2391 1 .4027 1 .968
48 3.2827 1 .2683 1 .609
49 3.1449 1.1570 1.339
50 4.1343 .9826 .965
51 4.0459 1.0078 1.016
52 4.1543 .9993 .999
53 3.0495 1.3484 1.818
54 3.4064 1.1352 1.289
55 3.8940 1.2279 1.508
55 2.9069 1 .2226 1 .495
57 3.4346 1.2212 1.491
58 2.9435 1.2350 1.525
59 3.9741 1.1261 1.268
60 3.9093 1.1574 1.340
61 3.4676 1.2050 1.452
62 4.1837 1.1208 1.256
63 3.1637 1.3685 1.873
64 3.1225 1.2790 1.636
65 3.0989 1.4801 2.191
66 3.4087 1.2137 1.473
57 2.9317 1.2785 1.634
63 3.3027 1.3835 1.914
69 3.4770 1.2121 1.469
70 3.0907 1 .2738 1.623
71 2.9918 1.2853 1.652

 

104

 

Table 21. (Cont’d)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Item Mean SD Variance
72 3.4287 1 .1368 1.292
73 3.3899 1 .2053 1 .453
74 3.8587 1 .1436 1.308
75 4.0342 1.0914 1.191
76 2.7044 1.1036 1.218
77 3.9258 1.0958 1.201
73 3.3663 1.1641 1.355
79 3.0224 1 .3867 1 .923
30 3.5701 1.1886 1.413
81 2.7409 1.3126 1.723
82 3.7915 1.1676 1.363
83 3.1390 1.2110 1.467
84 2.4935 1.1473 1.316
35 3.0389 1.0934 1.195
86 3.7703 1.1504 1.323
37 3.2874 1.2102 1.464
33 3.1602 1.1504 1.323
89 3.4158 1 .2477 1.557
90 4.2686 1 .0306 1 .062
91 3.0707 1.2756 1.627
92 4.0047 1.1451 1.311
93 3.4747 1 .1583 1 .342
94 3.7244 1.1230 1.261
95 2.9882 1 .2102 1.464
96 3.7880 1 .0278 1 .056
97 4.0141 1.0858 1.179
93 2.9788 1 .2159 1 .478
99 3.5807 1.1131 1.239
100 2.3581 1.1519 1.327
101 3.6219 1.1963 1.431
102 2.6396 1.2023 1.445
103 2.6631 1.2893 1 .662
104 3.4276 1 .2275 1 .507
105 3.4664 1.2186 1.485
106 3.7126 1.2381 1.533
107 2.7385 1 .4706 2.163

 

 

 

 

105

 

APPENDIX E

Descriptive Statistics of Four Modes of Item Wording

106

 

 

 

 

 

 

 

 

632 824 M66. m4m H .4 32 .H 4 H62 682 £444 .698 669. 4 8H: 69 H
842 4-o.m 6on H484 4%: ~83 ~H-.H 4499.. .6966 4 9 €63 69 H
. . . . . . . . .9864 669.6 9699,
~wH4H 28 m 85H 6H2 4 R4~H 424 m 4~HmH 8E ~ 96 9634462699 83
. . . . . . . . 920830 voom >8
44m~ H 95 ~ Hm: H ~48 4 462 H 82. m 4mH~ H 496 m 6 69666 :86: 9 9.. 69 H
. . . . . . . . .699
4444 H 346 m cm: H mg m 82 H 28 m 9:: H §~ m 9 66466 96 69 69 H
2me ~82 ~m~o.H 8H~4 82H 4844 4%: ~82 .698 4 6 99.646 69 H
. . . . . . . . .9696
62~H §6~ ~82 424 m 36H 4mm~ m ~oH~H ~33 9994664696H§H
. . . . . . . . .6 9266 H
4-~ H 696 ~ ~42 H 464 m 448 H HHS ~ m: H H Em ~ 35 69 H .9696 9 466668
. . . . . . . . 66.9 99 6 .66
m3~ H ”H66 ~ RH~ H :64 m H42 H 83 m m~o~ H 46% ~ 9 6696 94 H 69 69 H
962 ~83 H H9: 88m 962 E46 Em: 644R .666 H69H :6 H 69 69 H
63H 33 696. $24 ~64: 862 H: H~.H 6%; £8.96 6 96.6 :9 H
~66: «$3 o-HH 44~2 ~44: 6 :64 E4: 362 «696 .23 48966 64 H
4689 482 £42 264~ ”RH: 292 9: H.H H H44 9696 9? H6966 :6 H
262 843 4H2: ~48.4 ”SH: 696.4 H6-.H $3.». .2699 9H: H
.5965 :32 Son—dam 532 £5.65 :32 .2565 :32 u on... use:
4 962 m 962 ~ 6.62 H 962

 

 

$54.83 :5: .3 meta: .58— .? 85395 253.88: .NN 039,—.

107

 

 

.m—Hwomvm ﬁg mﬁuﬂﬁgm

 

 

 

 

 

 

 

omnNH $86 mmoNH oowm.m HchH Hmmcd mwcmH Ram». :30 >8 3 a: 3: H 85 HUSH H

.maoswbmmm

wwmmH mcmwd can: Hohmﬁ ow HN.H $945 3%: Emmd >8 9 a: o>HH 3 800m mwnga H

. . . . . . . . .68 8:8

we: H mwmh m ahmm H 343 m comm H mmNH m 32 H mem N oHHHoonH 35 38.895 8 H00:

whomH mmwad mXHHH 35d mHmNH momhd NoHNH Emma .EwtonHEH bo> HouH H

nmoYH Hammd 33H Sm H .v HcNHH H435 HmNNH. £38m .583 £23553 a Em H

mNNH .H H Hwad wax: mmom.m HmH H.H momma vmoo. H 38.4” .Hsmmmooosm HouH H

A893 9 mm

H HVNH w: H.m ammoH wmmmd mom H.H Swim ammoH Havnd 3083 329:8 a Ea H HooH H

. . . . . . $55088

33H 32 m HNH H.H H.433 m amnH H 82. m thN H 3.9 m 8H woow Ea H 8H: :3 H
.6965 s62 5.56% :62 36:66 E62 $5.3 562 Ila—8:.

v 9.52 m «H52 N ace—Z H £52

 

 

€9.69 - 654...

108

APPENDIX F

Item Total Statistics

109

Table 23. Item-total Statistics (88 items)

 

Scale Scale Corrected

Kean Variance Iten- Squared Alpha

it Item if Item Total Multiple 1! Item

Deleted Deleted Correlation Correlation Deleted
01 295.2108 2823.6147 .2599 .3038 .9672
02 294.7809 2803.9142 .5036 .5052 .9666
Q3 294.9552 2800.3754 .5009 .5405 .9666
04 294.9764 2816.4027 .4115 .4525 .9668
05 295.8386 2919.3454 -.4999 .4655 .9683
06 295.3958 2842.8314 .1564 .2749 .9673
07 294.7267 2795.8663 .6111 .5641 .9664
08 294.0754 2789.2820 .6521 .6412 .9663
09 294.8386 2786.4232 .5993 .5053 .9664
010 294.6325 2801.107? .5128 .3768 .9666
012 294.4782 2809.2286 .4603 .5335 .9667
013 294.7750 2777.1156 .6686 .6023 .9663
014 295.0931 2810.9123 .3524 .3551 .9670
016 294.8504 2804.8915 .3935 .5178 .9669
017 294.8645 2799.2045 .5096 .4818 .9666
018 294.8186 2801.1345 .5121 .5478 .9666
020 295.3592 2815.3201 .3655 .4107 .9669
021 294.7409 2794.2653 .4915 .4781 .9666
023 294.4817 2783.4976 .6036 .5287 .9664
024 294.6537 2786.8115 .5894 .5456 .9664
025 295.4829 2794.5896 .6300 .5538 .9664
026 295.2909 2798.7938 .5450 .4635 .9665
027 295.2485 2813.8568 .3383 .3276 .9670
030 294.3522 2779.0350 .6755 .7050 .9663
031 294.8257 2802.9601 .4967 .4201 .9666
032 294.0801 2797.2412 .6190 .6095 .9664
034 294.4335 2794.1374 .5573 .4527 .9665
035 294.0212 2789.9312 .6816 .6864 .9663
036 295.0365 2779.5494 .6940 .6493 .9662
037 294.3015 2788.9609 .6732 .6447 .9663
039 294.2473 2786.0213 .6605 .6328 .9663
040 294.9812 2791.0963 .4894 .5006 .9666
041 295.8763 2802.5472 .5157 .4958 .9666
042 294.6914 2787.1264 .5565 .5248 .9665
043 294.6090 2816.8658 .3643 .4875 .9669
044 295.1143 2795.3985 .5171 .4718 .9666
046 294.6302 2794.6178 .5591 .5455 .9665
047 294.9929 2798.7028 .4321 .4327 .9668
048 294.9494 2800.0128 .4707 .5382 .9667
049 295.0872 2780.579? .6793 .6238 .9663
050 294.0978 2809.2110 .5244 .4524 .9666
051 294.1861 2794.6941 .6484 .6698 .9664
052 294.0777 2791.3076 .6866 .7079 .9663
053 295.1826 2805.9065 .3995 .4268 .9668
054 294.8257 2788.2455 .6278 .6223 .9664
055 294.3380 2776.2240 .6729 .6598 .9663
056 295.3251 2826.9414 .2797 .3559 .9671
057 294.7974 2765.1146 .7646 .7344 .9661
059 294.2580 2790.1280 .6170 .5605 .9664
060 294.3227 2785.9193 .6346 .5532 .9663
061 294.7644 2772.1968 .7183 .6596 .9662
063 295.0683 2801.7383 .4224 .4447 .9668
064 295.1095 2832.5458 .2249 .3280 .9672
066 294.8233 2779.0371 .6587 .6387 .9663

 

llO

Table 23. Item-total Statistics (88 items)

 

Scale Scale Corrected

Kean Variance Item- Squared Alpha

it Item if Item Total multiple if Item

Deleted Deleted Correlation Correlation Deleted
067 295.3004 2810.9934 .3849 .4718 .9669
068 294.9293 2790.0658 .4982 .5109 .9666
071 295.2403 2822.0955 .3006 .3952 .9670
072 294.8033 2802.832? .5041 .4995 .9666
073 294.8422 2802.7345 .4750 .5388 .9667
075 294.1979 2785.8405 .6751 .7021 .9663
077 294.3062 2782.5854 .7008 .7019 .9662
078 294.8657 2809.8640 .4342 .4262 .9667
079 295.2097 2808.0055 .3734 .4192 .9669
080 294.6620 2805.5118 .4596 .4987 .9667
081 295.4912 2804.8328 .4189 .3848 .9668
082 294.4405 2803.6005 .4839 .4000 .9666
083 295.0931 2776.9170 .6771 .6732 .9662
084 295.7385 2941.9410 -.6304 .6423 .9687
085 295.1932 2784.8164 .6828 .6663 .9663
086 294.4617 2790.1592 .6033 .5332 .9664
087 294.9446 2777.4014 .6737 .6537 .9663
088 295.0718 2785.3592 .6433 .5982 .9663
089 294.8163 2848.2303 .1127 .2130 .9674
091 295.1614 2800.9539 .4608 .4844 .9667
092 294.2273 2779.7584 .6934 .6727 .9662
093 294.7574 2791.0212 .5919 .5640 .9664
094 294.5077 2785.5403 .6580 .6557 .9663
095 295.2438 2789.8520 .5748 .4851 .9665
096 294.4441 2800.2472 .5837 .6450 .9665
097 294.2179 2782.3758 .7093 .7365 .9662
098 295.2532 2814.3474 .3796 .3736 .9669
099 294.6514 2796.5788 .5690 .5383 .9665
0100 295.8740 2810.145? .4367 .4556 .9667
0101 294.6101 2805.5282 .4564 .4411 .9667
0102 295.5925 2804.0979 .4654 .4837 .9667
0104 294.8045 2782.4641 .6242 .5359 .9664
0105 294.7656 2819.9650 .3350 .4455 .9669
0107 295.4935 2829.4861 .2119 .2896 .9673

 

111

BIBLIOGRAPHY

112

BIBLIOGRAPHY

Adorno, T. W., Frankel-Brunswik, E., Levinson, D. J ., & Sanford, R. W. (1950). Ih_e
Authoritarian Personalig. New York: Harper.

Ahlawat, K. S. (1985). On the negative valence items in self-report measures. Journal of
General chhology, 112(1), 89-99.

Anastasi, A. (1982). Psychological Testing (5th ed.). New York: Macmillan.
Anderson, L. W. (1981). Affective characteristics in the schools. Boston: Allyn & Bacon.

Andrich, D. (1983). Diagnosing and accounting for response sets provoked by items of a
questionnaire. Paper presented at the annual meeting of the National Council on
Measurement in Education, Montreal. PQ.

Andrulis, R. S. (1977). Adult assessment: A sourcebook of tests and measurement for
human behyior. Springﬁeld, IL: Thomas.

Benson, J ., & Hocevar, D. (1985). The impact of item phrasing on the validity of attitude

scales for elementary school children. Journal of Educationg Measurement. 22,
213-240.

 

Bentler, P. M., Jackson, D. N., & Messick, S. (1971). Identiﬁcation of content and style:
A two-dimensional interpretation of acquiescence. Psycholcgical Bulletin, 76,
186-204.

Bentler, P. M., Jackson, D. N., & Messick, S. (1972). A rose by any other name.
Psychological Bulletin. 77, 109-1 13.

Berg, I. A. (1961). Measuring deviant behavior by means of deviant response sets.
In I. A. Berg & B. M. Bass (Eds), Conformity 3nd deviation (pp.328-397). New
York: Harper and Row.

Block, J. (1967). Remarks on Jackson’s “review” of Block’s challenge of response sets.
Educational and Psychological Measurement. 27, 499-502.

Block, J. (1971). On further conjecture regarding acquiescence. Psychological Bulletig
Z_6_, 205-210.

113

Block, J. (1972). The shiﬁing deﬁnition of acquiescence. Psychological Bulletin, 78,
10-12.

Bloom, B. S. (1978). New learner: Implications for instructions and curriculum.
Educational Leadership. 35, 563-576.

Campbell, N. 0., & Grissom, S. (April, 1979). Inﬂuence of item direction on student
responses in attitude assessment. Paper presented at the 63rd annual meeting of
the American Educational Research Association, San Francisco, CA.

Chang, L. (1995). Connotatively inconsistent test items. Applied Measurement in
Education, 8(3), 199-209.

Chang, S. S. and Hunter, J. (1988). Phenomenology and the measurement of shame.
Unpublished manuscript.

Coopersmith, S. (1967). The antecedents of self-esteem. San Francisco: Freeman.

Couch, A., & Keniston, K. (1960). Yeasayers and naysayers: Agreeing response set as a
personality variable. Journal of AbnormzL and Social Psychology. 60, 151-174.

Crocker, L. & Algina, J. (1986). Introduction lassical modern test theo . New
York: Holt, Rinehart, and Winston.

Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological
Measurement. 6, 475-494.

Cronbach, L. J. (1950). Further evidence on response sets and test design. Educational
afnd Psychological MeasurementéO, 3-31.

Cronbach, L. J. (1960). Essentials of Psychological Testing(2nd. ed.). New York: Harper
& Row.

Danis, S. G. (1974). The effect of attitude and scale format on polarization in social
judgments. Dissertation. University of Georgia.

Diener, E., Emmons, R. A., Larson, R. J ., & Grifﬁn, S. (1985). The satisfaction with life
scale. Journal of Personality Assessment, 49(1), 71-75.

Dudycha, A. L., & Carpenter, J. B. (1973). Effects of item format on item discrimination
and difﬁculty. Journal of Applied Psychology, 58, 116-121.

114

Edwards, A. L. (1953). The relationship between the judged desirability of a trait and the
probability that the trait will be endorsed. Journal of Applied Psychology. 37, 90-
93.

Edwards, A. L. (1955). Social desirability and Q-sorts. Journal of Consulting Psychology.
1_9, 464.

Edwards, A. L. (1957a). The social desirability variable in personality assessment and
research. New York: Dryden.

Edwards, A. L. (1957b). Techniques of attitudes scale construction. New York: Appleton
Century Crofts.

Green, R. F. (1951). Does a selection situation induce testees to bias their answers on

interest and temperament tests? Educational and Psychological Measurement, 11,
503-515.

Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill.

Harasym, P. H. (1992). Evaluation of negation in stems of multiple-choice items.
Evaluation and the Health Professions. 15(2), 198-220.

Hathaway, S. R. and McKinley, J. C. (1967). MMPI Manual (Rev. ed). New York:
Psychological Corporation.

Hunter, J. E. (1973). Methods of reordering the correlation matrix to facilitate visual
inspection and preliminary cluster analysis. J ourn_al of Edgational Measurement.
m, 51-61.

Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order
factor analysis and causal models. In B. W. Saw, & L. L. Cummings (Eds),
Research in Organizational Behavior. Vol. 4. Greenwich, CN: JAI Press.

 

Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Mata-analysia: Qmalating
research ﬁndings across studies. Beverly Hills, CA: Sage.

Huttenlocher, J. (1962). Some effects of negative instances on the formation of simple
concepts. Psychological Reporth 1. 35-42.

Jackson, D. N. (1967a). Balance scales, item overlap and the stables of Augeas.
Educational and Psychological Measurement, 27, 502-507.

115

Jackson, D. N. (1967b). Block’s challenge of response sets. Educational aad
Psychological Measurement. 27. 207-219.

 

Jackson, D. N., & Messick, S. (1958). Content and style in personality assessment.
Psychological Bulletin, 55, 243-252.

Jackson, D. N., & Messick, S. (1965). The nonvanishing variance component. American
Psychologist. 20. 498.

Jackson, D. N., & Paunonen, S. V. (1980). Personality structure and assessment. Annual
Review of Psychology, 31, 503-551.

Jacobs, A., & Barron, R. (1968). Falsiﬁcation of the Guilford-Zimmerman Temperament
Survey: H. Making a poor impression. Psychological Reports. 23, 1271-1277.

Jaroslogy, R. (1988. July/August). What’s on your mind, America? Psychology Today,
54-59.

Joreskog, K. G., & Sorbom, D. (1988). LISREL 7: A guide to the program and
application. Chicago: SPSS Inc.

Lemon, N. (1973). Attitudes and their Measurement. New York: John Wiley & Sons.

Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology,
_14_0, 152.

Linn, R. L. (1990). Essentials of student assessment: From accountability to instructional
aid. Teachers College Record. 91 (3), 422-436.

Marsh, H. (1984). The bias of negatively worded items in rating scales for young
children. Journ_a1 of Educational Psychology. 76, 420-431.

Marsh, H., Smith, 1., Barnes, J ., & Butler, S. (1983). Self-concept: Reliability,
dimensionality, validity and the measurement of change. Journal of Educational

Psychology, 75, 772-790.

Mehrens, W. A., & Lehmann, I. J. (1984). Measurement and Evaluation in Education and
Psychology (3rd ed.). Orlando, FL: Holt, Rinehart, & Winston.

Michael, W. B. and Smith R. A. (1976). The development and preliminary validation of
three forms of a self-concept measure emphasizing school-related activities.
Educationala and Psychological Measurement, 36, 527-535.

116

Michael, W. B., Denny, B., Ireland-Galman, M., & Michael, J. J. (1987). The factorial
validity of a college-level form of an academic self-concept scale. Educational
Research Quarterly. 11(1), 34-39.

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.

Ory, J. C. (1982). Item placement and wording effects on overall ratings. Educational and
Psychological MeasurementL42, 767-775.

Osgood, C. E. (1952). The nature and measurement of meaning. Psychologacal Bulletin,
19, 197-237.

Osgood, C. E., Suci, G. J ., & Tannenbaum, P. H. (1957). The measurement of meaning.
Urbana: University of Illinois Press.

Piers, E. (1969). The Piers-Harris children’s self-concept scale. Nashville, TN: Counselor
Recordings & Test, Box 6184, Acklen Station.

Radcliffe, J. A. (1966). A note on questionnaire faking with 16 PFQ and MP1. Australian
Journal of Psychology. 18, 154-157.

Ramsay, J. O. (1973). The effect of number of categories in rating scale on precision of
estimation of scale values. Psychomemlca, 38, 513-529.

Remmers, H. H., & Ewart, E. (1941). Reliability of multiple-choice measuring

instruments as a function of the Spearman-Brown prophecy formula, 111. Journal
of Educational Psychology. 32, 61-66.

Robinson, J. P., & Shaver, P. R. (1973). Measurement of social psychological attitudes.
Ann Arbor, MI: Survey Research Center, Institute for Social Research.

Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin. 63, 129-156.

Rotter, G. S. (1972). Attitudinal points of agreement and disagreement. Journ_al of Social
Psychology, 86, 211-218.

Rotter, G. S., & Barton, P. (1970). See attitudes of some New Jersey teachers. N.J.E.H.
Review, 28-29.

Rotter, J. B. (1966). Generalized expectancies for internal versus external control of
reinforcement. Psychological Monographs, 80,1.

117

Samelson, F. (1972). Response style: A psychologist’s fallacy. Psychological Bulletin,
28,13-16.

Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new
research ﬁndings. American Psychologist, 36, 1128-1137.

Schmitt, N., & Stults, D. M. (1985). Factors deﬁned by negatively keyed items: The
results of careless respondents? Applied Psycholggical Measurement, 9, 367-373.

Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response bias by item
reversal: The effect on questionnaire validity. Educational and Psychological
Meguremerg 41. 1101-1 1 14.

Schriesheim, C. A. and Kerr, S. (1974). Psychometric properties of the Ohio State
leadership scales. Pachologacal Bulletin, 81, 756-765.

Scott, W. A. Attitude measurement. (1968) In G. Lindzey (Ed.), The handbook of social
psychology, Vol.2 (2nd ed.). Reading, MA: Addison-Wesley.

Shaw, M. E., & Wright, J. M. (1967). Scales for the measurement of attitudes. New York:
McGraw-Hill.

Simpson, R. D., Rentz, R. R., & Shrum, J. W. (1976). Inﬂuence of instrument
characteristics on student responses in attitude assessment. Journal of Research in

Science Teaching, 13, 275-281.

Spielberger, C. D., Gorsuch, R. R., Lushene, R. E. (1970). Test mmaal for the State-Trait
Anxiety Inventom Palo Alto, CA: Consulting Psychologists Press.

Stricker, L. J. (1969). “Test-wiseness” on personality scales. Journal of Applied
PsychologLMonoggaphs, 53 (3, Pt. 2).

Symonds, P. M. (1931). Diagnosing personality and conduct. New York: Appleton
Century.

Thacker, J .W., Fields, M. W., & Tetrick, L. (1989). The factor structure of union
commitment: 11 application of conﬁrmatory factor analysis. Journal of Applied
Psychology. 74, 228-232.

Throne, F. C. (1978). Methodological advances in the validation of inventory items,
scales, proﬁles, and interpretation. Journal of Clinical Psychology. 34(2), 283-
301.

118

Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology. 33.
529-554.

Tittle, C. R., & Hill, R. J. (1967). Attitude measurement and prediction of behavior: An .
evaluation of conditions and measurement techniques. Sociometry. 30, 199-213.

Towne, D. C. (1967). Inﬂuences exerted upon subiect responses by the response scale
structured elements of attitude scales. Dissertation, Cornell University, Ithaca,
NY..

Trott, D. J ., & Jackson, D. N. (1967). An experimental analyses of acquiescence. Journal
of Experimental Research in Personality. 2. 278-288.

Tryon, R. C. (1939). Cluster analysis. Ann Arbor, MI: Edwards Brothers.
Tryon, R. C., & Bailey, D. E. (1970). Cluster Analysis. New York: McGraw-Hill.

Violato, C., & Marini, A. E. (1989). Effects of stem orientation and completeness of
multiple-choice items on item difﬁculty and discrimination. Educational and
Psychological Measurement. 49, 287-295.

Wason, P. C. (1961). Responses to afﬁrmative and negative binary statements. British
Journal of Psychology, 52, 133-142.

Wesman, A. G. (1952). Faking personality test scores in a simulated employment
situation. Journal of Applied Psychology, 36, 112-113.

Wiggins, J. S. (1973). Personality and prediction. Reading, MA: Addison-Wesley.

Wiggins, J. S. (1966). Social desirability estimation and “faking good” well. Educational
and Psychologic_al Measurement. 26, 329-341.

Winkler, J. D., Kanouse, D. E., & Ware, J. E., Jr. (1981, August). Controlling for
acquiescence respoase set in scale development. Paper presented at the 90th annual
meeting of the American Psychological Association, Los Angeles, CA.

 

Zern, D. (1967). Effects of variations in question-phrasing on true-false answers by
grade-school children. Psychological Reports. 20, 527-533.

119

"11111111111111“