{1 11111111111 w. 1111:1111 ‘ 5131111111111 1,1911” 114‘: . ‘1 421149.111 =15". qm“. a... -o .9- 1 7:. 'M ’— - -lg “133:3?“ ‘55:; Ir 1 1 '2'” m- - 43- 1* :1 ‘i‘fl :3. —.¢--0 - .fl - ”3% ...r" hay-IE ’ 51": 'i ' -=; 3 3E1 -.... .. - .rH— 0-- 4==m=émwssi1g1 ', fia‘i‘kawf“ ”it; ”7‘ am»- ".j {Lit-‘11 i111," “vi";l‘ hmmn W {fins} g . &P%':"13‘w_;. I '. ' ;.,..£ . '1 ,. 1.". n: :.' . v»: ‘ fizz-3"?”hfi‘?“ ., . .. ‘ M... am..- ._ _ - 1 4Q- - ..:..; 1. .. .- .,:.,,:.. ”a" IW’giaga' a" *"5'4. 4mg» . n.“ W a vwz': fl. ”324:3“! . ~11: ' ! """‘ . 1. ail-4‘ I "pv - l‘ .' .. " "hm ,A ‘. L , . ..L vii - I{.\ . 2‘; :Q ‘5'“ . «2'; t A“ . {.1 , ‘o .1. . ‘ ‘ . ‘4- “; .u.-_....-.-- - .- ,. mar —' [,1 1‘3" 1‘111L111fi1 13:13“ 11:7} 1“”: 131911111 LL11! W} m,‘ “of". ‘A n _ _v ,;" 7.31:} .. - ._ awn , fiv— ~ :1: who “'. ' . w WT. . - W'fiw. - N .. «.- --.:c‘:.-...-m~,__.:...: _ . i - A - '- - --.., ..._ . r" h- -uung Elwyn" 1 '1 -i .. A h-y‘r‘f’l‘ l )gw 1 51: C. p...- ‘mm.’ .- 22%;; _ .. w: «u .- _.- ‘ _‘ L... ., -u ., ..-,....~ I . ,.-— ..'; '. I _ Infl- ,. ¢ . v - - '~ :2 . . 1—. mou n ”5.. ~23“; xx? ~ - 1 §, 1::— ;:‘.. .31.. 7:; P' L. “J1 1 «- ' 1 , m <_ .- nd- "' ”u. v' K veg-:3“ . ~ ~ .y vuz‘q: _ _“,1.m ' M r. Nu" . .r. ’ I; 4.— 1:? i V : jig-:1: 23:1:- _ ‘ "I 11%.,512. k132i; ) \ 3 ”-2"- f’ WW quv v. -“ " —-1 'ZMw-‘N" r.. ha .— a" v": ELM-nus." Eire“ I2- m :5; -v 1 i1 M1” 11111111“! A? 11k ‘1 ‘11$ixg"111"13;$1h1111111 1; «91": ”mam-77m; 1-%£::: ”a; ‘__x..——- m ' film 31194 1!! 3‘12 '1 Vigil w 1:1 I 2‘: '1‘ M124 ‘ ~ ._ H . l 1": IE"'L~".’!\-.. , t iv h 11": u “giggligrgmtgg 114- '4 I: 3; '1‘. ‘” n: t L . '1‘ ““1 'I' t ‘1‘ ' ' - i J . 2‘ :. .Lt'v 2,!k-- -.'-~ w“ . . h ’15 ii H1 u 1‘ 1? 111 '4 W" 1 - 1' E.‘ . 1‘20, 3;" 1% 1151'; '91“ . 1, 311! 1 <,, ‘ '3!" 1:1" "1"11121'33'Eh" 31'111',15.-‘1:'!'i~" '6‘115‘1 ‘11.}:th ,2} is“ "I ._ 1.331‘siilfiififil ‘1““1’flsnu‘1' 113; \zfié; 11%;; -1_ § Al‘s??? :. $151} 1‘ {Ex}? 1m.hi;r111l‘!1i111‘%11 LE} .1141}. "Li "a. 1.1‘I11‘1o,8t:§|§u'1_1 ' . “1 1:. 5‘ :m‘ '1”.' “ =3‘F'r‘g‘ 1’19}. ’v" 1 v q . "‘ A 121111; 1 1‘111‘1‘101111 1;!111‘1.112u*1rv1, =if "L"'l’1¥1d11, 111’1'1‘11'fi’r‘1qhs1! 111‘111315'."131111131“1"; “1111111111111“11‘1112 1111““); '1111315'5121}: *4 "'«J'H'W Lin ‘ ‘11 $1; 1 .: .Ll'1l 1*1W1HLM'. 31111 ml "* ‘1111'Wfl H «m 1"?~i‘;y‘iru!-”1§‘ "' "‘ '1‘ " ‘2‘ "1‘ 1:1 1“”113‘111111'511111111111~ “li.1‘1‘1111111'111h11r‘301,9'111‘11911‘11'1111111111111111111151u$1111 11111111 “:5 a 31:71. M W "156” ‘7311? 11.13131. 111111111111111111.1111"I 1“" 131?; $241.11 m.” 1:11: guy; ”1 1‘11“ ' L’ij1.11“ifl 11111 1‘11! 1‘ “111111111? ““9111“ "'13 1111111 1‘13 'H $114.11.“:11‘J ”I 111.?“ 11111111" 0 E 2 It ’ 5”. W11 "15,1? I“ 111111111111110111111111131111111111 41113311 11E: 111'“: 1:111:11. 11.11111111111111113111111'311 11111111R111fi111111111§§1111a§ 1‘ 1911 '11“; f1 kl». {15.21111 ‘ 311:1:11, 3L1»! : . 1“..1",‘ A I ‘1 é . h" ’u " '11‘31111z1fi11‘gififli3&111‘1“ m M“ 113311?“ . 1 P ‘ 1111‘ . “‘1‘" ‘=? ‘1 ' 'z " “115]: § ,3 ' 1 l 1: 1’. _: ‘ §}£.11‘~'*11’1’:. O r ‘3 . 1111,11! 11,111...“ g”) 11111 11- r11“ THESIS 0&6, CEO llllllll'lllfllllllllllllllllllllillllllllll 3 1293 02058 6321 This is to certify that the dissertation entitled INVESTIGATING THE EFFECTS OF ITEM WORDING ON RATING RESPONSES presented by Annie Woo has been accepted towards fulfillment of the requirements for %- D- degreein MfifWa/‘XM ade/fl Major professor DateA/o-zz’ /j’/??l/7 / MSU is an Affirmative Action/Equal Opportunity Institution 0-12771 LIBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 11/00 chlRODalantfiS—p.“ INVESTIGATING THE EFFECTS OF ITEM WORDING ON RATING RESPONSES By Annie Woo A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Educational Psychology and Special Education 1999 ABSTRACT INVESTIGATING THE EFFECTS OF ITEM WORDING ON RATING RESPONSES By Annie Woo The purpose of the study was to investigate how item wording affected rating responses. Semantic negative and positive items measuring the same construct, in the form of a Likert—type scale, were tested on a sample of students enrolled in middle schools in Taipei, the capital of Taiwan. The psychometric properties of each item (means, standard deviation, skewness, and kurtosis) were examined as a function of four modes of item wording. The four modes were: Mode 1 (regular), “1 like myself”; Mode 2 (negated), “I do not like myself”; Mode 3 (polar opposite), “I dislike myself”; and Mode 4 (negated polar opposite), “I do not dislike myself.” A hierarchical measurement model regarding the relationship between the modes of item wording and the responses was constructed. A hierarchical confirmatory analysis estimated the correlations between the item scores and the four modes of item wording. The correlations between four versions of a shame scale and a measure of anxiety (DOSC Anxiety Factor) and of life satisfaction (Satisfaction with Life Scale) were also computed. Pearson correlation coefficients were obtained for each of these subscales (Mode 1 to Mode 4), DOSC-Anxiety Factor, Satisfaction with Life Scale, and gender. The Pearson correlation coefficients ranged from -0.l33 to 0.898. Gender had low correlation coefiicients with the four modes, DOSC, and Satisfaction with Life Factor, ranging from —0.133 to 0.128. There seems to be not much relationship between gender and these scales. The MANOVA results showed that the responses to the modes of item wording are significantly different between male and female. The F -ratios of Mode 1, Mode 2, and Mode 3 were all significant at the 0.05 level. Mode 4 was not significant at the 0.05 level. These results were similar to that of the correlational analyses. It seems that Mode 4, which has double negative semantics, introduced some ambiguity to the items. To determine whether the four Modes of semantics measured the same construct, five models were tested in a confirmatory factor analysis. The 2-factor model (Modes 1, 2, & 3 vs. Mode 4) fit the data statistically and showed an overwhelming superiority over the other models. These results rendered strong indications of the inequivalence between double negatives (Mode 4) and the rest of the items (Modes 1, 2, & 3). Though it may be useful to include some negative items to reduce a response bias, the findings from the present study suggest that special caution should be exercised in the use of double negative item phrasing. Despite the conventional wisdom so often found in measurement textbooks, recent findings by researchers in the area of item phrasing have suggested that negatively phrased items, especially double negatives, may reduce the validity of a questionnaire. The present study clearly corroborates this position. Copyright by ANNIE W00 1 999 Dedicated to Paul and Nathan ACKNOWLEDGMENTS This dissertation has been completed with the help of many people. First of all, I wish to express my gratitude to the members of my Dissertation Committee. I wish to thank Dr. William Mehrens, who served as the dissertation chairmen of my committee, for his concerns and guidance in the completion of this study. His friendship and willingness to offer suggestions are most appreciated. I also wish to thank, Dr. Betsy Becker for her statistical expertise and suggestions in the data analysis, Dr. Teresa Tatto for her assistance during the all stages of my study, and Dr. Margot Kurtz for her constructive suggestions as a member of the committee. I am ever grateful to Margaret Gunn for her assistance in the editing and proofreading of this dissertation. I wish to express my appreciation also to my best friend, Christine Lau for her encouragement and assistance at some of the crucial stages of my study. Without her help, this study would have taken a much longer time to complete. I especially appreciate the support and cooperation of all the students, teachers, and school administrators who participated in this study. I will always remember their contributions of time and effort. Finally and most of all, I wish to thank my husband, Paul, for his love, patience, understanding, and support during the completion of my dissertation. I appreciate the many sacrifices made by him to make it all possible. vi TABLE OF CONTENTS Page LIST OF TABLES ....................................................................................................... ix LIST OF FIGURES ...................................................................................................... xi CHAPTER 1. STATEMENT OF THE PROBLEM Introduction ...................................................................................................... 1 Purpose of the Study ......................................................................................... 4 Significance of the Study .................................................................................. 4 Research Questions .......................................................................................... 7 Overview .......................................................................................................... 8 II. REVIEW OF LITERATURE Introduction ...................................................................................................... 9 Response Style .................................................................................................. 9 Item Wording .................................................................................................. 12 Counterbalance of Positive and Negative Items ........................................ 12 Item Reversal ............................................................................................. 16 Summary ......................................................................................................... 24 III. RESEARCH DESIGN AND PROCEDURE Introduction .................................................................................................... 29 Development of the Test Instrument .............................................................. 29 General Shame Scale ................................................................................. 29 Dimensions of Self-Concept (DOSC) ....................................................... 31 Satisfaction with life Scale ........................................................................ 32 Questionnaire Development ...................................................................... 33 Selection of the Sample .................................................................................. 37 Procedures of Test Administration ................................................................. 37 Quality Control Screening .............................................................................. 38 Data Analyses ................................................................................................. 38 Confinnatory Factor Analysis ................................................................... 40 Exploratory Factor Analysis ...................................................................... 41 Parallelism ................................................................................................. 42 Summary ......................................................................................................... 42 vii IV. ANALYSIS AND INTERPRETATION OF THE DATA Introduction .................................................................................................... 44 Characteristics of the Sample ......................................................................... 44 Analyses of the Questionnaire Data ............................................................... 46 Answers to the Research Questions ............................................................... 48 Research Question 1 ............................................................................... 48 Histogram Analyses ........................................................................... 48 Reliability Analyses ........................................................................... 52 Correlational Analyses ....................................................................... 53 Research Question 2 ............................................................................... 57 MAN OVA Analyses .......................................................................... 57 Exploratory Analyses ......................................................................... 6O Confirmatory Analyses ...................................................................... 63 Summary ......................................................................................................... 66 V. SUMMARY, CONCLUSIONS, IMPLICATIONS, AND RECOMMENDATIONS Summary of the Purpose and Procedures of the Study ................................................................................................. 67 Discussion and Conclusion ............................................................................ 68 Implications .................................................................................................... 70 Cross Validation Using a New Sample .......................................................... 74 Cultural Specificity of the Questionnaire ....................................................... 76 Issues Concerning the Use of Tests ................................................................ 78 Limitations ...................................................................................................... 79 Recommendations .......................................................................................... 8O APPENDICES A. Four modes of Item Wording ..................................................................... 83 B. Sample Questionnaire in English ............................................................... 86 C. Sample Questionnaire in Chinese .............................................................. 94 D. Descriptive Statistics of the Items ............................................................ 102 E. Descriptive Statistics of Four Modes of Item Wording ........................... 106 F. Item-Total Statistics ................................................................................. 109 BIBLIOGRAPHY ...................................................................................................... l 12 viii LIST OF TABLES Page Table 1. Summary of Findings Related to the Study .............................................. 23-24 Table 2. The General Shame Scale (Chang & Hunter, 1988) ...................................... 35 Table 3. Dimensions of Self-Concept (DOSC) (Anxiety Factor) (Michael & Smith, 1976) ............................................................................................................... 36 Table 4. Satisfaction with Life Scale (Diener, et al., 1985) .......................................... 36 Table 5. Gender of the Participants .............................................................................. 45 Table 6. Age of the Participants ................................................................................... 45 Table 7. Grade Level of the Participants ...................................................................... 45 Table 8. Participants’ Age and Gender by Grade ......................................................... 46 Table 9. Coding Format of the Subscales in the Questionnaire ................................... 47 Table 10. Tests of Normality .......................................................................................... 49 Table 11. Results of Reliability Analysis on Four Modes of General Shame Scale (88 Items) in the Questionnaire ...................................................................... 52 Table 12. Mean, Standard Deviation, and Cronbach’s Alpha Coefficients for the Subscales ............................................................................................. 53 Table 13. Correlations between the 4 Modes, DOSC-Anxiety Factor, Satisfaction with Life Scale, and Gender ................................................................................... 54 Table 14. Items with Corrected Item-Total Correlation Less Than 0.35 ........................ 56 Table 15. Descriptive Statistics of Four Semantic Modes by Gender ............................ 58 ix Table 16. Table 17. Table 18. Table 19. Table 20. Table 21. Table 22. Table 23. MANOVA Tables by Gender for the Four Semantic Mode Subtotal Scores ............................................................................... 58-59 Factor Structure of the Questionnaire (19 factors) ......................................... 62 Factor Structure of the Questionnaire (6 factors) ........................................... 63 Goodness-of-F it Indices of Five Models ........................................................ 66 Four Modes of Item Wording ................................................................... 84-85 Descriptive Statistics of the Items ........................................................ 103-105 Descriptive Statistics of Four modes of Item Wording ........................ 107-108 Item-Total Statistics .............................................................................. 110-111 LIST OF FIGURES Page Figure 1. Histogram of Totals of 107 Items on the Questionnaire ................................ 49 Figure 2. Histogram of Mode 1 Scores on the Questionnaire ........................................ 50 Figure 3. Histogram of Mode 2 Scores on the Questionnaire ........................................ 50 Figure 4. Histogram of Mode 3 Scores on the Questionnaire ........................................ 51 Figure 5. Histogram of Mode 4 Scores on the Questionnaire ........................................ 51 Figure 6. Scree Plot from the Exploratory Factor Analysis (19 factors) ........................ 61 xi CHAPTER I STATEMENT OF THE PROBLEM Introduction Conventional wisdom has suggested that psychological measures should be constructed to contain an even balance of positively and negatively-worded items, so as to counteract response biases such as agreement response tendencies. The practice of using a balance of positively and negatively-phrased items in an affective instrument stems from recommendations found in the literature. Most textbooks or publications listing recommendations concerning attitude scale construction also suggest that questionnaire items should include both positively and negatively-worded item stems (Anastasi, 1982; Wiggins, 1973). Nunnally (1978) specifically advocated the reduction of response styles by having an item pool “...divided evenly between positive and negative statements” (p.605), and asserted that “stylistic variance ...can be mostly eliminated by ensuring that an instrument is constructed so that there is a balance of items keyed ‘agree’ and 9” ‘disagree (p.669). Other psychometricans have made similar statements. The general consensus in the literature has been that measures should have both positive and negative items (Scott, 1968; Anastasi, 1982). The early data reported in support of response styles is not without challenges. Samelson (1972) stated that the researchers (Bentler, Jackson, & Messick, 1971, 1972; Couch & Keniston, 1960; Jackson, 1967a, 1967b; Jackson & Messick, 1958, 1965) failed to clarify the conceptual meaning of response styles, and used an incorrect logical model in interpretation. The root of the problem seemed to be that all discrepant responses were defined as acquiescence — a mistake of the sort called the “the psychologist’s fallacy” (Samelson, 1972), which refers to the confusion of the researcher’s own standpoint with that of the mental fact about which he/she is making his/her report. Block (1967, 1971 , & 1972) argued that acceptance (the tendency to ascribe characteristics to oneself, regardless of the direction of item keying), was not likely to be of appreciable import for understanding the nature of responses to structured personality inventories. Rorer (1965) stated that response styles must be distinguished from response sets. He defined a response set as “the criteria according to which a respondent evaluates item content when selecting his answer,” whereas a response style was defined as “a way or a manner of responding, such as the tendency to select some particular response option independently of the item content” (Rorer, 1965, p.151). Rorer also felt it was also necessary to distinguish between achievement examinations and personality, attitude, and interest inventories, when assessing the extent to which styles affect answers to items. On examinations, but not on inventories, there were right answers and there were items whose answers the respondent must guess. Inferences concerning an individual’s response style might be made on the basis of his/her response to a number of examinations by comparing the proportion of his/her answers to any given category with the proportion keyed for that category, and by considering the proportion of wrong answers in each of the response categories. Rorer concluded that response styles were of no more than trivial importance in determining rating responses. Despite the challenges, the practice of using positive and negative item phrasing continues to receive widespread endorsement. Psychometricans have suggested the use of an equal number of positively and negatively-worded items as a way of reducing the possibility of a response style influencing the responses to affective instruments (Anderson, 1981; Mehrens & Lehmann, 1984). The commonly referenced, and often followed, recommendation to use an equal number of positive and negative items is based upon two assumptions. First is the assumption that the items (whether they are positively or negatively phrased) are measuring the same construct. Second is the assumption that by balancing positive and negative item phrasing, a more valid index is obtained. Although there is a wide acceptance of these assumptions, there is little research on their tenability. In fact, the assumption that negative and positive items measure the same construct is so widely prevalent among test developers, it seems to be unquestionably accepted. The problem arises from test constructors’ common practice of using negative items based on unverified assumptions. Generally, in verbal self-report measures of latent traits, it is assumed that, given standard testing conditions, an examinee’s responses are determined by item content, examinee’s characteristics, and, to some extent, instrument artifacts. Most Likert-type scales include a balance of semantically negative and positive-valence items, with the intent of ridding the instrument of the effects of certain response styles. It is further assumed that both negative and positive items measure the same trait. However, the empirical verification of this assumption seems to have received little attention. Over the past several years, numerous questions have arisen pertaining to the impact of item wording on rating responses. Specifically, which item wording format is to be preferred? Is one format superior to the other, and under what constraints? How do different modes of item wording affect rating responses? Are we measuring the same construct if we use positive and negative items? If subjects respond differently to the same item stem when the item wording format varies, could the items be regarded as nonequivalent in the same sense as content-parallel achievement items, which vary in difficulty level? More research on the effects of item wording on rating responses is needed. Purpose of the Study The purpose of the study was to investigate how item wording affected rating responses. The four modes of item wording: Mode 1 (regular) “1 like myself,” Mode 2 (negated) “I don’t like myself,” Mode 3 (polar opposite) “I dislike myself,” and Mode 4 (negated polar opposite) “I don’t dislike myself,” were considered. Specifically, the effects of modes of item wording of rating scale items on scale and item score mean, distribution, and reliabilities were examined. The correlations between the modes of item wording and student responses were also studied. Significance of the Study Anderson (1981) has argued that affective characteristics facilitate desired cognitive goals of the schooling process, and are, in themselves, desired goals of the schooling process. Similarly, Bloom (1978) stated that schools should produce “independent learners” who are able to engage in higher-level thinking, develop confidence in their abilities, and possess a degree of social responsibility. In the arena of education, increasing emphasis is being placed upon the need for valid and reliable means of assessing affective outcomes. Most affective outcomes are currently assessed through attitude surveys. Surveys have been used widely in the measurement of attitudes and opinions. They are also a popular method for evaluating student achievement in performance-based or constructed response assessment. Furthermore, surveys are the predominant method for eliciting judgments from students on course and instructor effectiveness. Given that surveys are so widely used in the social sciences, both as research tools and in practical applications, item development becomes an important consideration in their construction. The literature dealing with item construction is voltuninous. Hundreds of articles have been published concerning issues such as the use of ratings, their reliability and validity, and potential biasing factors. Because of conflicting findings in this literature, however, it is difficult for reviewers to identify general trends. The possible effect of item wording on overall ratings is particularly relevant to many of today’s available rating scales. Yet, the current body of literature leaves numerous questions unanswered. What has yet to be determined is the possible effect of positively and negatively worded items on raters’ evaluations. Do negatively worded items “encourage” a more critical evaluation than do positively worded items? Existing studies on response schemes have not directly addressed issues of general validity and the problems evolving from the specific questions stated above have not been addressed directly in the studies on response schemes which have appeared in the literature. Further research needs to be conducted to determine whether one format over another is more susceptible to rating errors of leniency or other sources of invalidity. In survey and evaluation research, much emphasis has been placed on the development of the item stems of the questionnaire being used. Negatively worded items may highlight the negative aspects or faults of the object or person, or may serve to suggest unconsciously to the respondent particular problem areas anticipated by the researcher. If so, rating scale evaluations may be affected as much by the wording of the items as by the quality of the object or person being evaluated. The possible effect of different modes of item wording on overall ratings is particularly relevant to many of today’s available student rating instruments. It has yet to be determined if the different modes of item wording influence rating responses. Both positive and negative items are commonly used in educational and psychological measurement. Over the past several years, numerous questions have arisen pertaining to item wording. But despite the large amount of research on rating methodology, there have been relatively few conclusions concerning this measurement issue. In addition, the factors that bring some degree of control over the distributional parameters of ratings scales have received relatively little attention. There is little research on the factors that influence the meaning that subjects apply to response options when responding to rating prompts. Investigating this problem will bring some understanding of how the different modes of item wording influence the rating responses. The underlying premise for this research is that the item wording has an influence on scale and item score mean, distribution, and reliabilities. This study makes a start toward illustrating this premise by analyzing the results derived from applying four modes of item wording to a survey. The study will also provide valuable information on an essential and important dimension of instrument development, namely, knowledge and understanding of the effects of item wording on rating responses. The resulting information should be valuable for educators and researchers whose focus is developing effective rating scales. This research study intends to enhance our understanding of measurement issues in item development in two ways: First, the study will lead to general conclusions on the overall relationship between item wording and rating responses. Second, the study will provide some insight about the equivalence of items between different modes of item wording. The results of the study should prove useful to administrators and faculty members who use surveys to assess affective outcomes, and to educational researchers who are looking for state-of-the art research on survey item construction. Research Questions The present study was designed to provide answers to the following questions: 1. What is the influence of item wording on scale and item score mean, distribution, reliabilities, and correlations between scales? 2. Are items equivalent among four different modes of item wording? The four modes are: Mode 1 (regular), “1 like myself”; Mode 2 (negated), “I do not like myself”; Mode 3 (polar opposite), “I dislike myself”; and Mode 4 (negated polar opposite), “I do not dislike myself.” Overview This chapter has presented the problem, purpose, significance, and research questions of the study. In Chapter II, a review of the literature related to the study will be presented. Chapter HI describes the procedures and design of the study. The analysis and interpretation of the data is presented in Chapter IV. In Chapter V, the conclusions and implications of the study will be presented. CHAPTER II REVIEW OF LITERATURE Introduction The purpose of this study was to investigate the effects of item wording on responses on a rating scale. In developing a new survey or utilizing existing surveys, the researcher needs to examine how the wording affects subjects’ responses. A number of studies have focused on various aspects of item characteristics and their possible influence on reliability and variance. Numerous researchers have also investigated the impact of item wording on rating response. Research studies were reviewed to determine the current professional opinions regarding impact of item wording on the responses of rating scale. This chapter is divided into two major sections. The first section deals with response style. The second section focuses on item wording. Some overall conclusions follow. Response Style The development of scales to assess attitude poses complex methodological problems for the test constructor. Self-report, paper-and-pencil-type verbal measures are most commonly used in behavioral research and assessments. Among the various types of verbal self-report measures, Likert—type scales are the most popular, mainly because the Likert method is conceptually simple and practically straightforward. However, one of the major sources of criticism of self-report data centers on the susceptibility of self-report measures to various response sets that pose a continuing threat to construct validity. A good deal of research in this area was conducted in the 19503 and 19605. Cronbach (1950) examined the effects of selected response sets on the validity of cognitive instruments, and some corrective procedures were suggested. He also identified acquiescence as a response tendency that favors affirmative responses over negative responses. Couch and Keniston (1960) called this tendency “yea-or-nay-saying,” wherein respondents consistently select in one direction, either positive or negative. The hypothesis was that some individuals have a general disposition on the positive/negative continuum regardless of the content of the items. Various types of response sets were identified and their effects were investigated. Jacobs and Barron (1968), Green (1951), Radcliffe (1966), Stricker (1969), Wesman (1952), and Wiggins (1973) investigated the influence of social desirability in personality measurement. Couch and Keniston (1960) examined the impact of an acquiescence response set. Berg (1961) identified the deviant response set and hypothesized that it was an important dimension of personality. In fact, the literature on response styles accumulated to the point that by 1970 there had been several major reviews of literature and even reviews of the reviews (Nunnally, 197 8). Bentler, Jackson, and Messick (1972), Jackson (1967a, 1967b), Jackson and Paunonen (1980), Rorer (1965), and Samelson (1972) are some of the researchers expressing there views regarding response set. 10 As a consequence of such an upsurge of research, controversial though it had been, it was argued that response styles do account for a certain portion of test score variance, and that if one is interested in the construct validity of the instrument, then measures should be taken to free the instrument from this stylistic variance (N unnally, 1978, p.660). There is a relatively large body of literature on response styles which indicates that these wording changes may make a significant difference in the factor structure of scales and the item validity (Bentler, Jackson, & Messick, 1972). Bentler et al., argued this point convincingly for two different types of acquiescence response styles. There has been considerable psychological research over the last thirty years dealing with concepts broadly classified under the general heading of “response styles.” Specifically, acquiescence response sets and various other response biases have been examined as they relate to different types of item wordings. Most of this research has utilized measures of the California F -scale (Adorno, Frenkel-Brunswik, Levinson, and Sanford, 1950), the Minnesota Multiphasic Personality Inventory (MMPI) (Hathaway and McKinley, 1967), and the Personality Research Form (PRF) (Jackson, 1967a). Although this research has been valuable in questioning the interpretation of subject responses to these measures, it has not resolved the issue of response style relevance. The argument has been heated over whether or not response styles exist, and, if so, whether they impact upon research results in a meaningful way. Rorer (1965), for example, concluded that “the inference that response styles are an important variable in personality inventories is not warranted on the basis of the evidence now available” (p.150). Jackson and Messick (1965) responded to Rorer with extensive criticism of both his data and his conclusions. The results of their study suggested that the inclusion of negatively worded items can result in less accurate responses and therefore impair the validity of obtained results. Thus, although the inclusion of negatively stated items may theoretically control or offset agreement response tendencies, their actual effect is to reduce response validity. This situation suggests that the current recommendation concerning the desirability of including both positive and negative items on a questionnaire may be premature and apparently warrants much further investigation. Item Wording Surveys are widely used in education and psychology. Because of both their widespread use and their importance, the construction of survey items has been heavily researched; yet, concerns remain about what factors influence rating responses. One of the often expressed concerns is how the item wording affects students’ responses, and numerous research studies have investigated the extent to which item wording affects the rating response. By no means, though, is there total agreement on the extent of the relationship. In fact, while some investigators have found a strong relationship between item wording and rating responses, others have found no relationship at all. CounteLbaiance of positive and negative items Psychometricians recommended counterbalancing the questions which were asked, so that a positive response to one question and a negative response to another both contributed towards increasing the score on the measure as a whole (Lemon, 1973; Likert, 1932; Edwards, 1957a). This consensus has found its way to many specialty areas in educational and psychological testing. Likert (1932) suggested that those “two kinds of 12 statements ought to be distributed throughout the attitude test in a chance or haphazard manner.” (p. 91) Most Likert-type scales include a balance of semantically negative and positive valence items with the intent to rid the instrument of the effects of certain response styles. It is further assumed that both negative and positive items measure the same trait. For example, Schriesheim and Kerr (1974) indicated that subject agreement response tendencies can usually be controlled by having an adequate number of negatively-worded items, and, based upon their investigation, the major existing instruments measuring perceived leader behavior were inadequate in terms of not having sufficient negative items. The authors concluded in their review that revised scales were needed, and that these measures should have a larger number of negatively worded items to offset acquiescence response biases. A statement of caution concerning the use of negatively worded items is appropriate. Because their use may introduce covariance, some researchers have begun to question the utility of negatively worded items (e.g., Thacker, Fields, & Tetrick, 1989). Nonetheless, Schmitt and Stults (1985) suggested that covariance introduced by the direction of item wording does not necessarily result in a methodological confound that distorts conceptual interpretation. If negatively worded items areito be used, however, it would be wise to ensure that their inclusion does not present a methodological confound. It is preferable that constructs are not exclusively defined by negatively worded items during scale development. Instead, scales should have equal numbers of positively and negatively worded items. 13 A number of studies have focused on various aspects of scale characteristics and their possible influence on reliability, variance, and correlations. In Simpson, Rentz, and Shrum’s (1976) study, questions concerning six socially-significant topics were designed. Items representing each concept were written with two criteria: strong wording versus mild wording, and positive stance versus negative stance. Each item was consequently written in four forms so as to fit in one of each possible category: “mild-positive,” “mild- 99 “ negative, strong-positive,” and “strong-negative.” The authors found that wording influenced responses more than the content of the items - an outcome suggesting the influence of an “agree-disagree” response set. They also found that the positive versus negative item structure influenced students’ responses more than mild versus strong wording. Moreover, the authors found that the students tended to agree more with mildly worded positive items than strongly stated positive items. When reacting to the same concepts worded in the negative, students disagreed more with the strongly worded counterparts of each pair. Within some of the topical areas, students' responses were influenced more by an “agree-disagree” set than by the intended meaning of the items. Items written at a higher “emotional level” tended to elicit stronger responses when they were stated in 3 disagree format than when written in an agree format. The extremity of attitude conveyed by the wording of the item also affects the mean response and variability of response. J aroslovsky (1988) examined the effects of wording on poll results. In his study, the respondents provided different answers when the same question was asked in two different ways: “Do you think the United States should allow public speeches against democracy?” and “Do you think the United States 14 should forbid public speech against democracy?” The author noted that wording and context are two of the principal and closely related sources that public opinion experts refer to as “response biases” in polls. He found that even a small change in how a question was asked could trigger connotations or interpretations in the respondents’ minds that could have a major effect on how the question was answered. Moreover, he found that answers to even identical questions could vary from poll to poll, depending upon how the question is juxtaposed with others in the same survey. The effect of wording on polls may not be obvious to many people. However, words convey tones that can have a substantial effect on the answers. In 1940, researcher Donald Rugg had pollster Elmo Roper ask similar questions of two separate national samples. The researchers found that support for free speech was 21 percentage points larger among those asked whether speeches “should be forbidden” than those who were asked whether speeches “should not be allowed.” The experiment was replicated 35 years later; researchers asked the same set of questions. The results showed a substantial increase in respondents’ willingness to tolerate free speech. People remained more willing to “not allow” speeches against democracy than to “forbid” them. Winkler, Kanouse, and Ware (1981) examined a technique of control for acquiescence-response sets. Logical or polar opposites were used to reverse regular items. Each concept was defined by a set of matched, contradictory statements. No adverse effects of using contradictory statements were found. The authors recommended using a balance of regular and polar items for controlling acquiescence effects. 15 Item reversal Researchers (Anastasi, 1982; Nunnally, 1978) suggested using an equal number of positive and negative valence statements in scales to minimize the influence of the response sets triggered by item content, the tendency to agree, and the tendency to mark in the left or right columns. However, it was difficult to determine whether the meaning of an item had been virtually reversed. Rorer (1965) investigated this aspect of item wording and concluded that many reversal pairs of items did not turn out to be reversals. Inspection of the items indicated that in many instances there was nothing at all inconsistent or contradictory about rejecting both the original and the reversed item. He indicated that while more extreme reversals result in lower correlations than less extreme reversals, the more extreme reversals are simply permitting a greater number of consistent rejections of both forms of the item. Many studies have been conducted on the impact of reverse-scored items on survey results. In general, measurement specialists recommend a mixture of both regular and reverse-scored items in order to guard against various response biases such as acquiescence and agreement response tendency (Anastasi, 1982; Nunnally, 1978). However, there is agreement, based on intuition, that negative statements are more difficult to understand, and there is a study that suggests that negatively phrased items are less valid. Schriesheirn and Hill (1981) used undergraduates to investigate the effects of positive and negative item phrasing on the validity of the responses to three forms of the same questionnaire. The authors studied the effect of item wording on questionnaire reliability and validity with data from 280 undergraduates who read a scenario describing 16 a hypothetical leader’s behavior, and then completed one of four different questionnaires to describe that leader. The authors examined the effects of item wording on the accuracy of responses to Form XII of the Leadership Behavior Description Questionnaire (LBDQ- XII) (Stogdill, 1963). Three forms of the questionnaires (all regular items, all negated items, and a mix of the two forms of items) were randomly distributed. The responses were compared to the LBDQ-XII descriptions of a fictitious supervisor to the known levels of behavior actually shown to each subject. The fictitious supervisor was portrayed on a one-page script given to each subject. The authors found that polar opposite and negated polar opposite items had significantly lower coefficient alpha internal- consistency reliabilities as compared with those for the regular and negated regular items. Accuracy scores yielded the same results for validity of measurement. The authors also found that regular wording format yielded more accurate responses than the negated or mixed formats. The lowest level of respondent accuracy was found in the mixed format. Chang (1995) examined the psychometric equivalence of negative and positive items. Some researchers call negative items “semantically negative,” meaning they have a negative meaning. Such a definition lacks accuracy. “Semantic” refers to the formal meaning, or the nature of a statement free from value judgment or sentiment. The value judgment or feeling of a word or statement is represented by its connotation. For example, the words frugal and miserly have the same formal meaning, whereas one has a positive or neutral connotation and the other has a negative connotation. Similarly, a test item can be connoted as positive or negative, whereas its semantics are flee from a positive or negative sentiment. Chang suggested that defining items as semantically negative is incorrect. Items can be defined not in terms of the manifest syntax or semantics but in terms of the underlying connotation. The opposite connotations of items represent two directions of a latent construct continuum of which items or their semantics are indicators. Chang defined a test item as connotatively consistent or connotatively inconsistent when the connotation of the item either agrees (conforms with) or disagree (contradicts) that shared by the majority of the items making up a test or a subscale of a test. He examined the equivalence between connotatively consistent and connotatively inconsistent items of a 4-point Likert type scale using confirmatory factor analysis. His study concluded that these items measured correlated but distinct traits. He also suggested that connotatively inconsistent items should not be used. Items on a test or questionnaire should possess items connotatively consistent with the construct being measured. Ahlawat (1985) concluded that semantically negative and positive item contents do not measure essentially the same construct. His study was based on a sample of Jordanian middle school students using an Arabic translation of State-Trait Anxiety Inventory (Spielberger et al., 1970). Four sets of items were constructed with distinct modes of semantic presentation. He suggested that double negative items create cognitive complexity for the students, which may end up in confusion. On the basis of the correlational and variance-related analyses, his findings concluded that semantically negative and positive item contents do not measure essentially the same construct. Furthermore, the author suggested that more cognitive steps are required to decode or unravel a negatively worded statement, thus making the task more difficult than responding to a positively worded statement. The findings of this study question the 18 common practice of using both negative and positive valence items in a scale and making decisions on the basis of the students’ total score on items. This problem deserves closer inspection through more specifically designed studies. Benson and Hocevar (1985) examined the effect of item phrasing on the validity of Likert-type attitude scales. Three content-parallel forms of the same questionnaire were developed to assess student attitudes toward integration. The forms differed only in terms of item phrasing. The first contained 15 positively phrased items. The second form contained the same 15 items phrased negatively. The third form contained the same 15 items, eight with positive phrasing, and seven with negative phrasing. The words “was not” were used to create a negative statement from the positive statement. The study reported strong evidence that the insertion of the word “not” has a profound influence on student responses. The items that induced a favorable response on the positive form induced a less favorable response on the negative form. Respondents were less likely to indicate agreement by disagreeing with a negatively phrased item than to indicate agreement by agreeing with a positively phrased item. Moreover, items that induced an unfavorable response on the positive form were less likely to induce an unfavorable response on the negative form. The analyses provided evidence that changing positive statements into negative statements may have an effect on the psychometric characteristics of an item. These analyses did not provide conclusive evidence that positive and negative items measure the same construct in different ways. The results indicated that the subjects had difficulty expressing agreement by disagreeing with the negated items. In order to examine if a different construct was being measured, two models were contrasted. One model composed of two factors where the factors were set to be correlated (undifferentiated model), and a second model where the correlation between the two factors was estimated (differentiated model). The authors found that the differentiated, two-factor model provided a better fit to the data than the undifferentiated two-factor model. It was suggested that the positive to negative transformations change not only an item’s psychometric characteristics, but also change the construct that an item is intended to measure. Campbell and Grissom (1979) examined the effect of item phrasing on attitude scale items. Two forms were developed — the first containing all regularly-scored items and the second consisting of items that were designed to be their logical opposites (polar). The results of factor analyses suggested that the two different formats measured different constructs. The authors also indicated that scoring-negated or polar-attitude scale items were not equivalent to the reversal of regular items. Schmitt and Stult (1985) suspected that a small number of respondents who were careless in their responses might be responsible for the appearance of negative factors composed only of reverse-scored items. The objective of their study was to show how a “negative factor” can be produced by a relatively small number of careless respondents. A frequently occurring phenomenon in the analysis of personality or attitude scale items is that all or nearly all questionnaire items that are negatively keyed define a single factor. Although substantive interpretations of these negative factors are usually attempted, this study demonstrated that the negative factor could be produced by a relatively small portion of the respondents who failed to attend to the negative-positive wording of the 20 items, and who did not notice that some items are the opposite in meaning to the majority of the items. In a series of simulations, the proportion of “careless” respondents and the proportion of negatively keyed items were varied for data generated from three different correlation matrices reflecting different levels of item intercorrelation. The results indicated that when only a small portion of the respondents were careless (ten percent), a clearly definable negative factor was generated. The authors cautioned about the use of item reversal and the interpretation of the reversed-score items. In summary, it was concluded that scales with negatively keyed items frequently led to the identification of a factor defined wholly or mostly by those negatively keyed items. Other literature cited above indicates that this finding is relatively widespread in the sense that it occurs in a variety of research areas. Marsh (1984), who found a negative item factor in a self-concept measure for elementary school children reported similar results. Other researchers have also reported negative factors (Campbell & Grissom, 1979; Simpson et al., 1976), thus adding support to the notion that negative phrasing may actually change the construct that the item is intended to measure. Harasym (1992) reported, from a study with approximately 200 first-year nursing students, evidence that the use of negation (e.g., not, except) should be limited in stems of multiple-choice test items, and that a single-response, negatively worded item should be converted to a multiple-response, positively worded item. Several other researchers (Andrich, 1983; Campbell & Grissom, 1979; Simpson, Rentz, & Shrum, 1976) have investigated whether phrasing can influence overall attitude levels on different attitudinal questionnaires. These researchers all concluded that item 21 phrasing makes a difference. However, the results these researchers report cannot be easily corroborated with each other. At issue is that the word “not” was used in Andrich’s study (1983) to create parallel negative statements, whereas the other researchers cited created negative statements by item reversal (Campbell & Grissom, 1979; Simpson, Rentz, & Shrum, 1976). Rorer (1965) suggested that this latter procedure often leads to negative statements that reflect different content or ideas; consequently, such statements are not direct opposites of the original positive statements. It is because of this problem that many affective scales contain the word “not” to create a negative statement (Coopersmith, 1967; Marsh, Smith, Barnes, & Butler, 1983; Piers, 1969). The findings that relate to the present study are summarized in Table 1. For each review, the table shows the name of investigator, year of publication, and a summary of findings. 22 Table 1. Summary of Findings Related to the Study. Investigator Year Finding 1. Likert 1932 & The authors recommended counterbalancing the survey items Edwards 1957 & so that a positive response to one question and a negative Lemon 1973 response to another both contributed towards the score on the measure. 2. Cronbach 1946 & 1950 There is a tendency for people to favor affirmative responses over negative responses. 3. Couch & Keniston 1960 The authors labeled the tendency wherein respondents consistently select in one direction, either positive or negative, as “yea-or-nay saying.” 4. Rorer 1965 Many reversal pairs of items do not turn out to be reversals. 5. Simpson, Renz, 1976 Wording influenced responses more than the content of the & Shrum items. 6. Campbell & Grissom 1979 Scoring-negated or polar-attitude scale items are not equivalent to the reversal of regular items. 7. Schriesheirn & Hill 1981 Regular wording format yielded more accurate responses than the negated or mixed formats. The lowest level of respondent accuracy can be found in the mixed format. 8. Winkler, Kanouse 1981 No adverse effects of using contradictory statement was & Ware found. The authors recommended using a balance of regular and polar items for controlling acquiescence effects. 9. Ahlawat 1985 Semantically negative and positive item contents do not measure essentially the same construct. 23 Table l. (cont’d) Investigator Year Finding 10. Benson & Hocevar 1985 ll. Schmitt & Stult 1985 12. Jaroslovsky 1988 13. Harasym 1992 14. Chang 1995 A detrimental effect would occur if the number of regular and reverse-score statements were balanced. A frequently occurring phenomenon in factor and cluster analysis of personality or attitude scale items is that all or nearly all questionnaire items that are negatively keyed will define a single factor. The study demonstrates that the negative factor could be produced by a portion of respondents who fail to attend to the negative-positive wording of the items. The wording of the item affects the mean response and variability of response. A single-response, negatively worded item should be converted to a multiple-response, positively worded item. Connotatively consistent and connotatively inconsistent items measured correlated but distinct traits. Items on a test or questionnaire should be connotatively consistent with the construct being measured. Summag Surveys are used extensively in a wide range of assessment. The increase in popularity of surveys as measures of affective outcomes consequently has focused a great deal of attention on their validity. It is important to understand the ways in which people use survey items and, especially, to understand the factors that can influence the responses given. Many researchers have provided an extensive review of research in this area. Previous research has not addressed the perceived defensibility or accuracy of the 24 assumption of construct equivalence with regard to using both positive and negative items in a survey. The literature review has provided insights into the effects of response format on rating scales. The various research studies suggest that item wording does account for a certain portion of test score variance. If we are interested in the construct validity of the instrument, then measures should be taken to account for this stylistic variance. (Nunnally, 1978, p.660). However, the cited studies have shown inconsistent findings with respect to the effects of wording of items on the rating scale. The desirability of including a mixture of regular and reversed-scored items on attitude and questionnaire measuring instruments is yet to be determined conclusively. Research studies yielded inconsistent and ambiguous support for balancing regular and reversed-scored items (Bentler, Jackson, and Messick, 1972; Jackson and Messick, 1965; Rorer, 1965). This circumstance raises serious concern about whether both regular and reversed-scored items should be included on a measuring instrument. The research results of these studies indicate that further investigation is needed. Although the recommendations of some authors are in conformity with conventional psychometric advice, the experience of the authors suggests that the use of negative items may have at least some dysfunctional consequences. In their experience, negatively worded items often reduce scale reliability, and they may be eliciting response biases or measuring unintended aspects of constructs under investigation. In any event, as measurement validity requires instrument reliability, these impressions suggest that the use of negative items may not be cost-free. 25 The investigators have raised serious questions about the influence of item wording on student responses. Because of conflicting findings, however, it is difficult to draw firm conclusions. Furthermore, several complexities in this research literature add to the difficulty of drawing overall generalizations concerning the impact of item wording. The research to date suggests that positive to negative transformations change an item’s psychometric characteristics and, more importantly, change the construct that an item is intended to measure. However, the studies that have been reviewed do not show that positively phrased items are necessarily better indicators of attitude. Nevertheless, there is some indication that negatively phrased items are less valid. There is the plausible argument that respondents may not understand that they can indicate agreement by agreeing with a negative statement. Marsh (1984) provided support for the above contention. Despite the conventional wisdom so often found in measurement textbooks, pronouncements by researchers in the study of item phrasing have been unanimous that negatively phrased items reduce the validity of a questionnaire (Andrich, 1983; Campbell & Grissom, 1979; Marsh, 1984; Schriesheim & Hill, 1981; Simpson et al., 1976). One measure almost universally adopted in Likert-type scales to minimize the influence of the response set triggered by item content, the tendency to agree, and the tendency to mark in the left or right columns is to include an equal number of positively and negatively valenced statements in the scales. An actual reversal of meanings of an item, however, may be hard to achieve. To complicate matters still further, it is not easy to determine that the meaning of an item has been virtually reversed. Rorer (1965) has provided many examples of reversed pairs of items that on close scrutiny turn out not to be reversals. 26 The assumption that negative and positive items measure the same construct is so widely prevalent among test developers that it seems to be unquestionably accepted by almost everybody. Test constructors’ conviction in this assumption is ftu'ther fortified by empirically obtained high indices of homogeneity contrary to the psychometricans’ warning that homogeneity neither implies nor guarantees the unidimensionality of the trait being measured by the test. The problem arises out of the test constructors’ common practice of using negative items based on unverified assumptions. Generally, in verbal self-report measures of latent traits, it is assumed that, given standard testing conditions, an examinee’s response is determined by item contents, exarrrinee characteristics, and, to some extent, instrument artifacts. Most Likert-type scales include a balance of semantically negative- and positive-valence items with the intent of ridding the instrument of the effects of certain response styles. It is further assumed that both negative and positive items measure the same trait. The empirical verification of this assumption seems to have received little attention from researchers. The purpose of this study was to investigate the effects of different modes of item presentation on the responses of various groups of high school students. Taking a simple case of semantic change, a sentence with a positive import, for example, “ I like myself,” can be reversed in meaning either by replacing the word “like” with one of its antonyms (e.g., dislike) or by using a grammatically negative mode (e. g., “I do not like myself”). Similarly, a negatively valenced sentence can be reversed in meaning by using the same transformations. Following this scheme, four distinct modes of semantic presentation can easily be defined: Mode 1 (regular), “1 like myself;” Mode 2 (negated), “I do not like 27 myself;” Mode 3 (polar opposite), “I dislike myself”; and Mode 4 (negated polar opposite), “I do not dislike myself.” More specifically, the study looks into the following questions: What is the influence of wording of rating scale items on rating scales? What is the influence of item wording on scale and item score mean, distribution, reliabilities, and correlations between scales? What is the item equivalence between different modes of item wording? Do different modes of presenting the seemingly same content measure the same trait? Does the grammatically negative mode incorporate ambiguity into the items? How much variance is due to format factors? Does stating a concept in a positive manner affect reactions differently than stating the same concept in a negative manner? Do students agree with positively stated items to the same extent they disagree with the concept when stated negatively? These questions warrant further investigation of the effects of item wording on rating responses. 28 CHAPTER III RESEARCH DESIGN AND PROCEDURE Introduction The purpose of this study was to investigate the effects of item wording on the rating responses. This chapter includes a description of the development of the test instrument, selection of the sample, procedures of test administration, quality control screening, and the statistical procedures for examining the research questions of this study. Development of the Test Instrument General Shge Soak The General Shame Scale was developed by Chang & Hunter (1988). On the basis of an examination of major writings in shame literature, everyday language, and clients' verbal reports, the authors developed and tested scales measuring each of six shame themes. They are disappointment with oneself, feelings of inferiority, feelings of defectiveness, feelings of worthlessness, feelings of unimportance, and feelings of falling short of one's own standards and ideals. Items were written to avoid reference to other affects such as embarrassment or self-consciousness. These shame themes were highly correlated and were shown by confirmatory factor analysis to measure one underlying factor. The specific factors were shown to be uncorrelated with measures of emotional and social function once the general shame was partialed out. They then combined the 29 shame theme scales to form a general shame scale. These shame themes were developed into six shame theme scales. Initially, 65 items were written altogether for six shame theme scales. Mainly on the basis of content meanings, 22 best items were retained. The reliability of the 22 item general shame scale was .95. The shame theme scales are presented in Table 2. For this study, four versions of the items in four distinct modes of semantics were prepared (Appendix A). In the first version, the statements were presented by semantically positive words or phrases. For example, “ I like myself.” In the second version, the statements were presented by semantically positive words or phrases structured in grammatically negative sentences. The 22 sentences of the first version were transformed into “do not” sentences, for example, “ I do not like myself.” In the third version, “I dislike myself,” the statement was reversed in meaning by replacing the word “like” with one of its antonyms (e. g., dislike). Similarly, in the fourth version, the items in the third version were transformed into “do not” sentences, for example, “I do not dislike myself.” Since English is not the first language of the author of this study, the items were carefully reviewed by a native speaker to ensure the accuracy of the conversion of the four modes of item wording. Chang & Hunter (1988) have shown that shame relates very substantially to emotional and social problems. In their study, people high in shame are high in anxiety (r=0.86) and low in life satisfaction (r=-0.73). Hence, for the sake of validation, two other scales, which measure anxiety and life satisfaction, were used in this study. 30 Dimensions of Self-Concept (DOSC) The Dimensions of Self-Concept (DOSC) (Michael & Smith, 1976) was developed to measure non-cognitive factors associated with self-concept in a school setting. There are two main purposes for the DOSC: a) to identify those students who might experience difficulty in their schoolwork because of their perceptions of a low degree of self-esteem; and 2) to diagnose for purposes of counseling or guidance by the teacher, professional counselor, or administrator those dimensions as well as the specific activities associated with them that might contribute to low self-esteem and might impair learning capabilities relative to negative affectivity. The DOSC is a self-report instrument that reflects the perception that students have regarding each statement of the five main dimensions: aspiration, anxiety, academic interest and satisfaction, leadership and initiative, and identification vs. alienation. The five factor dimensions measured by the DOSC scales are described as follows: 1. Level of Aspiration This factor is a manifestation of patterns of behavior that portray the degree to which achievement levels and academic activities of students are consistent with their perceptions of their potentials in terms of scholastic aptitude or past and current attainments. 2. Anxiety This factor reflects behavior patterns and perceptions associated with emotional instability, a lack of objectivity, and a heightened concern about tests and the preservation of self-esteem in relation to academic performance. 31 3. Academic Interest and Satisfaction This factor portrays the love of learning and pleasure gained by students in doing academic work and in studying new subject matter. 4. Leadership and Initiative This factor appears to represent those behavior patterns and perceptions that are associated with star-like qualities, in which a student has an opportunity to demonstrate his mastery of knowledge, to help others, to give direction to group activities, to become a respected expert whom others consult, and to put forth sound suggestions for classroom activities. 5. Identification vs. Alienation This factor is intended to represent the extent to which a student feels that he has been accepted as part of the academic community, has been regarded by his teachers, and peers as a significant person. The 14-item Anxiety Factor was used in this study (Table 3). A 0.82 coefficient alpha was reported for the 14-item scale (Michael & Smith, 1976). Satisfaction with Life Scale The Satisfaction with Life Scale (Diener et al., 1985) is a five-item scale, which measures global life satisfaction. The scale is designed to assess global life satisfaction and does not tap related constructs such as positive affect or loneliness. The purpose of the scale is to obtain an overall judgement of the respondent’s life in order to measure the of concept of life satisfaction. In the initial phase of item construction, 48 self-report 32 items were generated. A five-item scale was formed after a factor analysis and an examination of the semantics. Diener et al. (1985) reported a high reliability (0.87 coefficient alpha) for the scale. It also correlates moderately to highly with other measures of subjective well-being. The scale is well suited for use with different age groups. The high correlations with personality indicators of well-being suggested that the scale rrright also be useful in clinical settings (Table 4). Questionnaire Development The 107 items were constructed and translated into equivalent Chinese structures. The preservation of the connotation of meaning and the essence of the original sentences was the top priority. Furthermore, the items (in Chinese) were edited for clarity and readability by school teachers, who were requested to edit each item with respect to its clarity and suitability for the middle and high school students. The Chinese version of the questionnaire was refined on the basis of the teachers’ comments. When the English version of the questionnaire was given to the Chinese teacher who assisted with the translation, she raised a concern regarding the rating scale. In the English version, the anchors of the rating scale are “Never,” “Seldom,” “Sometimes,” “Often,” and “Almost Always.” She reminded the author that the last scale anchor “Almost Always” might not be an appropriate word choice. In the Chinese version “Almost Always” was changed to “Always.” In order to ensure the appropriate wording of the survey and the students’ understanding of the survey, a pilot test was conducted in two classrooms of about 80 to 90 students. The subsequent item analysis served as a basis for further refinement of the 33 translation and wording of the survey. Based on the recommendations of the teachers and the students, the researcher made some very slight changes to the wording of some the survey items. The items were then assembled in random order into a questionnaire with instructions. A copy of the survey can be found in Appendix B. The survey in Chinese can be found in Appendix C. Subjects also completed items on their gender and age. 34 Table 2. The General Shame Scale (Chang & Hunter, 1988) (1) Disappointment with Oneself 1. I don't like myself. 2. I am pleased with myself. 3. I am disappointed in myself. 4. I feel ashamed of myself. (2) Feelings of Inferiority 5. I feel like I am just not quite good enough. 6. I feel that I am inferior to most of my friends. 7. Compared to others, I feel like I don't measure up. 8. I am just as good as my friends. (3) Feelings of Defectiveness 9. I feel inadequate as a person. 10. I feel there is something defective in my character. 11. I look down on myself because of my flawed character. 12. I see myself as intact and without personal defects. (4) Feelings of Worthlessness 13. I feel worthless as a person. 14. I feel like a useless person. 15. I feel like I am good for nothing. 16. I feel I am a complete failure as a person. 17. I feel like a failure. 18. I am a worthwhile person. (5) Feelings of Unimportance 19. I feel unimportant. 20. I feel so insignificant to others, as ifI were invisible. (6) Falling Short of Own Standards and Ideals 21. I always seem to fall short of my aspirations. 22. I find that I don't live up to my own standards and ideals. 35 Table 3. Dimensions of Self-Concept (DOSC) (Anxiety Factor) (Michael & Smith, 1976) h—‘i—fi I—‘H 5"!" fl 5“ T‘PFFSP‘PPP‘S‘JT‘ Statements that some teachers make about my schoolwork hurt my feelings. I feel so nervous about some of my classes that it is hard for me to attend. I become tense and nervous when I am studying. I am upset about so many things that I cannot concentrate on or do my schoolwork. I worry about how well I am doing in my classes. I am afraid to ask teachers to explain a difficult concept a second or third time. I avoid talking to my classmates about schoolwork because they might make fun of me. I become fiightened when a teacher calls on me in class. Talking in front of class makes me feel nervous. I feel upset when I have to take a test. I would be afraid to tell a teacher that he or she made a mistake in explaining an assignment or in working a problem. I have trouble sleeping well the night before an important examination. I am embarrassed to face my friends or family if I have made a low grade on a test or assignment. I worry that my score on a test will not be one of the highest in class. Table 4. Satisfaction with Life Scale (Diener et al., 1985) .V‘PP’NI" I am satisfied with my life. The conditions of my life are excellent. In most ways, my life is close to my ideal. So far, I have gotten the important things I want in life. If I could live my life over, I would change almost nothing. 36 Selection of the Sample Participation in the study was solicited from students enrolled in middle schools. The subjects in this study were the students studying in the sixth and seventh grades in two schools located in Taipei, the capital of Taiwan. Although the schools were not randomly selected, they were considered typical in terms of achievement, socioeconomic status, and ethnic background to other schools within the district. In Taiwan, there are usually 35 to 40 students in one classroom. Twenty-two classrooms were sampled. Procedures of Test Administration The questionnaires were administrated during class sessions. No titles were printed on the questionnaires. The questionnaire required approximately 20 minutes to complete. Before responding, students were told only that the survey dealt with opinions about themselves. Subjects were instructed to use all five points on the rating scale to provide an accurate reflection of their opinions. Subjects were encouraged to respond to all items. The teacher read the survey instructions aloud as the students followed along. Students were told that their responses would be completely anonymous. No names or other identifying information were collected. Students were told that the survey was completely voluntary, that they did not have to participate, and that they could leave unanswered any questions they thought were too personal. The teachers left their classrooms while the questionnaires were administered. 37 Qaalig antrol Screening Quality control procedures were developed with prior surveys to screen for incomplete or otherwise unusable responses. Students were instructed to respond to each item of the questionnaire. After the questionnaires were completed, the students’ responses were entered into an SPSS data file. A printout of this file was obtained and each of the entries was then checked for mis-entry with each of the students’ surveys. All missing responses were coded “9”. Each questionnaire was carefully inspected for any detectable abnormalities. Such aberrant cases were discarded from the sample. A screening procedure was applied to exclude students who were not taking the survey seriously. If the student marked one particular option constantly throughout the survey, that survey was discarded. A questionnaire was considered unusable if the student did not answer any items. Data Analyses Items representing each mode were summed to obtain a subtotal score for each mode. Total test score was obtained by adding the four subtotal scores. It was reasoned that if subjects’ responses were mainly determined by item contents, as hypothesized, then all four categories of items would tap the same source trait (construct), and consequently, the correlation coefficients among the pairs of mode subtotal scores would be high; otherwise, they would be low. The Mode 1 (regular - “I like myself”) and Mode 4 (negated polar opposite - “I do not dislike myself”) variables were assumed to be positive, and the Mode 2 (negated -— “I do not like myself”) and Mode 3 (polar opposite — 38 “I dislike myself”) were assumed to be negative aspects of the same construct. The intermode Pearson product moment correlation coefficients were computed. Scale responses were scored by reversing all negatively worded questionnaire statements. Items representing each version were summed to obtain a subtotal score for each mode. A higher score on any of the subscales or on the total scale indicated a more positive attitude toward a certain aspect. The means, standard deviations, and reliabilities of the six-shame theme scale and the general shame scale were computed. The intercorrelations between six shame themes were also computed. Hunter and Gerbing (1982) noted that if the right research design is used, then confirmatory factor analysis can be used to assess items using the criteria: "internal consistency" and "parallelism" (or "external equivalence"). If item responses differ from each other only by random error of measurement, then the item errors will not correlate with each other. The correlations between items within a scale should then satisfy a mathematical product rule discovered by Spearman (1904, cited in Hunter & Gerbing, 1982) — his “one factor model.” If the correlations between items within a scale satisfy the Spearman product rule, then the scale is said to be "internally consistent." This is a weak criterion for item equivalence. Hunter and Gerbing (1982) also noted that there is a stronger criterion for item equivalence - parallelism in the pattern of correlations between the items and important "outside" variables, such as the measures of emotional and social functioning used in the present study. If all items measure the same construct, then the item errors will not correlate with outside variables. The correlations between items in a unidimensional scale with any outside variable should satisfy a condition called "parallelism" (Tryon, 39 1939; Tryon & Bailey, 1970; Hunter, 1973; Hunter & Gerbing, 1982). This is a strong test for item equivalence. Confirmatory Factor Analysis A confirmatory factor analysis was run on the items organized into the predicted four clusters (modes of item wording). The analysis was used to examine the quality of each scale. Confirmatory factor analysis is a method for computing the correlations for constructs from the correlations for observed measures, so long as the measures obey the assumed measurement model. Each scale was checked for homogeneity of content, for internal consistency, and for external consistency or parallelism. The first analysis was done to assess each of the four scales separately. If there was no significant departure in fit, each of the four scales would prove internally consistent. The items within scales would be parallel in relationship to the other scales and to that extent would appear to be equivalent to each other. A hierarchical measurement model was constructed with regard to the relationship between the modes of item wording and the responses. A hierarchical confirmatory analysis estimated the correlations between the item scores and the four modes of item wording. The correlations between the four scales and each of the measures of anxiety and life satisfaction were also computed. A confirmatory factor analysis of the four modes of item wording together fits the data only if each scale is parallel to each other scale in their pattern of correlations with them. A stronger test of parallelism was obtained by testing the items in each scale for parallelism in terms of how the items 40 related to outside variables. In order to do this, each mode of item wording was tested separately for parallelism in relationship to the measures of anxiety and life satisfaction. If the first stage of analysis shows that each of the scales is unidimensional, then a second stage of analysis can test the hypothesis that the four modes of item wording are each measures of one underlying trait. This analysis first checks to see if the separate scales define specific factors or if they are identical to each other. If the scales are not highly correlated with each other, then there are specific factors, which differentiate the shame constructs from one another. Exploratory Factor Analysis To see if there might be some completely unanticipated dimension in the data, an exploratory factor analysis of the items was also run. The communalities were estimated as the largest correlation. The principal axis factors were followed by VARIMAX rotation. The eigenvalue cutoff for the number of factors was set at 1.00. An examination of the resulting factors would show if the clusters matched with the a priori clusters. In other words, when the items were blindly grouped together using the highest loading from the varimax factors, the clusters thus formed should be similar to the original clusters (modes of item wording). 41 Parallelism Parallelism is the basis for the use of correction for attenuation to eliminate the bias in correlations produced by error of measurement. Thus the test for parallelism is the test which directly justifies the use of correction formulas. Since the correction formulas are implicit in confirmatory factor analysis, it is the test for parallelism that is the heart of the assumptions for confirmatory factor analysis. Items in each of the four scales were examined in terms of how they correlated with outside variables. If the items in a cluster all measure the same affect, then the items in that cluster should correlate in a parallel way with each of the outside variables. Parallelism was tested by computing the correlations between the items and the outside variables (anxiety and life satisfaction), and by examining that correlation matrix for parallelism (as noted by Hunter, 1973, and Tryon and Bailey, 1970). The most direct way to check for equivalence is to correlate scales measuring the same thing (though it is important to correct for the attenuation produced by random error of measurement). If two measures are equivalent, then they correlate in exactly the same way with other variables. The correlations between all four scales, anxiety (DOSC Anxiety Factor) and life satisfaction (Satisfaction with Life Scale) were computed and were corrected for attenuation due to error of measurement. Summm The research design of this study was aimed to address two main issues concerning the effect of item word in the rating responses. The first issue was the influence of item wording on scale and item score mean, distribution, reliabilities, and 42 correlations between scales. The second issue was the item equivalence between different modes of item wording. Two Taiwanese schools were selected for this purpose, all from an urban setting in Taipei. A questionnaire was constructed and administered to a total of 861 sixth and seventh grade students from the four schools. Demographic information was also obtained regarding the students’ grade, gender, and age. All this information was coded into a SPSS file, cleaned and subjected to a data analyses. Descriptive statistics, correlational analyses, and reliability analyses were obtained. A confirmatory factor analysis was also conducted to further analyze the data. Various other statistical tests were also performed on the results of the survey data. In summary, the findings of several other researchers clearly suggest that item wording can make a difference. They have not shown that a different construct was being measured. This study provides a second set of analyses to test whether altering item wording had an effect on the nature of the construct being measured. The results will have important implications for measurement instrument design by addressing the question: just what are the effects of item wording on the item responses? The purpose of this current research was, therefore, to explore the item wording issue further and to clarify its implications for the validity of questionnaires. 43 CHAPTER IV ANALYSES AND INTERPRETATION OF THE DATA Introduction This chapter presents the data analyses. First, a general description of the characteristics of the sample will first be presented. This will be followed by an account of the factor analyses on the questionnaire data. The manner in which the analyses were conducted will be explained and its implication on survey development based on the results will be discussed. Finally the results for the two research questions of the study together with their interpretations will be reported. Characteristics of the Sample A total of 861 surveys were returned. All the surveys were carefully inspected before data entry. Eleven surveys were discarded because participants marked the same score on each item. One student did not fill out the survey. A total of 849 useable surveys were included in this study. They were completed by students from two middle schools in Taipei. There were 443 (52.2%) male and 406 (47.8%) female students. All students were from the urban setting. Their age ranged from 12 to 17. Eleven (1.3%) students were twelve years old; 152 (17.9%) were thirteen; 451 (53.1%) were fourteen; 228 (26.9%) were fifteen; 5 (0.6%) were sixteen; and 44 2 (0.2%) were seventeen. Two hundred and fifty-six students (30.3%) were in the sixth grade and 593 students (69.8%) were in the seventh grade. Table 5, 6, 7, and 8 show the distribution of students by gender, age, and grade. Table 5. Gender of the Participants F Percent Male 443 52.2% Female 406 47.8% Total 849 1 00.0% Table 6. Age of the Participants F Percent 12 11 1.3% 13 17.9% 14 53.1% 15 26.9% 16 0.6% 17 0.2% Total 100.0% Table 7. Grade Level of the Participants [ Grade Frequency Percent I | 6th Grade 256 30.3% | I 7th Grade 593 69.8% I 45 Table 8. Participants’ Age and Gender by Grade 6 Grade Gender Male 1 3 1 Female 125 Analyses of the Questionnaire Data The 107 items were coded into six subscales as shown in Table 9. The four modes of the shame scale represented four different modes of semantics for a total of 88 items (as mentioned in Chapter 3.) In Mode 1, the statements were presented by semantically positive words or phrases. In Mode 2, the statements were presented by semantically positive words or phrases structured in grammatically negative sentences. The 22 sentences of Mode l were transformed into “do not” sentences. In Mode 3, the sentences were reversed in meaning by replacing each adjective with one of its antonyms. Similarly, in Mode 4, the items in Mode 3 were transformed into “do not” sentences. DOSC-Anxiety is a 14-item scale, which measured student academic anxiety. Satisfaction with Life scale is a 5-item scale, which measured global life satisfaction. 46 Table 9. Coding Format of the Subscales in the Questionnaire Modes Question Number Model 2, 3, 9, 13, 24, 25, 36, 41, 48, 49, 57, 61, 67, 81, 83, 85, 87, 88, 95, 100, 102, 104 Mode 2 4, 10, 20, 23, 31, 34, 43, 46, 51, 59, 60, 64, 72, 78, 82, 86, 89, 93, 96, 99, 101, 105 Mode 3 7, 8, 12, 18, 30, 32, 35, 37, 39, 50, 52, 54, 55, 66, 73, 75, 77, 80, 84, 92, 94, 97 Mode 4 l, 5, 6,14,16,17, 21, 26, 27, 40, 42, 44, 47, 53, 56, 63, 68, 71, 79, 91, 98, 107 DOSC- ll, 15, 19, 29, 38, 45, 62, 65, 69, 70, 74, 90, 103, 106 Anxiety Factor Satisfaction 22, 28, 33, 58, 76 with Life Scale 47 Answers to the Research Questions Research Qaestion 1: What is the influence of item wording on scale and item score mean, distribution, reliabilities, and correlations between scales? Histogram Analyses To determine if the sample chosen was representative of a normal population, a histogram of the student’s scores on the questionnaire was plotted (see Figure 1). The mean and standard deviation of the 107-item questionnaire were 298.2 and 53.52 respectively (Appendix D). The item descriptive statistics of the four modes can be found in Appendix E. The plot showed that the distribution of the total scores was not normally distributed. The test of normality was significant at p<0.0001 level. For purposes of comparison, the points of a normal curve base on all valid values of the scores were superimposed on the histogram. Four other histograms were plotted for the four modes of items in the questionnaire (see Figures 2, 3, 4, and 5). They were also found to be not normally distributed, except Mode 1 (Table 10). The means of the four Modes were 68.5 (Mode 1), 78.7 (Mode 2), 83.1 (Mode 3), and 67.9 (Mode 4). The means of Mode 2 and Mode 3 were higher than Mode 1 and Mode 4. 48 Table 10. Tests of Normality Kolmogorov-Simirnov Statistic Sig. Total of 4 Modes .047 .000 Mode l .028 .142 Mode 2 .053 .0001 Mode 3 .084 .0001 Mode 4 .043 .001 Figure 1. Histogram of totals of 107 items on the questionnaire. Std. Dev = 53.52 Mean = 298.2 110.0 150.0 190.0 230.0 270.0 310.0 350.0 390.0 430.0 130.0 170.0 210-0 250.0 290.0 330.0 370.0 410.0 49 Figure 2. Histogram of Mode 1 scores on the questionnaire 120 Std. Dev = 17.37 Mean = 68.5 25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Figure 3. Histogram of Mode 2 scores on the questionnaire 14o 120 100 40 20 Sid. Dev = 14.84 Mean = 78.7 25.0 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 50 Figure 4. Histogram of Mode 3 scores on the questionnaire 120 40- 20 - Std.Dev=16.15 Mean = 83.1 .250 35.0 45.0 55.0 65.0 75.0 85.0 95.0 105.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 110.0 Figure 5. Histogram of Mode 4 scores on the questionnaire 140 Std. Dev = 15.80 Mean = 67.9 MODE4 Reliability Analysis Based on the sample of 849 students, mean scores and standard deviations were obtained and test reliability for the total scores of four modes of the General Shame Scale (88 items) was calculated. The internal consistency reliability was estimated, using coefficient alpha (Cronbach, 1951). The results are shown in Table 11. In the analysis of variance, the F statistic for the variation between items was significant (F= 152.86, p<0.001). This indicated that the items had significantly different means. This finding was confirmed by the large Hotelling’s T-squared statistic (T-squared=4084.47) which is a test for the equality of means. Its F statistic (F=42.19, p<0.001) was significant and indicated that the hypothesis that the items had equal means in the population could be rejected. The 88-item test was reliable with Cronbach’s alpha at 0.9670. Table 11. Results of Reliability Analysis on Four Modes of General Shame Scale (88 Items) in the Questionnaire Analysis of Variance Source of Variation Sum of Sq. DF Mean Square F Prob. Between People 27606.2646 848 32.5546 Within People 93563.2159 73863 1.2667 Between Measures 14289.9870 87 164.2527 152.8626 .0000 Residual 79273.2289 73776 1.0745 Total 121169.4806 74711 1.6218 Grand Mean 3.3890 Hotelling's T-Squared = 4084.4675 F = 42.1867 Prob. = .0000 Degrees of Freedom: Numerator = 87 Denominator = 762 Reliability Coefficients 88 items Alpha = .9670 Standardized item alpha = .9682 52 Another reliability analysis was conducted on each of the four different modes of the questionnaire, the anxiety factor, and the life satisfaction factor. The results are shown in Table 12. The Cronbach’s alpha coefficients for the four modes were 0.9392 (Mode 1), 0.9059 (Mode 2), 0.9383 (Mode 3), and 0.8891 (Mode 4). This showed that the four modes had about the same reliability. The Cronbach’s alpha coefficients for the anxiety factor and life satisfaction factor were 0.7524 and 0.6268 respectively. Table 12. Mean, Standard Deviation, and Cronbach’s Alpha Coefficients for the Subscales Subscale Mean Standard Cronbach’s alpha Derivation coefficients Mode 1 68.50 17.37 0.9392 Mode 2 78.70 14.84 0.9059 Mode 3 83.10 16.15 0.9383 Mode 4 67.93 15.80 0.8891 DOSC-Anxiety Factor 34.51 8.51 0.7524 Life Satisfaction 14.50 3.89 0.6268 Correlational Analyses Items representing each mode were summed to obtained a subtotal score for each mode. Total test scores were obtained by adding the four subtotal scores. It was reasoned that if subjects’ responses were mainly determined by item contents, then all four categories of items would tap the same source trait (construct), and, consequently, the correlation coefficients among the pairs of mode subtotal scores would be high; otherwise, they would be low. Mode 1 and Mode 4 variables were assumed to be positive, and Modes 2 and 3 were assumed to be negative aspects of the same construct. Pearson correlation coefficients were obtained for each of these subscales (Mode 1 to Mode 4), DOSC-Anxiety Factor, Satisfaction with Life Scale, and gender. By inspecting these correlations, it became clear that the subscales correlated among each other. The result of the correlation analyses is presented in Table 13. The Pearson correlation coefficients ranged from -0.133 to 0.898. All subscales were significantly correlated with each other, except between Satisfaction with Life Factor and gender. Gender had low correlation coefficients with the four modes, DOSC, and Satisfaction with Life Factor, ranging from —0.133 to 0.128. There seems to be not much relationship between gender and these scales. Table 13. Correlations between the 4 modes, DOSC-Anxiety Factor, Satisfaction with Life Scale, and Gender Mode 1 Mode 2 Mode 3 Mode 4 DOSC Life Gender Mode 1 1.000 Mode 2 0.660 1.000 Mode 3 0.672 0.898 1.000 Mode 4 0.656 0.315 0.346 1.000 DOSC 0.325 0.465 0.491 0.166 1.000 Life 0.616 0.452 0.442 0.345 0.219 1.000 Gender -0.110 -0.133 -0.129 -0.055 0.128 -0.002 1.000 The major problem this study set out to investigate concerned the construct validity of the differential semantic modes of item presentation. Among six correlation coefficients between the pairs of four semantic mode scores, all are significantly different 54 from zero (p <0.05). However, the double-negative Mode 4 item score showed a smaller relationship with both Mode 2 and Mode 3 item scores. Mode 2 and Mode 3 contained, overall, negative-valence items, whereas Modes 1 and 4 had positive-valence items. The largest correlation coefficient in the set (0.898) was between Mode 2 and Mode 3. On the basis of correlational evidence, it is conspicuous that, by and large, Mode 4 seems to measure different constructs from Mode 2 and Mode 3 in spite of their deceptive content similarity. On the other hand, the value (0.97) of coefficient alpha, which is generally taken as an evidence of the homogenous nature of the test components, is quite impressive. An inspection of the item-remainder correlation coefficients revealed that 8 of the 88 items have corrected item-total correlations of less than .35 (Appendix F). Out of these eight items, two belong to Mode 2 and six items are in Mode 4 (Table 14). The salient common feature of these modes is that they are characterized by the grammatically negative form. This produces evidence in favor of the argument that the “do not” form of sentences, even with a simple structure otherwise, creates ambiguity and confusion, and that double negatives, as in Mode 4, add to this confusion. 55 Table 14. Items with Corrected Item-Total Correlation Less Than 0.35. Item Corrected Item Mode 2 Total Correlation Q64. I do not feel significant enough to others 0.2249 that people notice me. Q89. I don’t see myself as intact and without 0.1127 personal defects. Mode 4 Q1. I do not see myself as flawed and with 0.2599 personal defects. Q5. I am not unhappy with myself. -0.4999 Q6. 1 do not always fall short of my aspirations. 0.1564 Q27. 1 do not feel unimportant. 0.3383 Q56. Compared to others, I do not feel like I am 0.2797 less of a person. Q71. I don’t feel inferior to most of my friends. 0.3006 56 Rasearch Question 2: What is the item equivalence between different modes of item wording? MANOVA Analyses One may hypothesize that if all four modes of presentation measured the same construct in the same way in all the groups of students, then one should be able to reach the same conclusions on the basis of any of the previously defined, five dependent variables (four mode subtotal scores and the total score). On the basis of male and female as the group deterrniners, a Multivariate Analysis of Variance procedure (MANOVA) was conducted. The results of the MANOVA are presented in Tables 15 and 16. Inspection of each row in Table 16, showing F ratios calculated from four different sets of scores (four item modes), reveals that the four F-ratio values were far from being equivalent. The MAN OVA results showed that the responses to the modes of item wording are significantly different between male and female. The F-ratios of Mode 1, Mode 2, and Mode 3 were all significant at the 0.05 level. Mode 4 was not significant at the 0.05 level. These results were similar to that of the correlational analyses. It seems that Mode 4, which has a double negative semantics, introduced some ambiguity to the items. 57 Table 15. Descriptive Statistics of Four Semantic Modes by Gender Gender Mean Std. Deviation Male 70.3273 16.6181 Mode 1 Female 66.5025 17.9645 Total 68.4982 17.3704 Male 80.5553 14.1062 Mode 2 Female 76.6010 15.3556 Total 78.6643 14.8404 Male 85.1400 15.1455 Mode 3 Female 80.9606 16.9337 Total 83.1413 16.1517 Male 68.7585 16.0670 Mode 4 Female 67.0222 15.4791 Total 67.9282 15.8031 Table 16. MAN OVA Tables by Gender for the Four Semantic Mode Subtotal Scores Multivariate Tests Effect Value F Hypothesis Error (If Sig. Eta Squared Intercept Pillai's Trace .975 8215.575 4.000 844.000 .000 .975 Wilks' Lambda .025 8215.575 4.000 844.000 .000 .975 Hotelling's 38.936 8215.575 4.000 844.000 .000 .975 Trace Roy's Largest 38.936 8215.575 4.000 844.000 .000 .975 Root Gender Pillai's Trace .019 4.062 4.000 844.00 .003 .019 Wilks'Lambda .981 4.062 4.000 844.000 .003 .019 Hotelling's .019 4.062 4.000 844.000 .003 .019 Trace Hotelling's .019 4.062 4.000 844.000 .003 .019 Trace Roy's Largest .01 4.062 4.000 844.000 .003 .019 Root 0 58 Table 16 (Cont’d) Tests of Between-Subjects Effects Source Dependent Type III df Mean F Sig. Eta Variable Sum of Square Squared Squares Mode 1 3099.210 1 3099.210 10.385 .001 .012 Corrected Mode 2 3312.574 1 3312.574 15.294 .000 .018 Model Mode3 3700.342 1 3700.347 14.409 .000 .017 Mode 4 638.661 1 638.661 2.562 .110 .003 Mode 1 3966279422 1 3966279422 13290.652 .000 .940 Intercept Mode 2 5232215283 1 5232215283 24157.626 .000 .966 Mode 3 5844726448 1 5844726448 22758.468 .000 .964 Mode 4 3905689591 1 3905689591 15667.897 .000 .949 Mode 1 3099.210 1 3099.21 10.385 .001 .012 Gender Mode2 3312.574 1 3312.574 15.294 .000 .018 Mode 3 3700.347 1 3700.347 14.409 .000 .017 Mode4 638.661 1 638.661 2.562 .110 .003 Mode 1 252767037 847 298.42 Error Mode 2 183448.755 847 216.586 Mode 3 217522.692 847 256.815 Mode 4 211139.956 847 249.280 Mode 1 4239381000 849 Total Mode 2 5440436000 849 Mode 3 6089921000 849 Mode 4 412926300 849 Mode 1 255866.247 848 Corrected Mode 2 186761.329 848 Total Mode 3 221223.039 848 Mode 4 211778.617 848 59 Exploratory Factor Analyses The survey was subjected to a principal components analysis as the initial method of factor extraction. The two criteria used to select the model were obtained from suggestions offered by Rummel (1970) and Hair, Anderson, and Tatham (1987), that is, a) statistical indicators for selecting the 'best' number of factors, and b) an analysis of the content of the’factors. Selection rules for the best number of factors have been developed for eigenvalues, scree plots, percent of variance accounted for by the model, overlap in factor loadings, and loadings values. The following selection rules applied. First, a factor must have an eigenvalue of one or greater to be considered significant. Second, a scree plot indicates the maximum number of factors to extract when the plot becomes horizontal, that is, when the curve first begins to straighten out. As a general rule, the scree tail test will result in at least one more factor being considered significant than will the latent root criterion (Hair et al., 1987). Third, the percent of variance accounted for should be as great as possible in considering the best number of factors. Fourth, the choice with the best number of factors should have the least amount of overlap in the total item factor loadings. Finally, the loading value with the largest absolute factor loading is the optimal choice. The loading values should be at least 0.30 to be considered significant, while factor loadings of 0.50 or greater are considered very significant. Ultimately, the number of significant loadings on each column of the factor matrix or loading associated with one variable would need to be maximized (Hair et al., 1987). Thus, a dually loaded item would be placed in the factor with the higher loading. 60 Product-moment correlation coefficients were computed, and a principal components analysis with iterations was performed on the resulting correlation matrix. The 107-item questionnaire was subjected to a principal components analysis. The factor analysis yielded nineteen factors with eigenvalue greater than 1.00. Three items from DOSC (question #1 l, #15, & #74) had no factor loadings over 0.30 and hence could not be identified with any of the factors. The scree plot, which is the plot of the total variance associate with each factor, is shown in Figure 6. The factor structure of the questionnaire can be found in Table 17. Figure 6. Scree Plot from the Exploratory Factor Analysis (19 Factors) Scree Plot 40 30 n 20- Eigenvalue 7 19 31 43 55 67 79 91 103 Component Number 61 Table 17. Factor Structure of the Questionnaire (19 Factors) Factor Question Number 1 2,3,4,5,7,8,9,10,12,13,17,18,20,21,23,24,25,26,28, 30,31,32,33,34,35,36,37,39,41,42,44,46,49,50,51,52 54,55,60,61,66,68,72,73,75,79,87,88,92,93,94,97 2 1,14,16,27,40,47,53,56,63,71,91,98,99,100,101,104 3 57,58,59,102,107 4 38,62,65,69,70,89,90,105 5 43,48 6 71 7 64,85,103 8 76,84 9 80,96 10 67,95 1 1 6,106 12 83 13 86 14 82 15 78 16 45 17 22 18 81 19 77 62 Since the original factor structure was not interpretable, the number of factors was reduced to six in another analysis. This factor structure can be found in Table 18. Table 18. Factor Structure of the Questionnaire (6 Factors) Factor Question Number 1 4,5,7,8,10,12,18,23,30,31,32,34,35,37,39,46,49,50, 51,52,54,55,57,59,60,61,66,72,75,77,78,82,86,92,93, 94,96,97,99,101 2 1,6,13,14,15,16,17,21,26,27,41,42,44,47,53,63,68,7l,79,81,98,104,107 3 2,3,9,24,25,28,33,36,83,84,85,87,88,95 4 1 1,3 8,45,62,65,69,70,74,90,103,106 5 43,48,67,73,76,80,91,105 6 19,20,41,100,102 Confirmatory Factor Anaclyses Confirmatory factor analysis may not be a complete answer to the issue of construct validity (See Hunter and Gerbing, 1982). The statistical method asks only whether items measure the same construct, not whether that construct is the right construct. Whether the items measure the right construct is a question of content, which is usually tested by looking at correlations between the scale and other constructs. At the level of the shame theme scales, the issue of the nature of the construct was dealt with solely in terms of an item content analysis. The items in each shame theme cluster were 63 closely examined to see if they were psychologically equivalent in both the affect expressed and in the manner in which that affect was expressed. Hunter and Gerbing (1982) noted that if the right research design is used, then confirmatory factor analysis can be used to assess item equivalence in two very different ways: "internal consistency" and "parallelism" (or "external equivalence"). If item responses differ from each other only by random error of measurement, then the item errors will not correlate with each other. The correlations between items within a scale should then satisfy a mathematical product rule discovered by Spearman (1904, cited in Hunter & Gerbing, 1982): his one factor model. If the correlations between items within a scale satisfy the Spearman product rule, then the scale is said to be "internally consistent." However, this is a weak criterion for item equivalence. There is a stronger criterion for item equivalence — parallelism in the pattern of correlations between the items and important "outside" variables, such as the measures of anxiety and life satisfaction factors used in the present study. If all items measure the same construct, then the item errors will not correlate with any outside variable. The correlations between items in a unidimensional scale with any outside variable should satisfy a condition called "parallelism" (Tryon, 1939; Tryon and Bailey, 1970; Hunter, 1973; Hunter and Gerbing, 1982). This is a strong test for item equivalence. If an item is contaminated by some unintended variable, and if that contaminating variable is one of the outside variables (or is correlated with one), then the item will correlate more highly with that outside variable than will the other uncontaminated items. Thus, failure to find parallelism not only shows an item to be contaminated, but it also identifies the contaminating variable (Hunter, 1986, 1987). Parallelism can either be tested by doing a 64 confirmatory factor analysis including both the items and the outside variables (as noted by Hunter and Gerbing, 1982) or by computing the correlations between the items and the outside variables and examining that correlation matrix for parallelism (as noted by Hunter, 1973, and Tryon and Bailey, 1970). To answer the question of whether the four Modes of semantics measured the same construct, five models were tested. The first model specified a single factor in which all the items in four different modes measured one single general factor. The second model, specified a two-factor model where one factor represented the positive wording items and the other factor represented the negative wording items. The third model specified another two-factor model with one factor representing double negatives (Mode 4), and the second factor with affirmative semantic mode (Mode 1 and Mode 3) and with the “don’t” form semantics (Mode 2). The fourth model, specified three factors: one factor representing affirmative semantic mode (Mode 1 and Mode 3), the second with the “don’t” fonrr semantics (Mode 2), and the third factor with double negatives (Mode 4). The fifth model parameterized the four Modes of semantics in the survey. Results from these five sets of confirmatory factor analyses models are reported in Table 19. The 2-factor model (Modes 1, 2, & 3 vs. Mode 4) fit the data statistically and showed an overwhelming superiority over the other models. These results rendered strong indications of the inequivalence between double negatives (Mode 4) and the rest of the items (Modes 1, 2, & 3). The 3-factor model also fit the data statistically. The results rendered some indications of the inequivalence between the affirmative semantic mode (Mode 1 & 3), “don’t” form semantics (Mode 2), and double negatives (Mode 4). 65 Table 19. Goodness-of-Fit Indices of Five Models Model Chi-square GFI AGFI 1-factor (Modes l,2,3,4 as one general factor)) 298.30 0.721 0.699 2-factor (Modes 1 &4 vs. Modes 2 & 3) 317.50 0.689 0.532 2-factor (Modes 1,2,3 vs. Mode 4) 222.67 0.956 0.889 3-factor (Modes 1 & 3, vs. Mode 2 vs. Mode 4) 259.70 0.938 0.874 4-factor (Modes l,2,3,4 as individual factor) 389.76 0.679 0.514 Note. GFI=goodness-of-fit index; AGFI=adjusted goodness-of-fit test Summary The data of this study were the responses of 849 students to a questionnaire. Preliminary steps were taken to ensure that the data were appropriately coded and sufficiently accurate before the start of the analysis. Statistical data analysis techniques of exploratory factor analysis, confirmatory factor analysis, analysis of variance, and reliability analysis were employed to answer the research questions. 66 CHAPTER V SUMMARY, CONCLUSIONS, IMPLICATIONS, AND RECOMMENDATIONS Summary of the Purposes and Procedures of the Study A group of 849 Taiwanese students from two schools were selected for this study. These two schools were located in an urban setting. A questionnaire was constructed with 107 items. The students’ responses to this 107-item questionnaire were analyzed. A confirmatory factor analysis was conducted and reliability of the subscales reported. Two research questions were formulated for this study. Statistical data analysis techniques of exploratory factor analysis, confirmatory factor analysis, analysis of variance, and reliability analysis were employed to answer the research questions. The purpose of this study was to investigate how item wording affects rating responses. The first issue investigated was the influence of item wording on scale and item score mean, distribution, reliabilities, and correlations between scales. The second issue examined the item equivalence between different modes of item wording. 67 Discussion and Conclusion Refinement of the survey development process is an essential component of any serious effort to enhance the reliability and validity of survey results. In the introductory chapter it was noted that item wording is an important consideration in survey development. A review of research on the impact of item wording on rating responses revealed a lack of consistent findings. The results of this study suggest a rather important conclusion for measurement instrument design — that the inclusion of negatively worded items can result in less accurate responses and therefore can impair the validity of obtained results. Thus, although the inclusion of negatively stated items may theoretically control or offset agreement response tendencies, their actual effect is to reduce response validity. This situation suggests that current recommendations concerning the desirability of including both positive and negative items on a questionnaire may be premature (and perhaps incorrect), and the inclusion of both apparently warrants much firrther investigation. In examining the effects of item wording on item responses, this study has shed some interesting light on an issue that has here-to-fore been the focus of arguments based upon ambiguous data and results. Overall, the findings suggest that the use of double negatively worded items may result in the measurement of a different construct than is intended. This outcome is in direct contrast to the conventional psychometric recommendations summarized earlier. This seems to be generally a function of the double negatively worded items themselves, and not of their exertion of a strong contextual effect on the positive items. 68 The data from the present study provide strong evidence that the insertion of the word “not” has a profound influence on the student responses. Two trends were indicated in the results. First, items that induced a more favorable response on the positive form induced a less favorable response on the negative form. In other words, respondents were less likely to indicate agreement by disagreeing with a negatively phrased item than to indicate agreement by agreeing with a positively phrased item. Second, items that induced an unfavorable response on the positive form were less likely to induce an unfavorable response on the negative form. In addition, confirmatory factor analyses indicated that the factor structures were clearly different for the positive and the negative forms. The findings from the present study suggest that caution should be exercised in the use of negative item phrasing. Although it may be useful to include some negative items to reduce a response bias, these items need not be used in computing a total attitude score. The use of double negatively phrased items should be used with great caution. Despite the conventional wisdom so ofien found in measurement textbooks, recent pronouncements by researchers in the area of item phrasing have suggested that negatively phrased items, especially double negatives, reduce the validity of a questionnaire. This present study clearly corroborates that position. The research to date suggests that positive to negative transformation changes an item’s psychometric characteristics, and, more importantly, changes the construct that an item is intended to measure. However, the present study and the other studies that have been reviewed do not prove that positively phrased items are necessarily better indicators of attitude. Nevertheless, there are some hints that negatively phrased items are less 69 valid. First, there is the plausible argument that respondents may not understand that they can indicate agreement by disagreeing with a negative statement. Similarly, they may not understand that they can indicate disagreement by agreeing with a negative statement. A word of caution concerning the use of negatively-worded items is appropriate. If negatively worded items are to be used, it would be wise to ensure, during scale development, that their inclusion does not present a methodological confound. Rather than having alternative interpretations to contend with some two decades after scale development (e.g., McGee et al., 1989; Tracy & Johnson, 1981), it would be preferable to ensure that constructs are not exclusively defined by negatively worded items during scale development. Implications Survey development is a difficult task. This study attempts to provide some insights that will begin to answer questions pertinent to the impact of item wording on rating responses. The results of these analyses have a clear implication for researchers who factor analyze data in which the wording of items is varied. Such researchers should be cautious of factors loaded primarily with negatively keyed items. Likewise, consumers of this research should question substantive interpretations of such negative factors. Researchers should be especially cautious concerning negative factors when responses to questionnaires are “involuntary” or when there is a reason to sabotage the researcher effort. The respondents’ data should be examined to detect unusual response patterns. If negative and positive items are recoded so as to be consistent, then a respondent whose primary responses on a 7-point scale are 5 and 6 would be suspicious if 70 negatively worded item responses were 2 and 3. Responses from these individual would be best deleted prior to any further analyses. A more systematic analysis of these respondents is possible with the use of item response theory. Latent trait analyses allow the determination of which item responses made by an individual are not well predicted by the IRT model. As a consequence, it is possible to detect unusual responses at the individual level. However, since ample size and number of item requirement for IRT analyses are large, latent trait parameters may not be obtainable for many instruments. This investigation confirms the findings of earlier studies. Taken together, these studies offer important implications for measurement practice and theory development. One practical implication is that double negative items should not be used. Items on a test or survey should be consistent in positive wording with respect to the construct being measured. Positive and negative wording items are not bipolar indicators of a common trait continuum and therefore construct unidimensionality cannot be maintained by simply reversing the scale points associated with the negative items. Consequently, the inconsistent direction of wording is likely to change the intended factor structure of a test or survey. Researchers should not deliberately use double negatively worded items, even for the purpose of countering response-set effects. Including many double negatively worded items in a test may have the impact of altering the original operational definition of the underlying construct. In order to maintain the intended original factor structure, the connotation of the items must be consistent, in one direction or the other. Research may be needed to reexamine the construct validity of other tests that use a large number of double negatively worded items. 71 It makes sense that the presence of a construct often indicated by a positively worded item does not necessarily mean the opposite (reversal of scale points) of the absence of a construct, as is often indicated by the negative wording counterparts on a test. It seems that the constructs in measurement are inseparable from the way their item indicators are connoted. Researchers using Likert-type scales routinely recode or flip the responses scales of the negatively worded items. Therefore, a primacy effect from this measurement factor needs to be avoided, considered, held constant, or separated from the interest effected by cautious researchers. The results of this study will have important implications for researchers who analyze data in which the wording of items is varied. It is suggested that questionnaire instructions may include a warning to potential respondents that some questions will be negatively keyed and that they should attend to all items. The overall questionnaire length in an instrument that uses the same response format may also be a concern. The respondents may become fatigued or bored when they answer many like-sounding items. Trott and Jackson (1967) found that an acquiescence factor was strongly associated with the speed of presentation of personality items. It may be useful to experiment with the wording directions and the length of questionnaires/instruments as well as the serial position of any negatively keyed items. Further, the context in which data are collected could be varied in an effort to assess the effect of context on the presence or absence of negative factors. The possible effect of item wording on overall ratings is particularly relevant to many of the student and employer rating instruments available today. Increasingly, emphasis is being placed upon the need for valid and reliable means of assessing teaching 72 and working performance. Rating scales used to evaluate a new project, person, or course of instruction, often include both negatively and positively stated items about the objects or person being evaluated. What has yet to be determined is the possible effect of item wording on raters’ evaluations. Do negatively worded items encourage a more critical evaluation than do positively worded items? Negatively worded items may highlight the negative aspects or faults of the object or person being evaluated, or may serve to suggest unconsciously to the rater particular problem areas anticipated by the evaluator. If so, rating scale evaluations may be affected as much by the wording of the items as by the quality of the object or person being evaluated. Several other researchers (Andrich, 1983; Campbell & Grissom, 1979; Simpson, Rentz, & Shrum, 1976) have investigated whether phrasing can influence overall attitude levels on different attitudinal questionnaires. These researchers have all concluded that item phrasing makes a difference. However, the results that these researchers report cannot be easily corroborated with each other or with the present study. One important differentiating feature is that in this study, the word “not” was used to create parallel negative statements, whereas the other researchers created negative statement on an intuitive basis. Rorer (1965) suggested that this latter procedure often leads to negative statements that reflect different content or ideas; consequently, such statements are not direct opposites of the original positive statement. It is perhaps because of this problem that many affective scales contain the word “not” or the prefix “un” to create a negative statement (Coopersmith, 1967; Marsh, Smith, Barnes & Butler, 1983; Piers, 1969). Given the relative frequency with which a negative factor is reported in the literature and the ease with which a factor is produced, researchers should be especially 73 cautious when their factor analyses produce factors that are loaded primarily by negative items. Further, those who design questionnaires may also want to take steps to minimize problems during the construction of their instruments and with the directions which accompany them. On the basis of the data analyses, the findings of this study conclude that double negative and semantically positive item contents do not measure essentially the same construct. Furthermore, these “do-not” form-generated negative-to-positive (double negative) items create ambiguity and confusion. This problem deserves closer inspection through more specifically designed studies. Interesting thoughts about measurement theory, suggested by the present study, may be worthy to ponder. Cross Validation Using a Nevfiample The problem of situational specificity is always a major concern in validity studies. Validity generalization research is based on the application of a particular set of meta-analytic methods (Hunter, Schmidt, & Jackson, 1982) to criterion-related validities of tests. This meta-analysis method was developed as a way of attacking a critically important problem in psychology: the problem of situation specificity. The belief was based on the empirical fact that considerable variability was present from study to study in observed validity coefficients, even when the jobs and tests studied appeared to be similar or identical. The explanation developed for this variability was that the factor structure of job performance was different from job to job. The conclusion was that validity studies must be conducted in every setting; that is, that validity evidence could not be generalized across settings. 74 Schmidt and Hunter (1981) hypothesized that most or all of the variance of study correlations across studies and settings was due to artifactual sources, such as sampling error, and not to real difference between jobs. Artifacts other than sampling error and differences between studies in measurement error and in range restriction can cause variance in study outcomes. Because the most common form of validity evidence is the correlation coefficient between predictor and criterion scores, it is important to recognize that the restriction of the range of scores on the questionnaire may results in attenuation of the observed validity coefficient. One example is the instance in which the test being validated is used for selection purposes before its validity has been established. Other things being equal, the greater the variability among the observations, the greater the value of correlation coefficients. Thus, restriction of range occurs on the questionnaire because of the explicit selection on that scale. As the correlations among items increase, the tests become more homogeneous in content (increase in internal consistency). Moreover, when we compute statistics from a set of data, we are getting the best estimates from the data. If these statistics are used in further calculations, we take advantage of the original calculations and therefore overestimate the second calculations. In Cureton’s study (Cureton, 1950), the author said that one should not use a data set for conducting an item analysis and then use the results of that analysis to compute validity coefficients. When items are deleted from an original sample, some random errors are introduced. The cross validated sample will result in lower item correlations. 75 Cultural Specificity of the Ouestionnaira Crocker and Algina (1986) mentioned that the ultimate criterion for the number of factors to interpret is replicability. When the same variables are investigated in different studies, the factors that are replicated in the studies are those that should be interpreted. Due to situational specificity of Taiwan, the survey may not generate the same factors shown by this study when applied to a new sample. The items in the survey are very culturally specific. For instance, items such as "Compared to others, I feel like I don’t measure up," "I find that I don’t live up to my own standards and ideals," "I look down on myself because of my flawed character," "I feel so insignificant to others, as if I were invisible," and "I always seem to fall short of my aspirations," “I see myself as intact and without personal defects, " while familiar to students of Chinese ethnicity, can have very different implications when administered to students of different cultures. In general, Chinese think about social institutions such as school quite differently from American educators, seeing teachers as professionals with authority over their children's schooling. Teachers in the Chinese culture are accorded a higher status than teachers in the United States. They believe that parents are not supposed to interfere with school processes. Chinese people highly value formal education, and believe that high achievement brings honor and prestige to the family, while failure brings shame. The intense pressure upon children to succeed often leads to intergenerational conflicts, and many Chinese children suffer from test anxiety, social isolation, and low self-esteem because of their mediocre school performance. They have difficulties accepting learning 76 disabilities and depression, and believe that psychological distress is an indication of organic disorders and harmful to both the individual and the family. Confucian ideals, which include respect for elders, deferred gratification, and discipline, are a strong influence. Cohesion and harmony are valued above individual achievement. Hard work, duty, obligation, frugality, and responsibility are also priorities. Most Chinese parents teach their children to value educational achievement, respect authority, feel responsibility for relatives, and show self-control. Chinese parents tend to view school failure as a lack of will, and they address this problem by increasing parental restrictions. Chinese children tend to be more dependent, conforming, and willing to place family welfare over individual wishes than are American children. Self-effacement is a trait traditionally valued by Chinese culture. Chinese children tend to wait to participate, unless otherwise requested by the teacher. Having attention drawn to oneself, for example, having one's name put on the board for misbehaving, can bring considerable distress. Many Chinese children have been socialized to listen more than speak, to speak in a soft voice, and to be modest in dress and behavior. Discipline and obedience are highly valued in the Asian cultures, whereas creativity and freedom are important in Western cultures. The definition of appropriate student attitudes may be different due to differences in the interpretation of students' behaviors. 77 Issues Concemmghe Use of Tests Attitude measures are measures of typical behavior and are distinguished from the ability tests, which measure maximmn performance (Cronbach, 1960). By measuring attitudes, we want to know what the person normally does rather than what he or she can do under exceptional motivation. Valid information about attitudes can be valuable to teachers, counselors, and students. Attitude inventories can help in identifying the problems and needs of students, provided that they are truthful in answering the items. These inventories provide a more complete and holistic understanding of the students. However, the results should not be treated as the sole source of information. Teachers and counselors can also identity problems and needs through observations and interviews. Since these tests usually have lower reliabilities and validities than cognitive tests, the interpretation of the results should be handled with great caution (Mehrens & Lehmann, 1984) It is important for the technical quality of test materials and standards to be considered. The information about these should include evidence of reliability and validity; information regarding the method of estimating reliability and the population on which it was measured; and types of validity evidence, including validity relevant to the intended purpose of the test. Teachers and counselors need to ask themselves these questions: Is the test appropriate for the person who is being tested? How are the results going to be used? Are the test scores reliable enough? Does the test possess enough validity to be used for the purpose for which it is planned? Is the welfare of the student 78 being taken into consideration in the choice and use of tests? Will confidentiality become an issue if the subject does not want to reveal himself or herself to the tester? Another important issue is the competence of the teacher or counselor who will be administering various available assessment instruments. Do those who use various tests have sufficient knowledge and understanding to select tests appropriately and to interpret their results? Since different tests demand different levels of competence for their use, users must recognize the limits of their competence and make use only of instruments for which they have adequate preparation and training. Lastly, presentation of test results also requires significant attention. Teachers and counselors should avoid labeling when communicating the results of a test to students. Labeling can stigmatize a person even when such terms can be justified. They not only suggest a lack of any chance to grow or change, but they may also become self-fulfilling prophecies. Instead, interpretations should be presented in terms of possible ranges of academic achievement or formulations of interventions to assist the individual in behaving more effectively. Limitations As an investigation of the impact of item wording on rating responses, the present study has a number of limitations. First, since the survey was developed according to the context in the United States, the questionnaire may not be relevant to teachers' situations in other countries. Therefore, the application of this questionnaire to other cultures should be used with caution. Second, the subjects were students in Taiwan and they do not represent a random sample from the population. The results are therefore suggestive 79 rather than definitive, and cannot be generalized to other populations without qualification. Third, the factor structure generated by the factor analyses may represent a chance phenomenon, which would not hold up in a second study. A cross-validation study is needed in order to validate the generalizability of the factor structure. Fourthly, the conversion of item wording from one mode to another may add some confusion and ambiguity to the original meaning of the items. The translation of the questionnaire from English to Chinese may add more possible ambiguity to the original version of the questionnaire. Sometimes it is difficult to find an appropriate translation of certain words despite the best effort. Nonetheless, the results provide practical implications for test developers and measurement researchers. Recommendations The study has raised a number of issues which future work should address. This research sheds some light on the effects of item wording on rating responses, and suggests other possible investigations of problems of interest to survey developers and educational researchers. Several directions can be suggested for future research on the effects of item wording on rating responses. One direction would be to replicate the study using larger samples of students from countries other than Taiwan. Replication with other forms might shed more light on the pattern of interaction between each of these factors and form. Only 107 items and 849 students were used in this study. A relevant question that might be asked is to what extent can the results of the study be generalized to students in other cultures. Thus another direction which future research might take would be to 80 replicate the study using subjects and items from other disciplines, such as politics and religion, which are more bipolar. Further research might reveal answers to such questions as: 1) Will the factor structure remain the same when the questionnaire is administered to another sample in a different setting? 2) How will the content of the items affect the item wording, which in turn may affect the item responses? 3) Is there a relationship between item content, item wording, and rating scale? 4) What are the differences between male and female students in responding to this instrument? Despite the work that still needs to be done in this area, this study provides some insights in the field of survey development. Researchers may be able to gain new insights resulting in more efficient survey construction. Further research into the item wording and the item responses would not only improve the accuracy of the inferences that may be made from these surveys, but also may have an important impact on survey development. 81 APPENDICES 82 APPENDIX A Four Modes of Item Wording 83 :88: mac—um: m 8H: HooH H0: 00 H :88: 328: m 8H: :5 H :8th Hana: a 8H: BE Ho: 00 H :88: H88: a 8H: HocH H Embed a mu 802:0? HuoH H0: 0: H :OEoH a ma 803:0? HooH H :08: a 8 H553 HooH Ho: 0H0 H :88: a 8 H583 HcUH H mHooHoHv Han—023 5H3 H05 H532“: mm HHomHE 8m 8: 0H0 H 308ch HEOEoH at; H08 83% mm .HHomHE 8m H flooHoHu Hm:0m:om :55; 0:: USE mu HHomHE com H0: 00 H 808% H888: 52E? HEN SEE we .HHomHE com H 8.0320 poow HE H0 8:83 HHowHE :0 :38 H02 H0: 0: H 8880:... H003“: HE H0 8583 .HHomHE :0 Esau x02 H 86820 coast HE H0 8:803 .HHomHE 9 a: x02 H0: 0H0 H 8880:”. H000w HE E 3:83 .HHcmHE 0H 0: H02 H 88830 HE E 038030 MEEoEOm & 205 HouH H.:0u H SUE—£0 HE E 3880: 33888 M. 205 :5 H USE E 88988 HE HooH H.:0H0 H “055 a 5.0820 HE E: HcoH H defied a 8 88835 Ea H HouH 3:00 H :88: a 8 28302 HooH H 89.3 a mm 8888 HooH “be: H :88: a mm 2250:: EH H 06:05 HE 2 SEE: H0: En H 8:05 HE 8 58%.: En H 88$ HE 8 boom 8 H0: Ea H 885 HE mu 000M 8 «man Eu H :OmHonH 0 H0 82 En H 85 HouH H.:0H0 H 6850 2 HauSnHEoU :8qu 0 H0 mmuH Ea H 82 BE H £850 8 HUB—«9:00 a: 2888 H.:0H0 H 8H: :3 H .3050 2 noSHHEou a: oEmmoE H 85 BE H .0050 0. cohanHEoo 8:05 HE H0 HmoE 0. SEE: HooH H.:0H0 H 8:05 HE .Ho 50E 8 SEE Ea H 35 HucH H «ESE HE H0 80E 9 58:3. Ea H 35 HuuH H0: 0: H 38.5 HE .H0 HmoE 0H Eton—am Em H 35 HooH H £958 boom H0: 83 En H EH: Hand Ho: 00 H 23:08 noom H0: 8E. Ea H 35 Hand H £958 boom Ea H HouH Ho: 0: H @008 000m Ea H 85 HooH H .HHomHE H0 HooEflHma HooH H0: on H HHomHE .Ho 8838 Hunt H .HHumHE H0 H052: HuoH .0: 0H0 H .HHomHE .3 H503 HouH H HHUmHE HEB 0252586 H0: E: H HHomHE 5H3 coHEoanHamHHv Em H .HHomHE .23 85.08 :0: E H .HHumHE 5H3 comma—8 Eu H .HHumHE 5H3 HQHSHEH Ho: Ea H .HHUmHE 5H3 HnHAHEHE. Eu H .HHomHE 5H3 cumnoHnH Ho: E H .HHomHE EH? 0883 E: H .HHomHE SHHHmHU ..:0H0 H .HHomHE 85ch H .HHomHE 8H: #:00 H .HHomHE 8H: H v MOO—2 m H902 N EGO—2 H EGO—H ”£0.83 83H H0 8102 Eek .3 03:8 84 £82 HE“ 8:85:88 :30 HE H0 tonm :5 H0: 05 H 85 55H H £82 H28 58588 .56 E a .88 E H as E: H 282 5.8 58588 HE S a: 3: H.:0H0 H 85 UEH H £82 H05 58588 :30 HE S a: o>= H :5 Eu H 80.5858 HE .Ho :08 :5 mHaBHm H.:0H0 H 805858 HE H0 :08 :5 mHmBHa H 8:28:88 HE E a: 3: mHmBHa 9.:0H0 H 205838 HE S m: 03H 9 :88 mHmBHm H 83835 883 HHH 8 .8850 8 E85885 85 H0: 0H0 H 2502: 883 H .HH 8 £850 0: ago—15:38.5 HooH H 8E 0050: 2:08 85 .8850 2 H3508 58$:me 85 H0: 05 H 0E 0050: 0303 85 .0850 2 Emu—HEM; 8 85 H EEOQEE: 88H H0: 20 H ESHQHEHHE HocH H E8595 85 H0: 0H0 H ESEQEH H8> HouH H COwHOQ mmouzuog 5 H0: Ed H :88: 825:0? 0 En H :88H 353583 a H0: Em H :88: 0:555:03 8 Eu H 8:55 m 8H: 88 H0: 2.0 H 228 a 85 3 H H5882; HBH H0: 0H0 H Hammuooam 85 H :88H a 8 255 881500 1. Eu H 8H: H8120: 05 H :88: a 8 8:55 oHoHHEoo a Ea H 8H: 88H H :88: a mm 8888 oHoHanu a Ea H 85 .0: 05 H :88: a 8 8888 829:8 a Ea H 85 H wE50: 8H 500w :8 H 8H: 88 H0: 05 H wE50: EH 500w Ea H 8H: 85 H w55oE8 8H 500w Em H 8H: 85 H0: 05 H mE58E8 8H 500w Em H 8H: .85 H v HOG: m ”HA—OE N HOD: H nan—05H 8.288 .8 28:. 85 APPENDIX B Questionnaire in English 86 Code Number: INSTRUCTIONS For each item, fill in the circle on the answer sheet for that item which corresponds to the word or phrase that best describes yourself. Read the response options carefully before making your selection. These survey results will be used in a research study. Please read both the item and response options carefully before selecting your answer. Your answers will be kept strictly confidential. Results of this survey will appear in summary or statistical form only, so that individuals cannot be identified. Thank you for your time and cooperation. TO THE STUDENT There are five possible responses to each statement: Never Seldom Sometimes Often Almost Always 1 2 3 4 5 O O O O O For each statement select ONE response. Please mark the bubble which best indicates your agreement with the statement. There are no right or wrong responses. 87 Never Seldom Sometimes Often Almost Always l 2 3 4 5 l. I do not see myself as flawed and with O O O O 0 personal defects. 2. I am pleased with myself. 0 O O O O 3. I feel that my character is intact. O O O O O 4. I am not as good as my friends. 0 O O O O 5. I am not unhappy with myself. 0 O O O O 6. I do not always fall short of my aspirations. O O O O O 7. I feel like a failure. 0 O O O O 8. I am a worthless person. 0 O O O O 9. I am look up to myself because of my good 0 O O O 0 character. 10. I do not feel proud of myself. 0 O O O l l. I worry that my score on a test will not be 0 O O 0 one of the highest in class. 12. I feel there is a something defective in my 0 O O O 0 character. 13. I am a worthwhile person. 0 O O O O 14. I do not feel like I am good for nothing. 0 O 15. Statements that some teachers make about 0 my schoolwork hurt my feelings. 16. I do not feel ashamed of myself. 0 O O O O 17. I am not disappointed with myself. 0 O O O O 18. I am inferior to my friends. 0 O O O 0 19. Talking in front of class makes me feel 0 O O O O nervous . 88 Never Seldom 2 Sometimes 3 Often Almost Always 4 5 20. 2]. 22. 23. 24. 25. 26. 27. 28. 29. 30. 3]. 32. 33. 34. 35. 36. 37 38 39 Compared to others, I feel like I don’t measure up. I do not look down on myself because of my good character. If I could live my life over, I would change almost nothing. I do not feel important. I like myself. I feel I am a complete success as a person. I do not feel like a failure. I do not feel unimportant. I am satisfied with my life. I am embarrassed to face my friends or famin if I have made a low grade on a test or assngnment. I feel unimportant. I don’t feel adequate as a person. I see myself as flawed and with personal defects. The conditions of my life are excellent. I don’t feel like I am good for something. I feel inadequate as a person. I am satisfied with myself. . I feel that I am just not good enough. I have trouble sleeping well the night before an important examination. I feel like I am good for nothing. 89 0 0000000 0 0 000000 0 0000000 0 O 0 000000 0 0000000 0 0000000 0 0 OOOOOOO 0000000 0 0 0000000 0 O 0 000000 Never Seldom Sometimes Often Almost Always l 2 3 4 5 40. I do not feel like a useless person. 0 O O O 41. Compared to other people I feel like I do 0 O O O 0 measure up. 42. I don’t dislike my self. 43. I find that I don’t live up to my own 0 standards or ideals. 44. I do not feel I am a complete failure as a O O O O 0 person. 45. I avoid talking to my classmates about 0 O O O O schoolwork because they might make fun of me. 46. I am not satisfied with myself. 0 O 47. I am not a worthless person. 0 48. I find that I live up to my own standards 0 O O and ideals. 49. I feel like I am good enough. 0 O O O O 50. I feel ashamed of myself. 0 O O O O 51. I do not like myself. 0 O O O O 52. I feel like a useless person. 0 O O O O 53. I don’t feel there is a something defective O O O O O in my character. 54. Compared to others, I feel like I am less of O O O O O a person. 55. I feel so insignificant to others, as if I were 0 O O O 0 invisible. 56. Compared to others, I do not feel like I am 0 O O O 0 less of a person. 57. I feel worthy as a person. 0 58. So far I have gotten the important thing I 0 want in life. 90 Never Seldom Sometimes Often Almost Always l 2 3 4 5 59. I am not a worthwhile person. 0 O O O O 60. I don’t feel worthy as a person. 0 O O O O 61. I feel like a useful person. 0 O O O O 62. I feel so nervous about some of my classes 0 O O O O that it is hard for me to attend. 63. I do not feel that I am just not good 0 O O O O enough. 64. I do not feel significant enough to others 0 O O O O that I people notice me. 65. I am afraid to ask teachers to explain a O O O O O difficult concept a second or third time. 66. I feel that I am inferior to most of my 0 O O O 0 friends. 67. I always seem to live up to what I aspire to O O O O 0 be. 68. I don’t feel I am defective as a person, as if 0 O O O O something is basically wrong with me. 69. I become frightened when a teacher calls 0 O O O O on me in class. 70. I am upset about so many things that I O O O O 0 cannot concentrate on or do my schoolwork. 71. I don’t feel inferior to most of my friends. 0 O O O 72. I don’t feel I am a complete success as a 0 person. 73. I find that I fall short of my own standards 0 O O O O or ideals. 74. I feel upset when I have to take a test. 0 O O O O 75. I dislike my self. 0 O O O O 76. In most ways my life is close to my ideal. O O O O O 77. I feel I am a complete failure as a person. 0 O O O O 91 Never Seldom Sometimes Often Almost Always l 2 3 4 5 78. I do not feel that I am superior to most of O O O O O my friends. 79. I do not feel worthless as a person. 0 O O 80. I always seem to fall short of my 0 O O O O aspirations. 81. I see myself as intact and without personal 0 O O O O defects. 82. I don’t feel like a usefiil person. 0 O O O O 83. I feel proud of myself. 0 O O O O 84. I am unhappy with myself. 0 O O O O 85. I feel successful. 0 O O O O 86. I don’t look up myself because of my 0 O O O O flawed character flaws. 87. I feel important. 0 O O O 88. I feel adequate as a person. 0 89. I don’t see myself as intact and without 0 O 0 personal defects. 90. I become tense and nervous when I am 0 O O O O studying. 91. I find that I don’t fall short to my own 0 O O O 0 standards or ideals. 92. I look down on myself because of my 0 O O O O flawed character. 93. I do not feel that I am good enough. 0 O O O O 94. I feel disappointed with myself. 0 O O O O 95. I amjust as good as my friends. 0 O O O O 96. I am not pleased with my self. 0 O O O O 97. I feel worthless as a person. 0 O O O O 98. I am not inferior to my fi'iends. O O O O O 92 Never Seldom Sometimes Often Almost Always l 2 3 4 5 99. 1 do not feel successful. 0 O O O O 100. I feel so significant that people notice 0 O O O 0 me. 101. I don’t feel my character is intact. O 102. I feel I am superior to most of my friends. 103. I worry about how well I am doing in my 0 O O O 0 classes. 104. I feel like I am good for something. 0 O O O O 105. I don’t always seem to live up to my 0 O aspirations. 106. I would be afraid to tell a teacher that he 0 O O O O or she made a mistake in explaining an assignment or in working a problem. 107. I don’t feel insignificant to others, as if I O O O O 0 were invisible. Gender: Grade: Age: 93 APPENDIX C Questionnaires in Chinese 94 58881 1'4? 453% a‘éfifi’ifi $815885?! WM? 57,? fitflfiafié‘r 1318165258 iiigfi‘zau, 581%;Bfi‘éafi- “41313881 :24?) 13317883 *$%%%QI$EI§EZRJ fifh’i’FZSz“ at , aamwas 88%fififi. 4”%K$%?3éfii¥a F=375~3*%#%VM§5%5=IL iaél’fiikifi. FIiVAIIfl/xfiai’ifiififii'lfl: ##451 éfiésfifivAt’F 8988141435? fi-“fifl’fi'ifil‘ffi‘éé‘lfi’fiz am 1%: 753:5: i314? ”fl? 1 2 3 4 5 fi-wfiééfififi‘lflfiifii £11444 ncfiaxiléAé’Jtafli 5285113 E§5%. Ffififia‘gfifii’ifii‘léa‘z» 95 8:5 5% #8 3* 8* 10. ll. 12. 13. 14. 15. 16. 17. 18. 19. fi$fiaéiflfifi 8888A 889688 81886A88§ 8888888888 fififié$8® 8$iififiififi 88188 fit—mafimmwA §336888A8888 8$mE688 88688888888 888888—8 81836A8888 ai—mfiflfiwA 881836—¥88 $88888¥8888 #88878 fi$868$ afiaaxxz fiw$ififlfli 88888888888 88 0 0000000000 0000 0000 96 000000000 00 0000 0000 0 0000000000 0000 0000 000000000 00 0000 0000 000000000 00 0000 0000 fi$ «a ifi fifi fit 1 2 3 4 5 20. fiwmtfifififitfiaa o o o o o EBA—3r 21. fiziaaaéhzififixkfé o o o o o txgaa 22. afllfifimifiifiififiéfi O o o o o fii$$fi$i£flifi 23. axtfiaafifi: o o o o o 24. sums. o O o o o 25. fitéflsmfilfim O O o o o 26. fiifitsdsifli o o o o o 27. axtfiacxnz o o o o o 28. afifiéfiifié-tififi O O o o o 29. hiéfiiwfificfixfi o o o o o attfimmuafi “iii/x 3o. atfiaaxfir o o o o o 31. fi/‘Ftfiaafiififiéh o o O o o 32. fitaakfifififikm o O o o 0 WA 33. fififiiiflflfiifi o o o o o 34. sift/3E! Cfirxiffififi O O O O O 35 atfiaamfi. o O o o o 36 fifiacfit O o o o o 37 fiifiaamfitfiiz o o O o o 38 fitfi!§£fifl#fifi o o o o o —%#&$fi 39. fitfié} a—tfiefii O o o o o 97 fix MK tfi at ft 1 2 3 4 5 4o. 5523;1213‘9 aificmwx o o o o o 41. fiwmtmufi a a. o o o o o rim—é? 42. fixfima a o o o o o 43. fifitfiififia a ass/ms: o o o o o agafi 44. 425511; E: Mini 0 o o o o 45. fifikfifilhfiéwamfi; o o o o o mammxfi 46. airs 6$5%$ o o o o o 47. axi—mafimmwx o o o o o 48. afitfisa Ex 6642mm 0 o o o 0 3243.655 49. fitfi s: 61H? 0 o o o o 50. firx a 631k 0 o o o o 51. «fiiF-Efiél a. o o o o o 52. 45.14% Ci-fieméfi/x o o o o o 53. W511; a Exxfifiwa o o o o o 54. fimxrtflfitfi a EtbFF o o o o 0 LA? 55 435.14% a EflFt-EGJF o o o o 0 2311—132 56.§4&/\tt1&35i$133 é: Eat. o o o o o $LA§ 57. am; a Efiflfifi o o o o o 58. .3913 flaoifimmrid o o o o o i%?m§:§%éfiiéfi 98 5&2; 0525 7E9? il'¥' 11’ 1 2 3 4 5 59. fixi—{afiufiwx o o o o O 60. $51M: aiflififi O O O O O 61. $1233 Enigma/x o o o o O 62. figiififlififififiai o o o o o 63. 43:51:53 6.591153%): 0 o o o O 64- fiifiifi’rfi 5:29» 0 O O O O Aflfitfi 65. ggfigfggggmmfi O O O O O 66. fitfia fltt$l1§~§ EM 0 o o o O 67. Mfima‘efims o o o o O 68. fiziizfiél afirfig O O O O O 69. gfififiwiflfifififlfi o O o O O 70. 13.3 $¢fii§mifificii o o o O O 71. £75113”: Btt$i1fi§ O o o o 0 £812: 72. 15411433 mum o o o o O 73.3%}:13149513 wanna. O o o o O 74. gflfi££fifififfl3$ O o o O O 75. saw-ma E. o o o o O 76. gfiéfififiwifi-afim o o o O O 77. £13439 2.129434 0 o o o O 99 fi$ «a fifi a? #t 1 2 3 4 s 78. fixtfia Efiififififififli o o o o o 79. 3313613133 mug-ma O o 0 O O 80. 382525182482 0 o o o o 81. gag/Ezimmmifi o o o o o 82. fixifira EnHifiWJ/k O O O O O 83. aux a 6.32% o o o ' o o 84. £043 watt: 0 o o O o 85- fiétfi’iI/J O O O O O 86.13% El 6%?FEEA46 O O O O O $ifi£él E. 87. fitfiél EARf£ O O O O O 88. 4815888 6753.58.88): o o o o o 89. 8875278 aififiéflfiififi o o o o o fififiA 9o. azimamzwrma o o o o o 91. §§§§§$mza we 0 o o o 0 92.35881 8 waigxxg o o o o o fixiflél E. 93. $051138 61% o o o o o 94. £683 6%: O O O O O 95. fififiéfiafii—fifi o o o o o 96. xii-ta Exam 0 o o o o 97. 5&1; El Eiifimfi O O O O O 98. siziht$ifimai O o o o o 100 4875 {Ma 759'? 3* *1 1 2 3 4 5 99. 45% 8&8sz o o o o o 100. £1533 E1 EJREErxg O o o o o Aflfitfi 101. £75113 El Bkfiifl‘ o o o o o 102. £113 Ex Efix‘ééfifi 5812: O o o o o 103. 35mm El 8.411).:me o o o o o fififi’afi 104. sit??? a EETVXWfiIskLst O o o o o 105. fixjifi‘fiwafi ms 0 o o o o 106. tatemflfizmawxa O o o o o mfifi%fi$fi%fiw 107. £17513 8 Exiflgifi o o o o 0 #41432 'Tifi']: $337.: $35}: lOl APPENDIX D Descriptive Statistics of the Items 102 Table 21. Descriptive Statistics of the Items Item Mean SD Variance 1 3.0212 1.4182 2.011 2 3.4511 1.1182 1.250 3 3.2768 1.1886 1.413 4 3.2556 1.0813 1.169 5 2.3934 1 .0294 1 .060 6 2.8363 1 .2268 1 .505 7 3.5053 1.0496 1.102 8 4.1567 1.0795 1.165 9 3.3934 1.2154 1.477 10 3.5995 1.1492 1.321 1 1 3.4087 1.4918 2.225 12 3.7538 1.1136 1.240 13 3.4570 1 .2231 1 .496 14 3.1390 1.3905 1.933 15 3.9199 1.1516 1.326 16 3.3816 1.3908 1.934 17 3.3675 1.1902 1.417 13 3.4134 1.1502 1.323 19 2.9870 1.4303 2.046 20 2.8728 1 .2364 1 .529 21 3.4912 1.3238 1.763 22 2.3840 1.4146 2.001 23 3.7503 1.2519 1.567 24 3.5783 1.2291 1.511 25 2.7491 1.0379 1.077 26 2.9411 1.1228 1.261 27 2.9835 1 .3675 1 .870 23 3.3922 1.1927 1.423 29 2.5689 1 .2969 1 .682 30 3.8799 1 .1845 1 .403 31 3.4064 1.1507 1.324 32 4.1519 1.0160 1.032 33 3.0777 1.1710 1.371 34 3.7986 1.1759 1.383 103 Table 21. (Cont’d) Item Mean SD Variance 3 5 4.2108 1.0252 1.051 36 3.1955 1.1470 1.315 37 3.9305 1.0511 1.105 38 3.9576 1 .2046 1 .451 39 3.9847 1.1121 1.237 40 3.2509 1 .3879 1 .926 41 2.3557 1.1173 1.248 42 3.5406 1 .2935 1 .673 43 3.6231 1 .2021 1.445 44 3.1178 1.2411 1.540 45 4.2968 1.0054 1.01 1 46 3.6019 1.1642 1.355 47 3.2391 1 .4027 1 .968 48 3.2827 1 .2683 1 .609 49 3.1449 1.1570 1.339 50 4.1343 .9826 .965 51 4.0459 1.0078 1.016 52 4.1543 .9993 .999 53 3.0495 1.3484 1.818 54 3.4064 1.1352 1.289 55 3.8940 1.2279 1.508 55 2.9069 1 .2226 1 .495 57 3.4346 1.2212 1.491 58 2.9435 1.2350 1.525 59 3.9741 1.1261 1.268 60 3.9093 1.1574 1.340 61 3.4676 1.2050 1.452 62 4.1837 1.1208 1.256 63 3.1637 1.3685 1.873 64 3.1225 1.2790 1.636 65 3.0989 1.4801 2.191 66 3.4087 1.2137 1.473 57 2.9317 1.2785 1.634 63 3.3027 1.3835 1.914 69 3.4770 1.2121 1.469 70 3.0907 1 .2738 1.623 71 2.9918 1.2853 1.652 104 Table 21. (Cont’d) Item Mean SD Variance 72 3.4287 1 .1368 1.292 73 3.3899 1 .2053 1 .453 74 3.8587 1 .1436 1.308 75 4.0342 1.0914 1.191 76 2.7044 1.1036 1.218 77 3.9258 1.0958 1.201 73 3.3663 1.1641 1.355 79 3.0224 1 .3867 1 .923 30 3.5701 1.1886 1.413 81 2.7409 1.3126 1.723 82 3.7915 1.1676 1.363 83 3.1390 1.2110 1.467 84 2.4935 1.1473 1.316 35 3.0389 1.0934 1.195 86 3.7703 1.1504 1.323 37 3.2874 1.2102 1.464 33 3.1602 1.1504 1.323 89 3.4158 1 .2477 1.557 90 4.2686 1 .0306 1 .062 91 3.0707 1.2756 1.627 92 4.0047 1.1451 1.311 93 3.4747 1 .1583 1 .342 94 3.7244 1.1230 1.261 95 2.9882 1 .2102 1.464 96 3.7880 1 .0278 1 .056 97 4.0141 1.0858 1.179 93 2.9788 1 .2159 1 .478 99 3.5807 1.1131 1.239 100 2.3581 1.1519 1.327 101 3.6219 1.1963 1.431 102 2.6396 1.2023 1.445 103 2.6631 1.2893 1 .662 104 3.4276 1 .2275 1 .507 105 3.4664 1.2186 1.485 106 3.7126 1.2381 1.533 107 2.7385 1 .4706 2.163 105 APPENDIX E Descriptive Statistics of Four Modes of Item Wording 106 632 824 M66. m4m H .4 32 .H 4 H62 682 £444 .698 669. 4 8H: 69 H 842 4-o.m 6on H484 4%: ~83 ~H-.H 4499.. .6966 4 9 €63 69 H . . . . . . . . .9864 669.6 9699, ~wH4H 28 m 85H 6H2 4 R4~H 424 m 4~HmH 8E ~ 96 9634462699 83 . . . . . . . . 920830 voom >8 44m~ H 95 ~ Hm: H ~48 4 462 H 82. m 4mH~ H 496 m 6 69666 :86: 9 9.. 69 H . . . . . . . . .699 4444 H 346 m cm: H mg m 82 H 28 m 9:: H §~ m 9 66466 96 69 69 H 2me ~82 ~m~o.H 8H~4 82H 4844 4%: ~82 .698 4 6 99.646 69 H . . . . . . . . .9696 62~H §6~ ~82 424 m 36H 4mm~ m ~oH~H ~33 9994664696H§H . . . . . . . . .6 9266 H 4-~ H 696 ~ ~42 H 464 m 448 H HHS ~ m: H H Em ~ 35 69 H .9696 9 466668 . . . . . . . . 66.9 99 6 .66 m3~ H ”H66 ~ RH~ H :64 m H42 H 83 m m~o~ H 46% ~ 9 6696 94 H 69 69 H 962 ~83 H H9: 88m 962 E46 Em: 644R .666 H69H :6 H 69 69 H 63H 33 696. $24 ~64: 862 H: H~.H 6%; £8.96 6 96.6 :9 H ~66: «$3 o-HH 44~2 ~44: 6 :64 E4: 362 «696 .23 48966 64 H 4689 482 £42 264~ ”RH: 292 9: H.H H H44 9696 9? H6966 :6 H 262 843 4H2: ~48.4 ”SH: 696.4 H6-.H $3.». .2699 9H: H .5965 :32 Son—dam 532 £5.65 :32 .2565 :32 u on... use: 4 962 m 962 ~ 6.62 H 962 $54.83 :5: .3 meta: .58— .? 85395 253.88: .NN 039,—. 107 .m—Hwomvm fig mfiuflfigm omnNH $86 mmoNH oowm.m HchH Hmmcd mwcmH Ram». :30 >8 3 a: 3: H 85 HUSH H .maoswbmmm wwmmH mcmwd can: Hohmfi ow HN.H $945 3%: Emmd >8 9 a: o>HH 3 800m mwnga H . . . . . . . . .68 8:8 we: H mwmh m ahmm H 343 m comm H mmNH m 32 H mem N oHHHoonH 35 38.895 8 H00: whomH mmwad mXHHH 35d mHmNH momhd NoHNH Emma .EwtonHEH bo> HouH H nmoYH Hammd 33H Sm H .v HcNHH H435 HmNNH. £38m .583 £23553 a Em H mNNH .H H Hwad wax: mmom.m HmH H.H momma vmoo. H 38.4” .Hsmmmooosm HouH H A893 9 mm H HVNH w: H.m ammoH wmmmd mom H.H Swim ammoH Havnd 3083 329:8 a Ea H HooH H . . . . . . $55088 33H 32 m HNH H.H H.433 m amnH H 82. m thN H 3.9 m 8H woow Ea H 8H: :3 H .6965 s62 5.56% :62 36:66 E62 $5.3 562 Ila—8:. v 9.52 m «H52 N ace—Z H £52 €9.69 - 654... 108 APPENDIX F Item Total Statistics 109 Table 23. Item-total Statistics (88 items) Scale Scale Corrected Kean Variance Iten- Squared Alpha it Item if Item Total Multiple 1! Item Deleted Deleted Correlation Correlation Deleted 01 295.2108 2823.6147 .2599 .3038 .9672 02 294.7809 2803.9142 .5036 .5052 .9666 Q3 294.9552 2800.3754 .5009 .5405 .9666 04 294.9764 2816.4027 .4115 .4525 .9668 05 295.8386 2919.3454 -.4999 .4655 .9683 06 295.3958 2842.8314 .1564 .2749 .9673 07 294.7267 2795.8663 .6111 .5641 .9664 08 294.0754 2789.2820 .6521 .6412 .9663 09 294.8386 2786.4232 .5993 .5053 .9664 010 294.6325 2801.107? .5128 .3768 .9666 012 294.4782 2809.2286 .4603 .5335 .9667 013 294.7750 2777.1156 .6686 .6023 .9663 014 295.0931 2810.9123 .3524 .3551 .9670 016 294.8504 2804.8915 .3935 .5178 .9669 017 294.8645 2799.2045 .5096 .4818 .9666 018 294.8186 2801.1345 .5121 .5478 .9666 020 295.3592 2815.3201 .3655 .4107 .9669 021 294.7409 2794.2653 .4915 .4781 .9666 023 294.4817 2783.4976 .6036 .5287 .9664 024 294.6537 2786.8115 .5894 .5456 .9664 025 295.4829 2794.5896 .6300 .5538 .9664 026 295.2909 2798.7938 .5450 .4635 .9665 027 295.2485 2813.8568 .3383 .3276 .9670 030 294.3522 2779.0350 .6755 .7050 .9663 031 294.8257 2802.9601 .4967 .4201 .9666 032 294.0801 2797.2412 .6190 .6095 .9664 034 294.4335 2794.1374 .5573 .4527 .9665 035 294.0212 2789.9312 .6816 .6864 .9663 036 295.0365 2779.5494 .6940 .6493 .9662 037 294.3015 2788.9609 .6732 .6447 .9663 039 294.2473 2786.0213 .6605 .6328 .9663 040 294.9812 2791.0963 .4894 .5006 .9666 041 295.8763 2802.5472 .5157 .4958 .9666 042 294.6914 2787.1264 .5565 .5248 .9665 043 294.6090 2816.8658 .3643 .4875 .9669 044 295.1143 2795.3985 .5171 .4718 .9666 046 294.6302 2794.6178 .5591 .5455 .9665 047 294.9929 2798.7028 .4321 .4327 .9668 048 294.9494 2800.0128 .4707 .5382 .9667 049 295.0872 2780.579? .6793 .6238 .9663 050 294.0978 2809.2110 .5244 .4524 .9666 051 294.1861 2794.6941 .6484 .6698 .9664 052 294.0777 2791.3076 .6866 .7079 .9663 053 295.1826 2805.9065 .3995 .4268 .9668 054 294.8257 2788.2455 .6278 .6223 .9664 055 294.3380 2776.2240 .6729 .6598 .9663 056 295.3251 2826.9414 .2797 .3559 .9671 057 294.7974 2765.1146 .7646 .7344 .9661 059 294.2580 2790.1280 .6170 .5605 .9664 060 294.3227 2785.9193 .6346 .5532 .9663 061 294.7644 2772.1968 .7183 .6596 .9662 063 295.0683 2801.7383 .4224 .4447 .9668 064 295.1095 2832.5458 .2249 .3280 .9672 066 294.8233 2779.0371 .6587 .6387 .9663 llO Table 23. Item-total Statistics (88 items) Scale Scale Corrected Kean Variance Item- Squared Alpha it Item if Item Total multiple if Item Deleted Deleted Correlation Correlation Deleted 067 295.3004 2810.9934 .3849 .4718 .9669 068 294.9293 2790.0658 .4982 .5109 .9666 071 295.2403 2822.0955 .3006 .3952 .9670 072 294.8033 2802.832? .5041 .4995 .9666 073 294.8422 2802.7345 .4750 .5388 .9667 075 294.1979 2785.8405 .6751 .7021 .9663 077 294.3062 2782.5854 .7008 .7019 .9662 078 294.8657 2809.8640 .4342 .4262 .9667 079 295.2097 2808.0055 .3734 .4192 .9669 080 294.6620 2805.5118 .4596 .4987 .9667 081 295.4912 2804.8328 .4189 .3848 .9668 082 294.4405 2803.6005 .4839 .4000 .9666 083 295.0931 2776.9170 .6771 .6732 .9662 084 295.7385 2941.9410 -.6304 .6423 .9687 085 295.1932 2784.8164 .6828 .6663 .9663 086 294.4617 2790.1592 .6033 .5332 .9664 087 294.9446 2777.4014 .6737 .6537 .9663 088 295.0718 2785.3592 .6433 .5982 .9663 089 294.8163 2848.2303 .1127 .2130 .9674 091 295.1614 2800.9539 .4608 .4844 .9667 092 294.2273 2779.7584 .6934 .6727 .9662 093 294.7574 2791.0212 .5919 .5640 .9664 094 294.5077 2785.5403 .6580 .6557 .9663 095 295.2438 2789.8520 .5748 .4851 .9665 096 294.4441 2800.2472 .5837 .6450 .9665 097 294.2179 2782.3758 .7093 .7365 .9662 098 295.2532 2814.3474 .3796 .3736 .9669 099 294.6514 2796.5788 .5690 .5383 .9665 0100 295.8740 2810.145? .4367 .4556 .9667 0101 294.6101 2805.5282 .4564 .4411 .9667 0102 295.5925 2804.0979 .4654 .4837 .9667 0104 294.8045 2782.4641 .6242 .5359 .9664 0105 294.7656 2819.9650 .3350 .4455 .9669 0107 295.4935 2829.4861 .2119 .2896 .9673 111 BIBLIOGRAPHY 112 BIBLIOGRAPHY Adorno, T. W., Frankel-Brunswik, E., Levinson, D. J ., & Sanford, R. W. (1950). Ih_e Authoritarian Personalig. New York: Harper. Ahlawat, K. S. (1985). On the negative valence items in self-report measures. Journal of General chhology, 112(1), 89-99. Anastasi, A. (1982). Psychological Testing (5th ed.). New York: Macmillan. Anderson, L. W. (1981). Affective characteristics in the schools. Boston: Allyn & Bacon. Andrich, D. (1983). Diagnosing and accounting for response sets provoked by items of a questionnaire. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal. PQ. Andrulis, R. S. (1977). Adult assessment: A sourcebook of tests and measurement for human behyior. Springfield, IL: Thomas. Benson, J ., & Hocevar, D. (1985). The impact of item phrasing on the validity of attitude scales for elementary school children. Journal of Educationg Measurement. 22, 213-240. Bentler, P. M., Jackson, D. N., & Messick, S. (1971). Identification of content and style: A two-dimensional interpretation of acquiescence. Psycholcgical Bulletin, 76, 186-204. Bentler, P. M., Jackson, D. N., & Messick, S. (1972). A rose by any other name. Psychological Bulletin. 77, 109-1 13. Berg, I. A. (1961). Measuring deviant behavior by means of deviant response sets. In I. A. Berg & B. M. Bass (Eds), Conformity 3nd deviation (pp.328-397). New York: Harper and Row. Block, J. (1967). Remarks on Jackson’s “review” of Block’s challenge of response sets. Educational and Psychological Measurement. 27, 499-502. Block, J. (1971). On further conjecture regarding acquiescence. Psychological Bulletig Z_6_, 205-210. 113 Block, J. (1972). The shifiing definition of acquiescence. Psychological Bulletin, 78, 10-12. Bloom, B. S. (1978). New learner: Implications for instructions and curriculum. Educational Leadership. 35, 563-576. Campbell, N. 0., & Grissom, S. (April, 1979). Influence of item direction on student responses in attitude assessment. Paper presented at the 63rd annual meeting of the American Educational Research Association, San Francisco, CA. Chang, L. (1995). Connotatively inconsistent test items. Applied Measurement in Education, 8(3), 199-209. Chang, S. S. and Hunter, J. (1988). Phenomenology and the measurement of shame. Unpublished manuscript. Coopersmith, S. (1967). The antecedents of self-esteem. San Francisco: Freeman. Couch, A., & Keniston, K. (1960). Yeasayers and naysayers: Agreeing response set as a personality variable. Journal of AbnormzL and Social Psychology. 60, 151-174. Crocker, L. & Algina, J. (1986). Introduction lassical modern test theo . New York: Holt, Rinehart, and Winston. Cronbach, L. J. (1946). Response sets and test validity. Educational and Psychological Measurement. 6, 475-494. Cronbach, L. J. (1950). Further evidence on response sets and test design. Educational afnd Psychological MeasurementéO, 3-31. Cronbach, L. J. (1960). Essentials of Psychological Testing(2nd. ed.). New York: Harper & Row. Danis, S. G. (1974). The effect of attitude and scale format on polarization in social judgments. Dissertation. University of Georgia. Diener, E., Emmons, R. A., Larson, R. J ., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49(1), 71-75. Dudycha, A. L., & Carpenter, J. B. (1973). Effects of item format on item discrimination and difficulty. Journal of Applied Psychology, 58, 116-121. 114 Edwards, A. L. (1953). The relationship between the judged desirability of a trait and the probability that the trait will be endorsed. Journal of Applied Psychology. 37, 90- 93. Edwards, A. L. (1955). Social desirability and Q-sorts. Journal of Consulting Psychology. 1_9, 464. Edwards, A. L. (1957a). The social desirability variable in personality assessment and research. New York: Dryden. Edwards, A. L. (1957b). Techniques of attitudes scale construction. New York: Appleton Century Crofts. Green, R. F. (1951). Does a selection situation induce testees to bias their answers on interest and temperament tests? Educational and Psychological Measurement, 11, 503-515. Guilford, J. P. (1936). Psychometric methods. New York: McGraw-Hill. Harasym, P. H. (1992). Evaluation of negation in stems of multiple-choice items. Evaluation and the Health Professions. 15(2), 198-220. Hathaway, S. R. and McKinley, J. C. (1967). MMPI Manual (Rev. ed). New York: Psychological Corporation. Hunter, J. E. (1973). Methods of reordering the correlation matrix to facilitate visual inspection and preliminary cluster analysis. J ourn_al of Edgational Measurement. m, 51-61. Hunter, J. E., & Gerbing, D. W. (1982). Unidimensional measurement, second order factor analysis and causal models. In B. W. Saw, & L. L. Cummings (Eds), Research in Organizational Behavior. Vol. 4. Greenwich, CN: JAI Press. Hunter, J. E., Schmidt, F. L., & Jackson, G. B. (1982). Mata-analysia: Qmalating research findings across studies. Beverly Hills, CA: Sage. Huttenlocher, J. (1962). Some effects of negative instances on the formation of simple concepts. Psychological Reporth 1. 35-42. Jackson, D. N. (1967a). Balance scales, item overlap and the stables of Augeas. Educational and Psychological Measurement, 27, 502-507. 115 Jackson, D. N. (1967b). Block’s challenge of response sets. Educational aad Psychological Measurement. 27. 207-219. Jackson, D. N., & Messick, S. (1958). Content and style in personality assessment. Psychological Bulletin, 55, 243-252. Jackson, D. N., & Messick, S. (1965). The nonvanishing variance component. American Psychologist. 20. 498. Jackson, D. N., & Paunonen, S. V. (1980). Personality structure and assessment. Annual Review of Psychology, 31, 503-551. Jacobs, A., & Barron, R. (1968). Falsification of the Guilford-Zimmerman Temperament Survey: H. Making a poor impression. Psychological Reports. 23, 1271-1277. Jaroslogy, R. (1988. July/August). What’s on your mind, America? Psychology Today, 54-59. Joreskog, K. G., & Sorbom, D. (1988). LISREL 7: A guide to the program and application. Chicago: SPSS Inc. Lemon, N. (1973). Attitudes and their Measurement. New York: John Wiley & Sons. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, _14_0, 152. Linn, R. L. (1990). Essentials of student assessment: From accountability to instructional aid. Teachers College Record. 91 (3), 422-436. Marsh, H. (1984). The bias of negatively worded items in rating scales for young children. Journ_a1 of Educational Psychology. 76, 420-431. Marsh, H., Smith, 1., Barnes, J ., & Butler, S. (1983). Self-concept: Reliability, dimensionality, validity and the measurement of change. Journal of Educational Psychology, 75, 772-790. Mehrens, W. A., & Lehmann, I. J. (1984). Measurement and Evaluation in Education and Psychology (3rd ed.). Orlando, FL: Holt, Rinehart, & Winston. Michael, W. B. and Smith R. A. (1976). The development and preliminary validation of three forms of a self-concept measure emphasizing school-related activities. Educationala and Psychological Measurement, 36, 527-535. 116 Michael, W. B., Denny, B., Ireland-Galman, M., & Michael, J. J. (1987). The factorial validity of a college-level form of an academic self-concept scale. Educational Research Quarterly. 11(1), 34-39. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Ory, J. C. (1982). Item placement and wording effects on overall ratings. Educational and Psychological MeasurementL42, 767-775. Osgood, C. E. (1952). The nature and measurement of meaning. Psychologacal Bulletin, 19, 197-237. Osgood, C. E., Suci, G. J ., & Tannenbaum, P. H. (1957). The measurement of meaning. Urbana: University of Illinois Press. Piers, E. (1969). The Piers-Harris children’s self-concept scale. Nashville, TN: Counselor Recordings & Test, Box 6184, Acklen Station. Radcliffe, J. A. (1966). A note on questionnaire faking with 16 PFQ and MP1. Australian Journal of Psychology. 18, 154-157. Ramsay, J. O. (1973). The effect of number of categories in rating scale on precision of estimation of scale values. Psychomemlca, 38, 513-529. Remmers, H. H., & Ewart, E. (1941). Reliability of multiple-choice measuring instruments as a function of the Spearman-Brown prophecy formula, 111. Journal of Educational Psychology. 32, 61-66. Robinson, J. P., & Shaver, P. R. (1973). Measurement of social psychological attitudes. Ann Arbor, MI: Survey Research Center, Institute for Social Research. Rorer, L. G. (1965). The great response-style myth. Psychological Bulletin. 63, 129-156. Rotter, G. S. (1972). Attitudinal points of agreement and disagreement. Journ_al of Social Psychology, 86, 211-218. Rotter, G. S., & Barton, P. (1970). See attitudes of some New Jersey teachers. N.J.E.H. Review, 28-29. Rotter, J. B. (1966). Generalized expectancies for internal versus external control of reinforcement. Psychological Monographs, 80,1. 117 Samelson, F. (1972). Response style: A psychologist’s fallacy. Psychological Bulletin, 28,13-16. Schmidt, F. L., & Hunter, J. E. (1981). Employment testing: Old theories and new research findings. American Psychologist, 36, 1128-1137. Schmitt, N., & Stults, D. M. (1985). Factors defined by negatively keyed items: The results of careless respondents? Applied Psycholggical Measurement, 9, 367-373. Schriesheim, C. A., & Hill, K. D. (1981). Controlling acquiescence response bias by item reversal: The effect on questionnaire validity. Educational and Psychological Meguremerg 41. 1101-1 1 14. Schriesheim, C. A. and Kerr, S. (1974). Psychometric properties of the Ohio State leadership scales. Pachologacal Bulletin, 81, 756-765. Scott, W. A. Attitude measurement. (1968) In G. Lindzey (Ed.), The handbook of social psychology, Vol.2 (2nd ed.). Reading, MA: Addison-Wesley. Shaw, M. E., & Wright, J. M. (1967). Scales for the measurement of attitudes. New York: McGraw-Hill. Simpson, R. D., Rentz, R. R., & Shrum, J. W. (1976). Influence of instrument characteristics on student responses in attitude assessment. Journal of Research in Science Teaching, 13, 275-281. Spielberger, C. D., Gorsuch, R. R., Lushene, R. E. (1970). Test mmaal for the State-Trait Anxiety Inventom Palo Alto, CA: Consulting Psychologists Press. Stricker, L. J. (1969). “Test-wiseness” on personality scales. Journal of Applied PsychologLMonoggaphs, 53 (3, Pt. 2). Symonds, P. M. (1931). Diagnosing personality and conduct. New York: Appleton Century. Thacker, J .W., Fields, M. W., & Tetrick, L. (1989). The factor structure of union commitment: 11 application of confirmatory factor analysis. Journal of Applied Psychology. 74, 228-232. Throne, F. C. (1978). Methodological advances in the validation of inventory items, scales, profiles, and interpretation. Journal of Clinical Psychology. 34(2), 283- 301. 118 Thurstone, L. L. (1928). Attitudes can be measured. American Journal of Sociology. 33. 529-554. Tittle, C. R., & Hill, R. J. (1967). Attitude measurement and prediction of behavior: An . evaluation of conditions and measurement techniques. Sociometry. 30, 199-213. Towne, D. C. (1967). Influences exerted upon subiect responses by the response scale structured elements of attitude scales. Dissertation, Cornell University, Ithaca, NY.. Trott, D. J ., & Jackson, D. N. (1967). An experimental analyses of acquiescence. Journal of Experimental Research in Personality. 2. 278-288. Tryon, R. C. (1939). Cluster analysis. Ann Arbor, MI: Edwards Brothers. Tryon, R. C., & Bailey, D. E. (1970). Cluster Analysis. New York: McGraw-Hill. Violato, C., & Marini, A. E. (1989). Effects of stem orientation and completeness of multiple-choice items on item difficulty and discrimination. Educational and Psychological Measurement. 49, 287-295. Wason, P. C. (1961). Responses to affirmative and negative binary statements. British Journal of Psychology, 52, 133-142. Wesman, A. G. (1952). Faking personality test scores in a simulated employment situation. Journal of Applied Psychology, 36, 112-113. Wiggins, J. S. (1973). Personality and prediction. Reading, MA: Addison-Wesley. Wiggins, J. S. (1966). Social desirability estimation and “faking good” well. Educational and Psychologic_al Measurement. 26, 329-341. Winkler, J. D., Kanouse, D. E., & Ware, J. E., Jr. (1981, August). Controlling for acquiescence respoase set in scale development. Paper presented at the 90th annual meeting of the American Psychological Association, Los Angeles, CA. Zern, D. (1967). Effects of variations in question-phrasing on true-false answers by grade-school children. Psychological Reports. 20, 527-533. 119 "11111111111111“