. V _ . 7—. V . _-___-w__ww _m’vmwwww—~.w.._w. “‘vw -— .,. .V__FW—Wl . s.wt...w.vb.: in.“ $619.31}? 4.13.... 3.3 .3 cs #4.. ”.m- rvdcfio‘r ~ . I} b. u‘ - 2: I .'. '. . 4. ......,mw£ ., ,....dh... , , ,.,, . . f .. I ,. . .. .. .. _. . Nu “mm. .. . _ .. M .,., :3 .11....w... .1... nu . .. . .. .. . . « buyikrwmnd . I.. M . . . . m. mmmxfifmfimmmvw _ Wu. .0, . .. .1... 1... .W. . 111%.... .4... . m S . , . . . ..m9.....R “AU . . , un.flm..nwm.wd%b.w9..f , .. Du. , r1..W .3. . . >15. Mn... N. . _. ENN . . ESL .. r, .m, ....U. 2.. .. ,. ,. . N. M . . e, W... ”an. .7... a. 1:, . m. A 1.... .m. "U. .1... S 1.1.1.11 r1.S.0. .... ..... E , O . _....1 R... . .m... H Hn.9, , . 1 1:51 . . 1...... _ , . D. _ U. .1 . ... . . 2.. ..,. . ..,._.,....,....1 I.suso..-4ut_.'. ...n(..s..4.s 1......3.....:JJ.2...111.....ololtnc: ..,....14.?. 1.. I .3 1:. .. V1 . th E... 1?. . .u .4 . r . -1. 4...... .5 a. ,1 . .124. 1 I :.~ LIBRARY Michigan State University 1". This is tb certify that the thesis entitled ALTERNATIVE RESPONSE DEFINITIONS IN INSTRUCTIONAL RATING SCALES presented by Barbara Houghton Showers has been acceptet of the req PhD dc -——— DateJuqust 9. 1973‘ 0-7639 ABSTRACT ALTERNATIVE RESPONSE DEFINITIONS IN INSTRUCTIONAL RATING SCALES BY Barbara Houghton Showers Over the years many efforts.have been made to improve student ratings of teacher effectiveness. This study represents another such effort. It is concerned with the particular problem of the leniency bias shown by many students in rating their instructors. By leniency bias is meant the tendency of students to use only the two or three highest options in rating their instructors. The harmful effect of this bias is to reduce discrimina- tion between instructors to the extent that small differences in mean ratings produce large differences in reported rankings. The idea which gave rise to the present study was that leniency bias could be reduced by changing the wording of the response options. It was hoped that a different wording would increase the range of options used by student raters and improve discrimina- tion between instructors. The major reason for conducting the study was to improve an existing Likert-type student instructional rating scale. Since the content of the Barbara Houghton Showers scale was well established in its creation, the study was focussed on manipulating the response options to reduce the amount of lenient responding present with the existing scale. Two alternative response definitions were chosen to compare with the existing Likert format response defi- nitions. The three response formats were, (1) fixed alternative Likert cues (SA-SD), (2) fixed alternative evaluative cues (superior—inferior), and (3) multiple choice short descriptive cues. A concurrent purpose of the study was to test two claims made in the literature concerning the bias—proneness of certain response defini- tions. The claims tested were: a. Evaluative cues are more susceptible to bias than other cues. b. Fixed response alternatives are more sus- ceptible to bias than descriptive multiple choice alternatives. It was hypothesized that the evaluative format would pro- duce the most lenient responses, and the descriptive format the least lenient responses. It was also hypothe- sized that the least lenient response cue format would prove to have the greatest rater reliability, since a reduction in lenient responding would make possible improved discrimination between instructors. To conduct the study, three instructional rating forms differing primarily in response cue format were developed and administered to randomiflflnxksof the classes of 23 instructors. Leniency bias was measured by finding Barbara Houghton Showers the closeness of each item mean to the midpoint of the rating scale. Since student ratings were overwhelmingly concentrated at the upper end of the scale, the format that gave the lowest mean was regarded as the least biased. The hypothesis of no differences in mean ratings (leniency bias) was tested with a two-way multivariate analysis of variance design, instructor by treatment, where the response cue formats were the three treatments and the 17 items were 1? dependent variables. Scheffe post hoc analyses tested alternate hypotheses that the evaluative format would produce the most lenient items, the Likert format the next most lenient, and the descriptive format the least lenient. The hypothesis of no differences in rater reliabilities was tested by comparing confidence intervals about the reliability estimate for each item. Non-overlapping confidence intervals would indicate significant differences in rater reliabilities. The results of the study indicated that the evalu- ative format of the instructional rating scale was less prone to leniency bias than the other response formats. The evaluative format had less lenient means than either the descriptive or Likert formats for the majority of items. The Likert format, which was the format of the rating scale currently in use at the university, was found to be the most often prone to leniency bias. Barbara Houghton Showers Claims made in the literature concerning the proneness to bias of fixed alternative response formats in general, and evaluative formats in particular, were found not to hold with student ratings of instruction. Fixed alternative evaluative response cues were found to be the lga§t_susceptible to leniency bias in this study, while multiple choice descriptive response cues were found to be moderately susceptible, and fixed alternative Likert response cues most susceptible. Situational variables such as the purposes of the ratings and the experiences of the student raters were hypothesized to be somewhat responsible for this outcome. The reduction in lenient responding produced by the evaluative format items was not large enough to result in a significant increase in rater reliability. No sig- nificant differences were found in rater reliabilities among the three response cue formats. In all, it was found that the study was partially successful in obtaining its ends-~succeeding in reducing lenient responding by changing the response mode, but fail- ing to reduce it sufficiently to improve the rater reliability of the instructional rating form items. Since the evaluative format items of the instructional rating scale were found to be least prone to leniency bias, comparable in rater reliabilities to Likert and descriptive Barbara Houghton Showers formats, and most consistent with the experiences of the raters and the normative purposes of the rating task, it was concluded that they were the best choice of the three formats to improve the existing instructional rating scale. ALTERNATIVE RESPONSE DEFINITIONS IN INSTRUCTIONAL RATING SCALES BY Barbara Houghton Showers A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Counseling, Personnel Services,enuiEducational Psychology 1973 qkfiw % f u DEDICATION TO Donald and Lucille Houghton ii ACKNOWLEDGMENTS There are many people who have contributed directly and indirectly to this thesis. I would espe- cially like to thank Dr. Robert Ebel, Chairman of my Guidance Committee, for his advice and counsel through- out my doctoral program; Dr. Leroy Olson whose ideas and professional experience were contributed at many points in the development of the thesis; Drs. William Mehrens, Robert Craig, and Stephen Yelon for their contributions as members of the committee; the instructors in the departments of Education, Humanities, Natural Science, and Social Science who volunteered to take part in the study; and all the members of the Office of Evaluation Services for their cordially contributed aid in developing the forms, scoring them, and making the data ready for analysis. The financial support of the U. S. Office of Education through a Research Directors Training Program fellowship enabled me to complete my doctoral studies at Michigan State University. iii TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . ACKNOWLEDGMENTS . . . . . . . . . . . LIST OF TABLES . . . . . . . . . . . ‘LIST OF FIGURES . . . . . . . . . . . Chapter I. THE PROBLEM . . . . . . . . . Introduction . . . . . . . . Purposes, Rationales, and Problems Psychometric Characteristics of an Ideal Student Instructional Rating Form . Impetus for the Study . . . . . Purpose . . . . . . . . . . Hypotheses . . . . . . . Summary of Response Cue Literature . Overview . . . . . . . . . . II. REVIEW OF THE LITERATURE . . . . . Introduction . . . . . . . . Studies of Response Cues . . Data on the Reliability of Cue Types Studies of Intraclass Rater Reliability Summary . . . . . . . . . . III. DESIGN AND PROCEDURES . . . . . . Introduction . . . . . . . . Sample . . . . . . . . . . Instruments . . . . . . . . . Design . . . . . . . . . . Hypotheses . . . . . . . . . Analysis . . . . . . . . . . Summary . . . . . . . . . . iv Page ii iii vi vii 17 20 24 25 3O 31 31 32 39 43 SO SO 52 63 7O 71 75 Chapter Page IV. RESULTS 0 O O O O O O O I O O O 0 77 Introduction . . . . . . . . . . 77 Results Concerning Leniency Bias . . . . 79 Results Concerning Rater Reliability . . . 93 Summary of Results of the Study . . . . 100 V. SUMMARY AND CONCLUSIONS . . . . . . . 104 Summary . . . . . . . . . . . . 104 Conclusions . . . . . . . . . . . 110 Discussion . . . . . . . . . . . lll BIBLIOGRAPHY . . . . . . . . . . . . . . 118 APPENDIX . . . . . . . . . . . . . . . 125 LIST OF TABLES Page Uses of instructional rating forms in seven universities . . . . . . . . 7 Number of student raters responding to each form for each instructor . . . . . . 53 Pretest variances of two forms of each item of the descriptive scale . . . . 57 F-ratios, instructor by treatment MANOVA . 81 Univariate F tests, each dependent variable (each item) . . . . . . . 82 Contrasts of item means . . . . . . . 84 Table of item means for the three response cue formats . . . . . . . . . . 88 Number and percentage of extreme instructor means . . . . . . . . . 89 Item reliabilities for a single average rater . . . . . . . . . . . . 94 Item reliabilities for 20 Raters . . . . 98 vi LIST OF FIGURES Figure Page 1.1. Illustration of the differences between the actual norm group distribution and a conventional norm group distribution . . 22 3.1. The MSU student instructional rating form . 64 3.2. The experimental evaluative form . . . . 66 3.3. The experimental descriptive graphic form . 68 3.4. Design of the experiment . . . . . . . 74 4.1. Graph of item means for the three response cue formats . . . . . . . . . . . 87 4.2. Graph of item reliabilities for a single average rater . . . . . . . . . . 97 vii CHAPTER I THE PROBLEM Introduction Over the years many efforts have been made to improve student ratings of teacher effectiveness. This study represents another such effort. It is concerned with the particular problem of the leniency bias shown by many students in rating their instructors. By leniency bias is meant the tendency of students to use only the two or three highest options in rating their instructors. The harmful effect of this bias is to reduce discrimination between instructors to the extent that small differences in mean ratings produce large differences in reported rankings. The idea which gave rise to the present study was that leniency bias could be reduced by changing the wording- of the response options. It was hoped that a different wording would increase the range of Options used by student raters and improve discrimination between instruc- tors. The setting of the study is described below, first from the broad perspective of purposes, rationales, and problems of student evaluation of instructors, and second from the more specific perspective of events leading to this study. Purposes, Rationales, and Problems The purposes that student ratings can serve are threefold. They can be called normative, diagnostic, and informative (Gillmore, 1972). The normative purpose is served when the results of student evaluations are used by the department to help decide promotions, salary increases, and teaching assignments. The diagnostic purpose is served when an instructor makes use of the results of the student evaluation to improve his course. The informative purpose is served when the results of student ratings are made available to other students as they make decisions about selection of courses and instructors. While student evaluation of instruction can be carried out in many ways, it is frequently accomplished by means of a single all-purpose questionnaire or rating form. It is recognized that the questionnaire is not always the most direct or most informative source of information to all instructors in all.disciplines,lm1t it is probably the most often used. A review of some student instructional rating systems currently in use in the universities shows several variations on the themes described above. Some universi- ties stress primarily the diagnostic purposes for instructor self-improvement, while others see student ratings as inputs into a larger system for departmental accountability. A review of seven rating forms indicates that all are used for instructor self-improvement, three are used for some form of departmental accountability, and two may be used to aid students in selecting courses. A closer look at their rationales is presented below. Instructor Improvement (Diag- nostic Uses) The Office of Evaluation Services of the Univer- sity of South Florida concludes that student ratings of instruction are appropriate for instructor self-improvement but not for helping determine salary advancement or tenure. ‘Its conclusion was based on the finding that there was a variation in average ratings between courses and depart- ments, preventing one overall scale from being applied to all faculty. It felt ratings to be valuable for self- improvement, however, since, "If the instrument is designed to measure opinion of teaching functions and reliability is established, then its validity is assumed" (Caldwell, 1971, p. 3). An even more cautious stance is taken at Southern Illinois University-Carbondale. Thomas Tyler of the Testing Center there suggests that in order to build a tradition for evaluation, instructor rating should be presented in a very non-threatening manner. The results should go only to the instructor who may then, if he wishes, release them to the department chairman or the student publication. Consistent with this philosophy, the 810 form includes some forced choice items of the "non-evaluative" type, such as: The one thing this instructor did best was to: a. deliver good lectures. b. encourage class participation. c. understand and sympathize with students. d. prepare a well organized course. e. make good quizzes and examinations. This type of item provides information to the instructor without a good-bad connotation (Tyler, 1972). At Northwestern University, student evaluation of instruction is carried on by an outside agency, namely Educational Testing Service, which was asked by the Associated Student Government of Northwestern in 1970 to develop a questionnaire to gather student ratings of courses and instruction. The resulting instrument, SIR (Student Instructional Report), is now being marketed commercially by ETS (Centra, 1972). The primary goals of this instrument are teacher self-improvement feedback and provision of a high quality source of information for published student critiques of courses and instruction. The use of the Purdue Rating Scale for Instruction is described in the manual as primarily for instructor self-improvement feedback. The writers stress the volun- tary and confidential nature of the use of ratings, but note that 65% of the instructors in a study felt themselves benefited by the ratings and that 83% of the total sample among students, instructors, and administrators expressed belief that additional improvement would be possible with continued use of the scale. The manual of the scale does provide comparative data in the form of percentile ranks, but use of the scale for departmental evaluation is not encouraged. Departmental Accountability (Normative Uses) Unlike most of the scales developed for diagnostic uses only, the University of Illinois scale was developed with the philosophy that measurement is more useful when comparative results are available. When an instructor administers the scale, his results are compared with other instructors of his own academic rank, with those at the same course level, with other instructors in his particu- lar department or college, and with all courses at the university. A shortened form of the original is being made available containing general summarized questions specifically designed to be used by departmental decision- makers to evaluate instruction. It is hoped that this form will become one input into a total instructional evalu- ation scheme (Aleamoni, 1972). Student ratings of instruction are one of three inputs into the Faculty Appraisal System at Bowling Green University. They are the primary "point of View" by which the teaching dimension of faculty activity is evaluated. The system attempts to get the people closest to the activity to be the raters instead of placing full responsibility in the hands of the department chairman; thus, students are the primary raters of teaching, while faculty peers and the chairman rate scholarly.pro- ductivity and service (Swanson and Sisson, 1971). This system uses the University of Illinois scale to gather student ratings since the scale objectives are compatible with those of the system. The use of student ratings of instruction has been made a mandatory procedure by the Academic Council at Michigan State University. In 1969 the Council approved the following procedures "as a means to assist in improv- ing the evaluation of instruction. . . . a. Each of the teaching faculty (including gradu- ate assistants) at MSU regardless of rank or tenure is required to use the Student Instruc- tion Rating Report to evaluate at least one course in every quarter in which he teaches and every separate course he teaches at least once a year. b. The results generated by the Instructional Rating Report shall be evaluated at the departmental level in order to help determine individual effectiveness. Appropriate pro- cedures for the execution of this evaluation shall be determined according to departmental or residential faculty prerogatives. c. The department chairman will be asked to describe in his annual report the steps which have been taken by the department or residen- tial college to improve instruction (MSU Faculty Handbook, 1971, p. 42). The Student Instruction Rating Report is a machine-scored 21-item questionnaire on which normative data have been developed. The instructor receives a printout of his rating results giving mean, standard deviation, and percentile ranks for each item and each subscale. Informative Uses A separate discussion of the use of ratings for student publications (Informative Uses) was not con- ducted since few of the seven universities utilized this function of student ratings of instruction. Table 1.1 presents a summary of the uses of instructional rating scales at the seven universities discussed. TABLE l.1.--Uses of instructional rating forms in seven universities. . . . a . . b . c Univer51ty Normative Diagnostic Informative Bowling Green X X Michigan State. X X Northwestern X X Purdue X SIU-Carbondale X optional Illinois X X South Florida X aComparisons are made with other instructors. bInstructor uses results to improve instruction. cResults are made available to students to choose courses. Status and Deficiencies of Student Ratings There are many ways to evaluate instruction other than by the use of student ratings. These are, for example, the evaluation by chairmen of departments, by deans, by colleagues, by alumni, by amount and quality of scholarly research and publication, by informal student opinion, by committee evaluation, grade distribution in classes, student examination performance, enrollment in elective courses. course syllabi and exams, classroom visits, and other more informal methods. Why do many universities use student ratings as at least one input into their teacher evaluation systems? Spencer and Aleamoni at the University of Illinois suggest that since the students are the prime beneficiaries of the instruction, they appear to be the most logical evaluators of the quality and effectiveness of the course elements: In addition, student opinions should indicate areas of rapport, degrees of communication, or the existence of problems and thereby help instructors as well as educational researchers describe and define the learning environment more concretely and objectively than they could through other types of measurements (1969, p. l). Remmers and Weisbrodt of Purdue take a somewhat dimmer view of students' capabilities but advocate student ratings for this reason: Whether the student's judgment is correct is largely beside the point.The real point is that his attitude toward the instructor is a vital factor in the total learning situation. . . . Nor has the teacher any choice as to whether he will be 'rated' by his students. Such rating goes on in every classroom everywhere. The only real choice the instructor has is whether he wants to know what these ratings are. If he chooses to get this knowledge, he is in a position to profit thereby. He will have obtained the possibility of control of one of the important elements in the total learning situation (1965, p. 1). These two statements illustrate the primary argu- ments for the use of student opinion in the evaluation of teachers. Students are the day-to-day consumers of the instructional product and in addition, the success of instruction appears to depend on their positive attitude toward the learning environment. Validity of student opinion.--Although students have the most opportunity to observe instruction, questions have been raised about the influence of such variables as the student's sex, GPA, major, or personality on his or her ratings of instruction. The majority of the review of student rating research conducted by Costin, GreenOugh, and Menges in the Spring, 1971 Review of Educational Research is devoted to this topic. It appears that feW‘ strong or consistent relationships have been found between student demographic variables and student opinion of instruction, indicating that student opinion is not apt to be biased by factors other than the instruction received. After reviewing some fifty studies on the subject, Costin etafl” (pp. 520 and 530), concluded: ”’10 1. "Correlations between course rating and grade received, when observed at all, tended to be small." 2. "Majors tended to rate courses more highly than non-majors in some cases." 3. "Students required to take a course sometimes rated it lower than those for whom it was an elective." 4. "Upperclass students occasionally gave higher ratings than underclassmen." 5. Experienced or higher ranking instructors usually received higher ratings than did their less experienced colleagues." 6. "A number of studies found no significant differences in overall ratings of teaching made by men and women students, or in the ratings received by men and women teachers." I John Centra noted in the Student Instructional Report (ETS) three additional points: 7. "Students on campus and alumni agree on average ratings of the same instructors" (r's between .40 and .68). 8. "Student needs (as meaSured by the Edwards Personal Preference Schedule) were found to influence some items on the Purdue Rating Scale (Rezler, 1965)." 9. "When teacher personality measures and student ratings have been correlated the relationship has been generally negligible (Borg, 1957; Bendig, 1955)." However, 11 both Centra and Costin et al., suggest this area has not been conclusively researched. University of South Florida researchers, Remmers of Purdue University, and Wilson and Hildebrandt in California found in their research an additional rela- tionship: lO. Differences in ratings can be expected between departments or courses within specific colleges. Few demographic variables had consistent, strong effects on student opinion in the research reviewed. Only the experience of the instructor and department affili- ation repeatedly appeared to show differences in ratings. While student demographic characteristics have not been shown to bias their ratings of instruction in most cases, further questions have been raised regarding the validity of the rating form itself as a criterion of teaching effectiveness. Concerns have been expressed over (1) deficiency of the rating form alone as a criterion of teaching effectiveness, (2) contamination of ratings by halo effect and question ambiguity, (3) scale unit bias due to "generosity errors," and (4) criterion distortion by imprOper weighting of results. Investigations of these concerns suggest that the validity of the rating form as a criterion of teaching effectiveness depends on its prOper use with other inputs Us teacher evaluation and on its susceptibility to scale unit biases. The framework for 12 the critique of the rating form as a criterion measure is provided by Brogden and Taylor (1950), who originally out- lined the four types of bias described above as possible criticisms of any criterion measure. Studies pertaining to each criterion bias as it relates to student ratings of instruction are detailed below. Later in the report, possible characteristics of a good instructional rating scale are derived from this discussion. Criterion deficiency.--Several authors make the point that student ratings of instruction would be deficient as the sole criterion of teaching effectiveness, but it has been demonstrated also that student ratings represent one stable part of such a criterion. Costin et a1. (1971), report that students repeatedly cite: (1) knowledge of subject, (2) organization of course content, (3) enthusi- astic attitude toward teaching and subject, and (4) interest in students, as attributes of most importance in teaching effectiveness, but the correlations between student ratings and faculty peer ratings or department chairmen's ratings in various studies ranged from .08 to .63. It appears that student ratings are a stable but relatively independent part of a larger criterion of teaching effectiveness. Criterion contamination.-—Major measurement texts often cite halo effect and ambiguity of the quality to be observed as major influences affecting the rater's 13 ability to rate accurately in any situation. Halo effect was not often mentioned in studies of student ratings of instruction, but where it was investigated (Remmers, 1934; Hodgson, 1958), little influence was found. Several authors offered general suggestions for wording a rating scale in order to reduce ambiguity of the quality to be observed, but none specifically considered student ratings of instruction. Cronbach (1960) suggested that such words as "average" and "excellent" be replaced by specific descriptions of behavior. Both he and Thorndike and Hagen (1961) suggested that abstract terms such as "leadership" or "personality" not be rated, but rather more overt, 'directly observable characteristics be rated, such as "pleasant speaking voice," or "appearing at ease at social gatherings." Oppenheim (1966) spoke of defining a frame of reference in a rating scale with much the same intent-- to make sure every judge agrees on the meaning of the trait to be rated. He claimed the increased specificity of traits to be rated also would tend to decrease the halo effect by making raters less able to generalize their ratings. Criterion scale unit bias.-—Sca1e unit bias seems to be a particular problem of all rating scales. Piling up of ratings at the upper end of the scale, failure to employ lower scale units, piling up in the center of the scale have all been frequently reported in research with 14 rating scales. The "generosity error" causing piling up of ratings at the upper end of the scale is a persistent problem when people rate other people. As Thorndike and Hagen put it, "There seems to be a widespread unwillingness on the part of raters to damn a fellow man with a low rating" (1961, p. 344). This is apparently true of stu- dents in their evaluation of instructors. Two examples of generosity error in an instructional rating scale can be taken from forms administered at Michigan State University and the University of Iowa. At MSU the average rating on SIRS in 1971 fell between 1.7 and 2.5 on a five point scale (Office of Evaluation Services, MSU, 1969), and at Iowa the average rating on a 1952 experimental form ranged from 1.5 to 2.6 on a five point scale (Stuit and Ebel, 1952). Attempts have been made to counteract generosity error by manipulating response options. Evaluators at SIU-Carbondale report some success using a favorable midpoint on a five point scale and using no disparaging' options (instead of "terrible" use "needs considerable improvement") in order to encourage raters to use the full range of the scale (Tyler, 1972). Amiel Sharon (1970) developed a forced choice student instructional rating scale and discovered that choosing between two to four equally favorable statements to describe the instructor was resistent to bias but it could no longer produce a profile of the instructor's strengths and weaknesses. 15 Criterion distortion.--Criterion distortion arises out of the improper assignment of weights to the several elements of a criterion measure. The relative importance of student ratings of instruction among all inputs in evaluating teaching effectiveness depends on the particular philoSOphy of the college or department. As Brogden and Taylor point out, adequate empirical procedures for deriving any criterion combination have yet to be developed. In summary, the validity of student ratings as a criterion of teaching effectiveness depends on their use in combination with other inputs, such as faculty opinion and department chairman's evaluation, and on their suscepti- bility to halo effects and generosity errors. Research cited showed that halo effects were apparently minimal with student rating of instruction and that generosity of ratings might sometimes be discouraged by manipulating the wording of the response Options. Reliabilityiof ratings.--To complete the background information available to one considering a study of student ratings of instruction, studies to date concerning the reliability of student ratings will be summarized. Most have reported moderate to high coefficients, indicating that reliability is not a serious problem in this area of rating scale construction. Several approaches were taken to measuring the reliability of student ratings of instruction. Some 16 considered the stability of ratings over time, others obtained internal consistency estimates via Cronbach's alpha, and still others considered item and rater reli- ability. Early studies of stability of student opinion over time yielded correlations of .87 to .89 over periods of two weeks to a year (Costin et a1., 1971). Recent studies confirmed the stability of student ratings over time (Costin, 1968; Wilson, Hildebrandt, and Dienst, 1971). Reported split—half reliabilities were .79 and .92 for two instructional rating scales (Lovell and Haner, 1955; Spencer and Aleamoni, 1969), while use of Cronbach's alpha on subscales produced internal consistency reli- abilities ranging from .58 to .985 in studies of five instructor rating scales (Gillmore, U. of Illinois, 1972; Hildebrandt, Wilson and Dienst, 1971; Centra, Educational Testing Service, 1972; MSU Technical Report, 1969; Tyler, SIU-Carbondale, 1972). Item reliabilities reported for four scales ranged from .40 to .96 with median values greater than .80 (Gillmore, 1972; Remmers and Weisbrodt, 1965; Coffman, 1954; and Deshpande et a1., 1970). It can be seen that student raters report their opinions with moderate to high consistency within the form and with high stability over time. It appears that their opinion can be trusted to be more than a whim of the moment 17 and thus could prove useful to an instructional evalu- ation system. Psychometric Characteristics of an Ideal Student Instructional Rating Form The functions, rationales, and problems of student rating of instruction have been described, and the progress of reliability and validity studies has been discussed. It now remains to discuss the psychometric qualities of an ideal instructor rating scale and to describe the problems with an existing scale that led to undertaking this study. The psychometric qualities of an ideal scale can be determined by keeping in mind the purposes for which the scale is used and the possible pitfalls to scale con- struction that were noted in the criterion validity studies. From the functional point of View, when the scale results are used normatively by the department to help decide promotions, salary increases, etc., the ratings for good and poor instructors should be as widely different as possible to clearly distinguish between the recipients and non-recipients of the commendations. Psychometrically, the scale should discriminate between good and poor instructors for normative uses. When the scale results are used diagnostically by the instructor to discover areas of teaching difficulty, the individual item mean ratings should represent close 18 agreement among the students in the class on each trait that each item represents. In most rating forms each item concerns one aspect of teaching, such as course organization or student's opportunity to ask questions. It should be possible that students agree on a high rating for one item and also agree on a low rating for another item, thus giving a clear direction to the instructor for self-improvement. The combined psychometric attributes of discrimination between good and poor instructors and close agreement of students within a given class on each item can be measured by the intraclass rater reliability coefficient. This statistic compares the amount of variation in ratings between instructors for an item with the amount of variation in ratings within each instructor's class for that item. If there is as much variation in the ratings given to a single instructor by his students as there is between the scores of all instructors rated, then the statistic returns a value close to zero, indicating that the differences between good and poor instructors are indistinguishable from the difference in the student opinions of one single instructor. This could happen when the item is So ambiguous that the students are unable to agree on its meaning, or when both good and poor instruc- tors are so alike on a particular trait that it is not useful to include it for diagnostic or normative uses. A high rater reliability for an item would indicate that 19 it was both discriminating and unambiguous--at least to the student raters whose opinions were being sought. Whether the scale is used for normative or diagnos- tic purposes, it must be kept in mind that the task of the scale is to solicit judgments from untrained student raters. Each question must contain enough information to make its intent clear but not so much that the rater is unable to digest it on the first reading. The format Of the questions andresponsecmmions becomes important in helping the rater to digest the information given and in helping him to return a response that reflects his true opinion. When the rating scale format is such that the rater finds himself always making the same response, even to widely different questions, then the format is encour- aging response set biases. Three major types Of response bias possible with rating scales are leniency (same as "generosity error"), central tendency, and halo effect. Leniency bias occurs when raters use only the high response Options on a scale, central tendency occurs when raters use only the middle options, and halo effect occurs when a.rater rates all traits of one ratee alike because of his general impression of the ratee. Instructor rating scales have been found to be most susceptible to the leniency bias Of high ratings Of all instructors. Such a bias works against ability to discriminate between good and poor instructors as well as perhaps against the validity Of the rating form 20 itself. The amount of leniency bias in a given response cue format can be measured by finding the closeness of the mean of all instructors rated to the midpoint of the response scale, assuming that teaching ability is normally distributed about the midpoint of the scale. Inspection of the variation in ratings about the midpoint could rule out the presence of central tendency in this situation. Impetus for the Study The ideal student instructional rating scale would be free of response biases, discriminate between good and poor instructors, and have unambiguous questions on which all raters could agree for each instructor. Such a scale would have to be carefully developed from items selected for appropriate content as well as for their psychometric characteristics. The impetus for undertaking this study was created when such a carefully developed scale was found to still possess a strong tendency to leniency bias in the ratings produced. Even after a substantial data base had been established over a five year period, the mean item responses on the scale ranged from 1.7 to 2.5 on a five point continuum where 1 is the highest rating. This essentially psychometric problem was compounded by instructor confu- sion in interpretation generated when (l) the results were reported to instructors in percentile ranks, and (2) a Uni- versity policy was approved whereby every instructor was 21 required to use the rating form in at least one course he taught every term and report the results to his department chairman. The confusion became apparent when the leniency Of the ratings on the scale caused reports of performance at the 50th percentile or below to be given to instructors whose classes only "agreed" and did not "strongly agree" to some of the statements in the questionnaire. This occurred because the distribution on which the percentile ranks were based (the five year data base) was centered about the high item mean rather than about the midpoint Of the five point continuum as one would conventionally interpret an "average" value. Thus, when the 50th percen- tile rank was legitimately assigned to the mean value of the item and lower ranks were given accordingly, an instructor with a score less than the mean would be given a less than 50th percentile rank even if the score was still well above the midpoint of the five point scale (Figure 1.1). It appeared that a study was needed to attempt to reduce the leniency of students' responses so that the 50th percentile could be more conventionally interpreted as a midscale value. For purposes of such a study, it was not considered productive to create a new set of items for the scale, since discovering the appropriate content was well done in 1967 when an elaborate selection system was set up to determine what questions were to be on the form. Faculty 22 Conventional Distribution Model (2') 50th percentile Actual Distribution Model (35) 50th percentile Figure 1.1.—-Illustration of the differences between the actual norm group distribution and a conven- tional norm group distribution. 23 and students were polled as to the usefulness and appro- priateness of a large pool of potential items, and 56 items with the greatest concensus of favorable opinion were pretested, yielding the 21 most discriminating in five areas that became the final scale. The recent Review of Educational Research (1971) review of studies of teacher effectiveness showed that most instructor rating scales developed had at least four of the five factOrs present in this scale, indicating that there is agreement at what constitutes teacher effectiveness. Given the well-established content of the scale, it seemed more reasonable to look at the possible effect of manipulation of response options in reducing the leni- ency bias problem than to revise the entire scale. It was hypothesized that since altering response cues had reportedly reduced leniency response bias in some studies (Smith, 1967; Stockford and Bissell, 1949; Guilford, 1954; Cronbach, 1950), it might do so here, and might also improve rater reliability of the student instructional rating form by increasing the amount of scale used so that there could be maximum latitude for discrimination between instructors. Thus, a study was devised in order to compare the abilities of different response cues to reduce leniency of response and to improve rater reliabilities of the items. The choice of response cues was to be based on attributes of the different response cue types reported 24 in the literature. Rater reliabilities were calculated because the focus of this coefficient was on consistency of student agreement in ratings and on the students' ability to discriminate between instructors when using a particular response cue format. Purpose The purpose of the study was to compare the effects of alternate response definitions on the leniency and rater reliabilities of student instructional rating twogm-Al‘nxax.‘ uni)!" J. ‘h ‘73 TI '— In. . form items. After a review of the literature on response cue types, three response formats were selected for the study. They were defined as, (1) fixed alternative Likert cues (SA-SD), (2) fixed alternative evaluative cues (superior-inferior), and (3) multiple choice short descriptive cues. In addition to finding the least biased and most reliable of these item types for student rating of instruction, the study tested two claims concerning leniency bias that were made in the literature: a. Evaluative cues are more susceptible to response bias than other cues. b. Fixed response alternatives are more sus- ceptible to bias than descriptive multiple choice alternatives. 25 gypgtheses Nullpfiypotheses Hl. There are no differences in mean ratings of instructors between items with Likert, evalu- ative, and descriptive response cue formats. 2. There are no differences in rater reliabilities between items with Likert, evaluative, and descriptive response cue formats. firmer-“reuse-.. 1., Alternative Hypotheses H 1a. The mean ratings with the evaluative format will be significantly more lenient than the mean ratings for the Likert and descriptive formats. 1b. The mean ratings with descriptive cue formats will be significantly less lenient than the mean ratings for Likert and evaluative cue formats. 2a. The descriptive response cue format will have significantly more rater reliability than the Likert or evaluative cue formats. Summary of Response Cue Literature The literature on response cues is presented in detail in Chapter II, but it is briefly summarized here to help explain the selection of response types for the study. 26 The major part of the literature on response cue types does not deal specifically with student ratings of instruction. Rating scales have been more often used for sociological studies of behavior, personnel evaluation, and psychological or vocational counseling. But some generalities appear to have emerged from these diverse uses that might be expected to hold in the student rating situation. TO begin with, there are several generally accepted formats for response cues. Each type provides the rater with a slightlydifferent task, though the different types have been used interchangeably in the same rating ’situations. As Guilford (1954) defines them, they are: numeric, descriptive graphic, standard, cumulated points, and forced choice. The numeric scale provides a number continuum from which the rater assigns a number value to a ratee's trait or behavior, while the descriptive graphic scale adds descriptive words, sentences, or paragraphs to define points on the continuum, and the rater chooses the description that best fits the ratee. A standard (evaluative) scale provides a real or assumed nOrm group against which to compare the ratee and the rater's task is to judge whether the ratee is average, above average, etc., with respect to the group. The cumulated points (Likert) scale provides several statements to which the rater agrees or disagrees in varying intensities. His 27 responses are then summed to arrive at his overall opinion of the ratee. The forced choice scale provides the rater with two-to—four equally favorable or unfavorable state- ments from which the rater must choose the ones most descriptive of the ratee. Many authors hypothesize that some of these tasks, when done repetitively in a questionnaire, are more sus- ceptible to response biases than others. For example, Cronback (1950) distinguishes between these scales according to whether the alternatives are the same or different for every question rated. He Opines that raters are less likely to develop a set response when faced with different alternatives every time than when faced with the same alternatives for every question. (Fixed alternative response options would be presented by numeric, standard, and cumulated points scales, while multiple choice I options would be presented by descriptive graphic and forced Choice scales.) Other authors cite other charac— teristics of the response Options which they hypothesize could make the rating task more or less susceptible to response bias. Most discussions of rating scale techniques dwell on practices in avoiding response set with the various response Option types. These practices include manipu- lating extremeness of cues, direction Of scales, spacing of cues along a continuum, balance of favorable and 28 unfavorable cues, presence or absence of neutral or unde- cided cues, and concreteness of descriptions of cues. Evidence reported in the literature on these practices is summarized below. Where the evidence for some statements is contradictory in part, or non-experimental, the state- ments are listed as claims to be further tested. Details concerning the studies contributing evidence to each statement are presented in Chapter II. Summary,Statements l. The Optimal number of options for each question is five to seven when untrained raters are used. 2. The presence Of a neutral point increases the ambiguity of the scale. 3. Reduction in leniency bias due to reversing the direction of the scale within a questionnaire may increase the errors in rating. 4. Leniency bias may be reduced by the presence of more favorable than unfavorable response options. 5. Numeric, sentence, or paragraph cue lengths may reduce leniency bias, if the cues are not too long, but cue length has no apparent effect on the rater reli- ability Of untrained groups of raters. Claims to be Further Tested l. Evaluative cues are more susceptible to response bias than other cues. 29 2. Fixed response alternatives are more sus- ceptible to bias than descriptive multiple choice alternatives. 3. Reported rater reliabilities for instructor rating scales currently in use roughly rank the cue types in decreasing order as descriptive (.87 and .86), Likert (.84), and evaluative (.81). The study tested the two claims concerning leniency bias in a controlled setting by comparing the means of the Likert, evaluative, and descriptive response cue formats in equivalent rating situations with student evaluation of instructors. The choice of cue types was guided by the derived results above. In order to study the question of bias in a controlled setting, the number of Options for each cue type was held constant at five, within the range of optimally reliable numbers of options for untrained raters. Although the scale to be improved was a Likert scale having a neutral midpoint, the alternative scales did not have Neutral as an option in order to decrease the likelihood of ambiguity. None of the questions com— pared were stated in the opposite direction from the others. (Four such questions existed in the scale to be improved but were omitted from the analysis for this and further reasons--see Chapter III.) The balance of favora- ble and unfavorable cues was held constant in this study in order to isolate the effect of cue type (Likert, 30 evaluative, descriptive) on lenient responding. Likert cues were chosen for the study because the scale to be improved was in Likert format. Descriptive cues were chosen because they were the most often recommended to reduce rater biases (in spite of the few negative find- ings reported). Evaluative cues were included to test the claim that they were the most bias-prone cue type. Rater reliabilities were also compared across the three response formats where previously only comparisons with numeric formats had been made experimentally, and the descriptive format was hypothesized to have the greatest rater reliability. This hypothesis was based on the assumption that the descriptive format would be the least bias-prone and would produce greater rater reliabilities than the other formats by improving discrimination between instructors. The rater reliabilities of existing scales seemed to concur with this prediction. Overview In Chapter II, the literature on response cues and rater reliability is reviewed in detail. The design and procedures of the study are discussed in Chapter III, and the results concerning the leniency bias and rater reli- ability of the three response cue formats are presented in Chapter IV. Conclusions and discussion of the results appear in Chapter V along with a summary of the study problem, theory, and methodology. CHAPTER II REVIEW OF THE LITERATURE Introduction The studies of response cues reviewed in this chapter provided the theoretical groundwork for the selec- tion of the three response cue types compared in this study. The studies of rater reliability also reviewed in this chapter provided the information necessary to formulate and test the hypothesis concerning rater reliability of the instructional rating forms. Studies concerning the background and setting of the problem itself were pre- sented in Chapter I. The major part of the literature on response cues and on rater reliability does not deal specifically with student ratings of instruction. Rating scales have been more often used for sociological studies of behavior, personnel evaluation, and psychological or vocational counseling. But some generalities appear to have emerged from these diverse uses that might be expected to hold in the student rating situation. The studies are presented in detail in the following sections under the headings, "Studies of Response Cues," "Data on the Reliability of Cue Types," and "Studies of Intraclass Rater Reliability." 31 32 The Summary section which follows presents the generalities derived from the studies and summarizes the evidence con- tributing to each statement. Studies of Response Cues Types Detailed Guilford (1954) defines five broad categories Of response cues: numeric, descriptive graphic, standard, cumulated points, and forced choice. Similar categories are defined by Thorndike and Hagen (1961) under correspond- ing titles: frequency of occurrence or typicality, behavioral statement, man-to—man, and present-absent. They add percentage of group and ranking to the list. Oppenheim (1966) mentions Thurstone and Likert type scales whose response Options would probably fit into the "present-absent" and "cumulated points" categories already mentioned. Levinthal et al. (1971), discuss a scale format of real-ideal discrepancies, and Cronbach (1950) distin- guishes between multiple choice and fixed format response cues. On Response Set Most discussions of rating scale techniques dwell on practices in avoiding response set with the various response Option types. These practices include manipu- lating extremeness of cues, direction of scales, spacing Of cues along a continuum, balance of favorable and 33 unfavorable cues, presence or absence of neutral or unde- cided cues, and concreteness Of descriptions of cues. The diverse uses of rating scales led different researchers to study these practices in different contexts, however; hence, the rating situations and the variables manipulated by the experimentor are inconsistent from one study to the next--from foremen rating subordinates to mental patients rating their self-concepts. But they indicate what manipulations of response cues have been made, and their outcomes. The studies are categorized here according to the practices on which they provide data. Their results are further condensed into a series of general statements appearing in the Summary section of this chapter. Number of options.--The effect of number of options on leniency Can be seen in a study by Hillmer reported by Edwards (1970). After administering a nine-point scale, Hillmer selected the two options on either side of the item median and readministered the two-choice scale. Instead of an equal distribution of choices about the median, 73% chose the higher of the two options given. Direction Of scale.--Elliott (1961) tested Likert items on the same positively and negatively worded topics and found that tendency to agree with the direction of the statement was apparent for middle and low aptitude subjects, but not for high aptitude subjects, whose scores remained relatively stable. 34 Madden and Bourdin (1964) compared orientation and numbering of nine-point scales and found statistically significant differences between the scale means. The greatest difference seemed to be between the horizontal graphic scale numbered 1 to 9 which produced the least lenient ratings, and the vertical scale numbered +4 to -4 which produced the most lenient ratings. But no means fell below the scale midpoint. Reversing directions of scales within a question- naire is argued on intuitive grounds by Oppenheim (1966) that it forces raters to stay alert and doesn't allow them to create a habit of marking every question in the same place. On equally intuitive grounds, Guilford (1954) claims that reversing scale directions generates more rater errors than it does unbiased responses. Spacing and balance Of cues.--Regarding spacing and balance of favorable and unfavorable cues in a graphic scale, Guilford notes, "To counteract leniency error, the cues on the favorable side may be more widely spaced and more numerous than those on the unfavorable side" (1954, p. 268). In practice, Tyler (1972) chose a favorable midscale anchor making three favorable cues out of five in the SIU instructor rating form and still found that few mean ratings fell below item midpoints, though some reduc- tions were obtained. 35 Follman (1973) carries Guilford's advice to the extreme in comparing the conventional five-point balanced evaluative scale having two favorable Options to three other five-point scales each having one more favorable option than the last. The most favorable scale created was, "Above Average"; "Superior"; "Excellent"; "Superb", "Perfect." The students gave the instructor a mean rating between the first and second highest Options ("Above Average" and "Superior") on the conventional scale, and between the second and third highest Options on each of the succeeding more favorably weighted scales. It appeared that a favorable midpoint helped reduce leniency bias, but that more favorably weighted scales had little further effect on leniency. Other approaches to cue balance and spacing tend to favor equally weighted cue distributions. Champney (1941) favored equal spacing and balance of cues to the extent that he devised a pretest of one placement akin to the Thurstone equal-appearing intervals technique that allowed him to determine a scale value for each cue on the continuum and pick out unambiguous, equally spaced high, medium, and low cues for the final scale. Amiel Sharon (1970) found he was able to avoid leniency bias in student ratings of instruction by using forced choice scale items which balanced favorable state- ments against each other, but he notes that it could 36 not be used for diagnostic purposes since it only gave a single overall score for each instructor. Presence or absence of neutral.-—Regarding the presence or absence of a neutral point on the cue continuum, Guilford and Jorgensen (1938) found a tendency to bimodality in distributions which they thought were unimodal. This was more serious with the numeric than the graphic scale. Since the point of lowest frequency in the numeric scale was at the indifference category, they suggested elimina- ting the indifference category in numeric scales and not mentioning indifference in a graphic scale except as attached to a point. Cronbach, in "Response Sets and Test Validity" (1946), opts for those practices which will reduce ambi- guity, one of which:h5,in his judgment, eliminating the neutral response option. Holdaway (1971) found results contrary to those of Guilford and Jorgensen in his study of response distri- butions in a Likert scale with and without a neutral point. His distributions peaked at the "Agree" option and declined on either side whether a neutral point was present or not. But a greater percentage chose the disa— gree Option when no "Neutral" choice was available, or when the N was placed after the SA-SD scale. 37 Concreteness of cue descriptions.--Both Cronbach (1946) and Guilford (1954) stress the importance of clarity and specificity of cues. Guilford states, "Avoid using cues Of a very general character, such as 'excel- 1ent,’ 'superior,' 'average,' 'poor,‘ and the like" (1954, p. 293). But Symonds (1931) points out that the diffi— culty of vocabulary should be considered, taking care to avoid unusual words even though they are highly descriptive and meaningful, such as "slovenly" for "very careless in dress." Concerning lack of specificity of evaluational cues, Stockford and Bissell (1949) recount a study in which values from 1 to 100 were assigned by 200 raters to cues which could be used in a rating scale. The ranges and standard deviations of the values for those cues which contained evaluative words ("average," "excellent," etc.) were significantly greater than the ranges and standard deviations of the non-evaluative cues. At one time it was thought that the man-to-man scale would provide the concreteness of description neces- sary to avoid leniency response set in an evaluative type of cue. But in their development of a man—to-man instructor rating scale, Stuit and Ebel (1955) note that the norms they derived all lay in the upper half of the five-point scale with an overall mean of 2.04 for 267 classes. The instructors may have been a select group, but the ratings were very high for such a large number of classes. 38 The effect of multiple choice versus fixed alterna- tive cues on leniency bias has been studied in several ways, with uncertain results. Smith reports that acqui- escence response set is best dealt with by constructing items that avoid the agree-disagree format in favor of "contentful alternatives" (Smith, 1967, p. 88), but doesn't substantiate his claims. Similarly Cronbach hypothesizes that multiple choice items are least suscep- tible to bias. He states: Item forms using fixed response categories are particularly Open to criticism. The attitude test pattern, A, a, U, d, D, is open to the fol- lowing response sets: Acquiescence . . ., evasiveness . . ., and tendency to go to extremes. . . . (1950, p. 21) Elliott (1961) claimed this was not the case in her study where most acquiescence occurred with items in multiple choice rather than fixed alternative format, but she did not make the items more descriptive than the existing Likert alternatives restated in sentence form. Champney (1941), in his work with the Fels Parent Behavior Rating Scale, opts for long cue explanations if the raters are trained but short cue explanations if the raters are not. Bryan (1944) appeared to confirm this opinion with untrained student raters when he found no difference between mean ratings Of given instructors when the cue alone was used (excellent, good, average, etc.) and when the cue followed by a paragraph explanation was used. Finn (1972), also using untrained raters, found no 39 differences in mean ratings between cues which were para- graph explanations and numeric cues. In the cases of both Finn and Bryan the paragraphs were several sentences long, rather than a few descriptive words. Stockford and Bissell (1949), on the other hand, found that errors of leniency were less for ratings made on sentence-length descriptive graphic scales than for those made on single word evaluative scales. Data on the Reliability Of Cue Types Primary experimentation has involved increasing the number of response options to some optimally reliable point. Guilford (1954) discusses this research and concludes that five to seven options is a conservative choice, and that the Optimal number to use depends on the ease of rating the trait and the training and motivation of the raters. Mattell and Jacoby (1971) point out that most research on this question has dealt with internal consistency measures. They found no differences in test- retest reliability of 2 to 19 option Likert scales using untrained student raters. But Finn (1972) confirms that five to seven Options give optimal inter-judge agreement on each item with untrained student raters. (His formula var (observed) ) var (random) for interjudge agreement: r = l - -Other experimentation has compared the rater reliabilities of various verbal cues to numeric cues. J. B. Taylor et a1. cflainl from their previous research 4O that,"whereas numerical rating scales show a typical inter- judge reliability in the r = .40 to .60 range, example anchored scales typically show reliabilities in the .70 to .99 range-—and this with untrained raters" (Taylor et a1., 1972, p. 544). Their examples are short behavioral statements anchored to a point on a thermometer-like scale. Peters and McCormick (1966) found significant differences in single rater intraclass reliabilities between numeric and one sentence job-task anchored scales, but the differences vanished when the r's were stepped up by the Spearman-Brown formula to become the reliabilities Of mean ratings from n raters. Similarly, Finn (1972) found no differences in stepped-up intraclass rater reliabilities between numeric and paragraph-length cues. Some collegiate instructor rating scales report rater reliabilities. Since the scales use different cue types, it is possible to make a rough comparison of cue type reliabilities in this way. Rater reliabilities are available for the Purdue scale (Remmers and Weisbrodt, 1965), Oklahoma A&M scale (Coffman, 1954), Georgia Tech scale (Deshpande et a1., 1970), and U. of Illinois scale (Gillmore, 1972). The first is descriptive graphic in part and evaluative in part, the second is descriptive graphic, the tinnxi is a five-point frequency of occurrence scale, and the fourth is a Likert type with no neutral Option. The median reliabilities are 41 .87 and .86 for the descriptive graphic scales, .84 for the Likert type scale, .81 for the evaluative scale, and .79 for the frequency scale. Numbers of raters averaged at least 20 per class in each calculation. Studies of Intraclass Rater Reliability Methods of estimating rater reliability are dis— cussed by Ebel, Lindquist, Stanley, Cronbach, Rajaratnam and Gleser, Remmers, Medley and Mitzel, Guilford, and Brown, Mendenhall and Beaver. Most are analysis of vari- ance procedures, predominantly the intraclass correlation coefficient. Medley and Mitzell (1963), Guilford (1954), and Brown, Mendenhall, and Beaver (1968) consider only the two-way analysis Of variance case where instructors and raters are completely crossed in the design, i.e., where every rater rates all instructors. This design is not comparable to the student rating of instruction situ- ation where it is unlikely that any rater rates more than one instructor in the study. Ebel (1951), Lindquist (1953), and Stanley (1971), allude to generalized intra- class reliabilities where the raters may be different for each instructor. Ebel concludes, after discussing three formulas applicable to rating situations--average intercor- relation (Peters and Van Voorhis), the intraclass formula, and the generalized formula for the reliability Of averages (Horst)--that the intraclass correlation formula is most versatile, allowing one to include or exclude "between 42 raters" variance from the error term. (One would include between-raters variance in the error term in the student rating of instructors situation since all raters do not rate all instructors.) Also, as Engelhart (1959) points out, both a single rater estimate and an n—rater estimate can be Obtained with the intraclass coefficient while Horst only gives the n-rater case. In addition, estimates of precision can be readily calculated from an intraclass correlation. Both Ebel and Lindquist explain how to calcu- late confidence intervals for the intraclass coefficient. Cronbach, Rajaratnam, and Gleser (1963) explain how the use Of the intraclass formula allows one to generalize from randomly selected samples Of raters tO the reliability of raters in general. This is particularly desirable in determining the reliability of student ratings Of instruction, since the particular group of students who were rating each instructor is certain to be different every time. The intraclass coefficient for the "average" rater can be stepped up by the Spearman-Brown formula to give the reliability Of a number of raters (Stanley, 1971). Remmers provided empirical verification of this use of the Spearman-Brown formula in two Often-quoted experiments with the reliability Of student ratings Of instruction. He concluded that judgments were equivalent to test items in the sense of the Spearman-Brown formula and that the 43 formula could predict within one standard deviation the reliabilities empirically obtained (Remmers, 1927 and 1931). In another study, Remmers (1934) determined average rater reliabilities of a single rater for high school and college students for three items with 57 teachers. The non-stepped-up reliabilities reported for college students averaged .290 i .102 for the "interest in subject" item, .429 i .094 for the "presentation of subject matter" item, and .354 i .038 for the "stimulating intellectual curiosity" item. These results seem to illustrate that the reported instructor rating form item reliabilities in the .80's and .90's are substantially affected by the number Of raters assumed in the Spearman-Brown formula. Summary Most evidence reported here on the effects Of cue types on response set and rater reliability can be cate- gorized as either conclusions which most studies confirm or claims for which inconclusive or possible contradictory evidence was found. Summary statements are presented below, along with a review of the evidence contributing tO each. Summary Statements 1. The optimal number of Options for each question is five to seven when untrained raters are used. 44 This conclusion is derived from the combined results of studies of rater reliability and studies of leniency bias. Guilford derives the five to seven esti- mate from his review of rater reliability studies and Finn specifically confirms with untrained student raters that "II five to seven Options produce Optimal rater reliability. While no such specific result is found with regard to the effect of number of options on leniency bias, Hillmer's example of the strong increase in leniency when the number til-WW7; _- of options was reduced from nine to two indicates the potential biasing effect of too few options on rater judgment. 2. The presence of a neutral point increases the ambiguity of the scale. The studies of Guilford and Jorgensen and Holdaway support Cronbach's contention that the neutral response Option causes ambiguity in rater responses. In Guilford and Jorgensen's study, raters avoided choosing the neutral Option when it was the midpoint of a numeric continuum, while in Holdaway's study of the Likert format, the Undeci- ded Option was chosen if it was the scale midpoint but not if it was placed at the end Of the scale. This variety of reactiOns 'Uo the neutral Option supports the contention that raters are uncertain of its meaning in a rating scale. 3. Reduction in leniency bias due to reversing the direction of the scale within a question- naire may increase the errors in reading. 45 Although Oppenheim argues on intuitive grounds that reversing question direction within a scale forces raters to stay alert, Elliott discovered that only the scores of high aptitude raters remain stable regardless of question direction while middle and low aptitude raters tend to agree with the direction of the statement. This supports Guilford's contention that most raters cannot be relied on to remain aware can Hooves: mums m.z .mmmcommmu pmuuflso on map .mums3 msoflummoxm who ma one .0H .m mamuH n .Umumasoamo me msoam xw .mmsoum anon ca ON u 2 can 2\wa we mocmwum> mosflmm N m.ma name on no: memmm Umsumocoo SHUHNE omEmmm uoa m sumo ou memmm .m o.HH mam: ou ucmuosdwh coxmo con3 Hounds: mam: manammom Ham m>mw .d m v m N H Hmflumume on» cmcumma musocoum one assumes sues sumosoo m.nouosuumse one .v m.NN use» «0 mumms Hammad: mmEHuoEOm w>euoomwo mnm> m m.ma mumeumoummmce Assam: Hamunmamce a m e m N a mmmao ca mmouom muswom pom mam: ou mousseummxm Hmcomuwm Ho moameoxw mo mm: mLOHOaumsw one .m m.vH mcwusou .OOQSOOOHmum women Ou mason Hmcwmfino .m>wumcflmofiw m m.OH mxwawwp on mason homes on mason whence wamsow>no 4 m e m N H mswnomou ca umououcw m.uouoouumcw one .N m.HN pesos .HHSU omuwmmse mofieumEOm . msflmsmaamno .vmumsacm m m.» oaumnummm wmuwmmsfl mOEHuOEOm mcwumasawum .ucmunw> d m v m N H Howumuoa omusoo msflucmmoum sons Emmamsnuso m.uouosuumcw osa .H xw mN swuH snow .wdmom 0>Humwuommv on» no Bone sumo.mo mason 03» mo mousseum> umououmll.N.m names 58 .pmumHsOHmo mos can Hmsqmss one: m.z .mmmsommmu omuufifio on mac .mumnz mGONOQOOxm mum ma can .0H .m mamuH mocoeum> n N.m msflumum mommusoomflo mcwumum oumsou Hmuusms mcoflsemo mcfiumum moumsmu m v.0a sowmmsomep mpwo>m Mom mafia osom monHo msoecemo use mxmmm d m e. m m a meowcemo Henna mmmnmxm on musmpsum ou usmsmmmusoosm mo unsOEm was .m mm. mapped oommmuocfi mamumumvoa ommmmnocw maummum commonosfl m mm. pm>oumsfl haasmunsop pm>oumsfl maucoeoemmsm oo>oumae gone a m e m N H mmuooo menu on one mono menu CH mocmummeoo MOON .m m.ON Hasp mseummumucw mswumHoEHum m m.va xmm3 mumsvmcm asmumzom < m e m N a mmsmHHmno Nmsuooaamusw no mo mmusoo mane .b N.bH O>Nucmuum waoumu 08w» mo umoa m>wusmuuo mafia on» Ham O>Husmuuo m m.¢N pmuon usm>ummno Human < m e m N H mmoao ow mmocm>wucmuum Hmuosmm Moo» .m N.ON owumsummm o>eusouum Hmmmm m N.ma ucmummuepcw O>Husmuum ve>m a m v m N H Hmeumume omusoo may msecumma Ge umonmucfl snow .m .omscwucouuu.N.m OHQMB 59 .ANm .m some EOuH mHsu How mumsHOUHm OHnmsommmH possumcoo ou mHnmcDo m.mN Gunman m>m3Hm mmEHu um Umnmsn omsou mmmHo ou O>Hmcommmn o.mN cannon mwmsHm mmEHu um cosmos omsmu mmMHo ou O>Hmcommou m e m N H HoHumumE may Ho>oo ou omumsmuuo HouosHDmGH one nOHns um comm one .vH N.MN anocomomucs OHnocommmu HmmOH o.mN HMHHmume nose oou cousmmoum nonsmEOm ouMHumoummm m v m N H Hm>oo ou ooumSmuum HouosuumsH on» HMHHmume mo unseen one .NH N.HH >Houmu omnusooo >Husmswmum pmuuoooo coumo pmuusooo o.HH oumzxsm ucmummaoo HSMHHme m e m N H conmsome mmMHo mo GOHumHsEHum m.nouosuum2H one .NH o.MH mHnoHHm>m SOpHom OHQMHHM>M >Hucosvmum OHQMHHm>m m>m3Hm m.OH mHanHm>o Hm>os OHanHm>o mQEHuoEom mHnoHHm>m mxmzHo. m e m N H mGOHumoov xmm ou SuHssuuommo m.ucmpsum one .HH N.Hm Emsu sHmOch ou msomm Soap msoHHm sons msHm> ou memm o.NN mmocOMOMMHO ou wHHumos DGMHOHOu SHHmsms mmocoummep moEOOHms m e m N H mucHom3OH> .muwsuo cam mmopH so: on mmocm>HumoowH m.uouosuumcH one .OH .omssHusouul.N.m OHQMB 60 .poumHOOHmo mos OOGMHMM> can Hooves: mums m.z .mmmcommmu UmuuHEo on one .muwns mcoHummoxm mum mH use .OH .m mswuH b o.HN uouosuumcH an monomems coHumsuHm Hmshos HouosuumsH we omon . m o.vm mono: mxmu o» HHOOHMMHO uuommm mEOm nuHs mHnHmmom mmuos mxmu ou Ammo d m e m N H soHuousomonm m.HouosuumsH one so mmuos mconu no snow was .mH QNm. pmuHcomsomHo co>oumEH on oHooo usmumsoo can omeHcs m amb. poocmuum SHHOOQ mOHmou ©w>oumEH on pHsoo ommsmuum HHo3 mOHmou H m w m N H mmusoo one mo GOHumuHsomno och .mH o.HN ssonm mHzmcoHumHmu on coHumucomwum mHumpuo o>mm COHumHmunoucH .mOHmou pmzonm m ®.mH SMHss ou unemmw Os moms coHumucmmmum wHumouo m>mm mOHmou mo muHcs pesoSm m m e m N H sesame OHumEmumMm m :H mummocoo mmusoo mumHmH ou huHHHQm m.uouosuumsH one .hH nom.H HHDOHMMHO oou DHOOHMMHO nonsmEOm mmMHO ou mumHumoummm m now. msouomHu oou moonomHH possmsom HmmpH s m e m N H mmsHomou UmsmHmmm :H pmuw>oo mOHmou mo Hm>mH huHDOHMMHp one .OH 0.0N men on mHmc 050m me5 ummum m o.ON uswmm mSHu suuoz no: ucmmm mfiHu auuos SHOHME usmmm mEHu suuos HHmz m m e m N H mmusoo on» no msHpsmumumoss Moo» ou coHuanuucoo .mucmsconmm MuosoEos one .mH .emscHucoouu.N.m mHnme 61 .ANO .m oomv EouH mqu mom ouocuouHo oHQMGOmooM possumcoo Op oHQoCDO .oouoHsOHoo mos oosoHuo> oco Hosvoss ouo3 m.z ~noncommou pouuHao on one .ouoa3 msoHumooxo ohm mH coo ~OH .m ofiouH a .ooemHSOHoo mH oGOHo wa .omsoum noon sH ON u 2 can 2\wa mH oosoHno> oOCHmo N.hN UH wouon oonmHuoo uH po>oH m m.mH Homoumoume oHnomonso oHnomOnso huo> m m e m N H omoHo onu mo usofi>0mco Houocom snow .HN o.¢N posHuoo >HuooHo Dos oosHmop HHo3 mHououoOOE cosHmop HHo3 m m.HN woOMOHssssoo >Huoom ooDMOHsoEEoo mHouoonoo ooDMOHcssaoo >HuooHo < m e m N H omHsoo onu mo QOHuooqu oaa .ON .amuceucoouu.m.m mHnme 62 In the process of writing the evaluative and descriptive response forms of the original Likert questions, it was discovered that four items concerning the topic of Course Demands (#13-16) wererunzparallel in scale format to the other items. For most items in the questionnaire, the first option on the scale was the highest rating an instructor could receive, but for these items the third option was the highest rating. This was due to the word- ing of the four questions such that the first option was a response of "too much" and the fifth option was a response of "too little." For example, in the question, "The instructor attempted to cover too much material. (SA-SD)," a response of "SA" meant "too much material" and a response of "SD" meant "too little material." The difficulty with this change in scale format was in the inability of the study to compare the mean ratings of the items where "3" was the highest rating to the mean ratings of their counterparts written in evaluative for- mat, where "1" ("Superior") was the highest rating. The descriptive format could have been written to corre- spond to either scale, but no satisfactory transformation of all three scales was seen to be possible. This was not felt to be a condemnation of any one scale, but rather an unforseen difficulty in the study. It was concluded that the comparison of the remaining 17 items would give sufficient grounds to answer the hypotheses of the study, so the four questions were omitted from the analysis. 63 Figures 3.1, 3.2, and 3.3 are photographic repro- ductions of the machine-scorable forms administered in the study. Design Three instructional rating forms differing primarily in response cue format were developed and administered to randomly equivalent thirds of each class of 23 instructors. Each instructor was given a packet containing the three forms arranged alternately so that (assuming a random start) each form would be automatically distributed to random thirds of the class. Each student received one form. Directions were given to the instructors to admin- ister the forms just as they have administered the instructional rating form in the past. Differences in administration, if present, were considered a legitimate potential source of variance in instructors. The instructors were told and could pass on to their students that a new form was being tried out. But neither they nor their students were informed of the research hypotheses regarding leniency or reliability. The answer sheets were collected and machine-scored and the data punched onto cards. Instructors were assured of anonymity of results. Generalizability of Results The nonrandom selection of instructors did not affect the comparison of rating forms to each other since 64. Figure 3.1.--The MSU student instructional rating form. 65 MICHIGAN STATE UNIVERSITY STUDENT INSTRUCTIONAL nArmo SYSTEM ram “' "m—“U—L'“ ' ' “ "‘1“ 1'“ WIN"? A- it you algee with the statement hi- it you neither agree nor disagree 0- it you disaggee with the statement Please omit any of the items which do not pertain to the course that you are rating. For 80- if you strongly disagree with the stateme- oitample, it- you have had no homework assignments in this course omit (leave blank) those items pertaining to homework. with a poor . nip-x..- 1.‘ the .tcms using the KEY. KEY L.- SA A N D u 1. The instructor was enthusiastic when presenting course materiaa ———————————— - ————— 1. a? E N 5 2. The instructor seemed to be interested in teaching. ——————————————————————— 2. SA A a. .t 3. The instructor's use oi esampies or personal experiences helped to get points across in class. — - -'- ----- 3. s} .3 I"! g i 4. The instructor seemed to be concerned with whether the students learned the material. ——————————— 4. SA A N ti 5. You were interested in learning the cause mataiai. ————— -- ————————— '— - - H- -— -— - 5. SSA E N . v 6. You were generally attentive in class. —————————————————————————— 6. sa A N z.» 7. You rsit that this coitsechallenged you inteIIectuaIiy. ————————————————————— 7. s‘A 3 N 6 . 8. You have becorne more consistent in this area due to this course. —————————————————— 8. 5A A N i: 3. The instructor errcotnged students to express opinions:- -. ——————————————————— 9. s21 8 N o' m 10. The instructor appeared receptive to new ideas and others’ viewpoints. ———————————————— 10. u A N r, 11. Thesttiderithadanopporttnitytoaskouestionse--- ———————————————————— 11. star is N n 12. The instructor generally stimulated class discussion. ————————————————————— 12. s: 2 N ii 13. The instructor attempted to cover too much material.- ————————————————————— 13. eat 6 N n 14. The instructor generally presented the materiai too rapidly.- ——————————————————— 14. SA A N o 15. mien-vent assignments weretootime consmirureiativetotheircontributionbyotlmderstendingotthecotsse tutorial. is. a} N N «‘7 16. You generally round the coverage oi topics in the assigned readings too diflicuit ————————————— 16. SA A N :‘i 17. The instructor appeared to relate the cause concepts in a systematic manner. ——————————— —- - - 17. s3 4'5 N 6 . 18. The course was well organized ————————————————————————————— 18. 9 A N 0 ~ 19. The instructor' s class presentations made tor easy note taking. ————————————————— — 19. :3 N N P 20. The direction oi the course was adequateiy outlined.- ————————————————————— 20. 5A A N .1 21. Yougemraiiy enjoyed goingtociass. --— ————————————————— -—---——- ‘21. sin A N ' n 22. 22, sA A N n 23. instructor may insert threem items in these spaces. 23. s21 41 N 9 ~ 24. 24. 5A A N o . / W: Select the most appropriate alternative. 25. Iasthiscoissereouiredinyourdefieeprogram———----——--—' ———————————— 25. use no 26. Vlas' this course recommended to you by another student? ———————————————————— 26. {on on 27. wrist is your oven" GPA? (a) 1.9 or less (tr) 2.0 2.2 (c) 2. s- 2. 7 to) 2.3- 3. 3 (e) 3.4- r. s —————————— 27. g u". _C‘ a 28. How many other courses have you had in this department? (a) none (b) 1- 2 (c) 3. 4 (d) 5- 6 (e) 7 or more ————— 28. n i. . r1 n c. o "i 29‘ instructor may insert twom Itemsln this space. 29' I. 3 °- ‘3 " 30. 30. a b .-. 6 DO NOT WRITE BELOW THIS LINE UNLESS THIS COURSE HAS LABORATORY GI RECITATIGII SECTIONS LABORATORY or RECITATION: (fill in your recitation or lab number at the bottom) _ 31. The laboratory or recitation instructor clarified lecttse materia1.— ————————————————— 31. 3‘71 1 N s 32. The laboratory or recitation instructor- adequately prepared you for the material covered in his section— ————— 32. Sit N n 33. You generally retard the laboratories or recitation: interesting. — — ——————————————— - 13. sh A N c 34. . . . . 34. u A N n .. 35. Instructor may insert two (2) items in this space. _ . 35. ‘A N N 9 a.“ O IECITATION 0R LABORATOPY JllEIIIIIAIIJ - morass—I IIITI and III! in the boxes to the right your recitation or laboratory section number. —-—> 1' 3 ' 3. 3 ‘ 5 E ’ ' " . Section number 1 wggld bg writtgn grid MM; section “WW 2- ° ‘ 2 3 ‘ '4 5 ’ 3 ".1 and marked 01;. It you do not have a recitation or lab section leave this area blank. 3. if. g L5 :4: E E 3:1 31 w 8“" WW" Pnnme, 905‘? ° F i 5' 66 Figure 3.2.--The experimental evaluative form. 67 STUDENT INSTRUCTIONAL RATING FORM--XS l-Superior: Exceptionally good course or instructor ~ For each item. respond by marking the 2-Above Average: better than the number in the key that corresponds to typical course or instructor the closest description of your 3-Average: typical of courses instructor or your course. or instructors k-Below Average: not as good PLEASE NOTE CHANGES IN THE KEY as the typical S-lnferior: one of the worst l. The instructor's enthusiasm when presenting course material ------- l ‘ ¥ ' i 2. The instructor's apparent interest in teaching .................... 2 3. The instructor's use of examples or personal experiences to help get points across in class ................................... 3 h. The instructor's concern with whether the students learned the material --------------------------- - -------------------------- h ‘ 5. Your interest in learning the course material --------------------- 5 6. Your general attentiveness in class ............................... 6 7. This course as an intelleétual challenge .......................... 7 8. This course's ability to improve your competence in this area ----- 8 9. The amount of encouragement to students to express Opinions ....... 9 l0. The instructor's receptiveness to new ideas and others' viewpointSslo II. The student's opportunity to ask questions -----------------c ..... ll 12. The instructor's stimulation of class discussion ................. 12 13. The apprOpriateness of the amount of material the instructor attempted to cover ---s ------------------------------------------- 13 lb. The apprOpriateness of the pace at which the instructor attempted to cover the material -------------------------------------------- ih l5. The homework assignments' contribution to your understanding of the course material relative to the amount of time required --- 15 i6. The apprOpriateness of the difficulty level of the coverage of tOpics in the assigned readings .................................. l6 1?. The instructor's ability to relate the course concepts in a systematic manner ------------------------------------------------ 17 i8. The course organization ------------------------------------------ 18 i9. The ease of taking notes on the instructor's presentation -------- I9 20. The adequacy of the outlined direction of the course ------------- 20 21. Your general enjoyment of the class .............................. 2| _) he'll w. hips» Sure Umnu-ty Print-up San-tr 68 Figure 3.3.--The experimental descriptive graphic form. 1o 2. 3s A. 5. 6. 7o 9. 10. 11s 13s 1‘. 15s 16. 17. 18. 19. O-B6t'lrl 69 STUDENT INSTRUCTIONAL RATING FORM-dill For each item. respond by marking the number in the key that correSponds to the closest description of your instructor on each continuum. If he or she fits the description under 3. mark 3. If he or she is somewhere between the descriptions under 3 and I, mark 2. There are 5 choices for each question. The instructor's entlusie- eten presenting colts-ea aaterial l 2 5 h 5 vibrant. stimulatirg acetiaes inspired apathetic ] the instructor's interest in teaching 1 2 3 N 5 2 obviously enjoys secs to enjoy sens to dislike l'he instructor's use of eat-plea or personal experiences to help get points across in class 1 2 3 A 5 insightful useful ' inappropriate The instructor's concern eith vhether the students learned the material 1 2 3 h 5 gave all possible help helpful ehen asked telnet-it to help Your interest in learning the course aeterial 1 2 3 h 5 avid attentive indifferent 5 Your general attentiveness in class 1 2 3 b s 6 attentive all the tine attentive part of tine rarely attentive this course as It intellectual challenp 7 1 2 A; 5 A 5 purer-nil adequate eeak Your carpetence in this area due to this course 1 2 3 b 5 increased greatly increased moderately increased little 8 The count of encouragement to students to express their opinions I 2 3 b 5 revards stating opinions neutral tovard stating discourages statirg 9 the instructor's receptiveness to ear ideas and others vievpoints IO 1 2 5 h 5 eeloues differences usually toleruit hostile to differences The student's opportunity to ask questions 1 2 3 A 5 always available ' smetiaes available never available I I fie instructor's stilulation of class disarssion 1 2 3 5 5 I 2 slcillml capatent aelreard The count of aaterial the instructor attarptad to cover 1 2 3 h s 13 ideal reasonable unreasonable the pace at which the instructor attempted to cover the aaterial l 2 3 b S responsive to class teapo rushed at tines alvays mashed “4 he hoeevorlr oasis-ants contr-itrttion to your understctding of the course 1 2 5 h 5 cell vorth tine spent aainly earth tine spent not north time spent '5 The diffiarlty level of topics covered in assisted readings 1 2 3 h 5 l6 ideal scaevhat rigorous too rigorous The instructor's ability to relate course concepts in a systeaatic anther 17 1 2 3 h atroved unity of topics gave orderly presentation eade no effort to unify The organisation of the course 1 2 3 A 5 topics vell arrmged could be improved topics poorly arranged '8 The ease of taking notes on the instructor's presentation 1 2 3 h 5 l9 aided by instructor noreal situation bound by instructor his direction of the course 20 1 2 - J A 5 clearly ctr-uniceted adequately cunnicaud porly cmnicated Your peer-a1 enjopent of the class 1 2 3 A 5 very enjoyable enjoyable distasteful 2i acre-hogan Sure Unrverer‘ry Printing Servire 70 all instructors were rated with all forms by randomly equivalent groups of students. The fact that the instruc- tors were volunteers did affect the ability to compare their mean ratings and rater reliabilities to other means and reliabilities reported in the literature. Hypotheses Null Hypotheses Alternative H H1. la. lb. 2a. There are no differences in mean ratings of instructors between items with Likert, evalu- ative, and descriptive response cue formats. There are no differences in rater reliabili- ties between items with Likert, evaluative, and descriptive response cue formats. Hypotheses The mean ratings with the evaluative format will be significantly more lenient than the mean ratings for the Likert and descriptive formats. The mean ratings with descriptive cue formats will be significantly less lenient than the mean ratings for Likert and evaluative cue formats. The descriptive response one format will have significantly more rater reliability than the Likert or evaluative cue formats. 71 Analysis The hypothesis of no differences in mean ratings was tested with a two-way multivariate analysis of variance design, instructor by treatment, where the response cue formats were the three treatments and the seventeen usable items were seventeen dependent variables. The data were first tested for the presence of interaction between instructors and treatments. A nonsignificant interaction would allow an overall F-test, a = .05 for the main effect of treatment to determine whether the item means of any format were significantly different from the item means of any other format over all the items. Individual item F-values were also inspected to determine sources of variance with the understanding that lack of independence among the items prevents each individual F from having a known constant error. Scheffe post hoc analyses tested alternate hypotheses Hla and Hlb' A packaged computer program.written by Jeremy Finn was available to do the multivariate analysis of variance. The hypothesis of no differences in rater reli- abilities was tested by comparing confidence intervals about the reliability estimates for each format of each item. Overlap of confidence intervals would indicate no significant differences in rater reliabilities with the probability of no Type 1 error being (1 - o)3. If the number of items with non-overlapping intervals was greater 72 than chance, the null hypothesis was rejected. (Fisher's r to z transformation was not used because the rater reliability estimate for this rating situation includes between-raters variance in the error term whereas the Pearson reliability estimate excludes it. This would likely make the distribution of this coefficient different from the coefficient on which Fisher based his r to z transformation.) Since the coefficient used to calculate the rater reliabilities was the intraclass rater reliability coeffi- cient written in analysis of variance terms, it was possible to use the Ffijul MANOVA program to find the necessary components to generate the reliability estimates by hand. The necessary mean squares were obtained from a one-way raters-nested-within-instructors design with 1? dependent variables (the items) within each treatment group. Thus, since there were three treatment groups, three separate MANOVA's were necessary to generate the data for the rater reliability estimates. The rater reliabilities and the confidence inter- vals were generated and tested by hand. The formula for the reliability of one average rater and the Spearman- Brown formula for the reliability of the average of k raters are: = k rll kk l + (k-l)rll = MSB - MSE 11 M88 + (kO-l)MSE r and r IJ. 'a-L’Lfi ’- LIT-H bl e l QWE A 73 where MSB = mean square between instructors MSE = mean square within instructors k = average number of ratings per instructor in the sample Zkiz o n - 1 in ‘ Zki n = number of instructors k. = number of ratings of each instructor r11 = reliability of one average rater rkk = reliability of an average of k ratings k = number of ratings for which a prediction of reliability is desired The estimate of precision, as detailed by Lindquist (1953), is found by determining the upper and lower bounds of F and substituting them into the formula for r11, where r is rewritten as r = FO-l . The (100 — 2a)% 11 11 F0 + (kO-l) confidence interval for rll becomes: FL _ 1 < r < FU - 1 FL + (kO-l) ll FU + (ko-l) where _ _ MSB . FL — FO/F(a), FO — NEE 1n the sample FU = FO . F(a) Figure 3.4 illustrates the design of the multivariate analy- sis of variance to be conducted for the tests of both hypotheses. 74 .ucofifluomxo may no cowmooii.v.m ousmflm Hashim ilkr e o o e e e e e e 4 e a e 1 . . . . mN . . . . H . we . me we . . fivccnm . CGHL ccflm . n . x . . . . .xm .m nmmm . . H . . N O O O O H e h e e e a a H+. . . . . H+flmfi+flm em chem m m o o o o o m . . m . . . . . m m m m . . . H H H . . H w 0 O O I H+ m . H+om H+u . . . . H+oma+om a an Hmam t 5H . . . . m H EouH ha . . . m a EouH 5H . m H EouH obflumfluomoo o>Hum9Hm>m unoxflq umfiuom HouosuomcH 75 W Three instructional rating forms differing primarily in response cue format were developed and administered to random thirds of the classes of 23 instructors. The three response cue formats were defined as (1) fixed alternative Likert cues (SA-SD), (2) fixed alternative evaluative cues (superior-inferior), and (3) multiple choice short descriptive cues. The questions used were the first 21 questions of the MSU student instructional rating scale. The questions were in Likert format on the original scale and were used unchanged. The question wording was altered slightly on the other two forms to accommodate the evaluative and descriptive response cues. The descriptive response cues were pre- tested to determine the least ambiguous descriptors for the final form. The study was designed to test the effect of the alternate response definitions on the leniency bias and rater reliability of the three forms. Leniency bias was measured by finding the closeness of each item mean to the midpoint of the rating scale. Since student ratings were overwhelmingly concentrated at the upper end of the scale, the format that gave the lowest mean was regarded as the least biased. In addition to finding the least biased and most reliable response cue format for student rating'of instruction, the study tested two claims made in the litera- ture as they applied to the bias of lenient responding. “ , Lung“ Km dawn—r: m I r , f- ‘ . my _ .; 76 a. Evaluative cues are more susceptible to bias than other cues. b. Fixed response alternatives are more suscep- tible to bias than descriptive multiple choice alternatives. The test was limited to the question of leniency bias since this was the major problem with student ratings of instruction. The hypothesis of no differences in mean ratings was tested with a two—way multivariate analysis of variance design, instructor by treatment, where the response cue formats were the three treatments and the 17 usable items were 1? dependent variables. Scheffe post hoc analyses tested alternate hypotheses that the evaluative format would produce the most lenient items, the Likert format the next most lenient, and the descriptive format the least lenient. The hypothesis of no differences in rater reli- abilities was tested by comparing confidence intervals about the reliability estimate for each format of each item. Non-overlapping confidence intervals would indicate significant differences in rater reliabilities. CHAPTER IV RESULTS w“. 13’ i Introduction The study was designed to test the leniency bias- ! =":.€‘ proneness and rater reliability of three response cue formats. The major reason for conducting the study was to 'i‘:‘._ Ra. improve an existing Likert-type student instructional rating scale. Since the content of the scale was well established in its creation, the study was focussed on manipulating the response options to reduce the amount of lenient responding present with the existing scale. Two alternative response definitions were chosen to compare with the existing Likert format response definitions. The descriptive graphic format was chosen as the most often recommended format for reducing leniency bias. It was hypothesized that this format would produce the least lenient responses from student raters of instruction. The evaluative format was chosen as a second alternative for purposes of contrast because it was claimed to be the most bias-prone response format. It was hypothesized that the evaluative format would produce the most lenient responses, and the descriptive the least lenient responses. 77 78 It was also hypothesized that the least lenient response cue format would prove to have the greatest rater reli- ability. A concurrent purpose of the study was to test two claims made in the literature concerning the bias-proneness of certain response definitions. The testing of the claims v film‘s as they applied to the bias of lenient responding was compatible with the major objective of the study. The claims to be tested were: Fm; "m1.”- Tmn‘ l H..'.'-'.I'. “ a... l. Evaluative cues are more susceptible to response bias than other cues. 2. Fixed response alternatives are more suscep— tible to bias than descriptive multiple choice alternatives. If both of these claims were to hold true for leniency bias, the hypothesized order of response cue types, from most to least lenient would be, fixed alternative evaluative cues, fixed alternative Likert cues, and multiple choice short descriptive cues. To conduct the study, three instructional rating forms differing primarily in response cue format were developed and administered to random thirds of the classes of 23 instructors. Leniency bias was measured by finding the closeness of each item mean to the midpoint of the rating scale. Since student ratings were overwhelmingly concentrated at the upper end of the scale, the format that gave the lowest mean was regarded as the least biased. 79 The hypothesis of no differences in mean ratings (leniency bias) was tested with a two-way multivariate analysis of variance design, instructor by treatment, where the response cue formats were the three treatments and the 17 usable items were 17 dependent variables. Scheffe post hoc analyses tested alternate hypotheses that the evaluative format would produce the most lenient items, the Likert format the next most lenient, and the descrip- tive format the least lenient. The hypothesis of no differences in rater reli- abilities was tested by comparing confidence intervals about the reliability estimate for each item. Non- overlapping confidence intervals would indicate significant differences in rater reliabilities. The following sections present the results concern- ing leniency bias and the results concerning rater reliability of the three instructional rating forms with alternate response definitions. Results Concerning Leniency Bias The test of Hypothesis 1 was carried out by the test of the main effect of treatment in the two-way, instructor by treatment, analysis of variance design. Hypothesis 1 was stated, 1. There are no differences in mean ratings of instructors between items with Likert, evalu- ative, and descriptive response cue formats. 80 A significant treatment effect, F = 10.40 p <.0001, indi- cated that the mean ratings of the 23 instructors were different with the three instructional rating forms, even though the instructors being rated were the same for each form and the groups of students rating them with the different forms were assumed to be randomly equivalent. A possible complicating factor in such a design, which had to be tested also before a clear result could be estab- lished, was the interaction effect of instructor with treatment.. Interaction would have occurred if some instructors received their least lenient ratings with one response format while others received their least lenient ratings with another format. It would not have been possible to establish a clear format effect if such an interaction were present in the results, since the order of most and least lenient response formats would have depended on the instructor being rated. Fortunately, interaction effects were not found to be significant in the study results, indicating that the three response cue formats produced clear differences in lenient responding for all instructors (Table 4.1). After establishing that differences in mean ratings between the response formats existed, steps were taken to determine which items were contributing to the significant overall difference in means and to determine which formats were producing the most and least lenient means. 81 TABLE 4.1.--F-ratios, instructor by treatment MANOVA. Effect F DFl DF2 p less than Instructor 5.1012 374 12891.01 .0001 Treatment 10.4026 34 1902.00 .0001 Interaction 1.0128 748 15184.85 .3989 In the multivariate analysis of variance, each item was a contributing dependent variable for which inde- pendent tests of significance were carried out in addition to the overall test which led to the rejection of the hypothesis of no differences in item means. Inspection of the individual item F tests of treatment effect indi- cated that most items were contributing to the significant overall difference in item means found between response cue formats. Such an inspection was used to indicate sources of differences, though lack of independence among the items prevented each individual F from having a known constant error. The large number of significant item F's did indicate that the effect of response cue format was present with most items and not limited to a few (Table 4.2). To summarize the results thus far, the null hypothesis of no differences in item means was rejected, establishing that there were significant differences in leniency between the three instructional rating forms. 82 TABLE 4.2.-~Univariate F tests, each dependent variable (each item). Variable Mean Square Univariate F p less than Item 1: I-enthusiasma 10.1345 20.4571 .0001 Item 2: I—interest 5.6580 10.8606 .0001 Item 3: I-examples 7.8597 9.0612 .0002 Item 4: I-concern 7.1532 8.4603 .0003 F“ Item 5: S-interestb 20.7714 21.7417 .0001 - Item 6: S-attention 22.6673 31.6897 .0001 ‘ Item 7: S-challenge 11.5925 11.6609 .0001 i Item 8: S-competence 8.1133 7.9277 .0004 4 Item 9: opinions 4.4629 6.0640 .0025 P Item 10: new ideas 11.0392 14.9030 .0001 Item 11: questions 27.4122 46.3598 .0001 ‘ Item 12: discussion 7.2542 8.5008 .0003 g; Item 17: unity of topics 2.4900 2.8058 .0610 Item 18: organization 11.2256 13.7279 .0001 Item 19: note-taking 3.9678 3.3147 .0368 Item 20: course outline 9.1462 10.4170 .0001 Item 21: enjoyment 3.4328 2.8815 .0566 a"I" stands for "Instructor." b"S" stands for "Student." Scheffe post hoc analysis of the directions of the differences between individual item means indicated that the predicted directions of differences were only partially correct. The alternate hypotheses were: H1a. The mean ratings with the evaluative format will be significantly more lenient than the mean ratings for the Likert and descriptive formats. 1b. The mean ratings with descriptive cue formats will be significantly less lenient than the mean ratings for Likert and evaluative cue formats. In order for both alternate hypotheses to be correct, item means in the three response cue formats would 83 have had to have been ordered so that the evaluative format produced the most lenient items, the Likert the next most lenient, and the descriptive the least lenient. A contrast of Likert and descriptive item formats showed that the descriptive format was in fact less lenient than the Likert format as predicted. Seven descriptive item means were significantly less lenient than their Likert format counterparts, one was significantly different in the opposite direction, and the rest were not signifi- cantly different. The probability that seven out of 17 tests would be significant by chance alone, a = .05, is less than .001 assuming independent tests, so it was con- cluded that the descriptive format cues were significantly less lenient than the Likert format response cues (Sakoda et a1., 1954). The ordering of the item means in the study data differed from the ordering predicted by the alternate hypotheses in that the evaluative format produced completely opposite results to those predicted. Instead of being the most lenient response format, it was found to be the least lenient of all the formats. Post hoc analysis showed that the evaluative format was significantly less lenient than the Likert format in 15 out of 17 items, and that it was significantly less lenient than the descriptive format in 10 out of 17 items. Since the majority of items pro- duced this effect, it was concluded that the evaluative 84 format was the least lenient format in this study. The contrasts of item means are presented in Table 4.3. The combined results of the tests of significance of the contrasts between all item means are represented in the final column of the table by orderings with "less than" signs depicting significant differences and "equal" signs depicting non-significant differences. The items are grouped according to the factors established when the original scale was constructed in order to compare the performances of items on the same topics. Due to the TABLE 4.3.--Contrasts of item means. Item L-E D-E L-D Order Factor 1: Instructor Involvement l: I-enthusiasm -.29* .02 -.31* L