II l W! WIWHI’W I ( W — , _ — u. x l *1” AN ANALYSIS OF SOME OF THE. SOURCES OF VARIATION INVOLVED IN RATING: SPEECHES Thesis {or the Degree of M. A. MiCch‘AI‘i STATE COLLEGE Margaret Mary Anderson 3945 1Hems' This is to certify that the thesis entitled An Analysis of Some of the Sources of Variation Involved in Eating Speeches presented by Mary Margaret Anderson has been accepted towards fulfilment of the requirements for :1. A. degree in Education (7?.ng Major ‘Iproi’egsor Date SEPLEQDEE |l "lggfi l. w‘l‘wlul l AN ANALYSIS OF SOME OF THE SOURCES OF VARIATION INVOLVED IN RATING SPEECHES By hummer may ANDERSON A THESIS Submitted to the Graduate School of Michigan State College of Agriculture and Applied Science in partial fulfilment of the requirements for the degree of MASTER OF ARTS Division of Education 195 THEQIS ACKNOWLEDGMENT I wish to express my sincere appreciation to Dr. Paul L. Dressel for his interest, suggestions, and.helpfu1 guidance throughout this study. 17588-1 ii CONTENTS IntrOduction 0.00000000000000000000COO. Purposes of Study ..................... Earlier Studies Reviewed .............. Procedure and Organization of Data .... Conclusions ........................... suggeStionS OOOOOOOOOOOOOIOOO00.0.00... Bibliogaphy OOOOOOOOOOOOOO0.0.0.000... Page 16 18 iii II. III. VII. TABLES Speech Rating Scale .............. Individual Table of Test Results . Roomrto-Room Variation Among Student Ratings OOOOOOOOOOOCOOO Room-to-Room'VariationflAmong FaCflty Ratings OOOOOOOOOOOOOOO Total.Roomrto-Room Variation Among Faculty Ratings ......... Means and.Standard Deviations on No (“mavlities OOOOOOOOOOOOOOOO. Analysis of Variance Results for Room 1A5 OOOOOOOOOOOOOOOOOOO... Page 10 iv AN ANALYSIS OF SOME OF THE SOURCES OF VARIATION INVOLVED IN RATING SPEECHES INTRODUCTION In connection with the Basic College written and Spoken English course which was introduced last fall at Michigan State College, a six9hour Comprehensive Examination was given at the close of the fall quarter. A year's credit for freshman English was granted to those students who satisfactorily passed this examination while the other students were obliged to complete their year of work in English and take another Comprehensive Examination at a later date. Since there were some students with one and two terms of English completed under the old program, eligibility for taking this examination was automatically granted such students while students with only one term.of English under the new program.and with entrance test scores and high school English records meeting the required standards, were granted permission by the Dean upon the recommendation of their counselor or instructor. Of the one hundred and sixtybnine students who took the examination, one hundred and twenty were given full credit for the course. Speech being one of the important phases of freshman English, the second half of the first test session was devoted to preparing and delivering a two-minute speech on some aspect of the library. To obtain a random.grouping in each of the eight rooms where the speeches were to be given, a card containing a room assignment and a speech number was handed each student as he left the earlier session. Twentyh one such cards had been.made previously for each of the eight rooms. A list of suggested topics was distributed.as the session began and the students were allowed twenty minutes to prepare their speeches. Three instructors were assigned to act as faculty raters in each of the eight rooms, this rating being used to decide the student's grade on the speech. Each student was also rated by every other student, but on only one quality, the five qualities being taken in successive order, thus giving approximately four different ratings per quality per room. Purposes of’Study The purposes of this study are (l) to compare the faculty and student ratings; (2) to study the independence of the five qualities used in rating and.the tenepoint scale; (3) to determine the reliability of the number of raters; (A) to determine the major sources of variance in ratings; and (5) to make possible suggestions for the improvement of speech testing. EARLIER STUDIES REVIEHED Studies of speaking skill in which the participants are brought into the speaking situation are far from numerous. Nichols (7, pp.385-391) developed a written test which appears to correlate more closely with oral performance than other written tests but it was designed.more for courses in which the knowledge and application of the principles of speech were the main objectives and no speeches were given. .As yet, it is only in the experimental stage as far as actual results are concerned. Thompson (12, pp.87-91) realized that the accuracy of judgment could be increased by (1) a panel of raters, (2) a training program for increasing raters' skill, and (3) by giving the raters a better yardstick for measuring speaking skill. Turning his attention particularly to the improvement of the third item, he conducted three experiments to determine the accuracy of various rating techniques with the following general conclusions: 1. The grading system and the linear system are approximately equal in accuracy, with.the slight margin apparently in favor of ' the linear scale but not significant statistically. Nine different letter grades were used.here, however, and the linear scale included nine points (0 - 8). 2. Comparing the use of letter grades and the Bryan-Wilke Scale, each technique was used with approximately equal degree of accuracy, although the letter system is more practical because of simplicity. 3. Paired-comparisons method of evaluating speaking skill is superior to the rank-order method and should be used when the problem is one of ranking speakers. Because the ratings must be made after all the speeches have been delivered,this method is limited to small groups. Experiments have brought various results concerning the number of points used, number of raters, and the types of ratings. Guilford (A, pp. 263-283) made the general statement that the number of points used on.the scale depends upon the raters, their ability to discriminate, and their motivation in making the ratings. Conklin (1) found that for untrained persons the maximum of five points should be used while Symonds (11, pp. #56-461) states that seven is the optimal number for greatest reliability. Rugg, (9, pp. 425-438) states that pooled ratings of not less than three independent judges should be used while Symonds (ll, pp. ASé—Aél) demands at least eight. Much depends, of course, on the particular trait and the manner of securing ratings. Symonds (11, pp. hSé-Aél) concludes that the results of ratings are as reliable as those obtained from the ranking method and Conklin and Sutherland (2, pp. Ah-57) found ratings were less variable from.one judge to another than were rankings. PROCEDURE AND ORGANIZATION OF DATA The speakers, identified by number only, were rated by three raters on five qualities according to the scale given below: Table I. Speech Rating Scale § Points on which High Speaker is Rated 10 9 8 7 6 5 h 3 2 l W Physical Control I I I I I I I I L I I I Vbcal Control | | | I I I Point (Controlling Idea I I or Theme Sentence) I I I I i_ Sense of Communication Achievement of Purpose (Development of point—- specific, appropriate interesting, relevants — — — _ -. — — 1b — - — 1r I. — — h. — - — n. - - — — - 1p — — _ uh - — — r- - - _._.._._._._._ . _._...JL._._._ _ _._. —.——.——.—_-p——— —--—-————-—IL -——_~—_T-”—1'—_——- -——-—-I-—-e--—- Although the main tOpics for each room were the same, the judges and students were not. As mentioned above, the students also rated every other student but only on one quality at a time, the five qualities being considered in successive order. Comparisons between the two groups of raters were thus based on a single quality for each student and not on the total score. {edians were computed and used for comparison as well as for giving the students a numerical score. Means for each group were used in studying room-to-room variation among the qualities for both faculty and student results and, together with the standard deviations, gave an indication of the relationship between the standards of the two rating groups. Correlations between student and faculty ratings for each room were also computed. The analysis of variance involved the setting up of individual tables for each of the one hundred and sixty-nine students. An example of one appears below: Table II. Individual Table of Test Results Qualities Raters . 1 2 3 h 5 Totals 1 8 5 7 7 7 31+ 2 5 6 6 5 5 27 3 5 h A 2 2 17 Totals 18 15 17 1A 14 78 This shows the scores for one student on all five qualities as given by the three raters. Combining these tables within each of the eight rooms and computing both the variances and the interactions, tables similar to Table VII on page 12 were set up. It is from these tables that the analysis of the sources of variation are feund. CONCLUSIONS 1. Roomyto-Room.Variation Among Faculty and Student Ratings: Since the students were chosen at random for each of the eight rooms, there was reason to expect that a comparison of their average ratings among the eight rooms would reveal no significant Table III. Room-to-Room Variation Among Student Ratings Sum of Degrees Mean Quality Variance Square of of Square Deviations Freedom Deviation Within Rooms 36.00 31 1.16 1 Physical Among Rooms 11.90 7 1.70 ms. Control Total A7090 38 Within Rooms 20.33' 25 .81 2 Vocal Among Rooms 13.73 7 1°96 3'5 Control Total 34.06 32 Lfithin.Rooms 33.38 2b 1.39 POint Among Rooms 17.10 7 2041+ Nose Total 50.48 31 , Within.Rooms h0.ll 23 1.7h C0mmm‘ Among Rooms 11.71. 7 1.68 N.S. cati°n Total 51.85 30 Achieve- Within.Rooms 31.19 2h 1.28 ment of Among Rooms 2.05 7 .29 N.S. Purpose Total 33.24 31 lNon-significant difference. 2Significant difference at 5 per cent level. differences. right-hand column indicating the level of significance or non-significance The results appear in Table III and Table IV, with the as the case may be. Table IV. Room-to—Room Variation.Among Faculty Ratings Sum of Degrees Mean Quality Variance Square of of Square Deviations Freedom Deviation . Within Rooms 39.60 32 1.21. ngst°ai Among Rooms 30.18 7 4.11 3.1 on ro Total 69.78 39‘ hfithin Rooms 72.00 25 2.88 2 Vggitrol Among Rooms A9J6A 7 7.09 8.5 Total 121.6h 32 Within Rooms [+0.25 21. 1.68 Point Among Rooms 3l.22 7 4.76 3.5 Total 71.h7 31 . Withianooms 61.25 24 2.55 C°mf¥n1“ Among Rooms 36.97 7 5.28 N.S.3 ca 1°“ Total 98.22 31 Achieve— Within.Rooms 41.75 24 1.7h ment of Among Rooms 17.72 7 2.53 N.s. Purpose Total 59.72 31 ¥Significant difference at 1 2Significant difference at 5 3Non-significant difference. per cent level. per cent level. From these tables, it appears that the groups of students in the various rooms had.more nearly uniform grading standards than did the faCUlty 0 Although the "among rooms" variance for the faculty raters is significant in only three of the five qualities, it is noticeable in every case that this variance is considerably greater than the "within room" variance. In other words, the students were more in agreement as to the rating a speaker should get on these qualities while the faculty varied in their judgments. A further study of each room gave no evidence that the faculty raters in any particular room caused this great variation. 2. Total Roompto-Room.Variation Among Facultthatings: It had.been planned to combine the faculty ratings and assign grades on the basis of all one hundred and sixtyenine ratings, but when an analysis of the roomrto-room variation of the faculty ratings, given in Table V, indicated a significant difference of ' Table V. Total Roomrto-Room.Variation Among Faculty Ratings f Variance Sum.of Square Degrees of Lean Square of Deviations Freedom. Deviation Within.Rooms 579h.87 160 36.22 Among Rooms 2208.28 7 315.47 Total 8003.15 167 the variance among rooms over the variance within a room, it was necessary'to make grade assignments separately from.the distributions within each room. This large variation indicated that a student's luck in drawing a room assignment was more important than giving a good speech. 3. Comparison of Means and Standard Deviations: 10 Students ranked the speeches higher than did the faculty in most cases, as exemplified by'Table VI. Here we have included the averages for only two qualities, Physical Control and.Voca1 Control, but the other three show similar results. Although the amount of Table VI. Means and Standard Deviations on Two Qualities R Physical Control Vocal Control 0' Standard Standard 0 Means Deviations Means Deviations m. Student Faculty Student Faculty Student Faculty Student Faculty 120 6.21 6.7h .77 .75 7.17 7.32 .19 .90 12A 7.62 6.66 1.10 1.11 7.32 6.13 1.28 2.26 125 7.71 7.30 l.h3 1.22 8.34 7.32 .33 .75 128 7.81 6.97 .39 .57 6.39 A.66 .57 .86 140 7.16 l 7.33 .81 .88 7.02 6.24 .69 1.65 144 7.65 5.39 .58 .68 7.10 4-72 .65 .46 1&5 7.66 7.26 .39 .90 7ohl 6.57 .h5 1.46 1A6 7.51 7.66 .93 1.11 8.29 7.8h .88 .83 difference between the means of the two groups varies, the greatest difference for all five qualities appears in Room 14A. Two out of the three raters in thisoroom were speech instructors who had not partici- Since the variance among pated in the teaching of the English course. rooms is no more than a measure of the variation among the room means, 11 a comparison of the range of faculty and student means in the above table bears out the significant results obtained in Tables III and IV. The average deviations from the mean within each room as measured by the standard deviation.varies from room to room for both students and faculty but in most cases the faculty deviations are the larger. Hence, the faculty not only rated the speeches lower on the \ average but also showed greater variation in their ratings. Large ' variation is generally desirable since it results from finer dis- crimination in the quality measured. A. Correlation between Room Means: The correlation between faculty and student ratings ranges from .49 to .86, with most of them being above .60. Here it was necessary to pair the mean of twenty student ratings on Quality 1 with the mean of three faculty ratings on Quality 1 and so on for all the students, making certain that the ratings were for the same quality and the same student in each case. With correlations of this size, it appears that the students were consistently rating higher than the faculty. 5. Reliability of a Rater: Although we would have liked to have a satisfactory method for computing the reliability of a rater, this seems impossible with the present study since identical speeches were not and never could be given. By means of correlations, the relationship between the ratings of three raters and one rater and between the ratings of 12 three raters and two raters was computed. The three-to-one comparison gave correlations from .55 to .74 and.the three-to-two comparison gave correlations ranging from .79 to .88. These ranges do not include Room 144 where the results were quite different from.the other rooms. This method is based on the assumption that if two raters correlated very high with three raters, it would be useless to use three raters. From the results, however, it is conclusive that two raters are better than one but that two do not correlate high enough with three raters to warrant accepting the hypothesis and using two raters. 6. Analysis of Variance Results: In order to make an investigation of the sources of variation leading to the discrepancy in the various ratings, an analysis of variance technique was employed and the computed results for each room set up in tabular form as shown in Table VII. Details of the Table VII. Analysis of Variance Results for Room 145. Sum of Squares Degrees Mean.Square of Deviations of Freedom Deviation j 1 -'_ Raters 468.25 .2 ' 234.125 Students 236.55 20 11.827 Qualities 316 .73 1+3" 8 .682 Raters.x Qualities 28.45 8 3.556 Students.x Qualities 230.34 _ 80 2.879- Students x.Raters 401.89 40 10.047 Students x.Raters x Qualities 272.08 160 1.701 computation, 13 analysis, and'test of significance are not given here but may be found in such references as Rider (8, pp. 117-161) and Snedecor, (10, pp. 179-248). The sum of squares of deviations divided by the number of degrees of freedom, one less than the number of persons or qualities involved, gives the mean square deviation for each category. Ideally, it would seem.that the variance should be spread about as follows: 1. 4. 5. Low variance anong the Raters would exist if they were in agreement on the various ratings. Large variance among the Students would show that the raters were recognizing the difference in ability and ranking the students accordingly. Low variance among the Qualities would result from the fact that qualitylvariations would be eliminated in averaging over a large group of students. Low variance should exist in the interaction of Qualities and Raters to show consistency of all raters in the ratings of the five qualities. The interaction of Students and Qualities should be high since individual students would be expected to show differences on the various qualities. The interaction of Students and Raters should.be low since good raters should.rate each student in the same marmer o 7. The interaction of Students and.Raters and Qualities should.be small since most sources of variation are already accounted for. With this brief explanation of what we would like to find in our study, let us examine the results. In every room the amount- of variance among the raters greatly exceeded that of the students and qualities, as shown by Table VII, a typical example. This is exactly what we would not eXpect if the raters were in agreement on standards and qualities. Although the variance among students, ranging from.7.939 to 37.375 is not large in comparison with the variance among the raters, it is significant in most cases, thereby indicating some spread among the students but far from the amount needed to ccnmare favorably with the variance among the raters. The amount of variance among the qualities ranges from .587 to 31.265, with two rooms showing significant results. hfith a group of students selected at random.as this group was, it seems plausible that the average of all students on each quality should fall somewhere near the center of the scale--that is, result in a small amount of variance. Such favorable results were found in six of the eight rooms. It may be that the students in the other two rooms were quite different groups and should ShOW’a group average away from.the center on some of the qualities, or it may be that the raters were emphasizing one quality more than another in making their ratings. 15 The variance due to the interaction of raters and qualities, which should ideally be low, shows a range from 1.367 to 7.631, and these amounts are significant in seven of the eight rooms. Here a tendency on the part of the rater to rate one quality high and another low is revealed. The student and quality interaction variance ranges from 1.087 to 2.879. This is significant in a majority of the rooms but still rather low when a Large amount of variance is necessary to show' the expected individual differences on the various qualities. The variance due to the interaction of students and raters is highly significant in all cases, again showing that the raters did not agree well on the ratings of individual students. These analysis of variance results may be summarized in a few general statements: 1. The variance among the raters far exceeds that among the students although it is the latter group that should have a large spread. 2. The interaction variance among the qualities and students is not large enough to assure us that the raters were distinguishing among the five qualities. 3. The raters also Show'no consistent standard for rating the students on.the five qualities nor do they rank the students in the same manner. 16 SUGGESTIONS 1. Because of the great variance among the faculty ratings, it would seem advisable to attempt some method for increasing the raters' skill. 2. From our reliability results, the number of raters in each room should not be reduced.but increased if possible. The raters should.be chosen from among the instructors of the course or at least all raters should be very clear on the standards appropriate to the course. 3. Although the experiment by Thompson (12, pp. 87-91) shows that ratings by grades and by numbers are approximately equal in accuracy, his study had nine points in each technique tested. Conklin (1) found that for untrained raters no more than five points should be included on the scale while Symonds (11, pp. 456-461) states that seven is the Optimal number for greatest reliability. Since there is a tendency not to use the two end scores, thinking that possibly scme later speaker will be a little better or even worse than the extreme speaker now being rated, the customary number of divisions on the scale probably should be increased; hence, the five points, correSponding to the five letter grades, probably could be increased by two without causing error. But, if the scale of ten points is to be continued, the correspondence between the five letter grades ordinarily used and the ten points should be thoroughly I? understood by the raters. 4. The five qualities do not seem to have identical meanings to all raters sota more complete explanation of each quality and possibly a revision of the list might lower this variance. 5. As pointed out by Thompson (12, pp. 87-91), judges evaluations and interpretations are bound to differ somewhat but both techniques and qualities can be controlled to lessen the difference. 4. 18 BIBLIOGRAPHY Conklin, E. S., "The Scale of Values Method for Studies in Genetic Psychology," University of Oregon mblication, Vol. II (1923), No. 1. Conklin, E. S. and J. W. Sutherland, "A Comparison of the Scale of Values Method with the Order of Merit Method," Journal 9_f_‘ Experimental Psychology, Vol. VI (1923), pp. 44-57. Guilford, J. P., Fundamental Statistics in_ bychology any, Education, McGraw—Hill Book Company, New York, 1942, pp. 273—284. ’ Guilford, J. P., Psychometric Methods, McGraw—Hill Book Company, New York, 1936, pp. 263-283. Lindquist, E. F., Statistics; Analflis _i_r_; Educational Research, Houghton Mifflin Company, New York, 1940, pp. 173-179. Newcomb, T., "An Experiment Designed to Test the Validity of a Rating Technique ," Journal of Educational Psychology, Vol. XXII, (1931), pp. 279-289. Nichols, Ralph G. , "Case Method of Speech Examination," "ua: rterly Journal 9; s eech, 761. xxvu, (Fall, 191.1), pp. 385-391. Rider, P. R., 53 Introduction to W13. Statistical Methods, John Wiley and Sons, Inc., New York, 1939, pp. 117—161. Rugg, H. 0., "Is the Rating of Human Character Practicable," Journal of Educational Psychology, Vol. XII,(1921) pp. 425-438, 485-501. 19 10. Snedecor, George '77., Statistical Methods, Collegiate Press, Inc., Ames, Iowa (1938), pp. 179-248. 11. Symonds, P. 1.1., "On the Loss of Reliability. in Ratings Due to Coarseness of the Scale ," Journal of; Experimental Psychology, Vol. VII (1924), pp. 456-461. 12. Thompson, Wayne, "Is There a Yardstick for Measuring Speaking Skill," Quarterly Journal or; s eech, v61. XXIX, (1943), pp 0 87.9]- o b . ad 7. . . .!71 .Ir‘no 0‘ ’3‘!‘ It I. JL‘.‘. beta-l: . . —< . 5”“.‘5 “V, t' J“ Le" a.) 5 I“ ,“\’3‘ ...v-.. 1‘11 if (1’: l’ m 26 ’48. Dec 35 4g 4"1!‘ (“II waft) _’ if“: A 158‘.) 14 {50 HR 7 L ‘34. I. :- e '- -~ "1*- /. ‘tfi Mv7’63 Feb 19:55 W 29 ‘56 MAY "'1‘3'1981'" use i "_ ‘ I," ‘ TTTTTT ”'Wl'iififllllljmwilfilowfiixljlllfgim“