II

l

W!

WIWHI’W

I
(

W

—
, _
—

u.

x
l

*1”

 

AN ANALYSIS OF SOME OF THE.
SOURCES OF VARIATION INVOLVED
IN RATING: SPEECHES

Thesis {or the Degree of M. A.
MiCch‘AI‘i STATE COLLEGE

Margaret Mary Anderson
3945

1Hems'

This is to certify that the

thesis entitled

An Analysis of Some of the Sources
of Variation Involved in Eating
Speeches
presented by

Mary Margaret Anderson

has been accepted towards fulfilment
of the requirements for

:1. A. degree in Education

(7?.ng

Major ‘Iproi’egsor

Date SEPLEQDEE |l "lggﬁ

 

 

 

 

l. w‘l‘wlul l

AN ANALYSIS OF SOME OF THE SOURCES OF VARIATION

INVOLVED IN RATING SPEECHES

By
hummer may ANDERSON

A THESIS

Submitted to the Graduate School of Michigan
State College of Agriculture and Applied
Science in partial fulfilment of the

requirements for the degree of

MASTER OF ARTS

Division of Education

195

THEQIS

ACKNOWLEDGMENT

I wish to express my sincere appreciation to
Dr. Paul L. Dressel for his interest, suggestions,

and.helpfu1 guidance throughout this study.

17588-1

ii

CONTENTS

IntrOduction 0.00000000000000000000COO.

Purposes of Study .....................

Earlier Studies Reviewed ..............

Procedure and Organization of Data ....

Conclusions ...........................

suggeStionS OOOOOOOOOOOOOIOOO00.0.00...

Bibliogaphy OOOOOOOOOOOOOO0.0.0.000...

Page

16

18

iii

II.

III.

VII.

TABLES

Speech Rating Scale ..............

Individual Table of Test Results .

Roomrto-Room Variation Among
Student Ratings OOOOOOOOOOOCOOO

Room-to-Room'VariationﬂAmong
FaCﬂty Ratings OOOOOOOOOOOOOOO

Total.Roomrto-Room Variation
Among Faculty Ratings .........

Means and.Standard Deviations on
No (“mavlities OOOOOOOOOOOOOOOO.

Analysis of Variance Results for
Room 1A5 OOOOOOOOOOOOOOOOOOO...

Page

10

iv

AN ANALYSIS OF SOME OF THE SOURCES OF VARIATION

INVOLVED IN RATING SPEECHES

INTRODUCTION

In connection with the Basic College written and Spoken
English course which was introduced last fall at Michigan State College,
a six9hour Comprehensive Examination was given at the close of the
fall quarter. A year's credit for freshman English was granted to
those students who satisfactorily passed this examination while the
other students were obliged to complete their year of work in English
and take another Comprehensive Examination at a later date. Since
there were some students with one and two terms of English completed
under the old program, eligibility for taking this examination was
automatically granted such students while students with only one
term.of English under the new program.and with entrance test scores
and high school English records meeting the required standards, were
granted permission by the Dean upon the recommendation of their
counselor or instructor. Of the one hundred and sixtybnine students
who took the examination, one hundred and twenty were given full
credit for the course.

Speech being one of the important phases of freshman English,
the second half of the first test session was devoted to preparing

and delivering a two-minute speech on some aspect of the library. To

obtain a random.grouping in each of the eight rooms where the speeches
were to be given, a card containing a room assignment and a speech
number was handed each student as he left the earlier session. Twentyh
one such cards had been.made previously for each of the eight rooms.
A list of suggested topics was distributed.as the session began and
the students were allowed twenty minutes to prepare their speeches.
Three instructors were assigned to act as faculty raters
in each of the eight rooms, this rating being used to decide the
student's grade on the speech. Each student was also rated by every
other student, but on only one quality, the five qualities being taken
in successive order, thus giving approximately four different ratings

per quality per room.

Purposes of’Study

The purposes of this study are (l) to compare the faculty
and student ratings; (2) to study the independence of the five
qualities used in rating and.the tenepoint scale; (3) to determine
the reliability of the number of raters; (A) to determine the major
sources of variance in ratings; and (5) to make possible suggestions

for the improvement of speech testing.

EARLIER STUDIES REVIEHED

Studies of speaking skill in which the participants are
brought into the speaking situation are far from numerous. Nichols
(7, pp.385-391) developed a written test which appears to correlate
more closely with oral performance than other written tests but it
was designed.more for courses in which the knowledge and application
of the principles of speech were the main objectives and no speeches
were given. .As yet, it is only in the experimental stage as far as
actual results are concerned.

Thompson (12, pp.87-91) realized that the accuracy of
judgment could be increased by (1) a panel of raters, (2) a training
program for increasing raters' skill, and (3) by giving the raters a
better yardstick for measuring speaking skill. Turning his attention
particularly to the improvement of the third item, he conducted three
experiments to determine the accuracy of various rating techniques
with the following general conclusions:

1. The grading system and the linear system are approximately
equal in accuracy, with.the slight margin apparently in favor of '
the linear scale but not significant statistically. Nine different
letter grades were used.here, however, and the linear scale
included nine points (0 - 8).

2. Comparing the use of letter grades and the Bryan-Wilke
Scale, each technique was used with approximately equal degree
of accuracy, although the letter system is more practical because

of simplicity.

3. Paired-comparisons method of evaluating speaking skill
is superior to the rank-order method and should be used when
the problem is one of ranking speakers. Because the ratings
must be made after all the speeches have been delivered,this

method is limited to small groups.

Experiments have brought various results concerning the
number of points used, number of raters, and the types of ratings.
Guilford (A, pp. 263-283) made the general statement that the number
of points used on.the scale depends upon the raters, their ability
to discriminate, and their motivation in making the ratings. Conklin
(1) found that for untrained persons the maximum of five points should
be used while Symonds (11, pp. #56-461) states that seven is the
optimal number for greatest reliability.

Rugg, (9, pp. 425-438) states that pooled ratings of not
less than three independent judges should be used while Symonds
(ll, pp. ASé—Aél) demands at least eight. Much depends, of course,
on the particular trait and the manner of securing ratings.

Symonds (11, pp. hSé-Aél) concludes that the results of
ratings are as reliable as those obtained from the ranking method
and Conklin and Sutherland (2, pp. Ah-57) found ratings were less

variable from.one judge to another than were rankings.

PROCEDURE AND ORGANIZATION OF DATA

The speakers, identified by number only, were rated by

three raters on five qualities according to the scale given below:

Table I. Speech Rating Scale

 

§

Points on which High
Speaker is Rated 10 9 8 7 6 5 h 3 2 l
W

Physical Control

 

 

I I I

I I I

I I L

I I I

Vbcal Control | | |
I I I

Point (Controlling Idea I I
or Theme Sentence) I I
I I

i_

 

Sense of Communication

Achievement of Purpose
(Development of point—-
specific, appropriate

interesting, relevants

— — — _ -. — — 1b — - — 1r I. — — h. — -
— n. - - — — - 1p — — _ uh - — — r- - -

_._.._._._._._ . _._...JL._._._ _ _._.

—.——.——.—_-p———
—--—-————-—IL

-——_~—_T-”—1'—_——-
-——-—-I-—-e--—-

 

Although the main tOpics for each room were the same, the judges and
students were not. As mentioned above, the students also rated every
other student but only on one quality at a time, the five qualities
being considered in successive order. Comparisons between the two
groups of raters were thus based on a single quality for each student

and not on the total score.

{edians were computed and used for comparison as well as
for giving the students a numerical score. Means for each group were
used in studying room-to-room variation among the qualities for both
faculty and student results and, together with the standard deviations,
gave an indication of the relationship between the standards of the
two rating groups. Correlations between student and faculty ratings
for each room were also computed.

The analysis of variance involved the setting up of individual
tables for each of the one hundred and sixty-nine students. An example

of one appears below:

Table II. Individual Table of Test Results

 

Qualities

 

 

 

 

 

 

 

Raters . 1 2 3 h 5 Totals
1 8 5 7 7 7 31+
2 5 6 6 5 5 27
3 5 h A 2 2 17
Totals 18 15 17 1A 14 78

 

 

This shows the scores for one student on all five qualities as given
by the three raters. Combining these tables within each of the eight
rooms and computing both the variances and the interactions, tables

similar to Table VII on page 12 were set up. It is from these tables

that the analysis of the sources of variation are feund.

CONCLUSIONS

1. Roomyto-Room.Variation Among Faculty and Student Ratings:
Since the students were chosen at random for each of the
eight rooms, there was reason to expect that a comparison of their

average ratings among the eight rooms would reveal no significant

Table III. Room-to-Room Variation Among Student Ratings

 

 

 

Sum of Degrees Mean
Quality Variance Square of of Square
Deviations Freedom Deviation
Within Rooms 36.00 31 1.16 1
Physical Among Rooms 11.90 7 1.70 ms.
Control Total A7090 38
Within Rooms 20.33' 25 .81 2
Vocal Among Rooms 13.73 7 1°96 3'5
Control Total 34.06 32
Lﬁthin.Rooms 33.38 2b 1.39
POint Among Rooms 17.10 7 2041+ Nose
Total 50.48 31
, Within.Rooms h0.ll 23 1.7h
C0mmm‘ Among Rooms 11.71. 7 1.68 N.S.
cati°n Total 51.85 30
Achieve- Within.Rooms 31.19 2h 1.28
ment of Among Rooms 2.05 7 .29 N.S.
Purpose Total 33.24 31

 

 

lNon-significant difference.

2Significant difference at 5 per cent level.

differences.

right-hand column indicating the level of significance or non-significance

The results appear in Table III and Table IV, with the

as the case may be.

Table IV.

Room-to—Room Variation.Among Faculty Ratings

 

 

 

Sum of Degrees Mean
Quality Variance Square of of Square
Deviations Freedom Deviation
. Within Rooms 39.60 32 1.21.
ngst°ai Among Rooms 30.18 7 4.11 3.1
on ro Total 69.78 39‘
hﬁthin Rooms 72.00 25 2.88 2
Vggitrol Among Rooms A9J6A 7 7.09 8.5
Total 121.6h 32
Within Rooms [+0.25 21. 1.68
Point Among Rooms 3l.22 7 4.76 3.5
Total 71.h7 31
. Withianooms 61.25 24 2.55
C°mf¥n1“ Among Rooms 36.97 7 5.28 N.S.3
ca 1°“ Total 98.22 31
Achieve— Within.Rooms 41.75 24 1.7h
ment of Among Rooms 17.72 7 2.53 N.s.
Purpose Total 59.72 31

 

 

¥Significant difference at 1
2Significant difference at 5
3Non-significant difference.

per cent level.
per cent level.

From these tables, it appears that the groups of students

in the various rooms had.more nearly uniform grading standards than

did the faCUlty 0

Although the "among rooms" variance for the faculty

raters is significant in only three of the five qualities, it is

noticeable in every case that this variance is considerably greater

than the "within room" variance. In other words, the students were
more in agreement as to the rating a speaker should get on these
qualities while the faculty varied in their judgments. A further
study of each room gave no evidence that the faculty raters in any

particular room caused this great variation.

2. Total Roompto-Room.Variation Among Facultthatings:

It had.been planned to combine the faculty ratings and
assign grades on the basis of all one hundred and sixtyenine ratings,
but when an analysis of the roomrto-room variation of the faculty

ratings, given in Table V, indicated a significant difference of

' Table V. Total Roomrto-Room.Variation Among Faculty Ratings

 

f
Variance Sum.of Square Degrees of Lean Square

 

 

of Deviations Freedom. Deviation
Within.Rooms 579h.87 160 36.22
Among Rooms 2208.28 7 315.47
Total 8003.15 167

 

 

 

the variance among rooms over the variance within a room, it was
necessary'to make grade assignments separately from.the distributions
within each room. This large variation indicated that a student's
luck in drawing a room assignment was more important than giving a

good speech.

3. Comparison of Means and Standard Deviations:

10

Students ranked the speeches higher than did the faculty

in most cases, as exemplified by'Table VI. Here we have included the
averages for only two qualities, Physical Control and.Voca1 Control,

but the other three show similar results. Although the amount of

 

 

 

 

 

 

Table VI. Means and Standard Deviations on Two Qualities

R Physical Control Vocal Control

0' Standard Standard

0 Means Deviations Means Deviations
m. Student Faculty Student Faculty Student Faculty Student Faculty
120 6.21 6.7h .77 .75 7.17 7.32 .19 .90
12A 7.62 6.66 1.10 1.11 7.32 6.13 1.28 2.26
125 7.71 7.30 l.h3 1.22 8.34 7.32 .33 .75
128 7.81 6.97 .39 .57 6.39 A.66 .57 .86
140 7.16 l 7.33 .81 .88 7.02 6.24 .69 1.65
144 7.65 5.39 .58 .68 7.10 4-72 .65 .46
1&5 7.66 7.26 .39 .90 7ohl 6.57 .h5 1.46
1A6 7.51 7.66 .93 1.11 8.29 7.8h .88 .83

 

 

 

 

 

 

 

 

 

 

 

difference between the means of the two groups varies, the greatest

difference for all five qualities appears in Room 14A. Two out of the
three raters in thisoroom were speech instructors who had not partici-
Since the variance among

pated in the teaching of the English course.

rooms is no more than a measure of the variation among the room means,

11

a comparison of the range of faculty and student means in the above
table bears out the significant results obtained in Tables III and IV.
The average deviations from the mean within each room as
measured by the standard deviation.varies from room to room for both
students and faculty but in most cases the faculty deviations are the
larger. Hence, the faculty not only rated the speeches lower on the \
average but also showed greater variation in their ratings. Large '

variation is generally desirable since it results from finer dis-

crimination in the quality measured.

A. Correlation between Room Means:

The correlation between faculty and student ratings ranges
from .49 to .86, with most of them being above .60. Here it was
necessary to pair the mean of twenty student ratings on Quality 1
with the mean of three faculty ratings on Quality 1 and so on for all
the students, making certain that the ratings were for the same
quality and the same student in each case. With correlations of this
size, it appears that the students were consistently rating higher

than the faculty.

5. Reliability of a Rater:

Although we would have liked to have a satisfactory method
for computing the reliability of a rater, this seems impossible with
the present study since identical speeches were not and never could
be given. By means of correlations, the relationship between the

ratings of three raters and one rater and between the ratings of

12

three raters and two raters was computed. The three-to-one comparison
gave correlations from .55 to .74 and.the three-to-two comparison gave
correlations ranging from .79 to .88. These ranges do not include
Room 144 where the results were quite different from.the other rooms.
This method is based on the assumption that if two raters
correlated very high with three raters, it would be useless to use
three raters. From the results, however, it is conclusive that two
raters are better than one but that two do not correlate high enough
with three raters to warrant accepting the hypothesis and using two

raters.

6. Analysis of Variance Results:

In order to make an investigation of the sources of variation
leading to the discrepancy in the various ratings, an analysis of
variance technique was employed and the computed results for each room

set up in tabular form as shown in Table VII. Details of the

Table VII. Analysis of Variance Results for Room 145.

 

Sum of Squares Degrees Mean.Square
of Deviations of Freedom Deviation

j
1 -'_

Raters 468.25 .2 ' 234.125
Students 236.55 20 11.827
Qualities 316 .73 1+3" 8 .682
Raters.x Qualities 28.45 8 3.556
Students.x Qualities 230.34 _ 80 2.879-
Students x.Raters 401.89 40 10.047

Students x.Raters x Qualities 272.08 160 1.701

 

 

computation,

13

analysis, and'test of significance are not given here

but may be found in such references as Rider (8, pp. 117-161) and

Snedecor, (10, pp. 179-248). The sum of squares of deviations

divided by the number of degrees of freedom, one less than the number

of persons or qualities involved, gives the mean square deviation for

each category.

Ideally, it would seem.that the variance should be spread

about as follows:

1.

4.

5.

Low variance anong the Raters would exist if they were
in agreement on the various ratings.

Large variance among the Students would show that the
raters were recognizing the difference in ability and
ranking the students accordingly.

Low variance among the Qualities would result from the
fact that qualitylvariations would be eliminated in
averaging over a large group of students.

Low variance should exist in the interaction of Qualities
and Raters to show consistency of all raters in the
ratings of the five qualities.

The interaction of Students and Qualities should be
high since individual students would be expected to
show differences on the various qualities.

The interaction of Students and Raters should.be low
since good raters should.rate each student in the same

marmer o

7. The interaction of Students and.Raters and Qualities
should.be small since most sources of variation are

already accounted for.

With this brief explanation of what we would like to find
in our study, let us examine the results. In every room the amount-
of variance among the raters greatly exceeded that of the students
and qualities, as shown by Table VII, a typical example. This is
exactly what we would not eXpect if the raters were in agreement on
standards and qualities.

Although the variance among students, ranging from.7.939 to
37.375 is not large in comparison with the variance among the raters,
it is significant in most cases, thereby indicating some spread among
the students but far from the amount needed to ccnmare favorably with
the variance among the raters.

The amount of variance among the qualities ranges from
.587 to 31.265, with two rooms showing significant results. hﬁth a
group of students selected at random.as this group was, it seems
plausible that the average of all students on each quality should fall
somewhere near the center of the scale--that is, result in a small
amount of variance. Such favorable results were found in six of the
eight rooms. It may be that the students in the other two rooms were
quite different groups and should ShOW’a group average away from.the
center on some of the qualities, or it may be that the raters were

emphasizing one quality more than another in making their ratings.

15

The variance due to the interaction of raters and qualities,
which should ideally be low, shows a range from 1.367 to 7.631, and
these amounts are significant in seven of the eight rooms. Here a
tendency on the part of the rater to rate one quality high and another
low is revealed.

The student and quality interaction variance ranges from
1.087 to 2.879. This is significant in a majority of the rooms but
still rather low when a Large amount of variance is necessary to show'
the expected individual differences on the various qualities.

The variance due to the interaction of students and raters
is highly significant in all cases, again showing that the raters

did not agree well on the ratings of individual students.

These analysis of variance results may be summarized in a

few general statements:

1. The variance among the raters far exceeds that among the
students although it is the latter group that should
have a large spread.

2. The interaction variance among the qualities and students
is not large enough to assure us that the raters were
distinguishing among the five qualities.

3. The raters also Show'no consistent standard for rating
the students on.the five qualities nor do they rank the

students in the same manner.

16
SUGGESTIONS

1. Because of the great variance among the faculty ratings,
it would seem advisable to attempt some method for increasing the

raters' skill.

2. From our reliability results, the number of raters in
each room should not be reduced.but increased if possible. The raters
should.be chosen from among the instructors of the course or at least
all raters should be very clear on the standards appropriate to the

course.

3. Although the experiment by Thompson (12, pp. 87-91) shows
that ratings by grades and by numbers are approximately equal in
accuracy, his study had nine points in each technique tested.

Conklin (1) found that for untrained raters no more than five points
should be included on the scale while Symonds (11, pp. 456-461)

states that seven is the Optimal number for greatest reliability.
Since there is a tendency not to use the two end scores, thinking
that possibly scme later speaker will be a little better or even worse
than the extreme speaker now being rated, the customary number of
divisions on the scale probably should be increased; hence, the five
points, correSponding to the five letter grades, probably could be
increased by two without causing error. But, if the scale of ten
points is to be continued, the correspondence between the five letter

grades ordinarily used and the ten points should be thoroughly

I?

understood by the raters.

4. The five qualities do not seem to have identical meanings
to all raters sota more complete explanation of each quality and

possibly a revision of the list might lower this variance.

5. As pointed out by Thompson (12, pp. 87-91), judges
evaluations and interpretations are bound to differ somewhat but
both techniques and qualities can be controlled to lessen the

difference.

4.

18

BIBLIOGRAPHY

Conklin, E. S., "The Scale of Values Method for Studies in Genetic
Psychology," University of Oregon mblication, Vol. II
(1923), No. 1.

Conklin, E. S. and J. W. Sutherland, "A Comparison of the Scale
of Values Method with the Order of Merit Method," Journal
9_f_‘ Experimental Psychology, Vol. VI (1923), pp. 44-57.

Guilford, J. P., Fundamental Statistics in_ bychology any, Education,
McGraw—Hill Book Company, New York, 1942, pp. 273—284. ’

Guilford, J. P., Psychometric Methods, McGraw—Hill Book Company,
New York, 1936, pp. 263-283.

Lindquist, E. F., Statistics; Analﬂis _i_r_; Educational Research,
Houghton Mifflin Company, New York, 1940, pp. 173-179.

Newcomb, T., "An Experiment Designed to Test the Validity of a
Rating Technique ," Journal of Educational Psychology,

Vol. XXII, (1931), pp. 279-289.

Nichols, Ralph G. , "Case Method of Speech Examination," "ua: rterly
Journal 9; s eech, 761. xxvu, (Fall, 191.1), pp. 385-391.

Rider, P. R., 53 Introduction to W13. Statistical Methods, John
Wiley and Sons, Inc., New York, 1939, pp. 117—161.

Rugg, H. 0., "Is the Rating of Human Character Practicable,"
Journal of Educational Psychology, Vol. XII,(1921)

pp. 425-438, 485-501.

19

10. Snedecor, George '77., Statistical Methods, Collegiate Press, Inc.,
Ames, Iowa (1938), pp. 179-248.

11. Symonds, P. 1.1., "On the Loss of Reliability. in Ratings Due to
Coarseness of the Scale ," Journal of; Experimental Psychology,
Vol. VII (1924), pp. 456-461.

12. Thompson, Wayne, "Is There a Yardstick for Measuring Speaking
Skill," Quarterly Journal or; s eech, v61. XXIX, (1943),

pp 0 87.9]- o

b

. ad 7. .
. .!71 .Ir‘no 0‘ ’3‘!‘ It I.

JL‘.‘. beta-l: . .

—<

. 5”“.‘5
“V,

t'
J“ Le" a.) 5 I“

,“\’3‘
...v-..

1‘11

if

(1’:
l’

m 26 ’48.
Dec 35 4g

4"1!‘ (“II

waft) _’ if“: A
158‘.) 14 {50
HR 7 L
‘34. I. :- e

'- -~ "1*- /. ‘tﬁ

Mv7’63

Feb 19:55
W 29 ‘56

MAY "'1‘3'1981'" use

i "_ ‘ I," ‘

 

 

TTTTTT

”'Wl'iiﬁﬂlllljmwilﬁlowﬁixljlllfgim“