(I'm

2 0 l o .UBRARY
Michigan State
University

 

 

 

 

This is to certify that the
thesis entitled

RATERS’ CULTURAL BACKGROUND AS
RATER-TEXT INTERACTION

presented by

CHEN WANG

has been accepted towards fulﬁllment
of the requirements for the

MA degree in Teaching English to Speakers
of Other Languages

 

 

 

22%; 4M)

Major Professor’s §gnature

Mad 57,. a C I O
f
Date

MSU is an Afﬁrmative Action/Equal Opportunity Employer

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5/08 K:IProjIAcc&Pres/CIRCIDaIeDue.indd

CULTURAL EXPERIENCE AS RATER-TEXT INTERACTION
By

Chen Wang

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF ARTS
Teaching English to Speakers of Other Languages

2010

ABSTRACT
CULTURAL EXPERIENCE AS RATER-TEXT INTERACTION
By
Chen Wang

This study investigates whether raters’ cultural inﬂuence is a source of rater-text
interaction in L2 writing assessment. Twenty-seven ESL/EFL teachers participated in the
study. They were divided into three groups: raters who are American but have never been
to China; raters who are American and have taught English in China for more than one
year; and raters who are Chinese native speakers. All raters received rater training before
rating. In the rating experiment, all raters assigned a score on a scale from 1 to 5, judged
the best and worst feature of the essays, and provided some comments. No signiﬁcant
differences were found between the scores given by the two groups of American raters,
but signiﬁcant differences between the Chinese raters and the two groups of American
raters were obtained. The qualitative data, which consisted mainly of the raters’
comments, demonstrated that each group of raters attended to different aspects of the
essays. The paper also discusses the possible inﬂuences of raters’ cultural background on
the scoring of second writing assessment. Finally, limitations of the study and suggestions

for future studies of raters in writing assessment are given by the researcher.

TABLE OF CONTENTS

List of Tables ....................................................................................................................... iv
List of Figures ...................................................................................................................... v
CHAPTER 1 ........................................................................................................................ 1
CHAPTER 2 ........................................................................................................................ 2
CHAPTER 3
Participants .................................................................................................................... 11
Materials ....................................................................................................................... 13
Procedure ...................................................................................................................... 15
Analysis ......................................................................................................................... 1 6
CHAPTER 4
Reliability ...................................................................................................................... 19
Quantitative data .......................................................................................................... 20
Qualitative data ............................................................................................................. 23
CHAPTER 5 ...................................................................................................................... 29
CHAPTER 6 ...................................................................................................................... 33
APPENDICES ................................................................................................................... 36
BIBLIOGRAPHY. ............................................................................................................. 46

iii

LIST OF TABLES

Table 1 Rater backgrounds ................................................................................................. 13
Table 2 Tests of between-subjects effects .......................................................................... 21
Table 3 Post hoc test for multiple comparisons ................................................................. 21
Table 4 Best features and worst features identiﬁed by raters ............................................ 25
Table 5 An overview of raters’ comments ......................................................................... 28

iv

LIST OF FIGURES

Figure 1 Estimated marginal means ................................................................................... 19
Figure 2 Raters’ positive comments ................................................................................... 28
Figure 3 Raters’ negative comments .................................................................................. 29

CHAPTER 1
Introduction

Researchers in the ﬁeld of writing assessment have identiﬁed several factors
that may inﬂuence the test scores of test takers, including the writing task, the written
text itself, the scale being used, characteristics of the raters, characteristics of the
writers other than their writing ability, and various contextual factors (Weigle, 2002).
These variables may cause problems concerning test validity. For example, the scores
may be different for examiners depending on the tasks they write about or the rater
who evaluates their essays. In order to increase the validity of the writing tests,
research should focus on identifying and explaining the source of variability in
authentic writing assessment contexts and how they inﬂuence the accuracy of essay
test scores (Broad, 2003; Cumming, 1997; Deville & Chalhoub-Deville; Huot, 2002).
The overall aim of such research is to ultimately inform test developers and those who
use test scores in how to check for, avoid, or correct such construct-irrelevant variance
that may appear in test sores.

Much research on second language writing assessment has focused on the
study of raters’ variability that may cause a potential source of measurement error. It
is well known that human raters introduce subjective factors into the procedure of
scoring, for instance, the raters themselves may become tired, prefer one writing style
over another, or have personal experience that inﬂuences the way they assign the
scores. Rater variability may pose test fairness issues and reduce the test's reliability
and validity (Kondo-Brown, 2002). Test takers may get different scores depending on
the raters who rate or when they rate. It is of great importance to identify the sources
of rater variability, and “if found to exist, to address them accordingly” (Johnson &

Lim, 2009, p.486). The commonly used measures for statistically estimating rater

variability are inter-rater reliability (rater agreement) and intra-rater reliability (self-
consistency). Weigle (2002) deﬁnes inter-rater reliability as “the tendency of different
raters to give the same scores to the same scripts” and intra-rater reliability as “the
tendency of a rater to give the same score to the same script on diﬂerent occasions”
(p. 135). However, “the measures of inter— and intra- rater reliability only tell us the
product of the writing assessment, yet nothing do we know about the process”
(Johnson & Lim, 2009, p.486). Connor-Linton (1995b) explained that “if we do not
know what raters are doing (and why they are doing it), then we do not know what
their ratings mean” (p.763).

The study of rating process can address many questions in the area of writing
assessment. Cummings (1990) argues that the understanding of the rating process
could reveal the validity of different rating scales and procedures. However, he
mentioned that a focus on rater reliability is not adequate. Lumley (2002) stated that
the investigation of testing validity and reliability is concerned with how the process
of rating is conducted. Studying the rating process can improve our understandings of
the backwash effects of different rating scales and procedures, and may illuminate the
kinds of evaluation that are consistent with instructional goals and teacher beliefs

about the construction of writing scales (Connor-Linton, 1995b).

CHAPTER 2
Literature
Rater Bias Pattern
One area of the research that has been done to address problems with the

rating process is rater bias patterns. There are a range of different forms of rater

variability (Bachman & Palmer, 1996; Mo Namara, 1996; Weigle, 2002; Lumley,

2005): raters may interpret the scoring rubrics differently; raters may show various
degrees of leniency and severity in their scoring procedure; raters may comply with
the rating scales in different degrees. This category of study is rather quantitative,
characterized by a large number of essays that have been rated, and uses advanced
statistical tools, such as two-mode clustering and multi-faceted Rasch measurement.
Eckes (2008) hypothesized that experienced raters may fall into different categories
that value scoring rubrics in different degrees. In his study, sixty-four raters
participated in a large-scale writing assessment and were asked to indicate on a four-
point scale how much importance they attached to a certain score criteria that covered
many aspects of a writing, such as ﬂuency, completeness, and grammatical
correctness. Eckes used many-facets Rasch analysis and two-mode clustering and
found six distinguishable rater types. For example, the “syntax” type refers to raters
who had a strong focus on language ability and task completeness.

Schaefer (2008) studied rater bias of 40 native English speakers who rated
essays similar to the TOEFL Test of Written English (TWE). The essays were
examined on a rating scale that contained six categories: Content, Organization, Style
and Quality of Expression, Language Use, Mechanics, and Fluency. Schaefer found
recurring rater bias through a Rasch analysis and he concluded that there were several
subgroups of bias patterns. One subgroup of raters may rate Content and/or
Organization severely and Language Use and/or Mechanics leniently and vice versa.
Another subgroup of raters showed more severity to more advanced writers and more
leniency to lower level writers, and vice versa.

Kondo-Brown (2002) used four experienced Japanese English teachers to rate
students' essays. Although the inter-rater reliability and intra-rater reliability was very

high, the severity or leniency pattern was signiﬁcant between raters. She found that

raters were more likely to show severity to extremely low or high ability writers and
suggested more clearer criterion or rater training are needed for test takers in the
extreme levels.

Rating Procedure and Rater Background

In response to Cumming’s (1990) and Connor-Linton's (1995b) call for more
research to be done to study the rating procedure mainly through analyzing the think-
aloud protocol from raters while they rate the essays, Cumming, Kantor, and Powers
(2002) focused on the exploration of raters decision making-procedures. A general
sequence of decision-making through the rating process was identiﬁed and proved to
be almost consistent when applying the framework to a following study using
different raters and different writing tasks. Cumming, Kantor and Powers found that
“raters attended more extensively to rhetoric and ideas (compared to language) in
compositions they scored high than in compositions they scored low” (p.67). Teachers
who taught English as second/ foreign language, compared to EMT (English-mother-
tongue) teachers, attended to language of the essays more extensively than the idea
and organization. Most teacher raters were aware that their teaching experience and
rating experience had some inﬂuence on their rating behavior.

When scoring the essays, raters play two roles: ﬁrst as a reader; then as a
judge. Like other readers, raters will also bring their reading expectations and using
some reading strategies while scoring essays. Several studies addressed the
relationship between background of the readers and the effects of they brought to the
rating process. Kobayashi and Rinnert (1996) conducted a study that focused on
reader's different backgrounds, such as their L1, academic status, amounts of writing
instruction, and their experience in evaluating EFL students' essays. Four groups of

participants joined the study and they were Japanese students who had not received

formal English writing instruction, Japanese students who were experienced English
writers, Japanese EFL teachers and native speaker English teachers. It was found that
culturally inﬂuenced rhetorical patterns made a difference in the raters' evaluations.
For example, Japanese students without English writing instruction gave better ratings
to essays with a Japanese rhetorical pattern and native English teachers gave higher
scores to essays with American rhetorical patterns. However, there was no signiﬁcant
difference in the evaluations between native English teachers and experienced
Japanese EFL teachers. In the wake of the previous study, Rinnert and Kobayashi
expanded their study in 2001, when they used four groups of readers in Japan with a
different amount of exposure to ESL writing to investigate if their ﬁndings from 1996
would be replicated. The researchers worked out a list of diﬁerent characteristics
between English writing and Japanese writing and manipulated 16 different essays
with various textural features of Japanese and English, such as “overall organization
moving from general to speciﬁc ” and “a thesis statement aiming to convince the
reader of a particular position” as typical features in English writings, in contrast with
“overall organization moving from speciﬁc to general” and “no strong speciﬁc
position taken by the writer, thus leaving more up to the reader” as typical feature in
Japanese writing(Rinnert & Kobayashi, 2001, p. 192). They found that more
experienced readers had a tendency to favor the features of the second language (L2)
while less experienced readers showed a preference for the features of their ﬁrst
language (L1).

Clearly, human raters may exhibit various interactions with the reading
passages they rate or have biases when scoring essays from ESL learners. Researchers
have contributed extensively to the study of raters' background on their rating process.

Although, the raters' background is a complicated issue, in order to construct a ﬁnely-

tuned study, researchers often target one or two traits of raters’ characteristics. To
summarize, factors such as professional background, linguistic background, and
cultural background of raters could alone or together have an effect on the rating

procedure.

Professional Backgrounds

It is easy to understand that professional background may inﬂuence raters'
behavior. Rating experience, rater expert and lay rater, clearly will be different in their
performance. Teaching experience plays a role as well, whether raters are English-as-
a-ﬁrst-language, or English-as-a-second language teachers may make a diﬁ‘erence in
their reaction to the ESL essays.

Shohamy, Gordon, and Kraemer (1992) had four groups of participants who
were distinguished from each other in their professions, whether they were EFL
teachers or native English speakers with other occupations, and the training
experience, whether they received training before rating the essays. Through an
ANOVA, signiﬁcant differences were only found between raters who diﬁered on
training. Professional background in this study did not yield any signiﬁcant difference
and the researchers argued for the importance of rater training before rating is actually
performed.

Brown (1991) intended to ﬁnd if there was any difference between the scores
given by teachers who are English teachers and those given by ESL teachers. Sixteen
teachers in total participated and rated 112 essays ﬁom students who were native
speakers of English or ESL learners. The result was that no signiﬁcant differences
were found between the two groups of raters in general. However, from the qualitative

data, in which raters were required to label the best and the worst features of the

essay, researchers found that raters may reach the same score from different
interpretations of the essays. For example, English teachers considered Syntax
Cohesion and Syntax more importantly, whereas ESL raters valued Organization more
importantly, even though both groups assigned similar scores to the test takers.
Weigle, Boldt, and Valsecchi (2003) conducted a pilot study on different
faculty members' response to students' essays. The 16 raters represented four different
professional backgrounds: ESL, English, history, and psychology. They all graded two
types of essays, text-responsible writing (TR) and non text-responsible writing (NTR)
essays in which the former content is about students' response about their
understanding of speciﬁc texts, usually a short article, and the latter is writing about
their personal ideas and arguments on a general topic. Raters across all different
backgrounds ranked content as the most important criterion for their judgments. Yet,
raters from the English department appeared to favor grammar more than other raters.
For different writing tasks, English teachers and ESL teachers treated the tasks in a
different way in that they valued content to be more important in TR essays and
grammar more important in NTR essays. However, the history teachers treat all the

tasks evenly. The psychology teachers did not demonstrate a clear pattern.

Linguistic Background

With more and more ESL teachers who are non-native speakers becoming
raters of second language writing, there have been several studies exploring the
potential different perceptions held by native speaking (NS) raters and non-native
(NNS) speaking raters.

Shi (2002) conducted a study to examine NS and NNS teacher’s evaluation on

essays from Chinese college students. Twenty-three teachers in each category

participated in the study and ranked 10 essays according to the quality of the writing.
The raters were required to write and rank three reasons for their rating. No
signiﬁcant differences were found between the two groups of the raters; however,
they differed from each other in the comments concerning the positive and negative
features of the essays. NS teachers gave more positive comments regarding content
and language, while NNS teachers showed more negative comments on the
organization and length of the essays. It was found that NS teachers tended to give
lower scores even though they gave more positive comments than NNS speakers.

In a recent study, Johnson and Lim (2009) employed a quantitative approach
to ﬁnd if there is any rater-bias pattern caused by the different L1 background of
raters, and more importantly by the L1 background of the second language writers.
Seventeen raters, including four NNS raters, scored the writing part of Michigan
English Language Assessment Battery (MELAB). It is important to note that most of
the raters were considered to be experienced because they all had received rater
training and had gotten a certiﬁcate to serve as a long-term rater in the University of
Michigan language assessment program. The essay pool was from a large variety of
examinees from different language backgrounds. The researcher applied FACTETS
and found that no rater bias could be attributed to the language backgrounds of the
raters. The small number of NNS raters, including one Korean rater who was born in
Korea but mostly was raised in the US, may account for the result. Clearly, there is a
need to conduct research with larger size of NNS raters.

An Educational Testing Service report (Erdosy, 2004) investigated the rating
procedure of four raters who represented a mixed linguistic and professional
background. Concurrent verbal protocols were used to produce qualitative data to

explain the source of potential rater variability. The researcher mentioned the limited

generalizability of this study but he found that without a scoring rubric, the raters
were very likely to refer to their teaching experience. NNS raters may refer to their
experience as an ESL learner. It was further argued that cultural and linguistic
backgrounds may be viewed as more important than academic backgrounds in
inﬂuencing the raters‘ behavior. Rating experience may impact rating strategies but
have limited effects on the construction of the rating criteria. Erdosy suggested future

study should involve more rater participants to make the results more generalizable.

Cultural Background

A few studies found that raters’ background may interact with the text
characteristics of second language learners. Kaplan (1966) suggested that writers’
rhetorical patterns varied across cultures. Despite many problems with his
characterization of the way different cultures write, contrastive rhetoric has become
an important area of investigation. It was found that students with little L2 writing
experience are likely to refer to their Ll knowledge in their L2 writing (Connor,
1996). Rinnert and Kobayashi (2001) proposed ﬁve distinctive features of Japanese
writing and English writing. For example, “a thesis statement aiming to convince the
reader of a particular position” in English writing is compared with “no strong
speciﬁc position taken by the writer thus leaving more up to the writer” in Japanese
(p. 192). The ﬁndings were consistent with earlier studies that Japanese students with
less exposure to English writing favored Japanese rhetorical patterns. Hamp-Lyons
(1989) found that experience with other languages could change NS raters' reactions
to the English writing of learners from those language communities. The teachers
were found to be more tolerant of features of L1 patterns in students' English writing.

In addition, some other studies have reported the different intellectual

traditions held by different cultures. Ballard and Clanchy (1991) identiﬁed different
learning approaches and thinking patterns between students from Asia and the western
world. Asian students were encouraged by the education system to adopt a
reproductive learning approach featured by rote learning and memorization.
Challenging the teacher or text-book and critical thinking was less favored in the
classroom. In contrast, students in Western society have adopted a speculative
approach of learning in terms of applying critical thinking, questioning and
evaluation. Wang (2009) compared test takers’ perspectives on the differences of GRE
writing and TOEFL writing and found that students hold negative opinions about
GRE writing because topics in the writing prompts required critical thinking and
logical reasoning, which they had not be taught in schools in China. Wang found that
the Chinese test takers often avoided describing a person or event related to Chinese
culture as a seemingly preferred or even learned test taking strategy, because they

thought the American raters would have a hard time understanding their example.

Research Question

The previous research identiﬁed various rater bias patterns which could be
partially attributed to raters’ backgrounds which could interact with the text features
of students’ essays. As raters’ professional backgrounds and linguistic backgrounds
have been addressed in several studies, I think that there is a need for a study that
explores the impact of raters' cultural experience on their rating procedures given by
limited research focus in this area. It is suggested by the literature that the experiment
should involve more raters in experiments to make ﬁndings more generalizable and
robust. Hence, the current study will use a larger sample of participants and will

address two variables related to the raters' backgrounds: cultural and linguistic

10

backgrounds. The research question that guides this study is:

Do raters with different amounts of exposure to and experience with the

culture of test takers differentially score the test takers‘ essays?

I hypothesize that raters’ differing amounts and types of exposure to the test
takers’ culture will inﬂuence the scores the raters assign to the test takers’ essays.
Furthermore, I expect that the more experience the raters have with the culture, the
easier it will be for them to understand the content of the L2 leamers' essays.
Consequentially, the ones who have more exposure and experience with the culture
may be more tolerant of prevalence of the test takers’ L1 rhetorical features, which
could result in the raters assigning higher scores or producing more positive

comments concerning the quality of the essays.

CHAPTER 3
Method
Participants
An email for invitation for participating in the study was sent to potential
participants by the researcher. For participants who were around the university where
the experiment took place, the researcher met the participant face to face and
monitored the whole procedure. For participants who lived far away and were unable
to meet with the researcher, they were sent all the materials with detailed instructions
to ensure that they followed the experiment procedures step by step. Nine participants
in the study completed the study on their own, while 18 participants met the

researcher individually to ﬁnish the study.

11

Three groups of participants were invited to participate in the study. They were
ESL/BF L teachers who were Americans and had never been to China, labeled as
American raters with no Cultural Experience (ARN CE), Americans who have taught
English as a foreign language in China for at least one year, labeled as American
raters with cultural experience (ARCE), and Chinese ESL/EFL teachers who were
born in China and received their secondary education there, labeled as Chinese raters
(CR). The backgrounds of the three groups of raters varied in terms of their academic
majors, teaching experience, and their familiarity with TOEFL rating rubrics for
writing. All the participants ﬁlled out a background questionnaire (See appendix A).
In addition, all the raters were required to self-evaluate their experience with rating
essays and using TOEFL writing scales to be one of the following: new; having some
experience; being expert-like. Table 1 shows that the three groups of participants were
comparable in their academic background. Except for the group of American raters
with cultural experience, which had three raters in other majors, the raters in other
groups held a degree related to ESL, English, or Education. Regarding teaching
experience, Chinese raters exceeded the other groups, with an average of 13.6 years of
English teaching, while the other two groups are similar, 4.4 years by American raters
without cultural experience averaging, and 3.4 years by American raters with cultural
experience. Regarding rating experience, Chinese raters had the most experienced
raters among which six raters considered themselves to be expert-like. American
raters with cultural experience are the least experienced raters, with only one rater had
some rating experience and the rest raters were new about it. Regarding to the
familiarity of using TOEFL rubrics, American raters with no cultural experience was
more experienced with ﬁve raters having some experience and one being expert-like.

’ Generally, American raters with cultural experience have the least professional

12

experience with rating and using TOEFL rubrics. When interpreting the ﬁndings, the

researcher was aware of these mediating variables and treated them with cautions.

More details are given in the discussion and limitations part of this paper.

 

 

 

Table l
Rater Backgrounds
Variables Categories ARN CE ARCE CR
(n=8) (n=lO) (n=9)
Gender Male 1 4 2
Female 7 6 7
Maj or ESL/English/Education 8 7 9
Others 3
Degrees Bachelor 0 5 3
Master or above 8 5 6
Teaching (yrs) Group average 4.4 3 4 13.6
Experience with New 1 7 0
rating essays Some experience 6 3 3
Expert 1 0 6
Experience with New 2 9 4
TOEFL rubrics Some experience 5 1 3
Expert 1 0 1
Materials

The essay prompt used in this study was based on the TOEFL independent

writing tasks. Students were asked to write an essay about whether they agreed or

disagreed with a statement concerning the effects of globalization on their home

13

culture (see Appendix B). They were told to develop an argument for their stance. The
topic of globalization was selected internationally to promote test takers to write from
their own cultural perspectives.

Nineteen essays were collected from students with various backgrounds. The
essays were written by 16 Chinese international students and 3 international students
from other countries, such as Iran, Korea, and Saudi Arabia. 16 students were from
Intensive English Program (IEP) in the English Language Center and three students
were graduate students majored in Communication, Journalism and Engineering. The
reason why essays from other foreign students were included was to distract the raters'
attention from the focus of the research, that is, to see how they reacted to the scoring
of Chinese students’ essays, but they were not included in the analysis. The lengths of
the essays varied, ranging from 198 words to 444 words, with an average of 295
words.

Seven essays, including ﬁve benchmark essays and two sample practice essays
used in the training session in this study were from the TOEFL writing benchmark
essays (ETS, 2005). The seven essays were written by TOEFL test takers and the
essay topic was about telling the truth. In the rating experiment, a score sheet was
used and each rater put all of their response on the score sheet (see Appendix C).

The holistic rating rubric of TOEFL independent writing task (ETS, 2004) was
used in the scoring procedure in this study (see Appendix D). The rationale for
choosing this rating rubric is that most of the performance tests have adopted a similar
holistic rating scale because it is easy and fast to use, and also mimics the natural
reading process (Weigle, 2002). Additionally, the writing prompt was based on the
current structure of TOEFL writing prompts, thus it was appropriate to have raters use

a standard TOEFL rating rubric to rate the essays.

14

Procedure

Collecting and preparing students ’essays.

The students were presented with the writing prompt and instructions (see
Appendix B). They ﬁrst read the prompt and then were given 40 minutes to write the
essay in class. After that, the researcher typed the essays with all spelling and
punctuation errors left uncorrected. The essays were typed to avoid any inﬂuence
from hand writing on the rating process. The order of the 20 essays was randomized
before they were given to the raters, and they all got the essays in the same order.

The rating experiment.

Raters ﬁrst ﬁlled out a biographic questionnaire which inquired about their
professional, cultural, and linguistic backgrounds. After that, they went through a rater
training session. They were given the TOEFL rating rubric and TOEFL benchmark
essays from different levels (1 to 5), which were about a different topic from the one
in this study. The benchmark essays and the sample essays were written by TOEFL
test takers. The raters read these materials to become familiar with the rating scales
and features of essays at different levels. Then the raters were given two extra sample
essays on that topic to practice using the TOEFL rating rubric. The raters rated an
essay, and then were provided with the correct rating. The trainer went over reasons
why the correct rating was correct. The trainer and the raters discussed discrepancies
between their ratings and the correct rating. This process was repeated with the
second sample essay. For the participants who could not be present in the experiment,
the training session was conducted on their own. The researcher provided the reasons
for correct ratings for the two practice essays, hoping that the participants could

realize their discrepancies between their ratings and the correct rating and thus make

15

some adjustments.

After the training session, raters began to rate 19 essays. They were told to
refer to the rating rubric at any time. There were three main tasks for the raters during
the rating procedure: give a score (from 1 to 5) to each essay, provide a comment in
two or three sentences based on the strangeness and weakness of each essay, and mark
the best and worst features of each essay they could perceive. The features of an essay
included Content, Organization, Coherence, and Language. However, the researcher
did not provide any supporting materials to explain these four features of an essays
but rather left them to the participants to decide.

Finally, raters ﬁlled out a follow-up questionnaire that inquires into the
importance of different essay features they attached to (see Appendix E). They were
asked whether they agreed or disagreed with a statement that the adequacy or
inadequacy of Chinese cultural knowledge and Chinese language play a role in their

rating process.

Analysis

Quantitative data.

The quantitative data consisted of the essay scores given by three groups of
raters. These score assignments served as the primary variable to estimate any
differences in score assignment among the rater groups. The research question was
whether the three groups of raters would assign different scores to the essay written
by the Chinese students. Repeated measures analysis of variance1 (AN OVA) was
used with the rater category as the between-subject variable and 16 Chinese students’
essay scores as the repeated levels. This was to ﬁnd if there was any signiﬁcant

difference among the groups. Post-hoe tests were conducted to ﬁnd the source of

16

group differences.
Qualitative data.

Two types of qualitative data existed in this study. First was the best and worst
feature circled by each rater, and second was the raters’ comment. I coded and
analyzed the best and worst features to ﬁnd out whether the three groups of raters
approached the essay in the same way because even when the same score was given

for an essay, the raters may have had different reasons for assigning those scores.

When analyzing the best and worst feature of an essay, I found that a few
raters circled more than one best or worst features; for example, when the raters
thought content and language were equally good for an essay, they would circle both
features. When analyzing the data, I counted them all. In addition, some raters did not
circle any best feature, because they thought nothing was good in the essay. Because
of this, I added a “noun” category. While coding the data for each group, only when
more than half of the raters agreed on an answer did I view it as a valid answer for
that group. Otherwise, I assumed that no agreed-upon answer had been achieved and

instead I marked the response as n/a.

To code the raters’ comments concerning students’ essays, I typed all the
comments into a spreadsheet. To make the comments codable, I ﬁrst segmented the
comments for coding. For example, if one rater wrote “good organization and good
language,” I segmented it into two items. After segmentation, there were 960
comments total. To code the data, I borrowed a coding scheme from Shi (2001) (See
Appendix F). Shi divided the comments from her study into ﬁve major categories:
general, content, organization, language, and length. Under the major categories of

content, organization, and language, Shi had further subcategories. I tried to code the

17

comments into the subcategories, but changed some of them to ﬁt this data better. In
total, I used twelve categories in this study (see Figure 1). The categories were: (1)
general (general); (2) content (general); (3) content (ideas); (4) content (arguments);
(5) organization (general); (6) organization paragraphs; (7) organization (coherence);
(8) language (general); (9) language (general); (10) language (intelligibility); (11)
language (ﬂuency); (12) ﬂuency (ﬂuency). Before coding all of the data, I ﬁrst coded
25% of the data. A second rater also coded this 25% of the data. Inter-rater reliability
was calculated at that point (after we had each coded 25% of the data). Using a
percentage matching test, I calculated inter-rater reliability in this manner: if I and the
second rater agreed both on the main category and on the subcategory for a comment,
agreement was marked as 100%; but if I and the second rater only agreed on the main
category and not the subcategory, then agreement was marked as 50%. Finally, I
marked us as agreeing at only 25% if the subcategories matched but not the main
categories. This was possible when comments were talking about something general,
for example a coding like organization (general) and content (general). The initial
inter-rater reliability was 70%. After that, we discussed the 30% of the data we did not
code in the same manner. Though consensus, we reached agreements on the correct

coding of those data. I then coded the rest of the data on my own.

As Shi (2001) did in coding the data, I used the key word, usually noun
phrases to identify the categories of the comments. The adjectives in the comment
usually denoted whether it was a positive or negative comment, and the content word
gave the clue of comment categories. I ﬁrst identiﬁed the positive or negative nature

of the comment and then categorized it based on the key word.

After coding and organizing the comments into 12 categories, I calculated the

simple statistics for the negative and positive comments for each group in general to

18

get an overview of the leniency and harshness of the raters’ comments. Next, I
calculated the count for the comments in 12 categories in each rater group. Such
analysis could reveal raters’ scoring procedure as to which aspect of the essay they
judged positively or negatively. If comparing the data among three rater groups, we

could know the group difference on raters’ positive or negative judgment on the 12

categories of essay aspects.

 

  
  
  
  
 

general general

   

 

    

content

    
 
      
   
  

intelligibility

Figure 1. Categories of raters’ comments

       
 

comments

   
 

language

     
   

  

 

CHAPTER 4

Findings and Results
Reliability
I conducted a test of inter-rater reliability to ensure that all rater groups were
rating reliably. Ifreliability was high enough, the differences between the groups’
score assigmnents would be not related to unreliable raters. I used Cronbach’s alpha to

19

estimate reliability. The reliability for raters in total is .784, and the rate for American
raters without cultural experience is .784, for American raters with cultural experience
is .829, and for Chinese raters is .787. The degree of reliability for the three groups of
raters was almost at the same level. This indicates that the raters were very similar in

terms of reliability.

Quantitative Data

One way, 1repeated measures AN OVAs with rater category as the between
subject variable were calculated to measure the differences among the scores given by
the three groups of raters. Table 2 shows that the scores were signiﬁcantly different
among the three groups of raters (df (2, 15) = 12.707, F = 6.943, p< .05). As can be
shown in Table 3 Post-hoe comparisons using the Scheffe test indicated that the mean
scores were signiﬁcantly different between American raters without cultural
experience and the Chinese raters (M = 0.5, SD = 0.164, p < .05). Signiﬁcant
differences were also found between means of the American raters with cultural
experience and the Chinese raters (M = 0.53, SD = 0.164, p < .05). There was no
signiﬁcant difference found between the two American rater groups (M = 0.03, SD =

0.160, p = .983).

20

Table 2

Tests of Between-Subjects Eﬂects

 

 

 

 

 

 

 

Sum of
Source Squares df Mean Square F Sig.
Intercept 2970.531 1 2970.531 1623.114 000*
Rater Category 25.415 2 12.707 6.943 004*
Error 43.923 24 1.830
Table 3
Post hoc T est for Multiple Comparisons
95% Conﬁdence
Mean Interval
Rater Rater Differenc Std. Lower Upper
Category Category e Error Sig. Bound Bound
Scheffe ARNCE ARCE -.03 .160 .983 -.45 .39
CR .50' .164 021* .07 .93
ARCE ARNCE .03 .160 .983 -.39 .45
CR .53' .155 009* .12 .93
CR ARNCE -.50‘ .164 021* -.93 -.07
ARCE -.53' .155 009* -93 -.12

 

21

Rater Category

— ARNCE
— ARCE
CR

 

5—

A
l

 

i.
l .
W

IIIIIIT
12345678910111213141516

 
 

Estimated Marginal Means
0f

’Y

 

 

 

 

Essay No.

Figure 2. Estimated marginal means

As can be seen from the multi comparisons, post hoc test results illustrated in
Table 3, raters who were Chinese ESIJEFL teachers scored signiﬁcantly differently
from the two groups of American raters. Figure 1 demonstrates that the Chinese raters
were harsher than the other groups of raters in scoring Chinese students’ essays
because the Chinese raters tended to give a lower score than American raters with no
cultural experience (M = 0.5, SD = 0.164, p < 0.05) and American raters with cultural
experience (M = 0.53, SD = 0.164, p < 0.05). Among the American rater groups, there

were not any signiﬁcant differences in score assignments, and Table 4 illustrates that

22

the American raters with cultural experience made score assignments that were
slightly higher than the score assignments from American raters with no cultural

experience (M= .03, SD = .160,p = .983).

Qualitative Data

Best and worst features.

All the raters circled the best and worst feature for each essay after assigning a
score. The features of the writing were content, organization, coherence and grammar.
Table 5 showed the most frequently circled features by each group of raters. For the
best feature, there were three essays for which all groups of raters had the same
perception. For example, essay No. 1 was best for organization, No. 3 for content and
No.9 for organization. For the worst feature, the raters in different groups agreed on
four essays, and all because of the essay writers’ poor language. Still for most of the
essays, raters across all groups had different perceptions concerning the best and
worst features. Given for that, it can be reasoned that raters in the three groups did not
approach the essay from the same aspect, or in other words, the raters had scored very
differently because they had diverse perceptions of the best or worst feature for each
essay.

Although it varied a great deal, the data showed that content and organization
were judged as much better than other aspects of the essays by all raters, and language
was perceived as the worst features by all raters. When looking further into different
rater groups, American raters with no cultrual experience chose content as the best
feature for nine essays out of sixteen. This number was higher than the other two
groups, which might suggest that American raters with no cultural experience

generally valued content more positively than the other raters. It was obvious that

23

Chinese raters perceived grammar as the worst feature for ten essays out of sixteen.
This may suggest that Chinese raters were stricter with language of the essay than
other raters. No clear tendency was found for American raters with cultural

experience, since they had too diverse opinions.

24

Table 4

Best Features and Worst Features Identified by Raters

 

Essay Best features Worst features

No.

 

ARNCE ARCE CR

 

 

 

1 Orga Orga Orga Iang Lang Lang
2 Orga Cont - Lang Lang Iang
3 Cont Cont Cont Lang - Lang
4 Lang - Cont Cont - Iang
5 Lang Cont Cont Cohe Lang Lang
6 Cont - None - — Cont
7 Cont - Cont Orga Orga -

8 Cont Orga None Cohe Cont Cont
9 Orga Orga Orga Lang Lang Lang
10 Lang Orga 0 Cont Cont -

1 1 Cont Cont Orga Lang Cohe Lang
12 Cont Cont Orga Iang Lang lang
l3 Orga Orga Cont Cont C ohe I Cont
14 Cont - None - C ont Lang
1 5 Cont - Cont; - Orga Lang

Lang
1 6 Cont — - - Lang -

 

Note. The codes are used to label four features of the essay. Cont = content; Orga =

organization; Cohe = cohesion; Lang = language. Dashes indicate that no agreed

answers by that rater group.

Rater Is comments.

The raters’ comments were coded into negative or positive comments ﬁrst, and
then were put into 12 categories according to the coding scheme developed in Shi’s
(2001) study. The total number of comments coded for American raters with no
cultural experience was 328, for American raters with cultural experience was 393,
and for Chinese raters was 342. Table 6 shows that raters across the three groups
generally provided more negative comments than positive comments, probably due to
the low proﬁciency of essay writers. Overall, American raters with cultural experience
had provided the largest amount of positive comments. On the contrary, Chinese
raters had the most negative comments in terms of both number and frequency. But
still, two groups of American raters were very close in their proportion of giving
positive or negative comments. American raters with cultural experience commented
a little more positively than American raters with no cultural experience, indicated by
the percentage of 40% compared to 36%. It is important to remember that in the
quantitative analysis, there were not any signiﬁcant differences between the two
groups of American raters. From an overview of the number and frequency of the
comments for these two groups, the number was too close to demonstrate that
American raters with cultural experience may judge the essays more positively than
those with no cultural experience. More in-depth analysis of the nature of the
comments thus is required.

Figure 2 shows the count of positive comments coded into 12 categories. My
focus was still trying to ﬁnd out the differences between American raters with and
without cultural experience, if there were any. Figure 1 shows American raters with
experience gave about two times more comments than the other raters on general

aspect of the writing, 24 items compared to 13 and 14 in other groups. This might be

indicating that American raters with cultural experience were slightly more satisﬁed
with the essay writers in this level than other groups of raters, but more evidence is
needed, especially negative comments from American raters with cultural experience
should be taken into account. Regarding the comments about language, American
raters with cultural experience offered much more comments on the general quality
and intelligibility of the language in the essays than other raters. However, American
raters with cultural experience commented less positively on accuracy of language
compared with other raters. A close observation of the Chinese raters shows that they
offered many fewer positive comments on content (general) and content (ideas) for an
essay than other raters. Raters across all the three groups provided almost the equal
amount of comments on content (argument) and organization (general).

Figure 3 shows the number of negative comments coded into 12 categories.
The ﬁrst impression from the ﬁgure is that the negative comments from Chinese
raters outnumbered the other groups in several categories, for example, general
(general), content (arguments) and especially language (accuracy) where the number
almost doubled in size from the other groups. American raters with cultural
experience provided many more negative comments on content (general), language
(general) and language (intelligibility), the same as they did when providing positive
comments. Now it is clear that American raters with cultural experience tended to
offer more comments in both positive and negative way in such categories as content
(general), language (general) and language (intelligibility), thus there still lacks
evidence to prove that American raters with cultural experience valued more
positively than those without experience in terms of content and other aspects of the
writing. However, American raters with cultural experience commented less

negatively on language (accuracy) than the other groups, as indicated by 26 comments,

27

compared to 38 by American raters without cultural experience, and 64 for Chinese

raters.

Table 5

An Overview of Raters ’ Comments

 

Comments ARN CE ARCE CR

(N=8) (N=8) CN=9)

 

Count Frequency Count Frequency Count Frequency

 

Positive 119 36% 157 40% 100 29%
Negative 209 64% 236 60% 242 71%
Total 328 393 342

 

 

 

 

 

 

Figure 3. Raters’ positive comments

28

 

 

 

 

 

Figure 4. Raters’ negative comments

CHAPTER 5
Discussion

In the area of second language writing assessment, a large number of studies
have focused on the study of rater variability that may cause a potential source of
measurement error. Human raters, used in scoring second language writing, may often
introduce subjective factors into the procedure of scoring. Raters’ backgrounds or
experience could inﬂuence their ways in approaching the essays. All of these factors
may lead to rater variability, which may pose test fairness issues and reduce the test's
reliability and validity. Johnson and Lim (2009) claim that it is of great importance to
identify the sources of rater variability, and “if found to exist, to address them
accordingly” (p. 486). The current study manipulates the amount of raters' cultural
experience to study its inﬂuence on the rating procedure. This study will promote our

understanding of the possible source of rater variability and will shed light on

29

teacher’s different perspectives of evaluating writing by second language learners.

It was found in this study that the Chinese raters were harsher than the other
two groups of American raters, given the statistically signiﬁcant differences between
the ratings given by the Chinese raters and the American raters. The Chinese raters
were particularly strict concerning the L2 leamers’ accurate language and grammar,
evidenced by their enormous negative comments on the language aspect of the essays.
The result is not consistent with a previous study. Shi (2001) did not ﬁnd any
signiﬁcant differences in the score assignments between the two groups of raters,
NES (native-English-speakers) raters and NNS (non-native-English-speakers) raters.
Probably this is due to the two studies’ different experiment treatments. For example,
Shi did not have rater training sessions or use any rating rubrics when raters gave
evaluations to the students’ essays. On the other hand, this study involved rater
training and rating rubrics. More importantly, Shi asked the raters to rank the 10
students’ essays in an order of 1 to 10. However, raters in the current study had a 5
point scale (1-5) in assigning scores.

The result for the qualitative part is also inconsistent with the Shi’s (2001)
study. The major difference is that Shi found NNS raters comment much less on the
accuracy of the language for the writing because “being nonnative speakers, the
Chinese teachers might shy away ﬁom making qualitative judgments or comments
about the English language of their students” (p. 312). However, the current study
found that Chinese raters gave more negative comments than the two groups of
American raters on the language, especially on the accuracy of the language.

The fact that Chinese raters in this study assigned lower scores than American
raters did probably is because the raters’ past teaching and learning experiences play a

role in the scoring procedure. More than half of the Chinese raters have taught college

30

English in China, and the proﬁciency of the students they taught might have been
higher than that of the students taking courses in the English Language Center who
just graduated from high school, those who participated in this study. It can be argued
that the Chinese raters have much more experience in rating essays on a different
level; they might judge lower level students more negatively than they should when
assigning scores. Their enormous negative comments on language accuracy perhaps
are related to how grammar accuracy is treated in English education in China.
Grammar learning is emphasized heavily when a student learns English, and thus
English teachers in China attach great importance to the grammar accuracy
accordingly. As Eckes (2008) found in his study, experienced raters may fall into
different categories that value categories within rating rubrics to different degrees.
Shaefer (2008) also found a similar rater bias pattern in which one group of raters
rated content and organization more severely, while the other group rated language
and mechanics more severely.

However, in terms of language assessment, the raters who weigh thelanguage
more importantly than other aspects of the language may decrease the validity of the
test because they should approach an essay with balanced attention to each aspect of
the writing (Weigle, 2002). The holistic scores commonly used in writing assessment
may increase the possibility of overemphasizing one aspect of the writing, especially
for raters who have the tendency to overemphasize one aspect of the criteria being
evaluated. The current study did not ﬁnd any signiﬁcant differences between the two
groups of American raters with and without cultural experience in China. But
American raters with experience have a trend to be more lenient than the raters
without experience. The quantitative analysis identified a 0.03 higher score (the mean

differences) given by the raters with experience than the raters without experience.

31

The frequency of giving positive comments in general is also slightly higher for raters
with experience, for example, 40% for raters with experience, compared to 36% for
raters without experience. The result is backed up by a question in the questionnaire
that asked whether the raters think their cultural experience in China plays a role in
the rating procedure. About four American raters with experience in China said that
they thought their experience had an effect on their evaluation of the essays. The
arguments. they provided were various. For example, one rater said that “I probably
gave the essays a higher rating if they seemed more truthful to my idea of how
globalization has affected Chinese culture.” Another rater responded, “I recognize
some thoughts that may be poorly supported by heresy (heresay) and also have an
easier time understanding the writer’s perspective.” One rater even thought that since
she was aware that her knowledge about Chinese language and culture might make
her give higher scores, she had to compensate by judging the essays too strictly.
Within the raters who said “No” to this question, two raters stated that they could ﬁnd
some common and repeated “Chinglish forms” (expressions that were mistranslated
from Chinese to English). However, they were not certain whether they would be
biased toward the essay writers.

Clearly, American raters who taught English in China had more possibilities to
refer to their common knowledge of Chinese culture and their teaching experience
when rating the essays than their peers who never went to China. They adopted
various strategies when comprehending and judging students’ essays, for example,
their cultural experience in China, their linguistic knowledge about the Chinese
language, and their experience with reading lots of “C hinglish” errors. Some of them
were even aware of an inﬂuence on the rating and adapted some compensation

strategies to overcome the perceived problems. The current study is consistent with

Hamp-Lyons’s study (1989) in which Hamp-Lyons found that experience with other
language could change native speaker raters’ reactions to the English writing of
learners from those language communities. Despite their thoughts while rating essays
seemed exceptionally complicated, there is not enough evidence to prove that the
raters with experience in China are more lenient raters regarding the marking of
features of essays. The result of this study shows that this area is in need of further
inquiry.

This study has implications for rater training. The inter-rater reliability is
reasonably high even though the ESL/BF L teachers had diverse backgrounds in terms
of teaching and rating experience. An easy-to-use rating rubric is essential to the
rating tasks because all raters in the study referred to the rubric time and time again to
clearly distinguish between a “2-point-essay” or a “3-point-essay.” However, in other
situations, some raters shared their experience that they were quite ambivalent in
deciding a frnal score, because “the language is high 2” but “the content is low 3.” As
Hamp-Lyons (1991) pointed out, second language writers may have developed
unbalanced skills in different aspects of writing. The holistic scale again fails to
represent the multi-trait features of an essay. In rater training sessions, it is important
to remind the raters they will expect to have such undetermined moments and they
should look further in the essay for other evidence to assign a score and not be

arbitrary; not something between “2” and “3” but a “2” or “3.”

CHAPTER 6
Conclusion
The purpose of this study is to understand whether raters’ cultural inﬂuence

may affect the score assigned to students’ essays. The raters in this study were divided

33

into three groups: American ESL/EF L teachers without teaching experience in C Irina:
American ESL/EFL teachers with teaching experience in China; and Chinese
ESL/EFL teachers. All raters received rater training and performed rating tasks that
included assigning a holistic score (1 to 5) to essays. They also judged the best and
worst feature of the essays, and offered comments. The ﬁndings are that there were
signiﬁcant differences in the scores assigned by the Chinese raters and the two groups
of American raters. Chinese raters were harsh raters and gave lower scores than the
other raters. American raters with teaching experience in China had a trend of being
more lenient raters than American raters without teaching experience in China:
American raters with experience assigned slightly higher scores and offered more
ﬁ'equent, positive comments than American raters without experience. This study is
important because it discovers an area of rater variability that has not been addressed
in the literature. A trend was found in this study that raters with extensive cultural
experience with the target culture (the one being learned by the test takers) are more
lenient raters than those who do not have such experience. Clearly, the results from
this study call for future inquiry in this area.

Considering raters’ cultural experience and investigating its effect on the rating
procedure should be part of any large-scale testing program's rating procedure. A
larger, more fmely-tuned experimental study on these aspects of the rating process
should be conducted in the future. In the current study, the majority of the raters with
teaching experience in China stayed in China for one or two years. If the raters had
spent more than ﬁve years teaching in China, the results may have been quite
different. On the other hand, when recruiting rater participants, researchers in the
future should ﬁnd raters with similar professional backgrounds. For example, to truly

investigate the inﬂuence of cultural background, the amount and quality of the raters’

34

teaching and rating experience should be the same; likewise, their academic
backgrounds and current employment should be comparable. Controlling these factors
would make the results more robust and generalizable. In addition, the researcher is
not sure if Chinese raters are harsher on Chinese essay writers or if they are just
harsher raters overall. A comparison group of non- Chinese raters of comparable size
would be needed to test for that in the future studies. Nonetheless, this study, despite
its weaknesses, demonstrated that cultural background plays a role and rater training

programs will need to address it.

Note:

1 Repeated measures analysis of variance: It is used to analyze the score differences

among the three groups in the study, because there is a between-subject variable (the
cultural experience), and a within-subj ect variable (19 essays). The raters are
independent, but the measurement across the essays is not since the same people were

measured 19 times (Tabachnick & Fidell, 2007).

35

APPENDIX A

1.

2.

3.

Questionnaire of Demographic Information

Basic information

Email

 

Gender (circle one): Male / Female Age (in years):

Job position (if applicable):

 

Major in undergraduate study:

 

Major in graduate study (if applicable):

 

Langage background
What is your native language? (If you grew up with more than one language,

please specify.)

 

Can you speak another language or other languages? What is it or what are they?_

Teaching and Rating merience:

a) How many years have you taught English as a second or foreign language?

b) Could you list all institutions where you have worked?

 

 

 

 

c) Have you been an instructor for any writing courses? If yes, what was the level

of the class and for how long did you teach it?

 

 

 

d) Do you have experience in rating essays written by ESL/EFL learners? (Please

rate your experience in choosing one description below ”Yes”.)

Yes No
36

i. I am new at it.
ii. I’ve got some experience.
iii. I am very experienced.

e) Are you familiar with rating rubrics or scales for TOEFL writing? (If yes,

please rate your experience in choosing one description below ”Yes”.)

Yes No
i. I am new at it.
ii. I’ve got some experience.
iii. 1 am very experienced.
f) Have you used TOEFL rating rubrics or scales in the writing courses you

taught?

. Cultural knowledge:

a) Have you ever been to China? If yes, can you make a list of when and for how

long you travelled there?

 

 

 

 

b) Can you describe the degree of your knowledge of Chinese culture by
choosing one statement below?
i. I know nothing about Chinese culture.
ii. I know just a few things about Chinese culture.
iii. I know enough knowledge of Chinese culture that I am conﬁdent to
work or study
there without trouble.
iv. 1 know Chinese culture almost at the same level as a native Chinese
speaker does.

0) Can you name all Chinese traditional festivals you know and on which day

37

they are? You will use Chinese lunar calendar if you can.

 

 

 

 

(I) Can you speak Chinese? If yes, can you rank your Chinese language ability?

 

 

 

 

 

 

 

 

 

 

 

Reading Writing Speaking Listening
proﬁciency proﬁciency ﬂuency ability
Very poor
Fair
Good
Very good
Native-like

 

e) Where did you learn Chinese and what was highest level of Chinese course

you took?

 

 

 

 

38

 

APPENDIX B

Writing Prompt

The Writing Task:

Directions: You will have 40 minutes to write an essay about the following topic. The
length of the essay should be at least 300 words in order to be an eﬁecﬁve essay.

Do you agree or disagree with the following statement? "Globalization is

impacting the culture of some countries in negative ways." Use speciﬁc details
and reasons to support your answer.

39

APPENDIX C

Scoring Sheet
Now, you will begin the oﬁ'rcial rating section. You will rate 19 essays consecutively.

It is a bit different from rating the sample essays. Here are three steps that you will do:

1. You will assign a score to each essay.
2. You will provide a two-to-three sentence comments about the essay.
3. You will circle the best and worst feature of each essay.

Please put all your response, including the essay score, your comments and chose of

the best and worst feature of each essay on the score sheet, NEVER on the essays.
You can refer to the writing rubrics anytime you want during your rating procedure.
Now, you can begin. When you finish rating, please tell the researcher you are done.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Essay Score Best Feature Worst Feature Comments
NO.

Content Content
Organization Organization
Coherence Coherence
Language Language
Content Content
Organization Organization
Coherence Coherence
Language Language
Content Content
Organization Organization
Coherence Coherence
Language Language
Content Content
Organization Organization
Coherence Coherence
Language Language
Content Content
Organization Organization
Coherence Coherence
language Language
Content Content
Organization Organization
Coherence Coherence
Language Language

 

40

 

APPENDIX D

TOEFL Independent Writing Rating Rubrics

 

Score Task Description

 

5 An essay at this level largely accomplishes all of the following:

' effectively addresses the topic and task.

° is well organized and well developed, using clearly
appropriate explanations, exempliﬁcations, and/ or details.

' displays unity, progression, and coherence.

° displays consistent facility in the use of language,
demonstrating syntactic variety, appropriate word choice, and
idiomaticity, though it may have minor lexical or grammatical

CITOI‘S .

 

4 An essay at this level largely accomplishes all of the following:

° addresses the topic and task, though some points may not be
fully elaborated.

' is generally well organized and well developed, using
appropriate and sufficient explanations, exempliﬁcations, and/
or details.

0 displays unity, progression, and coherence, though it may
contain occasional redundancy, digression, or unclear
connections.

° displays facility in the use of language, demonstrating
syntactic variety and range of vocabulary, though it will
probably have noticeable minor errors in structure, word
form, or use of idiomatic language that do not interfere with

meaning.

 

3 An essay at this level largely accomplishes all of the following:

0 addresses the topic and task using somewhat developed
explanations, exempliﬁcations, and/ or details.

° displays unity, progression, and coherence, though
connections of ideas may be occasionally obscured.

° may demonstrate inconsistent facility in sentence formation
and word choice that may result in lack of clarity and
occasionally obscure meaning.

° may display accurate but limited range of syntactic structures

and vocabulary.

 

41

 

 

An essay at this level largely accomplishes all of the following:

limited development in response to the topic and task.
inadequate organization or connection of ideas.

inappropriate or insuﬂicient explanations, exempliﬁcations, or
details to support or illustrate generalizations in response to
the task.

a noticeably inappropriate choice of words, or word forms.

an accumulation of errors in sentence structure and/ or usage.

 

An essay at this level largely accomplishes all of the following:

serious disorganization or underdevelopment.
little or no detail, or irrelevant speciﬁcs, or questionable
responsiveness to the task.

serious and frequent errors in sentence structure or usage.

 

42

APPENDIX E

F ollow-up Questionnaire of Rating Behavior

a. Could you order the importance of each aspect of writing when you rate an essay?
(for example, l-most important and 4-least important)

Grammar and Language use
Content

Organization and Development

COP”?

Coherence and Unity

 

 

 

.4:er

 

b. Do you think your adequacy or lack of Chinese cultural knowledge play a role in
the way you rate the essays?

c. If you have any idea after rating the essays, you can write it here.

43

 

 

APPENDIX F
Coding Scheme
MAJOR SUB- DEFINITIONS EXAMPLES OF
CATEGORI CATEGORIES POSITIVE/NEGATIVE
ES COMMENTS
General General General comments - This is an excellent essay
on (ARNCE-2).
overall quality of - This is very well written
writing. (ARNCE-6).
Content General General comments 0 The writer makes good point
on content (ARCE-4).
- The essay lacks enough
content to prove the author's
points (ARCE-4).
Ideas General or speciﬁc - The ideas in this article were
comments on ideas very clear (ARCE-5). .
and thesis. - The writer had great ideas
(ARCE-5).
- Great clear statements
(ARCE-8).
Arguments General or speciﬁc 0 Use examples to address the

Organization

General

Paragraphs

comments on
aspects of
arguments such as
balance, use of
comparison,
counter
arguments,
support, uses

of details or
examples,
clarity, unity,
maturity,
originality,
relevance,

logic, depth,
objectivity,
conciseness,
development and
expression
developed.

General comments
on organization
Comments on the
macro level
concerning

paragraphs

44

topic (CR-8).

0Developed the topic to some
extend but not enough (CR-7).
0 Has some good examples
(ARCE-3).

- Great examples (ARNCE-5)
- Inadequate development
(ARNCE-6).

0 Good arguments were made
(ARCE-5).

0 Very little examples (ARCE-
9).

- Well organized (ARCE-9).

0 It is organized well (CR-4).

0 But they should be broken
down into subpoints and
paragraphs (ARCE-10).

' Some paragraphs seem oddly

 

Transitions

Language General

lntelligibility

Accuracy

Fluency

Length Length

introductions, and
conclusions.

Comments on the
micro level
concerning
transitions,
coherence, and
cohesion.

General comments
on language

Comments on
whether the
language is clear
or easy to
understand.
General comments
on accuracy or
speciﬁc comments
on word use,
grammar and
mechanics.

Comments on
ﬂuency,
conciseness,
maturity,
naturalness,
appropriateness,
and

vividness of
language.

Comments on
whether the writer

has fulﬁlled the
word limit.

short (ARCE-3).

- Needs to know how to group
paragraphs together (ARNCE-
7).

- This could be slightly more
effective if some of the short

paragraph were combined or
expanded (ARNCE-2).

° Coherence is a problem
(ARNCE-5).

- Some problems with
cohesion and connectives
(ARNCE-6).

- Cohesive devices not
effective (ARCE-3).

0 Language is poor (ARCE-8).
° The use of language was not
very good (ARCE-5).

- English usage is sometimes
confusing (ARCE-9).

' The language usage at times
the writer’s opinion was
initially unclear (ARCE-5).

0 Serious and frequent errors
in sentence structure or usage
(CR-4).

0 Serious errors in usage (CR-
5).

- There is almost no error in
sentence structure (CR-4).

0 Many language errors in
grammar and syntax (ARCE-
9).

0 Too many words, he
could’ve said it more
concisely (ARCE-7).

- Sentence variety, command
of complex vocabulary and
structures (ARCE-3).

0 Sometimes redundant
(ARCE-9).

- Demonstrate range of
vocabulary (CR-2).

- It was quite short (ARNCE-

2).

 

45

BIBLIOGRAPHY

Ballard, B. & Clanchy, J. (1991). Assessment by misconception: Cultural inﬂuences
and intellectual traditions. In L. Hamp-Lyons (Ed.), Assessing second
language writing in academic contexts (p. 241-278). Norwood, NJ: Ablex.

Barkaoui, K. (2007). Participants, texts, and processes in ESL/EFL essay tests: A

narrative review of literature. The Canadian Modern Language Review, 64,
99-1 34.

Broad, B. (2003). What we really value: Rubrics in teaching and assessing writing.
Logan, UT: Utah State University Press.

Brown, J .D. (1991). Do English and ESL faculties rate writing samples differently?
TESOL Quarterly 25, 587—603.

Connor, U. (1996). Contrastive rhetoric: Cross-cultural aspects of second language
writing. New York: Cambridge University Press.

Connor-Linton, J. (1995). Looking behind the curtain: what do L2 composition
ratings really mean? TESOL Quarterly 29, 762—65.

Cumming, A. (1997). The testing of writing in a second language. In C. Clapham &
D. Corson (Eds), Encyclopedia of language and education: Vol. 7. Language
Testing and Assessment (pp. 51—63). Dordrecht, Netherlands: Kluwer.

Cummings, A. (1990). Expertise in evaluating second language compositions.
Language Testing 7, 31—51.

Cummings, A., Kantor, R., & Powers, D. (2002). Decision making while rating
ESL/EF L writing tasks: A descriptive framework. Modern Language Journal,
86, 67—96.

Deville, G., & Chalhoub-Deville, M. (2006). Old and new thoughts on test score
variability: Implications for reliability and validity. In M. Chalhoub-Deville,
C.A. Chapelle & P. Duff (Eds), Inference and generalizability in applied
linguistics: Multiple perspectives (pp. 9—25). Amsterdam: Benjarnins.

Eckes, T. (2008). Rater types in writing performance assessments: A classiﬁcation
approach to rater variability. Language Testing, 25, 155-185.

Educational Testing Service. (2004). iBT/Next generation of TOEFL Test independent
writing rating rubrics (Scoring standards). Princeton, NJ: Educational Testing

Service.

Educational Testing Service. (2005). TOEFL iBT writing sample responses,
Princeton. NJ: Educational Testing Service.

Erdosy, U. (2004). Exploring Variability in Judging Writing Ability in a Second
Language: A Study of F our Experienced Raters of ESL Compositions. (TOEFL

46

Report 70). Princeton, NJ: Educational Testing Service.

Hamp-Lyons, L. (1989). Raters respond to rhetoric in writing. In H.W. Dechert &
M.Raupach (Eds.), Interlingual processes (p. 229—244). Tubingen, Germany:
Gunter Narr Verlag.

Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons
(Ed.), Assessing second language writing in academic contexts (p. 241-278).
Norwood, NJ: Ablex.

Hinkel, E. (1994). Native and normative speakers' pragmatic interpretations of English
text TESOL Quarterly, 28, 353-376.

Johnson, J. & Lim, G. (2009). The inﬂuence of rater language background on writing
performance assessment. Language Testing, 26, 485-505.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they
really mean to the raters? Language Testing, 19, 246—76.

Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating
scales. Language Testing, 26, 275-304.

Kondo-Brown, K. (2002). A facets analysis of rater bias in measuring Japanese
second language writing performance. Language Testing, I 9, 3-31.

Kobayashi, H. & Rinnert, C. (1996). Factors aﬁecﬁng composition evaluation in an

EFL context: cultural rhetorical pattern and readers’ background. Language
Learning 46, 397—437.

Kobayashi, T. (1992). Native and normative reactions to ESL compositions. TESOL
Quarterly 26, 81-112.

Rinnert, C., & Kobayashi, H. (2001). Differing perceptions of EFL writing among
readers in Japan. Modern Language Journal, 85, 189—209.

Shohamy, E., Gordon, CM. and Kraemer, R. (1992). The effect of raters’ background
and training on the reliability of direct writing tests. Modern Language
Journal 76, 27— 33.

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language
Testing, 25, 465—493.

Schoonen, R. (2005). Generalizability of writing scores: an application of structural
equation model. Language Testing, 22, 1-30.

Shi, L. (2001). Native- and nonnative-speaking EFL teachers' evaluation of Chinese
students' English writing. Language Testing, 18, 303-325.

Tabachnick, B. & F idell, L. (2007). Using multivariate statistics (5». ed). Toronto:
Allyn & Brown.

47

Wang, C. (2009). Chinese test-takers’ perspectives on GRE and TOEFL writing tests.
Paper presented at the Michigan TESOL conference, Grand Rapids, Michigan.

Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University.
Weigle, S. C., Boldt, H. & Valsecchi, M. I. (2003). Effects of tasks and rater

background on the evaluation of ESL student writing: A pilot study. TESOL
Quarterly, 37(2), 345-354.

48

l

1293 03063 4459

H
I"
"-
H
III
N
N
NHI
I
I
TI
A”.
III
III.
"
IIII
H

3