IMPROMPTU TIMED
-
WRITING 
AND PROCESS
-
BASED TIMED
-
WRITING 
EXAMS: 


By
 
 
Virginia David
 
 
A DISSERTATION
 
 
Submitted to
 
Michigan State University
 
in partial fulfi
llment of the requirements
 
for the degree of 
 
 
Second Langua
ge Studies
 

Doctor of Philosophy
 
 
2015
 
 
A
BSTRACT
 
 
IMPROMPTU TIMED
-
WRITING AND PROCESS
-
BASED TIMED
-
WRITING EXAMS: 


PTIONS
 
 
By
 
 
Virginia David
 
 
In this study I compare 
81 

writing exams: an impromptu timed
-
writing (TW) exam and a process
-
based timed
-
writing 
(PBTW) exam. The students 
had
 
45 minutes to write an es
say for the impromptu TW exam. For 
the PBTW exam, the 
same 
participants read an article and watch
ed
 
two short videos about a 
topic, discuss
ed
 
the topic in small groups, and plan
ned
 
their essays.
 
This took approximately 45 
minutes. 
 
After that, they 
had
 
45 
minutes to write their essays. T
hus, t
he PBTW exam 
lasted
 
90 
minutes
. After taking both exams, the participants 
answered
 
a short post
-
writing questionnaire 
about their perceptions of the two exams. 
Eighteen
 
participants 
were
 
randomly 
selected
 
or 
volunteere
d
 
to participate in a semi
-
structured interview in groups to receive more detailed 
information about their opinions 
regarding
 
the exams. 
My
 
secondary aim 
was
 
to investigate what 
raters think of the two exams. Two raters 
scored
 
the essays using an analytic 
rubric and 
participate
d
 
in a semi
-
structured interview after they score
d
 
the essays.
 
Furthermore, the raters 
participated in two training and norming sessions before they began rating the essays, both of 
which were audio
-
recorded to gather more information
 

exams.
 
 
To explore the results, I correlate
d
 
the scores that the students receive
d
 
in the exams in 
SPSS and perform
ed
 
a 
t
-
test to determine whether there 
were
 
significant differences between the 
scores. I 
also
 
examine
d
 
the essays to investigate accuracy, lexical and syntactic complexity, and 
fluency. 
I 
investigated
 
t

in the post
-
writing questionnaire and the transcripts of the semi
-
structured interviews. 
F
inally, I 
analyzed the training and norming sessions, as well as the semi
-
structured interviews with the 
raters to explore their perceptions of the exams. 
 

significantly dif
fer in the two exams, the participants expressed a clear preference for the PBTW 
exam because they had time to learn about the topic through the artic
le, videos, and discussion, 
they had time to plan their writing
, and they could use the ideas in the sourc
e materials to support 
their opinions in the essays
. 
The scores that the test takers received for content and punctuation 
were significantly higher in the PBTW exam, while the scores that they received for spelling 
were significantly higher in the TW exam.
 
The participants also wrote significantly longer essays 
and significantly more words per minute in the PBTW exam. In addition, they used more 
sophisticated vocabulary and a wider variety of nouns in the PBTW exam.
 

did not vary in term
s of syntactic complexity or grammatical accuracy, however.
 
The scores that 
students received in the two exams correlated only moderately
 
(.391)
, which suggests that the 
two exams measure different constructs, with the PBTW exam measuring reading, listenin
g, and 
source integration, among other skills, in addition to writing.
 
Both the learners and the raters 
mentioned that the
 
learners
 
had difficulties in
tegrating sources
, which could be a result of the 
negative washback of TW exams in their classes
.
 
The int
er
-
rater reliability coefficient for all of 
the exams was high, but the inter and intra
-
reliability coefficients for the TW exam were higher. 
The results of this study, combined with the results of other similar studies and the skills that L2 
learners need
 
to succeed in college
-
level classes suggest that the PBTW exam may be a better 
tool to evaluate 
the construct of
 
academic 
writing.
 
 
iv
 
 
I dedicate this work to my husband, Eric Timothy David, and to my son, Liam Felipe David.
 
v
 
 
ACKNOWLEDGEMENTS
 
 
Fi
rst and foremost, I thank all of the members of my committee, Dr. Charlene Polio, Dr. 
Peter De Costa, and Dr. Susan Gass, who have provided wonderful help throughout the data 
collection
, analysis,
 
and writing process
es
. I especially thank my chair
 
and advi
sor
, Dr. Paula 
Winke, who has guided me throughout my PhD program and given me incredible help and 
support
.
 
I especially thank 
Language Learning
 
for awarding me with the Language Learning 
Dissertation Grant.
 
 
I also thank all of 
my
 
family members
,
 
who hav
e always supported me and believed in 
me, especially my parents, Edgar Harckbart and Solange Correa Harckbart, 
my brothers Tadeu 
Harckbart and Gustavo Harckbart, and 
my in
-
laws, Joan Elizabeth David and Patrick Timothy 
David. I am grateful to the many wond
erful colleagues that I have met during my PhD program 
and the feedback that they have given me about this project and other
 
projects
. 
I have met 
wonderful instructors at the English Language Center and thank them for their wisdom and 
kindness, especially 
Carol Arnold and David Krise.
 
I am so grateful to all of the students that I 
have had as a teaching assistant at Michigan State University. Teaching is my passion and I am 
lucky to have had the amazing students that I had.
 
I am grateful to my friends in Br
azil, in the 
United States, and elsewhere, who have always supported me. I especially thank Dr. Scott 
Sterling and 
his wife, 
Kara Sterling for their support and for the fun we have had together. 
 
 
Thank you to my husband and best friend, Eric Timothy David
, for your patience, 
understanding, and everlasting support. Thank you to my son, Liam Felipe David. Your smiles 
and laughter have given me 
much 
strength and happiness. I love you both.
 
 
vi
 
 
TABLE OF CONTENTS
 
 
LIST OF TABLES
 
 
vi
ii
 
 
INTRODUCTION
 
 
1
 
 
CHAPTER 1: REVIEW OF THE LITERATURE
 
 
7
 
1.1
 
Writing task complexity
 
 
7
 
 
1.1.1
 
The effects of planning on L2 writing
 
 
8
 
1.1.2
 
Topic familiarity
 
 
11
 
1.1.3
 
Integrated tasks
 
 
12
 
1.1.4
 
Test take

tests
 
 
15
 
 
1.2
 
Measuring the component
s of writing
 
 
1
9
 
 
1.2.1
 
Measures of grammatical accuracy
 
 
19
 
1.2.2
 
Measures of lexical complexity
 
 
22
 
1.2.3
 
Measures of syntactic complexity
 
 
26
 
1.2.4
 
Measures of fluency
 
 
27
 
 
CHAPTER 2: THE PRESENT STUDY
 
 
30
 
2.1 Method
 
 
33
 
2.1.1 
 
Participants
 
 
33
 
2.1.2
 
Procedure
 
 
3
6
 
2.1.3
 
TW exam
 
 
36
 
2.1.4
 
PBTW exam
 
 
37
 
2.1.5
 
Rating
 
 
40
 
2.1.6
 
Post
-
writing questionnaire
 
 
43
 
 
2.1.7
 
Semi
-
structured interviews
 
 
4
4
 
2.2 Analysis
 
 
45
 
 
CHAPTER 3: QUANTITATIVE RESULTS
 
 
51
 
3.1 RQ1: 

 
scores in a PBTW exam?
 
 
51
 
3.2 RQ2:
 

from their scores in a PBTW exam?
 
 
53
 
3.3 RQ3:
 
How 

 
5
5
 
 
3.3.1
 
Accuracy
 
 
5
5
 
 
3.3.2
 
Lexical complexity
 
 
5
6
 
 
3.3.3
 
Syntactic complexity
 
 
58
 
 
3.3.4
 
Fluency
 
 
60
 
3.4 RQ4:
 
What are the intra and inter
-
rater reliability 
coefficients 
for each exam?
 
62
 
 
3.5
 
Summary of the quantitative findings
 
 
64
 
 
vii
 
 
CHAPTER 4: QUALITATIVE RESULTS
 
 
68
 
4.
1
 
Participants from the semi
-
structured interviews
 
 
68
 
4.2
 

70
 
 
4.2
.1
 
Pos
t
-
writing questionnaire
 
 
71
 
 
4.2
.2
 
Interviews
 
 
81
 
 
a) Difficulty incorporating sources
 
 
81
 
 
b) Using ideas from the source materials
 
 
83
 
 
c
) Difficulty understanding the videos
 
 
85
 
 
d
) Topic preference
 
 
87
 
 
e
) Time constrain
ts
 
 
9
1
 
 
f
) Planning time
 
 
93
 
4.3
 
RQ6: What are 
the 

 
9
7
 
 
4.3
.1
 
Norming sessions
 
 
97
 
 
a) Rubric
 
 
97
 
 
b) Source integration
 
 
100
 
 
c) Differences between the TW and PBTW exams
 
 
101
 
 
4.3
.2
 
Interviews
 
 
10
2
 
 
a) Topics 
of the exams
 
 
102
 
 
b) Source integration
 
 
103
 
 
c) Rubric
 
 
10
4
 
 
d) Content validity
 
 
105
 
 
CHAPTER 5: DISCUSSION
 
 
10
8
 
5.1 Main findings of the study
 
 
10
8
 
5.2 Why PBTW exams
 
are 
a better fit
 
to 
e
valuate
 
ESL academic writing
 
 
11
6
 

12
2
 
5.4 Rubric design and use
 
 
13
1
 
5.5 Hurdles of implementing PBTW exams
 
 
13
7
 
 
CHAPTER 6: CONCLUSION
 
 
14
2
 
6.1 
Pedagogical implications
 
 
14
4
 
6.2 
L
imitations
 
 
14
5
 
6.3
 
Future research
 
 
14
8
 
6.4
 
Summary
 
 
1
50
 
 
APPENDICES
 
 
1
5
2
 
APPENDIX A
: Videos
 
 
1
5
3
 
APPENDIX B
: Reading passages
 
 
1
5
4
 
APPENDIX C
: Rubric
 
 
1
5
5
 
APPENDIX D
: Post
-
writing questionnaire
 
 
1
5
7
 
APPENDIX E
: Semi
-
structured interview questions
 
 
1
5
9
 
APP
ENDIX F
: Guidelines for clauses
 
 
1
60
 
 
REFERENCES
 
 
1
6
1
 
 
viii
 
 
LIST OF TABLES
 
 
Table 1
 
Participants
 
 
3
5
 
 
Table 2
 
Procedures for the TW and PBTW exams for each group
 
 
36
 
 
Table 3
 
Procedures for the PBTW exam for group 1
 
 
39
 
 
Table 4
 
Procedures for the PBTW exam for group 2
 
 
40
 
 
Table 5
 
Descriptive statistics: Average scores
 
 
51
 
 
Table 6
 

52
 
 
Table 7
 
Descriptive statistics
:
 
Average
 
analytic scores
 
 
52
 
 
Table 8
 
Spearman correlations: Average analytic scores
 
 
5
3
 
 
Table 9
 
T test: Average scores
 
 
53
 
 
Table 10
 
T tests: Average analytic scores
 
 
54
 
 
Table 11
 
Descriptive statistics: Percentage of error
-
free clauses
 
 
55
 
 
Table 12
 
T test: Accuracy
 
 
55
 
 
Table 13
 
Descriptive statistics: Lexical sophisticatio
n in TW exams and PBTW exams
 
56
 
 
Table 14
 
T test: Lexical sophistication
 
 
56
 
 
Table 15
 
Descriptive statistics: Lexical Density and Lexical Variation
 
 
57
 
 
Table 16
 
T tests: Lexical Density and Lexical Variation
 
 
5
8
 
 
Table 17
 
Descriptive statis
tics: Syntactic complexity
 
 
59
 
 
Table 18
 
T tests: Syntactic complexity
 
 
60
 
 
Table 19
 
Descriptive statistics: Number of words
 
 
6
1
 
 
Table 20
 
T test: N
umber of words
 
 
61
 
 
Table 21
 
Descriptive statistics: 
Number of minutes
 
 
61
 
 
ix
 
 
Table 
22
 
Descriptive statistics: 
Words per minute
 
 
6
2
 
 
Table 23
 
T test: Words per minute
 
 
62
 
 
Table 24
 
Inter
-
rater reliability: Analytic scores
 
 
6
3
 
 
Table 25
 

63
 
 
Table 26
 
Correlation matrix for RM
 
 
6
4
 
 
Table 27
 
Correlation matrix for RK
 
 
64
 
 
Table 28
 
Interview groups
 
 
69
 
 
Table 29
 
The participants
 
 
69
 
 
Table 30
 
Answe
rs to multiple
-
choice questions
 
 
71
 
 
Table 31
 
Q2: Which exam was easier
 
and why?
 
 
7
3
 
 
Table 32
 
Q9
: Why did you not use the ideas from the 
materials 
or
 
discussion
?
 
 
75
 
 
Table 33
 
Q10: What was difficult/easy about the TW exam?
 
 
76
 
 
Table 34
 
Q11: What was difficult/easy about the PBTW exam?
 
 
78
 
 
Table 35
 
Q13: Which exam did you prefer and why?
 
 
8
0
 
 
1
 
 
INTRODUCTION
 
 
Many 
English as a second language (
ESL
)
 
writing programs all over the world use timed
-
writing exams for placement
 
and achievement
 
purposes. 
Placement exams are used to make
 
decisions about which class or classes students should 
be placed in
,
 
and achievement exams 
assess whether students have mastered the goals and objectives of a course
 
(Carr, 2011)
. 
Most 
timed
-
writing
 
exams require students to write about a general topic for a specified amount of 
time, such as 30, 45, or 60 minu
tes. The essays are then scored by trained raters
, and the test 
administrators use the scores to
 
make
 
decisions regarding whether students have to take writing 
classes, what should be taught in the classroom, or whether students are ready to move on to the
 
next proficiency level. Timed
-
writing exams are extremely cost
-
effective, easy to design, 
administer, and score. However, timed
-
writing exams also have some disadvantages
 
when 
administered for achievement purposes
. 
Below I review five of the problematic a
reas. 
 
First, if the purpose of an ESL writing program is to prepare students to succeed in 
an
 
academic environment, impromptu timed
-
writing exams may not be the best tool to evaluate 

f academic writing tasks 
that students have to do in their 
regular university 
courses do not include answering bare prompts 
(Cooper & Bikowski, 2007; Hale et al., Horowitz, 1986; Yigitoglu, 2008). In a review of 54 
writing assignments sheets from 29 differ
ent university
-
level courses, Horowitz (1986) found 
that the majority of assignments that students have to do in their academic courses involve the 
incorporation of sources. The most common assignments were: summary/critique, annotated 
bibliography, report
 
on an experience, connecting theory and data, case study, synthesis of 
various sources, and research project. More recently, Cooper and Bikowski (2007), after 
 
2
 
 
examining 200 university course syllabi, found that 38% of the assignments that students had to 
do were research papers, and 20% were book reviews. Impromptu timed
-
writing exams usually 
require students to write about their personal experiences or general topics, without allowing 
them to incorporate sources. 
 
Hale et al. (1996) also investigated writ
ing tasks that students have to perform at the 
university level. They analyzed writing tasks from 162 undergraduate and graduate courses in 
seven American 
universities
 
and one Canadian
 
one
. Most of the in
-
class writing assignments in 
the undergraduate cour
ses required students to write short essays
 
of
 
no longer than half a page. 
The most common out
-
of
-
class writing assignment was the research paper
, multiple pages in 
length

 
her 

Midwestern university to analyze the types of writing tasks that the professors require
d
 
their 
students to do. The syllabi and handouts came from courses in
 
the Social Sciences, Sciences, and 
Humanities. She found that the most common assignments were research and reaction papers. 
Forty
-
three percent of the writing prompts were text
-
based prompts, meaning that the students 
had to use sources in their writing,
 
while only 29% were bare prompts, with no use of sources. 
Impromptu timed
-
writing exams usually require students to write about their personal 
experiences or general topics and do not provide them with readings to allow them to incorporate 
sources in thei
r writing. This practice seems to go against a very robust finding that most 
university 
professors require students to incorporate sources in their writing. 
 
Second, as Weigle (2002) noted, most writing assignments that students have to complete 
in 
regular
 
academic classes are untimed and completed outside of the classroom. She further 
explained that timed
-
writing exams are not authentic academic writing tasks because, when 
 
3
 
 
students write a paper for a course, the person who reads the essay is the professor
, not a trained 

accuracy, as a rater
 
scoring
 
timed
-
writing
 
essays in an ESL course might be. 
Impromptu timed
-
writing exams also lack content validity when 
it comes to
 
what many ESL academic writing 
teachers do in their classes. Many ESL programs that aim at preparing students for 
university 
academic writing
 
teach writing as a process that includes reading, discussing, planning, peer 
review, revisions, and so on. A good 
example of this practice comes from books commonly used 
by ESL programs to teach academic writing. 
Sourcework: 
Writing
 
from Sources 
(Dollahite & 
Haun, 2011), for example, has students engaging in research, summary writing, source 
integration, planning, and
 
other skills that are not valued in impromptu timed
-
writing exams. 
Assessing learners using impromptu timed
-
writing exams might not provide program 
administrators with accurate information about what students learned during the course of a 
semester becaus
e these types of exams do not engage in the skills that students learn in their 
academic writing classes.
 
The third problem with impromptu timed
-
writing exams is the issue of topic familiarity. 
Research seems to suggest that when students write about a fam
iliar topic they score higher (He 
& Shi, 2012; Tedick, 1990; Winfield
-
Barnes & Felfeli, 1982). When designing an impromptu 
timed
-
writing exam, it is really difficult to find a topic with which all students in a program, who 
come from many different cultura
l backgrounds, will be familiar. If students have to write about 
an unfamiliar topic, their scores could suffer. 
 
Fourth, studies have also found that students perform better at writing tests if they are 
given time to plan what they will write (Ellis & Yua
n, 2004; Kellogg, 1988; Worden, 2009). 
Most timed
-
writing exams require students to perform their best in a very limited amount of 
 
4
 
 
time, with little time for planning. In an attempt to solve the problems that arise with impromptu 
timed
-
writing exams, more 
standardized ESL tests and programs are using integrated writing 
tasks (tasks that integrate reading and/or listening with writing) instead of, or in addition to, 
impromptu timed
-
writing exams. The TOEFL
 
iBT
, for example, in addition to an independent 
writ
ing task, now includes a task for which students have to read a passage and watch a lecture to 
respond to a prompt
 
(see 
www.toefl.org
 
for more information)
. The ESL writing placement exam 
at the University of Illinois a
t Urbana
-
Champaign includes a mini
-
lecture, a reading passage, 
group discussions, and peer review (Cho, 2001). Research suggests that students score higher in 
integrated writing tasks than they do in independent writing tasks (Cumming et al., 2005; David, 
under review
; Plakans, 2008). Integrated writing tasks provide students with the background 
information they need to write about the topic, even if they were initially unfamiliar with the 
topic. 
 
The fifth disadvantage
 
with impromptu timed
-
writing exams is
 
that oftentimes students 
memorize phrases and even complete essays ahead of time, memorize them in entirety or in part, 


(see 
www.toelf.org
),
 
many participants admitted to 
memorizing entire essays to prepare for the test (He & Shi, 2008). Because the prompts designed 
for impromptu timed
-
writing exams have to be about more general topics, 
test
-
takers have a 
good chance of memorizing an essay for a prompt they might actually encounter in a future test. 

method or practice effects associated with con

tasks allow for a wider range of topics because the necessary background information that test
-
takers need to perform the task will be given through readings and/or videos. Moreover, students 
 
5
 
 
are required to
 
use the sources they read in their essays. These factors could decrease the change 

 
To my knowledge, not many studies have 

r 
progress 
and achievement
 
purposes, teachers and test designers should also take into consideration 

exams, taking the exams, and 
baring 
the consequences of their
 
performance in the exams. 
Moreover, impromptu timed
-
writing exams do not mirror what students do in academic writing 
classes or 
other 
regular university classes
, as described above
. If the way 
that 
we assess learners 
in a program does not mirror what teac
hers actually do in the class, students may not feel 
invested in the exam and may not feel motivated 
to take the exam or perform well. 
In a prior 
study, I investigated 

-
prompt versus 
process
-
based
 
writing tasks 
(David, under re
view)
. I found that students 
preferred
 
process
-
based
 
writing tasks over 
independent writing tasks. That study spurred me to delve into this topic more deeply. I started 
investigating the literature on 
L2 writing assessments
 
more, and I found 
the five major
 
problem 
areas in L2 writing assessment
 
that I described above. 
 
Th
e general questions I have as a
n applied linguistics 
researcher and 
as an 
L2
-
writing 
teacher are these: 
Do students perform differently in writing exams when they are given 
(
a) 
background r
eadings and videos
, (
b) the opportunity to discuss their ideas in groups
, and (
c) time 

writing tests? 
Before launching into the study at hand, I first 
review st
udies
 
from researchers
 
that 
have investigated these questions
 
over the last 20 years
. 
I review how complexity, planning, 
topic familiarity, writing
-

different angles. I also review how
 
researchers have measured (quantitatively) writing success. 
 
6
 
 
This is important because to compare essays that are written under varying conditions, 
researchers need an objective way to measure the essays
 
using the same proficiency
-
oriented 
scale
. Student p
erceptions and their thoughts, I have found, is not enough information to 
adequately compare performances. L2
-
writing researchers need both qualitative (interview 
response data) and quantitative measures (test scores) to understand the full scope of differ
ences 
across exam formats. Thus, in this study, I use a
 
sequential 
mixed
-
methods
 
design
 
(Creswell & 
Plano Clark, 2011) 
to compare the scores students receive in an impromptu timed
-
writing (TW) 
exam with the scores they receive in an integrated writing exam
, which I call process
-
based 
timed
-

 
7
 
 
CHAPTER 1: REVIEW OF THE LITERATURE
 
 
1.1
 
Writing task complexity
 
L2
-
writing test
 
designers have many options. They can allow for planni
ng before the test 
takers write their essays; the writing topic can be familiar or unfamiliar; the writing task itself 
can be integrated (involve reading and/or listening)
;
 
or not (bare prompt). In essence, test 
developers can manipulate the complexity of 
the writing task. 
Robinson (2001) 
explained 
that 


added 
that when a task
 
is simpler, the learner will make less language errors, whereas when a 
task is more complex, the learner will be prone to making more language errors because the 
processing demands of the task are higher.
 
According to Robinson (2001), t
here are 
different 
factors that affect the complexity of a task. If the learner is required to simply provide 
information as opposed to using reason, for example, the task is less demanding. At the same 
time, allowing planning time also makes the task simpler. However, if th
e learner is required to 
perform another task in addition to to the primary task, the task becomes more demanding. For 
instance, if the learner is required to read a text before writing, the task is more demanding, 
because the learner has to allocate resou
rces to both tasks, not just one (Robinson, 2001). 
Background knowledge is another factor that can affect task complexity. If the learner has 
background knowledge for the task that he or she is completing, then the task is simpler.
 
It is 
important to ackno
wledge that t

written performance
 
is a controversial
 
one
, however. A meta
-
analysis conducted by Jackson and 
Suethanapornkul (2013) suggested that task complexity has 
only a
 
slight
 
influence
 

8
 
 
performance.
 

context, including the performance
-
scoring system. 
Below I describe the research that has been 
done to investigate the effects of some of the factor
s of task complexity, such as planning, topic 
familiarity, and integrated tasks.
 
 
1.1.1
 
The effects of planning on L2 writing
 
 
Studies on the effects of planning on writing have mixed findings. While some 
researchers have found that students write higher quality
 
essays when they are given time to plan 
(Ellis & Yuan, 2004; Kellogg, 1988; Worden, 2009), others failed to find this correlation 
(Johnson, Mercado & 
Acevedo, 2012; Ong & Zhang, 201
3
; Shi, 1998). Ellis and Yuan (2004) 
divided 42 L2 learners into three gro
ups: pre
-
task planning, on
-
line planning, and no planning. In 
the no planning condition, the participants had to write an essay with a minimum of 200 words in 
17 minutes. The participants in the pre
-
task planning group were given 10 minutes to plan their 
w
riting in addition to the 17 minutes to write their essay. In the on
-
line planning condition, the 
participants were not given a time limit or word limit to write their essay. The authors found that 
the pre
-
task planning group wrote longer essays with more 
syntactic variety, whereas the on
-
line 
planning group wrote more accurate essays. 
 
The other study that
 
showed
 
a positive relationship between planning and the quality of 
writing was that of Kellogg (1988). 
Kellogg
 
assigned 18 college students to two group
s, which 
Kellogg
 
further divided into two other groups: outline vs. no outline and rough draft vs. polished 
draft. The outline group was told to spend 5 to 10 minutes writing an outline for the writing task, 
whereas the no outline group was told to begin w
riting right away. Similarly, the rough draft 
group was told to write a rough draft of the essay, mainly to put their thoughts down on paper, 
 
9
 
 
and then they were asked to revise their essay by adding and changing the content of their 
writing. The polished d
raft group was simply told to write their essay. Kellogg found that the 
outline group wrote faster than the no outline group. The outline group also wrote longer essays 
and received higher scores for their essays when compared to the no outline group. Ther
e were 
no significant differences in the efficiency with which the participants wrote their essays or in 
the quality of their writing for the rough draft and polished draft groups. 
 

mance, Worden 
(2009) 
examined 890 essays written by L2 learners. Six hundred and forty students did some 
type of pre
-
writing activity and 747 students revised their essays in some way. The researcher 

vels of pre
-

scores (p. 162). Participants who made global revisions to their essays, on the other hand, 
received lower scores. 
 

performance. S
hi (1998) had the participants engage in pre
-
writing discussions before they wrote 
their essays, but found no significant differences in the overall quality of the essays who
se 
participants
 
participated in such discussions and those who did not. Ong and Zh
ang (201
3
) found 
a positive relationship between planning time and fluency and lexical complexity. The 108 
participants were given 10 minutes to plan and 20 minutes to write (pre
-
task condition); 20 
minutes to plan and 10 minutes to write (extended pre
-
tas
k condition); or 30 minutes to write 
continuously without any type of planning (free
-
writing condition). The authors found that the 
participants in the free
-
writing condition wrote significantly longer essays and scored 
significantly higher in lexical comp
lexity than the other two groups, indicating that planning time 
does not affect fluency and lexical complexity.
 
 
10
 
 
In a large
-
scale study to investigate the effects of planning on L2 writing, Johnson, 
Mercado
,
 
and Acevedo (2012) examined the essays of 968 ESL
 
students who were divided into 
five groups: a control group, an idea generation group, an organization group, a goal setting 
group, and a goal setting plus organization group. The control group did a vocabulary activity 
before they wrote. The idea generat
ion group was given 10 minutes to list as many ideas related 
to the topic of the essay as possible. The organization group was given an outline worksheet 
before they wrote. The goal setting and the goal setting plus organization groups listed the 
rhetorica
l goals for the essay, but in addition to that, the latter also completed an outline 
worksheet before writing their essay. The results showed that planning had no effects on lexical 
complexity or grammatical complexity, but planning did result in longer es
says, although the 
effect size was considerably small. One problem with this study was the fact that the researchers 

ESL course as a measure of proficienc
y. Students were from four different Advanced levels. As 
Johnson
 
et al.
 
explained
,
 

to free cognitive resources for successful planning. They suggested the following:
 
[There may be] 
a threshold of proficiency in the target language in order for 
working memory resources to be freed from the demands of the translation 
process of writing to such an extent that pre
-
task planning may have any 

xts. (p. 272). 
 
 
Although some of the studies mentioned above did not find any positive effects on 

scores, and sometimes more syntactic variety
 
(Ellis & Yuan
, 2004; Kellogg, 1988; Worden, 
2009)
. These results suggest 
that 
learners could benefit from 
being 
given time to plan what they 
 
11
 
 
complexity of a task.
 
 
1.1.2 
 
Topic 
familiarity
 
 
Most studies that have investigated the effect of topic familiarity on writing have shown 
that L2 learners write more fluent essays and score higher when they write about topics with 
which they are familiar (He & Shi, 2012; Tedick, 1990; Winfi
eld
-
Barnes & Felfeli, 1982). He 
and Shi (2012) asked 50 
language learners
 
to write two essays, one in response to a prompt about 
a general topic that required them to use personal experiences, and another in response to a 
specific topic. The prompt that th


Be 

topic. The authors explained that the participants wrote shorter essays with weaker cohesion and 
coherence and more grammatical errors on the specific 

lacked idea development
,
 
and the participants did not explicitly state their position on the issue. 
 
 
Seeking to examine the effects of topic knowledge on writing performance, Tedick 
(1990) collected two writing
 
samples from 105 ESL students. For one essay, the participants 
wrote about a general topic, and for the other essay, the participants wrote about a topic related 
to their fields of study. When they wrote about their field of study, the participants receiv
ed 
higher holistic scores and made fewer grammatical mistakes. Tedick concluded that topic 
familiarity is related to better performance in writing. The same conclusion was reached in 
another study (Winfield
-
Barnes & Felfeli, 1982). 
 
 
12
 
 
In Winfield
-
Barnes and 

Spanish speaking country and the other ten were from other non
-
Spanish speaking countries. All 
of the participants were asked to read two paragraphs about the book 
Don Quixote
 
and a 
Japanese play 
named 
Noh
. They were then asked to write two compositions about the two 
paragraphs. The authors noticed that the Spanish
-
speaking participants wrote longer and more 
accurate essays when they wrote about the 
Don Quixote 

g 
the dual cognitive processing load by having students deal with culturally familiar material 

 
The findings of the studies described above seem to indicate that when L2 learners are 
familiar with a topic, they write higher qual
ity essays and score higher (
He & Shi, 2012; Tedick, 
1990; Winfield
-
Barnes & Felfeli, 1982

theory. When learners write about familiar topics, they have more attentional resources to allot to 
other element
s of writing, such as cohesion and grammar.
 
 
1.
1.3
 
Integrated tasks
 
 
Many researchers have investigated tasks that integrate reading and/or listening with 
writing. However, the results are conflicting. Gebril (2010), for example, administered two 
independe
nt writing tasks and two reading
-
to
-
write tasks to 115 
English as a foreign language 
(
EFL
)
 

highly and suggested that the two tests measured similar constructs. Ho
wever, the findings of 
other studies suggest the opposite: 
I
ntegrated writing tasks and independent writing tasks 
measure two different constructs (Cumming et al., 2005; David, 
under review
). 
 
 
13
 
 
ndent and integrated 
writing tasks written for the TOEFL 
(
www.toefl.org
) 
and found that the discourse differed 
significantly in terms of text length, lexical complexity, syntactic complexity, and discourse 
orientation. 
The test takers wrote shorter essays, used a wider variety of words, and wrote longer 
clauses and more clauses in the integrated writing task. In addition, they tended to use less 
personal knowledge and more source information. However, the authors noticed
 
that the test 

did not differ significantly in terms of grammatical accuracy. However, when the researchers 
analyzed the more advanced learners, they found somewh
at different results. The more advanced 
learners wrote longer essays, used a wider range of words, wrote longer clauses and more 
clauses, were more grammatically accurate, and provided better arguments in the integrated 
writing tasks. It seems, then, that 
proficiency plays a part in the two types of tests; more 
advanced learners seem to take more advantage of integrated writing tasks
, or it could be that 
they are able to better demonstrate their advanced proficiency through integrated
-
task work
.
 
 
Plakans (2

independent and reading
-
to
-
write writing tasks. Nine of the 
10
 
test takers believed that they 
performed better in the integrated writing task and they were indeed right in doing so. Th
e 
writing process that they used, however, differed. The participants planned more in the 
independent writing task. Furthermore, more experienced writers seemed to engage more with 
the text in the reading
-
to
-
write task.
 
 
A study of much relevance to this i

perceptions of a process
-
based writing (PBTW) exam and an impromptu timed
-
writing (TW) 
exam (David, 
forthcoming
). In 
my
 
study, 40 students from Academic writing classes in a large 
 
14
 
 
Midwestern university to
ok the two exams within the same week. 
I counter
-
balanced the order of 
the
 
administration of the exams. The PBTW exam required students to read two texts and watch 
two videos about the same topic. After that, the participants were presented with the essay 
prompt and given 10 minutes to discuss their ideas in groups. In addition, the participants had 10 
minutes to plan and 45 minutes to write their essays. For the impromptu TW exam, the 
participants had to choose between two prompts and write their essays in
 
45 minutes. Two raters 
scored each essay using a holistic rubric. The results revealed that students scored higher in the 
PBTW exam. Although a 
t
 
test did not reveal a significant difference between the scores in the 
two exams, the results approached sign
ificance (
p
 
= .06). Furthermore, the scores the students 
received in the two exams correlated only moderately (
r
 
= .417), indicating that the exams 
probably measure different constructs. The participants also wrote significantly longer essays in 
the PBTW e
xam and used more verb phrases per T
-
unit than in the impromptu TW exam. There 
were some limitations to this study, however. The participants had a choice of topic in the TW 
exam, and the two topics they could choose from in the exam might not have been co
mparable. 
One topic required students to write using personal experiences, while the other was a more 
academically oriented
 
topic about the use of natural resources in the world. The rubric used to 
rate the PBTW exam was the same as the one used to rate th
e TW exam, with one exception. I 

Adding another component to the rubric was a problem because some participants were 
penalized for not using the required sources. 
In addition, the raters only used a holistic rubric to 
rate the essays. An analytic rubric might provide more fine
-
grained and specific information 

 
15
 
 
Plakans (2008), Cumming et al. (2005), and David (forthcoming) found very s
imilar 
results in their studies: 
T
he participants scored higher and wrote more syntactically complex 
essays when they were given reading and listening passages related to the topic of the essay.
 

ed
 
that integrated tasks may allow 
more advanced learners to write more accurate essays.  
 
 
1.
1.4
 

Test designers need to consider more than just the dimensions of the writing test when 
they choose how to formulate and
 
construct a writing test. They must also consider the 
population of test takers. What will the test takers think of the test? This is 
relevant
 
because 

 
are important stakeholders (perhaps the most important) but their views are among the 
most diffic
u

(Rea
-
Dickins, 1997
, p. 306
)
. 
As explained by 
McNamara and Roever (2006), the International Language Testing Association (ILTA) detailed 
in its Code of Ethics (ILTA, 2000) that a fundamental concern of any test developer shou
ld be 

time, ILTA recognized within its Code of Ethics that test developers are often pulled between 
their need to assess in efficient and expedient manners
 
and their need to be contentious of the te
st 
takers and their value systems. 
As explained above, the test takers are the ones preparing for the 
exam, taking the exam, and baring the consequences of the exam. If, for some reason, students 
are not invested 
in an exam, they may perform poorly
 
due to lack of motivation
 
(Lee & Coniam, 
2013)
. 
However, t
o my knowledge, there are 
a 

perceptions of writing tests (
David, under review; 
He & Shi, 2008; 
Lee, 2006; 
Powers & Fow
les, 

 
16
 
 
(
www.toefl.org
) 
and the English 
Language Proficiency Index (LPI)
 
(see
 
www.celpiptest.ca
)
. The TWE includes an integrated 
writing task and an independent writing task. In addition to the TWE, many international 
students in Canada are required to take the LPI, a Canadian English proficiency test. 
 
He and Shi (2008) interviewed
 
16 Chinese students attending a Canadian university about 
how they prepared for the two tests and about their perceptions about both tests. Of the 16 
participants, 13 failed or had difficulties passing the LPI. With the exception of two participants, 
all 
participants stated that they prepared for the TWE by memorizing sentences, and some even 
admitted to memorizing entire essays. To prepare for the LPI, the participants took a 3
-
month 
preparatory course at the university. The course focused on writing skil
ls, and not memorization, 
and most students felt that the course did not help them pass the LPI. The authors attribute the 

for writing tests in their home cou
ntries: through memorization. Most of the participants believed 
that the LPI was more difficult than the TWE because they had to write about unfamiliar topics. 
Furthermore, they also felt that they needed to know more about the Canadian culture to write 
ab
out the topics in the LPI. They thought that the TWE, on the other hand, provided them with 
more general topics. Finally, He and Shi reported that the participants felt more pressure to write 
grammatically accurate essays in the LPI than in the TWE. In the
 
TWE, students had memorized 
sentences and essays, which helped them to write more accurate essays. 
 
 
In another study regarding test
-
takers perceptions of writing exams, Powers and Fowles 

Graduate Record Examination 
(GRE
, see 
www.ets.org/gre
)
.
 
Fi
rst, the researchers asked the participants to rate six prompts on a 
7
-
point scale, in which 7 was extremely good, and 1 was extremely poor. They were also asked 
 
17
 
 
with which prompt they could write the best and the worst essay and to explain why. After rati
ng 
the essays, the participants had to write two essays using two of the six prompts they rated. After 
they wrote the essays, the participants were asked to rate the six prompts once again. The 
researchers found that the participants, as individuals, were 
not very consistent with their ratings 
of the six prompts. In other words, they rated the prompts differently. However, they seemed to 
be consistent when considered as a group. On the one hand, the participants thought that a 
prompt was good when: 1) they 
could connect with the topic; 2) the prompt was clearly 
explained; and 3) they thought the topic was interesting. On the other hand, the participants rated 
prompts as difficult when: 1) the topic was not interesting; 2) they lacked topic familiarity; 3) th
e 
prompt was not clearly explained; 4) the participants had negative feelings about the topic; and 
5) they thought the topic was not pertinent. However, some 
participa
nts
 
rated a prompt as easy, 
while others rated the same prompt as difficult, indicating t
hat participants have different 
opinions about the prompts. Finally, Powers and Fowles compared the essays that the 
participants wrote for prompts they rated as difficult to those they rated as easy. The authors 
found that the participants tended to do bet
ter when they wrote essays for the prompts that they 

are related to
 
their performance
s
. 
 
 
Lee (2006) investigated a daylong ESL writing placement exam at th
e University of 
Illinois at Urbana
-
Champaign. The ESL learners watched a mini
-
lecture, read an article, 
participated in group discussions, had time to plan, write, and then peer review their essays. The 

rceptions of the placement exam at the end 

 
18
 
 
participants complained that the exam was too long, others said the opposite. The participants 
also made positive
 
comments about the group discussions
 
and peer feedback
. 
 
In the study that I conducted comparing two types of writing exams (David, 
under 
review

took the two exams,
 
they answered a short questionnaire that contained both multiple
-
choice and 
short
-
answer questions about the two exams. Seventy percent of the participants preferred taking 
the PBTW exam and 62% thought that it was easier than the impromptu TW exams. Six 
participants also said that they 
derived
 
ideas from the group discussions in which they 
participated. Almost half of the participants (n = 18) said that they 
g
leaned
 
ideas from the articles 
they read and the videos they watched and used these ideas in thei
r writing. Some participants (n 
= 8) mentioned that they liked the fact that they had time to plan their essays before they began 
writing. 
 

d
 
that learners value the 
process of writing through wh
ich process
-
based writing exams allow them to go. The 
participants in both studies reported that they liked having time to plan their essays and discuss 
their ideas in groups. Furthermore, the participants in my study reported that they 
learned ideas 
from 
the topic by reading and watching videos and used those ideas in their writing.
 
Powers and 
Fowles
 

may have an 
effect on their performance. If students do not like the exam that they are taking, they may not 
feel motivated to take it or do well
 
(Lee & Coniam, 2013)
. 
 
 
19
 
 
1.2
 
Measuring the components of writing
 
Above I reviewed writing task complexity, which i
s important for test development, and 

 
In addition to 

determine which c
omponents of writing to evaluate them and how to measure each of these 
components. 
The most commonly used holistic and analytic rubrics co
ntain categories like 
content, organization, grammar, vocabulary, and so on. Jacobs
, Zinkgraf, Wormuth, Hartfiel, and 

 
(1981) rubric, for example, includes content, organization, vocabulary, language use, 

includes cohesion.
 
Task achievement, coherence and cohesion, lexical
 
resource, and grammar 

The 
researchers mentioned above who
 
have investigated the effects of planning, task complexity, and 
integrated tasks
 

have comp
ared tes
t
 

dimensions: lexical and syntactic complexity, fluency, and grammatical accuracy (
Cumming et 
al., 2005; Ellis & Yuan, 2004; Johnson et al., 2012; Shi, 1998).
 
Other researchers have also 

ting using the same measures. Below I review how these researchers 
have analyzed grammatical accuracy, lexical and syntactic complexity, and fluency.
 
 
1.2.1
 
Measures of g
rammatical accuracy
 
 
Most of the researchers who have analyzed grammatical accuracy in
 
L2 writing used 
measures such as 
error
-
free
 
T
-
units or error
-
free clauses.
 
T
-
units are independent clauses and 
their accompanying dependent clauses (Hunt, 1965).
 
Researchers decide
,
 
on their own
,
 
what 

-

-
free T
-
units, b

 
20
 
 
-
free T
-
units, see Appendix
 
F.
 
 
Amstrong 
(2010)
, Kuiken, Mos, and Vedder (2005),
 
and Wigglesworth and Storch (2009) measured 
grammatical accuracy by calculating the
 
percentage of error
-
free T
-
units per total T
-
units. 
Amstrong (2010
) investigated the effects of giving grades for compositions written by
 
intermediate
 
L2 Spanish learners on fluency, accuracy, and complexity. 
The author defined 


performance on essays that were graded
,
 
essays that were ungraded
, and written work done for 
an online discussion board
. Amstrong measured accuracy by analyzing the number of error
-
free 
T
-
units, e
rror
-
free T
-
units per T
-
unit, and
 
errors per T
-
unit. She only found a significant 
difference for errors per T
-
unit between the graded and ungraded essays and for error
-
free T
-
units between the ungraded essays a
nd the online discussion board. There were mor
e error
-
free 
T
-
units in the online discussion board. 
Kuiken et al. (2005) also used error
-
free T
-
units per total 
T
-
units as a measure of grammatical accuracy, as well as dividing the degree of errors in three 
categories: minor errors, more serious errors, 
and errors that make the text incomprehensible. 
The authors investigated the effect of task complexity on syntactic 
complexity
, lexical variation, 
and accuracy. The participants were 62 Dutch learners of Italian who performed two writing 
tasks, one of whic
h was more 
cognitively
 
complex than the other. The results of the study 
showed that the learners made significantly more errors in the more complex task. 
Another study 
that measured grammatical accuracy was that of Wiggleswroth and Storch (2009). The autho
rs 
wanted to investigate whether 
advanced 

accuracy, fluency, and complexity when students worked in pairs or individually. 
In addition to 
error
-
free T
-
units per total T
-
units, t
hey 
also calculated
 
the p
ercentage of error
-
free clauses.
 
The 
results of the study revealed that the 144 participants produced significantly less error
-
free T
-
 
21
 
 
units when they wrote the essays in pairs.
 
Ellis and Yuan (2004) calculated the percentage of 
error
-
free clauses per total
 
clauses in the study mentioned above. The authors
 
investigated the 
effects of planning in L2 writing on fluency, complexity, and accuracy. They had two measures 
of accuracy: error
-
free clauses per total clauses and correct verb forms, that is, the percent
age of 
verbs used accurately. For the first measure of accuracy, the researchers counted as errors 
syntactic, morphologic, and lexical choice mistakes. 
As mentioned above, the learners in the on
-
line planning condition wrote essays with significantly less 
grammar errors than the learners in 
the no planning and pre
-
task planning conditions. 
Of particular interest to the present study is
 
the 
work by
 
Polio and Shea (2014)
. Polio and Shea investigated
 
different measures of linguistic 
accuracy. The authors 
analy
zed the following measures of linguistic accuracy
 
found in 35 
different studies
 
in an attempt to investigate their reliability: holistic scores of language use and 
vocabulary, error
-
free T
-
units per total T
-
units, error
-
free clauses per total clauses, weig
hted 
error
-
free T
-
units per total T
-
units, number of errors per words, and number of verb phrase, 
preposition, article, and lexical errors per words. 
Polio and Shea used these measures of 
grammatical accuracy to analyze essays in the MSU 
data set
.
 
This dat
a set includes essays 
written over the course of a semester by
 
ESL students enrolled at the English Language Center at 
Michigan State University.
 
The inter
-
rater reliability for error
-
free T
-
units per total T
-
units and 
error
-
free clauses per total clauses 
was .88; .85 for language holistic scores and .90 for 
vocabulary holistic scores; .84 for weighted error
-
free T
-
units; and .89 errors per total words. 
The other measures of linguistic accuracy all had inter
-
rater reliability lower than .80.
  
 
Error
-
free T
-
units and error
-
free clauses seem to be among the most commonly used 
measures to investigate accuracy in L2 writing
 
and researchers have found that these are both 
reliable measures to do so (Polio and Shea, 2014). 
 
 
22
 
 
1.2.2
 
Measures of l
exical complexity
 
 
M

,
 
and 
there are quite a few lexical complexity measures
 
from which
 
to choose. 
Fritz and Ruegg (2013) 
investigated whether raters are sensitive to lexical accuracy, lexical soph
istication, and lexical 
range when rating timed writing essays. 
Eight hundred and ninety
-
five EFL learners wrote an 
essay responding to a single prompt in 30 minutes. The researchers manipulated these essays in 
terms of lexical accuracy, sophistication, an
d range. 
The authors included two types of errors in 
the essays: errors of word choice, that is, using the wrong word in a certain context, and errors of 
part of speech, that is, using the wrong part of speech of a word in a certain context.  
The 
manipulat
ed essays contained three levels of accuracy: low accuracy, when the essay had errors 
in 32 content words; medium accuracy, when there were 16 errors in the 32 content words; and 
high accuracy, when all of the 32 content words were correct. 
The authors use
d RANGE, 
developed by Nation (2005), to 
manipulate
 
lexical sophistication. RANGE compares essays to 
three different word lists: the most common 1,000 words (1,000 word level), the second most 
common 1,000 words (2,000 word level), and the third most common
 
1,000 words (3,000 word 
level). Fritz and Ruegg 
created essays with three levels of
 
lexical sophistication: low lexical 
sophistication (32 content words from the 1,000 word level), medium lexical sophistication (32 
content words from the 2,000 word level)
, and high lexical sophistication (32 content words from 
the 3,000 level). 
T
he researchers 
created three categories
 
when they manipulated lexical range
: 
high lexical range, when the essay
 
contained 
32
 
different content words
; medium range, when 
the essay c
ontained 25 different content words; and low range, when the essay had 18 different 
content words. 
Twenty
-
seven raters scored
 
the manipulated essays using a four
-
point analytic 
 
23
 
 
rubric
. 
The results of the study revealed that raters seem to be sensitive to l
exical accuracy, but 
not to lexical range or lexical sophistication.
 

by
 
Kormos (2011). 
Kormos
 
investigated how task complexity influence
d
 

diversity and lexical 
complexity.
 

complexity. 
The participants performed two writing tasks: a picture description task and a picture 
narration task. The main difference between the two tasks is that the la
tter is said to be mor
e 

to use their imagination and find a way to relate the pictures to one another and invent a story 

 
D
-
formula 

(2005) 
RANGE program to measure lexical range, in the same manner Fritz and 
Ruegg (2013) used the program. 
One more measure of lexical complexity was used in this study: 
the concreteness of t
he words, that is, how concrete or abstract words are in a text. 
The results of 
the study show that task complexity did not influence lexical complexity at all. One reason for 
that might be that the two writing tasks were in the same genre, narratives. Kor
mos did, however, 
find a significant difference in lexical variety and complexity between L1 and L2 writers. 
A 

 
(
2005
)
, mentioned above. In addition to 
analyzing how task complexity influenced accuracy in
 
L2 writing, the authors also investigated 
lexical variation. They measured lexical variation as the number of word types divided by the 
total number of word tokens and word types per square root of two times the total number of 
word tokens. Similarly to K

complexity has no effect on lexical variation.
 
 
24
 
 
judgments
 

spoken language,
 
Lu (2012) 
created 
the Lexical Complexity Analyzer. The 
Lexical Complexity Analyzer generates 25 different measures of lexical complexity that measure 
three different dimensions of lexical complexity: lexical density, lexical sophistication, and 
lexical variation.  
Lexical d

refers to the rat
io of the number of 
lexical (as 
opposed to gramm
atical) words to the total num
ber o

191). 
The author included the following categories in his definition of lexical words: no
uns, 
adjectives, verbs, and adverbs. 
For lexical sophistication, Lu analyzed five different measures: 
lexical sophistication 1, following Linnarud (1986) and
 
Hyltenstam (1988)
, who divided the total 
number of sophisticated lexical words by the total number
 
of lexical words in a text
; lexical 
sophistication 2, following 
Laufer (1994)
, who analyzed lexical sophistication in terms of the 
number of sophisticated words divided by the number of words in a text
; verb sophistication 1, 
following Harley and
 
King (19
89)
, which is the number of sophisticated verbs divided by the 
number of total verbs in a text
; verb sophistication 2, 
which is a modification of verb 
sophistication 1, 
following Chaudron and
 
Parker (1990)
; and corrected verb sophistication 1, 
which is als
o an adaptation of verb sophistication 1, 
following 
Wolfe
-
Quintero, Inagaki and Kim 
(1998)
 
(See Lu, 2012 for a more detailed explanation of these measures). 
Finally, Lu (2012) used 
20 different measures to investigate lexical variation. These measures are 
the following: number 
of d
ifferent
 
words, 
first 50 words
, 
expected random 50
, 
expected seque
nce 50, type
-
token r
atio
, 
m
ean 
s
egmental TTR (50)
, c
orrected TTR
, r
oot TTR
, b
ilogarithmic TTR
, uber i
ndex
, D m
easure
, 
lexical word variation, verb variation 1, squa
red verb variation 1, corrected verb variation 1, verb 
variation 2, noun variation, adjective variation, adverb variation, and modifier variation.
 
The 

trained raters scored their 
 
25
 
 
performanc
e on these tasks
. The results revealed that lexical density and lexical sophistication 
did not correlate very strongly with the scores that the participants received. However, lexical 

 
Two
 
of the
 
studies that investigated the effects of planning on writing mentioned above 
analyzed the effects of planning on lexical complexity (Johnson et al., 2012
; Ong & Zha
ng, 
2013
). In 

 
study, the researchers 
analyzed five different measures of l
exical 
complexity: lexical diversity
 

-
Metrix)
, the number 
of pronouns to noun phrases, 
the use of personal pronouns, how often content words appear in 
comparison to a corpus, and how often the four and five most fr
equent word families appear in 

(2005) 
RANGE). As mentioned above, Johnson et al. 
found no significant differences in terms of lexical complexity in the different planning 
conditions.
 
Ong and Zhang 
(2013
) used the foll
owing formula to calculate lexical complexity: 
WT
2

As 
mentioned previously, t
he results revealed that the participants who in the pre
-
task 
(10 minutes 
to plan and 20 minutes
 
to write) 
and free
-
writing 
(write nonstop for 30 minutes) 
conditions wrote 
more lexically complex essays than the participants in the extended pre
-
task 
(20 minutes for 
planning and 10 minutes for writing) 
and control conditions
 
(write for 30 minutes)
.
 
 
So
me common measures of lexical complexity used by researchers to analyze L2 writing 
or speech are: lexical sophistication
;
 
lexical density
;
 
and lexical variety. Of particular interest to 
this study are the two programs developed by Nation (2005) and Lu (201

analyzes lexical sophistication 
that
 
compar
es
 
essays to three different word lists
;
 

Lexical Complexity Analyzer 
that 
examines 25 different measures of lexical complexity, all of 
which include lexical density and variety measures
.
 
 
26
 
 
1.2.3
 
Measures of s
yntactic complexity
 
Many of the studies that investigated 
integrated tasks and 
the effects of planning on L2 
writing 
i
ncluded measures of syntactic complexity (
Cumming et al
., 2005;
 
David, under review;
 
Ellis & Yuan, 2004
; Kuiken et a
l., 2005
; Kuiken & Vedder, 2008
). When comparing integrated 
writing tasks to independent writing tasks, Cumming et al. (2005) analyzed syntactic complexity 
in two ways: by counting the number of clauses per T
-
unit and the number of words per T
-
unit. 
The fi
ndings of the study revealed that the participants wrote significantly less words per T
-
unit 
in the listening to write tasks than in the reading to write and independent writing tasks. 
Ellis and 
Yuan (2004) measured syntactic complexity the same way Cummin
g et al. did: by counting the 
number of T
-
units per clauses. The findings of the analysis revealed that the participants who 
engaged in pre
-
task planning had significantly more syntactic variety than the no planning group. 
Another study that investigated t
he effects of 
task complexity on syntactic co
mplexity was that 
of K
uiken et al.

 
(2005). The authors used the following measures of syntactic complexity: 
number of clauses per T
-
unit and number of dependent clauses per clause. The results revealed 
no sign
ificant differences in syntactic complexity in the two tasks. 
In a later study, Kuiken and 
Vedder (2008) used similar measures of syntactic complexity to investigate the effects of 
cognitive task complexity on Italian
 
and French
 
as a Foreign Language writi
ng. They calculated 
the number of clauses per T
-
unit and the number of dependent clauses per clauses. The authors 
found no significant differences in syntactic complexity. 
One study of particular interest to the 

 
The authors investigated syntactic complexity in a 
large number of essays written by 
university
-
level 
native and
 
nonnative speakers of English and 
they used the L2 Syntactic Complexity Analyzer, developed by Lu (2010) to do so. Ai and Lu 
 
27
 
 
used ten out of t
he fourteen syntactic complexity measures generated by the L2 Syntactic 
Complexity Analyzer: mean length of clause, mean length of sentence, mean length of T
-
unit, 
dependent clauses per clause, dependent clauses per T
-
unit, coordinate phrases for clause, 
c
oordinate phrases per T
-
unit, T
-
units per sentence, complex nominals per clause, and complex 
nominals per T
-
unit. 

compositions differed significantly in terms of syntactic complexity.
 
I als
o used the L2 Syntactic 

exam. The results revealed that the participants wrote more verb phrases per T
-
unit in the PBTW 
exam, while they also wrote more coordinate phra
ses per clause in the TW exam. 
 
Number of clauses per T
-
unit and number of words per clause or T
-
unit seem to be 

Complexity Analyzer includes these measures, as we
ll as many others that are related to 
syntactic complexity, such as the number of dependent clauses per T
-
unit. 

program has been found to be a reliable measure of syntactic complexity when compared to 
human raters
 
(Lu, 2010)
. 
 
 
1.2.4
 
Measure
s of f
luency
 
R
esearchers analyzing fluency in L2 writing have measured the concept of fluency in 
different ways
 
and there is much controversy on what is an accurate measure of fluency (Abdel 
Latif, 2012)
. Ellis and Yuan (2004), for instance, used two measu
res of fluency: syllables per 

of words a participant reformulated (i.e., crossed out or changed) divided by the total number of 

The resu

 
28
 
 
wrote more syllab
les
 
per minute in the pre
-
task planning group than the no planning group. 
Johnson et al. (2012), on the other hand, investigated fluency by calculating the total number of 
words 
and average sentence length. 
The authors only found a significant different in the average 
sentence length. The learners in the control group wrote longer sentences than the learners in the 
pre
-
task planning condition.
 
Similarly, Cumming et al. (2005) calc
ulated the total number of 

fluency that the authors used. They found
, however,
 
that the participants wrote more in the 
independent writing tasks. 
The study that I cond
ucted comparing PBTW exams and impromptu 
TW exams also included one measure of fluency (David, under review). In order to measure 
fluency, I divided the total number of words per essay by the total number of minutes that the 
participants were allowed to wr
ite. The results revealed that the participants wrote significantly 
longer essays in the PBTW exam when compared to the TW exam. This measure of fluency, 
however, was problematic because some participants finished writing before the 45 minutes were 
over an
d the number of words per minute was possibly not a true measure of fluency for those 
participants who finished writing earlier. 
Another one of the studies mentioned above that 

 
measured 
fluency in terms of the total number of words, total time spent on task, and words per minute
; a 
much more complete and accurate measure of fluency than the one used in 
my study
 
(
David, 
under review) study
. 
The L2 learners in the outline conditio
n wrote more words, spent more time 
on task, and wrote faster than the learners who did not write an outline for their essays. 
Ong and 
Zhang (2013
) also investigated the effects of task complexity on fluency. The authors calculated 
fluency by analyzing wor
ds per minute for the total number of minutes on task (fluency II) and 
words per minute for the amount of time the learners wrote their essays (fluency I). The results 
 
29
 
 
of the study revealed that the learners scored significantly higher for fluency II in th
e free
-
writing condition than the participants in the pre
-
task and extended pre
-
task conditions. 
However, there were no significant differences for the different planning conditions in terms of 
fluency I
.
 
 
There are many problems with theses measures of fl
uency, as explained by Abdel Latif 
(2012). Abdel Latif stated that one of the problems with using any of the above measures of 
fluency is that writers often pause while writing and the pauses are not very consistent. In fact, 
Flower and Hayes (1981) said t
hat more than half of the time that writers spend on task consists 
of pause time, not writing time. 
Total number of words per essay and words per minute do not 
take those pauses into consideration and therefore may not really reflect fluency accurately. 
An
other problem that Abdel Latif reported with the most commonly used measures of fluency is 
that, when writers pause, they pause for different reasons. They may pause to plan, to monitor 
language, to retrieve information, and so on. The author explained tha
t pausing can help or 
interfere with
 
writing fluency.
 
 
30
 
 
CHAPTER 2: THE PRESENT STUDY
 
 
A need exists in the literature to develop a more thorough understanding of the effect of 
planning, topic familiarity, and 
integrated tasks on L2 writing,
 
as well as test
 

opinions of writing exams by collecting both quantitative and qualitative data, because each type 
of data provides a different view of these issues. 
According to Creswell and Plano Clark (20
11
), 

cuses on collec
ting, analyzing, 
and mixing both quantitative and 

ed
 

central premise is that the use of both quantitative and qualitative data approaches, in 
combination, provid

(Creswell & Plano Clark, 20
11
, p. 5). The quantitative results from this study will provide 
statistical
 

alitative 
results may provide explanations and insights about their performance
,
 
which will 
ultimately 
enhance
 
and add depth to
 
the findings of
 
the study.
 
The present study is a fixed mixed
-
methods 
study, rather than emergent
 
and evolving
, because the use 
of quantitative and qualitative methods 
for collecting data for the study were planned before data collection began (Creswell & Plano 
Clark, 2011). The data collection for this study was sequential (Creswell & Plano Clark, 2011), 
that is, data collection f
or the quantitative portion of the study was done first, by administering 
tests, followed by the collection of the qualitative data, which consisted of questionnaires and 
interviews. 
In addition, the results of both the quantitative and qualitative portion
s of the study 
were 
combined
 
during the interpretation phase, after 
all of 
the data were collected.
 
Creswell and 
Plano Clark (2011) explain
ed
 
that there are multiple ways to design and conduct mixed methods 
research: the convergent parallel design, the exp
lanatory sequential design, the exploratory 
 
31
 
 
sequential design, the embedded design, the transformative design, and the multiphase design.
 
As I described,
 
I followed the convergent parallel design, in which the researcher combines the 
results of the quantit
ative and qualitative data after the data for each portion of the study has 
been analyzed.
 
The reason why this mixed
-
methods design was chosen is because, as Morse 

ob


done by using quantitative methods, but I also wanted to investigate the test 


compare and contrast quantitative and qualitative results 
to validate the findings of each result 
and to have a better understanding of what they mean (Creswell & Plano Clark, 2011).
 
In an attempt to expand on
 
and combine
 
the findings 
of the studies mentioned in the 
literature review, the purpose of this mixed me
thods study is to investigate how test takers 
perform in two timed writing exams: an impromptu TW exam and a PBTW exam. Another goal 

I
n the present 
mixed methods 
study I seek
 
to investigate the following research questions:
 
1)
 

exam?
 
2)
 

a PBTW exam?
 
3)
 
How do studen

 
a)
 
Accuracy?
 
 
32
 
 
b)
 
Lexical complexity?
 
c)
 
Syntactic complexity?
 
d)
 
Fluency?
 
4)
 
What are the intra and inter
-
rater reliability coefficients for each exam?
 
(Are they 
comparable?) 
 
5)
 

types of writing exams?
 
6)
 

The quantitative data in this study 
are
 

and syntactic complexity, and fluency scores on the exams, as well a
s the in
tra and inter
-
rater 
reliability for the two exams. The qualitative data consist of the questionnaire and interviews. 
The qualitative data was collected an
d analyzed 
borrowing from 
the case study
 
tradition. 
With 

umber of research participants

 
behaviors, performance, knowledge, and/or perspectives are then studied very closely and 
intensively, often over an extended period of time, to address timely questions about language 
acquisition, attri

2012, p. 95).
 
In the case of the present study, the focus was on 18 participants, but data collection 
did not occur over an extended period of time.
 
Instead, the participan
ts were interviewed only 
once after they took the two exams.
 
Because one of the aims of the study was to investigate test 

interviews. 
Duff 
explained 
that researc
hers can learn much about individuals by examining a 
smaller number of them instead of a large number
 
of participants
. 
The individuals that I 
investigate are 18 out of the 81 participants and the two raters who participated in this study. 
These participant
s were 
either 
randomly selected by me, in the ca
se of the two classes I taught, 
 
33
 
 
selected by their teachers
,
 
or they volunteered for the interviews 
(
in the case of
 
the teachers who 
collected data for me
)
. 
T
his particular
 
investigation
 
of the 18 test takers 
and the two raters is 
interpretive in nature
. I seek
 
to understand the
 

perceptions of the two 
exams through interviews.
 
The themes 
that 
emerged in the interviews with the test takers are 
described within the context of the response
s of the post
-
writing questionnaire, which contained 
both multiple
-
choice and open
-
ended questions.
 
Instead of describing the data case by case, I 
describe the themes that emerged across cases for both the test takers and the raters.
 
 
2.1 
 
Method
 
 
2.1.1
 
Pa
rticipants
 
 
The participants of this study 
were 81
 
ESL learners 
who were 
taking an academic writing 
course at the English Language Center at Michigan State University
 
at the time of data 
collection
, which occurred in the 
summer
 
and 
fall
 
semesters of 2014
. 
There are two academic 
writing courses at the English Language Center
 
(ELC)
: one is a 6
-
credit
-
hour course that focuses 
on grammar and writing
 
(ESL 220)
; and the other is a 3
-
credit
-
hour course that focuses only on 
writing
 
(ESL 221)
. 
Students are placed in
 
these courses based on the score that they receive on a 
placement exam
 
upon arrival on campus
. Some students in ESL 220 and ESL 221 have taken 
other ESL courses at the ELC, while others are placed
 
directly
 
in these courses and have not 
taken other courses
 
at the center. 
The participants in this study 
are from both 
ESL 220 and ESL 
221
. 
The reason why students from both courses participated is because both courses focus on 
academic writing, but ESL 221 does not focus on grammar instruction. 
Only eight of the
 
participants were enrolled in ESL 221. 
I sent an email
 
to the teachers who were teaching ESL 
 
34
 
 
220 and ESL 221 during the summer and fall semesters in 2014 to ask them for help to collect 
data. Four teachers replied and agreed to allow 
me
 
to collect data du
ring class time. 
Two of the 
teachers agreed to do the data collection themselves 
because my
 
schedule did not allow 
me 
to 
come into their classrooms to collect data. 
Two of the 
six 
classes in which the data were 
collected were taught 
during the summer semes
ter 
by 
me
. Some teachers decided to give extra 
credit to the students who agreed to participate in the study
 
and others did not
.
 
Some teachers 

others did not
 
us
e the essays for grading
.
 
The data were collected in 
five
 
ESL 220 classes and 
one ESL 221 class. 
Table 
1 
summarizes the information a
bout the participants.
 
Fifty
 
males and 
31 females
 
participated in the study
. The average length of residence 
(LoR) 
in the U
nited States 
of the participants was 10 months and the average number of years 
they had 
had 
formal English 
instruction 
(FI) 
was five and a half years. Approximately 38% (n=31) of the participants were 
from Brazil, 36% (n=30) were from China, 1% (n=
7
) were 
from Saudi Arabia, and the remaining 
number of participants were from Angola (n=5), South Korea (n=3), the United Arab Emirates 
(n=2), Kuwait (n=1), Lybia (n=1), and Japan (n=1). 
The most common majors
 
among the 
participants were Business (n=7), Finance (n
=6), Electrical Engineering (n=6), Accounting 
(n=5), Medicine (n=5), Agricultural Sciences (n=5), Civil Engineering (n=4), and Forestry (n=4).
 
Many of the participants were already taking classes in their majors alongside their ESL classes, 
but some were o
nly taking ESL classes at the time of data collection. 
Only two of the 
participants had undecided majors. 
The great majority of the students were undergraduate 
students
. F
ive were
 
graduate students. 
The median age of the participants was 21 years of age
 
 
35
 
 
T
a
ble 1
 
Participants
 
Country
 
N
 
Mean age
 
Mean 
LoR
 
Mean FI
 
Brazil
 
31
 
22
 
4 months
 
4 years
 
China
 
30
 
20
 
15 months
 
7 years
 
Saudi Arabia
 
7
 
25
 
13 months
 
5 years
 
Angola
 
5
 
20
 
9 months
 
1 year
 
South Korea
 
3
 
24
 
25 months
 
9 years
 
United Arab Emirates
 
2
 
18.5
 
9 month
s
 
2.5 years
 
Kuwait
 
1
 
19
 
8 months
 
9 years
 
Libya
 
1
 
28
 
7 months
 
7 months
 
Japan
 
1
 
19
 
36 months
 
6 years
 
Total
 
8
1
 
21
 
10 months
 
5.5
 
years
 
 
In order to recruit raters for this study, I posted a 
message on a Facebook
 
page for graduate students and graduates of 
the MATESOL and SLS 
programs of 
Michigan State University. 
The three criteria for scoring the essays were the 
following: 
(

(
2) a minimum of 
two years
 
of experience teaching ESL academic writing; 
(
3) and prior experience scoring second 

I selected two
 
who responded and met the criteria for scoring. 
The 
raters were two female
 
native speakers of English
 
in their early thirties. 
One rater, whom I shall 
call RM, majored in Linguistics and was a Ph.D. candidate in Second Language Acquisition at 
the time of data collection. 
RM taught ESL for three years both to K
-
8 and college level students. 
The other rater, RK, 
had a BA in German a
nd 

Language
. She
 
w
a
s a
 
rater for a high
 
stakes ESL standardized test at the time of data collection. 
RK taught ESL for six years both in K
-
12 and at the college level. 
 
 
36
 
 
2.1.2
 
Procedure
 
 
Following the proce
dures in 
my previous study (
David
, 
under review
), the participants 
took
 
two different writing exams: an impromptu TW exam and a PBTW
 
exam
. The order of the 
exams 
was
 
counter
-
balanced and administered within the same week to reduce any effect of 
instruction
. The participants 
were
 
randomly divided into two groups: one group 
wrote
 
an 
impromptu TW exam about obesity and a PBTW exam about gun control; and the other group 
wrote
 
an impromptu TW exam about gun control and a PBTW exam about obesity. 
As 
mentioned abo
ve, the order of the TW and PBTW exams was counter balanced. 
Table 
2
 
outlines 
the procedure for the administration of the two exams for the two groups. 
 
Table 
2
 
Procedures for the TW and PBTW exams for each group
 
Group
 
Order of the exams/topic
 
Order of the
 
exams/topic
 
Group 1
 
Impromptu TW (obesity)
 
PBTW exam (gun control)
 
PBTW exam (gun control)
 
Impromptu TW exam (obesity)
 
Group 2
 
Impromptu TW exam (gun control)
 
PBTW exam (obesity)
 
PBTW exam (obesity)
 
Impromptu TW exam (gun control)
 
 
The prompts for 
both topics and both types of exams 
were
 
exactly the same, with one exception. 
The participants 
did
 
not read articles, watch videos, participate in class discussion, or have time 
to plan their essays for the impromptu TW exam. The procedure followed for ea
ch exam is 
described below.
 
 
2.1.3
 
TW exam
 
T
he participants
 
were
 
given a prompt about obesity or gun control, depending on the 
group to which they were assigned, and they 
were
 
given 45 minutes to write an essay answering 
the prompt. Below are the prompts f
or each topic:
 
 
37
 
 
Obesity is a healthcare concern worldwide, but especially in the United States. Two 
solutions being proposed are: 1) to tax junk food to discourage people from buying it; and 
2) to ban the sales of large sodas in some establishments. Do you 
believe these solutions 
would encourage people to reduce their consumption of unhealthy foods? Propose other 
solutions to the problem in the United States. Be sure to fully develop your essay by 
including logical supporting ideas,
 
clear explanations, relev
ant examples, and specific 
details.
 

Gun control continues to be a problem in the United States and in other countries around 
the world. There are three main views on gun control in the United States: 1) Restrict the 
sales of guns to people with no criminal
 
background; 2) ban the sales of guns altogether; 
or 3) allow the sales of guns to anyone. What do you think? Be sure to fully develop your 
essay by including supporting ideas, clear explanations, relevant examples, and specific 
details. 
 
The participants 
were allowed to ask any questions that they had about the language contained in 
the prompts or questions that clarified what they were expected to do for the writing task. 
The 
participants were not allowed to use dictionaries or any electronic devices duri
ng the exam.
 
 
2.1.4
 
PBTW exam
 
I
 
created 
this exam 
for another study that I conducted (David, 
under review
). The 
students 
watched
 
two short videos (
the links can be found in 
Appendix A) and read an article 
(
the links can be found in Appendix B
) about the sa
me topic (obesity or gun control). They 
were
 
then presented with the prompt and they 
were
 
given 10 minutes to discuss their ideas in groups. 
After that, the students 
had
 
10 minutes to plan what they 
would write
 
on a sheet of paper 
 
38
 
 
containing the prompt and
 
blank space for pre
-
writing. No specific pre
-
writing technique 
was
 
elicited
 
and the participants were not allowed to speak to one another during the planning stage
. 
Finally, they 
had
 
45 minutes to write their essays. 
Again, the participants were not allow
ed to use 
dictionaries or electronic devices during data collection
, but they were allowed to ask questions 
about the vocabulary or content of the prompt
. 
Below are the prompts for each topic:
 

Obesity is a healthcare concern worldwide, but especially in th
e United States. Two 
solutions being proposed are: 1) to tax junk food to discourage people from buying it; and 
2) to ban the sales of large sodas in some establishments. Do you believe these solutions 
would encourage people to reduce their consumption of 
unhealthy foods? Propose other 
solutions to the problem in the United States. Be sure to fully develop your essay by 
including logical supporting ideas,
 
clear explanations, relevant examples and specific 
details. 
Use ideas from the videos we watched and th
e article we read about the 
topic. Do not forget to give credit to the authors. 
 

Gun control continues to be a problem in the United States and in other countries around 
the world. There are three main views on gun control in the United States: 1) Restrict
 
the 
sales of guns to people with no criminal background; 2) ban the sales of guns altogether; 
or 3) allow the sales of guns to anyone. What do you think? Be sure to fully develop your 
essay by including supporting ideas, clear explanations, relevant examp
les, and specific 
details. 
Use ideas from the videos we watched and the article we read about the 
topic. Do not forget to give credit to the authors.
 
Table 
3,
 
adapted from 
my earlier study (
David
, 
under review
), has a detailed description of the 
procedures
 
that 
were
 
follow
ed
 
for the PBT
W exam about obesity and Table 
4
 
outlines the 
procedures for the PBTW exam about gun control.
 
The PBTW exam began with a five
-
minute 
 
39
 
 
introduction to the general topic of the prompt. 
When the other teachers and I proctored the
 
exam, we
 

the topic. After that, 
we
 
told 
the participants that they would watch two videos and read one 
article related to the topic of the essay that they would have to
 
write. 
W
e
 
encouraged the 
participants to take notes while watching the videos and reading the article and said that the 
students would be allowed to use those notes while planning and writing their essay. After the 
participants watched the videos and read
 
the article, 
we
 
introduced the prompt to the students and 
told them that they would have ten minutes to discuss their ideas about the prompt in groups of 
four. Before the group discussion began, 
we
 
asked if the students had any questions about the 
prompt.
 
While the participants were discussing their ideas, 
as a proctor 
we
 
walked around and 
moved from group to group to help groups who were struggling to share their ideas by asking 
questions to prompt further discussion. Next, the participants were given a b
lank sheet of paper 
with the prompt written on the top of the paper and ten minutes to plan their essay. 
The 
participants were told that they could use the planning sheet when writing. Finally, the 
participants had 45 minutes to write their essay. The proc
tor had a timer projected in the front of 
the class so that the participants could keep track of time. Moreover, 
we
 
instructed the 
participants to write down how many minutes were remaining when th
ey finished writing their 
essay in order to measure fluency
.
 
Table 
3
 
Procedures for the PBTW exam for Group 1
 
Time Frame
 
Activity
 
Procedures
 
5 minutes
 
Introducing 
the topic
 
1) The proctor 
introduced
 
the topic by asking students the 
following questions:
 
What is obesity?
 
What causes obesity?
 
What are the consequenc
es of obesity?
 
How can we encourage people do eat more healthy foods?
 
 
40
 
 
Table 3 (Cont

d)
 
 
5 minutes
 
Videos
 
2) The proctor briefly explain
ed
 
what the videos are 
about and pla
yed
 
them for students. The students 
were
 
encouraged to take notes while watching the videos.
 
15 minut
es
 
Reading
 
3) The students
 
read the article.
 
10 minutes
 
Group 
discussion
 
4) The proctor read the essay prompt aloud to students
 
and asked if they had any questions about it
.
 
5) The students 
discussed
 
the essay prompt in groups of 
four.
 
10 minutes
 
Plannin
g
 
6) Students plan
ned
 
the essay. 
 
45 minutes
 
Writing
 
7) Students
 
wrote
 
their essays.
 
 
Table 
4
 
Procedures for the PBTW exam for Group 2
 
Time 
Frame
 
Activity
 
Procedures
 
5 minutes
 
Introducing 
the topic
 
1) The proctor introduce
d
 
the topic by asking students 
the 
following questions:
 
What do you know about gun control?
 
What laws does your home country have about gun control?
 
Should anyone be allowed to buy and carry guns?
 
5 minutes
 
Videos
 
2) The proctor briefly explain
ed
 
what the videos are about and 
play
ed
 
th
em for students. The students 
were
 
encouraged to take 
notes while watching the videos.
 
15 minutes
 
Reading
 
3) The students
 
read the article.
 
10 minutes
 
Group 
discussion
 
4) The proctor read the essay prompt aloud to students
 
and 
asked if they had any quest
ions about it
.
 
5) The students discuss
ed
 
the essay prompt in groups of four.
 
10 minutes
 
Planning
 
6) Students
 
plan
ned
 
the essay. 
 
45 minutes
 
Writing
 
7) Students 
wrote
 
their essays.
 
 
2.1.5
 
Rating
 
 
Two experienced raters 
rated
 
the exams using an analytic r
ubric
. Analytic rubrics allow 
raters to assign scores to individual categories. The total score is a sum of the scores that the test 
 
41
 
 
taker received in each category. The rubric used for this study was
 
developed by Weir (1990) 
(see Appendix C).
 
One reason w
hy I chose this specific rubric was because Weigle (2002) 
presented this rubric as a reliable 
choice
 
when discussing analytic rubrics in her book. 
The other 
analytic rubric
 
I could have chosen
 
was
 

(1981)
, which is also presented in 

 
book
, but it
 
seemed very wordy
 

. In addition, 
there is much
 
controversy surrounding Jacobs
 

rubric
 
(see Connor
-
Linton
 
& Polio
, 2014; Winke & Lim, 
2015).
 

because of its 
simplicity
 
and 
because it has not 
been controversial
. 
The rubric had seven categories: content, organization, cohesion, vocabulary,
 
grammar, punctuation, and spelling.
 
Each category could be assigned a score between 
zero
 
and 
three
. 
Before rating the essays, the raters pa
rticipate
d
 
in training and norming sessions to become 
familiar with the rubric and to calibrate their ratings. 
Training and norming sessions are a crucial 
part of the rating procedure and many testing experts suggest that raters train before using a new 
ru
bric and norm by reading essays together to ensure that their scores are reliable (Carr, 2011; 
Weigle, 2002).
 
In training sessions, raters read and discuss the rubric together. In norming 
sessions, raters read sample essays and score them. They then share 
their scores with other raters 
and discuss why they assigned such scores. 
There were two training and norming sessions: one 
to train and norm for scoring the TW exams and another one to norm for scoring the PBTW 
exam. 
At the beginning of the first training
 
and norming session, the raters read and signed 
consent forms to participate in the study. 
After each session, the raters had three weeks to score 
the essays. After they scored all of the essays, they were given a random subset of 20 essays that 
they had 
already scored to investigate intra
-
rater reliability
 
(ten TW essays
 
and ten PBTW 
essays)
. 
 
 
42
 
 
During the
 
first training and norming session
, 
I first explained to the raters the two exams 
and showed them all of the materials that were used in the exams. Then 
we
 
read the rubric and 
discussed how the
 
raters
 
interpreted each of the seven 
categories
 
of the rubric. 
We 
agreed
 
that 
content was related to how well and clearly the participants answered the prompt. Because the 
prompts had multiple questions, we decided 
that students who answered only one or two of the 
questions
 
from the prompt
 
would not be penalized. For example, some participants discussed 
only one of the two proposals written in the prompt
 
about obesity
. Others did not suggest a 
proposal at all. These 

neglected to 
answer all of the questions in the prompt
. 
After a long discussion about the difference between 
organization and cohesion, we 
agreed
 
that organization was related to the essay as a wh
ole and 
how it is organized from a macro
 
level
 
perspective (introduction, body, conclusion, thesis 
statement, topic sentences, and so on), while cohesion was related to how the essay was 
organized in a micro level or sentence level
 
perspective
. We discusse
d that when scoring 
vocabulary the raters 
should
 
pay attention to the complexity of words used in the texts, word 
choice, and so on. 
If the students repeat certain words too often and use very simple vocabulary, 
their vocabulary score should suffer. 
For gr
ammar, 
we agreed that 
the raters should pay attention 
to verb tense, subject and verb agreement, 
fragments, 
and the complexity of the grammatical 
structures used in the essays. We agreed that the raters should not penalize the participants who 
made punctua
tion mistakes around quotations
,
 
because investigating source incorporation was 
not one of the goals of this study. 
One issue that arose in the training session was that of source 
integration. One of the raters asked how they should score essays in which t
est takers did not use 
sources at all, even though they were instructed to, or essays in which the test takers did not 
integrate sources very well. After some discussion
,
 
we agreed that the test takers should not be 
 
43
 
 
penalized if they failed to use sources 
or used sources inadequately. 
This decision was made to 
ensure that both types of essays were scored based on the same criteria and could, therefore, be 
compared to one another.
 
Finally, after norming five essays and having difficulty agreeing with each ot

spelling scores, we decided that if an essay had less than five spelling mistakes, the participant 
would receive a 
three
 
for spelling; if it had more than five mistakes, the participant would 
receive a 
two
; and if the essay had spelling mistakes in e
very other sentence, the participant 
would receive a 
one
. 
If a student repeated the name spelling mistake throughout the essay, the 
mistake should be counted as one. 
After reading the rubric, the raters
 
read and scored six essays. 
They
 
read one essay quiet
ly and 
assigned scores for each category with
out consulting one 
another. Next,
 
they shared their scores and discussed why they assigned the scores for 
each 
category. 
This procedure was repeated for each of the six essays that they scored. 
In the second 
nor
ming session, the raters shared problems that they were having with the rubric and potential 
ways to solve them and then they scored six essays following the same procedures they did in the 
first norming session. 
The session
s
 
w
ere
 
audio
-
recorded 
with an Ol
ympus digital voice recorder 

 
2.1.6
 
Post
-
writing questionnaire
 
All learners complete
d
 
a post
-
writing questionnaire that include
d
 
multiple
-
choice and 
short
-
answer questions about the 
two exams (see Appendix D). The questions 
were related
 
to 
what they liked and disliked about the exams, what they thought was difficult or easy
 
about each 
exam
, what they thought about the article, videos, group discussion, and so on. 
The participants 
took
 
approximately 
ten
 
minutes to answer the questionnaire. 
 
 
44
 
 
2.1.7
 
Semi
-
structured interviews
 
 
I randomly selected 
four
 
students 
from 
the two 220 classes that I taught
 
to participate in a 
semi
-
structured interview in groups to discuss their attitudes about th
e two exams and obtain 
more detailed information about their perceptions of the two exams.
 
The teachers of the other 
four classes in which data were collected asked for volunteers or selected students to participate 
in the semi
-
structured interviews for me
. In other words, some students volunteered for the 
interviews in some classes and other students were chosen by their teachers for the interviews
 
in 
other classes
. 
The
re were six interviews
,
 
and they
 
each 
lasted
 
approximately
 
15 to 20 minutes.
 
In 
the firs
t 
and second 
interview
s
, there were four participants (ID07, ID09, ID13, and ID16
 
and 
ID21, ID22, ID24, and ID25); in the third 
and fourth 
interview
s
 
there were three partic
ipants 
(ID42, ID43, and ID45
 
and 
ID49, ID50, and ID51); 
and 
there were two particip
ants in the fifth 
and sixth 
interview
s
 
(ID59 and ID60
 
and 
ID73 and ID74).
 
As I was introducing myself to the 
participants before data collection began, I identified myself to them as Brazilian
 
and a graduate 
student
. At the time of data collection, there w
ere multiple Brazilian students enrolled at the 
English Language Center as part of a government program 
called 
Ciências Sem Fronteiras
 
(Science Without B
orders
)
.
 
All of the participants in 
the third and sixth interview groups were 
from 
Brazil and before th
e interview began I asked them if they wanted to be interviewed in 
English or in Portuguese, since the latter is my native language. They expressed that they 
preferred that the interview be conducted in Portuguese. 
For these interviews, I transcribed them 
in Portuguese and provided English translations in italics.
 
As I mentioned above, I collected data 
in two ESL 220 classes for which I was the instructor. Although there were many students from 
Brazil in those two classes, I did not give them the option to 
conduct the interview in Portuguese. 
 
45
 
 
I was their teacher at the time of data collection and did not want to encourage them to speak 
Portuguese to me in class. Therefore, I never spoke to them in Portuguese at all. 
There 
were four 
different ways in which I 
identified myself to the students in this study and that could have 
influenced the way that the participants interacted with me during the interviews: Brazilian, 
student, researcher, and teacher. I was a fellow Brazilian to some participants; a fellow stud
ent to 
others; a researcher to all; and a teacher to some.
 
I will discuss these identities and how they 

perceptions of the exam.
 
The raters
 
also participate
d
 
in a semi
-
s
tructured interview to discuss 
their thoughts about the exams
 
and the rubric
. The raters
, however,
 
unlik
e the students, were
 
interviewed separately
. 
The 
interviews
 
were audio recorded
 
with an Olympus digital voice 
recorder
.
 
The questions for the semi
-
struc
tured interview
s
 
are in Appendix E. 
 
 
2.2 
 
Analysis
 
 
exam correlate with their scores on a PBTW exam when raters use a holistic and an analytic 
rubric, I 
entered
 
all 
sc
ores into 
IBM 
SPSS version 21 and ra
n a 
Spearman
 
correlation
 
due to the 
fact that the scores each participant received should be considered a discrete 
(ordinal) 
variable. If 
the data were interval
 
(scale data)
, then a Pearson correlation should have been
 
used, according 
to Field (2009).
 
I 
ra
n a
 
paired samples
 
t
-
test 

to determine whether 
there 
were
 
any significant differences between the scores in the two exams to answer the second 
research question.
 

riting across exams differ
ed
 
in terms of grammatical 
accuracy, 
a research assistant
 
first 
counted
 
the number of 
clauses and 
then the number of 
error
-
 
46
 
 
free 
clauses in the TW and PBTW essays
, following Ellis and Yuan (2004) and Wigglesworth 
and Storch (2009)
.
 
Before she analyzed the essays for accuracy, the research assistant and I met 

see Appendix F
). We then read and 
analyzed 
two
 
essays together, discussing any issues or disagreements we had. 
We divid
ed the 
essays into clauses and then discussed whether each clause contained an error. 
As we did so, we 

 
a.
 
Spelling errors that result in a completely different word ar
e counted as word choice 
errors;
 
b.
 
If the
re is a comma splice, the 
clause preceding the comma splice is the one to which the 
error will be added;
 
c.
 
In the case of the PBTW exam, do not include quotations as clauses.
 
d.
 

We decided 
not to
 
include quotations in th
e clause count for obvious reasons. Quotations were 
grammatically accurate almost 100% of the time and the accuracy was not a result of the 

,
 
because they did not write the sentence. 
We found that many of the 
Arabic speaking part

mistake and therefore should not be counted as an error, as determined by Polio (1997). 
After we 
analyzed 
the two
 
essays together, 
the research assistant started her analysis of a
ccuracy in the 
two exams and I analyzed 10% of the essays (or 17 essays) to check for inter
-
rater reliability 
using Cronb

-
free clauses by the total number of 
clauses in each essay and used the percentage of er
ror
-
free clauses for the analysis.
 
I used two programs to analyze 
three
 
aspects
 
of
 
lexical complexity: RANGE, developed 
by Nation (2005), and the Lexical Complexity Analyzer, developed by Lu (2010). 
The two 
 
47
 
 
aspects of lexical complexity that I investigated
 
were lexical sophistication
, lexical density,
 
and 
lexical variation. 
Before entering the essays into the two programs, I 
typed
 
them and save
d
 
them 
as .txt files
, which is the type of file that both programs have to use. As described above, 
RANGE compares 
texts to three different word lists: the most common 1,000 words, the second 
most common 1,000 words, and a list of university words. I used RANGE to investigate lexical 
sophistication, as 
Fritz
 
and Luegg (2013) and Kormos (2011) did in their studies. The 
Lexical 
Complexity Analyzer g
enerates 
25 different measures of lexical
 
complexity, which include 
lexical density and variation
.
 
However, I did not use the 25 measures in my analysis of lexical 
complexity. Following Kormos (2013) and Lu (2010), I used Malve

-
Measure as one measure of lexical variation. The other measures of lexical variation generated 
by the Lexical Complexity Analyzer were: lexical word variation, verb variation 1, noun 
variation, adjective 
variation, and adverb vari
ation
.
 
 
To investigate 
syntactic complexity I 
ran
 

Complexity Analyzer (Lu, 2010). The L2 Syntactic Complexity Analyzer generates 14 different 
syntactic complexity measures: mean length of sentence, mean length o
f T
-
unit, mean length of 
clause, clause per sentence, verb phrase per T
-
unit, clause per T
-
unit, dependent clause per 
clause, dependent clause per T
-
unit, T
-
unit per sentence, complex T
-
unit ratio, coordinate phrase 
per T
-
unit, coordinate phrase per clause
, complex nominal per T
-
unit, and complex nominal per 
clause. Lu (2010) correlated the complexity scores generated by the program with scores 
generated by two human raters and found the correlations to be high, ranging from .845 to 1. The 
author concluded 
that the L2 Syntactic Complexity Analyzer is indeed a reliable measurement of 
complexity. 
In order to investigate fluency, I used Microsoft Word to get word counts for each 
essay. 
In addition, I asked the participants to write down how many minutes were le
ft before they 
 
48
 
 
turned in their essays to measure how many words per minute they wrote. 
Following Kellogg 
(1988), 
I 
used three
 
different numbers to compare fluency in the TW and PBTW exams: total 
number of words per essay
, total time writing,
 
and total numb
er of words per minute.
 
Although 
these measures of fluency may not be the most valid (Abdel Latif, 2012), as described above, the 
lack of consensus 
and research 
on which measures are more valid left me no choice but to 
choose ones that researchers have oft
en used. 
 
In order to determine the intra
-
rater reliability for each rubric, I 
obtained
 

alpha coefficient for the twenty essays that the raters rated twice
. 
Similarly, I 
obtained
 
a 

total 
scores that ea
ch rater assigned to each essay to 
determine inter
-
rater reliability.
 

scores in each of the seven categories in the analytic rubric.
 
Spearman correlations were also 
used to examine intra and i
nter
-
rater reliability.
 

different types of writing exams, I 
inputted
 

-
writing 
questionnaire into SPSS and obtain
ed
 
a count for
 
their answers in
 
the multiple
-
choice questions. 
This 
set of 
data was more
 
quantitative in nature and was
 
converted into percentages. For 
example, for the first question, which asked the participants which exam they thought was easier, 
I counted how many p
articipants circl
ed each multiple choice option (the TW exam, the PBTW 
exam, or they were both equally easy or difficult)
 
and then calculated the percentage for each of 
them. 
This procedure was followed for all of the multiple
-
choice questions. However, th
ere were 
also questions that were open ended and usually followed up on a multiple
-
choice question. 

document. This set of data was more qualitative and therefore I fo
llowed 
Baralt

 
49
 
 
guidelines for coding qualitative data. According to her, the first step to analyzing qualitative 
data is to read through the data the first time and start thinking of ways that the data could be 
coded. This is called open coding, wh


Baralt
, 2012, p. 230). Instead of denoting the codes myself, I chose 
to use a more emergent approach, what 
Baralt
 
called 
in vivo
 
code. 
In vivo
 
codes arise from the 
data and 
are not assigned by the researcher. As I noticed codes emerging from the data, I 
highlighted any text that seemed to discuss the same code and then created a name for the code. 
For example, many wrote that the videos, articles, and discussions helped them 
to think of ideas 
to use in their essay. As I noticed that, I re
-
read the text and highlighted anything that the 
participants wrote that were related to that topic. 
Baralt
 
suggests that the researcher 
go through 
more than one iteration with the data. She 
e
xplained 
that after coding the data, the researcher 
should read through the data again and refine the themes to better understand to what they relate. 
Next, the researcher has to re
-
read all of the data coded under one theme and compare it to 
ensure that e
verything should indeed be coded under that particular theme.
 
For example, many 
participants wrote about time. However, after re
-
reading the data, I noticed that some participants 
wrote about time to write the essay and others wrote about time to plan the 
essays. While both 
ideas are related to time, one is related to planning time, whereas the other is related to time to 
write, which are two different types of time.
 
According to 
Baralt
, 
I
 
can choose to 
separate
 
these 
themes into 
two
 
or create one overarchi
ng theme with subcategories to com
b
ine them. 
I chose 
the first option and coded some of the responses as planning time and others as time. 
Finally, the 
researcher has to interpret the data. In 
Baralt

explai

4.
 
 
50
 
 
The manner in which I approached the interview data with the test takers was somewhat 
different from the manner in which I dealt with the written qualitative data 
from the post
-
writing 
questionnaires. All of the participants answered the post
-
writing questionnaire, whereas only 18 
participants participated in the semi
-
structured interviews. Since the interviews served as 
complement to the responses in the post
-
writi
ng questionnaire, I carefully listened to the 
interviews and transcribed only quotes that described the themes which emerged in the 
questionnaire. As 
Baralt
 
suggested
, and as I did with the written questionnaire data, I re
-
listened 
to the interviews multip
le times and merged or separated codes as needed. 
 
Following 
Baralt

s
 
suggestions for coding qualitative data
 
described above
, I started by 
carefully listening to
 
and transcribing
 
the interviews with 
the raters
 
to answer the last research 
question, which a

time to begin the open coding of the data and then I re
-
listened multiple times to fine tune the 
codes and create themes. I the
n
 
listened to the two norming sessions and trans
cribed quotes that 
belonged to 
the same
 
themes that emerged in the interviews. As I did with the interviews with 
the test takers, the norming sessions served as a supplement to the themes that arose in the 
interviews. 
 
 
51
 
 
CHAPTER 3: 
QUANTITATIVE 
RESULTS
 
 
F
irst, I describe the results of the first four research questions, all of which are 
quantitative questions. In chapter 4, I report on the results of the last tw
o research questions, 
which are
 
qualitative questions. 
Below are the results for each
 
quantitati
ve
 
research question
.
 
3.1 
 
RQ1: 

am correlate with their scores 
i
n a PBTW exam
?
 

I 
calculated the average score 
for each category and 
for each participant by adding the score
s
 
that 
the two raters assigned and dividing 
them
 
by two. 
For example, if one rater assigned a participant 
a score of 1 and the other rater assigned a score of 2 f
or spelling, the average score would be 1.5. 
Most of the scores that the participants assigned differed by one point 
(one rater assigned a 1 and 
the other a 2) or were exactly the same 
and 
only 
4% of them differed by two points
 
(one rater 
assigned a 1 and 
the other assigned a 3)
. 
After that, I entered all of the scores in 
IBM 
SPSS and 
ran a 
Spearman
 
correlation. Table 
5
 

the TW and in the PBTW exams.
 
 
Table 
5
 
Descriptive statistics
:
 
A
verage sco
res
 
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
TW exam
 
81
 
9.5
 
19
 
14.88
 
2.17
 
PBTW exam
 
81
 
10
.
5
 
20.5
 
14.95
 
2.35
 
 
The results of the 
Spearman
 
correlation revealed that the average scores that the 
participants received in the TW exam and in the PBTW 
exam only 
correlated moderately, 
according to the levels set by Cohen (1992). Cohen suggests that a small correlation is 
 
52
 
 
approximately .10, a medium correlation is .30, and a large correlation is .50.
 
Table 
6
 
shows the 
results of the 
Spearman
 
correlation.
 
Table 
6
 
Sp

 
correlation
: A
verage scores
 
 
rho
 
Sig.
 
TW and PBTW exams
 
.391
 
.000
 
 
In addition to analyzing the correlation between the average scores in the TW and PBTW exams, 
I also investigated how each score in the different categories of the ana
lytic rubric correlated. 
The 
descriptive statistics for each score in the analytic rubric are in Table 
7
.
 
Table 
7
 
Descriptive statistics
:
 
Average
 
analytic scores
 
Scores
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
Content TW
 
81
 
1
 
3
 
2.19
 
.52
 
Content PBTW
 
81
 
1.5
 
3
 
2.40
 
.52
 
Organization TW
 
81
 
1
 
3
 
1.88
 
.50
 
Organization PBTW
 
81
 
1
 
3
 
1.87
 
.54
 
Cohesion TW
 
81
 
1
 
3
 
2.17
 
.52
 
Cohesion PBTW
 
81
 
1
 
3
 
2.14
 
.50
 
Vocabulary TW
 
81
 
1
 
3
 
2.14
 
.49
 
Vocabulary PBTW
 
81
 
1
 
3
 
2.06
 
.44
 
Grammar TW
 
81
 
1
 
3
 
1.90
 
.36
 
Grammar PBTW
 
81
 
1
 
2.5
 
1.
80
 
.36
 
Punctuation TW
 
81
 
1.5
 
3
 
2.37
 
.42
 
Punctuation PBTW
 
81
 
1.5
 
3.2
 
2.56
 
.38
 
Spelling TW
 
81
 
1
 
3
 
2.29
 
.61
 
Spelling PBTW
 
81
 
1
 
3
 
2.03
 
.58
 
 
The results of the 
Spearman
 
correlations for the analytic scores revealed moderate correlations 
between the scores 
the participants received for vocabulary, punctuation, and spelling, and weak 
correlations between the scores that the participants received for content, organization, cohesion, 
and grammar. 
The 
results of the correlations 
for the analytic scores 
are in 
T
a
ble 
8
.
 
 
53
 
 
Table 
8
 
Spearman
 
correlations
: Average a
nalytic scores
 
Scores
 

rho
 
Sig
 
Content
 
.218
 
.050
 
Organization 
 
.246
 
.027
 
Cohesion 
 
.178
 
.112
 
Vocabulary 
 
.302
 
.006
 
Grammar 
 
.221
 
.048
 
Punctuation 
 
.380
 
.000
 
Spelling 
 
.355
 
.001
 
 
3.2 
 
RQ2: 
D

their scores in a PBTW exam?
 

calculated the average 
scores by adding the scores the two raters assigned to each participant and dividing them by two.
 
I entered the scores in SPSS and ran a paired
-
samples 
t 
test with the average scores. Below are 
the results of the 
t
 
test.
 
Table 
9
 
T te
st
: A
verage scores
 
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
TW and 
PBTW exam
s
 
.07
 
2.48
 
.27
 
-
.268
 
.789
 
.01
 
 
The 
t
 
test did not reveal a significant difference between the scores that the participants received 
in the TW and PBTW 
exams. 
In or
der
 
to examine the scores in more detail, I ran 
t
 
tests for each 
of the average analytic scores that the participants received
. Running too many 
t 
tests increases 
the chances that Type I error will occur. Type I error is a false positive; that is, Type I e
rror is 
when the results show a significant difference when there was none (Larson
-
Hall, 2009). To 
reduce the risk of type I error, Larson
-
Hall 
suggested 
doing a Bonferroni adjustment, which is 

t
 

54
 
 
The author 
went 


according 
to her, 
the researcher should 
divide the alpha level by the number of comparisons being done 
with the test (Larson
-
Hall, 2009). The desired alpha level for the purpose of my research is .05 
and the number of comparisons done for the analytic scores is 7 (7 different analytic scores): .
05 
divided by 7 is .007. Anything above .007 will not be considered statistical.
 
Table 
10
 
shows the 
results of the 
t
 
tests for the average analytic scores.
 
Table 
10
 
T tests
: A
verage analytic scores
 
Scores
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig
 
Effect 
si
ze
 
Content 
 
.20
 
.66
 
.07
 
2.859
 
.005
 
.19
 
Organization
 
.01
 
.66
 
.07
 
.133
 
.867
 
.01
 
Cohesion 
 
.03
 
.65
 
.07
 
.107
 
.613
 
.03
 
Vocabulary 
 
.08
 
.55
 
.06
 
.042
 
.198
 
.08
 
Grammar 
 
.09
 
.45
 
.05
 
.000
 
.052
 
.13
 
Punctuation 
 
.19
 
.44
 
.04
 
.293
 
.000
 
.23
 
Spelling 
 
.25
 
.67
 
.07
 
-
.103
 
.001
 
.20
 
 
The results of the 
t 
tests revealed that there were significant differences in the scores that the 
participants received for the following categories in the analytic rubric: content
, punctuation,
 
and 
spelling
.
 
As can be seen in Table 
10
, t
h
e test takers scored significantly higher for content and 
punctuation in the PBTW exam and they scored significantly higher for spelling in the TW 
exam.
 
 
55
 
 
3.3 
 
RQ3: 

 
3.3.1
 
Accuracy 
 
Following Ellis a
nd Yuan (2004) and Wigglesworth and Storch (2009), accuracy was 
measured in terms of error
-
free clauses. 
After the research assistant counted the number of 
clauses and error free
-
clauses in both TW and PBTW exam, I entered the percentage of error
-
free clau
ses for each essay in 
IBM 
SPSS. 
The descriptive statistics for the percentage of error
-
free clauses in the TW and PBTW exams are in Table 
1
1
.
 
Table 
1
1
 
Descriptive statistics: Percentage of error
-
free clauses
 
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
TW exam
 
81
 
0
 
62.06
 
31.85
 
14.43
 
PBTW exam
 
81
 
2.77
 
76.47
 
33.08
 
14.09
 
 
In order to determine whether the accuracy scores differed in the two exams, I ran a paired
-
samples 
t
 
test. The results of the 
t
 
tests are in Table 
1
2
.
 
The results of the 
t
 
test revealed that th
ere 
were no significant differences between the accuracy scores in the TW exam and the PBTW 
exam.
 
In order to measure rater reliability, I analyzed accuracy in approximately 10% of the 

-
rater reliability ob
tained was .842
, which is 
high and on par with essay
-
test reliabilities in large
-
scale and high stakes tests like the TOEFL
 
(www.toefl.org).
 
Table 
1
2
 
T test
: Accuracy
 
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
Error
-
free 
clauses 
 
1.23
 
16.37
 
1
.81
 
-
.677
 
.501
 
.04
 
 
56
 
 
3.3.2
 
Lexical complexity
 
I used the programs 
RANGE (Nation, 2005) and the Lexical Complexity Analyzer (Lu, 
2010) 
to 
analyze the 162 essays for lexical complexity in three different dimensions: lexical 
sophistication, lexical density,
 
and lexical variation. 
As I mentioned previously, RANGE 
compared the essays to three different word lists (the first and second most common 1,000 words 
and univers
ity words) to investigate lexical sophistication. The descriptive statistics for lexical 
sop
histication are in Table 
1
3
.
 
Table 
1
3
 
Descriptive statistics: Lexical sophistication in TW exams
 
and PBTW exams
 
Word lists
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
Word list 1
 
TW
 
81
 
155
 
472
 
263.71
 
70.12
 
Word list 1 PBTW
 
81
 
154
 
463
 
290.18
 
74.82
 
Word list 2
 
TW
 
81
 
7
 
49
 
24.18
 
8.50
 
Word list 2 PBTW
 
81
 
8
 
54
 
28.64
 
10.97
 
Word list 3
 
TW
 
81
 
0
 
30
 
7.95
 
5.39
 
Word list 3 PBTW
 
81
 
0
 
25
 
10.04
 
5.80
 
 
sophistication, I ran thr
ee 
t
 
tests and set the alpha level to 
.016
 
to avoid Type I error
, as suggested 
by Larson
-
Hall (2009).
 
The desired alpha level for the purpose of my research is .05 and the 
number of comparisons done for syntactic complexity is 
three
 
(
three
 
different 
word l
ists
): .05 
divided by 
3 is .016
. Anything above .0
16
 
will not be considered statistical. Table 
1
4
 
shows the 
results of the 
t
 
tests for the 
three word lists.
 
Table 
1
4
 
T test: Lexical sophistication
 
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
Wo
rd list 1 
 
26.46
 
61.93
 
6.88
 
3.846
 
.000
 
.17
 
Word list 2 
 
4.45
 
13.51
 
1.50
 
2.968
 
.004
 
.22
 
Word list 3 
 
2.09
 
5.70
 
.63
 
3.310
 
.001
 
.18
 
 
57
 
 
The results of the 
t
 
test revealed a significant difference between the PBTW and TW exams for 
all three word lists. The par
ticipants used significantly more words in the first 1,000 most 
common words list, in the second 1,000 most common words list, and in the university words list
 
in the PBTW exam
. 
 
 
The Lexical Complexity Analyzer generated 25 different measures of lexical d
ensity and 
lexical variation. For the purposes of this study, I only use the D
-
measure for lexical density and 
the following measures to investigate lexical variation: lexical word variation, verb variation 1, 
noun variation, adjective variation, 
and 
adver
b variation. 
The descriptive statistics for these 
measures of lexical density and variation are shown in Table 
1
5
.
 
Table 
1
5
 
Descriptive statistics: Lexical Density and Lexical Variation
 
Lexical measure
s
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
Word Type TW
 
8
1
 
90
 
210
 
129.13
 
25.21
 
Word Type PBTW
 
81
 
87
 
215
 
146.80
 
28.93
 
Lexical Density TW
 
81
 
.46
 
.64
 
.53
 
.03
 
Lexical Density PBTW
 
81
 
.48
 
.64
 
.53
 
.03
 
Lexical Variation TW
 
81
 
.37
 
.79
 
.56
 
.09
 
Lexical Variation PBTW
 
81
 
.38
 
.79
 
.56
 
.08
 
Verb Variation I TW
 
81
 
.37
 
1
 
.
66
 
.12
 
Verb Variation I PBTW
 
81
 
.37
 
.83
 
.65
 
.10
 
Noun Variation TW
 
81
 
.32
 
.77
 
.49
 
.10
 
Noun Variation PBTW
 
81
 
.35
 
.75
 
.52
 
.08
 
Adjective Variation TW
 
81
 
.06
 
.18
 
.10
 
.02
 
Adjective Variation PBTW
 
81
 
.05
 
.21
 
.10
 
.03
 
Adverb Variation TW
 
81
 
.02
 
.09
 
.05
 
.01
 
Adverb Variation PBTW
 
81
 
.01
 
.11
 
.05
 
.01
 
 
Once again, I ran 
t
 
tests to determine whether there were significant differences in the two exams 
regarding lexical density and lexical variation. 
The desired alpha level for the purpose of my 
research is .05 and
 
the number of comparisons done for 
lexical density and lexical variation
 
is 
seven
: .05 divided by 
7 is .007
. Anything above .0
07
 
will not be considered statistical. Table 
1
6
 
shows the results of the 
t
 
tests for 
lexical density and lexical variation.
 
 
58
 
 
Table
 
1
6
 
T tests: Lexical Density and Lexical Variation
 
Lexical measures
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
Word Type 
 
17.66
 
22.52
 
2.50
 
-
7.059
 
.000
 
.30
 
Lexical Density 
 
.002
 
.04
 
.005
 
-
.562
 
.576
 
.03
 
Lexical Variation 
 
.009
 
.07
 
.008
 
-
1.126
 
.264
 
.05
 
Verb Variation I 
 
.01
 
.11
 
.01
 
.772
 
.442
 
.04
 
Noun Variation 
 
.02
 
.08
 
.009
 
-
2.994
 
.004
 
.14
 
Adjective Variation 
 
.003
 
.03
 
.003
 
1.062
 
.291
 
.06
 
Adverb Variation 
 
.001
 
.02
 
.002
 
-
.815
 
.417
 
.04
 
 
The results of the 
t
 
test revealed that there was a sig
nificant difference in word type
 
 
and noun 
variation
 
between the TW and PBTW exams. The participants used significantly more word 
types and a wider variation of nouns in the PBTW exam when compared to the TW exam.
 
 
3.3.3
 
Syntactic complexity
 
As mentioned a
bove, in order to measure syntactic complexity, I used the L2 Syntactic 
Analyzer developed by Lu (2010)
.
 
The L2 Syntactic Complexity Analyzer generate
d
 
14 different 
syntactic complexity measures: mean length of sentence
 
(MLS)
, mean length of T
-
unit
 
(MLT)
, 
mean length of clause
 
(MLC)
, clause per sentence
 
(C/S)
, verb phrase per T
-
unit
 
(VP/T)
, clause 
per T
-
unit
 
(C/T)
, dependent clause per clause
 
(DC/C)
, dependent clause per T
-
unit
 
(DP/T)
, T
-
unit per sentence
 
(T/S)
, complex T
-
unit ratio
 
(CT/T)
, coordinate phras
e per T
-
unit
 
(CP/T)
, 
coordinate phrase per clause
 
(CP/C)
, complex nominal per T
-
unit
 
(CN/T)
, and complex nominal 
per clause
 
(CN/C). Table 
1
7
 
shows the descriptive statistics for the 14 measures of syntactic 
complexity for the PBTW and TW exams.
 
 
59
 
 
Table 
1
7
 
De
scriptive statistics: 
Syntactic complexity
 
Lexical 
Complexity 
Measures
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
MLS PBTW
 
81
 
9.66
 
48.25
 
19.27
 
5.60
 
MLS TW
 
81
 
10.88
 
39.83
 
19.15
 
4.91
 
MLT PBTW
 
81
 
9.32
 
38.6
 
16.56
 
4.13
 
MLT TW
 
81
 
11.08
 
34.14
 
16.40
 
3.75
 
MLC PBTW
 
81
 
6.26
 
13.18
 
9.02
 
1.47
 
MLC TW
 
81
 
6.23
 
14.29
 
8.90
 
1.64
 
C/S
 
PBTW
 
81
 
1.14
 
6
.00
 
2.17
 
.73
 
C/S TW
 
81
 
1.13
 
4.11
 
2.19
 
.60
 
VP/T
 
PBTW
 
81
 
1.46
 
5.1
 
2.50
 
.62
 
VP/T TW
 
81
 
1.7
0
 
4.85
 
2.52
 
.53
 
C/T
 
PBTW
 
81
 
1.1
0
 
4.8
0
 
1.85
 
.50
 
C/T TW
 
81
 
1.13
 
2.88
 
1.86
 
.40
 
DC/C
 
PBTW
 
81
 
.19
 
.6
0
 
.37
 
.08
 
DC/C TW
 
81
 
.11
 
.59
 
.36
 
.09
 
DC/T
 
PBTW
 
81
 
.25
 
2.08
 
.72
 
.32
 
DC/T TW
 
81
 
.13
 
1.69
 
.71
 
.32
 
T/S
 
PBTW
 
81
 
.96
 
1.5
0
 
1.15
 
.11
 
T/S TW
 
81
 
.96
 
1.61
 
1.16
 
.13
 
CT/T
 
PBTW
 
81
 
.21
 
.8
0
 
.49
 
.14
 
CT/T TW
 
81
 
.13
 
.84
 
.49
 
.16
 
CP/T
 
PBTW
 
81
 
0
 
1.1
0
 
.32
 
.18
 
CP/
T TW
 
81
 
0
 
1.14
 
.32
 
.18
 
CP/C
 
PBTW
 
81
 
0
 
.5
0
 
.17
 
.09
 
CP/C TW
 
81
 
0
 
.7
0
 
.17
 
.10
 
CN/T
 
PBTW
 
81
 
.18
 
4.1
0
 
1.82
 
.63
 
CN/T TW
 
81
 
.82
 
3.88
 
1.85
 
.61
 
CN/C
 
PBTW
 
81
 
.52
 
1.97
 
1
.00
 
.28
 
CN/C TW
 
81
 
.48
 
1.84
 
1
.00
 
.29
 
 
s differed in the measures of syntactic 
complexity above, I ran paired samples 
t
 
tests on SPSS. 
As I mentioned before, r
unning too 
many
 
t 
tests 
increases the 
chances 
that
 
T
ype I error
 
will occur
. 
In order to avoid Type I error, I 
adjusted the alpha level a
s suggested by Larson
-
Hall (2009). 
The desired alpha level 
for the 
purpose of my research is
 
.05 and the number of comparisons done for syntactic complexity is 14
 
(14 different syntactic complexity measures)
: .05 divided by 14 is .0035. Anything above .003
5 
 
60
 
 
will not be considered statistical. 
Table 
1
8
 
shows the results of the 
t
 
tests for the 14 different 
measures of syntactic complexity.
 
Table 
1
8
 
T tests: Syntactic complexity
 
Lexical 
Complexity 
Measures
 
Mean
 
Std. 
Deviation
 
Std. 
Error 
Mean
 
t
 
Sig
 
Effect 
size
 
MLS 
 
.119
 
4.64
 
.51
 
.231
 
.818
 
.01
 
MLT 
 
.154
 
4.01
 
.44
 
.347
 
.729
 
.01
 
MLC 
 
.123
 
1.82
 
.20
 
.608
 
.545
 
.03
 
C/S 
 
-
.017
 
.69
 
.76
 
-
.233
 
.816
 
.01
 
VP/T
 
-
.016
 
.60
 
.67
 
-
.241
 
.810
 
.01
 
C/T
 
-
.009
 
.52
 
.58
 
-
.163
 
.871
 
.01
 
DC/C 
 
.004
 
.11
 
.01
 
.321
 
.749
 
.02
 
DC/T
 
.003
 
.40
 
.
04
 
.067
 
.947
 
.004
 
T/S
 
 
-
.007
 
.14
 
.01
 
-
.437
 
.663
 
.02
 
CT/T 
 
-
.011
 
.18
 
.02
 
-
.570
 
.570
 
.03
 
CP/T
 
-
.000
 
.20
 
.02
 
-
.005
 
.996
 
.0002
 
CP/C 
 
-
.003
 
.12
 
.01
 
-
.241
 
.810
 
.01
 
CN/T 
 
-
.031
 
.71
 
.07
 
-
.402
 
.689
 
.02
 
CN/C 
 
.002
 
.33
 
.03
 
.060
 
.952
 
.003
 
 
As you can see from T
able 
17
, there were no significant differences between any of the 14 
different syntactic complexity measures in the PBTW and TW exams.
 
 
3.3.4
 
Fluency
 
To measure fluency, I asked the participants to write down on the test sheet how many 
minutes were left on
 
the 
timer
 
when they finished writing their essays. 
In order to measure 
fluency, I entered the number of words the participants wrote on SPSS, the time it took them to 
write their essays, and the amount of words they wrote per minute. 
As mentioned previous
ly, the 
participants had 45 minutes to write both essays. However, they spent more time on task when 
they took the PBTW exam, because they watched two videos, read one article, had a group 
discussion, and planned their essays before they started writing. T
he mean number of words that 
 
61
 
 
the students w
rote in the PBTW exam was 35
3.9
 
and 
31
5.6
 
in the TW exam
.
 
Table 
1
9
 
describes 
the descriptive statistics for the number of words in the TW and PBTW exams.
 
Table 
1
9
 
Descriptive statistics: Number of words
 
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
TW exam
 
81
 
194
 
550
 
315.6
 
88.1
 
PBTW exam
 
81
 
195
 
536
 
353.9
 
77.8
 
 
A 
t 
test revealed a significant difference for the total number of words between the TW exam and 
the PBTW exam. The participants wrote significantly longer essa
ys in the PBTW exam. Table 
20
 
has the results of the 
t 
test.
 
Table 
20
 
T test: N
umber of words
 
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
TW and PBTW exams
 
38.32
 
70.017
 
7.78
 
4.926
 
.000
 
.22
 
 
Furthermore, the mean number of minutes that student
s w
rote for the PBTW exam was 41
.3
 
minutes and 4
0.6
 
minutes for the TW exam. 
Table 
21
 
shows the descriptive statistics for the 
number of minutes that students wrote for each exam.
 
A 
t 
test revealed no significant differences 
between the number of minutes t
hat the participants wrote for the two exams (
t
(1.525), 
p
(.131), 
r
(.54)).
 
Table 
21
 
Descriptive statistics: 
Number of minutes
 
 
N
 
Minimum
 
Maximum
 
Mean
 
Std. 
Deviation
 
TW exam
 
81
 
30
 
45
 
40.6
 
3.9
 
PBTW exam
 
81
 
22
 
45
 
41.3
 
4.9
 
 
62
 
 
The
 
parti
cipants wrote an average
 
of 8.6
 
words per minu
te in the PBTW exam and 7.9
 
words per 
minute in the TW exam. 
Table 
2
2
 
shows the descriptive statistics for the amount of words that 
the participants wrote per minute.
 
Table 
2
2
 
Descriptive statistics: 
Words per minute
 
 
N
 
Minimum
 
Maximu
m
 
Mean
 
Std. 
Deviation
 
TW exam
 
81
 
4.7
 
14.8
 
7
.
9
 
2.2
 
PBTW exam
 
81
 
4.3
 
14.8
 
8.6
 
2.3
 
 
A 
t 
test revealed that the participants wrote significantly more words per minute in the PBTW 
exam. Table 
2
3
 
describes the results of the 
t 
test.
 
Table 
2
3
 
T test: Words pe
r minute
 
 
Mean
 
Std. 
Deviation
 
Std. Error 
Mean
 
t
 
Sig.
 
Effect 
size
 
TW and 
PBTW exam
s
 
.694
 
1.88
 
.20
 
3.323
 
.001
 
.15
 
 
3.4 
 
RQ4: 
What are the intra and inter
-
rater reliability coefficients for each exam?
 
To measure intra and inter
-
rater reliability, I 
calcula
ted
 

by Howell (2002). 

The 
inter
-
rater 
reliability coefficient obtained for the scores for the TW exams was .728 and for 
the PBTW exam was .643. 
Sinc
e the rubric that the raters used was analytic, I decided to 
investigate the inter
-
rater reliability for each of the categories of the rubric. Table 
24
 
shows the
 
results
 

25
 
shows the results of the correlations.
 
The highest 
Cr

and correlation coefficients 
were for spelling for both the TW and the PBTW 

 
and correlation coeffi
ci
ents
 
for the TW exam were for 
cohesion, content, and 
organization

 
and correla
tion coefficients
 
for 
the TW exam were for grammar, followed by punctuation and 
vocabulary
. The highest 
 
63
 
 
and correlation coefficients
 
for the PBTW exam after spelling were for content 
and organization, whereas the lowest were vocabulary, fo
llowed by punctuation, grammar, and 
cohesion.
 

8 and for the PBTW exam was .555
.
 
Table 
2
4
 
Inter
-
rater reliability: Analytic scores
 
Scores
 

alpha for 
TW
 

alpha for 
PBTW
 
Content
 
.512
 
.595
 
Organiz
ation
 
.482
 
.485
 
Cohesion
 
.557
 
.416
 
Vocabulary
 
.479
 
.242
 
Grammar
 
.012
 
.352
 
Punctuation
 
.225
 
.301
 
Spelling
 
.845
 
.755
 
 
Table 
2
5
 
Correlation
 
matrix for 

 
Scores
 

rho for TW
 

rho for 
PBTW
 
Content
 
.375
 
.464
 
Organization
 
.
303
 
.325
 
Cohesion
 
.406
 
.276
 
Vocabulary
 
.302
 
.131
 
Grammar
 
.015
 
.205
 
Punctuation
 
.132
 
.196
 
Spelling
 
.723
 
.624
 
Total score
 
.618
 
.555
 
 
After the raters scored all 162 essays they were given a random selection of 20 essays to score 
again to measure intr
a
-
rater reliability. Ten of these essays were TW exam essays and the other 
ten were PBTW exam essays. 

exams was .
696
 
and for RK it was 
.794
. 

exams
 
was 
.862
 
and for the PBTW exams it was 
.352

 
64
 
 
for the PBTW exam was .552 and for the TW exam it was .
645
.
 
Below are the correlation 
matrixes
 
for each
 
category in the analytic rubric for each of the
 
rater
s
. 
 
Table 
2
6
 
C
o
rrelation matrix for RM
 
 
TW 
Correlations
 
PBTW 
Correlations
 
Content 
 
.544
 
.375
 
Organization 
 
.000
 
.167
 
Cohesion 
 
.559
 
.151
 
Vocabulary 
 
.408
 
.283
 
Grammar 
 
.559
 
.111
 
Punctuation 
 
.408
 
.327
 
Spelling 
 
.808
 
.377
 
Total score
 
.857
 
.000
 
 
Table 
2
7
 
C
orrelat
ion matrix for RK
 
 
TW 
Correlation
s
 
PBTW 
Correlation
s
 
Content 
 
.749
 
.717
 
Organization 
 
.818
 
.574
 
Cohesion 
 
.820
 
.791
 
Vocabulary 
 
.745
 
.856
 
Grammar 
 
.244
 
.258
 
Punctuation 
 
.218
 
.645
 
Spelling 
 
.249
 
.167
 
Total score
 
.611
 
.777
 
 
3.5 
Summary of the quant
itative findings
 
The first research question asked how the scores in the TW and PB
TW exams correlated. 
The result
 

correlating the total scores of both exams, I also cor
related the scores that the participants 

each category were: content .218; organization .246; cohesion .178; vocabulary .302; grammar 
.221; punctuation .380;
 
and spelling .355. The correlations for content, organization, cohesion, 
vocabulary, and grammar were all weak, according to the levels set by Cohen (1992), and the 
 
65
 
 
correlations for vocabulary, punctuation, and spelling were moderate. The second research 
question asked whether there were significant differences between the scores that the participants 
received in the two exams. The results of the 
t 
test revealed that there were no significant 
differences between the scores in the two exams. However, I also
 
decided to investigate whether 
there were significant differences between the scores that the participants received for the 
different categories in the analytic rubric. As mentioned before, because of the increased chance 
of Type I error caused by perform
ing multiple 
t 
tests, I followed Larson
-

suggestions and lowered the alpha level for this analysis. The results of the 
t 
test showed 
significant differences in the following categories: content, punctuation, and spelling. The 
participants rece
ived significantly higher scores for content and punctuation in the PBTW exam 
and significantly higher scores for spelling in the TW exam.
 
 
The third research question investigated how the TW and PBTW essays differed in terms 
of accuracy, lexical complexit
y, syntactic complexity, and fluency. In order to measure accuracy, 
a research assistant counted the number of clauses and then the number of error
-
free clauses per 
essay. After that, I divided the number of error
-
free clauses by the number of total clause
s to 
obtain a percentage score. I ran a paired
-
samples 
t
 
test to investigate whether the TW exam and 
the PBTW exam differed in terms of accuracy and the results revealed no significant differences 
in the accuracy scores for the two exams. 
I
 
investigated th
ree different levels of lexical 
complexity: lexical sophistication, lexical density, and lexical variation. To investigate lexical 
sophistication, I used RANGE, developed by Nation (2005), which compared the essays using 
three different corpora: the list o
f the most common 1,000 words; the list of the second most 
common 1,000 words; and the list of university words. The results of the 
t 
test showed that the 
participants used significantly more words from all three word lists in the PBTW exam. I used 
 
66
 
 
the Lex
ical Complexity Analyzer (Lu, 2012) to analyze lexical density and lexical variation. This 
program analyzes essays for 25 different measures of lexical complexity. For the purpose of this 
study, I only used seven of these measures: lexical density, to meas
ure lexical density, and six 
measures of lexical variation. The six measures that I used are the following: word type, lexical, 
verb, noun, adjective, and adverb variation. The reason for choosing six measures, as explained 
before, was because many of thes
e measures assess the same element of lexical complexity and 
to reduce the amount of statistical tests and therefore reduce the risk of Type I error. Once again, 
to reduce the chance of Type I error, I lowered the alpha level for the 
t 
test. The results of
 
the 
t 
tests revealed a significant difference only for two measures of lexical variation: word type and 
noun variation. The participants used significantly more different word types and types of nouns 
in the PBTW exam. To measure syntactic complexity, I u

Complexity Analyzer, which generated 14 different measures of syntactic complexity. The 
results of the 
t 
test, however, revealed no significant differences for the TW and PBTW exams in 
any of these measures. Finally, I used thr
ee different measures of fluency: number of words per 
essay, number of minutes writing the essay, and total number of words per minute. The results of 
the 
t 
test revealed significant differences for the total number of words per essay and the total 
number 
of words per minute. The participants wrote significantly longer essays and significantly 
more words per minute in the PBTW exam. 
 
The fourth, and last quantitative question, asked about the intra
-
 
and inter
-
rater reliability 
for the two exams. I used two 


-
rater reliability coefficient for the TW exam was .728 
and the coefficient for the PBTW exam was .643. According to Carr (2011), a satisfactory 
co
efficient
 

67
 
 
ored twice were 

coefficient for all 20 essays, I also investigated the coefficient for each of the two exams. The 
coefficient for the TW essays were: .862 for RM and 
.552 for RK. For the PBTW exam, the 

consistent when scoring TW exams. I now present information about the 18 participants who 
participated in the semi
-
structured intervi
ews. 
 
68
 
 
CHAPTER 4
: QUALITATIVE RESULTS
 
 
In this chapter, I describe the results for the two qualitative questions, that is, what are the 

?
 
First, I 
give more detai
led information about the students who participated in the semi
-
structured 
interviews. Then, I 
describe the results of the post
-
writing questionnaire that students answered 
about their perceptions of the two exams. Next, I report on the main themes that em
erged in the 
semi
-
structured interviews with the students. I end this chapter by describing the main themes 
that emerged in the norming sessions and semi
-
structured interviews with the raters.
 
 
4.
1
 
Participants from the semi
-
structured interviews
 
 
Of the 8
1 test takers who participated in this study, 18 
were randomly selected or 
volunteered 
to participate in the semi
-
structured interviews. 
The students were group
ed
 
according to the 220 or 221 class in which they were enrolled. Students from the same class w
ere 
interviewed together. 
The interview groups ranged from two to four participants. The reason for 
this was because some students did not come to the interview sessions. It may not be ideal to 
have differing number of students per interview. In a group of
 
four people, the participants might 

smaller group, the participants might be 
shyer
 
and less willing to share what they really think. 
One participant might be very
 
talkative and the shier participant might 
stay quiet or agree with 
the more dominant talker. 
These participants were: ID07, ID09, ID13, ID16, ID21, ID22, ID24, 
ID25, ID42, ID43, ID45, ID50, ID51, ID59, ID60, ID73, and ID74. 
Table 
2
8
 
illustrates the 
groups
 
in which the participants were interviewed.
 
 
69
 
 
Table 
2
8
 
Interview groups
 
Groups
 
Participants in each group
 
Group 1
 
ID07, ID09, ID13, and ID16
 
Group 2
 
ID21, ID22, ID24, and ID25
 
Group 3
 
ID42, ID43, and ID45
 
Group 4
 
ID49, ID50, and ID51
 
Group 5
 
ID59 and I
D60
 
Group 6
 
ID73 and ID74
 
 
Table 
2
9
 
describe
s
 
in detail the background information collected from each of the participants 
who participated in the semi
-
structured interviews. LoR stands for 
length of residence 
in the 
United States, in months; FI is the n
umber of months the participants have had formal instruction 
in English, at home or in the United States; and WA stands for writing ability, for which the 
participants gave self
-
ratings on a scale of 1 to 5.
 
Table 2
9
 
The participants
 
 
Participant
 
Age
 
Gende
r
 
Nationality
 
LoR
 
FI
 
Major
 
WA
 
ID07
 
28
 
Female
 
S. Korean
 
21
 
84
 
Music
 
2
 
ID09
 
29
 
Female
 
S. Arabian
 
18
 
-
 
-
 
3.5
 
ID13
 
19
 
Male
 
Brazilian
 
3
 
15
 
Agricultural Sciences
 
2
 
ID16
 
19
 
Female
 
Chinese
 
24
 
84
 
Accounting
 
2
 
ID21
 
20
 
Female
 
Brazilian
 
3
 
39
 
Biomedicine
 
2
 
ID22
 
2
3
 
Female
 
Brazilian
 
3
 
39
 
Biomedicine
 
3
 
ID24
 
21
 
Male
 
Brazilian
 
3
 
39
 
Animal Science
 
2
 
ID25
 
19
 
Female
 
Brazilian
 
3
 
36
 
Electrical Engineering
 
3
 
ID42
 
19
 
Female
 
Brazilian
 
3
 
60
 
Forestry
 
3
 
ID43
 
22
 
Female
 
Brazilian
 
3
 
24
 
Civil Engineering
 
3
 
ID45
 
23
 
Female
 
Brazili
an
 
3
 
24
 
Agricultural Sciences
 
3
 
ID49
 
22
 
Male
 
Chinese
 
1
 
120
 
Finance
 
2
 
ID50
 
24
 
Male
 
Chinese
 
1
 
120
 
Urban Planning
 
2
 
ID51
 
20
 
Female
 
Chinese
 
1
 
96
 
Actuarial Science
 
2
 
ID59
 
21
 
Female
 
Brazilian
 
1
 
48
 
Biology
 
2
 
ID60
 
21
 
Male
 
Brazilian
 
1
 
72
 
Computer Engineering
 
2
 
ID73
 
20
 
Male
 
Chinese
 
24
 
24
 
Business Management
 
2
 
ID74
 
20
 
Female
 
Chinese
 
24
 
144
 
Accounting
 
2
 
 
70
 
 
Below I present the findings for the fifth research question
, which derive from the post
-
writing 
questionnaire and from the semi
-
structured interviews. The que
stions in the post
-
writing 
questionnaire included multiple
-
choice questions and open
-
ended questions. The questions asked 
the participants which exam they thought was easier and which exam they liked more, both of 
which were followed up by open
-
ended quest
ions asking the participants to explain why. Other 
questions asked what they thought was easy and what they thought was difficult about both 
exams, what they thought about the article, videos, and discussion, and so on (see Appendix D 
for all of the questi
ons in the post
-
writing questionnaire). The questions in the semi
-
structured 

the time limit, planning time, discussion, and source materials in the PBTW exam an
d what they 
thought about the time limit of the TW exam, and so on (see Appendix E for the list of questions 
asked in the semi
-
structured interviews).
 
 
4.
2
 
 
RQ5: 

 
There were two set
s of data 
that 
answered the fifth research question. The first data
 
set
 
originated
 
from the post
-
writing questionnaire, which contained both multiple
-
choice and open
-
ended questions. The answers to the multiple
-
choice questions are quantitative and are pre
sented 
first. The answers to the open
-
ended questions are qualitative and are presented second. 
For the 

-
ended questions, I only corrected spelling mistakes, 
but left other grammar mistakes as they were in the ori
ginal responses. 
The second set of data 
which was used to answer the fifth research question is the semi
-
structured interviews. This data 
is solely qualitative and the results are presented third. 
 
 
71
 
 
4.
2
.1
 
Post
-
writing questionnaire
 
Table 
30
 
shows the respo
nses that the participants made for the multiple
-
choice questions 
in the post
-
writing questionnaire.
 
Table 
30
 
Answers to multiple
-
choice questions
 
Questions
 
Answers
 
Q1: Which exam was easier?
 
PBTW
 
exam
 
= 58
 
(72%)
 
TW 
exam 
= 11
 
(13%)
 
Both = 12
 
(15%)
 
Q3: Wh
at did you think of the videos?
 
Easy = 31
 
(38%)
 
Difficult = 25
 
(31%)
 
They helped think of ideas for the essay = 63
 
(78%)
 
They did not help think of ideas for the essay = 12
 
(15%)
 
Q4: What did you think of the article?
 
Easy = 46
 
(57%)
 
Difficult = 9
 
(11%)
 
I
t helped think of ideas for the essay = 66
 
(81%)
 
It did not help think of ideas for the essay = 7
 
(9%)
 
Q5: What did you think of the group 
discussion?
 
It helped think of ideas for the essay = 66
 
(81%)
 
It did not help think of ideas for the essay = 10
 
(12%
)
 
It was not related to what I wrote = 5
 
(6%)
 
Q6: Did you use the ideas from the 
videos?
 
Yes = 48
 
(59%)
 
No = 33
 
(41%)
 
Q7: Did you use the ideas from the 
article?
 
Yes = 67
 
(83%)
 
No = 13
 
(16%)
 
Q8: Did you use the ideas from the group 
discussion?
 
Yes = 56
 
(69%)
 
No = 24
 
(30%)
 
Q12: Which exam did you prefer taking?
 
PBTW exam = 62
 
(77%)
 
TW exam = 18
 
(22%)
 
 
The first question in the post
-
writing questionnaire asked the participants which exam 
they thought was easier.
 
Fifty
-
eight participants, or 72
% of them, 
answered that the PBTW exam 
was easier, 11 participants, or 14% of the participants, answered that the TW exam was easier, 
and 12 participants, or 15% of the participants, answered that both exams were equally easy or 
difficult. 
 
 
72
 
 
This question was followed
 
by an open
-
ended question that asked why the participants 
thought one exam was easier than the other.
 
F
our main
 

responses 
to
 
the second question
, which asked the participants to explain why they thought that 
one exam 
was easier than the other
: 
ideas from sources
 
and discussion
, 
time, 
background 
information, and planning.
 
Ideas from sources and discussion 
wa
s mentioned by 33 different 
participants. Twenty
-
nine of the participants mentioned that they thought that the PBT
W exam 
was easier because of the ideas from the videos, article, and discussion. 
Nevertheless, four
 
participants had negative comments about the videos, article, and discussion. 

answers related to time mostly explained
, with two exception
s,
 
that they thought that the PBTW 
exam was easier because they had more time to take this exam. However, one participant wrote 

Though there were lots of ideas and time seemed short

 
in the 
shorter timed writing exam I could mana
ge my time better

. 
The comments about 
background information describe that the videos, article, and group discussion in the PBTW 
exam gave the participants background information that they needed to understand the topic.
 
Seven participants wrote th
at they thought that the PBTW exam was easier because they had 
time to plan their essay. However, one participant, the same one that mentioned he or she could 
manage his or her time better in the TW exam, wrote that he could plan his essay in the TW 
exam i

when I write the out
line I can write very detail as main points, 
subpoints and how long the paragraph should have

 
Table 
31
 
lists 
the
 
five
 
themes
 
that 
emerged from the second question, which asked why the participants thoug
ht one exam was 
easier
, how many participants mentioned each theme, and sample comments related to each 
theme. 
 
 
73
 
 
Table 
31
 
Q2: Which exam was easier
 
and why
?
 
Themes
 
Times
 
Sample comments
 
Ideas from 
sources
 
and 
discussion
 
(positive 
comments)
 
29
 

hing the videos and reading the article, I got more 

 
Ideas from 
sources and 
discussion
 
(negative 
comments)
 
4
 

that when I on
ly 
w
rote
 
about 75% of my list (outline), the time 

 
Ti
me 
 
23
 

Background 
information
 
22
 

ld activate my background 
knowledge

 
Planning
 
7
 

better arguments and t

 
T
he third, fourth, and fifth questions asked the participants what they thought about the 
videos, the articles, and the discussion in which they participated. Thirty
-
one participants thought 
 
74
 
 
that the videos were easy, wh
ile 25 participants thought that they were difficult. Sixty
-
three 
participants answered that the videos helped them to think of ideas that they could use in their 
writing, whereas only 12 participants answered that the videos did not help them to think of 
ideas 
that they could use in their essay. Forty
-
six participants circled the answer that said the article 
was easy and 9 circled the answer that said the article was difficult. Sixty
-
six students thought 
that the article helped them to think of ideas, whil
e only 7 participants thought that it did not help 
them to think of ideas. 
Sixty
-
six participants answered that the group discussion helped them to 
think of ideas for their essays and only 10 participants thought the opposite. Five participants 
answered th
at the discussion was not related to what they wanted to write. 
 
 
Questions 6, 7, and 8 asked the participants whether they used the videos, article, and the 
ideas in the group discussion in their essay. Forty
-
eight participants answered that they used the
 
videos in their writing, while 33 said that they did not do so. 
One participant did not answer this 
question. Sixty
-
seven students said that they used the article in their essay and 13 participants 
said that they did not. Two participants did not answer t
his question. Finally, 56 students said 
that they used the ideas they heard during the group discussion in their writing, but 24 of them 
said they did not. Again, two participants did not answer this question.
 
Question 9 asked the 
participants why they cho
se not to use the videos, article, or discussion in their essays.
 
There 
were three main themes for this question:  opposing ideas, difficult videos or article, and time. 
The participants who mentioned opposing ideas all wrote that the ideas in the videos, 
article, or 
discussion were opposite to their own ideas and that is why they did not use them. Some 
participants wrote about the fact that they thought the videos and/or article was too difficult to 
understand and that is why they chose not to use them. Fi
nally, some participants described that 
they did not have time to include or think about including the information from the videos, 
 
75
 
 
article, and discussion. 
Table 
3
2
 
contains 
each theme, how many participants mentioned them in 
their answers, and sample ans
wers.
 
Table 
3
2
 
Q9: Why did you not use the ideas from the 
materials
 
or discussion? 
 
Themes
 
Times
 
Sample comments
 
Opposing ideas
 
12
 

t
he article and group discussio
n was not helpful because 

 
t
he ideas showed in those medias 
contradict
 

was different than 
the 
information
 

Difficult videos or 
article
 
9
 

Time
 
4
 

Question 10 asked the students to write about what was easy and/or difficult about the 
TW exam. T

swers to this question: 
background knowledge, time, planning, topic, arguments, and ideas. When the participants 
talked about background knowledge, they wrote that they did not have background knowle
dge 
on the topic about which they were writing
, with one except

the 
knowledge I had it was enough to write a good essay

. The students who discussed time 
were divided: 
fifteen
 
wrote that they did not have much time to writ
e, while 
ten
 
wrote that time 
wa
s not an issue in the
 
TW exam. All of the students who talked about planning said that they 
 
76
 
 
did not have enough time to plan their essays. 
However, one participant said that she thought that 
it was easier to plan while taking
 
the TW exam. 
S

In shorter timed writing exam is easy 
for students get ideas and write about easy

The next most commonly discussed theme 
was topic. 
Nine
 
participants wrote that they 
thought the topic was difficult or they 
did not like 
it 
and five participants
 
wrote that the topic of the TW exam was easier. 
Some participants also 
mentioned that they did not have enough arguments to write a good essay and others said that 
they liked that they could use their own ideas in the essay as opposed
 
to having to incorporate 
sources. 
Table 
3
3
 
shows the themes, the amount of times they were mentioned by the 
participants, and sample comments.
 
Table 
3
3
 
Q10: What was difficult/easy about the TW exam?
 
Themes
 
Times
 
Sample comments
 
Background 
knowledge
 
27
 
I

i

 
i
f I have enough background of the topic, shorter timed 

 
Time
 
25
 

w
e do not have time to develop good/strong arguments in 

 
s
ometimes I think I have no time to write whole time 

 
t
he time! It runs quickly, and I 


t

 
Planning
 
25
 

w
as difficult to organiz

 
i
t was difficult not have time enough to think about the 

 
77
 
 
Table 33 (Cont

d)
 
 
Topic
 
14
 

t

 
Arguments
 
11
 

detailed
 

ID12 

support the main point
 

port what I was 
writing

 
Ideas
 
5
 

The next question asked the participants what was difficult or easy about the PBTW 
exam. The most commonly mentioned themes for question 11 were the following: 
incorporating 
sources
, 
planning, topic, videos, and time. 
The 

responses 
was incorporating sources
. 
Half of the comments said that the participants did not 
know how to incorporate the sources in their writing and the other half wrote that the sources 
helped them to write bette
r essays
. 
Ten participants wrote that one easy thing about the PBTW 
exam was that they had time to plan their essay before they had the 45 minutes to write it. 
Eight 
students wrote about the topic of the PBTW exam: 
five
 
students wrote that they liked the t
opic or 
that the topic was easy and 
three
 
wrote that they did not like the topic of the topic was difficult. 
Seven participants said that they had pr
oblems understanding the videos
. 
Finally, six participants 
had negative comments about the amount of time t
hat they had to take the PBTW exam. 
Table 
 
78
 
 
3
4
 
contains 
the five 
themes, how many times they were mentioned, and sample comments from 
the participants.
 
Table 
3
4
 
Q11: What was difficult/easy about the PBTW exam?
 
Themes
 
Times
 
Sample comments
 
Incorporating sou
rces
 
5
1
 

article and discu

 
good for my essay
 

choose the 
information
 
in easier way
 

me to support the main 

 
Planning
 
10
 

d organize our thoughts 

 
Topic
 
8
 

was harder because gun control has to 
much 
controversy
 

Videos
 
7
 

problem
 

79
 
 
Table 34 (Cont

d)
 
 
Time
 
4
 

[the pieces of information from the sources] when I am 

 
The 
twelfth
 
que
stion asked the participants which exam they preferred taking: the PBTW 
exam or the TW exam. Sixty
-
two participants, or 76% of the participants, answered that they 
preferred taking the PBTW exam. Eighteen participants, or 22% of the participants, said that
 
they preferred taking the TW exam. Two participants did not answer this question.
 
The thirteenth 
and last question asked the participants to explain why they preferred taking one e
xam when 
compared to the other. There were four main themes that emerged fr

responses
 
for this question: ideas from sources and discussion, planning, background 
information, and test preparation.
 
The participants who wrote about ideas from sources and 
discussion all said that the sources and group discussion h
elped them to think of ideas and 
arguments that they could include in their essay. 
Sixteen students wrote that they preferred the 
PBTW exam because they had time to plan their essay. 
Fourteen participants talked about the 
fact that the videos and article g
ave them background information to help them to write better 
essays. 
Some participants also mentioned that the TW e
xam helped them to prepare for 
other 
exams that they have to take, like the TOEFL 
(
www.toefl.org
) 
and th
e TW exam t
hat the 
students have to take i
n ESL 220 and 221
 
at the ELC
. 
Table 
3
5
 
shows the themes, how many 
times the participants wrote about them, and sample comments.
 
 
80
 
 
Table 
3
5
 
Q13: Which exam did you prefer and why?
 
Themes
 
Times
 
Sample comments
 
Ideas f
rom sources and 
discussion
 
30
 

ng permit open more points of 
view about subject and choose what the best way to 
argument and 
convince

 
Planning
 
16
 

w
e can make a draft first which make us have enough 
ti

 
Background information
 
14
 

it gave me more information about the topic, so I was 
more aware about the issue before starting to write the 

 
as more resources for me 
to understand
 
Test preparation
 
5
 

In short, 72% of the participants thought 
that the PBTW was easier and 77% of the 
participants preferred taking the PBTW exam. The answers to the open
-
ended questions in the 
post
-
writing questionnaire revealed that the students thought that the PBTW exam was easier and 
they preferred taking the PB
TW exam for two main reasons: the articles, video, and discussion 
helped them think of ideas to include in their essays and the source materials gave the test takers 
 
81
 
 
background information that they needed to complete the writing task. Fifty
-
nine percent o
f the 
participants reported using ideas from the videos in their essays; 69% of them reported using the 
ideas that they heard in the group discussion in their writing; and 83% of the participants 
answered that they used ideas from the article in their essa
ys. Many participants wrote that they 
had difficulty integrating sources in the PBTW exam, partly because they could not understand 
the videos or article, or because they did not know how to cite the sources. On the other hand, a 
large number of the studen
ts wrote that the TW exam was difficult because they were not 
familiar with the topic or because they did not have enough time to write or plan the essay. The 
semi
-
structured interviews revealed very similar findings, but along with the post
-
writing 
questi

 
4.
2
.2
 
Interviews
 
Many of the issues discussed in the post
-
writing questionnaire also arose in the semi
-
structured interviews, such as difficulty incorporating sources in the PBT
W exam, difficulty 
understanding the video, 
topic preference, time constraints, and planning time
. 
 
 
a)
 
Difficulty incorporating sources
 
Three of the eighteen students who participated in the interviews mentioned difficulty 
incorporating sources in the PBTW e
xam. Below are their comments about this theme.
 
Excerpt 1
 
ID07
 
How can combine the video, article, and my idea
 
R
 
So you think that is difficult to do?
 
 
82
 
 
ID07
 
Yeah, the combine. How can combine the video and article. How can supporting, how 
can example in my 
essay. I think is more difficult than the shorter timed writing.
 
 
(...)
 
ID16
 
I like the idea of discuss and talk about the idea, but not to support uh, like, take uh=
 
ID07
 

ID16
 
From the video, to build my idea, to have the co
ncept of what I want to write
 
R
 
So you would not like to have to use the ideas in your essay? It would be easier if you 

.
 
ID16
 
Yeah, it help me to think
 
ID13 
 
Yeah, I think it should be optional. If you want to make your argumen
t more strong you 

 
Excerpt 2
 
ID60
 
Na segunda redaçã
o eu ac
hei que eu recebi muita informação e nã
o consegui organiz
ar 
(...) 
Eu tinha muita informação mais os ví
deos
, os textos de apoio que eu li e eu meio que 
me perdi no tempo.
 
 
In the second essay [PBTW exam], I thought that I got too much information and I 

read and I kind o
f lost myself.
 
 
In sum, three participants mentioned that they thought that combining the ideas from the 
article and videos with their own writing was difficult. Integrating the article and videos was not 
the only source of difficulty in the PBTW exam. Som
e participants also described having 
difficulty understanding the two short videos that they watched.
 
 
83
 
 
b)
 
Using ideas from the source materials
 
Although the participants thought that incorporating the source materials in their writing 
was difficult, some
 
part
icipants mentioned that the videos, article, and/or discussion helped them 
generate
 
ideas that they could use in their essay.
 
Excerpt 23
 
ID16
 
I think the second one [the PBTW exam] help me to make a contrast or idea (...) When 
we discuss about uh the topic
, that will help me to think about what I want to write and 
my idea.
 
Excerpt 24
 
R
 
What did you think of the two exams and which one did you prefer?
 
ID24
 
The long term.
 
R
 
Why?
 
ID24
 
Because is easy to take a thesis statement when you talk to another students
 
and read the 
article, because I used some parts of the article to make my essay.
 
ID21
 
For me even the long
-
 
the longer essay the topic was more difficult, for me was more 
easy because uh with the videos and the articles I had more ideas to write in my ess
ay. So 
even the topic was more difficult for me was more easy.
 
 
(...)
 
ID22
 
I prefer the first one, because I could build my essay with this kind of argument in the 
article.
 
Excerpt 25
 
 
84
 
 
ID45
 
Deu mais suporte, a segunda porque teve os artigos pra ver
-
 
os víde
os, então acho que foi 
bem mais fácil que a primeira (...) As discussão com os grupos acho que também ajudou 
bastante porque deu mais idéias, pode trocar idéia.
 
 
It gave us more support, the second one [the PBTW exam] because there were the 
articles to see
-
 
the videos, so I thought it was much easier than the first (...) The 
discussions with the group also helped a lot, I thought, because they gave us more ideas, 
we could exchange ideas. 
 
ID42
 
Aí vc percebe (incomprehensible) e muda o que você tava pensando
. É bom pra ter mais 
exemplo, alguma coisa assim.
 
 
Then you realize (incomprehensible) and you change what you were thinking. It [the 
PBTW exam] is good to give us an example, something like that.
 
 
(...)
 
ID43
 
Eu acho que deu um exemplo assim no que basear 
pra escrever.
 
 
I think that it [the PBTW exam] gave examples to use as a basis to write.
 
ID42
 
Pra mim foi como um complemento, eu acho que não mudou totalmente a minha idéia, 
mas serviu pra complementar as idéias e tornar mais consistente, talvez, assim, t
endo 
alguma coisa como exemplo ou alguma coisa que possa complementar seu pensamento 
pra deixar ele mais forte.
 
 
my idea completely, but they served to complement my id
eas and make them more 
consistent, maybe, like, having something like an example or something that can 
complement your opinion to make it stronger.
 
Excerpt 26
 
 
85
 
 
ID73
 
I would like to choose the first one, the long one, because we can watch the video and 
watch
 
the article and talk about some information with the classmate so I can I think 
everyone can get more information and then to write a essay (...) I can use some 
(incomprehensible) from the article or the video to support my opinion in the essay.
 
ID49
 
The 
first exam [the TW exam] maybe you need to think about yourself and the second 
[the PBTW exam] you can discuss with other people so you can get ideas from other 
people and maybe it will give you new ideas and some support, something can support 
your idea.
 
All of the participants who mentioned using the source materials said that the article and 
videos gave them ideas that they used in their writing. In addition, some participants also 
had positive things to say about the group discussions. These students de
scribed that they 
used ideas that arose in the discussions in their essays.
 
 
c)
 
Difficulty understanding the videos
 
Eight 
of the eighteen 
participants 
who were interviewed 
mentioned that they had 
difficulty understanding the video. Their opinions about the
 
videos are expressed below.
 
Excerpt 4
 
R
 
What about the videos, were they difficult?
 
ID07
 
The videos were so fast. The first guy is so fast so I just picking some ideas in my timed 
writing.
 
 
(...)
 
ID16
 
The video, the problem in the second exam, something i

 
(...)
 
 
86
 
 
ID21
 

article was very useful for me 

understand.
 
I thought it was a littl
e difficult, the videos.
 
Excerpt 5
 
ID51
 
The second one [the PBTW exam], I think the problem is understanding, because we 

the videos, but the details was a little 
hard.
 
 
(...)
 
ID49
 
The video maybe a little bit difficult because people speak very quickly and maybe use 

 
(...)
 
ID50 
 

Excerpt 6
 
ID74
 
The first one [the P


what is he talking about, but I understood he agree with gun control. Or disagree, maybe? 
He thinks pe
ople should have gun, I already know this, but other points is hard for me to 
follow. And then the second video I understand most of the part.
 
 
(...)
 
ID73
 
 
For the video, uhm, I can, uh, get the main point, but I am not sure I can clear to get the 
detail i
nformation. Uhm, but at the essay [the article] is very easy. I can know the detail 
information.
 
 
87
 
 
The eight participants who reported having difficulty understanding the videos explained 
that the people in the videos spoke too quickly and that understanding
 
the details of the videos 
was challenging, although they could understand the main ideas. One issue that emerged in the 
post
-
writing questionnaire was that of topic familiarity. In addition, the majority of the 
participants in the semi
-
structured intervie
ws also discussed which topic they liked best.
 
d
) 
Topic preference
 
 
Another theme that emerged in the semi
-
structured interviews was that of topic 
preference. Fourteen of the eighteen participants who were interviewed mentioned this theme. 
F
our
 
commented t
hat they preferred the topic of gun control, while the remaining 
ten
 
stated that 
they favored the topic of obesity. Below are the comments that the participants made about 
the 
two topics of the exams.
 
Excerpt 
7
 
ID07
 
The topic is gun control, so is difficul
t to me
 
(...) Change the topic [of the PBTW exam]. 
Topic is so heavy.
 
(...)
 
ID13
 

[
o
f the PBTW exam]
, that was gun control, because we spent 
three of four hours talking about this in the Speaking and Listening class so we alrea
dy 
have a knowledge about this.
 
(...)
 
ID09
 
The topic [obesity] 
also very easy for me (...) It [gun control] is a big question for us, uh, 
because, uh, when I know this topic is about gun control I feel very scared. 

s very 
scared. 
 
Excerpt 
8
 
 
88
 
 
ID24
 
I thoug
ht that the short timed writing exam was easier for me because the topic was easy 

pretty easy to think about gun control and write in 45 minutes. I think that the proble
m 

 
(...)
 
ID21
 
For me, even the longer essay was, the topic [gun control] was more difficult, for me [the 
PBTW exam] was more easy.
 
 
(...)
 
ID22
 
The second one, the topi
c was easier than the first one (...) because gun control i
s a little 
difficult to express.
 
Excerpt
 
9
 
 
ID45
 
No Brasil a gente nã
o discute muito sobr
e gun control, agora obesidade é
 
mais discutido. 
Por isso e
u acho que o segundo foi mais fá
cil.
 
 
more discussed. 
That is why I think the second topic [obesity] was easier.
 
ID42
 
Pra mim també
m, porque inclusive a gente sempre ouve falar de obesidad
e daqui dos 
Estados Unidos, então é um assunto mais comum, então eu acho que é mais fá
cil.
 
 
For me that i
s true also, because we actually always hear about obesity in the United 
States, so the topic is more common, so I thought it was easier.
 
ID43
 
Eu também acho que obesidade foi mais fá
cil.
 
 
I also thought that obesity was easier.
 
Excerpt 1
0
 
 
89
 
 
ID59
 
Na primeir
a, eu tive muito
 
mais dificuldade porque eu não tinha o que falar. 
Eu não, nã
o 
e
ra um tema que eu tinha a opinião formada nem em portuguê
s. Tipo, se me mandass
e 
escrever esse tema em português eu nã
o sa
beria, eu ia ficar perdida também. Meus 
argumentos nã
o
 
tinha base, tip
o, nã
o tinha o que falar. Ficou ruim, tipo, eu sabia que eu 
tava escrevendo, mas tipo, eu nã
o conseguia faze
r melhor que aquilo porque eu nã
o tinha 
o que 
falar sobre aquilo. 
A segunda nã
o, a segunda, eu acho que
,
 
ló
gico, ter um preparo 
ante
s é
 
muito melhor
. Você, tipo, lê sobre aquilo, você
 
tem mais embasamento pra falar. 
A segu
nda, acho que mesmo que se eu nã
o tivesse, acho que teria ficado muito melhor 
que a primeira. 
 
 
In the first one [the TW exam], I had much more difficulty because I d


ng to say. The essay was bad, like, I 

anything to say about it. The second one [the PBTW exam], though, I think that, 
obviously having preparation before is much better
. You, like, read about it, you have 

have still written a better essay than the first. 
 
(...)
 
ID60
 
No meu caso, eu me lembro que 
no ensino mé
dio eu fiz uma redaçã
o
 
parecida com a 
do 
desarmamento, no Brasil, aí eu tinha mais idé
ias e eu consegui colocar no meio tempo. 
 
 
In my case, I remember writing a similar essay about gun control in high school, in 
Brazil, so I had more ideas and I could use them in the time I ha
d.
 
 
90
 
 
Excerpt 
11
 
ID74
 

gun control before I took the timed writing. But the one about junk food, I think that 

 
(...)
 
ID73
 
From the topic, I think that gun control I can write more information than the junk food  
because gun control is a heat discussion in the United States and I studied gun control in 
the Speaking class also, so I can write more information in gun
 
control. 
 
Excerpt 
12
 
ID
51
 
I like junk food more. This topic, I have a lot of things to write, because is a 

harder than the second one. 
 
R
 
So you thought gun contr
ol was more difficult?
 
ID51
 
Yeah.
 
ID49
 

one.
 
R
 
Right.
 
ID50
 
I think the both topics is both okay, because if you explore this topic you will find that 
both topics are no
t just simple topic.
 
 
Only two of the eighteen interviewees said that they preferred gun control. Ten 
participants said they did not like the topic of gun control because it is a difficult topic or because 
they are not familiar with the topic. Nine partici
pants mentioned that they liked the topic about 
obesity.
 
 
91
 
 
e
) 
Time constraints
 
Five participants mentioned the issue of time.
 
Most of them believed that they did not have 
enough time to write their essays during the TW exam.
 
Below are the comments they made
 
about the issue of time.
 
Excerpt 13
 
ID13
 

enough to develop ideas good and teachers who collect our exam uh they want they 
require a high levels of our arguments and 

strong argument in 4


a good essay, a strong essay with goo
d arguments. And we are learning a new language. 

would be a necessary choice.
 
Excerpt 14
 
R
 
What did you think about the time limit?
 
ID
24
 

ime to write, but a bad time to revise your essay.
 
You can write 

minutes. Probably my essay will be stuffed with a lot of grammar mistakes.
 
ID25
 

ust write and revise never.
 
ID13
 

92
 
 
ID25
 

it.
 
Excerpt 15
 
ID42
 
Eu sempre tive problema com o tempo, então fazer uma redação em 45 minutos pra mim 
é quase impossível porque meu cérebro pára e eu não consigo pensar em nada.
 
 
I always had problems with time, so writing an essay in 45 minutes to me is
 
almost 

 
(...)
 
ID 45
 
Quarenta e cinco
 
minutos não dá pra você mostrar tudo que você sabe fazer, você só joga 
a idéia no papel
,
 
então acho que se tivesse mais tempo e se tivesse essas discussõ
es, tipo, 
discutir o tema, acho que seria bem melhor.
 
 
Forty five minutes is not enough to show what you know, you only throw your ideas on 
the paper, so I think that if we had more time and if we had more discussions, like, 
discussing the topic, I think i
t would be a lot better.
 
 
(...)
 
ID42
 
Eu acho que o maior problema seria o tempo també
m 
(...) Eu acho que com o tempo você 
sente muita pressão e você acaba não escrevendo o que você sabe.
 
 
I 
also 
think that the big
gest problem would be the time (...) 
I thin
k that with the time limit 
you feel a lot of pressure and you end up not writing what you know.
 
Excerpt 16
 
ID
51
 
I think 
the time is a problem because we have to uh think about the topic and writing in 
fi
fty minutes, fifty
-
five minutes (...) I write very sl
owly.
 
 
93
 
 
ID50
 
So do I, same question, the time problem. In the first uh exam [the TW exam] the 
information given in the article is limited so you should spend a lot of time to organize 
your thought. I think at last I have many thought I want to write them all
 
down, but the 
time is limited. Uhm so if I have more uh time or more information uh maybe my writing 
is better.
 
All of the participants who mentioned the issue of time in the semi
-
structured interviews 
believed that 45 minutes was not enough time to write
 
a well
-
developed essay. No one disagreed.
 
 
f
) 
Planning time
 
Another theme that emerged in the semi
-
structured interviews was planning time. While 
some participants complained that they had no time to plan in the TW exam, others mentioned 
that the ten minu
tes given to them in the PBTW was also not enough time to plan. The excerpts 

 
Excerpt 17
 
ID16
 
My problem I want to write uh I always spend time to think what I want to write and 
make idea 
and think 
about uh the question (...) The first exam 
[the TW exam] 
it take a 
long time for me to think about what I want to write to find my idea. 
 
Excerpt 18
 
 
ID22
 

essay and I prefer
 
always do first the outline.
 
Excerpt 19
 
 
94
 
 
ID43
 
A segunda teve mais tempo pra gente pensar o que a gente 
í
a escrever, então achei mais 
fácil. A primeira deu o tópico e a gente já tinha que fazer o brainstorming, aí ficou mais 
difícil.
 
 
The second [exam] [PBT
W exam] gave us more time to think about what we were going 
to write, so I think it was easy. In the first [exam] [TW exam], you gave us the topic and 
we had to start brainstorming right away, so it was more difficult. 
 
Excerpt 20
 
ID59
 
A
 
segunda eu acho qu
e se eu tivesse mais tempo eu teria escrito mais, eu poderia ter 
pensado melhor em como eu teria dividido aquele texto, mas, tipo, nos 45 minutos eu fiz 
o que deu pra fazer.
 
 
I think that if I had more time in the second 
[exam] [PBTW exam] 
I would have wri
tten 
more, I would have thought more about how to organize the text, but, like, I did the best I 
could in 45 minutes.
 
 
(...)
 
R
 
O que vocês acharam do tempo que vocês tiveram pra planejar?
 
 
What did you think of the time that you had to plan the essay [in t
he PBTW exam]?
 
ID60
 
Os 10 minutos?
 
 
The 10 minutes?
 
R
 
É.
 
 
Yeah.
 
ID60
 
Eu acho que foi pouco pra mim. Eu tinha muita idéia e não consegui formar tudo.
 
 
I think it was too little for me. I had a lot of ideas and I could not organize everything. 
 
ID59
 
Pra mim 
também foi pouco.
 
 
95
 
 
For me it was too little as well.
 
Excerpt 21
 
R
 
What did you think about the planning time, the time you had to plan the essay?
 
ID49
 
I think the planning is very important. I always hard to study but when I have to write, 
when I have plan
ning, I can write quickly. The first topic, gun control, we have less time 


5 
minutes is enough because we know a lot of information from this topic.
 
ID50
 
Yeah, I think that the time before you writing is very important because you can write 
some key sentences, key words, to support your idea. Uh and like you have three key 
sente
nce, like, for example, so you know how to arrange your time reasonable. Maybe 
one point you will spend ten minutes, three sentence totally, so you can arrange your 
time.
 
Excerpt 22
 
ID74
 
Even I already understand, like, junk food is really a problem for wo
rld
 
wide people, but 

[outline] or anything for writing.
 
 
(...)
 
R
 
What did you think about the time you had to plan?
 
ID74
 
Before we write our essay?
 
R
 
Yeah.
 
ID74
 
Uhm, maybe
 
we need some time to think about, cause the gun control you give us time to 
write not 
information
, but our guideline in the paper, so when I start writing the paper I 
 
96
 
 
know in the introduction what I want to write and the body or conclusion, what do I need
 

R
 
Yeah. Okay. Good.
 
ID73
 
Between the gun control and junk food writing, I think gun control I have lots of time to 
plan 
and to write and the junk food just 45 minutes, I only use 45 minutes to plan how to 
write and what should be write in the essay, so I think the gun control the time is better 

 
Most of the participants who
 
commented about the ten minutes that they were given to 
plan their essays agreed that they liked the time that they had to plan. However, two students 
thought that ten minutes was not enough to plan their essays.
 
 
Below is a summary of the main themes tha
t emerged in the post
-
writing questionnaire 
and the semi
-
structured interviews:
 
1)
 
The students like the planning time in the PBTW exam;
 
2)
 
they believe that the article, videos, and group discussion are useful
 
and give them 
ideas and background information that
 
they can use in their writing
;
 
3)
 
some of them had problems integrating the ideas from the article and videos in their 
writing
;
 
 
4)
 
Some students found the videos difficult to understand;
 
5)
 
most students liked the topic about obesity more than the topic about gun
 
control;
 
6)
 
they think that they cannot write a good essay in 45 minutes;
 
 
97
 
 
think about the PBTW and TW exams.
 
 
4.
3
 
 
nt exams?
 
The two raters participated in two training and norming sessions, both of which were 
audio recorded. The first session lasted approximately two hours and the second one lasted one 
hour and thirteen minutes. 
The two sessions occurred three weeks a
part. 
After the raters scored 
all 182 essays, I interviewed them individually
, approximately three weeks after the second 
norming session
. First, I report on the common themes that emerged in the training and norming 
sessions, which were rubric, source int
egration, and differences between the TW and PBTW 
essays. I then 
go 
on
 
to report
 
the themes that emerged in their interviews: the topics of the 
exams, source integration, the rubric, and the content validity of the exams. 
 
 
4.
3
.1
 
Norming sessions
 
In the fi
rst session, both RK and RM shared their thoughts about the rubric
,
 
source 
integration
 
and the differences between the TW and PBTW essays
. 
 
 
a)
 
Rubric
 
RM did not like that the rubric assigned the same amount of points to each category, 
because she believed th
at content and organization should be more valued than spelling and 
punctuation. She said:
 
 
98
 
 
n give more 

 
RM mentioned this same issue in the second session as well. She said:
 

ge

 
RK mentioned the fact that the rubric only had three levels, although 
there were actually four 
levels and 
one could assign a zero, a one, a two
,
 
or a three. She explained that, since 
no one 
would receive a zero, the rubric really only had 

essay.
 

-
 

going to
 
get a
 

RM agreed with RK, and said the following in the very beginning of the second norming session:
 

enough spread so I
 

like, probably a lot of spread and equality of twos across the board
 

enough room in each category to say, like, like you basically have good, bad or in 
between and there needs to be more spread in the in between part.

 
In the middle of the session, she again brought up the lack of spread in the rubric:
 

lated and this is where spread would help cause if there were
-
 
if there 
was more spread in the middle, you could say well this is
-
 
for instance, this one is much 
 
99
 
 
better organized than the one we just read right before it because it, like, has an 
introducti
on, it has body paragraphs, so you could give credit for having topic sentences, 

 
RK agreed with RM and suggested that the rubric might be better if it had half scores, for 
instance, if one could assign
 
a score of 2.5.
 
 
In the second session, RK also mentioned that she had difficulty with the wording of the 
rubric. She said:
 

Both RK and RM described that they had difficulty assigning scores for cohesion. RM said:
 

-
 

-
 
partly the rubric is not very helpful with 
cohesion, like, satisfactory cohesion, and we talked about that 


grammar that, like
-

 
RK interrupted RM and added the follow
ing:
 

-
 

-
 
kind of relatively high cohesion scores just because even if you have a lot of grammar 

 
can, like, 

 
RM also said that assigning scores for vocabulary was not easy, particularly deciding between a 
two and a three. She said:
 
 
100
 
 
hree 
because, like, so, like, he
-
 
for instance, he misuses principle in here, which I would argue 


u know, word class 

 
RK added:
 

RK revealed that organization was also a difficult category for her to assign a grade. 
She said:
 

thought that the rubric should have a wider range o
f scores and clearer descriptors for each score 
and category.
 
 
b)
 
Source integration
 
I asked the raters what they thought about the PBTW exam, which they had just finished 
scoring. RM discussed that the participants could not incorporate sources very well. Sh
e said:
 

problem
-
 
so this is a problem with the essays more generally is 
like, a lot of people just 


-
 
 
101
 
 
have a hard time with context. 


RK responded:
 

Both raters agreed that the test takers lack ability to integrate sources in their writing. RM felt 
very strongly about this issue.
 
 
c)
 
Differences between the TW and PBTW exams
 
At the end of the seco
nd norming session, I asked the raters if they noticed any 
differences between the TW essays and the PBTW essays. RM answered:
 

I kind of hate to say it,
 
I think in some ways these [the TW essays] are better because 

-
 
like, the other
 
ones if you could incorporate sources well it helped, 

worse, whereas this is like just purely opinion
-

y
 

actually know how
 

Both RK and RM thought t
hat the TW essays that they read 
in the norming session were 
longer than the PBTW essays. RK 

:
 
RK
 
These are longer, I think.
 
RM
 
These do seem to be longer.
 
RK 
 

102
 
 
RM explained that she preferred reading the TW essays mainly because it did not require 

dis
tracting. Both RK and RM thought that the TW essays were longer than the PBTW essays.
 
 
4.
3
.2
 
Interviews
 
T
he raters were interviewed separately one day after they had finished scoring the essays. 
 
I 
first 
interviewed 
RK on Skype 
and then 
RM in person. 
I ask
ed the raters the same questions 
and below I report 
on 
the common themes that emerged in both interviews
, two of which were 
also mentioned in the norming sessions
: the topics
 
of the exams
, source integration, the rubric
, 
and the content validity of the exa
ms
.
 
 
a)
 
Topics
 
of the exams
 
One of the questions asked the raters if they believed that the two topics were 
comparable. Both of them seemed to think that they were indeed comparable. Nevertheless, they 
thought that gun control was a bit more complex than obe
sity. Below is what RK had to say 
about the topics:
 

Yeah
,
 

know. Compared to the obesit
y thing, where they can just kind of say what they think you 
want to
 
hear or whatever. People have stronger opinions about guns

 
RM answered the qu
estion in a very similar manner, but she elaborated more on why she thought 
obesity was an easier topic:
 
 
103
 
 
I 
think
,
 
overall
,
 
I guess they were comparable. I do think, though, that the obesity topic 
was easier for people to just talk from their own experience than gun control, so you 

e
,
 
like
,
 
kind of opinion statements, or experience
-
based statements in the obesity stuff, in the 
obesity papers, than the gun control paper. So for instance
,
 
there were a 
handful
 
of gun 
control papers that talke
d about Chinese kids saying their
 
Chinese fam
ilies had
-
 
were 
concerned about living in the U.S. because of guns, but that was like two or three in the 
whole lot, whereas the obesity papers tend
-
 
you tend to have a lot more like, you know, 

I 
was surprised when I came to the U.S.

 
or 

American stores 
do this,

 
necessarily need
-
 

knowledge so to speak

 
The raters agreed that the topic of gun control was more complex for international 
students than the topic of obesity.
 
 
b)
 
Source integration
 
 
Both raters mentioned the difficulty that the participants in this study seemed to have 
integrating sources in their writing. RK said the following:
 

support a good argument and then you also have to throw in this other aspect
 
[source 

 
RM said:
 
 
104
 
 
I do think that for the PBTW exams people who were able t
o integrate information 
probably did better in terms of content because it gave them something to talk about, but 

end up helping, I guess, because you get, like, a
 
random fact, and no discussion of it

 
While both raters thought that the participants in this study could not integrate sources 
well, RM believed that the few participants who could integrate sources performed better in the 
PBTW exam.
 
 
c)
 
Rubric
 
Both raters
 
stated that they 
had problems with the rubric
, an issue that first arose in the 
second norming session
, which occurred three weeks before the interviews
. 
Once again, 
RK and 
RM 
shared
 
that 
they thought that 
the descriptors in the rubric did not distinguish
 
themselves
 
well 
from one another. RK said: 
 

frequent,

 
some,

 
almost no inadequacies


:
 

distinguishing

like
, it needed to have more of a range because I think

I 
felt like I 
ended up with a lot of stuff in the two
 

terrible, but things would be rated twos for different reasons
 
(...)
 
More precise descriptors 

-
 

of ways you can deal wi

 
105
 
 
necessarily mean well
-
organized, but I tended to give, like, priority to that because at 


RK als
o discussed which categories were easy and difficult to score with the analytic 
rubric. She thought that grammar, punctuation, and vocabulary were the most difficult, and 
spelling and organization were easier. This is what she said about the categories:
 

pelling was the easiest. So the categories that I found I put off when I was going 
through each essay, like the three categories that were the last ones for me to decide the 

. It 

much. Something like organization or content kind of jumps out from the beginning, for 
me at least
 
(...) 
And punctuation I thought was h
a
-
 
kind of difficult bec
ause you really 

-
ons in 

-
 
does that drop it down to a two? I feel like there 

the prompt, but it was just so 


Once again the raters mentioned that the descriptors in the rubric should be clearer.
 
 
d)
 
Content validity
 
RK and RM also mentioned the issue of content validity. 
Carr (2011) 
defined content 


is teaching. The two raters 
believed that th
e PBTW exam was a good tool to evaluate students in 
an academic course in which source integration is one of the objectives. RK 
stated
:
 
 
106
 
 
f that is one of the main objectives of that class
,
 
to be able to do that
 
[integrate sources]
, 
then I would say that i
t is an advantage to see whether they can. So for me
,
 
I guess
,
 

and if so, then I guess that would be a preferred format

 
Below is what 
RM said
 
about content validity
:
 

eah, so I
,
 
like, I actually like the idea of the PBTW exam because I think
-
 
I think it has 
the potential to measure kind of higher level synthesis writing skills in a way that the just 

ity to like write from 

actually a better
-
 

you have to do because you have to pull information fr

 
The raters thought that the PBTW exam is a good tool to evaluate students when 
the course objectives include teaching them to integrate sources in their writing. A 

 
gathered in the norming 
and training sessions and in the interviews were the following:
 
1)
 
The rubric lacked range and clear descriptors;
 
2)
 
the students d
id
 
not know how to integrate sources in their writing;
 
3)
 
the TW essays 
were
 
longer than the PBTW essays;
 
4) 
gun control 
was
 
a more complex topic than obesity;
 
5)
 
and the PBTW exam is valid when the course objectives include source 
integration.
 
Now that I have presented the results of the quantitative and qualitative research 
questions, I jointly interpret and discu
ss the results of the quantitative and qualitative 
 
107
 
 
data, following the convergent parallel design described by Creswell and Plano Clark 
(2011).
 
The reason for discussing the results of the study after the quantitative and 
qualitative data have been present
ed is to allow me to 
gather different pieces of 
information about the TW and PBTW exams, such as scores, the lexical and syntactic 
complexity, as well as the grammatical accuracy and fluency of the essays, and the 

xams, all of which, combined, paint a more in
-
depth picture of the two exams in question.
 
 
108
 
 
CHAPTER 5
: DISCUSSION
 
 
In this chapter, I discuss the results of the quantitative and qualitative research questions
. 
I do this while 
following the mixed
-
methods des
ign that Creswell 
and Plano Clark 
(
2011)
 
described as the convergent parallel design, in which the researcher combines the quantitative 
and qualitative results in the interpretation phase of the study.
 
First, 
I briefly summarize the main 
findings of the 
qu
antitative and qualitative questions
. 
Second
, 
I argue why PBTW exams are the 
best way to evaluate ESL academic writing based on the results of my study. T
hird,
 
I discuss the 

and 

 
perceptions of the two exams. 
Fifth and finally
, I finish this 
chapter by discussing the difficulties of implementing PBTW exams in ESL academic
 
writing 
courses.
 
 
5.1 
 
Main findings of the study
 
The 
foremost
 
finding of this study 
is 
that
 
PBTW exams have many benefits over TW 
exams. For example
,
 
when st
udents are given time to familiarize themselves 
with 
the topic of a 
writing test
 
through video
-
watching, readings, and discussions
, 
and when they are also allowed 
time to plan, 
they write significantly longer essays and significantly more words per minute
.
 
These r
esults 
were 
also found by Cumming et al. (2005), 
David (under review), 
Ellis and Yuan 
(2004), Kellog
g
 
(1988), and
 
Ong and Zhang (2013
). 
Longer essays 
are good because they 
have 
been found to be strongly correlated with higher scores, as reported by
 
Guo, Crossley, and 
McNamara (2013). 
Guo et al. (2013) 
found that text length was a strong predictor of the quality 
of an essay
 
for both independent and integrated writing tasks
. 
In addition,
 
in the PBTW exams, 
the participants
 
use
d
 
significantly more soph
isticated vocabulary and more different 
word 
types
 
 
109
 
 
and types
 
of nouns
 
in the PBTW
. 
This benefit was also found by 
Ong and Zhang (201
3
)
 
and 
Cumming et al. (2005)
, who stated that 
learners used a wider variety of words in their study on 
the effects of 
task c
omplexity
. These findings
 
actually contradicted work by
 
Johnson et al. 
(2012), Kormos (2011), 
and 
Kuiken et al. (2005)
. 
Perhaps the exposure to the source materials 
and thei

 
to use new vocabulary
 
that they 
encountere
d in 
the reading, videos, and 
the discussions with 
their classmates
. 
Indeed Cumming et al. (2005) 
hypothesized that the learners in their study used more different words in the integrated task 
because they borrowed words from the source text. 
Moreover, the
 
planning time might have 
given students the opportunity to think more carefully about the vocabulary that they were going 

tudy
.
 
The participants 
in this study 
also scored s
ignificantly higher in 
content and 
punctuation 
in the PBTW exam. 
This could be explained by the higher amount of time the students spent on 
task in the PBTW exam and by the many ideas learned from the source materials and 
brainstormed in the discussion
 
and
 
planning time. Before the students wrote their essays in the 
PBTW exam, they spent 45 minutes learning and discussing about the topic. The high level of 
engagement with the task in the PBTW exam, combined with the extra planning time, probably 
prepared th
e test takers 
better for the writing task and allowed them to brainstorm more ideas to 
include in their writing. The extra planning time also gave test takers the opportunity to think 
about how to organize their essays more and focus solely on writing in t
he 45 minutes that they 
had to write. In contrast, the learners had to plan, organize, and write in 45 minutes in the TW 
exam.
 
Last, but not least, using ideas from the source materials in the PBTW exam may have 
been the reason why they scored higher in co
ntent for this exam. 
 
 
110
 
 
But the case for PBTW exams over TW exams is not 100% closed. In some ways, 
students 
performed
 
better
 
or the same
 
in the TW exam. Results in this study were similar to
 
other 
studies (Kuiken, Moss & 
2005; Tedick, 1990; Winfield
-
Barnes 
& Felfeli, 1982)
 
in which 
learners
 
did not write
 
more accurate essays 
when they had to write 
more involved essay
-
writing 
tasks
. 
Their spelling scores were significantly higher in the TW exam. 
Furthermore, 
the 
participants
 
in this study did
 
not
 
score
 
signif
icantly
 
higher when they 
took 
the PBTW exam, 
findings which are similar to those of other studies (
Plakans, 2008; 
Cumming et al., 2005).
 
This 
could 
suggest
 

with higher quali
ty 
content
, but perhaps with less focus on 
spe
lling
 
(less technical essay
-
writing 
accuracy)
. 
One interesting finding of this study was that the participants did not perform 
differently in terms of syntactic complexity
, 
unlike the results that I found
 
in my
 
earlier study 
about process
-
based exams (David, forthcoming)

terms of syntactic complexity when writing in different genres suggest that their writing is more 
syntactically complex when they write argumentativ
e essays when compared to narratives (Lu, 
2011; Polio
 
& Yoon
, 
2014
; Way et al., 2000).  
However, the results of this study suggest that this 
is not the case when learners are writing across the same genre, but about different topics and 
under different con
ditions (TW exam versus PBTW exam). It seems as though what helps 
learners write more syntactically complex essays is not the condition in which they write, but in 
which genre they are writing.
 
Many impromptu timed
-
writing exams, as I suggested above, elic
it 
essays about personal experiences to ensure that the participants are not given a prompt with 
which they are not familiar. Writing about personal experiences might generate essays that are 
similar to narratives and, as research suggests, learners seem t
o use simpler syntax when writing 
narratives. 
 
 
111
 
 
Although there were no significant differences between the scores that the participants 
received on both exams, some participants scored much higher in the PBTW exam while others 
scored considerably higher in 
the TW exam. Therefore I decided to select the participants who 
scored much higher in one exam to investigate why that was the case. 
Participants ID12
 
(Brazilian)
, ID39
 
(Brazilian)
, and ID52 
(Chinese) 
all scored relatively higher in the PBTW exam. 
They sco
red 5.5, 4.5, and 4.5 points more in the PBTW exam, consecutively. 
After comparing 
the scores that they received for each category, I observed that all three of them scored 1 point or 
higher for content. ID39 and ID52 received scores twice as high for cont
ent in the PBTW exam. 
These two participants also scored 1 point or higher for cohesion in the PBTW exam. ID12 
scored twice as much on punctuation and 1 point more 
on vocabulary in the PBTW exam and ID 
39 scored twice as much for organization. 
All of the o
ther scores that the participants received 
were either the same or half a point higher. 
These three
 
participants received much higher scores 
for content
 
in the PBTW exam
, 
findings that the results
 
that
 
the 
t
 
test
s
 
also indicate.
 
 
Participants ID31
 
(Brazili
an)
, ID44
 
(Angolan)
, and ID 72 
(Chinese) 
score
d
 
4.5, 5.5, and 
7.5 points higher in the TW exam, 
consecutively
.
 
All three participants scored 1 or 1.5 points 
more for vocabulary and 1, 1.5, and 2 points more for spelling. Indeed the results of the 
t 
test
s
 
r
evealed that the participants scored significantly higher for spelling in the TW exam. However, 
it is surprising that these three participants received higher scores for vocabulary, because the 
results of the 
t
 
tests revealed a different picture: the parti
cipants used significantly more types of 


the Lexical Syntactic Analy
zer (Lu, 2012) use to measure lexical complexity.
 
Two participants, 
ID44 and ID 72, scored 1 point and 1.5 points higher for cohesion and ID31 and ID72 scored 1 
 
112
 
 
point or 1.5 points higher for content. A trend seems to emerge from the analyses of these six 
participants: most of them received higher scores for content and cohesion. These two constructs 
are perhaps linked to higher quality essays. 
 

the exams. Two of the th
ree participants who scored considerably higher in the PBTW exam 
answered that they preferred the PBTW exam when compared to the TW exam. The participant 
who preferred the TW exam, however, answered that the PBTW exam was easier. Similarly, two 
of the thre
e participants who scored much higher in the TW exam said that they preferred the 
TW exam.
 

performance.
 
In fact, 73% of the participants who scored higher in the PBTW exam
 
reported that 
they preferred the PBTW exam.
 
The students in this particular study may have performed better 
on spelling
 
on the
 
TW 
exam because they had more resources 
for paying attention 
to these 
mechanical 
forms
.
 
As 
Robinson (2001) explained, a task 
can
 
become
 
more complex when learners are required to 
complete other tasks related to the main task. 
During the PBTW exam,
 
participants in this study 
had to discuss, watch videos, and read an article before they began writing
. All of th
ese tasks
 
may 
have made
 
the PBTW more complex than the TW exam
. The complexity may have made it 
slightly different in the eyes of the test takers
. The need to integrate sources and use source 
information may have made the test takers think less about focusing on 
spelling
. 
Rather
 
than 
focusing on the minute mechanics of writing, in the PBTW exam, test takers may have focused 
on getting their ideas across
 
and using source materials more
. In taking time to plan and 
formulate their ideas, they may have concomitantly focused more on 
c
ontent and 
punctuation (as 
the data suggests) because 
these 
may correspond with structuring (well) an essay. 
Baralt, 
 
113
 
 
Gilabert, and Robinson (2014) explained the following about task complexity and unfamiliarity 
with task types: 
 

R
epeated attempts to perfo
rm complex tasks will prompt the use of more 
complex language in such a way that the proposed effects of task complexity on 


(p. 19).
 
Perhaps the studen

on their performance and if they have more opportunities to practice this type of process
-
based 
task more often they will use more complex language.
 
There is more evidence that th
e PBTW exam and the TW exam do not tap into the exact 
same construct. 
The scores that the participants received in the two exams correlated only 
moderately, 
demonstrating that they have low concurrent validity (they do not measure the same 
thing). This is 
a
 

study
 

review) stud
y
.
 
Moreover, the correlation coefficients for the categories of vocabulary, 
punctuation, and spelling were also moderate, while the coefficients for content, organizati
on, 
cohesion, and grammar were small. The
 
impromptu TW exam 
may only 
measure writing
 
(or 
may measure it more narrowly)
, while the PBTW exam 
likely 
measures 
a more broad construct 
of writing, with 
the following constructs 
additionally tapped into
: reading
,
 
listening
,
 
and source 
integration
.
 
 
However, t
his integration of tasks may be new to some of the test takers. B
eing a good 
reader 
(being able to comprehend a text) 
and 
being 
a good writer does not necessarily translate 
into being able to use information fr
om a reading passage 
while writing. The integration of 
different tasks is difficult and must be practiced if one wants to be good at it. Such
 
finding
s were
 
 
114
 
 
-
to
-
write 
tasks
; good readers still have to learn 
to
 
write what they have read about
. 
Thus, not only are the two test types potentially measuring 
differing constructs, but one (the TW exam) may be familiar to the students in this study, while 
the other (PBTW) may be unfamiliar. This further complicates a cl
ear view of the differences 
between the two exams.  
 
Another finding of this study is that the overall 
intra and 
inter
-
rater reliability for the 
PBTW exam was slightly lower than the reliability for the TW exam.
 
When analyzing the inter
-
rater reliability f
or each of the analytic categories, I found that grammar 
and punctuation
 
and 
vocabulary and punctuation had the lowest reliability in the TW exam and PBTW exam
,
 
respectively.
 
On the other hand, spelling and cohesion and spelling and content had the highest
 
reliability in the TW and PBTW exam
,
 
respectively. 
The reason for these findings may be related 

the rubric will be discussed later.
 
But another reason ag
ain may be task familiarity. The raters in 
this study were more used to rating TW exams. Their unfamiliarity with scoring PBTW exams 
may have led to their lower agreements in scoring them. After all, raters get better with practice 
(Weigle, 2002). 
 
Asking 
participants what they thought tended to also point to the superiority of the 
PBTW exam. 
The great majority of the participants in this study thought that the PBTW exam 
was easier
 
(72% of the participants said so in the post
-
writing questionnaire)
,
 
and 
the
y 
preferred 
the PBTW exam in comparison with the TW exam
 
(77% of the participants said so in the PBTW 
exam)
. The reasons for this preference varied. The most mentioned reasons were: 1) the article 
and videos gave the participants many ideas that they could
 
use in their writing
 
(a theme that also 
emerged in the semi
-
structured interviews)
; 2) they had more time to engage with the writing 
 
115
 
 
task; and 3) they were more familiar with the topic because of the article, videos, and 
discussions.
 
Eight of the eighteen
 
students interviewed thought that the videos and article helped 
them to think of ideas that they could use in their essay. 
However, some participants explained 
that the article, videos, and group discussion were not very helpful when their ideas opposed t
he 
ideas presented in the sources. In addition, some participants thought that the videos 
or article 
were difficult to understand.
 
This theme also emerged in the semi
-
structured interviews. 
Eight of 
the eighteen
 
participants 
who were interviewed 
said that 
the videos were very difficult to 
understand. 
Again, this comment tends to show the true integration of the task, because when 
asked about the PBTW exam, test takers talked about all aspects of the exam, including the 
videos. 
 
When asked about the TW exam,
 
the participants mainly complained about the fact that 
they were not familiar with the topic and did not have time to plan their writing, as well as the 
short amount of time that they had to write the essay. 
As reviewed at the beginning of this paper, 
res
earchers know that students perform better on TW exams if they are given time to plan (
Ellis 
& Yuan, 2004; Kellogg, 1988; Worden, 2009
). And researchers have described how students 
score higher when they are tasked with writing about something that is fami
liar to them, as 
opposed to something unfamiliar 
(He & Shi, 2012; Tedick, 1990; Winfield
-
Barnes & Felfeli, 
1982)
. Thus the students

 
complaints are valid. 
Five of the eighteen students interview
e
d said 
that they did not have enough time to write the essay 
in the TW exam and ten explained that they 
either did not have enough time to plan their essay in the TW exam or in the PBTW exam, even 
though they were given 10 minutes for planning. 
 
The raters also had a clear preference for the PBTW exam. They indicate
d that they 
thought that the PBTW exam had more content validity if one of the objectives of a course is to 
 
116
 
 
teach students to integrate sources. At the same time, the
 
raters complained that the majority of 
the participants did not know how to integrate sou
rces
,
 
and one of the raters explained that the 
participants who could integrate sources did it extremely well and received a higher score for 
content. However, she felt that this was rare and in most cases the lack of ability to integrate 
sources was distr
acting.
 
This may show again that the test takers are unfamiliar with the PBTW 

-
integration of sou
r
ces may also be one of the 
sources of their lower reliability with PBTW exams. 
 
 
5.2 
 
Why PBTW exams are 
a be
tter fit
 
to evaluate ESL academic writing
 
The construct of academic writing is not one that is easy to define. While there are many 
models that attempt to explain the writing process, not many attempt to define academic writing. 
However, all of the models 
of the writing process, including 
the Hayes
-
Flower (1980) model, 
the 

and 

-
telling 
model
, include elements such as background information, planning, drafting, and revising. 
The
 
ACTFL can
-
do statements for advanced academic writing include some of these skills as well 
(see 
http://www.actfl.org/sites/default/files/pdfs/Can
-
Do_Statements_2015.pd
f
 
for the complete 
list of can
-
do statements for all proficiency levels)
, in addition to other skills
. Below are some of 
the can
-
do statements for advanced learners (these range from low
-
advanced to high advanced): 
 

I can revise class or meeting notes tha
t I have taken for distribution
 
 
I can draft and revise an essay or composition as part of a school assignment
;
 

I can write a research paper on a topic related to my studies or area of specialization
;
 

I can write a position paper on an issue I have researc
hed or related to my field of 
expertise
;
 
 
117
 
 
It is clear from these can
-
do statements that skills such as planning, revision, and research are 
ones that advanced L2 writers should master. 
The CSU Expository Reading and Writing Task 
Force put at the top of thei
r list of the writing skills that college
-
level students must have to 
succeed in 
regular 
academic classes the following two writing skills: synthesizing ideas from 
different sources and integrating quotations and citations (as cited in Ferris, 2009). Moreo
ver,
 
Weigle (2002) noted that academic writing in higher education is very often based on a reading 
or listening text and that most assignments that college students complete require them to 
incorporate sources, which can be the course textbook or readings
 
the students have researched. 

discussed above. Research on the types of writing tasks that students do in tertiary
 
non
-
ESL
 
academic classes often include researc
h papers, critiques, summaries, and so on
 
(
Cooper & 
Bikowski, 2007; Hale et al., Horowitz, 1986; Yigitoglu, 2008
). 
In addition, advanced university 
L2 writers have to know how to analyze, interpret, create, and summarize information, persuade 
others about 
their opinions, and conduct and write research projects (Grabe & Kaplan, 
1996).
When suggesting activities that advanced L2 writers should engage in, Grabe and Kaplan 
listed planning, using multiple sources, reading critically, engaging in guided discussion
, writing 
outlines, and being exposed to multiple types of writing genres, none of which are promoted or 
encouraged by impromptu TW exams. 
 
Cumming (2013) call
ed
 
for a redefinition of the writing construct for English for 
Academic Purposes classes that inc
ludes source integration. 
Most ESL writing textbooks
, such as 
Sourcework: Academic Writing from Sources 
(Dollahite & Haun, 2007) 
(used at the English 
Language Center),
 
teach students pre
-
writing strategies 
and source integration, 
and many writing 
teachers 
encourage and might even re
quire their students to plan, 
write an outline
, and do 
 
118
 
 
research on a topic
 
before they begin writing the first draft of a paper. With the exception of in
-
class exams, many professors assign papers that students have to write outs
ide of class. These 
papers allow students with plenty of time to plan
 
and do research
. 
 
A definition of the construct of academic writing, based on the information above, 
then, 
should include the following elements, among others: planning, drafting, revisi
ng, researching, 
synthesizing ideas, thinking critically, and integrating sources. TW exams clearly do not support 
the use of these writing skills. 
The results of this study suggest
 
that the 
TW and PBTW
 
exams 
possibly measure two different constructs, 
find

(under review) studies also revealed. Indeed 
it is
 
not difficult to conceptualize why integrated 
and process
-
based exams do not measure the same constructs as impromptu TW exams
. The 
impromptu TW exam measures 
writing, while the 
integrated and process
-
based
 
exam
s
 
measure 
the following constructs in addition to writing: reading (the participants had to read and 
understand the article); listening (the participants had to watch two videos and under
stand the 
informa
tion in them); 
and source integration (the participants had to integrate information from 
the article and videos in their writing
)
. 
These are all skills that are required of students in college
-
level classes
, according to the definition of academic writing
 
that I proposed above
. 
 
However, these were not the only skills the participants had to use in the PBTW exam. 
They were encouraged to take notes while watching the two videos and not every student is a 
good note taker. 
One participant wrote in the post
-
wr

too 
fast to write down and hard to understand what they said

able to take notes while watching the videos. Some participants did not even take notes, although 
the 
teachers who help
ed to collect data and I 
always encouraged them to do so. One participant 

 
119
 
 
that she did not take notes.
 
In addition, the participants were given time to plan. If
 
students are 
not accustomed to planning and not familiar with pre
-
writing strategies they may not be efficient 
planners. 
The participants in this study may not value pre
-
writing during writing exams because 
of the nature of the impromptu TW exam with whic
h they are accustomed. 
Such exams put 
emphasis on the writing product, not the writing process. 
Worden (2009), for example, found 
that learners who engaged in high levels of pre
-
writing performed significantly better, suggesting 
that knowing how to apply p
re
-
writing techniques to planning and engaging in higher levels of 
pre
-

 
As explained above, many models of the writing process and ESL writing textbooks 
include pre
-
writing as an important ski
ll in the process of writing. 
While some students may 
engage in planning even in impromptu TW exams, most will most likely not do so
 
because of 
time constraints
. Indeed many students in this study complained that they did not have time to 
plan during the i


have enough time to plan and to think what you will write
.

minutes of planning in the PBTW exam was not enough. 
Engaging in group discussion before 
writ
ing can also be seen as a pre
-
writing activity because it gives students the opportunity to 
exchange ideas and think of new 
ideas
. 
 
One key component of the PBTW exam is source integration. The participants ha
d
 
to 
read an article, watch videos, and synthes
ize and integrate the information that they learned in 
the source materials in their writing. 
Most experts i
n the field of L1 and L2 writing agree that 
these 
skills are crucial for students, native or nonnative, to succeed in their 
non
-
ESL 
academic 
classes
, as described above
. 
More than half of the participants mentioned in the post
-
writing 
questionnaire that they had difficulty integrating the information from the sources in their essays. 
 
120
 
 
This same issue arose in the semi
-
structured interviews. ID07, for e
xample, said that she thought 
it was difficult to combine the videos and article with her ideas. The students were not the only 
ones who mentioned their difficulty integrating sources. The raters also noticed that the 
participants did not know how to integ
rate sources in their writing. RM mentioned that most of 
the participants could not successfully incorporate sources in the essays that she read. RK agreed 
and added that the participants seemed to all use the same information from the sources. Second 
lang

researchers have found (Cumming et al., 2005; Gebril & Plakans, 2013; Sawaki, Quinlan & Lee, 
2013). Sawaki et al. (2013) 
wrote
 
the following about source integration:
 

information from various sources into written discourse requires a complex coordination of 

ed
 
that test takers who cannot integrate sources 
successfully may do so for two reasons: they do 
not understand the source materials, which may 
have been the case in this study, or they may have problems choosing relevant information from 
the source materials and organizing them in their own writing, which may also have been the 
case for the participa

incorporating sources is a valid argument for not including integrated or process
-
based writing 
tasks 
in
 
progress
 
or achievement 
exams, others might argue the exact opposite: 
If a progr

objectives include teaching students to engage in the writing process by reading, discussing, and 
planning, and if the curriculum includes synthesis and source integration, then to obtain an 

ourse objectives, the impromptu 
timed
-
writing exam is not appropriate. 


curriculum, then the students should 
be assessed accordingly. If a program teaches writing as a 
 
121
 
 
process, 
and teaches other skills related to academic writing, such as synthesis and source 
integration, then the students should be tested on these very same constructs
, and the process
-
based writ
ing exam provides teachers a better picture of whether students have mastered these 
skills.
 

exams, teachers will continue to train students to take such exams. This, in tu
rn, creates a 
negative washback in L2 writing classrooms
, which forces teachers to ignore (at least for part of 
the course) the construct of academic writing on which they should be focusing
. 
Washback, as 
defined by Carr (2011), is how a test affects class
room teaching and learning. 
While teachers 
could be teaching students all of the writing skills listed above, such as 
planning, 
synthesis, 
source integration, critical thinking, and so on, they are wasting precious classroom time by 
training their students
 
to take impromptu TW exams.
 
Weigle (2004) investigated a new 
placement test that integrates reading and writing and found that the test has already created a 
positive washback in the classroom, with teachers now focusing more on critical thinking skills 
a
nd text analysis. 
Carr (2011) warned
 
that independent
-

focus on developing discrete skills in isolation so as to better prepare students for their te

17), but writing is not a skill separate from reading, li
stening, and speaking.
 
Weigle (2002) added
 
that, especially for classroom assessment, impromptu TW exams give students the 
false 

concern is to perform well on the t
est. 
 
The fact that the students in this study were not comfortable integrating sources in their 
writing can be evidence of the negative washback in their academic writing classes.
 
The students 
who participated in this study, in particular, have to take tw
o timed writing exams per semester, 
 
122
 
 
both of which
 
combined
 
are worth 20% of their final grade. Teachers and students alike want to 
ensure success in these exams and value in
-
class timed writing activities as a result. The College 
Conference on Composition 
and Communication stated the following about writing assessment: 

in Deane et al., 2008, p. 66). If teachers and language testers start implementing PBTW exams,
 
academic writing teachers will start teaching these crucial writing skills 
that
 
students so 
desperately need in order to succeed in their non
-
ESL academic courses
 
and therefore improve 
their teaching
. 
 
Impromptu TW exams clearly do not mirror the skills t
hat L2 learners will need to 
succeed in their ESL academic writing classes or in 
regular 
academic classes that they will take
 
nor do they mirror the 
elements
 
of academic writing that I discussed above
, such as planning, 
synthesis, and source integration
.
 
W
hile 
PBTW exams
 
do not allow learners to revise, they
 
give 
them
 
the opportunity to engage
 
much more
 
in the writing process
 
than impromptu TW exams,
 
by
 
allowing learners to
 
learn and discuss ab
out the topic and plan
 
their writing. In addition, these 
exams r
equire learners to read and t
hink critically about the topic, which are also crucial skills 
for 
regular 
academic classes.
 
Last, but not least, process
-
based writing exams allow learners to 
practice and demonstrate their source integration skills, skills wh
ich are much valued both in the 
ESL academic classroom and in other academic courses.
 
 
5.3 
 

(
David, 
under review; 
He & Shi, 2008; 
Lee, 2006; 
Powers & Fowles,
 
1999
)
 
even though their 
perceptions are extremely important 
because
, according to Rea
-
Dickins (1997),
 
they are one of 
 
123
 
 
the main stakeholders
.
 
In addition, i
f learners do not like an exam
 
(if they feel the exam is 
inauthentic in any way; see Carr, 2011, p. 
160)
, the
 
learners
 
may not feel invested in it
,
 
and their 
lack motivation to prepare for the exam and take the exam may affect their performance
 
(Lee & 
Coniam, 2013)
. 
This issue of how 
test takers and teachers
 
perceive tests is called 
face validity
 
(Hughes
, 2003)
,
 
and 
face validity
 
is related to 
the 
perceived 
authenticity
 
of a test
 
(Carr, 2011)
. 
T
hat perception can be self
-
evolved or adopted through the expressed beliefs of others (other 
students, the teacher, or other stakeholders)
. 
Students may 
singly or 
collectively 
have a negative 
view of a test if they believe that it does not measure what they are learning
 
or if they perceive 
that others do not value it or trust its scores
. The information gathered in the post
-
writing 
questionnaire and in the semi
-
stru
ctured interviews 
suggests
 
that the PBTW exam 
appears
 
have 
more face validity than the TW exam
, which aligns with results from David (forthcoming) and 
Lee (2006)
. 
The participants in this study had a
 
preference for the PBTW exam for the following 
reasons: 
(a
) they learned background information and heard new ideas, both of which they could 
use in their writing; 
(b
) they had time to plan their writing; and 
(c
) they had more time to engage 
with the
 
topic of the
 
writing exam.
 
 
One way in which the source mater
ials helped learners was by giving them background 
information about the topic and providing them with ideas to use in their writing. Many 
participants said that they liked the PBTW exam better because it gave them background 
information through the readin
g and the videos. ID05, for instance, wrote the following in the 
post
-

I will be able to have background about the topic 
because some topics I might never hear or read about them

Furthermore, m
any participants 
agre
ed that the source materials helped them to think of ideas and arguments to use in their 
essays. This was evidenced by the fact that more than 75% of the participants answered that the 
 
124
 
 
article and videos helped them to think of ideas for their writing. Man
y of them also brought up 
this issue in the open
-
ended questions and in the semi
-
structured interviews. ID03 wrote in the 
post
-

in the second exam I had more material to base on. This way I could 
give more effective arguments

said in the semi
-
structured interview, 

support, the second one [the PBTW exam] because there were the articles to see
-
 
the videos, so I 
thought it was much easier than the first
.

 
Research has suggested that topic familiarity may 
affect te

Ellis & Yuan, 2004; Kellogg, 1988; Worden, 2009
). 
When students who are unfamiliar with a topic are given texts to read and/or videos to watch, 
they can learn the necessary background information that they need to perform we
ll. 
These 
source materials can also help them to write essays with higher quality content
, as the results of 
this study suggest,
 
because of the ideas that they can borrow from the videos and reading.
 
Another important component of the PBTW exam was plannin
g time. The participants 
had ten minutes to plan their essay
s
. While some participants complained that ten minutes is not 
enough to plan an essay, the majority of the participants thought that allowing them to plan their 
essays before they began writing wa
s a good idea. ID26 wrote in her response to an open
-
ended 
question in the post
-

The longer timed writing was easier because I had 
more time to think about the subject and plan my essay

theme i
n the semi
-

important. I always hard to study but when I have to write, when I have planning, I can write 
quickly. The first topic, gun control, we have less time to plan, so I th


As mentioned before, 
providing learners with planning time may affect the ov
erall quality of their writing, as many 
studies have found
 
(
Ellis & Yuan, 2004
; Kellogg, 1988; Worden, 2009
)
. 
 
 
125
 
 
Group discussions are a huge part of communicative ESL classes and other 
non
-
ESL 
academic classes in the United States. Group work is so important for academic success that the 
Common Core S
tandards said
 
the following about
 

To 
build a foundation for college and career readiness, students must have ample opportunities to 
take part in a variety of rich, structured conversations

as part of a whole class, in small grou
ps, 
and with a partn

National Governors Association
, 2010). Even though Shi (1998) did not find 
significant differences between the group who participated in pre
-
writing discussion and the 
group who did not, group discussions allow students to exchange ideas, to formulate
 
their own 
ideas and opinions, and to be aware of opposing ideas, all of which can be beneficial to writing. 
More than 81% of the participants answered that the group discussion helped them to think of 
ideas to write in their essays. ID62 wrote in one answ
er to an open
-

after a 
discussion I will have more ideas on writing

discussion in the TW exam: 

throw your ideas on the paper
, so I think that if we had more time and if we had more 

 
Similarly, in the study to investigate a process
-
based writing placement exam mentioned 
above, Lee (2006) 
discovered
 
that 
many students 
found
 
the ten
-
minute group discussion 
extremely useful. They mentioned things very similar to the participants in this study, such as the 
benefit of being exposed to new ideas and brainstorming new ideas to include in their writing. 
The same 
results were found in my study comparing PBTW and TW exams (David, under 
review)
 
mentioned above
. Seventy percent of the participants preferred the PBTW exam and six 
of the forty participants mentioned that they especially liked the group discussion becaus
e it gave 
them new ideas and it allowed them even more time to brainstorm ideas. The results of this 
 
126
 
 
students clearly believe that group discussions 
and planni
ng time 
have a positive impact on their 
writing.
 
Group discussions also mirror what happens in communicative ESL classes and other 
non
-
ESL 
academic classes in Ame
rican colleges and universities.
 
Program and test 
administrators
 
might 
be concerned
 
that if st
udents discuss their ideas before they write, they 
might borrow ideas from their classmates and their writing might not reflect exactly the test 
taker

normal 
social aspect of 
writing, as describe
d by Prior (1998). 
Weigle (2002), for example, criticized the lack of a 
discussion component 

makes the 
test less authentic because discussing about a topic before writing is a normal part of most
 
academic classes. 

ideas are generated and not just transcribed as writers think through and organize their ideas 

eas in groups is one of the ways 
learners can generate ideas.
 

, 
however
. Some participants complained that the videos were too difficult and that the people in 
the video
s
 
spok
e too fast. Thirty
-
one percent of the participants answered tha
t the videos were 
difficult 
in the post
-
writing questionnaire, 
and, when explaining why they did not use ideas from 
the videos or article, eight of the participants wrote about their difficulty
 
understanding the 


mentioned that the videos were hard to understan
d when they discussed what was difficult about 
the PBTW exam
 
in the post
-
writing questionnaire
. Moreover, eight of the eighteen participants 
 
127
 
 
These results are not surprising. 
In their 

 
(www.toe
fl.org)
, Cumming et al. (2005) found that the test takers used the reading passage much 
more often than the video lectures. They hypothesize that this was due to the fact that the 
learners had access to the reading passage while they were writing, but they
 
could no
t go back to 
the video lectures and they had to solely rely on their memory.
 
Because reading is self
-
paced and 
learners can go back to sentences they cannot comprehend, reading may be an easier skill than 
listening.
 
Some participants even mentione
d that they did not use the ideas in the videos because 
they could not remember them or they had not taken notes. 
Another negative aspect of the 
PBTW exam that the participants mentioned was their difficulty integrating sources
, as discussed 
above.
 
Most of
 
the participants who were interviewed thought that obesity was a less complex 

Three of the four participants were from Science
-
related majors, such as Biology an
d 
Biomedicine and therefore might have been more familiar with a health
-
related topic such as 
obesity than gun control. 
However, even students from other majors, such as Accounting or 
Engineering also agreed that gun control was more complex. 
Perhaps the i
ssue was more related 
to cultural backgrounds than majors. 
Controversial topics such as gun control are heavily 
influenced by culture. Gun control is a topic that is much discussed in the United States, but not 
so much in other countries
 
where gun laws hav
e been in place for a long time
. In Saudi Arabia 
and Brazil, for example, guns are banned, so there is not much controversy on the subject.
 
 
128
 
 
Obesity is more of a worldwide problem and people might have more background information 
about it than gun control.
 
What was interesting was that students who took the PBTW exam on 
gun control and the TW exam on obesity and the students who took the PBTW exam on obesity 
and the TW exam on gun control both agreed that gun control was more difficult, even after they 
recei
ved background information about the topic.
 
This could be further evidence that the 
participants might have found gun control more complex because of their cultural background.
 
As mentioned above, Rea
-
Dickins 
(1997) 

f exams 
are difficult to investigate and to put into practice. However, when the results are this clear, and 
when other studies support these same findings (David, under review; Lee, 2006), it does not 
seem too difficult to make use of their perceptions. 
I
f students complain that they do not have 
enough time to plan their essays in the TW exam or even in the PBTW exam
,
 
and if they believe 
that planning is an important step in the writing process, which is the case for the participants in 
this study, why not
 
give them time to plan their writing in writing exams? Perhaps it might even 
be useful to include pre
-
writing techniques in the exam so that the test takers can choose the 
technique they will use. That way, the test takers will be exposed to different pre
-
writing 

simple to address is the fact that they valued and made us
e
 
of source materials. If the test 
designers or teachers have to account for time constraints, ins
tead of providing students with one 
article and two videos, they can give students one shorter article or show one short video. 
They 
can even ask the test takers to read or watch the source materials at home, before the exam. 
When students take exams in no
n
-
ESL academic classes this is exactly what they do. They study 
for the exam by reading their textbooks and lecture notes. 
Group discussions are also not 
challenging to include in a writing exam. A short ten or five
-
minute group discussion could get 
 
129
 
 
studen

This could even be done one 
class before the exam. 
Finally, the participants in this study said that spending more time on task 
was beneficial to their writing process, another element of the PBTW
 
exam which is not 
impossible to achieve in a writing exam if test designers and teachers include source materials, 
group discussions, and planning time in writing exams, all of which allow learners to engage 
more with the exam.
 
The students are not the on
ly ones whose perceptions matter when it comes to writing 

in their writing courses are also important
 
(Cumming, Grant, Mulcahy
-
Ernt & Powers 2004). 
Although i

exams, Weigle (2002) note
d
 
that teachers are worried about test usefulness. They are concerned, 
for example, whether the test being used in their class accurately tells th
em if their students have 
reached the goals of the course, how the results of the test will help students to become better 
writers, and whether students are interested in the prompts created for their writing exams. 
If the 
goal in an academic writing class
 
is to determine whether students see writing as a process, 
whether they can apply this process to writing, and whether they can integrate sources in their 
writing (all of which many writing experts agree are academic writing skills, as seen above), then 
t
he impromptu TW exam 
may
 
not 
be 
telling the teacher whether the students have reached the 
goals of the class. 
 
In this particular study
,
 

in
 
the TW and PBTW exam. 
However, their preference f
or the PBTW exam was evident based 
on their responses in the post
-
writing questionnaire and in the semi
-
structured interviews. There 
is no doubt that the participants value the writing process through which the PBTW exam allows 
 
130
 
 
them to go.
 
They may value t
his process because of its higher content validity. In other words, 
the process
-
based writing exam mirrors more closely what students do in their ESL academic 
writing classes. It may seem unfair to students that their teachers (including myself, as I was t
he 
teacher of more than thirty of the participants) encourage them to see writing as a process
 
and 
engage in the writing process in each assignment that they must complete
, but do allow them to 
go through this process during the two timed
-
writing exams tha
t they have to take
. 
Finally, 
the 
process that the students go through in the PBTW exam
 
may 
be the reason why the participants 
in this study scored higher for content in the PBTW.
 
It is important to acknowledge that my role 
and identity 
as teacher to some 
of the 
participants and 
as compatriot to all of the Brazilian participants may have played a part in the 
way that they interacted with me and responded to my questions in the post
-
writing 
questionnaire
, but
 
especially in the semi
-
structured interviews. 
Ric
hards 
(2003) 
warned 
qualitative researchers that identity can have an effect on how the interview unfolds. 
My 
ESL 
220 students might have been more inclined to say what they thought I wanted to hear, 
as 
opposed to how they really felt, 
a practice commonly 
found 
in
 
qualitative research. 
Indeed, as 


to questions by taking in

noticeable. 
I did not share my feelings about the two exams with my students, but that does not 
mean they might not have been sensitive to 
them.
 
In addition, t
he data were collected a
t the 
beginning of the semester and the students were going to continue classes with me for at least 
two or three more months. It is undeniable that I held a position of more power than them, which 
could have contributed to them saying what they thought I 
wanted to hear. 
As compatriots, the 
 
131
 
 
Brazilian participants, especially the ones who were interviewed in Portuguese (and who were 
not my students), might have been more inclined to say how they felt because of our shared 
experiences as Brazilian 
students in
 
the United States and because I was not their teacher. I was 
an outsider with shared 
cultural 
beliefs, experiences, and background. 
Indeed I found that one 
particular group of Brazilian participants who were interviewed together had more lengthy 
responses
 
than most participants
. 
Furthermore, e
ven though we did not know each other
 
before 
data collection
, all of the Brazilian participants who were not my students asked me questions 
about my personal life, my life in the United States, my plans for the future
, and so on
, before 
and
 
after the interviews
.
 
This might be evidence that they might have been more comfortable 
talking to me than other participants. 
Other participants from countries other than Brazil, 
including the ones who were my students, did not beh
ave the same way. 
The Brazilians who 
were not my students might have seen me as somewhat equal to them because they knew
 
that
 
I 
was a student at the university and they were students as well. This may have allowed them to 
share more about their opinions. 
 
Finally, the students in this study seemed much more interested about the prompt related 

motivations for writing, which in turn could potentially have an effect on st

(Weigle, 2002). 
 
 
5.4 
 
Rubric design and use
 
One important issue that I did not originally plan on investigating was rubric design and 
use. However, because the raters constantly mentioned the rubric during the norming and 
training sess
ions and during the interviews, it is difficult to ignore it.
 
The raters who participated 
 
132
 
 
different scores, although the scores can range from zero to three. No
t one of the test takers in 
this study received a zero for any category, which means that the rubric really only ranges from 

I ended up with a lot of stuff in the two category 


reasons

only having
-
 

nine 
possible 
scores. 

 
(GSTEP)
, as found in Weigle (2004),
 
is also 
an example of a rubric that has multiple levels of scores. Raters can g
ive scores that range from 
one to ten.
 

, which 
can go from zero to five
 
(see 
www.toefl.org
 
for more information).
 
 
Another problem that the 
raters mentioned about the rubric was that it lacked clear 

And I had trouble 
with the wording, like some or frequent, like, we decided on kind of numbers for spelling, but 
grammar,
 

distinguish


organization; 1) Very little organization of content. Underlying structure not suffi
ciently 
controlled; 2) Some organizational skills in evidence, but not adequately controlled; and 3) 
Overall shape and internal pattern clear. Organizational skills adequately controlled.
 
Word
s
 
such 
as 

ittle

 
and 

some

 
are very vague and do not give rat
ers a clear picture of what is expected of 
the test takers.
 
The rubric for the GSTEP includes more specific information about what it means 

 
133
 
 
ections between and within paragraphs are made through effective and varied 

, p. 50
).
 
The lack of range and clear 

(1990) 
rubric might have influenced inter and intra
-
rater re
liability.
 
It is difficult to ignore that RM seemed to play a much more dominant role during the 
training and norming sessions when compared to RK. RK was much quieter and many times 
simply agreed with RM without elaborating on her opinion very much. Perha

dominant role may have led RK to share less about her opinions or to agree with RM and avoid 
confrontation. It was impossible, however, to have separate training and norming sessions 
because the objective of these is to increase rater reliabil
ity and agreement. 
RK did, however, 
share more of her opinions during the interview, which was conducted separately. Her 
perceptions of the rubric and the exam during the interview seemed to match what she said 
during the training and norming sessions.
 
The
 
inter
-
rater reliability coefficient for 
both exams was 
somewhat 
high.
 
Carr
 
(2011)
 
recommended an alpha level of .800 or higher as acceptable for high stakes exams and an alpha 
l
evel of .700 or higher for low
 
stakes exams.
 
However, the coefficient for the 
TW exam was 
higher (.728) when compared to the one for the PBTW exam (.643)
, 
w
hich is not acceptable for 
low
 
stakes exams, according to Carr
. 
Similarly
, the coefficients for intra
-
rater reliability were 
higher for the TW
 
exam
. RM had coefficients of .352 f
or the PBTW exam and .862 for the TW 
exam and RK had coefficients of .642 for the PBTW exam and .552 for the TW exam. 
Weigle 
(2004), however, found the opposite. The integrated task in her study 
proved to generate more 
reliable scores than the independent 
task.
 
Nevertheless, the researcher might have obtained these 
results because the raters could only assign two scores: pass or fail. 
The TW exam may have 
been simpler to score because the raters did not have to deal with source integration. 
 
 
134
 
 
Although the ru
bric did not contain any categories for source integration, it seemed 
that
 

I do think that for the PBTW 
exams people who were able to integrate information probably did better in terms of conte
nt 


like, a random fact, and no discussion of it


point to 

 
RM, 

inability
 
to integrate sources, which could have 
affected the way she scored the essays 
and explain the lower 
inter and 
intra
-
rater reliability 
coefficients
 
for the 
PB
TW exam
.
 

integrate sources could have affected the way that they scored the PBTW essays. Perhaps the 
raters were try

assigning lower scores to other categories in the rubric, such as content or organization. RM 
herself mentioned that the students who could integrate sources successfully scored
 
higher in 
content, which could be evidence that the ones who did not received lower scores for content 
even though the descriptors in the rubric did not say anything about source integration.
 
In the 
training session, the two raters and I discussed how the
 
raters should attend to source integration 
and we agreed that test takers who could not integrate sources or who chose not to use any of the 

comments suggest, this did
 
not seem to be the case.
 
It was quite surprising that the inter
-
rater reliability coefficient for grammar for the TW 
exam was extremely low (.012)
. 
Other studies have found low reliability coefficients for 
grammar as well. Winke (2013) found reliability c
oefficients 
of .49 for grammar in the contex
t 
 
135
 
 
of
 
group oral exam
s
. 
She argued that this coefficient is low and that it is not an effect of the test, 
but it is a problem that the raters had when using the rubric. 
The raters in her study reported 
having diff
iculty focusing on grammar while also having to focus on fluency, vocabulary, and 
overall communication skills. They also reported not thinking that grammar was of particular 
importance to the tasks that the learners performed.
 
The researcher suggested tha
t eliminating 


Although her recommendations 
are for oral exams, t
his may be a good idea
 
for wri
ting exams as well
, especially because 
professor
s
 
who teach non
-

mistakes, but about their ability to demonstrate content knowledge. 
Furthermore, 
the 
interlanguage of a learner
 
may
 
develop
 
slowly
.
 
I
t can take 
time
 
for learners to improve 
grammatical accuracy
 
and many learners may even plateau
.
 
As Gass and Selinker (2008) 
explained, sometimes even if learners are frequently exposed to the L2, their interlanguages may 
still plateau, which further co
mplicates the develo
pment of interlanguage grammars and the 
discussion 
the importance to evaluate it 
in 
academic 
tasks.
 
Grammar was not the only low reliability coefficient in this study. The
 
intra
-
rater 
coefficient 
for organization for the TW exam for RM 
was .000
 
and 
the coefficient 
for spelling 
for RK for the PBTW exam was .167. 
It is difficult to understand the reason for the low spelling 
coefficient, especially because the raters and I established limits for spelling mistakes that fit 
each of the three 
possible scores, as described above. We decided that if the participants had less 
than five spelling mistake
s
, they would receive a 3; if they made more than five mistakes, they 
would receive a two; and if they made a spelling mistake in every other senten
ce, they would 
receive a one. 
RK herself seemed to think that rating spelling was not difficult, because she 
 
136
 
 
RM mentioned in one of the norming 
sessions that she had difficulty distinguishing organiza
tion from cohesion, so we had a 
discussion with RK and decided that organization was global and cohesion was local. However, 
it seems like even after we agreed on a more specific definition of the two categories, RM still 
had difficulty assigning a score f
or organization.
 
Both raters have had experience scoring essays, 
but RK has had more experience scoring listening tasks. When scoring such tasks, the rater does 
not have to pay attention to spelling and this may be one reason why the intra
-
rater reliabilit
y 
coefficient for spelling was so low. 
 
Perhaps the main reason why the intra
-
rater reliability coefficients were not very 
satisfactory was because we should have spent more time training and norming. 
We tr
ained and 
normed for over three hours, but that ma
y not have been sufficient time for the raters to become 
familiar with the rubric. 
Studies about rater reliability
,
 

,
 
had training sessions that lasted approximately half 
a
 
day, which was 
cons
iderably longer than the two training and norming sessions 
in this study 
combined. 
The 
other issue was that there were only two training and norming sessions, but the raters took 
approximately si
x weeks to score all 182 essays, which could have affected th
eir 
consistency 
rating
 
the essays
.
 
Indeed
 
Lunz and Stahl (1990), Lumley and McNamara (1995), and Congdon 

another.
 
Weigle (2002) suggested that raters be given a set of
 
essays for calibration at the 
beginning of each rating session when rating occurs over the course of more than one day. 
Perhaps with a longer training session, more regular norming sessions, and daily essays for 
calibration, the 
intra
-
rater reliability mi
ght have been higher.
 
 
1
37
 
 
5.5
 
  
Hurdles of implementing PBTW exams
 
Implementing PBTW exams in ESL programs is not an easy task
,
 
and that is why many 
programs still use the impromptu TW exam to 
assess 

monitor 
their achievements
. Th
e first obstacle in the implementation of PBTW exams is 
in 
designing the 
test itself. 
T
o design a PBTW exam, one must not only select a topic and write a prompt, but he 
or she must also find readings and/or videos that give test takers background informati
on about 
the topic and ideas that they can use in their writing. In addition, the readings and videos have to 
be level and age
-
appropriate and short enough to be read in a limited time frame. If the test is to 
be similar to the one used in this study, with
 
a brief discussion at the beginning to activate 

create
 
questions. If one wishes to 
administer two exams per semester,
 
as the English Language Center does,
 
this entire process 
would have 
to be done
 
twice. To make things even more complicated, the test designer would 
probably have to create multiple
, equated
 
exam
 
forms
 
so that he or she can have different forms 
of the exams to choose from for each administration and to ensure that students do not kno
w 
which form of the exam and which prompt they will answer. This is a way to avoid plagiarism 
and the memorization of entire essays, a practice that He and Shi (2008) found was common 
among the 16 Chinese students that they interviewed in their study about
 

Written English
 
(see 
www.toefl.org
 
for more information on the test)
. In addition to all of these 
steps, the test designer has to go through all sorts of recommended procedures for creating an 
assessmen
t tool, such as writing test specificati
ons, pre
-
testing prompts, choosing
 
a rubric, and 
so on (Weigle, 2002). 
 
Designing the test is not the only problem in the implementation of the PBTW exam. The 
other hurdle is scoring the exam and interpreting student

s Carr (2011) explain
ed
. He 
 
138
 
 
noted
 
that, with tests that integrate reading and writing, when a student does well, that probably 
means that he or she is a good reader and a good writer. However, what happens when he or she 
does poorly? Does he or
 
she have poor writing skills, reading skills, or both? In the case of this 
particular PBTW exam, which requires the participants to know how to use pre
-
writing 
techniques and how to integrate sources, to have good listening skills and note
-
taking skills, 
the 
issue is even 
more complex than Carr described
. 
 
Cumming (2013) responded
 
to 

 
on their performan
ce on the main task
 
(writing)
, is when one assumes that writing is an 
independent skill. 

on reading
-
to
-
write tasks 
suggest that 
reading and writing are extremely interrelated. In fact,
 
it is impossible to see any of t
he four 
language skill
s
 
as an isolated skill. When learners do a listening activity in class, they may take 
notes to remember the details or main points in the listening passage, combining listening and 
writing into one task. When learners are speaking to 
their classmates or teacher, they also have to 
use their listening skills to understand them and respond appropriately. When students are 
writing a paper, they may read their paper twice or more to revise and edit,
 
or they may do 
research and read about th
e subject of their paper,
 
using their reading skills in addition to their 
writing skills. In their 
regular, non
-
ESL, 
academic classes, students read book chapters and 
articles and then take an exam in which they most likely have to answer questions in writ
ing
 
about the content of the textbooks
. Not to mention the fact that most writing assignments that 
professors require their students to do include reading and writing skills, such as writing a 
summary, a critique, or a research paper. In short, as Cumming 
(2013) explain
ed

 
139
 
 
ability to integrate information in their writing is exactly the skill that ESL academic writing 
courses need to measure.
 
One 
further 

proficiency level is
 
high enough that they will actually be able to read and listen to the source 
materials and understand them, which according to Cumming (2013), is crucial for the students 
to succeed in process
-
based writing tasks. The fact that the students in this partic
ular study did 
not perform differently in the TW and PBTW exams may reflect this challenge. Perhaps the 
students in this study have not reached a proficiency level that allows them to successfully read 
and listen to source materials and successfully integr
ate them in their writing. In fact, some 
students brought up this exact issue in the post
-
writing questionnaire and semi
-
structured 
interviews. Thirty
-
one percent of the participants answered that the videos were difficult in
 
the 
post
-
writing questionnaire
 
and others mentioned this same issue in the semi
-
structured 
interviews. As I already mentioned, Cumming et al. (2005) found that test takers use reading 
passages more often than listening passages because they could refer back to the article, but not 
the 
video. Perhaps test designers should choose reading passages over listening passages when 
creating integrated tasks. 
 
A
lmost 61% of the participants mentioned that they had difficulty integrating sources in 
their writing, and three of the eighteen particip
ants who were interviewed also had something to 
say about their difficulty with source integration. Furthermore, both raters seemed to voice their 
concerns about this issue in their interviews, as mentioned before. RM, in particular, described 


integrate information


tioned that 
 
140
 
 
native speakers of English have problems incorporating sources. The students who participated 
in this study had all been taking 220 or 221 for less than a month at the time of data collection. 
One reason why they might not have done well integr
ating sources in their writing could be 
because the teacher had not covered source integration in their classes yet. Had the data been 
collected at the end of the semester, the results may have been different.
 
 
Ensuring that all test forms are equivalent i
n terms of 
difficulty 
level is 
a clear 
challenge 
for creating process
-
based and integrated writing tasks, as Weigle (2004) noted. She warn
ed
 
test 
developers that it is difficult to always choose reading passages of the same linguistic difficulty. 
Perhaps o
ne way to avoid choosing reading passages of different proficiency levels is to 
compare 
the passages to a corpus to make sure that the passages are similar in terms of the number of 
words and word families
 
that 
they
 
contain
.
 
RANGE (Nation, 2005) would be a
 
good tool for that 
purpose.
 
Another issue with designing integrated tasks that Weigle (2004) mentioned is choosing 
topics that are equivalent. A solution to this, according to the author, is to write detailed test 
specifications that can be easily followe
d.
 

preference for these exams and the fact that they have more content validity when it comes to the 
goals of academic writing classes alone are good reasons to implement such exams 
for 
 
progress
 
and 
achievement 
purposes. 
Regarding content validity, both raters agreed that PBTW exams are 
a valuable tool to evaluate writing when course objectives include source integration. 
RM, for 

I actually like the idea of the PBTW 
exam because I think
-
 
I think it has the 
potential to measure kind of higher level synthesis writing skills in a way that the just regular one 

kind of more
 

-
 

141
 
 
better task because it more naturally mimics the type of stuff you have to do because you have to 
pull information from a variety of different sources

 
f th
at is one of the main 
objectives of that class
,
 
to be able to do that
 
[integrate sources]
, then I would say that it is an 
advantage to see whether they can. So for me
,
 
I guess
,
 
comes down to is that something that 

ehand, and if so, then I guess that would be a preferred 
format

,
 
and 
many test designers would agree (
McNamara & Roever, 2006; 
Weigle, 2002). If an ESL 
academic writing c
ourse teaches students that writing is a process that includes reading, 
discussing, planning, and so on, it is only fair that the exams used to check 

 
progress 
during that course assess these same skills.
 
During the semi
-
structured interviews with
 
the 
eighteen test takers, I asked which exam they thought was more similar to the things that their 
instructors were doing in their 220 or 221 classes. All of the participants agreed that the PBTW 
exam reflected more what they did in class than the TW exa
m. 
Other reasons include the fact that 
the participants in this study and other studies used more sophisticated vocabulary and wrote 
longer essays in the PBTW exam, not to mention 
that 
they scored higher for content
.
 
 
142
 
 
CHAPTER 6: CONCLUSION
 
 
The finding
s of this study revealed many interesting aspects of the PBTW exam
 
and its 
apparent 
advantages over the impromptu TW exam
, especially in the case of achievement testing 
after academic writing classes have been taken by the students
. The participants in thi
s study 
wrote significantly longer essays and used significantly more sophisticated words in the PBTW 
exam. 
They also used significantly more word types and significantly more different types of 
nouns in the PBTW exam. 
The high level of engagement with the
 
exam and the addition of 
source materials may have contributed to these results. 
The participants most likely borrowed 
ideas and 
words from the reading passage and the videos, which 
could have
 
resulted in
 
longer 
and 
with more sophisticated and varied lexi
con
. 
In addition, they received higher scores for 
content in the process
-
based exam
, which again could 
be explained by the source materials
. 
The 
videos and article 
may have 
contributed with 
background information and 
ideas that the test 
takers could use in
 
their writing
, which 
may have affected
 
the 
superior 
content of their essays
. 
 
Another finding of this study was that t
he participants thought that the PBTW exam was 
easier and they clearly displayed their preference for this type of exam over the imprompt
u TW 
exam.
 
Reasons for this preference included the fact that they could use the ideas in the source 
materials and discussion as background information and supporting points for their arguments, as 
well 
as the extra time that they were given for planning.
 
The raters also displayed their 
preference for the process
-
based writing exam when the goal of the course is to teach students to 
integrate sources. 
The process
-
based exam has more content validity to evaluate the construct of 
academic writing described ab
ove, which includes planning, research, source integration, among 
other elements. 
However, both the participants and the raters mentioned the difficulty that the 
 
143
 
 
test takers had integrating sources in their essays. 
Source integration is a complex skill tha
t 
required practice, but at the same time it is an important skill to measure in academic writing 
classes. 
T
he raters expressed their dislike for the rubric. They believed that the rubric lacked 
range and specific descriptors for each category and level.
 
W

raters to assign scores that ranged from 1 to 3 and the descriptors for each category included 

The inter
-
rater reliability 
coefficients 
for each of 
the categories in the anal
ytic rubric were not very high
, with one exception
. 
The coefficient for 
spelling was higher than .700 for both exams. The coefficients ranged from .400 to .595 for 
content, organization, and cohesion, and from .120 to .352 for punctuation, grammar, and 
voc
abulary.
 
The coefficient for grammar was particularly low (.012). 
The low rater reliability 
coefficients may be a result of the few hours that the raters and I spent training and norming with 
the rubric
, as well as the lack of calibration before each scori
ng session
. 
RM had a considerably 
higher intra
-

showed the same trend
 
(.552 for the PBTW and .642 for the TW exam).
 

familiarity and experience rating process
-
based exams may explain the lower reliability 
coefficient for the PBTW exam
. 
Another reason for the lower reliability coefficient could be 

raters. 
Althou
gh the rubric did not include a category for source integration, it seemed as though 

Finally, the 
scores that the participants received in the TW and PBTW exams corre
lated only moderately, 
suggesting that the exams measure different constructs. The PBTW exam measures
, in addition 
to writing, r
eading, listening, note
-
taking skills, as well as the ability to integrate sources
, all of 
which are important academic writing 
skills that students need to succeed in their classes.
 
 
144
 
 
6.1
 
Pedagogical implications
 
The main pedagogical implication of this study is that the process
-
based exam
, when 
compared to impromptu timed
-
writing exams,
 
might be a better fit for ESL academic writi
ng 
programs that teach students planning strategies, synthesis, critical thinking, and source 
integration
 
because this exam supports and validates these skills, unlike impromptu timed
-
writing exams
. 
It is extremely important to consider 
content validity wh
en designing a test
 
(Carr, 
2011; McNamara & Roever, 2006; Weigle, 2002)
. 
A test
, whether its purpose is 
placement
 
or 
achievement,
 
should measure the curriculum 
upon which it is
 
base
d
 
(Carr, 2011). 
I
f the purpose 
of the writing exam is to check student 
achi
evement
, and the goal of the course is to prepare 
students 
for
 
regular university classes, then there is no doubt that the process
-
based writing exam 
has more content validity because impromptu timed
-
writing exams do not allow learners to use 
skills such a
s planning, sy
nthesis, and source integration, all of which are skills that writing 
experts, many ESL textbook
s, and research seem to support as crucial skills for university 
students.
 
 
Two other findings of this study that bear important pedagogical impli
cations are the 

-
based writing exams. The participants seem
ed
 
to 
value the process through which the PBTW exam allows them to go. Many participants 
mentioned that they liked having planning time and learning
 
about the topic through the source 

opinions should be taken into consideration when designing tests because, as one of the main 
stakeholders, test takers may be
 
more invested and motivated when they like the test that they are 
taking. Although the raters seemed to have difficulties scoring the PBTW exams and although 
 
145
 
 
the rater reliability was lower for these exams, the raters also agreed that process
-
based writin
g 
exam
 
have more content validity for ESL academic writing classes. Test designers should, 
however, ensure that they select rubrics with a wide range of possible scores and precise 
descriptors for each score level and category to obtain high rater reliabil
ity.
 
In addition, test 
designers should apply rigorous rater training and norming, as well as sample essays for 
calibration. 
 
Finally, one important aspect of classroom assessment is the issue of washback. If 
impromptu timed
-
writing exams continue to be ad
ministered in academic ESL writing classes, 
teachers will continue to teach students strategies to take such exams and spend class time 
training students for timed writing instead of focusing on preparing students to succeed in regular 
academic classes. 
Ca

washback is important because teachers 
will
 

If
 
process
-
based exams are implemented, teachers will focus more on teaching writing as a process and
 
teach
ing
 
students planning strategies, synthesis and source integration, 
among 
other skills that 
are part of the academic writing construct discussed above.
 
 
6.
2
 
Limitations
 
The present study has limitations, including the population of the study
 
and
 
the 
sample 
size. As with any other study, the population in this study is not exactly the same as the 
population in all other ESL academic writing classes in the United States. The great majority of 
the participants were either from Brazil or China. Other ESL 
programs may have a different 
student population whose performance or perceptions of the TW and PBTW exams may not have 
been the same.
 
Chinese students are used to preparing for and taking impromptu timed
-
writing 
 
146
 
 
exams because of the university entrance ex
am. The
 
Brazilian participants
, on the other hand,
 
are 
quite used to doing integrated reading and writing tasks, as they spend three years of high school 
preparing for the reading
-
to
-
write task that they take in the university entrance exams.
 
This in 
turn 

Although there were no significant differences between how the Brazilians performed in the two 
exams, there seemed to be a trend that indicated higher scores for the
 
PBTW exam. The opposite 
was the case for the Chinese participants. There seemed to be a trend of higher scores for the TW 
exam, but no significant differences were found. 

instruction in English varied considerab
ly, from seven months to seven years. In addition, their 
length of residence in the United States also varied greatly, from four months to three years. 
Although the participants took a placement exam and were placed in academic writing classes, 
their profi
ciency levels may have varied markedly because of the difference in the number of 
years of formal instruction and length of residence. Unfortunately I was unable to obtain any 

at a similar 
proficiency level. Even if I had asked the participants for their TOEFL scores, some participants 
had taken the exam months prior to data collection and their scores would not have been an 
accurate proficiency measure. 
 
Another limitation of t
his study was the sample size. Although the present study had 
more participants than some of the studies reported in the literature review, such as Ellis and 
Yuan (2004)
 
and
 
David (under review), eighty
-
one participants is still not enough to make 
broad 
ge
neralizations. Some aspects of the TW and PBTW exams were difficult to control and could 
have affected the results. During one of the semi
-
structured interviews, I learned that some of the 
participants who were taking 
l
istening and 
s
peaking classes had bee
n discussing the issue of gun 
 
147
 
 
control in their classes. They most likely had much more knowledge of the topic than the 
students who were not taking the 
l
istening and 
s
peaking course, and as discussed above, topic 

 
performance. 
 
Larson
-
Hall (2009) 
and other statisticians 
suggested using the Bonferroni adjustment 
when performing multiple 
t
 
tests and that is what I did 
with my data. However, s
ome
 
statisticians 
argue that using the Bonferroni adjustment 
might be too co
nservative
, and instead suggest other 
types of corrections
 
(Herrington, 2002). 
If that is indeed the case, then some of the results may 
have been different, which is another limitation of this study.
 

warned that the Bonferroni 
adjustment reduces statistical power dramatically and that researchers 
do not apply such alpha level corrections very consistently. He concluded that alpha level 
corrections should not be considered when researchers perform multiple 
t 
tests on the same set
 
of 
data. However, he did suggest the use of alpha level adjustments if the results of the study are 
used to make important decisions. He explained that if that is the case, researchers might decide 
to use alpha level adjustments to decrease the probabilit
y of the results occurring by chance. 
If 
decisions have to be made about implementing a process
-
based writing exam
 
in an ESL 
academic writing course, perhaps a lower alpha level might be justifiable. 
 
T
he rubric was 
a
 
limitation of this study, a fact which
 

comments about the rubric and the low intra
-
rater reliability coefficients. A rubric with more 
range and clearer descriptor
s
 
would have been more appropriate for this study.
 
T
here may not 
have been enough norming sessions w
ith the raters, another possible explanation for the low 
intra
-
rater reliability.
 
The raters would have also benefitted from reading some essays for the 
purpose of calibration before each scoring session.
 
Moreover
, it was impossible to ensure that the 
rate
rs did not know which test type (TW or PBTW exam) they were scoring because the test 
 
148
 
 
takers used ideas from the videos and article and cited them, as they were instructed to do. 
Knowing which test type they were scoring could have affected how the raters s
cored each exam. 
It was clear, for example, that the raters were attending to source integration even though the 
rubric did not include any category for how the test takers used sources. One example of this was 

 
lack of ability to integrate sources and the 
comment 
that 
she made about how distracting it was when they could not integrate sources well.
 
Perhaps RM 
was compensating for the fact that the rubric did not include source integration by 
punishing
 
the test t

other 
categories, such as 
content or organization.
 
If that was indeed the case, then some test takers 
might have received higher scores for content or organization had they been able to integr
ate the 
source materials successfully. 
 
Finally, it is a common practice to use a third rater when the scores that two raters assign 
are more than 1 or 2 points apart. However, instead of using a third rater, I decided to average the 
scores because only 4%
 
of the scores assigned for each category differed by more than 2 points. 
Had I asked a third rater to score the essays that differed by more than 2 points the results might 
have been different.
 
 
6.3
 
 
Future research
 
Integrated tasks and process
-
based exam
s have many promising research areas because 
there is still much researchers do not know about them. One area of research that deserves much 

process
-
based exams. Many
 
researchers claim that learners have to be more advanced in order to 
manage the complexities of such tasks (e.g., Cumming, 2013; Gebril & Plakans, 2013; 
Johnson 
 
149
 
 
et al., 2012
). However, to my knowledge, no one to date has investigated this issue
 
with proce
ss
-
based exams
. 
One study that investigated how learners of different levels perform in in
dependent 

s (2005), but they mainly investigated how learners 
of different proficien
cy
 
levels deal with source materials. 
G
ebril and Plakans (2013), for 
example, investigated how students of different levels differed in terms of fluency, syntactic and 
lexical complexity, grammatical accuracy, and source use when doing integrated tasks. In a study 

ng tasks
 
(www.toefl.org)
, Sawaki
 
et al.
 
(2013) teased apart three 
skills that differentiate learners below and above the level of proficiency required for university 
admission. However, these studies did not compare how lower level students and higher leve
l 
students perform in impromptu TW exams 
compared to how they perform in
 
integrated tasks. In 
addition, the task
s
 
used in the
se
 
stud
ies
 
did not include group discussions or planning time.
 
More research is also needed to gather more information about test t

exams that they are required to take, as they are important stakeholders who often spend months 
preparing for a test and 
wh
ose futures depend on how well they perform in them. 
Indeed, the 
ILTA Code of Ethics (2000) mandate
d
 
such refl
ection and research because
 
testers have a 
responsibility to understand the consequences of their tests on all stakeholders (see pages 6 and 
7). Test developers may need to understand whether 
integrated tasks and 
process
-
based 
exams 
differentially affect
 
s
tudents from 
different linguistic and cultural backgrounds. Do students from 
different cultures and native languages perform differently when they take integrated tests or 
PBTW exams? 
Brazilian students, for example, are 
most likely 
used to reading
-
to
-
writ
e tasks 
because of the university entrance exams. The reading
-
to
-
write task 
in the Brazilian university 
entrance exams carries much weight and students spend three years of high school preparing for 
it. 
It would be interesting to investigate the processes 
that L2 writers 
from different backgrounds 
 
150
 
 
go through when they take process
-
based writing exams and how they view and tackle each task 
involved in the exam. 
 
Teachers are also important stakeholders for some tests, especially classroom
-
based 
assessments
. 
T
hey are often taken for granted and not included in the test design process. 

oftentimes affect the way they teach and what they teach in their classrooms.
 
 
6.
4
 
 
Summary
 
P
rocess
-
based writing exams 
may be 
a better way to evaluate academic writing 
when 
compared to
 
impromptu TW exams
 
because processed
-
based writing may better match the 
construct of academic writing than TW exams do. 
But i
mplementing 
process
-
based 
writing 
exams
 
might not be easy. 
They take more time to administer. On the other hand, p
rocess
-
based 
writing exams encourage students to view writing as a process, not a product, because the
 
exams
 
provide learners with the opportunity to discuss the topic 
and learn about the topic, as well as 
additional time to plan their writing.
 
Process
-
based exams may also help students write essays 
with 
higher quality
 
content and 
more sophisticated 
vocabulary because the
 
test takers
 
can use 
ideas from the readings, vide
os, and discussions
. And t
hey 
can use
 
vocabulary
 
from the source 
materials
.
 
In addition, process
-
based writing exams 
may 
have more content validity for academic 
writing classes in which source integration plays a big role
. 
Impromptu TW exams 
may 
not 
allow 
students to demonstrate the skills 
acquired through academic writing classes. In such 
classes they learn
 
planning, synthesis, source integration, and so on, all of which are skills that 
should be part of the construct of academic writing. 
Another reason wh
y process
-
based exams 
should be used in the place of impromptu TW exams
 

151
 
 
assessed
 
is the fact that students enjoy the process through which they go when they take such 
exams. They value the discussion, source materials, a
nd planning time
,
 
and they seem to believe 
that these 
contribute to their 
success
 
in the writing task
. 
When implementing process
-
based 
exams, however, test designers must give careful consideration to the rubric which will be used 

ting.
 
Test designers must use rubrics that have a wide range of 
possible 
scores 
and clear descriptors
 
for each score band and category
. 
Raters must be highly trained, and 
a robust process of double rating must be implemented (McNamara & Roever, 2006, p. 27
). 
Such measures will help control the effects that the complex, process
-
based task components or 
the rater may have on scores.
 
152
 
 
APPENDICES
 
 
153
 
 
A
PPENDIX
 
A
: Videos
 
 
Videos from obesity prompt:
 
 
http://abcnews.go.com/Health/video/large
-
sugary
-
drink
-
ban
-
passes
-
new
-
york
-
city
-
17227911
 
http://abcnews.go.com/Nightline/video/mcdonalds
-
calorie
-
coun
ts
-
nyc
-
big
-
soda
-
ban
-
17232674
 
 
Videos for gun control prompt:
 
 
http://www.youtube.com/watch?v=vtAAI4xnmzE
 
http://abcnews.go.com/WNT/video/aurora
-
colorado
-
shooting
-
gun
-
control
-
laws
-
16829309
 
 
154
 
 
A
PPENDIX
 
B
: Reading passages
 
 
Article for obesity prompt: 
 
 
http://www.nytimes.com/2011/07/24/opinion
/sunday/24bittman.html
 
http://www.nytimes.com/2013/12/15/opinion/sunday/kristof
-
the
-
killer
-
who
-
supports
-
gun
-
control.html
 
 
155
 
 
APPENDIX
 
C
: Rubric
 
 
Rubr
ic (Weir, 1990)
 
 
A.
 
Relevance and adequacy of content
 
0. The answer bears almost no relation to the task set. Totally inadequate answer.
 
1. Answer of limited relevance to the task set. Possibly major gaps in treatment of topic 
and/or pointless repetition.
 
2. 
For the most part answers the tasks set, though there may be some gaps or redundant 
information.
 
3. Relevant and adequate answer to the task set.
 
 
B.
 
Compositional organization
 
0. No apparent organization of content.
 
1. Very little organization of content. Un
derlying structure not sufficiently controlled.
 
2. Some organizational skills in evidence, but not adequately controlled. 
 
3. Overall shape and internal pattern clear. Organizational skills adequately controlled.
 
 
C.
 
Cohesion
 
0. Cohesion almost totally absen
t. Writing so fragmentary that comprehension of the 
intended communication is virtually impossible.  
 
1. Unsatisfactory cohesion may cause difficulty in comprehension of most of the intended 
communication.
 
2. For the most part satisfactory cohesion althoug
h occasional deficiencies may mean that 
certain parts of the communication are not always effective.
 
3. Satisfactory use of cohesion resulting in effective communication.
 
 
D.
 
Adequacy of vocabulary for purpose
 
0. Vocabulary inadequate even for the most basic 
parts of the intended communication.
 
1. Frequent inadequacies in vocabulary for the task. Perhaps frequent lexical 
inappropriacies and/or repetition.  
 
2. Some inadequacies in vocabulary for the task. Perhaps some lexical inappropriacies 
and/or circumlocut
ion.
 
3. Almost no inadequacies in vocabulary for the task. Only rare inappropriacies and/or 
circumlocution.
 
 
E.
 
Grammar
 
0. Almost no grammatical patterns inaccurate.
 
1. Frequent grammatical inaccuracies. 
 
2. Some grammatical inaccuracies.
 
3. Almost no grammat
ical inaccuracies.
 
 
F.
 
Mechanical accuracy I (punctuation)
 
0. Ignorance of conventions of punctuation.
 
1. Low standard of accuracy in punctuation.
 
 
156
 
 
2. Some inaccuracies in punctuation.
 
3. Almost no inaccuracies in punctuation.
 
 
G. 
 
Mechanical accuracy II (spel
ling)
 
 
0.
 
Almost all spelling inaccurate.
 
 
1. Low standard of accuracy in spelling.
 
 
2. Some inaccuracies in spelling.
 
 
3. Almost no inaccuracies in spelling. 
 
 
157
 
 
APPENDIX D
: Post
-
writing questionnaire
 
 
Post
-
writing Questionnaire
 
 
Participant ID _____
 
 
For this research project, yo
u took two different types of timed writing exams. One was shorter 
and required you to write an essay in 45 minutes. The other was longer and required you to 
watch two short videos, read one article, discuss the topic with your classmates, plan your essay,
 
and then write it in 45 minutes. Please answer the questions about the two exams. 
 
 
1.
 
Which exam do you think was easier?
 
 
a)
 
The shorter timed writing exam 
 
b)
 
The longer timed writing exam with the videos, lectures and discussion
 
c)
 
They were equally easy/difficul
t (circle one)
 
 
2.
 
Why did you think one exam was easier than the other? If you think they were equally 
easy/difficult, skip this question.
 
 
3.
 
What did you think about the videos that you watched? You can choose more than one 
answer for this question.
 
 
a)
 
They 
were easy to understand
 
b)
 
They were difficult to understand
 
c)
 
They helped me think of ideas for the essay
 
d)
 
They did not help me think of ideas for the essay
 
 
4.
 
What did you think about the article that you read? You can choose more than one 
answer for this questi
on.
 
 
a)
 
It was easy to read
 
b)
 
It was difficult to read
 
c)
 
It helped me think of ideas for the essay
 
d)
 
It did not help me think of ideas for the essay
 
 
5.
 
What did you think about the group discussion? 
 
 
a)
 
It helped me think of ideas for the essay
 
b)
 
It did not help me think
 
of ideas for the essay
 
c)
 
It was not related to what I wrote in my essay
 
 
6.
 
Did you use the videos to help you support your ideas in the essay?
 
 
a)
 
Yes
 
 
158
 
 
b)
 
No
 
 
7.
 
Did you use the article to help you support your ideas in the essay?
 
 
a)
 
Yes
 
b)
 
No
 
 
8.
 
Did you use the ideas that yo
u discussed in your group in the essay?
 
 
a)
 
Yes
 
b)
 
No
 
 
9.
 
If you did not use the videos, article or ideas in the group discussion, why did you choose 
not to do so?
 
 
10.
 
What was difficult and/or easy about doing the shorter timed writing exam? 
 
 
11.
 
What was difficul
t and/or easy about doing the longer timed writing exam with the 
videos, article and discussion? 
 
 
12.
 
Which exam did you prefer taking? 
 
 
a)
 
The shorter timed writing exam with two topics
 
b)
 
The longer timed writing exam with the videos, lectures and discussion
 
 
13.
 
Why did you prefer taking that exam?
 
 
159
 
 
APPENDIX E
: Semi
-
structur
ed interview questions
 
 
Questions for the students:
 
 
1.
 

with the shorter exam. What about the longer exam?
 
 
-
 
Prompts for this question include:
 
 
What did you think of the time limit?
 
What did you think of the topics?
 
What did you think of the videos and article?
 
What did you think of the group discussion?
 
What did you think of the time for planning?
 
 
2.
 
What did you
 
like about the two exams?
 
3.
 
What did you dislike about the two exams?
 
4.
 
Did you think you did better at one exam than the other? Which exam? Why?
 
5.
 
What are some of the problems that you face when taking a timed writing exam?
 
6.
 
If you could choose a way to be eva
luated for your writing skills, what evaluation method 
would you choose?
 
7.
 
How can you prepare for taking a timed writing exam?
 
8.
 
What can you learn from taking timed writing exams?
 
 
Questions for the raters:
 
 
1.
 
What is your overall impression of the two exams?
 
2.
 
What are the advantages and disadvantages of each exam?
 
3.
 
What was it like to rate the essays for the two exams?
 
4.
 
Were there any difficulties scoring the exams?
 
5.
 
As both raters and ESL teachers, how representative of what you do in 
the classroom are 
the two ex
ams
 
 
160
 
 
APPENDIX F
: Guidelines for clauses
 
 
Guidelines for Clauses (Polio, 1997)
 
 
a.
 
A clause equals an overt subject and a finite verb. The following are only one clause 
each:
 
He left the house and drove away.
 
He wanted John to leave the house.
 
b.
 
Only an imperative does not require a su
bject to be considered a clause.
 
c.
 
In a sentence that has a subject with only an auxiliary verb, do not count that subject and 
verb as a separate clause (e.g. John likes to ski and Mary does too; John likes to ski, 

 
Error Guidelines
 
 
a.
 

b.
 

clauses or after prepositional phrases. Comma errors related to restrictive/non
-
re
strictive 
relative clauses 
should
 
be counted. Extraneous commas should also be considered errors.
 
c.
 
Base tense/reference errors on preceding discourse; do not look at the sentence in 
isolation.
 
d.
 

as plural).
 
e.
 
Be lenient about article errors from translations of proper nouns.
 
f.
 

g.
 
Count errors that could be made by native speakers (e.g. between you and I).
 
h.
 
Do not count register errors relat
ed to lexical choices (e.g. lots, kids).
 
i.
 
Disregard an unfinished sentence at the end of the essay.
 
 
161
 
 
REFERENCES
 
 
162
 
 
REFERENCES
 
 
Abdel Latif, M. M. (2013). What do we mean by writing fluency and how can it be validly 
measur
ed?
 
Applied Linguistics, 34
(1), 99
-
105.
 
 
Ai, H., & Lu, X. (2013). A corpus
-
based comparison of syntactic complexity in 
NNS
 
and 
NS
 

Automatic Treatment and Analysis of Learner Corpus 
Data
,
 
59
.
 
 
Armstrong, K. M. (2010). Fluency, 
accuracy, and complexity in graded and ungraded 
writing.
 
Foreign Language Annals
,
 
43
(4), 690
-
702.
 
 
Baralt
, 
M
. (2012).
 
Coding qualitative data
. In A. Mackey & S. M. Gass (Eds.), Research 
Methods in Second Language Acquisition: A Practical Guide
 
(pp. 95
-
116)
. Chichester, 
UK: John Wiley & Sons, Ltd.
 
 
Baralt, M., Gilabert, R., & Robinson, P. (2014).
 
Task sequencing and instructed second language 
learning
. London: Bloomsbury. 
 
 
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the ratin
g scale and 
rater experience. 
Language Assessment Quarterly
, 
7
(1), 54
-
74.
 
 
Bereiter, C., & Scardamalia, M. (1987). 
The psychology of written composition
. Hillsdale, N.J: 
L. Erlbaum Associates.
 
 
Carr, N. T. (2011).
 
Designing and analyzing language tests
. 
Ox
ford: 
Oxford University Press
.
 
 
Chaudron, C., & Parke
r, K. (1990). Discourse marked
ness and structu
ral
 
markedness: The 
acquisition of 
English noun phrase
s. 
Studies in Second Language Acquisition, 12
, 43

64.
 
 
Cho, Y. (2001). Examining a process
-
oriented wri
ting assessment in a large
-
scale ESL testing 
context. (Unpublished doctoral dissertation). University of Illinois at Urbana
-
Champaign. 
 
 
Chung, T. (2003) A corpus
-
comparison approach for term extraction. 
Terminology
 
9
, 2: 221
-
246
 
 
Coh
en, J. (1992). A power primer. 
Psychological Bulletin
, 
112
(1), 155

159. 
 
 
Congdon, P. J., & McQueen, J. (2000). The stability of rater severity in large
-
scale assessment 
programs. Journal of Educational Measurement, 37(2), 163
-
178.
 
 
Connor
-
Linton, J. & Pol
io, C. (2014). Comparing perspectives on L2 writing: Multiple analyses of a 
common corpus: Introduction. 
Journal of Second Language Writing, 23
, 1
-
9.
 
 
Cooper, A., & Bikowski, D. (2007). Writing at the graduate level: what tasks do professors 
actually requi
re?. 
Journal of English for Academic Purposes
, 
6
(3), 206
-
221. 
 
 
163
 
 
Creswell, J. W. & Plano Clark, V. L. (2011). 
Designing and conducting mixed methods research.
 
Thousand Oaks, CA: Sage.
 
 
Cumming, A. (2013). Assessing integrated writing tasks for academic purp
oses: Promises and 
perils. 
Language Assessment Quarterly
, 
10
(1), 1
-
8.
 
 
Cumming, A., Grant, L., Mulcahy
-
Ernt, P., & Powers, D. E. (2004). A teacher
-
verification study 
of speaking and writing prototype tasks for a new TOEFL. 
Language Testing, 2, 
107
-
145.
 
 
Cu
mming, A., Kantor, R., Baba, K., Erdosy, U., Eouanzoui, K., & James, M. (2005). Differences 
in written discourse in independent and integrated prototype tasks for next generation 
TOEFL.
 
Assessing Writing
,
 
10
(1), 5
-
43.
 
 
David, V. (
under review
). 
A compariso
n of two methods of assessing L2 writing: Process
-
based 
and impromptu timed writing exams
. Manuscript submitted for publication. 
 
 
Deane, P., Odendahl, N., Quinlan, T., Fowles, M., Welsh, C., & Bivens

Tatum, J. (2008). 
Cognitive models of writing: Writing proficiency as a complex integrated skill.
 
ETS 
Research Report Series
,
 
2008
(2), i
-
36.
 
 
Delaney, Y. A. (2008). Investigating the reading
-
to
-
write construct. 
Journal of English for 
academic purposes, 7,
 
140
-
150.
 
 
Dollahite, N. E., & Haun, J. (2007).
 
Sourcework: Academic writing from sources
. 
Thomson/Heinle.
 
 
Duff, P. (2012). How to carry out case study research. In A. Mackey & S. M. Gass (Eds.), 
Research Methods in Second Language Acquisition: A Practica
l Guide 
(pp. 95
-
116). 
Chichester, UK: John Wiley & Sons, Ltd.
 
 
Ellis, R., & Yuan, F. (2004). The effects of planning on fluency, complexity, and accuracy in 
second language narrative writing. 
Studies in 
S
econd Language 
A
cquisition
, 
26
(01), 59
-
84.
 
 
Engber, 
C. A. (1995). The
 
relationship of lexical profi
ciency to the quali
ty of ESL compositions. 
Journal of Second Language Writing
, 4, 139

155.
 
 
performance in an English
 
language test. 
The Canadian Modern Language Review, 58,
 
599
-
622.
 
 
Ferris, D. (2009). 
Teaching college writing to diverse student populations
. Ann Arbor: 
University of Michigan Press.
 
 
F
ield
, A
. (2009).
 
Discovering statistics using SPSS
. 
Thousand Oaks, CA:
 
Sage publications.
 
 
164
 
 
Research in the Teaching of English 15/3: 229

43
 
 
Fritz, E., & Ruegg, R. (2013). Rater sensitivity to lexical accuracy, sophistication a
nd range 
when assessing writing.
 
Assessing Writing
,
 
18
(2), 173
-
181.
 
 
G
ass, S. M., & Selinker, L. (2008). Second language acquisition: An introductory course (3rd 
ed.). New York: Routledge/Taylor and Francis Group.
 
 
Gebril, A. (2010). Bringing reading
-
to
-
wr
ite and writing
-
only assessment tasks together: A 
generalizability analysis.
 
Assessing Writing
,
 
15
(2), 100
-
117.
 
 
Gebril, A., & Plakans, L. (2013). Toward a transparent construct of reading
-
to
-
write tasks: The 
interface between discourse features and profic
iency.
 
Language Assessment 
Quarterly
,
 
10
(1), 9
-
27. 
 
 
Grabe, W., & Kaplan, R. B. (1996). 
Theory and practice of writing: An applied linguistic 
perspective
. New York; London: Longman.
 
 
Guo, L., Crossley, S. A., & McNamara, D. S. (2013). Predicting human judg
ements of essay 
quality in both integrated and independent second language writing samples: A 
comparison study. 
Assessing Writing, 18, 
218
-
238.
 
 
Hale, G., Taylor, C., Bridgeman, B., Carson, J., Kroll, B., & Kantor, R. (1996). 
A study of 
writing tasks assig
ned in academic degree programs
. Princeton, NJ: Educational Testing 
Service.
 
 
Harley, B., & King, M. L. 
(1989). Verb lexis in the writ
ten compositions o
f young L2 learners. 
Studies in Second Language Acquisition
, 
11
, 415

440.
 
 
Hayes, J. R. (1996). A new fr
amework for understanding cognition and affect in writing. In
 
C. 
M. Levy & S. Ransdell (Eds.), 
The science of writing: Theories, methods, individual 
differences, and applications
 
(pp. 1
-
27).
 
 
Hayes, J. R., & Flower, L. (1980).
 
Identifying the organization 
of writing processes
. 
In L. W. 
Gregg and E. R. Steinberg (Eds.)
, 
Cognitive processes in writing 
(pp. 31
-
50). Hillsdale, 
NJ: Lawrence Erlbaum Associates.
 
 
writing te
sts. 
Assessing Writing
, 
13
(2), 130
-
149.
 
 
He, L., & Shi, L. (2012). Topical knowledge and ESL writing. 
Language Testing
, 
29
(3), 443
-
464.
 
 
Heatley, A., Nation, I.S.P. and Coxhead, A. (2002). RANGE and FREQUENCY programs. 
http://www.vuw.ac.nz/lals/staff/Paul_Nation
 
 
165
 
 
Herrington, R. (2002). Controlling the false discovery rate in multiple hypothesis testing. On 
www.unt.edu/benchmarks/archives/2002/april02.rss.htm. Research and Statistical 
Support web s
ite. University of North Texas, Denton
 
 
Horowitz, D. M. (1986). What professors actually require: Academic tasks for the ESL 
classroom. 
TESOL Quarterly
, 
20
(3), 445
-
462. 
 
 
Howell, D. C. (2002). 
Statistical methods for psychology
. Pacific Grove, CA: Duxbury/
Thomson
 
Learning.
 
 
Hughes, A. (2003
). 
Testing for language teachers
. Cambridge; New York: Cambridge 
University Press.
 
 
Hunt, K. W. (1965). Grammatical Structures Written at Three Grade Levels. NCTE Research 
Report No. 3. 
 
 
Hyltenstam, K. (1988). Lexical
 
ch
aracteristics of near
-
native second
-
lang
uage learners of 
Swedish. 
Journal of Multilingual and Multicultural Development, 9
, 67

84.
 
 
ILTA. (2000). 
ILTA
 
code of ethics
. Available at 
http://ww
w.iltaonline.com/images/pdfs/ilta_code.pdf
 
 
Jackson, D. O., & Suethanapornkul, S. (2013). The cognition hypothesis: A synthesis and meta
-
analysis of research on second language task complexity. 
Language Learning, 63
(2), 
330
-
367. 
 
 
Jacobs, H., Zinkgraf, S
., Wormuth, D., Hartfiel, V. and Hughey, J. (1981). 
Testing ESL 
composition: A practical approach.
 
Rowley, MA: Newbury House.
 
 
Johnson, M. D., Mercado, L., & Acevedo, A. (2012). The effect of planning sub
-
processes on L2 
writing fluency, grammatical comple
xity, and lexical complexity. 
Journal of Second 
Language Writing
, 
21
(3), 264
-
282.
 
 
Kellogg, R. T. (1988). Attentional overload and writing performance: Effects of rough draft and 
outline strategies. 
Journal of Experimental Psychology: Learning, Memory, and
 
Cognition
, 
14
(2), 355
-
365.
 
 
Kormos, J. (2011). Task complexity and linguistic and discourse features of narrative writing 
performance.
 
Journal of Second Language Writing
,
 
20
(2), 148
-
161.
 
 
Knoch, U. (2009). Diagnostic assessment of writing: A comparison of
 
two rating scales. 
Language Testing, 26
(2), 275
-
304.
 
 
Kuiken, F., Mos, M., & Vedder, I. (2005). Cognitive task complexity and second language 
writing performance.
 
Eurosla yearbook
,
 
5
(1), 195
-
222.
 
 
166
 
 
Kuiken, F., & Vedder, I. (2008). Cognitive task complexity
 
and written output in Italian and 
French as a foreign language. 
Journal of Second Language Writing
, 
17
(1), 48
-
60.
 
 
Larson
-
Hall, J. (2009).
 
A guide to doing statistics in second language research using SPSS
. 
Routledge.
 
 
Laufer, B. (1994).The l
exical
 
profi
le
 
of second language 
writing: Does it 
change over time? 
RELC Journal, 25
, 21

33.
 
 
Lee, Y. J. (2006). The process
-
oriented ESL writing assessment: Promises and challenges. 
Journal of Second Language Writing
, 
15
(4), 307
-
330. 
 
 
Lee, I., & Coniam, D. (2013). 
Introducing assessment for learning for EFL writing in an 
assessment of learning examination
-
driven system in Hong Kong. 
Journal of Second 
Language Writing, 22
(1), 34
-
50.
 
 
Leki, I. (1991). A new approach to advanced ESL placement testing. Writing Program 
A
dministration, 14 (3), 53

68.
 
 
Linnarud, M. (1986). 

English
. Lund, 
Sweden: CWK Gleerup.
 
 
Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. 
Interna
tional Journal of Corpus Linguistics, 15
(4), 474
-
496.
 
 
Lu, X. (2011). A corpus
-
based evaluation of syntactic complexity measures as indices of college
-

TESOL Quarterly, 45
(1), 36
-
62. 
 
 
Lu, X. (2012). The Relationship
 
of Lexical Richness to the Quality of ESL Learners' Oral 
Narratives. 
Modern Language Journal, 96
(2), 190
-
208.
 
 
Lumley, T., & McNamara, T. F. (1993). Rater characteristics and rater bias: Implications for 
training.
 
 
Lunz, M. E., & Stahl, J. A. (1990). 
Judg
e consistency and severity across grading periods. 
Evaluation & the Health Professions, 13(4), 425
-
444.
 
 
Malvern, D. D., & Richards, B. J. (1997). A new measure of lexical diversity. In A. Ryan & A. 
Wray (Eds.), 
Evolving models of language
 
(pp. 58

71). Cl
evedon: Multilingual Matters.
 
 
McCarthy, P.M. (2005). 
An assessment of the range and usefulness of lexical diversity measures 
and the potential of the measure of textual, lexical diversity
 
(MTLD) (Unpublished 
doctoral dissertation). University of Memphis, 
Memphis, TN. 
 
 
McNamara, D. S., Louwerse, M. M., Cai, Z., & Graesser, A. (2005, January 1). Coh
-
Metrix 
 
167
 
 
version 1.4. 
 
 
McNamara, T., & Roever, C. (2006) 
Language testing: The social dimension.
 
Malden, MA: 
Blackwell Publishing. 
 
 
Morse, J. M. (1991). Approac
hes to qualitative
-
quantitative methodological triangulation. 
Nursing Research, 40, 
120
-
123.
 
 
Nation, I. S. P. (2005). Range and frequency: Programs for Windows based PCs [Computer 
software and manual]. Retrieved from
 
http://www.victoria.ac.nz/lals/staff/paul
-
nation/nation.aspx
 
 
National Governors Association Center for Best Practices & Council of Chief State School 
Officers. (2010).
 
Common Core State Standards
. Washington, DC: Aut
hors.
 
 
adjustment. 
Human Communication Research, 29
(3), 431
-
447.
 
 
Ong, J., & Zhang, L. J. (2013
). Effects of task complexity on the fluency and lexical complexit
y 

Journal of Second Language Writing
, 
19
(4), 
218
-
233.
 
 
Plakans, L. (2008). Comparing composing processes in writing
-
only and reading
-
to
-
write test 
tasks.
 
Assessing Writing
,
 
13
(2), 111
-
129.
 
 
Polio, C. (1997). Measures
 
of linguistic accuracy in second language writing research.
 
Language 
Learning, 47,
 
101
-
143.
 
 
Polio, C. & Shea, M. (2014). Another look at accuracy in second language writing 
development.
 
Journal of Second Language Writing, 23
, 10
-
27.
 
 
Polio, C. & Yoon, H.
J. (2014). A longitudinal study of written language development in two 
genres. Second Language Writing Symposium, Arizona State University, Tempe, AZ.
 
 
November 2014.
 
 
Prior, P. (1998). Writing/disciplinarity: A sociohistoric account of literate activity i
n the 
academy. Mahwah, NJ: Lawrence
 
Erlbaum.
 
 
Powers, D. E., & Fowles, M. E. (1999). Test
-
takers' judgments of essay prompts: Perceptions 
and performance. 
Educational Assessment
, 
6
(1), 3
-
22.
 
 
Rea
-
Dickins, P. (1997). So, why do we need relationships with st
akeholders in language testing? 
A view from the UK.
 
Language Testing
,
 
14
(3), 304
-
314.
 
 
Rezaei, A. R., & Lovorn, M. (2010). Reliability and validity of rubrics for assessment through 
writing. 
Assessing Writing
, 
15
(1), 18
-
39.
 
 
168
 
 
Richards, K. (2003).
 
Qualitativ
e inquiry in TESOL
. Basingstoke: Palgrave Macmillan.
 
 
Robinson, P. (2001). Task complexity, task difficulty, and task production: Exploring 
interactions in a componential framework. 
Applied Linguistics
, 
22
(1), 27
-
57.
 
 
Sawaki, Y., Quinlan, T., & Lee, Y. W.
 
(2013). Understanding learner strengths and weaknesses: 
assessing performance on an integrated writing task.
 
Language Assessment 
Quarterly
,
 
10
(1), 73
-
95.
 
 
Shi, L. (1998). Effects of prewriting discussions on adult ESL students' compositions. 
Journal of 
Se
cond Language Writing
, 
7
(3), 319
-
345.
 
 
Tedick, D. J. (1990). ESL writing assessment: Subject
-
matter knowledge and its impact on 
performance. 
English for Specific Purposes
, 
9
(2), 123
-
143. 
 
 
Way, P., Joiner, E. G., & Seaman, M. (2000). Writing in the seconda
ry foreign language 
classroom: The effects of prompts and tasks on novice learners of French. 
Modern 
Language Journal, 84
(2), 171
-
184.
 
 
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 
263
-
287.
 
 
Weigle, S. C. (2
002). 
Assessing writing
. Cambridge: Cambridge University Press.
 
 
Weigle, S. C. (2004). Integrating reading and writing in a competency test for non
-
native 
speakers of English. 
Assessing Writing, 9, 
27
-
55.
 
 
Weir, C. J. (1990). 
Communicative language testing
.
 
NJ: Prentice Hall Regents. 
 
 
Wigglesworth, G., & Storch, N. (2009). Pair versus individual writing: Effects on fluency, 
complexity and accuracy.
 
Language Testing
,
 
26
(3), 445
-
466.
 
 
Winfield, F. E., & Barnes
-
Felfeli, P. (1982). The effects of familiar and 
unfamiliar cultural 
context on foreign language composition. 
The Modern Language Journal
, 
66
(4), 373
-
378. 
 
 
Winke, P. (2013). The effectiveness of interactive group orals for placement testing. In K. 
McDonough & A. Mackey (Eds.), 
Second language interactio
n in diverse educational 
contexts
 
(pp. 247
-
268). John Benjamins Publishing Company. 
 
 
Winke, P., & Lim, H. (2015). ESL essay raters' cognitive processes in applying the Jacobs et al. 
rubric: An eye
-
movement study. 
Assessing Writing, 25
(2), 37
-
53.
 
 
169
 
 
Wolfe
-
Qu
intero, K., In
agaki, S., & Kim, H. Y. (1998). 
Second language dev
elopment in writing: 
Measures 
of fluency, accur
acy, and complexity (Report No. 
17).
 
Honolulu: Uni
versity of 

guage Teaching and Curriculum Center.
 
 
Worden, D. L. (2009). Fin
ding process in product: Prewriting and revision in timed essay 
responses. 
Assessing Writing
, 
14
(3), 157
-
177.
 
 
Yigitoglu, N. (2008). 
A pathway between academic and ESL classes: Academic tasks and their 
potential impact on teaching and testing writing. (
Unp

Michigan State University.