EFFECTIVE PLANNING IN REAL-TIME SPEAKING TEST TASKS 

By 

Shinhye Lee 

 

 

 

 

 

 

 

 

A DISSERTATION 

Michigan State University 

in partial fulfillment of the requirement 

Submitted to 

for the degree of 

Second Language Studies—Doctor of Philosophy 

2018 

 

 

ABSTRACT 

EFFECTIVE PLANNING IN REAL-TIME SPEAKING TEST TASKS 

By 

Shinhye Lee 

In this dissertation, I documented the effectiveness of a particular test-taking condition in 

the current task-based performance testing (e.g., TOEFL iBT, IELTS, OPI); namely, planning 

time (the time given for test takers to plan their responses before actually performing). In 

assessment contexts, the construct of planning addresses both language and test-related theories; 

in terms of the latter, it is associated with components constituting test qualities (test validity, 

authenticity, and fairness; Wigglesworth & Elder, 2010). This is because planning time is already 

a critical test accommodation, or a task implementation condition, as termed from the task-based 

research paradigm. Indeed, researchers and test developers suggest planning time as a major 

component in determining task difficulty in that the varying planning conditions are appropriate 

to represent the items’ different cognitive-demand levels (Norris, 2009; Robinson, 2001). 

Therefore, I explored whether test takers’ efficient use of varying planning times is contingent 

upon test-task characteristics in the context of the TOEFL iBT Speaking test, in which which 

varying degrees of task conditions such as planning, and test-task types co-exist. 

Ninety-nine Korean university students took three speaking tests, which each consisted of 

one independent task (impromptu task) and two integrated tasks (reading-listening and listening-

only tasks). As in operational testing, independent tasks were given 15 seconds to plan while the 

two integrated tasks were given 30 and 20 seconds, respectively. For each test set, participants 

performed under a specific planning condition: namely, Unguided planning (planning without 

specific instructions), Guided planning with Writing (planning with instructions given as to write 

  

to plan), and Guided planning with silently thinking (planning with instructions given as to think 

or outline silently to plan). After each test, test takers partook in a series of surveys and 

interviews to reflect on the appropriateness of each task’s planning times. Subsequently, I 

undertook multiple methods (quantitative and qualitative) on the collected data points. Three 

independent raters scored a total of 891 speech samples according to the TOEFL iBT speaking 

rubric. Two trained coders coded the speech samples for the three discourse quality measures 

pertaining to complexity, accuracy, and fluency. I thematically analyzed survey and interview 

responses through NVivo.  

Findings indicated that participants’ performance and perceptions were directed by the 

influence of test-task characteristics regardless of the planning activities they made use of. Their 

test performance and speech quality varied from independent tasks to integrated tasks; they 

generally scored lower, and demonstrated slower speech rate, increased lexical errors, and 

simplified language in independent tasks. In addition, test takers believed that extended planning 

time was unnecessary for integrated tasks, for the reading and listening sources were readily 

applicable to actual responses; yet 15 seconds did not suffice for them for the independent tasks 

to familiarize themselves with the given prompt.  

I discuss study results and conclude the dissertation by making connections between 

speech-production planning theories in task-based research (Robinson, 2001; Skehan, 1998) and 

planning time practices in second language assessment (Elder & Iwashita, 2005; Wigglesworth 

& Elder, 2010). 

 

 

  

Copyright by 
SHINHYE LEE 
2018 
 

 

 

 

 

 

 

 

 

 

  

ACKNOWLEDGMENTS 

 

Completion of this dissertation would not have been possible if it were not from the 

tremendous support of a number of people. I would like to take this opportunity to extend my 

thanks and appreciation to them.  

First of all, I want to express my sincerest gratitude to my advisor and dissertation chair 

Dr. Paula Winke, who took me under her wing from the very beginning to the end of my Ph.D. 

journey in the Second Language Studies program. I am grateful for her support and 

encouragement, which helped me believe in myself and try out more. Thanks to her, I was able 

to grasp a number of exciting opportunities during my Ph.D. studies that immensely helped me 

shape my career path early on. I give my deepest thanks to my dissertation committee members, 

Drs. Susan Gass, Daniel Reed, and Koen Van Gorp, for their critical and insightful comments 

from the earlier phase of my dissertation project to the final manuscript. Special thanks go to Dr. 

Patti Spinner for devoting her time to support my job search this past academic year. I am also 

grateful for my advisor at Ewha Womens University, Dr. Sang-Keun Shin, for his continuous 

support until this day from my master’s studies in Korea. 

Special thanks also go to Hima Rawal, Joshua Smith, Chad Bousley, Chris Bartoluzzi, 

Laura Bowman, Erin Degerman, Brandon Jung, and Aaron Ohlrogge, for putting substantial 

amount of time and efforts to help me out in coding, rating, and organizing my dataset. If it were 

not for their help, I would not have been able to complete any part of the main data analysis for 

my dissertation. 

I want to thank deeply my friends and colleagues at Michigan State University: 

Myeongeun Son, Jongbong Lee, Xiaowan Zhang, Melody Ma, and Michael Wang. They have 

 v 
 

demonstrated true friendship and have stood by me in both good and bad times. I feel very lucky 

to call them my friends and cannot wait to see them thrive as the academics that they hope to 

become.  

Last but not least, I am forever grateful for my Mom and Dad, Sun-Rye Lim and Yoon-

Jae Lee, for their unconditional love and patience. I want to congratulate them on finally 

witnessing their daughter graduate and start off a career, after all these seemingly endless years 

of schooling.  

 

 

 

 

 vi 

TABLE OF CONTENTS 

 

LIST OF TABLES ...................................................................................................................... x 

LIST OF FIGURES .................................................................................................................. xii 

INTRODUCTION ...................................................................................................................... 1 

CHAPTER 1: Literature Review ................................................................................................. 4 
1.1 Task-based research and theoretical underpinning .......................................................... 4 
1.1.1 Modeling and researching speech production and the effects of pre-task planning... 4 
1.1.2 Modeling and researching task-based oral test performance: Skehan’s (1998) 
expanded model on Kenyon-McNamara (McNamara, 1995) ........................................... 6 
1.2 Effects of task condition and task characteristics: Planning conditions and task features . 9 
1.2.1 Measuring the effects of planning and task characteristics .................................... 10 

1.2.1.1 Conventional Task-based Research on planning and task characteristics: 
Complexity, Fluency, and Accuracy ......................................................................... 10 
1.2.1.2 Conventional Task-based Research on planning and task characteristics: 
Competition of Performance Constructs .................................................................... 12 
1.2.1.3 Language assessment literature on planning: Towards a triangulated approach
 ................................................................................................................................. 14 
1.2.2 Research studies on the effects of planning time and planning conditions ............. 15 
1.2.2.1 Length of planning time ................................................................................ 16 
1.2.2.2 Planning activities ......................................................................................... 19 
1.2.3 Task characteristics mediating the effects of planning ........................................... 22 
1.3 The study ..................................................................................................................... 25 

CHAPTER 2. THE CURRENT STUDY ................................................................................... 27 
2.1 Research questions and study variables ........................................................................ 27 
2.2 Methods ....................................................................................................................... 29 
2.2.1 Participants ........................................................................................................... 29 
2.2.2 Materials............................................................................................................... 32 
2.2.2.1 Background questionnaire ............................................................................. 32 
2.2.2.2 Elicited imitation task .................................................................................... 32 
2.2.2.3 Test tasks ...................................................................................................... 33 
2.2.2.4 Post questionnaire ......................................................................................... 35 
2.2.2.5 Interview ....................................................................................................... 35 
2.2.3 Study Design ........................................................................................................ 35 
2.2.4 Procedure ............................................................................................................. 37 
2.3 Measures ...................................................................................................................... 38 
2.3.1 Fluency measures ................................................................................................. 38 
2.3.2 Complexity measures ............................................................................................ 43 
2.3.3 Accuracy measures ............................................................................................... 47 

 

 vii 

2.4 Analysis ....................................................................................................................... 48 
2.4.1 Research question 1: Does the type of planning (guided versus unguided) affect the 
test scores of test candidates? If yes, what are the influences of the different task types? 48 
2.4.1.1 Subjective ratings of the Elicited Imitation task responses ............................. 48 
2.4.1.2 Subjective ratings on spoken responses ......................................................... 51 
2.4.1.3 Subjective ratings on language use of spoken responses ................................ 54 
2.4.1.4 Statistical analysis ......................................................................................... 55 
2.4.2 Research question 2: Does the type of planning (guided versus unguided) affect the 
discourse quality of test candidates? If yes, what are the influences of the different task 
types? ............................................................................................................................ 59 
2.4.2.1 Transcription of spoken responses ................................................................. 59 
2.4.2.2 Coding spoken responses according to CAF measures ................................... 61 
2.4.2.3 Subjective ratings on Iwashita and Elder’s (2005) CAF rubric ....................... 63 
2.4.2.4 Statistical analysis ......................................................................................... 63 
2.4.3 Research question 3: How do test candidates use their planning times? ................. 64 
2.4.4 Research question 4. How do test candidates perceive the given planning times? .. 64 

CHAPTER 3: RESULTS .......................................................................................................... 66 
3.1 Research question 1: Speaking test scores .................................................................... 66 
3.1.1 Descriptive statistics for the speaking test scores................................................... 66 
3.1.2 FACETS analysis ................................................................................................. 72 
3.1.2.1 Test Set A ..................................................................................................... 75 
3.1.2.2 Test Set B...................................................................................................... 80 
3.1.2.3 Test Set C...................................................................................................... 84 
3.1.3 Follow-up repeated measures ANOVA ................................................................. 87 
3.2 Research question 2: Speech quality ............................................................................. 92 
3.2.1 Inter-coder reliability ............................................................................................ 92 
3.2.2 Factor analysis ...................................................................................................... 97 
3.2.3 Descriptive statistics and comparison analysis ...................................................... 99 
3.2.3.1 Fluency ......................................................................................................... 99 
3.2.3.1.1 Speed fluency ........................................................................................ 99 
3.2.3.1.2 Breakdown fluency .............................................................................. 103 
3.2.3.1.3 Repair fluency ..................................................................................... 106 
3.2.3.2 Accuracy ..................................................................................................... 109 
3.2.3.3 Complexity ................................................................................................. 112 
3.2.3.3.1 Syntactic complexity ........................................................................... 112 
3.2.3.3.2 Lexical diversity .................................................................................. 116 
3.2.4 Relationship between test scores and speech quality ........................................... 121 
3.3 Research Question 3: Test-takers’ survey responses ................................................... 126 
3.3.1 Confidence in performance ................................................................................. 126 
3.3.2 Appropriateness of planning time ....................................................................... 130 
3.3.3 Effectiveness of types of planning ...................................................................... 134 
3.3.4 Perceptual differences by planning and task type ................................................ 138 
3.4 Research question 4: Test-takers’ interview responses ................................................ 139 

3.4.1 Interview question 1: Which type of planning was helpful for you when 
responding? ................................................................................................................. 139 

 

 viii 

3.4.2. Interview question 2: Were the times allotted for planning sufficient to you? ..... 145 

CHAPTER 4: DISCUSSION .................................................................................................. 155 
4.1 Research question 1.................................................................................................... 155 
4.2 Research question 2.................................................................................................... 163 
4.3 Research question 3.................................................................................................... 174 
4.4 Research question 4.................................................................................................... 176 

CHAPTER 5: CONCLUSION ................................................................................................ 180 
5.1 Implication ................................................................................................................. 180 
5.2 Limitation and future research .................................................................................... 184 

 APPENDICES ....................................................................................................................... 187 
Appendix A Language learning and test-taking background questionnaire (in English) .... 188 
Appendix B Elicited imitation task ................................................................................... 192 
Appendix C Test tasks ..................................................................................................... 198 
Appendix D Post questionnaire ........................................................................................ 204 
Appendix E One-on-one interview questions.................................................................... 209 
Appendix F Scoring rubric for speaking tasks .................................................................. 210 
Appendix G Elder and Iwashita’s (2005) rating scales on fluency, accuracy, and complexity
 ........................................................................................................................................ 212 
Appendix H Basic descriptive statistics for the raw coding data ....................................... 214 

REFERENCES ....................................................................................................................... 222 

 

 

 

 

 ix 

LIST OF TABLES 

 

Table 1 Research design............................................................................................................ 36 

Table 2 Speed fluency measures ............................................................................................... 40 

Table 3 Breakdown fluency measures ....................................................................................... 40 

Table 4 Repair fluency measures ............................................................................................... 42 

Table 5 Complexity measures ................................................................................................... 46 

Table 6 Accuracy measures ....................................................................................................... 48 

Table 7 Descriptive statistics for the EI test ratings given by rater 1 and rater 2 ......................... 50 

Table 8 Descriptive statistics of the EI test according to the proficiency sub-groups .................. 51 

Table 9 Partially-balanced, incomplete rating block design ....................................................... 54 

Table 10 Six facets specified for the MFRM analyses ............................................................... 57 

Table 11 Transcription codes used in the present study ............................................................. 61 

Table 12 Descriptive statistics for participants’ speaking test scores .......................................... 68 

Table 13 Model-fit statistics summary for Test Sets A, B, and C ............................................... 74 

Table 14 Summary statistics of planning condition for Test Set A ............................................. 78 

Table 15 Summary statistics of test-task types for Test Set A .................................................... 79 

Table 16 Summary statistics of planning conditions for Test Set B............................................ 82 

Table 17 Summary statistics of test-task types for Test Set B .................................................... 83 

Table 18 Summary statistics of planning conditions for Test Set C............................................ 85 

 

 x 

Table 19 Summary statistics of test-task types for Test Set C .................................................... 86 

Table 20 Summary statistics of pairwise comparisons for Test Sets A, B, and C ....................... 91 

Table 21 Intra-class correlation coefficients for fluency measures ............................................. 94 

Table 22 Intra-class correlation coefficients for accuracy measures ........................................... 95 

Table 23 Intra-class correlation coefficients for complexity measures ....................................... 95 

Table 24 Intra-class correlation coefficients for CAF ratings ..................................................... 96 

Table 25 Factor analysis for IT-L task under UG planning condition ......................................... 98 

Table 26 Descriptive statistics for speed fluency by planning conditions and test tasks ........... 100 

Table 27 Descriptive statistics for breakdown fluency by planning conditions and test tasks ... 104 

Table 28 Descriptive statistics for repair fluency by planning conditions and test tasks ........... 108 

Table 29 Descriptive statistics for accuracy by planning conditions and test tasks ................... 110 

Table 30 Descriptive statistics for syntactic complexity by planning conditions and test tasks . 114 

Table 31 Descriptive statistics for lexical diversity by planning conditions and test tasks ........ 118 

Table 32 Summary of GEE statistics for fluency measures ...................................................... 123 

Table 33 Summary of GEE statistics for accuracy measures .................................................... 124 

Table 34 Summary of GEE statistics for complexity measures ................................................ 125 

Table 35 Frequency of comments on helpful planning conditions ........................................... 140 

Table 36 Properties of test-task characteristics in relations to planning time ............................ 154 
 

 

 

 xi 

LIST OF FIGURES 

 

Figure 1 Skehan’s (1998, 2001) expanded model of oral test performance .................................. 8 

Figure 2 Boxplots representing the distribution of the test scores from Test Set A by test tasks . 69 

Figure 3 Boxplots representing the distribution of the test scores from Test Set B by test tasks . 70 

Figure 4 Boxplots representing the distribution of the test scores from Test Set C by test tasks . 71 

Figure 5 Wright map for Test Set A .......................................................................................... 75 

Figure 6 Wright map for Test Set B .......................................................................................... 81 

Figure 7 Wright map for Test Set C .......................................................................................... 84 

Figure 8 Mean speaking test scores by test-task type and planning conditions ........................... 88 

Figure 9 Speed fluency measures ............................................................................................ 101 

Figure 10 Breakdown fluency measures .................................................................................. 105 

Figure 11 Accuracy measures ................................................................................................. 111 

Figure 12 Interaction effect on lexical error per 100 words ...................................................... 111 

Figure 13 Syntactic complexity measures ................................................................................ 115 

Figure 14 Interaction effect on subordinate clauses ................................................................. 115 

Figure 15 Lexical diversity measures ...................................................................................... 119 

Figure 16 Interaction effect on sentence linking devices .......................................................... 120 

Figure 17 Confidence ratings for IP task ................................................................................. 127 

Figure 18 Confidence ratings for IT-RL task ........................................................................... 128 

 

 xii 

Figure 19 Confidence ratings for IT-L task ............................................................................. 129 

Figure 20 Appropriateness of planning time in IP task ............................................................ 131 

Figure 21 Appropriateness of planning time in IT-RL task ...................................................... 132 

Figure 22 Appropriateness of planning time in IT-L task ........................................................ 133 

Figure 23 Effectiveness of planning for IP task ....................................................................... 135 

Figure 24 Effectiveness of planning for IT-RL task ................................................................. 136 

Figure 25 Effectiveness of planning for IT-L task ................................................................... 137 

Figure 26 Frequency of responses: Sufficiency and lack of planning time across test tasks...... 147 

Figure 27 Reasons given for the sufficiency of planning time in integrated tasks ..................... 149 

Figure 28 Reasons given for the lack of planning time in IP tasks ........................................... 151 

 

 

 

 

 

 

 

 xiii 

 

 

INTRODUCTION 

In performance-based testing, L2 oral test performance refers to one’s production of 

language on a series of test tasks. As opposed to discrete-point item types, on these assessments, 

test takers are expected to act upon a given test task (Davis, Brown, Elder, Hill, Lumley, & 

McNamara, 1999); that is, they are to provide their oral responses on the basis of how they 

process and undertake the presented test tasks. Within such contexts, test developers wish to 

have test takers perform on a variety of test tasks that when taken as a whole, they broadly 

represent and tap onto the target language ability, or the test construct (Bachman & Palmer, 

1996). More precisely, test tasks are designed in the way that closely resemble the identified 

language outcome yet differ from one another in terms of the conditions of task implementation 

that make up each test task uniquely reflect an array of their real-world counterparts (Elder, 

Iwashita, & McNamara, 2002). For instance, TOEFL iBT Speaking sections take up the format 

of which test takers advance through varied task types, each comprised of varied parameters of 

task conditions underlying each task type. Test takers may first take an impromptu test task, then 

make their way to subsequent task types, which require the channeling of the integration of other 

language skills than speaking for task completion. Yet taken altogether, the different task types 

are designed to make up a body of academic speaking, which is the core construct of the TOEFL 

iBT Speaking test.  

The elements differentiating each test task are allegedly known to have theoretical 

underpinning, particularly drawn from the task-based research (Robinson, 2001; Skehan, 1998). 

The premise of this line of work is that certain task characteristics and performance conditions 

have an additive function to complexity and/or the difficulty of task types. For instance, whether 

given the option of planning for subsequent responses in advance constitutes a major component 

 1 

of the inherent complex nature of tasks. The consensus in the field is that the provision of 

planning lessens one’s online processing load during real-time production, and leads to better 

performance (e.g., Crookes, 1989; Ellis, 2005, 2009). The option of planning may also contribute 

to how speakers perceive certain tasks to be much difficult from others (Robinson, 2001); tasks 

inherent with planning may nurture higher confidence or motivation of task completion amongst 

speakers (Tavakoli & Skehan, 2005). When applying this thinking to performance-based, task-

based language testing setting, further understanding of different underlying factors of the 

dimensions of task complexity and difficulty can underscore both (a) test-takers’ levels of test 

performance and (b) perceptions toward corresponding test-task conditions. As a consequence, it 

would contribute to informed decisions of selecting suitable range of tasks for assessment 

purposes of oral language and see how they meet with the test developers’ hunches in task design 

(Elder et al., 2002).  

In this dissertation, I apply insights from the task-based research in exploring the test-task 

characteristics of the TOEFL iBT Speaking test. Particularly, I aim to examine the quantitative 

and qualitative difference in test performance amongst the two TOEFL iBT task type, namely, 

independent and integrated tasks in light of their inherent planning conditions. I specifically 

attend to the planning conditions in the TOEFL iBT oral tasks as its differentiated employment 

per test-task type are based on ad hoc decisions (Elder & Iwashita, 2005), which lacks clear 

articulation of (1) the link between a precise planning condition and a test-task type (Butler, 

Eignor, McNamara, Jones, & Suomi, 2000) and (2) how the theoretical underpinning of speech 

production (that has guided the task-based research) might shed light on the precise test-task 

design (Elder et al., 2002; Mislevy, Steinberg, & Almond, 2003; Wigglesworth & Frost, 2017). 

Unanswered questions regarding the TOEFL iBT task conditions are: (1) Why are the planning 

 2 

times given differently across test-task types? and (2) Would such difference in allotment per 

test-task type matter in how test takers perform and react to certain test-task types? Exploring the 

very nature of a specific task condition inherent in test-task types affect subsequent test-taker’ 

performance and perceptions could render insights as to the appropriateness of the test-task 

design, and further inferring information on related test qualities (e.g., construct 

representativeness, authenticity, and practicality; Elder & Iwashita, 2005; Wigglesworth & Elder, 

2010).  

In addition, this study was set to promote more ecological validity in language testing 

research on pre-task planning. To date, there has been a lack of contextualized language-testing 

studies on planning time that adopt the actual planning conditions in language tests (Xi, 2010). 

Few existing studies follow the steps of laboratory-based SLA studies by providing a longer 

stretch of planning time that is not plausible in testing settings, thereby improperly reflecting the 

timed nature of the testing situation (Ellis, 2009). Another parameter of interest is the provision 

of guidance in planning. Current test directions often give little or no information on how 

planning should be done while test publishers require test candidates to use their preparation time 

as effectively as possible (Elder & Iwashita, 2005). Considering that tests should be biased for 

the best (Swain, 1984), it is of great interest whether guiding test takers to engage in certain 

planning activities lead to meaningful test taking. 

The central aim of my dissertation, therefore, is to conduct a de facto documentation of 

the interplay between planning-related variables and test-task types in the TOEFL iBT. Given 

that planning time is pervasive in real testing, the ultimate goal of this research inquiry is not to 

argue for the removal of such planning practice; the overall goal is to seek for and suggest ways 

to promote meaningful planning conditions within this particular testing environment. 

 3 

CHAPTER 1: Literature Review 

In this chapter, I first outline the theoretical models of speech production taken by the 

task-based research and make connections to language testing research on the effects of task 

characteristics, especially, pre-task planning. Then, I present both testing and non-testing 

research studies on task characteristics that highlight the role of pre-task planning on oral 

performance. To illuminate the effects of planning-related variables on oral performance, I take 

the order in this sub-section of first delineating how oral performance has been defined and 

commonly measured in conventional task-based research (e.g., complexity, accuracy, and fluency 

dimensions) followed by what needs to be adapted in relevant language testing research. In the 

subsequent sub-sections, I then introduce research studies on the effectiveness of varied planning 

parameters on oral performance (which are the core interest of the present study) such as the 

length of planning time, planning types (pre-task and within-task planning), and speaker-directed 

planning activities. I also touch on research studies on the TOEFL iBT Speaking test-task types, 

but only briefly, given the scarcity of relevant research accounts. At the end of the chapter, I 

provide the overall purpose and rationale of the study by building upon the previous line of 

research and addressing the gap in the literature.  

1.1 Task-based research and theoretical underpinning 

1.1.1 Modeling and researching speech production and the effects of pre-task planning 

While task-based research on oral performance has drawn from a number of theoretical 

perspectives, Levelt’s (1989) model of speech production has served as the major basis of 

accounting for the effects of pre-task planning on oral performance (Ellis, 2009). The model 

specifies a three-step procedure of speech production: (1) conceptualization; (2) formulation; and 

(3) articulation. In the conceptualizing stage, speakers go through three sub-phases of selecting 

 4 

the information subscribing to their intended communication goals. They first set their goals for 

articulation (i.e., macro-planning of speech), which is then followed by retrieving information 

necessary for meeting those goals (e.g., determining relevant speech acts), and finally making 

use of the selected information for achieving the set goals (i.e., micro-planning of speech, 

conceptualizing a pre-verbal message). Based on the general plan of what information they 

would draw onto, speakers go through a formulation stage in which they establish linguistic 

representation of the pre-verbal message. At the very least, they retrieve relevant lexical items 

from their mental lexicon, which serve as the building blocks of further grammatical encoding. 

Finally, such linguistic information gets processed by a phonological encoder, which further 

shapes internal speech (Levelt, 1989), which is the speakers’ internal representation of how the 

message should be articulated. In the final articulation stage, the speakers transform their internal 

representation into overt speech. The entire process is very much driven by the self-monitoring 

practices of speakers; Levelt claims that they particularly attend to three sub-processes of (1) 

matching the internal speech to the identified communication goal, (2) scrutinizing the internal 

speech before being articulated, and (3) inspecting the over speech that has been generated.  

Yet Levelt also asserted that some sub-components of the model are operated 

automatically, not requiring conscious efforts or controlled processing. This particularly pertains 

to the formulation and articulation of speech; that is, once the blueprint has established, speakers 

are able to generate speech largely in an automatic and smooth fashion (Kormos, 2011). Because 

of this point, SLA researchers made subsequent adaptations to the model to better account for the 

discrepancy in L1 and L2 speech production (De Bot, 1992; Ellis, 2009; Kormos, 2011). While 

the L2 speakers also go through a similar order of the three-stages speech production (Kormos, 

2011), for them, many of the linguistic properties in L2 are not automatized/limited and thus 

 5 

benefit more from close inspection or monitoring within and across the stages. The advantageous 

nature of pre-task planning on L2 speech production precisely lies here; the facilitation of the 

limited resources L2 speakers might have in retrieving linguistic information, forming internal 

speech, and finally articulating overt speech.  

However, other accounts (Batsone, 2005; Ellis, 2009) have extended Levelt’s model 

further in incorporating the role played by the individual differences on speech production. That 

is, how learners perform will result from not only their on-going monitoring of speech, but also 

from how they orient to a task (Batsone, 2005; Tajima, 2003). Ellis (2009) supported this view in 

explaining why language testing research on planning has demonstrated null effects in planning, 

which is discrepant from the findings in the conventional laboratory-based research. It seems to 

be critical, therefore, to inspect the effect of planning from multiple data points, including not 

only the quantitative accounts, but also the qualitative orientations of speakers to better ascertain 

the effects of pre-task planning.  

1.1.2 Modeling and researching task-based oral test performance: Skehan’s (1998) 

expanded model on Kenyon-McNamara (McNamara, 1995) 

While language testing research on task characteristics is rooted in task-based research, 

how different properties of test-tasks influence subsequent test performance has not been 

precisely streamlined through an organized framework. Skehan (1998), in this sense, proposed a 

working model that specifically pertains to explaining how test performance exert differences 

based on different test-task parameters.  

Skehan (1998) devised a framework drawing onto the initial model of Kenyon-

McNamara (McNamara, 1995), which essentially illustrated the intertwined relationship amongst 

different factors influencing task-based test performance (e.g., underlying competence of the test 

 6 

candidate, interlocutor to whom the test candidate speaks with, etc.). Skehan built upon Kenyon-

McNamara’s model as it takes into account that it is the task functioning as a central unit within 

a testing context. As shown in Figure 1, Skehan included three new factors to the initial model, 

namely, ability for use (dual-coding), task characteristics, and task conditions. The premise of 

including an element such as ability for use (dual coding) is to account for the way processing is 

adapted to performance conditions. Skehan claimed that previous models of language 

competence (e.g., Bachman’s strategic competence; Canale and Swain’s communicative 

competence framework) conceptualized language competence as a static, generalized entity. 

However, real-time performance is in effect the consequence of fluctuating communicative 

demand for which test takers adjust to performance conditions by allocating their attentional 

capacities in appropriate ways. That is, test takers need to make use of different channels of their 

attentional capacities according to what is demanded by the given test-tasks. Skehan coins the 

term processing competence, which is test-takers’ ability to handle the different demands 

imposed in tasks by flexibly making use of appropriate processing resources available. For 

instance, the linguistic modes that test takers make use of would be different from performing on 

a task requiring precise communication or emphasis on form to a task putting greater influence 

on effective, elaborated communication. In this sense, Skehan subscribed to the competitive 

nature of performance qualities such as complexity, accuracy, and fluency in that speakers put 

different processing goals within performance per tasks (see next section for Skehan’s and 

Robinson’s differing accounts on how the three dimensions of performance function within 

performance). Skehan further made a connection between the different processing goals test 

takers subscribe to and the scoring decisions on the emphasis of performance aspects prioritized 

by raters and rating scales.  

 7 

Rater

Scoring Criteria

Performance

Task

Score

Task characteristics

Task conditions

Interactants
Examiners

Other candidates

Candidate

Ability for use
Dual-coding

Underlying 
competence

 

Figure 1 Skehan’s (1998, 2001) expanded model of oral test performance 

 

Therefore, it becomes vital to understand test-takers’ oral test performance in conjunction 

to the conditions of task implementation (task conditions) and the inherent structures of tasks 

(task characteristics). Task conditions refer to task-external features that test takers are able to 

make use of; it is the manner in which the test-task is done (Skehan, 1998). For instance, in [+ 

planning] condition, test takers are able to shape their responses using the planning time given to 

them yet in [- planning] condition, they are required to make immediate responses. On the other 

 8 

hand, task characteristics are the inherent entity or nature that makes up particular test-tasks; 

these elements are related to the content of test-tasks. For instance, integrated test-tasks require 

the dual- or triple demand in language abilities for task completion; test takers need to carry out 

reading and listening in addition to speaking to extract information necessary for task 

completion. Independent tasks, on the other hand, do not provide contextual information that test 

takers can refer to; hence, they do not need to make use of other language skills for task 

completion.  

All in all, Skehan’s model addressed two points: (1) the need to interpret test 

performance in terms of the design of test-tasks (elements that make up both task condition such 

as the provision of planning, and task characteristics, the intrinsic factors differentiating across 

test tasks) and how it influences test-takers’ real-time processing orientations; and (2) the 

consideration of both test-tasks and test-takers’ processing, which informs the sampling of test-

tasks that is necessary for generalization.  

 

1.2 Effects of task condition and task characteristics: Planning conditions and task features 

Before I present research studies on the effects of planning conditions and task 

characteristics in detail, I introduce the constructs used in the task-based research on measuring 

language production. I provide this examination to further make clearer juxtaposition within and 

across planning-related variables commonly researched in current scholarship. I also do so to 

review appropriate measures and methodologies to be applied in the language testing research on 

planning conditions and task features.  

 

 

 9 

1.2.1 Measuring the effects of planning and task characteristics 

1.2.1.1 Conventional Task-based Research on planning and task characteristics: 

Complexity, Fluency, and Accuracy 

Acknowledging the multidimentionality of language production (Tavakoli & Skehan, 

2005; Housen & Kuiken, 2009), the effects of planning and task characteristics in task-based 

research have been comprehensively captured with regard to the three aspects of language 

production, namely, complexity, accuracy, and fluency (henceforth CAF; Ellis, 2005; 2009; 

Skehan, 1998; Skehan & Foster, 1999). While the three terms have been each elaborated by 

multiple perspectives in the field, relevant researchers have commonly subscribed to Skehan 

(1996) and Skehan and Foster’s (1999) operational definitions as the starting point of 

conceptualization. In addition, by taking up such a three-way distinction, researchers have 

conceptualized each construct as consisting of a number of inter-related facets that establish a 

basis of units of data analysis for explaining the effects of planning and task characteristics. 

Fluency, as defined by Skehan and Foster (1999), refers to “the capacity to use language 

in real time, to emphasize meanings, possibly drawing on more lexicalized systems” (p. 96). 

More precisely, it is a speaker’s ability to produce real-time language “without undue pausing or 

hesitation” (Skehan, 1996, p. 22). In this sense, the notion of fluency takes up two standpoints 

from both a speaker and a listener; it is the extent to which a speaker can spontaneously come up 

with meaningful language units in real-time contexts as well as his or her ability to fluidly 

deliver the language, ultimately facilitating listener’s comprehensibility (e.g., fewer signs of 

inappropriate and excessive pausing) (Trofimovich & Baker, 2006). Therefore, fluency is 

commonly broken down into three specific components, namely: (1) speed fluency (e.g., the 

number of syllables/words produced in a given time); (2) breakdown fluency (e.g., the number of 

 10 

silent pauses or the number of fillers); and (3) repair fluency (e.g., speech phenomena pertaining 

to repetition, self-corrections, etc.).  

Accuracy is defined as “the ability to avoid error in performance, possibly reflecting 

higher levels of control in the language as well as a conservative orientation, that is, avoidance of 

challenging structures that might provoke error” (Skehan & Foster, 1999, p. 96). While the 

concept of accuracy is relatively straightforward (i.e., error-free speech; Tavakoli & Skehan, 

2005), there has been much debate in what constitutes errors and how one might measure them 

(Polio, 1997) in addition to the comparative fallacy (i.e., the native-speaker fallacy; Bley-

Vroman, 1983) that the field of SLA has often been fallen into. As a result, researchers have 

preferred to use a mixture of both general and task-specific measures (Housen & Kuiken, 2009; 

Norris & Ortega, 2009) to account for the multi-faceted nature of accuracy. These include: (1) 

generalized, global measures such as percentage of error-free clauses in the given speech; and (2) 

specific grammatical (e.g., morpheme usages such as past tense -ed) or lexical features (i.e., 

appropriate use of words). 

Lastly, complexity is broadly operationalized as “the capacity to use more advanced 

language” (Skehan & Foster, 1999, p. 97). More precisely, it involves “a greater willingness [of 

speakers] to take risks, and use fewer controlled language subsystems” (Skehan & Foster, 1999, 

p. 97) in that the resulting language would demonstrate “the size, elaborateness, richness, and 

diversity of the learner’s linguistic L2 system” (Housen & Kuiken, 2009, p. 464). Because 

complexity can refer to various language subsystems such as vocabulary, morphology and 

syntax, it has spurred much controversy in task-based research as to its dimensionality and 

measurement methods (e.g., Bulté & Housen, 2012; Housen & Kuiken, 2009). Yet essentially, as 

in the case of accuracy, researchers have resorted to the analysis of (1) subordinization (e.g., the 

 11 

number of clauses per syntactic unit); and (2) lexical complexity (e.g., type-token ratio or 

specific measures on speech cohesion).  

1.2.1.2 Conventional Task-based Research on planning and task characteristics: 

Competition of Performance Constructs 

As much as the variability in what holds conceptualizing CAF constructs, there has been 

much debates on how to interpret the mechanism behind the functioning of each dimension as 

per varied task conditions. Two prevailing accounts are of particular interest in the present study: 

The Limited Capacity Hypothesis (Skehan, 1998; Bygate, 1996; Skehan & Foster, 1997, 1999), 

and the Cognition Hypothesis (Robinson, 2001, 2003, 2005, 2011).  

The tenets of the Limited Capacity Hypothesis hold the view that the limited attention 

mechanism and processing capacity has collateral effects on language production. Speakers may 

not be able to adhere to the three CAF dimensions in a parallel manner, but rather selectively 

concern on one or two aspects owing to their constraints in attentional allocations. More 

specifically, researchers point out two levels of competition per planning and task characteristics: 

(1) fluency in competition for attentional resources with accuracy; and (2) accuracy and 

complexity facilitated at the expense of one another. The distinction between fluency and 

accuracy basically takes up the initial discussion of focus on meaning and focus on form 

(VanPatten, 1990); it is essentially a tension between getting the task completed (represented by 

the fluency/meaning dimension) and a commitment of attention to form; the latter further entails 

a concern for both complexity and accuracy. Skehan and Foster (1997) (and generally the task-

based research) maintain that while fluency relatively remains stable in [+ planning] condition, 

complexity and accuracy does not go hand in hand. For instance, tasks inherently structured and 

concrete (e.g., picture-description tasks) would facilitate accurate but less complex speech since 

 12 

the provided information sources depicted by the task frees up attentional space for idea 

elaboration in planning, and hence facilitating more focus on language forms. On the other hand, 

tasks that are less structured and abstract in nature (e.g., decision-making tasks) would have 

speakers to devote their attention to idea generation and elaboration during planning, while 

giving little room to attend to accuracy.  

Robinson conversely proposed the Cognition Hypothesis, which essentially rejects the 

idea of competitions amongst performance constructs. He asserted that learners can 

simultaneously allocate attentional resources to both accuracy and complexity in speech. 

Essentially, he conceptualized tasks as resource-directing and resource-dispersing dimensions, 

and that higher complexity in these two types of task dimensions contribute to both accurate and 

complex speech. In the resource-directing tasks (e.g., requiring temporal reference, and/or spatial 

reasoning), speakers are able to simultaneously allocate their attention to the aspects of language 

code as well as elaboration. For instance, simple narrative tasks on an event happening now 

would impose lesser cognitive demand on speakers; yet tasks requiring speakers to elaborate on 

past events would encourage their control of morphology (e.g., past tense verb forms) as well as 

syntax (e.g., sentence connectors and adverbials of time and location) depicting what had 

happened in the past. Similarly, accuracy and complexity can be attended simultaneously in 

resource-dispersing tasks, for which complexity across relevant tasks is increased with the 

inclusion/exclusion of a number of task elements. For instance, as opposed from simple two-

factor comparison tasks, tasks involving multiple comparison of factors may facilitate speakers’ 

control of a wide range of deictic expressions, pronouns or relative clauses.  

The theoretical underpinning of task-based research in understanding the effects of 

planning and task characteristics serve the basis of testing-linked task-based research (Tavakoli 

 13 

& Skehan, 2005) (also see Ellis, 2009, pp. 478 – 490 for a synthesis of task-based and language 

testing research on planning effects). The linked research provides insight as to the inter-

connectedness amongst speakers’ (and presumably test-takers’) processing capabilities and 

intrinsic task condition/characteristics on language production. 

1.2.1.3 Language assessment literature on planning: Towards a triangulated approach 

Task-based researchers have been embarking on illuminating the effects of planning and 

task characteristics in support of one of the two hypotheses (Skehan & Foster, 1999 for Limited 

Capacity Hypothesis; Gilabert, 2007; Kuiken & Vedder, 2008; Robinson, 1995, for Cognition 

Hypothesis). However, as researchers have pointed out (Ellis, 2009; Skehan, 2009), the evidence 

is still limited as the effects of task characteristics vary per study contexts (laboratory, 

classroom-based versus testing contexts) as well as the specific measures adopted for the three 

constructs. Particularly, how the three performance constructs operate have been mostly 

understood within the conventional classroom-based study contexts. Little is known about how 

the mechanisms apply to a unique research setting of language testing in which speakers are 

essentially imposed with added real-time pressures of performing, and whether the debate 

between the two hypotheses stand valid in those contexts as well (Ellis, 2009). For instance, thus 

far, task-based researchers have generally supported the positive role of [+ planning] on fluency 

while rendering mixed results for accuracy and complexity (e.g., Crookes, 1989; Foster & 

Skehan, 1996; Gilabert, 2007; Skehan & Foster, 2005; Tavokoli & Skehan, 2005; Wigglesworth, 

1997; Yuan & Ellis, 2003). Yet the few relevant language testing research basically found the 

reverse: very weak effects of planning conditions and/or concrete task features as measured via 

three performance measures (e.g., Wigglesworth, 2001) (see next section for the review of these 

research studies). Implicit in this line of research is that there could be an underlying potential 

 14 

discrepancy in study contexts on the effects of planning as well as how the three dimensions 

operate (Elder & Iwashita, 2005; Ellis, 2009). Particularly, the three constructs may be captured 

in an inherently different way in light of varied planning conditions between classroom-oriented 

contexts and testing environment.  

Accordingly, few language testing researchers harmonized different data sources in 

addition to the three-dimension analysis such as (1) subjective ratings of speech (Elder & 

Iwashita, 2005; Elder et al., 2002; Wigglesworth & Elder, 2010) and test-taker perceptions 

(Elder & Iwashita, 2005), and (2) discourse-level qualitative differences in speech (Nitta & 

Nakatsuhara, 2014). Such a triangulated approach seems to be appropriate in finding evidence as 

to discerning the difference between the study contexts, but ultimately, it would exert more 

significance in undertaking the nature of validity argument inherent in language assessment 

research. Yet to date, only few studies have taken such an approach in the contexts of operational 

testing settings. Relevant researchers mostly simulated the study contexts of laboratory, task-

based studies, but only to employ a shortened version of the conventional planning conditions 

(by reducing the length of planning/responding time) (Ellis, 2009). There is still more room for 

language testing research on planning and task characteristics that undertakes the conceptual 

framework posed from the conventional task-based research while taking a contextualized 

approach for studying the real-time effects of such task-related variables on test performance. 

1.2.2 Research studies on the effects of planning time and planning conditions 

As mentioned, previous researchers who investigated pre-task planning have 

acknowledged the discrepancy between the findings of classroom and language testing research 

for which a number of planning parameters have contributed (Elder & Iwashita, 2005; Ellis, 

2005, 2009). In this section, I further present the potential mediating role accounting for such 

 15 

discrepancy through reviewing combined factors such as the length of planning time 

(Wigglesworth & Elder, 2010) and the provision of detailed guidance for planning (Skehan & 

Foster, 1997).  

1.2.2.1 Length of planning time 

Of primary concern in what separates the language testing research on planning from the 

findings from conventional task-based research is the time variable (Ellis, 2009); namely, the 

amount of time allocated for planning. In his synthetized review of the effects of planning time, 

Ellis (2009) reported that the majority of classroom and laboratory-based studies that took the 

planning versus no-planning approach provided 5 to 10 minutes prior to eliciting oral output, 

with 10 minutes being the standard planning time among relevant literature (e.g.,Gilabert, 2007; 

Kawauchi, 2005; Mochizuki & Ortega, 2008; Skehan & Foster, 1997, 2005; Tajima, 2003; Yuan 

& Ellis, 2003).  

On the other hand, the few testing studies on pre-planning allocated a much shorter time, 

conceivably prompted from reflecting on the demands of the timed condition of testing (Elder & 

Iwashita, 2005; Elder, Iwashita, & McNamara, 2002; Nitta & Nakatsuhara, 2014; Xi, 2005; 

Wigglesworth, 1997, 2000; Wigglesworth & Elder, 2010). In these studies, the effect of planning 

appeared inconsistent, being less likely to be positive in terms of test performance while 

demonstrating mixed extent of impact on CAF measures. For instance, Wigglesworth (1997) 

allowed one minute of planning time for adult ESL learners taking a tape-mediated oral test. 

Generally, only a minute of planning was reported to have effects on fluency (e.g., number of 

self-repairs), accuracy (e.g., article usage), and complexity (e.g., subordination) for the planning 

group; this was particularly true to those with higher language proficiency performing for a 

relatively more difficult task (e.g., summary of conversation). Yet Wigglesworth (1997) found 

 16 

no substantial score differences between planners and no-planners based on an analytic rubric. In 

her subsequent study, Wigglesworth (2000) only focused on test scores with the provision of the 

same 1-minute planning; a consistent finding of insignificant score differences were found for 

unstructured tasks. More recently, Wigglesworth and Elder (2010) conducted a study comparing 

three conditions of no planning, 1-minute of planning, and 2-minutes of planning on ninety test 

candidates on an IELTS-type oral test format. Again, the researchers did not find any significant 

differences in oral performance according to the amount of planning time provided neither on 

analysis of the scores nor the employed CAF measures.  

Elder et al. (2002) and Elder and Iwashita (2005) both provided 3 minutes of planning 

time (with additional 75 seconds to read instructions, which was also provided for the no-

planning group) for adult test takers prior to a series of narrative tasks extracted from the Test of 

Spoken English (TSE). Similar to the research conducted by Wigglesworth and her colleagues, 

both studies failed to demonstrate significant effects of planning on test performance as depicted 

by the Multi-Faceted Rasch Modeling, as well as further statistical analyses. The researchers also 

noted hardly any difference between [+ planning] and [- planning] in terms of different data 

points related to CAF dimensions: namely, the subjective ratings resulting from the holistic 

rating scale of each CAF constructs that they had devised as well as the manual coding 

undertaken for individual CAF measures. The researchers also did not find a positive effect of [+ 

planning] on test-takers’ affective orientation towards the planning task condition. 

A recent study of Nitta and Nakatsuhara (2014) also provided 3 minutes of planning time 

in the context of paired oral testing. The researchers had 32 students perform two decision-

making tasks in pairs, both under planned and unplanned conditions. The researchers 

conceptualized test performance from applying a range of CAF measures and conversation 

 17 

analysis (CA). The researchers found a conflicting evidence as to the sub-components of fluency; 

while [+ planning] condition contributed to only a slight increase in breakdown fluency 

measures, it was detrimental to speed fluency measures. Yet the CA analysis indicated that in the 

[+ planning] condition, test takers attempted to produce longer turns while recalling what they 

had planned, which in turn reduced the intra-run speech. On the other hand, in [- planning] 

condition, test takers initiated more animated, spontaneous shorter turns that subsequently 

enabled them to talk faster despite the increased likelihood of cognitive demands of online 

planning under the unplanned conditions. 

One rare study by Xi (2005), which was essentially not a planning-oriented study, found 

an effect of only a minute of planning time for a more structured test task (graph description) on 

the increase of the oral performance of college-level test takers of the Speaking Proficiency 

English Assessment Kit (SPEAK) exam. The researcher reported that the provision of planning 

time operated as to mediating test candidates’ unfamiliarity with the given complex test task.  

Such a consistent neutral finding renders a number of explanations (see Elder & Iwashita, 

2005) yet importantly, the limited amount of time (minimum of 1 minute and maximum of 3 

minutes) given may have been insufficient to contribute to significant score discrimination 

(Mehnert, 1998). This is a critical matter in that the actual time allotted for test tasks that are 

more structured and complex (e.g., academic-skills-oriented test tasks) are even lesser than a 

minute (Elder & Iwashita, 2005). In this sense, the aforementioned studies lack ecological 

validity, often not specifying justifications as to the provided amount of planning time. As Xi 

(2005) noted, the time allotted to planning in these studies is not contextualized within the 

existing testing contexts. This is only being recognized in recent scholarship: Wigglesworth and 

Elder (2010) mentioned that their use of 1 minute of planning was based on the current 

 18 

instructions for the IELTS Part 2 that stated “one to two minutes to prepare” should be provided 

for test candidates. Still, little attention is directed to the need to contextualize the planning 

condition in the actual testing environment, which would be an essential research inquiry 

contributing to evaluate the authenticity and validity of the corresponding tests. 

1.2.2.2 Planning activities 

Another important parameter on planning that is still under-researched in the domain of 

language testing is the planning activities of which the test candidates could make use. As 

opposed to previous planning literature in task-based framework (Ellis, 2009), much attention 

has been given to the effectiveness of the provision of planning itself, thereby drawing 

comparisons between test performance derived from the planning, as opposed to the non-

planning conditions (e.g., Elder & Iwashita, 2005; Elder et al., 2002; Wigglesworth, 1997, 2000; 

Wigglesworth & Elder, 2010). The under-representation of the effects of different planning types 

could be that informing test candidates about such information is yet to be a common practice in 

the actual oral testing. In effect, examinees are simply instructed that they are given a certain 

amount of time before responding to a test question. Moreover, test preparation kits of the 

current major test publishers are less likely to provide concrete information on how to make use 

of the offered planning time.  

Yet one could argue for the urgency of initiating this discussion of planning activities in 

testing considering the extremely limited amount of time allowed for planning in existing tests. 

Elder and Iwashita (2005) contended that the ineffectiveness of planning on test performance 

could be due to test candidates being (a) not familiar with such a short amount of planning time 

(hence, not reflecting the normal classroom-based oral tasks conditions) and (b) not aware of 

ways to efficiently use such time in test situations. The latter explanation prompted from the 

 19 

researchers’ finding of the participants responding favorably of the planning time yet failing to 

produce quality responses. Accordingly, the researchers speculated that “whatever action they 

were taking to improve their performance was ineffective” (Elder & Iwashita, 2005, p. 233) 

within the 3-minute time frame, with such ineffectiveness intensified with the lack of concrete 

guidance (i.e., explicit instruction before presenting the tasks) or training on planning strategies. 

By quoting Rutherford (2001), Elder and Iwashita expressed the importance of investigating how 

the provision of planning guidelines or strategy training makes a difference in test environments. 

Given the above results, it could be assumed that test candidates may struggle with even less 

planning time as in the case of the current oral testing such as the TOEFL iBT. If this is to be 

true, reviewing and exploring the effectiveness of the provision of guidance of planning is a 

timely matter. 

Previous task-based research has devised a distinction between guided versus unguided 

planning with regards to the type of planning (e.g., Foster & Skehan, 1996; Gilabert, 2007; 

Kawauchi, 2005; Mochizuki & Orgeta, 2008; Sangarun, 2005; Wendel, 1997; Yuan & Ellis, 

2003). According to Ellis (2005, 2009), learners in the unguided condition are simply allowed a 

certain amount of time before the task and asked to freely make use of this time to prepare for his 

or her responses; this mostly resembles the current planning practices in oral testing. On the 

other hand, learners are instructed to focus on specific aspects of planning under the guided 

condition; that is, learners are asked to engage in varied planning activities (e.g., taking notes, 

verbal rehearsal) prior to responding.  

Whilst it is still empirically unanswered about the precise circumstances under which 

guided or unguided planning facilitate speech production (Ellis, 2009), the former condition is 

considered to be theoretically supported from Swain’s (1993) Output Hypothesis. The premise is 

 20 

that such pre-planning pushes learners to “make use of their resources” to “reflect on their output 

and consider ways of modifying it” (p. 161) prior to constructing their responses.  

Thus, various types of guided planning have been devised by researchers yet of great 

relevance to the current study is that of Foster and Skehan (1996) and Kawauchi (2005). 

Although not conducted in a test environment, both studies had significance in testing out the 

intertwined effects of the provision of guidance in planning with other task-specific features. 

Foster and Skehan (1996), for instance, explored a possible interaction amongst variables such as 

planning time, guidance in planning, and task types. Three study groups each performed three 

task types under distinctive planning conditions: Group 1 was not given any planning time, 

group 2 were given 10 minutes of planning time, and group 3 were given guidance as to planning 

in addition to the 10 minutes of planning time. The guidance included “suggestions as to how 

language might be planned, and also suggestions to develop ideas relevant to completing each 

test tasks” (p. Skehan, 1998, p. 70). The study tasks were a personal information exchange task, a 

narrative task, and a decision-making task. The result was that speakers in Group 2 ([+ planning], 

[- guidance]) put forth more accurate speech for under unguided planning condition for the 

narrative task, while Group 3 ([+ planning], [+guidance]) speakers produced more subordination 

of language in speech for decision-making tasks. The researchers proposed two main 

interpretations of the result. First, when guidance is provided that directs speakers’ attention to 

task content (or the message to be expressed), the result is the suffer of accuracy and increase in 

complexity of speech; on the other hand, when simply given time to plan, speakers prioritize to 

plan for the language they will use, hence resulting in enhanced accuracy of speech. Second, the 

effects of planning conditions are to be mediated by task characteristics: under unstructured, 

input-depleted tasks such as the decision-making task, speakers may primary direct their 

 21 

attentional resources to generate elaborated idea, yet under narrative tasks in which reference 

information is provided, speakers would prioritize the form of language. 

Kawauchi (2005) further employed planning activities that are not only recommended in 

the current test preparation materials (e.g., note taking, rehearsing), but also based on learners’ 

actual pre-task planning strategies (Ortega, 1999). Employing a within-subjects design, Japanese 

EFL learners performed the same picture-description narrative tasks first without planning and 

subsequently with planning (with an interval in between for answering to a background 

questionnaire). Kawauchi employed three types of planning; namely, writing (i.e., taking notes 

of what they want to say), rehearsing (i.e., saying out loud), and reading (i.e., reading a model 

passage of the picture story). Each participant performed three sets of narrative tasks within a 

three-weeks window under differing planning condition (see Kawauchi, 2005, p. 149, for a 

detailed outline of the study procedure). Whilst Kawauchi did not find a favorable effect for one 

type of planning quantitatively, a qualitative analysis on the language use suggested the differing 

effect of guided planning activities in terms of learners’ proficiency levels. Low-level learners 

benefitted mostly from reading, making use of the L2 input (vocabulary) under planned 

condition; on the other hand, high-level learners did not show a preference of one activity, but 

stated that they benefitted from all activities by means of organizing their thoughts.  

1.2.3 Task characteristics mediating the effects of planning 

Generally, task-based researchers found mixed results for the planning conditions as well 

as the provision of guidance in planning. Several studies contended the positive effects of both 

guided and unguided planning on complexity and accuracy (e.g., Foster & Skehan, 1996; 

Wendel, 1997) while others reporting favorable findings for one condition (e.g., Gilabert, 2007; 

Yuan & Ellis, 2003) or even neutral results (e.g., Kawauchi, 2005; Sangarun, 2005). Language 

 22 

testing research, on the other hand, demonstrated rather consistently the limited effects of 

planning across different planning conditions. Alternatively, researchers have pointed out the 

mediating effects of certain task-specific features (e.g., narrative tasks versus decision-making 

tasks) to explain for such mixed findings (Elder et al., 2002; Foster & Skehan, 1996; Kawauchi, 

2005; Tavakoli & Skehan, 2005). Foster and Skehan (1996) mentioned above is a precise 

example. This line of thought has been established by researchers drawing onto the earlier task-

characteristics framework (Robinson, 2001; Skehan, 1998).  

For instance, both Robinson (2001) and Skehan (1998) synthesized a number of task 

characteristics researched by the task-based researchers. These include a broader categorization 

of task features, but I list four categories that are of interest of the current study that concern on 

the manipulation of task information: 

(1) Type of given information: Concrete/immediate versus abstract/remote information 

(Brown et al., 1984; Foster & Skehan, 1996; Robinson, 1995; Skehan & Foster, 1997) 

(2) Organization of given information: Structured tasks versus unstructured tasks (Foster 

& Skehan, 1996; Skehan & Foster, 1997) 

(3) Familiarity of information: Information with background knowledge versus 

unfamiliar information (Foster & Skehan, 1996; Robinson, 2001) 

(4) Operations of information: Retrieval of given information versus transformation 

(manipulation) of given information (Foster & Skehan, 1996; Prabhu, 1987; Skehan & 

Foster, 1997) 

 

The idea is that each parameter of task characteristics has differential impact on 

subsequent oral performance (more precisely, the CAF dimensions). For instance, speakers 

 23 

benefit from concrete information given for task completion as it frees up information-processing 

load in planning, thereby releasing attention for accuracy and fluency; clear storyline presented 

in the provided pictures are one example. On the other hand, information abstract in nature 

impose on more processing load, as there are fewer external evidence for speakers to refer to; 

hence, accuracy of planned speech may be reduced while complexity gets enhanced. How the 

information is organized is also known to impact on oral performance. Tasks that are highly 

structured provide speakers with clearer macrostructure of the information, hence speeding up 

the conceptualization of task contents during planning. These may include information 

progressing in a sequential order (as in narrative texts) or in a salient pattern (as in argumentative 

texts that flows from introduction, main body, to conclusion). Speakers also benefit from task 

contents that they are familiar with during planning as they facilitate expedited processing and 

understanding of the given information while putting more conscious efforts in accuracy and 

fluency. Lastly, tasks requiring the manipulation of given material push speakers to try out more 

and use complex language (increased complexity). On the other hand, tasks asking for simple 

delivery or repetition of task contents may grant easy access to idea generation, hence reducing 

information-processing load during planning.  

The presented task characteristics specific to the provision of test-task contents are of 

particular interest as they relate to the TOEFL iBT speaking test-tasks employed in the current 

study, and thus can shed light on ascertaining the influence of planning conditions in light of 

task-specific characteristics. This is because the primary difference between the independent and 

integrated test-tasks is the role of information processing of test takers; more precisely, the 

option of the provision of additional information (written and aural source material) in test-tasks 

that draws on test-takers’ ability in not only speaking, but also reading and listening for 

 24 

completing the task. Furthermore, given the focus of testing, studying the effects of task 

properties would be useful to be able to design tasks of predictable levels of difficulty which can 

be manipulated to elicit appropriate performances across candidates (Bachman, 2002; 

Wigglesworth & Foster, 2017).  

To date, such an attempt to illuminate the impact of task characteristics has not been 

extensively demonstrated for oral task performance (Wigglesworth & Foster, 2017), particularly 

within the TOEFL iBT speaking contexts. Researchers thus far have explored the difference in 

task properties in independent and integrated tasks of the TOEFL iBT in terms of (1) test takers’ 

strategic behaviors (e.g., Barkaoui, Brooks, Swain, & Lapkin, 2013; Huang & Hung, 2010; 

Hong, Huang, & Hung, 2016); (2) rater orientations to the test-tasks (Brown, Iwashita, & 

McNamara, 2005; Lee, 2006); and (3) the way in which test takers incorporate source materials 

into spoken performances (Brown et al., 2005; Crossley, Clevinger, & Kim, 2014; Frost, Elder, 

& Wigglesworth, 2012; Kyle, Crossley, & McNamara, 2015). However, none of the researchers 

of the studies listed here have focused on illuminating the role of planning in conjunctions to the 

TOEFL iBT test-task types for explaining a possible qualitative and/or quantitative difference in 

test performance; or more generally, the researchers indirectly explored the precise aspects of the 

test-tasks that inform about their inherent design. Given the varied time allotment of planning 

within and across test-task types in the TOEFL iBT, one can only speculate as to the rationale 

behind such task implementation condition, and in effect, whether they impact on subsequent test 

performance.  

 

1.3 The study 

In this present study, I take into account both task-based and language-testing research on 

 25 

planning and task characteristics in the testing context of TOEFL iBT speaking. This research is 

needed because understanding what test takers are required to do, and what they are imposed to 

do through test-tasks fundamentally influence the nature of the performance obtained in 

performance-based oral testing (Skehan, 2016). Fundamentally, exploring planning and task-

specific features of the TOEFL iBT could further address the construct validity of speaking tasks 

included in the test (Kyle et al., 2015).  

I revisit the construct of planning in the context of the TOEFL iBT speaking test, with an 

additional interest in the interplay of planning and test-task characteristics. Specifically, I focus 

on the effects of the differing amounts of planning time provided across tasks as well as the 

provision of specific instructions (i.e., guidance) of planning. In so doing, I investigate the 

quantitative (test scores) and qualitative (discourse qualities) differences in test taker’s speaking 

performance under specific task and planning conditions. I also explore test takers’ use of and 

perceptions on the planning times employed across the two task types, and whether differences 

are contingent upon test tasks.  

 

 

 

 

 

 26 

CHAPTER 2. THE CURRENT STUDY 

2.1 Research questions and study variables  

In this study, the unguided planning condition is equivalent to the kind of planning 

practiced in current academic oral testing contexts; its effects are compared to a condition where 

the test takers are given specific guidance on planning. This research inquiry is essentially an 

attempt to (a) document the actual planning actions and perceptions of test takers in this specific 

testing condition and (b) explore the optimal planning condition for facilitating academic speech 

performance within the current testing environment. Thus, to paint a comprehensive picture of 

the current planning practices, I investigate test takers’ planning actions as well as perceptions. I 

also investigate the potential mediating effect of task characteristics of the planning conditions 

(Kawauchi, 2005; Wigglesworth & Elder, 2010). As the existent test procedure and its 

effectiveness on test performance is the subject of the current study, the results could be taken to 

understand the authenticity and validity of the corresponding oral tests. Most importantly, the 

findings could augment the ecological validity of the current planning conditions in language 

testing.  

The following questions guided the current study: 

1.  Does the type of planning (guided versus unguided) affect the test scores of test 

candidates? If yes, what are the influences of the different task types? 

2.  Does the type of planning (guided versus unguided) affect the discourse quality of 

test candidates? If yes, what are the influences of the different task types? 

3.   How do test candidates use their planning times?  

4.  How do test candidates perceive the given planning times?  

The variables of primary interest, therefore, are (a) the type of planning (guided versus 

 27 

unguided) and (b) the type of guided planning (writing versus rehearsal). In terms of the type of 

planning, guided planning in the present study differs from the notion coined by Mochizuki and 

Ortega (2008) whose definitions refer to directing learners to attend to specific language aspects 

(e.g., “the syntax, lexis, content, and organization of what they would say”, p. 307); instead, 

guided planning comprises interventions that suggest to the test takers in which activities to 

engage (Kawauchi, 2005). I chose two planning activities for the guided planning condition; 

namely, a regular time in which to plan, and a suggested plan to write during the same amount of 

time of planning. The latter could be considered note taking (Ortega, 1999; Wendel, 1997), 

requiring the act of writing or jotting down whatever comes into mind. The former is an act of 

thinking about one’s ideas or sentence phrases that could be said in the actual response. Previous 

accounts have suggested that learners frequently engage in these activities when they are left to 

their own devices to use time for preparing for performing academically oriented oral tasks (e.g., 

presentation, role-plays) (Kawauchi, 2005; Ellis, 1987; Ortega, 1999). The difference in this 

study is that in the second condition with writing, test takers are encouraged and suggested 

explicitly to write, whereas in the first condition, they are not (though they may write if they 

would like, as is normal in current tests where there is time to plan).  

An additional variable employed in the present study is a covariate: the test-task type. In 

particular, the task types are similar to those assessed on the TOEFL iBT speaking test; namely, 

the independent and the integrated tasks. Researchers and test developers have suggested that 

planning time is a major component in determining task difficulty, in that the varying planning 

conditions represent the items’ different cognitive-demand levels (Norris, 2009; Robinson, 

2009). In the TOEFL iBT speaking test, the lengths of planning time differ across task types. 

Researchers need to document the extent to which such varied time allotments on planning per 

 28 

specific task type affects test takers’ performance or perceptions on the planning conditions. In 

addition, such an analysis is essentially (although indirectly through the present study to some 

extent) a response to the call for more research on the test-task differences in the TOEFL iBT 

speaking test (Kyle et al., 2015).  

Finally, I took into account participants’ proficiency levels within each experimental 

group. In other words, each group consisted of a balanced number of high- and low-level 

learners. This grouping is primarily to illuminate a potential interaction effect of proficiency 

level and planning condition (Kawauchi, 2005).  

 

2.2. Methods 

2.2.1. Participants  

 

Initially, a total of 120 college students participated in the beginning phase of the study. 

Yet due to various reasons (e.g., disqualification of participation, n=14; discontinuance in 

participation during the course of data collection, n=3; missing audio files, n=3), a total of 

ninety-nine participants remained for the final analysis.   

In the summer of 2016, I recruited the participants from four large universities 

(University A, B, C, and D) located in Seoul, South Korea. I advertised the recruitment by 

posting an electronic version of the flyer on each university’s online forums. Upon receiving 

contacts from the interested individuals, I only contacted those who met the qualifying criteria, 

which are as follows: (a) test takers with unexpired (effective) test scores from select 

standardized English language proficiency tests (e.g., IELTS, TOEIC, TOEIC Speaking, and 

TOEFL iBT); (b) speakers at intermediate-high to advanced in English proficiency as evidenced 

through their reported test scores (threshold score ranges: IELTS 6.5 to 7.5; TOEIC 880 and 

 29 

above; TOEIC Speaking 130 to 180; TOEFL iBT: 87 to 110); (c) test takers with minimum test-

taking experiences (took the test once or twice, at most) on the TOEFL iBT test; and (d) full-time 

university students. The reasons for establishing each criterion were the following: (a) to capture 

the participants’ most up-to-date English language proficiency; (b) to ensure participants would 

be able to produce speech in English to the extent of yielding a meaningful amount of data for 

CAF analysis; (c) to ensure practice effects are minimal, due to the fact that the main instruments 

of the current study came from the TOEFL iBT practice test1; and (d) to control for the 

occupational status of the participants.          

Among the four universities, University B was the only all-women’s college while the 

others were all coed colleges. All of the participants were regularly enrolled college students in 

one of the universities: 36 students (18 males; 18 females) from University A, 33 students (all 

females) from University B, 20 students from University C (4 males; 16 females), and 10 

students from University D (4 males; 6 females). The participants were primarily in their early- 

to mid-twenties (M = 24.39; SD = 2.26; Max. = 31; Min = 19), with more than half of the 

participants being seniors (N = 43) or beyond seniors (N = 17). A relatively small number of 

participants were in their early years in the college (Freshman: N = 5; Sophomore: N = 8; Junior: 

N = 18) or pursuing master’s degree (N = 8). The participants came from a variety of academic 

backgrounds, yet approximately 58% of them were pursuing social sciences (N = 52); the 

                                                

1 Those with extensive test-taking experiences of the TOEFL iBT would have already 
internalized a certain degree of test-wiseness as to test tasks and testing conditions. This is 
particularly true in the Korean context where extensive test-taking experiences are inevitably 
related to the exposure to intensive test preparation practices. It is not an overstatement to say 
that those with the intent to prepare for this test begin their journey of test taking by taking test-
preparation classes at private institutions, which are well known for its intensive training 
(coaching) of test-taking strategies (Choi, 2008). Planning strategies are one of which the 
programs are likely to instruct the students. 

 30 

remaining disciplines included humanities and languages (N = 26), science (N = 15), and 

education (N = 6).  

With the exception of 11 participants who indicated living in the English-speaking 

countries in their early years, most of the participants had learned English as a foreign language. 

They were relatively early learners of English (M = 7.92; SD = 2.34; Max. = 15; Min. = 6), 

having been exposed to English through both regular school instruction and private tutoring. 

When asked to self-assess their English language skills on a 6-point Likert scale (1 being very 

poor to 6 being advanced), the participants indicated higher confidence in their receptive skills 

(Reading: M = 4.96, SD = 0.72; Listening: M = 4.64, SD = 1.10) than in their productive skills 

(Speaking: M = 3.89, SD = 1.14; Writing: M = 3.98, SD = 1.14). Notably, the participants 

tended to give the lowest ratings to their speaking skills. Of the participants, 62 reported that they 

had studied a foreign language besides English. In terms of test-taking experiences, 22 

participants once took the TOEFL iBT test, while vast majority of the participants took the 

TOEIC (N = 87) due to its significant use for employment and admission purposes in South 

Korea. The participants, in the end, were assumed to be quite homogenous in educational 

background, occupational status, age, and English-learning experiences.  

I randomly assigned 33 participants to comprise each group, resulting in three 

experimental groups for the present study. Through the administration of a pre-test (which is 

described in detail in the following section), participants were further divided into two sub 

groups, each representing high and low proficiency of English (relative to each other). 

Participants performed three test sets under one of the three following conditions with 

corresponding instructions (adapted from the TOEFL official guide, ETS, 2012, and Ortega, 

1999).  

 31 

Unguided Planning (UG): In this condition, participants were left at their will to plan for 

their responses. They received the following instruction before responding: “Begin preparing 

your response after the beep.” 

Guided Planning-Writing (GW): In this condition, participants were provided with a 

blank sheet of paper, which they were required to use during planning time. Before planning, 

they received the following instruction: “On the provided piece of paper, please write out what 

you wish to say. It doesn’t have to include everything in detail or in full sentences.” 

Guided Planning-Silently thinking (GT): In this condition, participants were asked to 

silently think what they would say for the actual response. They were not allowed to write or 

make notes during planning time. Before planning, they received the following instruction: 

“Please think silently inside your head about whatever you wish to say. You will not be allowed 

to take notes while you plan your responses.” 

2.2.2. Materials 

2.2.2.1. Background questionnaire 

I administered a pre-experiment questionnaire (written in Korean) to elicit participants’ 

language learning background and test-taking experiences prior to their participation in the main 

study (see Appendix A for a complete list of questions). I administered the survey online by 

using Qualtrics (https://www.qualitrics.com). This was to additionally screen participants with 

extensive test-taking experiences on the TOEFL iBT test. 

2.2.2.2. Elicited imitation task  

I used an elicited imitation (EI) task (Ortega, Iwashita, Norris, & Rabie, 2002) (see 

Appendix B for the practice and main test sentences) as a pre-measure for three purposes: (a) for 

ensuring homogeneousness in grouping participants into the current study’s experimental 

 32 

conditions; (b) for dividing the participants into high and low proficiency levels in English; and 

(c) for minimizing participants with the exposure to any type of exposure to typical oral-testing 

conditions, EI was used instead of implementing a different type of speaking test. The decision 

to use EI was that not only it is an authentic task that is reflective of everyday conversation, but 

also it has been attested as a global measurement of oral proficiency in the L2 (Tracy-Ventura, 

McManus, Norris, & Ortega, 2013). More specifically, researchers have favored EI for directing 

learners to attend to the form and meaning of the target structure while requiring both 

comprehension and production (Bowles, 2011; Cox, Bown, & Burdis, 2015; Van Moere, 2012). 

According to the EI test results (discussed further in the next section), I balanced each testing 

condition in terms of the same average score and standard deviation.  

In the current EI task, the participants were asked to first listen to English sentences (that 

is read once) and then verbally repeat them. The full test consisted of 30 test sentences, preceded 

by six practice sentences for the purpose of familiarizing participants with the test procedures. 

These practice sentences were given in the participants’ L1, Korean. Participants’ responses were 

recorded by Audacity (http://web.audacityteam.org).  

2.2.2.3. Test tasks 

The main oral test tasks in the present study come from the three practice test sets (Set A, 

B, and C) from the TOEFL iBT® official guide, which are electronically accessible (ETS, 2012). 

From each test set, I further selected three test tasks, which are as follows: (a) independent task 

(providing one’s opinions about a given topic) (henceforth IP task), (b) integrated task with 

listening and reading (e.g., providing responses by combining relevant information from reading 

and listening to two sources) (henceforth IT-RL task), and (c) integrated task with listening only 

(henceforth IT-L task) (see Appendix C for complete sets of questions). I chose to use sample 

 33 

TOEFL iBT test items and the specific test task types that I did for their academic orientation of 

speaking skills as well as the planning conditions of interest. Planning times differed across 

tasks: (a) 15 seconds for IP task, (b) 30 seconds for IT-RL task, and (c) 20 seconds for IT-L task. 

The response time for IP task was 30 seconds, while for IT-RL task and IT-L tasks, 60 seconds 

were given.  

The electronic version of the practice sets simulates the actual testing screen of the 

TOEFL iBT speaking section. Yet to enable a natural flow of the experimental procedures of the 

current study (which is described in the next section below), I used Camtasia 

(www.camtasia.com) to make screen-captured video clips of each test task. I then uploaded the 

nine video clips (3 test tasks times 3 test sets) online via Youtube (www.youtube.com) due to its 

compatibility with Qualtrics (www.qualtrics.com), an online platform used to construct surveys 

and assessment tools. In Qualtrics, I constructed a web-based assessment on which participants 

were able to read the guided planning instructions, respond to the test tasks, and take the post-

survey questions at their own pace. To respond to the test tasks, the participants were instructed 

to press the play button on the test-task videos. In each video, a speaker guided the participants 

throughout the test procedures. The speaker gave general instructions of the corresponding test 

task, which were simultaneously displayed in a written form on the test screen. The speaker then 

prompted participants to read the test question, which was followed by an indication about the 

amount of planning and responding time for the test task at hand. Subsequently, a time-running 

bar on the bottom side of the test screen was immediately activated, while a countdown timer 

appeared to display the remaining time. After the planning time was completed, the response 

time was given. The guiding speaker instructed the test takers when to begin planning and 

responding.   

 34 

2.2.2.4. Post questionnaire 

I gave a post questionnaire (written in Korean) using Qualtrics (that appeared 

immediately after the last test task in each test set) to gauge the participants’ perceptions of the 

current planning practice (see Appendix D). The questions (adapted from Rutherford, 2001; 

Wigglesworth & Elder, 2010) particularly elicited (a) participants’ self-assessment on how they 

performed on each test task; (b) participants’ perceptions on the effectiveness of the length of 

planning time provided (15 seconds, 20 seconds, and 30 seconds) for each test task; (c) 

participants’ engagement of other types of planning activities (a question asked to those who 

reported that they did not make extensive use of the examined planning activity in each planning 

condition) and (c) participants’ evaluation of the effectiveness of the type of planning (guided 

versus unguided).  

2.2.2.5. Interview 

To further probe into the participants’ perspectives on the effects of planning, I conducted 

a brief interview with participants after testing. The participants were able to respond in Korean 

or English. I adapted the interview questions from Wigglesworth and Elder (2010) (see 

Appendix E). Questions concerned (a) the appropriateness of the planning conditions (in terms of 

the length of time and type of planning), (b) the use of the planning time in the differing planning 

conditions as well as differing test tasks, (c) the strengths and weaknesses of the planning 

activities, and (d) further suggestions, if any, of improving the current planning practice. 

2.2.3. Study Design 

As seen in Table 1, the order of the three planning conditions (UG, GW, GT), test sets 

(Set A, B, and C), and test tasks (IP task, IT-RL task, and IT-L task) were counterbalanced with 

a Latin Square design. For instance, a group of participants employed a certain planning activity 

 35 

for a test set while the other two groups of participants took the remaining test sets, each under 

different planning condition. In the end, all test tasks were performed under all three planning 

conditions. Such a study design controls the possible intervening effects of the sequences of the 

planning conditions, test sets, and test tasks (Wigglesworth & Elder, 2010).  

 

Table 1 

Research Design  

Groups 

Group 1 (N = 30) 

Session 1 

UN-Set A 

IP task 

Session 2 

GW-Set B 

IP task 

Session 3 

GS-Set C 

IP task 

IT-RL task 

IT-RL task 

IT-RL task 

IT-L task 

IT-L task 

IT-L task 

Group 2 (N = 30) 

GS-Set B 

UN-Set C 

GW-Set A 

IT-L task 

IP task 

IT-L task 

IP task 

IT-L task 

IP task 

IT-RL task 

IT-RL task 

IT-RL task 

Group 3 (N = 30) 

GW-Set C 

GS-Set A 

UN-Set B 

IT-RL task 

IT-RL task 

IT-RL task 

IT-L task 

IP task 

IT-L task 

IP task 

IT-L task 

IP task 

Note. UN, GW, and GS each indicate Unguided Planning, Guided Planning-Writing, and Guided 
Planning-Silently thinking, respectively. IP, IT-RL, and IT-L each indicate Independent, 
Integrated-Reading and Listening, and Integrated-Listening only tasks, respectively. 
 
 
 

 36 

2.2.4. Procedure 

Prior to proceeding to data collection, I conducted a pilot study with 25 college students 

(with varying English proficiency levels) in South Korea. I tested out the testing procedures as 

well as the post-survey items. In so doing, I replaced Kawauchi’s (2005) Speaking out loud 

condition to the regular timed planning condition. This was due to the fact that the former 

condition was practically difficult for the majority of learners to carry out in the actual testing 

context; in other words, the act of verbally rehearsing within a brief amount of planning time 

appeared to be disruptive to them in organizing their thoughts and preparing their responses. 

Silent planning time was a condition that most of the students reported as easy in which to 

engage during cognitively demanding situations such as test taking.   

For the main experiment, I sent out a link to the online pre-experiment questionnaire to 

the students who were eligible to participate. I then randomly assigned each participant one of 

the three study groups. Upon assignment, I asked each participant to schedule themselves for 

three separate testing days (with at least a one-day interval in between; consecutive test dates 

were avoided to prevent immediate practice effects).  

I asked the participants to attend three testing days in a private study room I reserved at 

each of the four universities. In all sessions, I had them sit in front of a laptop computer, and I 

provided them a headset for listening and speaking. As soon as I gave them brief instructions 

about the procedures in general, they were left alone in the room. On the first testing day, 

participants first signed the consent form. They then took the EI test for 20 minutes and moved to 

the first test set corresponding to their assigned groups. In the second testing day, the participants 

came in to take the second test set, followed by the post-questionnaire. On the third testing day, 

the participants took a third test set and partook in the post-hoc interview session. As in 

 37 

operational testing, participants were able to make notes during the test-taking of integrated 

tasks. These notes were differentiated with the written planning participants produced in the 

guided-writing planning condition. In the unguided planning condition, participants were 

instructed to use the time at their will. All participants were monetarily compensated ($30 worth 

of a gift card) for their time.  

 

2.3. Measures 

Following the previous literature on planning time, I analyzed each participant’s 

responses in regard to the discourse quality measures of complexity, accuracy, and fluency 

(henceforth CAF measures) (e.g., Elder & Iwashita, 2005; Kawauchi, 2005; Wigglesworth, 

1997; Wigglesworth & Elder, 2010).  

The three quality measures were further broken down to include specific features. In the 

following section, I provide the definitions and subcomponents related to each quality measure.   

2.3.1. Fluency measures  

Following previous researchers (Kawauchi, 2005; Housen & Kuiken, 2009; Skehan, 

2009; Wigglesworth & Elder, 2010), I conformed to Segalowitz (2010) in conceptualizing 

fluency as subscribing to objective acoustic measures of an utterance (see Segalowitz, 2010, on a 

broader definition of L2 fluency that additionally includes cognitive and perceived fluency). 

More specifically, I defined fluency as three subdimensions; namely, speed fluency, breakdown 

fluency, and repair fluency.  

First, I indexed speed fluency through (1) identifying the total number of syllables 

produced in each response per minute by using the online software, Syllable Counter 

(SyllableCount.com, n.d.); (2) calculating the speech rate of all responses (Freed, 2000) by 

 38 

dividing the total number of produced syllables in a given speech by the total amount of 

utterance time (Tavakoli & Skehan, 2005); and (3) identifying the silent duration, or the time 

spent before articulation each participant took before actually articulating his or her speech. I 

initially derived time spent before articulation to better account for the actual phonation time to 

calculate speech rate, which makes the measure to only encompass the production time. Yet I 

decided to include it as a separate speed fluency measure after inspecting participants’ interview 

responses. As will be reported in the next chapter, a majority of participants consistently 

indicated in the interview a possible relationship between fluidly delivering a speech and the 

silent buffering time taken before articulation. It was seemingly plausible that such a buffering 

time could have influenced how the subsequent speech was delivered. I defined the measure as 

the amount of time of a silent duration between the initiating beep sound (that prompted 

participants to speak) and the first articulation point.  

After data collection, I applied the following specific criteria to qualify the data: (1) silent 

pauses pertained to the inaudible sounds that were equivalent to or longer than 250 milliseconds 

(see De Jong, Groenhout, Schoonen, & Hulstijn, 2013 for the discussion of a threshold value for 

identifying silent pauses); (2) the total number of syllables only included intelligible language; 

that is, I excluded filled pauses (uhs, and ums) as well as disfluency features (e.g., repetitions, 

hesitations, false starts marked with hyphens in the transcriptions) when counting the number of 

syllables in the entire speech; (3) speech rate pertained to actual phonation time, excluding the 

silent time spent before articulation; and (4) time spent before articulation excluded any verbal 

articulation of produced by participants; for instance, participants’ production of lengthened 

filled pauses (e.g., uhhhh, ummm) was not included in the measure. Table 2 presents the 

measures used for indexing speed fluency.  

 

 39 

Table 2 

Speed fluency measures 

Measures 

Description 

Values 

Tot. Num. of syllables 

Total number of syllables in each response 

Quantified value 

per minute 

Speech rate 

Number of syllables / Total duration of 

Quantified value 

utterance (excluding time spent before 

articulation) 

Time spent before 

Total duration of silence before being 

Quantified value 

articulation 

prompted to speak  

(in seconds) 

 

Second, I identified breakdown fluency by: (1) calculating the number of filled and 

unfilled pauses per minute (Elder & Iwashita, 2005; Kormos, 2006) and (2) mean length of run 

(De Jong et al., 2012). Table 3 summarizes the breakdown fluency measures employed in the 

present study. 

 

Table 3 

Breakdown fluency measures 

Measures 

Description 

Values 

Num. of filled pauses per 

(Total number of filled pauses / Duration of 

Quantified value 

minute 

utterance) * 60 

 

 

 40 

Table 3 (cont’d) 

Num. of unfilled pauses 

(Total number of unfilled pauses / Duration of 

Quantified value 

per minute 

utterance) * 60 

Mean length of run 

Total number of syllables / Total number of 

Quantified value 

unfilled pauses +1 

 

Finally, for repair fluency, I identified the number of disfluency features, particularly 

those representing repetitions, replacements, reformulations, hesitations, and false starts 

(Bygate, 1996; Foster & Skehan, 1996; Kormos, 2006; Riggenbach, 1991). I classified these 

qualitative features according to Bygate’s (1996) sub-dimensions: verbatim repetition (i.e., 

“occurs when hesitating, creating time to find an appropriate word”, p. 141) and substitutive 

repetition (“employed when correcting a word or grammatical feature”, p. 141). According to 

these conceptualizations, I included repetitions and replacements in the former, and 

reformulations, hesitations, and false starts in the latter component. Table 4 depicts the repair 

fluency measures with precise definitions and examples from the data. 

 

 

 41 

Table 4 

Repair fluency measures 

Measures 

Sub-category 

Description 

Example from the data 

Values 

Verbatim 

Repetitions 

Words, phrases, or clauses that 

Uh, for this problem- 

repetition  

are repeated with no 

problem  

modification whatsoever to 

 

syntax, morphology, or word 

order 

Replacements 

Lexical items that are 

the awa- ceremony 

immediately substituted for 

(the award -> the 

another 

ceremony) 

Substitutive 

Reformulations 

Phrases or clauses that are 

I believe that working 

repetition 

repeated with some 

together it- this makes 

modification to syntax, 

synergy 

morphology, or word order 

Hesitations 

Initial phoneme or syllable(s) 

to write a paper for 

uttered one or more times 

inste- instead 

Quantified 

value 

before the complete word is 

spoken 

False starts 

False starts are utterances that 

So, if she, if student 

are abandoned before 

considered, so, she 

completion and that may or 

think that um 

may not be followed by a 

(if clause abandoned 

reformulation 

and not repeated 

 

before being modified 

to SV structure) 

 42 

As in Kawauchi (2005), I explored the proportion of repair fluency measure per 

performance for a task to account for the varied amount of speech from individual participants. 

This was accomplished by dividing the total number of each repair fluency measure by the total 

duration of speech time.  

2.3.2. Complexity measures  

In this study, I followed Kawauchi (2005) in analyzing syntactic complexity, and I also 

followed Kawauchi by using lexical diversity for measuring complexity. 

I analyzed syntactic complexity by exploring the following measures: Analysis of Speech 

Unit (AS-unit) and subordinate clauses (Ferrari, 2012; Frost, Elder, & Wigglesworth, 2011). An 

AS-unit is defined as “a single speaker’s utterance consisting of an independent clause, or sub-

clausal unit, together with any subordinate clause(s) associated with either” (Foster, Tonkyn, & 

Wigglesworth, 2000, p. 365). While an AS-unit is a syntactic unit encompassing the definition of 

a typical T-unit (see Hunt, 1965), it differs from the T-units by taking into account the specific 

features of spoken data. According to Foster et al. (2000), AS-units reflect such a nature of oral 

speech by (a) including the independent sub-clausal units that are very common in speech, and 

are defined as minor utterances (i.e., irregular sentences or non-sentences; yes, thank you very 

much, oh how wonderful, p. 366); (b) including breakdown fluency measures such as false starts, 

repetitions, and reformulations as the same unit; and (c) using intonation and pausing 

phenomena to distinguish syntactic boundaries. In terms of the second point, such principled use 

of temporal features applies to specific cases of discerning whether coordinated phrases are 

included in the previously occurring AS-unit. In the case of T-units, coordinated phrases are 

generally considered as being included as the same T-unit (Hunt, 1965). However, coordinated 

units can be independent AS-units when a pause longer than 0.5 seconds is preceding and the 

 43 

conjunctive markers (e.g., and, but) themselves are articulated either in falling or rising 

intonation (Foster et al., 2000). Foster et al. (2000) asserted that AS-units were also useful for 

analyzing monologic discourse (i.e., non-interactive speech) as t-units are, thus I used AS-units 

for analyzing discourse in the current study.  

In addition to the broader syntactic boundary, I explored the number of subordinate 

clauses. Following Foster et al. (2000), I defined an independent clause as a clause including a 

finite verb, whereas a subordinate clause is “a clause consisting of a finite or nonfinite verbal 

element with at least one other clausal element such as a subject, object, complement, or 

adverbial” (p. 365).  

Taken altogether, the two syntactic units were meant to quantify the participants’ abilities 

in producing both macro- and micro-units for establishing a more complex message. In the end, 

with these indices, I used two global quantitative measures of syntactic complexity as follows: 

(1) AS-unit length to account for the density of the syntactic unit (Ferrari, 2012) by dividing the 

total number of produced syllables by the identified number of AS-units (Bulté & Housen, 

2012); and (2) the number of subordinate clause to AS-unit for indication of subordination 

(Foster et al., 2000). See Table 5 for the summary of the independent measures for syntactic 

complexity.  

The following excerpt displays the identification of AS-units and subordinate clauses. 

Double forward slashes (//) were placed at AS-unit boundaries while double colons (::) indicate 

the locations of subordinate clauses. Numbers in parenthesis are the duration of an unfilled 

(silent) pause. Note that the coordinated phrase starting with “so” in the last sentence of the 

excerpt (bolded segment) was treated as a separate AS-unit as it was preceded by a long silent 

pause (of 1.14 seconds) and was articulated with a rising tone.  

 44 

Um, I often go to the baseball stadium with my friends :: because I like to watch Korean 

baseball game. // It’s my like, kind of like ori- original life. // So, I always watch the 

movie :: even though I can’t go to the stadium but I try to go to the stadium :: when I 

have time. (1.14)  // So I, I meet my friends and we, enthusiastically, um, enjoy the game. 

// 

For lexical diversity (see Table 5), I used the Tool for the Automatic Analysis of 

Cohesion (TAACO; Crossley, Kyle, & McNamara, 2016) for analyzing a number of indices 

denoting to lexical sophistication and complexity. TAACO (available here: 

http://www.kristopherkyle.com/taaco.html) is an automated text analysis, which specifically 

concerns on measures of lexical cohesion at word, sentence, and paragraph level. From TAACO, 

I identified both global and refined measures of lexical diversity. First, I explored type-token 

ratio (TTR), which is the ratio in percent between the different lexemes in the text (e.g., nouns, 

pronouns, verbs, adjectives, articles, adverbs, prepositions, and conjunctions) and the total 

number of lexemes (Laufer, 1991; Laufer & Nation, 1995; Ortega, 1999; Robinson, 2001). Yet 

supplemental information to TTR is needed as the value itself is strongly affected by the text 

length (Skehan, 2009); in other words, TTR could yield unstable results for a shorter amount of 

discourse (as is expected to be elicited from the task types employed in the current study) as 

opposed to longer stretch of utterances (Laufer & Nation, 1995; Malvern & Richards, 2002). 

Therefore, I looked at the number of unique words (i.e., types) as well as the cohesive devices 

used in the speech samples such as conjunctions (and, but) and sentence linking words (e.g., 

nonetheless, therefore, although).  

 

 

 45 

Table 5 

Complexity measures 

Measures 

Sub-category 

Description 

Values 

Syntactic 

Number of AS-units  The total number of AS-units 

complexity 

in each utterance 

Number of 

The total number of 

Frequency counts 

subordinate clauses 

subordinate clauses in each 

utterance 

AS-unit length 

Total number of AS-units / 

Total number of syllables  

Subordinate clause 

Total number of subordinate 

Quantified value 

ratio  

clauses / Total number of 

clauses 

Lexical 

diversity 

Number of word 

Unique lexical words in each 

types 

utterance  

Frequency counts 

Type-token ratio  

Total number of unique words 

(types) divided by the number 

of total words (tokens) 

Cohesive devices 

Total number of conjunctions 

and sentence linking words / 

Total number of words 

Quantified value 

 

 

 46 

2.3.3 Accuracy measures  

I measured overall accuracy in terms of employing a broader accuracy measure of error-

free clauses and a refined index of lexical errors (Kuiken & Vedder, 2012). Adopting the latter 

element complemented the incomplete nature of errors represented by the global accuracy 

measures that are associated with syntactic units (e.g., the number of error-free units, the number 

of errors per units) (Kuiken & Vedder, 2012). Following Tavakoli and Skehan (2005), I defined 

error-free clauses as those with no grammatical errors in terms of syntax, morphology, and word 

order. To be more specific, for a clause to be error free, clauses should contain correctly used 

verb forms (e.g., tense, aspect, voice, modality, and subject-verb agreement; Wendel, 1997), 

articles, and plural –s. I did not include errors in stress, intonation patterns or pronunciation of 

words and utterances. Lexical errors, on the other hand, subscribed to deviations in 

pronunciation, meaning, grammatical form, word order, collocation, idioms, and awkward 

phrasing that may interfere with the overall comprehensibility of the speech (e.g., It could 

happen a lot of troubles; words such as trigger or cause are more appropriate) (Mehnert, 1998).  

With these two indices, I calculated two quantitative measures of accuracy as follows: (1) 

the number of error-free clauses to total number of clauses (Tavakoli & Skehan, 2005); and (2) 

lexical errors per 100 words (Mehnert, 1998). I used the latter measure for normalizing the raw 

error counts. This was to account for the impact of speech lengths on skewing the occurrences of 

errors (a higher word count creates more opportunities for errors; Plakan & Gebril, 2016). Table 

6 reports further examples from the data coded for the two accuracy measures. 

 

 

 

 47 

Table 6 

Accuracy measures 

Measures 

Description 

Values 

Error-free clauses ratio 

Total number of error-free clauses / 

Total number of clauses 

Quantified value 

Lexical errors per 100 

(Total number of lexical errors / 

Total number of syllables) *100 

words 

 

2.4 Analysis 

2.4.1 Research question 1: Does the type of planning (guided versus unguided) affect the 

test scores of test candidates? If yes, what are the influences of the different task types? 

The first research question pertains to whether there are effects of the type of planning 

(guided versus unguided) in conjunction to the test-task type on test taker’s test performance. In 

the following section, I describe the procedures and statistical analysis taken to address this 

inquiry.  

2.4.1.1 Subjective ratings of the Elicited Imitation task responses 

Prior to answering research question 1, I examined participants’ EI test performance for 

the purpose of having two sub-groups within each study group that represent high and low oral 

proficiency. I hired a female rater (an M.A. student in TESOL at a large Mid-western university) 

to evaluate participants’ EI test performance. I participated as the second rater, to establish inter-

rater reliability with the first rater. The first rater and I met for multiple sessions to train 

 48 

ourselves on the EI rubric (Park, 2015; Ortega et al., 2002; see Appendix B). After we reached 

consensus in our ratings on the sample items, we each scored all 99 participants’ responses. 

Based on the rubric, we scored each response on a 4-point scale. We met multiple times during 

the course of our ratings to discuss any discrepancies and concerns we had when rating particular 

responses. In the end, our inter-rater agreement on all 30 EI items for all participants was 

estimated at .88 (Cronbach’s alpha), which was interpreted as moderately high. In addition, the 

EI test appeared to have high overall reliability, which was estimated at .93 (Cronbach’s alpha).  

It should be noted that the scoring method of the EI responses employed in the current 

study (based on Ortega et al., 2002) differed from the conventional approaches that primarily 

concern the completion and/or grammatical accuracy of repetitions (e.g., Erlam, 2006; West, 

2012). Instead, the responses were evaluated for its retention of key idea units in addition to the 

language forms (see Appendix B). Ortega and her colleagues’ rationale of such evaluation comes 

from the evidence that after listening to a sentence, the utterance’s meaning is stored for a 

significantly longer time than its linguistic form and specific wording (Sachs, 1967); hence, 

assessment of the responses becomes important to take into account the degree to which speakers 

are able to maintain both form and meaning of an input sentence (Wu & Ortega, 2013). Aside 

from this, integrity of the meaning in the repetition was deemed important as to its potential link 

to the construct measured by the TOEBL iBT speaking.    

Table 7 reports on the average scores between the first rater (rater 1) and I (henceforth 

rater 2). Although the mean score for Group 1 participants was the lowest (from both raters), a 

one-way ANOVA on the average mean scores of rater 1 and rater 2 (see the fifth column in 

Table 7) indicated statistically insignificant difference among the three groups (F = 0.147, p 

= .086, ηp2 = 0.21). The results suggest that participants in each group were comparable in terms 

 49 

of the oral proficiency measured through the EI test.  

 

Table 7 

Descriptive statistics for the EI TEST ratings given by rater 1 and rater 2 

Group 

Group 1  

Group 2  

Group 3 

N 

33 

33 

33 

Rater 1  

M (SD) 

Rater 2 

M (SD) 

Average Mean 

(Rater 1 + Rater 2) 

81.30 (15.61) 

81.04 (14.20) 

80.67 (14.91) 

82.30 (18.33) 

82.04 (17.66) 

82.17  (18.00) 

82.70 (11.94) 

82.51 (12.32) 

82.60 (12.13) 

Note. Maximum possible total score a participant can receive is 120 (30 items * highest rating of 4).  

 

Following Ortega (2000), I averaged the three combined mean scores from rater 1 and 

rater 2 (i.e., scores reported in the fifth column of Table 7). I then derived an arbitrary cut-off 

point of 81.8 as a benchmark for dividing participants into two subgroups (low and high 

proficiency). Those who scored above 81.8 on the EI test were considered as speakers with 

relatively higher oral proficiency than the lower group.  

As in Table 8, the two sub-groups within each of the three experiment groups were all 

balanced in number. In addition, the mean EI test scores of all High-Proficiency (henceforth HP) 

and Low-Proficiency (henceforth LP) groups were comparable across the three experiment 

groups.  

 

 

 

 50 

Table 8 

Descriptive statistics of the EI test according to the proficiency sub-groups  

Group 

Sub-group 

N 

Averaged descriptive 

M (SD) 

Max. 

Min. 

High 

Low 

High 

Low 

High 

Low 

16 

17 

17 

16 

16 

17 

91.24 (7.91) 

65.69 (7.87) 

93.65 (7.55) 

66.69 (10.76) 

92.94 (8.73) 

69.82 (9.86) 

109 

77 

105 

78 

110 

79 

84 

52 

82 

46 

83 

40 

Group 1  

Group 2  

Group 3 

 

Two separate one-way ANOVAs conducted once with the three HP groups and once with 

the three LP groups confirmed that participants in each proficiency band performed similarly 

(HP: F(2, 47) = 0.400, p = .673, ηp2 = 0.36; LP: F(2, 46) = 0.841, p = .438, ηp2 = 0.58). From 

these results, it can be inferred that the distribution of participants (at least with respect to their 

EI test scores) were relatively even across the three experiment groups. In addition, a series of 

three independent t test revealed that the magnitude of difference between the HP and LP 

participants within each experiment group was significant (Group 1: t = 9.261, p  < .001; Group 

2: t = 8.374, p < .001; Group 3: t = 7.114, p < .001). 

2.4.1.2 Subjective ratings on spoken responses  

To address research question 1, I hired three raters to score the speech samples according 

to the TOEFL iBT speaking rubric (see Appendix F). These individuals (two males and one 

female) were all native speakers of English, who, at the time of scoring, were pursuing their 

 51 

master’s degrees in Linguistics and TESOL at a large Mid-western university. While all three 

raters had varying levels of experience in teaching English, they were novice to rating speech 

samples produced in the language assessment context. As such, prior to moving onto the main 

phase of rating, they were asked to receive an intensive training session on rating from a rater-

training expert affiliated to the same university. The rater-trainer had extensive experiences on 

rating as well as training raters for high-stakes language proficiency tests. Prior to the training 

session, the rater-trainer took 30 speech samples from the current dataset to derive benchmark 

speech samples (with respect to the TOEFL iBT Speaking rubric) and to categorize 

characteristics of the oral responses pertaining to a specific score band. The actual training 

session was held for about two hours, during which the raters accustomed themselves with the 

rubric as well as the benchmark responses. There was an additional intervention session during 

the individual rating period in which the rater trainer and all three raters came together to discuss 

possible concerns and inquiries in rating.  

In terms of the rating design, I adopted an incomplete, connected block design of rating 

(Eckes, 2009; Fleiss, 1981). From this approach, raters do not score each and every speech 

response, but as pairs, they jointly rate a fixed number of responses (Fleiss, 1981). As opposed to 

the fully crossed, complete block designs (i.e., every rater rates every test response), incomplete 

block designs have been commonly employed in research studies and large-scale rating projects 

due to their cost- and time-efficient nature (Myford & Wolfe, 2004). I employed the incomplete 

design precisely for such a reason, while I had to acknowledge the existence of a sparse data set 

with missing observation (Myford & Wolfe, 2004). To minimize such a limitation of the rating 

design, researchers have suggested randomizing the assignment of participants (or observations) 

to the raters (Bechger, Maris, & Hsiao, 2010; Eckes, 2009). Following this suggestion, I devised 

 52 

a partially randomized rating assignment as displayed in Table 9. More specifically, I divided the 

speech samples into three large sets (each consisted of 33 participants’ speech samples). I 

randomized the order of participants as well as the test-task orders (IP task, 2, and 3) in each data 

set, and took care to balance out the number of participants from each experiment group within 

each set. The order of raters (rater A, B, and C) were balanced across each set of speech samples. 

Furthermore, I ensured the rating design adopts connectedness amongst the individual raters as 

well as test takers. A connected design is critical in performing Rasch modeling (which will be 

dealt further in the next section) as it fulfills the unidimensionality of Rasch modeling (Linacre 

& Wright, 2002); in other words, it makes it possible to calibrate all measurement factors (e.g., 

planning conditions, test-task types) onto the same scale in terms of score variation (Rasch 

modeling will be dealt in the next section in more detail) (Eckes, 2009). As shown in Table 9, 

individual raters are linked to one another (e.g., Raters A and B) through common ratings of the 

same examinees (e.g., Participant D), while each examinee (e.g., Participants D and E) is linked 

to one another through common ratings by at least a pair of raters (e.g., Raters A and B). 

As stated above, the raters scored using the TOEFL iBT Speaking rubric 

(https://www.ets.org/s/toefl/pdf/toefl_speaking_rubrics.pdf). The rubric is based on a holistic 

scale, which specifically focused on three descriptors: delivery (i.e., fluidity of the response), 

language use (i.e., complexity of the language), and topic development (i.e., coherence and 

relevance of the ideas to the topic). The TOEFL iBT Speaking rubric has all responses rated on a 

4-point scale (i.e., possible maximum score: 4). 

 

 

 

 53 

Table 9 

Partially-balanced, incomplete rating block design 

Set 1 

Participant D 

Task 

IP task 

IT-RL task 

IT-L task 

Participant E 

IT-RL task 

IT-L task 

IP task 

Participant F 

IT-L task 

IP task 

IT-RL task 

… 

Task 

 

 

 

… 

Task 

 

 

 

… 

… 

Set 2 

Participant G 

Participant H 

Participant I 

… 

Set 3 

Participant J 

Participant K 

Participant L 

… 

 

Rater A 

Rater B 

Rater C 

X 

 

X 

X 

 

X 

X 

 

X 

 

Rater B 

X 

X 

 

 

X 

X 

 

X 

X 

 

X 

X 

 

… 

Rater C 

X 

 

X 

… 

 

X 

X 

 

X 

X 

 

X 

X 

 

Rater A 

 

X 

X 

 

Rater C 

Rater A 

Rater B 

X 

 

X 

 

 

X 

X 

 

X 

X 

 

 

2.4.1.3 Subjective ratings on language use of spoken responses 

Given the differential task conditions among the three test tasks, I collected additional 

 54 

information to enlighten a possible task effect and its interaction with the planning conditions 

under which a specific task was performed. More specifically, this analysis concerned how the 

raters perceived the extent to which participants benefitted from a particular task type or not.  

The same three raters were asked to rate on a 4-point scale on how much of the specific 

language features (e.g., phrasal verbs, collocations, and vocabulary) in the input sources (e.g., 

reading texts in IT-RL task, and listening dialogue in IT-L task) were directly utilized in each 

response. With regard to the 4-point scale, 1 indicated a complete originality of the response, 

while 4 represented that a good amount of the language was extracted (or repeated) from the 

sources.  

2.4.1.4 Statistical analysis  

I carried out Multi-Faceted Rasch Measurement (MFRM) (Linacre, 1989) to discern how 

raters’ subjective scorings differed according to the testing variables at interest. I used FACETS 

3.80.3 (Linacre, 2017) for the MFRM analyses. While testing programs and researchers have 

commonly utilized the approach for assessing the severity of the raters and to establish test 

quality control (Bachman, 2004; Eckes, 2009; Weigle, 1998), the emphasis in the present study 

was placed on observing the impact of diverse measurement factors (e.g., raters, examinees, test 

tasks), or also known as facets, on quantitative test outcomes (Bonk & Ockey, 2003; 

Papageorgiou, Stevens, & Goodwin, 2012). With the MFRM model, the parameters of each facet 

can be estimated independently of the rest of the facets and are calibrated onto a single linear 

scale (i.e., the logit scale) (Myford & Wolfe, 2004). This makes it possible to carry out 

comparative interpretations among specified facets and the additive contribution of each 

indicator to score variation. Among a number of facets, I put a primary emphasis on the planning 

conditions and test-task types employed in the present study. Thus, I utilized MFRM modeling in 

 55 

response to the question, “To what extent do the core facets of interest – the three planning 

conditions and the test-task types – modeled in the current testing condition contribute to score 

variance?”  

MFRM modeling using FACETS was also critical for the current analyses for its 

robustness to model missing data (Bond & Fox, 2007; Eckes, 2009). FACETS accounts for these 

missing observations through the Joint Maximum Likelihood Estimation (JMLE) (Fisher, 1922), 

which is an iterative procedure taken in FACETS to calculate the estimates of each facet and its 

element defined in the analysis. JMLE operates under almost all conditions and overrides an 

incomplete data set for making estimations. This is because estimates are only based on the 

active data, or the data that has been observed (Eckes, 2009; Linacre, 2017). Any extent of 

randomness exerted in the data by missing observations is treated as “well-behaved” (Linacre, 

2017, p. 15). In the end, as long as the data set sufficiently fits the specified Rasch model, there 

is no bias in the produced measure estimates and fit statistics caused from the presence of 

missing data. This feature of FACETS and MFRM analysis carried significance in the current 

analyses; due to the incomplete block rating design incorporated in the present study, test takers 

did not receive full ratings from all three raters (see Table 9). Therefore, in the current study’s 

data file prepared for the MFRM analyses, I coded the missing observations as “m” and 

identified the code in FACET’s model specification. Entering the code for missing data helped 

the corresponding data points to be ignored and bypassed in the estimation.  

Following Bonk and Ockey (2003), I conducted three separate MRFM analyses in 

FACETS with regard to three administrations of test sets (Test Set A, B, and C). For all three 

analyses, I used a Rating Scale Model (RSM) as opposed to a Partial Credit Model (PCM) owing 

to the fact that the same holistic rating scale (e.g., the TOEFL iBT Speaking rubric) was used 

 56 

across the board for rating on each task (Winke, Gass, & Myford, 2012). In the present study, I 

specified six facets and the corresponding elements to define each facet (see Table 10). 

 

Table 10 

Six facets specified for the MFRM analyses 

Facets 

Raters 

Elements 

Description 

3 

Rater A, B, and C 

Examinees 

98 for Set A 

Varied by the number of participants with 

99 for Set B 

complete responses for all three test tasks 

97 for set C 

in the test sets 

3 

2 

3 

3 

Participant groups 

Participants’ oral proficiency 

level 

Planning conditions 

Test-task types 

  

Group 1, 2, and 3 

High and low oral proficiency as 

determined by the EI scores 

Unguided (UG) 

Guided-Writing (GW) 

Guided-Thinking (GS) 

IP (Independent) 

IT-RL (Integrated-Reading & Listening) 

IT-L (Integrated-Listening) 

In all three analyses, I specified the following MFRM model: 

?, ?, ?, ?, ?, ?, R4 

where ? each indicates the six facets of interest (raters, examinees, participant groups, 

 57 

participants’ EI-based oral proficiency levels, planning conditions, and test-task types), and R4 

indicates the highest possible score awarded to a test taker is 4 on the current rating scale. Each ? 

controls the selection of elements in each facet (first, second and so on). Specifying R4 controls 

for data-entry errors in that any data points greater than 4 would not be treated as valid data for 

estimation. Overall, the model “?, ?, ?, ?, ?, ?, R4” means: any element of facet 1 adds to any 

element of facet 2 adds to any element of facet 3 adds to any element of facet 4 adds to any 

element of facet 5 adds to any element of facet 6 producing an observation on a rating scale 

whose highest category is 4 or less. 

In the end, I made use of a number of outputs from FACETS in the current analyses. 

These included: (1) estimates from the fit statistics (e.g., infit and outfit mean square values) for 

assessing the overall fit of the data to the Rasch model; (2) reliability of separation index for 

assessing the extent to which the elements specified in each facet are separated (e.g., how distant 

are test scores awarded in a UG condition from other planning conditions); and (3) the Wright 

map for visual inspection of the data.  

As a follow-up on the MFRM analyses, I conducted a series of descriptive as well as 

inferential statistics for the test scores across the group. The primary dependent variable was the 

total test scores of each participant across the nine test tasks from the three test sets, which 

ranged from 0 to 33 (maximum of 12 points were assigned to a test set, with maximum of 4 

points were possible for all three test tasks). The independent variables of interest were (a) the 

three planning conditions, (b) proficiency levels of participants, and (c) task types (independent, 

integrated tasks). For illustrating the effect of the planning condition, I performed repeated 

measures ANOVA (using IBM SPSS) with the total test score as the dependent variable and the 

task types and test sets as the within-subjects variables; additionally, proficiency levels and 

 58 

planning conditions were entered as the between-group variables. I investigated a possible main 

and interaction effect of the type of planning condition and proficiency levels of participants (as 

well as the effect of the task types).  

As a final follow-up analysis, I conducted Generalized Equating Estimations (GEE) in 

SPSS for further exploring the extent to which planning conditions and task types each 

contributes to better test performance. GEE extends regular regression modeling as it 

accommodates repeated observations in the data, thereby reducing the overestimations of 

significance statistics (Type 1 error) (Ghisletta & Spini, 2004). For the analysis, I created a 

binary dependent variable by recoding each participant either as a “0” (low scorers) or “1” (high 

scorers) to be entered in the model. I added participants’ average scores (i.e., average scores 

between the two raters) from all nine tasks to derive a single sum score for all individuals. Based 

on the median of the sum score, I was able to divide the participants largely into two sub-groups 

of low- (n = 49) and high-scorers (n = 50) on the speaking test. Independent variables were 

planning condition (with Unguided Planning condition being the reference or the “dummy” 

variable) and test-task type (with IP task being the reference variable).  

2.4.2 Research question 2: Does the type of planning (guided versus unguided) affect the 

discourse quality of test candidates? If yes, what are the influences of the different task 

types? 

The second research question was raised to investigate whether the type of planning 

(guided versus unguided) had effects on the quality of the spoken responses. In the following 

section, I provide the procedures and statistical analysis taken to address this inquiry.  

2.4.2.1 Transcription of spoken responses 

Prior to conducting analyses, I transcribed all verbal responses verbatim. This yielded 

 59 

889 transcribed texts from all 99 participants (with the exception of two missing data, 99 

participants each responded to 9 test tasks). The transcriptions only encompassed responses 

produced within the given responding time; that is, I did not transcribe language produced 

beyond the given responding time (see Weigle, 2004, for the discussion of scoring incomplete 

responses in accordance to different scoring contexts).  

As it was beyond the scope of the current study, I did not refer to a more rigorous 

transcription convention (cf. see Lazaraton, 2002, for the application of Conversation Analysis 

conventions on spoken data from the language assessment perspective). However, during the 

course of transcribing, I internally developed three specific codes to mark for certain utterance 

phenomena to support further qualitative analysis. As shown in Table 11, these codes helped in 

(a) identifying the occurrences of the filled pauses in the data; (b) indicating the existences of 

utterance phenomena pertaining to a specific fluency measure in the transcriptions; and (c) 

specifying unintelligible language; it should be noted that 9 out of 889 transcriptions 

(approximately 0.01% of the whole dataset) contained these incidences.  

Upon completion of transcribing, I had two native speakers of English (who were both 

undergraduate students each pursuing his and hers bachelor’s degrees in Linguistics at a large 

Mid-western university) cross-examine the accuracy of the transcriptions. I assigned 

approximately 50% of the data each to these individuals. The students read through the 

transcriptions while listening to the assigned audio files. In so doing, they each noted 

transcription errors, which primarily pertained to minor mechanic errors (e.g., spelling errors, 

word-level errors, punctuation errors, etc.). I then went through the transcribed texts again and 

revised the identified transcription errors accordingly.  

 

 60 

Table 11 

Transcription codes used in the present study 

Codes/Marks 

Descriptions 

Examples from the data 

uh, um 

Filled pauses  

…I will uh, have my own stuff to 

do and um, therefore, I can build 

up my responsibility. 

Dash (-)  

A cut-off, usually a glottal stop. 

This is to indicate: 

 

 

1.  Breakdown fluency measures 

…the woman is kin- uh, sort of 

such as repetitions, 

worried about  

replacements, reformulations, 

(Indication of replacements) 

hesitations, and false starts   

 

2.  Incomplete response  

…um, well was the, was about the 

um, saving money – (end of 

responding time) 

Square brackets 

Transcription doubt, uncertainty; 

…the decision of university is 

with ‘xxx’ 

words within squared brackets are 

ridiculous and she is um, [xxx]. 

uncertain or unintelligible. 

 

2.4.2.2 Coding spoken responses according to CAF measures 

I hired two external coders for coding and rating the verbal responses in terms of the CAF 

measures for the main analysis regarding spoken quality. Both of the coders were graduate 

students at the same Mid-western University and had extensive teaching experiences of English 

 61 

in both the U.S. and in foreign countries. Coder A was a native speaker of English, pursuing his 

Master’s degree in TESOL. Coder B was a non-native speaker of English and a doctoral student 

in Applied Linguistics, with extensive experience in interacting with L2 learners and assessing 

their productive language in the classroom contexts.  

The coding and rating processes were rigorous and detail-oriented and lasted about four 

months until completion. The coding procedures can be broadly summarized as follows: (a) 

training phase on coding/rating by using the devised coding scheme and Iwashita and Elder’s 

(2005) CAF rubric (which is described in detail in the next section); (b) “playing around” with 

the dataset, with the two coders freely spending a specific amount of time on getting familiarized 

with the dataset and coding scheme; (c) a number of interim meetings (between the coders, and 

among the coders and I) in between on resolving concerns and questions prior to the main 

coding/rating phase; (d) joint coding/rating phase on a subset of data; (e) interim sessions (again, 

between the coders, and among the coders and I) on revisiting the coding scheme, and discussing 

emerging discrepancies in the dataset; (f) recoding of specific parts in the dataset to meet a 

certain level of agreement; and (f) independent coding/rating of the rest of the dataset.  

The purpose of the joint coding/rating was to establish reliable coding/rating; that is, to 

obtain inter-coder reliability, which in turn can ensure reliability for the qualitative results. For 

this, I assigned approximately 30% of the whole dataset commonly to the two coders, which 

consisted of 30 participants’ spoken responses. More specifically, I extracted 10 participants 

each from the three experiment groups. In the end, the coders analyzed a total of 270 same set of 

speech samples (30 participants each responding to 9 test tasks) in this joint coding phase.  

Another project assistant (an undergraduate student at the same university) helped on 

various phases of coding, particularly on the mechanic aspects of coding (e.g., deriving the total 

 62 

number of words and syllables in each speech sample).  

2.4.2.3 Subjective ratings on Iwashita and Elder’s (2005) CAF rubric 

The same two coders holistically rated the participants’ spoken performance using Elder 

and Iwashita’s (2005) CAF rubric (See Appendix G). Such a holistic rating was essential for 

complementing the quantified CAF measures in the manual coding phase (Elder & Iwashita, 

2005).  

All three CAF constructs were each examined on a 5-point scale. The descriptors for each 

construct represented the sub-measures employed in the present study. The complexity 

descriptors concerned the trade-off between complexity and accuracy; higher ratings were given 

to responses in which complex meanings (expressed through the use of a variety of verb forms 

and syntactic units) were conveyed at the expense of grammaticality. Accuracy descriptors 

referred to participants’ linguistic control over correct language forms. For fluency, speech rate 

as well as measures representing breakdown fluency was considered.  

2.4.2.4 Statistical analysis 

Prior to conducting the main statistical analysis, I went through the coding and rating 

results and obtained inter-coder reliability. I conducted Intra-class correlation coefficient (ICCs) 

using SPSS. ICC is an extension of the conventional Pearson correlation coefficient in that it 

reflects both degree of correlation and agreement between assessors (cf. Pearson correlation 

coefficient is only a measure of correlation; McGraw & Wong, 1996). In addition, following 

Tavakoli and Skehan (2005), I also conducted a series of exploratory factor analyses to inspect 

whether the qualitative measures each represent the larger CAF dimensions to which they were 

assigned.  

For the main analysis, I treated each measure from the three discourse qualities as a 

 63 

dependent variable with the independent variables being the planning condition. I performed 

seven repeated measures MANOVA (RM MANOVA) with regards to seven sub-categories of 

CAF dimensions. Subsequently, I carried out a series of post-hoc pairwise comparisons and one-

way ANOVA for addressing the interaction effects found in the RM MANOVA analysis.  

Finally, I conducted GEE analysis for exploring how speech quality measured for each 

planning condition contributed to overall test performance. I collapsed speech quality data 

pertaining to each planning condition and treated them as independent variables. I entered the 

same binary dependent variable of test performance (low- and high-scorers) in the model.  

2.4.3 Research question 3: How do test candidates use their planning times?  

For addressing research question 3, I downloaded the post-survey responses from 

Qualtrics, and formatted them into spreadsheets in Microsoft Excel. In Excel, I re-coded the 

responses into numerical values (e.g., responses on Likert-scale items) to further derive raw 

frequencies of each category of survey questions. Following previous researchers (Elder & 

Iwashita, 2005; Tavakoli & Skehan, 2005), I used SPSS to perform RM MANOVA in 

comparing the survey responses by planning condition and test-task types. General Linear 

Models (such as RM MANOVA) are fairly robust for ordinal data with small sample size (e.g., 

40 – 60) in the dataset (See Stiger, Kosinski, Barnhart, & Kleinbaum, 1998 for detailed 

discussion).  

2.4.4 Research question 4. How do test candidates perceive the given planning times?  

To answer research question 4, I had the project assistant extract the precise segments of 

the interview responses from the entire audio file yielded from each participant (the entire testing 

session were audio-recorded for every participant). With the exclusion of 9 missing interview 

segments from 9 participants (due to poor audio-recording quality and practicality of data 

 64 

collection procedures), I imported a total of 90 video segments into NVivo (version 10). In 

NVivo, I created separate individualized nodes (i.e., coding category) for each participant and 

designated attribute codes (e.g., his or her background variables) that represented their assigned 

groups and proficiency level. I then took an emergent and grounded-theory methods (Strauss & 

Corbin, 1998) while adopting the following coding procedures: (a) initial, open coding phase 

(Friedman, 2012) for developing a preliminary set of coding schemes and (b) axial and selective 

phases (Strauss & Corbin, 1998) for refining the coding schemes and gauging specific patterns in 

responses.  

 

 

 65 

CHAPTER 3: RESULTS 

In this chapter, I present results in the order of the four research questions that guided the 

current study. Therefore, four subsections largely constitute this chapter. In the first section, I 

report on participants’ speaking test performance through the ratings given by the three raters. 

With the rating data, I first illuminate whether participants’ test scores vary across testing 

conditions, which differed in terms of planning as well as task types. Next, I report on findings 

pertaining to the predictive relationship that planning conditions and task types each has with 

how participants performed on the speaking test. In the second section, I report on the extent to 

which participants’ speech quality differs across testing conditions as indexed through CAF 

dimensions. In the third section, I turn to participants’ survey responses and illustrate how the 

participants used and perceived the given planning times and conditions. In the final section, I 

provide more detailed responses from participants through their retrospective data.  

 

3.1 Research question 1: Speaking test scores 

3.1.1 Descriptive statistics for the speaking test scores 

To answer the first research question, I first obtained descriptive statistics for 

participants’ test performance across the three test sets. I broke down the test scores for each test 

set, and further by the three planning conditions as well as the three test tasks. Table 12 

summarizes the mean test scores and the standard deviations (SD) pertaining to each testing 

condition. It should be noted that the mean test scores for each testing condition showcased in 

Table 12 are essentially the means of the averaged scores each individual received from the two 

raters. 

 66 

As described in Table 12, it was apparent that participants’ mean test scores did not differ 

to a substantial extent across the three test sets. While the mean scores did seem to slightly 

increase in Test Set C, the score ranges from Test Set A to Test Set C were not heavily dispersed. 

In particular, the mean test scores especially had marginal differences amongst the three planning 

conditions within and across the three test sets. For instance, on average, participants maintained 

in the 2.50 to 2.56 score range for the IP task regardless of the differing planning conditions 

within and across test sets. For the IT-RL and IT-L tasks, the mean test scores had minimal 

differences across the three planning conditions in every test sets as well (while as noted above, 

there were increases in the mean scores for these two tasks in Test Set C). This seem to suggest 

that participants performed similarly regardless of the differing test sets and the types of planning 

activity they engaged in.  

Interestingly, a subtle yet noticeable pattern could be seen in the mean test scores for the 

three test tasks. Within and across test sets, and particularly regardless of the planning 

conditions, participants generally scored the least for the IP task, while scoring higher for the two 

integrated tasks. Between the IT-RL and IT-L tasks, it was the former that participants generally 

scored higher; however, the score difference between the two integrated tasks seem to be less 

clear when making the comparison between the two integrated tasks. In all test sets, participants’ 

score ranges for the IT-RL tasks maintained within the 2.80 to 2.90 range, while for the IP tasks, 

the range was within 2.50 to 2.56. The mean test scores for the IT-L task, on the other hand, 

somewhat maintained in between the IP and the IT-RL tasks.

 67 

Table 12 

Descriptive statistics for participants’ speaking test scores 

 

Test Set A 

Test Set B 

Test Set C 

UG  

GW  

GT  

UG 

GW 

GT 

UG 

GW 

GT 

(N = 32) 

(N = 33) 

(N = 33) 

(N = 33) 

(N = 33) 

(N = 33) 

(N = 32) 

(N = 33) 

(N = 32) 

IP 

2.45 (0.62)  2.50 (0.67)  2.56 (0.61)  2.50 (0.66)  2.49 (0.58)  2.52 (0.63)  2.53 (0.69)  2.56 (0.74)  2.54 (0.51) 

IT-RL 

2.80 (0.66)  2.82 (0.66)  2.79 (0.69)  2.78 (0.42)  2.80 (0.49)  2.83 (0.67)  2.88 (0.57)  2.89 (0.66)  2.90 (0.49) 

IT-L 

2.71 (0.60)  2.76 (0.72)  2.76 (0.60)  2.71 (0.45)  2.73 (0.52)  2.72 (0.65)  2.82 (0.66)  2.80 (0.61)  2.77 (0.41) 

Note. UG, GW, and GT each indicate Unguided-Planning, Guided-Writing Planning, and Guided-Thinking Silently Planning. IP, IT-RL, and IT-L 

each indicate Independent task, Integrated-Reading and Listening task, and Integrated-Listening only task. Standard deviations are in parenthesis.  

Note. In Test Set A, one participant’s responses for all test tasks were missing. In Test Set C, there were two participants’ responses 

for all test tasks missing.  

 

 68 

Figure 2 Boxplots representing the distribution of the test scores from Test Set A by test tasks 

 

 

 

 69 

 

Figure 3 Boxplots representing the distribution of the test scores from Test Set B by test tasks 

 

 

 

 

 70 

 

Figure 4 Boxplots representing the distribution of the test scores from Test Set C by test tasks 

 

 

 

 71 

Taken altogether, the trend emerging in Table 12 was that participants’ test scores did not 

vary to a noticeable extent in accordance to the type of planning. The differences in the mean 

scores were amongst the type of test tasks; that is, the scores were generally higher for the two 

integrated tasks and lower for the IP task. Such a trend is additionally illustrated in Figures 2, 3, 

and 4 through the boxplots generated for the overall mean scores by the test tasks within each 

test set from all participants (i.e., boxplots are not broken down by the planning conditions). It 

seems that for the IP task in all three test sets, participants consistently have lower medians 

(which generally is close to the average score) than the two integrated tasks. In addition, in all 

three test sets, participants’ scores are relatively more spread out for the IP task yet for the two 

integrated tasks, participants’ scores tend to vary from one another to a lesser extent.  

3.1.2 FACETS analysis 

In this section, I report on how the raw data of the test scores reported in the previous 

section are interpreted through Rasch modeling. I specifically focus on comparatively reporting 

how each factor or facet specified in the analyses explains the variance in the test scores. 

Because I analyzed each test administration (i.e., test sets) independently from one another, I 

present three MFRM analyses in accordance to the three test sets administered.  

Prior to exploring the outputs for individual facets in detail, I first inspected the fit 

statistics generated by FACETS to confirm whether the data fit the specified Rasch models. 

When the data set shows sufficient fit to a particular Rasch model, invariance among specified 

measurement factors are verified (Eckes, 2009). Measurement invariance (Bond & Fox, 2007) in 

Rasch modeling is particularly important as it entails (a) test scores (i.e., observation points 

entered in the model) contain sufficient information required for estimation; and (b) the 

 72 

unidimensionality of the test (i.e., the test items are measuring the same latent construct; in the 

current analyses, the latent construct is the speaking test performance) (Eckes, 2009).  

Table 13 summarizes the fit indices generated for the three Rasch models. It should be 

noted that the reported values are composite values from averaging the fit indices generated for 

the individual elements within each facet; for instance, the fit values for raters are the means of 

the corresponding fit values from the three individual raters. Model Standard Error of 

Measurement (henceforth Model S.E.) and the Infit and the Outfit Mean Square statistics each 

represent measurement precision (i.e., consistency of estimation) and measurement accuracy 

(i.e., correctness of estimation) in the models (Linacre, 2017). As in Table 13 (see columns 2, 5, 

and 8) in all three test sets, the Model S.E.s for the six facets were all small, mostly clustering 

around 0; this indicates that the corresponding facet in the models were measured with relatively 

high precision (Harvill, 1991). In terms of the Infit and Outfit statistics (see columns 3, 4, 6, 7, 9, 

and 10 in Table 13), the indices all fell within the range of the lowest of 0.89 to the highest of 

1.05, which correspond to the conventional range of acceptable fit (i.e., values located between 

the range of 0.7 or 0.8 to 1.2 or 1.3, Linacre, 1999) (See Linacre, 2000 for the discussion of a 

broader range of model fit and how different ranges apply differently depending on the 

assessment purpose and data size). Standardized Z (Zstd) values provide additional information 

on good model-fit in the data. Values closer to 0 demonstrate that the data did fit the model 

sufficiently. Table 13 suggests that all Zstd values were close to 0.  

Taken altogether, the fit values reported in Table 13 did not flag extreme tendencies of 

either misfit or overfit of the data to the models.  

   

 73 

Table 13 

Model-fit statistics Summary for Test Sets A, B, and C 

 

Test Set A 

Test Set B 

Test Set C 

Facets 

Model SE 

Infit mean 

square 

(Zstd) 

Outfit 

mean 

square 

(Zstd) 

Model SE 

Infit mean 

square 

(Zstd) 

Outfit 

mean 

square 

(Zstd) 

Model SE 

Infit mean 

square 

(Zstd) 

Outfit 

mean 

square 

(Zstd) 

Raters 

Examinee 

Group 

Proficiency level 

Planning condition 

Task Types 

0.16 

0.95 

0.13 

0.14 

0.13 

0.16 

0.97 (-0.30)  1.04 (0.30) 

0.99 (0.00) 

1.05 (0.00) 

0.98 (-0.20)  1.04 (0.30) 

1.00 (-0.30)  1.05 (-0.10) 

0.98 (-0.20)  1.04 (0.30) 

0.97 (-0.20)  1.04 (0.40) 

0.17 

0.98 

0.17 

0.14 

0.17 

0.17 

1.00 (0.00)  0.94 (-0.40) 

0.96 (-0.10)  0.94 (-0.10) 

1.00 (0.00)   0.94 (-0.40) 

0.99 (-0.10)  0.94 (-0.50) 

1.00 (0.00)  0.94 (-0.40) 

0.99 (-0.10)  0.94 (-0.40) 

0.18 

0.97 

0.18 

0.15 

0.18 

0.18 

0.98 (-0.30)  0.89 (-0.60) 

0.89 (0.00) 

0.89 (0.00) 

0.98 (-0.20)  0.89 (-0.50) 

0.98 (-0.20)  0.89 (-0.70) 

0.98 (-0.20)  0.89 (-0.50) 

0.98 (-0.20)  0.89 (-0.50) 

Note. Zstd indicates Standardized fit statistics.  

 

 

 74 

3.1.2.1 Test Set A 

In this section, I present the variable map and the corresponding outputs from FACETS 

that further explain the visual information. As can be seen in Figure 5, the variable map (also 

known as the “Wright Map”) displays comprehensive information on how all the facets entered 

in the model are represented in a single frame of reference; for MRFM modeling, this is the logit 

scale. The “+” and “-” in front of the facet headings indicate whether the corresponding facet 

measures were positively or negatively oriented. Positively oriented facets in the ruler mean that 

the data points in the higher positions have higher measures. In reverse, negatively oriented 

facets represent that highly positioned data points in the column have lower measures.  

In the first column in the map, measr (measure) depicts the logit scale that positions all 

measures of facets; the scale range for Test Set A spanned from 6 logits to -6 logits.  

In the second column, judge provides information on the level of severity or leniency of 

the three raters (A, B, and C) when they assessed participants’ speaking test scores for Test Set 

A. As the facet is positively oriented, more sever raters are to appear lower in the column, while 

more lenient raters are to be positioned higher. With the relatively tight clustering of rater A, B, 

and C in this column, it seems that the variability across raters in terms of their level of severity 

was not substantial. Rater A (appearing highest in the column) had a severity measure of 0.90 

logits while rater B and C each had severity measures of 0.63 and 0.55 logits, respectively. This 

essentially corresponded to less than 1-logit spread amongst the raters. The raters, even with a 

relatively short period of time to be accustomed to the rubric, did not have extreme differences in 

rating behavior. The Model S.E.s and the Infit statistics for the three raters were all within the 

acceptable range (Model S.E.: 0.16 to 0.17; Infit: 0.84 to 0.94), demonstrating that raters were 

relatively consistent in giving scores. 

 75 

 

Figure 5 Variable map for Test Set A 

 

The third column, examinees, depicts the estimates of participants’ proficiency on the 

speaking test (and particularly on Test Set A). Here, each star refers to one examinee. As the 

facet is positively oriented, participants who scored higher appear in the higher end of the 

column. Estimates for participants ranged from -6.61 logits to 5.63 logits. While the spread 

seems to be relatively wide, the examinee separation value (i.e., the spread of test taker 

proficiency that displays the separation among test takers as defined by their performance; Stone 

& Wright, 1988) of 2.24 denoted that participants were measurably separable into (at least) 2 

 76 

levels of ability. The reliability of the separation index of .83 suggested that the separation levels 

were fairly acceptable. This may suggest that participants were likely to cluster around certain 

rating categories, with few participants scoring in the extreme ends of the scoring scale (i.e., 0 or 

4).  

The fourth column, group, compared the three experimental groups in terms of their 

overall performance on the speaking test. Estimates for groups spanned from -0.13 logits to 0.07 

logit; the spread was narrow. Indeed, the separation value was 0 with a reliability of .90, 

indicating that the groups performed similarly. The fixed chi-square value provides additional 

information on whether the identified subgroups within a facet differ from one another to a 

statistically significant extent. There was a statistically non-significant chi-square value of 0.9 (df 

= 2, p = .62), which suggested that the three subgroups performed similarly.   

The fifth column, proficiency, compared the two oral proficiency levels (high and low) 

previously assigned based on the EI scores. With the facet being positively oriented, participants 

with a high oral proficiency level appeared in the higher part of the column, which was 

indicative of their higher test performance. Participants who were designated as having a low 

oral proficiency level appeared in the lower part of the column, which represented their lower 

test performance than the high level participants. Visually, the logit difference seems to be 

substantial; the range was -1.01 logits (high level) to 1.01 logits (low level). The separation value 

between these two subgroups was 2.42 with a reliability of .98, which corroborated with the 

large spread of logits. The fixed chi-square value of 112.1 (df =1) was statistically significant (p 

= .000). Overall, it seems to be the case that raters were likely to rate participants relative to their 

oral proficiency level as evidenced from the EI test. The pre-determined two levels deemed to 

align with how these participants performed in the actual speaking test.  

 77 

The sixth column, planning condition, displayed the comparison of test performance in 

terms of the three planning conditions. In alignment with the descriptive results in Table 12, the 

ratings did not differ to a significant extent across the planning conditions. Table 14 further 

reports on the summary statistics related to the planning condition. The observed average among 

the three conditions, which represents the raw observed score, had marginal difference. In logit 

scale (and with the negative orientation of the column), test performance under GT was the 

highest of -0.13 logits (SE = .16), followed by UG (SE = .17) and GW (SE = .17). Indeed, the 

separation index amongst the three conditions was 0 with a reliability of .90. The Fixed chi-

square value for planning condition was 0.6 (df = 2), which was not statistically significant (p 

= .62); this indicated that the sub-conditions did not significantly differ from one another. 

 

Table 14 

Summary statistics of planning condition for Test Set A 

Planning 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

condition  

Average 

Average 

square  

mean 

sub category 

UG 

GW 

GT 

 

2.65 

2.69 

2.70 

2.73 

2.77 

2.78 

0.07 

0.05 

-0.13 

0.16 

0.17 

0.16 

0.95  

1.01 

0.96 

square  

0.91 

1.05 

0.98 

The seventh column, tasks, demonstrated the test performance relative to the three test-

task types. With the facet being positively oriented, tasks that received higher performance 

appeared in the higher part of the column. In alignment with Table 12, Table 15 demonstrated 

 78 

that it was the two integrated tasks that received higher ratings than the IP task. IT-RL task and 

IT-L task each had values of -0.10 logits (SE = .16) and 0.03 logits (SE = .16), respectively. IP 

task had a value of 0.59 logits (SE = .16). While the differences in the logits seem to be not 

large, the separation index for the sub-categories was 2.33 with a reliability value of .84. This 

suggested that there were at least two distinct levels of categories that measurably separate 

participants in relation to their performance on the three test tasks. There was a statistically 

significant fixed chi-square value of 19.3 (df = 2, p = .00), which was also indicative of the 

possibility that the raters gave distinguishable ratings across tasks.  

 

Table 15 

Summary statistics of test-task types for Test Set A 

Test-task types 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

sub category 

Average 

Average 

square  

mean 

IP 

IT-RL 

IT-L 

2.50 

2.80 

2.74 

2.60 

2.90 

2.84 

0.59 

-0.10 

0.03 

0.16 

0.16 

0.16 

0.94 

1.01 

0.98 

square  

1.04 

1.03 

1.05 

 

Overall, the findings for Test Set A supported the descriptive statistics reported in Table 

12: Test performances for Test Set A marginally varied by planning condition while 

demonstrating evidence of an influence of test-task types. I further report on the follow-up 

analyses on pairwise comparisons amongst planning conditions and test-task types in section 

3.1.3. 

 79 

3.1.2.2 Test Set B 

As can be seen in Figure 6, the variable map for Test Set B was quite similar to that of 

Test Set A (see Figure 5). Again, all six facets were positively oriented; the subcategory with 

higher ratings was positioned in the higher part of each corresponding column. Because of the 

similarity, I put more focus on reporting the results for the two facets of interest, planning 

condition and test-task types. 

From the second column, judge, it is noticeable that the three raters did not have 

deviating patterns from when they scored speaking test performances. The differences in the 

level of severity were minimal, with a marginal spread in the logits amongst the raters (rater A = 

0.75 logits, rater B = 0.58 logits, rater C = 0.98 logits). Raters also seemed to consistently 

conform to the rating scale as evidenced through their Model S.E.s and the Infit statistics, which 

fell within an acceptable range (Model S.E.: 0.16; Infit: 0.81 to 1.02).  

The estimates for examinee in the third column spanned from -5.78 logits to 4.30 logits. 

As in Test Set A, the separation value of 2.37 with a reliability of .88 shows that there were at 

least two levels of categories distinguishing participants in the dataset. This was supported by the 

statistically significant fixed chi-square value of 475 (df = 98; p = .00).   

  

 

 

 80 

 

Figure 6 Wright map for Test Set B 

 

As in the fourth column, group, there were no significant differences in performance by 

the three experimental groups, with the highest and the lowest performing group being 0.20 

logits apart. As in Test Set A for group, the separation index was 0 with a reliability of .90, 

confirming that there were no distinct levels that distinguish among the three experimental 

groups in terms of test performance.  

From the fifth column, proficiency, differences between participants with high and low 

oral proficiency was visually clear. The two sub-levels of oral proficiency were 1.48 logits apart 

 81 

from one another, with participants of high oral proficiency being scored higher than participants 

with lower oral proficiency. These results were further supported by a separation index of 2.00 

with a reliability of .96.  

From the sixth column, planning condition, it was apparent that the test performance for 

Test Set B did not differ across the differing planning conditions. As in Table 16, observed 

values had no significant differences relative to the planning conditions. The differences were 

also marginal in the logits scale; the highest was GT of 2.69 logits (SE = 0.17), subsequently 

followed by GW of 0.05 logits (SE = 0.16) and UG of -0.12 logits (SE = 0.16). In addition, there 

was a separation value of 0 with a reliability of .90, which confirmed that test performance in 

relations to the differing planning conditions could not be separated into distinctive levels. That 

is, raters did not exercise distinguishable patterns of ratings relative to the three planning 

conditions. This was further corroborated by the non-significant fixed chi-square value of 0.9 (df 

= 2, p = .65). 

 

Table 16  

Summary statistics of planning condition for Test Set B 

Planning 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

condition  

Average 

Average 

square  

mean 

sub category 

UG 

GW 

GT 

 

2.66 

2.67 

2.69 

2.72 

2.77 

2.77 

-0.12 

0.05 

0.08 

0.16 

0.16 

0.17 

0.91 

1.01 

0.98 

square  

0.88 

1.06 

0.88 

 82 

Finally, the seventh column in the variable map provides similar information as in Test 

Set A. As further depicted in Table 17, generally, test performance differed across independent 

and integrated tasks. Test performance was the highest for the IT-RL task with -0.15 logits (SE = 

0.16) and the lowest for the IP task with 0.67 logits (SE = 0.17). The IT-L task was in between 

these two tasks with 0.02 logits (SE = 0.17). While the differences in logits seem not to be large 

between the IP and the integrated tasks, there was a noticeable separation index of 2.59 with a 

reliability of .90. This suggested that there were at least two levels of discernable subcategories 

of test performance in relations to test-task types in the dataset. This result was further confirmed 

with a statistically significant fixed chi-square value of 25.3 (df = 2, p = .00).   

 

Table 17 

Summary statistics of test-task type for Test Set B 

Test-task type 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

sub category 

Average 

Average 

square  

mean 

IP 

IT-RL 

IT-L 

2.50 

2.80 

2.72 

2.60 

2.89 

2.82 

0.67 

-0.15 

0.02 

0.17 

0.16 

0.16 

1.01 

0.90 

0.86 

square  

1.02 

0.88 

0.76 

 

As noted above, similar findings from Test Set A could be drawn to Test Set B. Score 

variation was minimal with regards to which planning activities participants utilized before 

responding. Test-task types brought differences in test performance; participants were rated 

higher on the integrated tasks as opposed to the IP task. 

 83 

 3.1.2.3 Test Set C 

When it comes to Test Set C, the variable map displayed in Figure 7 presents similar 

patterns of results as in the previous test sets. Thus, I briefly touch on the results pertaining to the 

following facets judge, examinee, group, and proficiency, and elaborate more on planning 

condition and tasks. 

 

 84 

 

Figure 7 Wright map for Test Set C 

 

The three raters, again, demonstrated fairly consistent ratings relative to one another. 

Their severity measures spanned from the lowest of 0.48 (Rater A) to the highest of 0.78 (Rater 

C). Reliability in their levels of severity were suggested through their Model S.E.s and Infit 

indices; all of the values maintained within the acceptable range of scale. For both examinees and 

proficiency, the separation indices were similarly suggesting 2 levels of test performance for 

each facet (examinees: separation index = 2.51, reliability = .84; proficiency: separation index = 

2.82, reliability = .96). On the other hand, FACETS indicated comparability across the three sub-

groups (separation index = 0, reliability = 0.94).  

In terms of planning condition, no significant differences were noticeable in the variable 

map. The summary statistics for planning condition reported in Table 18 verified such 

comparability across the planning activities. The observed average values as well as the logit 

measures maintained a narrow range of values regardless of the planning condition. The 

separation index of 0 with a reliability of .90 further supported how test performance did not 

differ to a great extent in terms of the type of planning participants utilized. In addition, the fixed 

chi-square value of 0.1 was not statistically significant (df = 2, p = .96), which confirmed that 

variation in participants’ test performance was kept to minimal. 

 

Table 18  

Summary statistics of planning condition for Test Set C 

Planning 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

condition  

Average 

Average 

sub category 

 

 85 

square  

mean 

square  

 

Table 18 (cont’d) 

UG 

GW 

GT 

 

2.74 

2.75 

2.74 

2.77 

2.83 

2.81 

0.01 

0.03 

0.01 

0.19 

0.19 

0.17 

0.96 

1.01 

0.97 

0.95 

0.81 

0.91 

The variable map as well as the summary statistics in Table 19 also showcased a similar 

tendency of differences in test performance relative to the test tasks. The observed average for 

the three tasks demonstrated a slight increase in ratings for Test Set C as opposed to the previous 

two test sets. In terms of logits, the IP task had the lowest value of 0.43 (SE = 0.18), while the 

IT-RL task had the highest value of -0.45 (SE = 0.18). The difference in logits was 0.85, which 

was the largest value amongst the three Test Sets. The separation index was 2.79 with a 

reliability of .84, implying that participants’ test performance in terms of test-task types can be 

divided measurably into at least two distinct levels. Such a separation of test performance was 

supported by a significant fixed-chi square value of 19.3 (df =2, p = .00).  

 

Table 19 

Summary statistics of test-task type for Test Set C 

Test-task type 

Observed 

Fair (M) 

Measure  Model SE 

Infit mean 

Outfit 

sub category 

Average 

Average 

square  

mean 

IP 

IT-RL 

IT-L 

2.54 

2.90 

2.80 

2.63 

2.94 

2.83 

0.43 

-0.45 

-0.32 

0.18 

0.18 

0.18 

0.76 

1.08 

1.10 

square  

0.67 

0.90 

1.11 

 86 

Taken altogether, all three MFRM models applied to the three datasets commonly 

informed the following finding: participants’ speaking test scores seemed to vary due to the type 

of test tasks, while the type of planning that they had employed had minimal impact. There were 

no significant differences on test performance depending on whether participants performed 

under unguided or guided planning conditions. Amongst the three test-task types, participants 

were likely to be rated lower on the IP task relative to the two integrated tasks.  

3.1.3 Follow-up repeated measures ANOVA 

To further elaborate on the MRFM analyses, I conducted a two-way repeated measures 

ANOVA (henceforth RM ANOVA). In this model, I entered Test Set (A, B, and C) and Test-task 

type (IP, IT-RL, and IT-L tasks) as the within-subjects variables, and Planning Condition (UG, 

GW, and GT) and Proficiency Level (oral proficiency levels designated from the EI test results) 

as the between-subjects variables. I specified the dependent variables as the speaking test scores 

from each test set and task types (e.g., Test Set A, IP task score).  

The results from the RM ANOVA indicated that there were no statistically significant 

higher-level interactions between the variables (e.g., two-, three-, or four-way interaction 

between Test Set, Test-task type, Planning Condition, and Proficiency Level). However, the 

Greenhouse-Geisser statistics (for accounting for the violated sphericity assumptions in the data 

set; Field, 2009) informed that Test-task type (F2, 182 = 13.116, p = .000, ηp2 = .126) and 

Proficiency Level (F1, 91 = 79.231, p = .000, ηp2 = .365) were statistically significant factors 

impacting on how participants performed on the speaking tests, with moderate to high effect 

sizes. Planning Condition, on the other hand, appeared as a statistically non-significant factor on 

the overall test performance (F2, 91 = 0.009, p = .991, ηp2 = .000). In addition, it should be also 

noted that Test Set did not impact on how participants performed to a statistically significant 

 87 

extent (F2, 182 = 1.116, p = .351, ηp2 = .024). These findings further support the MFRM models 

and outputs reported in the previous section.  

Findings from the RM ANOVA analysis were additionally captured in the line graphs in 

Figure 8. Here, participants’ test performance followed a similar pattern regardless of the three 

planning conditions. That is, participants’ mean speaking scores were the lowest for the IP task 

and the highest for the IT-RL task in all three planning conditions.  

 

Mean 

Test Scores

2.9

2.8

2.7

2.6

2.5

2.4

2.3

IP

IT-RL

Test-Task Type

IT-L

UG
GW
GT

 

Figure 8 Mean speaking test scores by test-task type and planning conditions 

 

To further illuminate on these results, I conducted a series of post-hoc pairwise 

comparisons of test performance across the three test tasks within each test administration (e.g., 

IP tasks versus IT-RL tasks in Test Set A). The primary comparison was between the IP task and 

the two integrated tasks. Table 20 reports the full summary statistics generated from the analysis. 

It could be seen that in all three test sets, test performance on the IP task was statistically 

different than both IT-RL and IT-L tasks. In Test Set A, the IP task was scored lower than both 

 88 

the IT-RL task (mean difference = -0.29, 95% CI [-0.41, -0.19], p = .000, d = 0.50) and the IT-L 

task (mean difference = -0.26, 95% CI [-0.37, -0.15], p = .000, d = 0.42). In Test Set B, the IP 

task was scored the least compared to the IT-RL task (mean difference = -0.34, 95% CI [-0.45, -

0.19], p = .000, d = 0.53) and the IT-L task (mean difference = -0.24, 95% CI [-0.35, -0.28], p 

= .000, d = 0.43). Likewise, in Test Set C, the IP task score differed significantly than the IT-RL 

task (mean difference = -0.38, 95% CI [-0.41, -0.17], p = .000, d = 0.75) and the IT-L task 

(mean difference = -0.30, 95% CI [-0.39, -0.19], p = .000, d = 0.50). Effect sizes for all pairwise 

comparisons were in the moderate range. On the other hand, score differences were statistically 

non-significant between the two integrated tasks across the three test sets.  

Overall, findings from the RM ANOVA and the post-hoc comparisons suggest that 

participants’ performance varied in terms of the type of test tasks, and not the type of planning 

activities. Amongst the test tasks, statistically significant score differences existed when 

comparing the IP task to the two integrated tasks. Even with small raw score differences, the 

interferential statistics and their effect sizes confirmed that participants scored higher on the two 

integrated tasks than the IP task. 

As a final follow-up on the overall results, I conducted a generalized estimating equation 

(GEE) analysis to verify the predictive power that planning condition and test-task types each 

has on the level of test performance. I specified the following as the predictor factors: (1) 

planning condition as a categorical variable (with UG as the reference variable); (2) test-task 

types as continuous variables (with IT-L as the reference variable). The binary variable was 

participants’ level of test performance (low- and high-scorers). The GEE results confirmed that 

planning condition was not a significant predictor of higher test scores (Wald χ2 (2) = 0.49, p 

= .824). More specifically, neither the GW (b = 0.27, p = .524, 95% CI = [0.120, 14.32]) nor GT 

 89 

planning conditions (b = 0.31, p = .452, 95% CI = [0.212, 19.215]) were associated to higher 

performance relative to the UG condition. On the other hand, there was a statistically significant 

effect of test-task types on predicting higher test performance (Wald χ2 (2) = 4.958, p = .003). 

Especially, participants’ higher ratings were likely to occur in the IT-RL task (b = 4.958, p 

= .000, 95% CI = [11.26, 20.71]) and the IT-L task (b = 4.994, p = .000, 95% CI = [5.32, 

19.41]) relative to the baseline IP task. This is indicative of a possibility that the two integrated 

tasks were able to contribute to better test performance as opposed to the IP task condition.  

 90 

Table 20 

Summary statistics of pairwise comparisons for Test Sets A, B, and C 

Pair 

Test Set A 

Mean 

diff. 

-0.30 

SD 

t 

p 

0.55 

-5.333 

.000 

Mean 

diff. 

-0.34 

IP vs. IT-

RL 

Test Set B 

SD 

t 

p 

0.55 

-6.155 

.000 

Test Set C 

SD 

t 

p 

0.52 

-7.385 

.000 

Mean 

diff. 

-0.38 

IP vs. IT-L 

-0.26 

IT-RL vs. 

0.03 

0.55 

0.51 

-4.755 

0.702 

.000 

.484 

-0.24 

0.04 

0.52 

0.43 

-4.300 

0.820 

.000 

.355 

-0.30 

0.06 

0.54 

0.43 

-5.174 

0.865 

.000 

.342 

IT-L 

Note. IP, IT-RL, and IT-L each refers to Independent, Integrated-Reading and Listening, and Integrated-Listening tasks, respectively. 

Mean diff. refers to the mean differences between the mean scores of the two test tasks.  

 

 

 91 

3.2 Research question 2: Speech quality  

3.2.1 Inter-coder reliability 

Prior to examining the descriptive as well as inferential statistics on speech quality, I 

inspected the raters’ inter-coder reliability by calculating the Intra-class Correlation Coefficients 

(ICCs). It should be noted that the variables included for the ICC analysis were the raw coding 

assigned by each coder (as opposed to the quantified values of speech quality measures that I 

subsequently generated; these include measures such as speech rate or lexical errors per 100 

words that made use of the raw coding). I applied ICCs to the dataset that the two coders were 

assigned to commonly analyze. This corresponded to a subset of speech samples that comprises 

30% of the entire dataset. Because there were two specific coders to be assessed, I selected a 2-

way mixed-effects model with absolute agreement definition (Shrout & Fleiss, 1979). An 

absolute agreement in the current analysis would refer to the extent to which the two coders 

overlap in evaluating a particular quality dimension for an individual participant. In addition, an 

absolute agreement definition for the ICC model is generally used with repeated observations (as 

in the current study) in the dataset (Koo & Li, 2016).  

Tables 21, 22, and 23 each summarizes the ICCs (and their 95% confidence intervals) 

with regard to the fluency, accuracy, and complexity dimensions generated by the two coders 

within each test set. Table 24 additionally reports on the ICCs between the two coders on their 

assessment based on the 4-point scale CAF rubric. For the basic descriptive statistics for all raw 

coding data, see Appendix H.  

As the tables report, the agreement level between the two coders is moderate to high for 

most of the measures, within the .75 to .99 range (Cicchetti, 1994). In terms of the fluency 

indices, the ICCs were the lowest for replacements in Test Set B (ICC = .75, 95% CI [0.61, 

 92 

0.83]), and the highest for the filled pauses measures in Test Set C (ICC = .99, 95% CI [0.99, 

0.99]). Excluding certain measures having an ICC value lower than .80 in Test Sets A (e.g., false 

starts) and B (e.g., reformulations, replacements, and false starts), all of the measures had high 

ICC values in Test Set C. Yet it is noticeable that variations in coding were identified in the 

repair fluency measures.  

Likewise, high levels of ICCs were found for both accuracy and complexity measures. 

For accuracy, lexical errors in Test Set B had the lowest ICC (ICC = .82, 95% CI [0.83, 0.87]) 

while error-free clauses in Test Set A had the highest ICC (ICC = .91, 95% CI [0.87, 0.94]). For 

complexity, subordinate clauses in Test Set C had the lowest ICC value (ICC = .85, 95% CI 

[0.80, 0.90]), and AS-units in Test Set A had the highest value (ICC = .95, 95% CI [0.91, 0.97]).  

The ICCs for CAF ratings were also within the acceptable range of agreement level. The 

two coders agreed upon the least for fluency rating in Test Set A (ICC = .80, 95% CI [0.77, 

0.84], and the most for accuracy rating in Test Set B (ICC = .92, 95% CI [0.90, 0.98]).  

 

 93 

Table 21 

Intra-class correlation coefficients for fluency measures 

Fluency indices 

Test Set A 

Test Set B 

Test Set C 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Filled Pauses (Num) 

Unfilled Pauses 

(Num) 

.96*** 

.93*** 

Reformulations (Num) 

.85*** 

Repetitions (Num) 

Replacements (Num) 

Hesitations (Num) 

False starts (Num) 

Note. *** p < .001 

 

.84*** 

.88*** 

.83*** 

.77*** 

 

.94 

.90 

.77 

.76 

.82 

.74 

.65 

.98 

.96 

.90 

.90 

.90 

.89 

.85 

.96*** 

.99*** 

.77*** 

.91*** 

.75*** 

.91*** 

.76*** 

.94 

.97 

.70 

.86 

.61 

.86 

.68 

.97 

.99 

.78 

.94 

.83 

.94 

.78 

.99*** 

.99*** 

.80*** 

.87*** 

.86*** 

.93*** 

.80*** 

.99 

.98 

.76 

.74 

.84 

.90 

.78 

.99 

.99 

.81 

.93 

.86 

.98 

.87 

 94 

Table 22  

Intra-class correlation coefficients for accuracy measures 

Accuracy indices 

Test Set A 

Test Set B 

Test Set C 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Error-free clauses 
(Num) 
Lexical errors (Num) 

.91*** 

.86*** 

.87 

.78 

.94 

.90 

.89*** 

.82*** 

.87 

.83 

.90 

.87 

.90*** 

.88*** 

.87 

.84 

.94 

.95 

Note. *** p < .001 

 

Table 23  

Intra-class correlation coefficients for complexity measures 

Complexity indices 

Test Set A 

Test Set B 

Test Set C 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

.95*** 

.92*** 

.91 

.88 

.97 

.95 

.93*** 

.88*** 

.89 

.77 

.96 

.90 

.89*** 

.85*** 

.86 

.80 

.90 

.90 

AS-units (Num) 

Subordinate clauses 
(Num) 
Note. *** p < .001 

 

 95 

Table 24  

Intra-class correlation coefficients for CAF ratings  

CAF ratings 

Test Set A 

Test Set B 

Test Set C 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

ICC 

95% Confidence Interval 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Lower Bnd  Upper Bnd 

Fluency 

Accuracy 

Complexity 

.80*** 

.83*** 

.85*** 

.77 

.80 

.77 

.84 

.87 

.90 

.88*** 

.92*** 

.83*** 

.85 

.90 

.80 

.89 

.98 

86 

.84*** 

.89*** 

.85*** 

.78 

.87 

.80 

.87 

.94 

.89 

Note. *** p < .001. Ratings are based on Elder & Iwashita’s (2005) CAF rubric.  

 

 

 

 

 96 

3.2.2 Factor analysis 

Prior to examining participants’ oral performance in detail, I inspected how the 

qualitative measures represent distinct dimension of fluency, accuracy, and complexity. I 

collapsed the datasets from the three test sets for the current analysis as well as the analyses 

presented in the following sections. I conducted nine separate principal component analyses 

(PCA) for each planning condition (and further broken down by three test-task types) on the 19 

CAF measures (for which I quantified the raw coding data). For all nine analyses, the Kaiser-

Meyer-Olkin (KMO) measure ensured the sampling adequacy (i.e., sufficient number of 

observations for reliable analysis): the KMO values ranged from .649 (UG condition, IP task) 

to .728 (UG condition, IT-L task), which was above the threshold limit of .5 (Field, Miles, & 

Field, 2012). The Barlett’s test of sphericity for all nine analyses were statistically significant (p 

< .001), demonstrating acceptable degree of correlations among variables for running PCAs.   

The factor structures were markedly similar across all nine conditions; thus, I present 

here the results with the highest KMO value: UG condition, IT-L task. As in Table 25, Factor 1 

mostly encompassed the breakdown fluency measures. Factor 2 showcased an interesting mix of 

features of all three dimensions. Speed fluency measures all clustered on Factor 2, with an 

addition of mean length of run displaying moderate strength of factor loading. All accuracy 

measures loaded on to Factor 2 with high factor loadings. Additionally, subordinate clauses, 

which is a sub-dimension of syntactic complexity, displayed moderate factor loadings to Factor 

2. Factor 3 consisted of repair fluency measures, with a moderate factor loading of time spent 

before articulation. Factors 4 and 5 each included syntactic complexity and lexical diversity 

measures, respectively.  

Overall, PCA informed that the measures adopted in the present study mostly conformed 

 97 

to the overall CAF conceptualizations. At the same time, accuracy measures indicated a possible 

link within and across sub-dimensions.  

 

Table 25  

Factor analysis for IT-L task under UG planning condition 

Factor 1  Factor 2  Factor 3  Factor 4  Factor 5 
Measures 
 
Tot. Num. of Syllables 
 
Speech rate 
 
Time spent before articulation 
 
Filled pauses per minute 
 
Unfilled pauses per minute 
 
Mean length of run 
 
Repetitions ratio 
 
Replacements ratio 
 
Reformulations ratio 
 
Hesitations ratio 
 
False starts ratio 
 
Error free clauses 
 
Lexical errors per 100 words 
 
AS-unit length 
 
Subordinate clause ratio 
 
Tot. Num. of unique words 
-.414 
Type-token ratio 
.740 
Conjunctions 
Sentence Linking 
.835 
Note. Bartlett’s test of sphericity: χ2(136) = 1344.76, p  < .001. Results are based on orthogonal rotation (varimax). 

 
 
 
-.607 
-.895 
.810 
 
 
 
 
 
 
 
 
 
 
 
 
 

.430 
 
 
 
 
 
 
 
 
 
 
 
 
.726 
.841 
.793 
.771 
 
 

.876 
.885 
-.560 
 
 
.440 
 
 
 
 
 
.773 
-.803 
 
.428 
.524 
 
 
 

 
 
-.402 
 
 
 
.681 
.707 
.818 
.709 
.770 
 
 
 
 
 
 
 
 

Factor loadings below .30 are not reported.  

 

 

 98 

3.2.3 Descriptive statistics and comparison analysis 

In this section, I report the basic descriptive statistics for the 19 CAF measures. I 

additionally provide comparison statistics (e.g., RM MANOVAs, pairwise comparisons) among 

the variables of interest.  

3.2.3.1 Fluency 

3.2.3.1.1 Speed fluency 

Speed fluency measures included: total number of syllables per minute produced by 

participants (within the response time), speech rate (total number of syllables divided per 

articulation time), and time spent before articulation (total amount of silence time before 

participants started to respond). A general tendency depicted in Table 26 was that across the 

three planning conditions, participants were able to produce substantially more syllables in the 

two integrated tasks; within the two integrated conditions, it was under the IT-L task (the 

integrated task with listening only) that participants spoke more. The speech rate seemed to be 

slightly higher in the IP task as opposed to the two integrated tasks; this suggests that participants 

generally had to speak faster in the IP task condition. Interestingly, in all planning conditions, 

time spent before articulation was slightly highest in the IP task condition, while the two 

integrated tasks did not show much difference between one another. The three bar graphs in 

Figure 9 additionally depict the trend found in the data: participants spoke more in the two 

integrated tasks, while taking some more time before responding for the IP task regardless of the 

three planning conditions. 

 99 

Table 26  

Descriptive statistics for speed fluency by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant differences 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning types 

Task types 

Tot. Num. of 

101.46 

133.24 

136.27 

99.47 

132.43 

138.66 

105.13 

132.82 

142.84 

ns 

IP < IT-RL 

syllables (Num) 

(24.17) 

(33.40) 

(33.29) 

(25.10) 

(32.72) 

(36.45) 

(23.64) 

(33.79) 

(36.16) 

Speech rate (Sec) 

2.34 

2.27 

2.33 

2.30 

2.26 

2.36 

2.38 

2.26 

2.32 

(0.54) 

(0.56) 

(0.55) 

(0.56) 

(0.55) 

(0.61) 

(0.52) 

(0.56) 

(0.60) 

Time spent before 

1.70 

1.33 

1.54 

1.75 

1.44 

1.41 

1.78 

1.33 

1.45 

art. (sec) 

(1.22) 

(0.65) 

(1.19) 

(1.25) 

(0.88) 

(0.71) 

(1.07) 

(0.66) 

(0.99) 

IP < IT-L 

IT-L < IT-RL 

IT-RL < IP 

IT-RL < IT-L 

IT-RL < IP 

IT-L < IP 

ns 

ns 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/9. Ns indicates no statistically significant difference between paired variables. 

 

 

 

 

 

 100 

Mean
Syllables
(Num)

160
140
120
100
80
60
40
20
0

Mean 
Speech 
Rate
(Sec)

UG
GW
GT

IP

IT-RL

IT-L

Test-task types

 

2.5

2

1.5

1

0.5

0

UG
GW
GT

 

IP

IT-RL

IT-L

Test-task types

Mean 

Speaking Time 

before 

Articulation

(Sec)

Figure 9 Speed fluency measures 

2

1.5

1

0.5

0

UG
GW
GT

 

IP

IT-RL

Test-task type

IT-L

 101 

A follow-up RM MANOVA revealed that there was no interaction effect between 

planning condition and test-task types for speed fluency (Greenhouse-Giesser corrections; total 

number of syllables: F4, 582 = 1.641, p = .162, ηp2 = .011; time spent before articulation: F4, 582 = 

0.670, p = .586, ηp2 = .005), although speech rate seemed to exert a tendency towards 

significance (F4, 582 = 2.035, p = .091, ηp2 = .014). In terms of the main effects, planning 

condition again did not have statistical significance on affecting speed fluency. Table 26 reports 

that pairwise comparisons (using the Bonferroni corrections: adjusted alpha level = .05/9) 

amongst the three planning conditions in terms of speed fluency measures did not have 

statistically significant pairs. Test-task types, on the other hand, had a statistically significant 

main effect on all three measures (total number of syllables: F2, 582 = 469.512, p = .000, ηp2 

= .617; speech rate: F2, 582 = 11.516, p = .000, ηp2 = .038; time spent before articulation: F2, 582 = 

14.531, p = .000, ηp2 = .048).  

Pairwise comparisons in Table 26 further revealed differences of speed fluency measures 

across the three test tasks, which supported the overall trends displayed in Figure 9. IT-L task 

exerted the highest total number of syllables relative to the IP task (mean difference = 37.310, 

95% CI [33.968, .40.650], p = .000, d = 1.23) and the IT-RL task (mean difference = 6.425, 95% 

CI [3.503, 9.347], p = .000, d = 0.18). Speech rate was the least in IT-RL task relative to the IP 

task (mean difference = -.089, 95% CI [-.146, -.032], p = .001, d = 0.11) and the IT-L task 

(mean difference = -.113, 95% CI [-.162, -.063], p = .000, d = 0.19). This indicated that 

participants took more time in responding to IT-RL tasks. On the other hand, participants took 

the most time before breaking the silence to respond in the IP task more so than the IT-RL task 

(mean difference = .339, 95% CI [.187, .492], p = .000, d = 0.35) and the IT-L task (mean 

difference = .239, 95% CI [.052, .427], p = .000, d = 0.22). 

 102 

3.2.3.1.2 Breakdown fluency 

As in Table 27, breakdown fluency measures included: filled pauses per minute, unfilled 

pauses per minute, and mean length of run (total number of syllables divided by total number of 

unfilled pauses). Again, differences were marginal relative to planning conditions, but noticeable 

across test-task types. Participants produced relatively small amount of filled pauses and unfilled 

pauses per minute, with small differences noticed by planning condition (although the graphs in 

Figure 10 suggests that unfilled pauses seemed to be slightly greater in the GT condition). Mean 

length of run was the highest in the IT-L task; this suggests that participants produced the 

lengthiest utterance between pause boundaries in the IT-L task. 

From a follow-up RM MANOVA, it was found that there was no interaction effect 

between planning condition and test-task type for breakdown fluency (Greenhouse-Giesser 

corrections; filled pauses per minute: F4, 582 = 0.720, p = .578, ηp2 = .005; unfilled pauses per 

minute: F4, 582 = 0.238, p = .853, ηp2 = .002; mean length of run: F4, 582 = 0.561, p = .681, ηp2 

= .004). Likewise, planning condition was not a statistically significant factor affecting the three 

breakdown fluency measures. Test-task types had a statistically significant main effect for 

unfilled pauses (F2, 582 = 45.440, p = .000, ηp2 = .136) and mean length of run (F2, 582 = 14.725, p 

= .000, ηp2 = .048), but not for filled pauses (F2, 582 = 1.603, p = .202, ηp2 = .006).  

Pairwise comparisons (using the Bonferroni corrections) (see Table 27) within sub-levels 

of planning condition and test-task types further confirmed that it was in the IT-L task condition 

that participants produced denser utterance than the IP task (mean difference = 5.379, 95% CI 

[2.680, 8.078], p = .000, d = 0.32) and IT-RL task (mean difference = 3.556, 95% CI 

[1.322, .5.791], p = .000, d = 0.18).  

 

 103 

Table 27  

Descriptive statistics for breakdown fluency by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant differences  

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning types 

Task type 

Filled pauses per 

0.08 

0.08 

0.07 

0.08 

0.07 

0.08 

0.08 

0.08 

0.08 

minute  

(0.06) 

(0.07) 

(0.06) 

(0.06) 

(0.64) 

(0.07) 

(0.07) 

(0.05) 

(0.06) 

Unfilled pauses 

0.14 

0.11 

0.15 

0.15 

0.11 

0.15 

0.15 

0.12 

0.16 

per minute 

(0.07) 

(0.05) 

(0.08) 

(0.06) 

(0.05) 

(0.09) 

(0.05) 

(0.04) 

(0.09) 

Mean length of 

17.10 

19.78 

23.32 

17.86 

19.37 

21.47 

16.47 

17.70 

22.85 

run (in syllables) 

(9.71) 

(5.68) 

(6.92) 

(7.61) 

(7.20) 

(7.39) 

(9.53) 

(10.70) 

(10.05) 

ns 

ns 

ns 

ns 

IT-RL < IP 

IP < IT-L 

IP < IT-L 

IT-RL < IT-L 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/9. Ns indicates no statistically significant difference between paired variables. 

 

 

 

 

 

 104 

Mean Filled 
Pauses per 

min

0.1

0.08

0.06

0.04

0.02

0

IP

IT-RL

IT-L

Test-Task Type

Mean 
Unfilled 
Pauses per 

min

UG
GW
GT

0.16

0.12

0.08

0.04

0

  

 

UG
GW
GT

 

IP

IT-RL

IT-L

Test-Task Type

Mean 

Length of 

Run 

(syllables)

Figure 10 Breakdown fluency measures 

25

20

15

10

5

0

UG
GW
GT

 

IP

IT-RL

IT-L

Test-Task Type

 105 

3.2.3.1.3 Repair fluency 

Five measures denoted for the two dimensions of repair fluency; namely, verbatim repair 

(repetitions, replacements; no linguistic modification on the repaired features) and substitutive 

repair (reformulations, hesitations, and false starts; linguistic modification made on repaired 

features). I derived the ratio of each repairing phenomenon to the total amount of articulation 

time for each individual participant. From Table 28, it was apparent that participants did not 

showcase a great extent of repair fluency in their speech; in fact, the values were close to 0, 

suggesting that participants’ speech did not contain a number of repairing phenomenon.  

Yet a follow-up RM MANOVA revealed a main effect of planning condition on 

repetitions (F1, 289 = 8.122, p = .000, ηp2 = .053); at the same time, there was also a borderline 

effect on reformulations (F1, 289 = 2.954, p = .054, ηp2 = .020). Test-task types, on the other hand, 

was not found to have statistically significant main effect on all repair fluency measures. As 

shown in Table 28, post-hoc pairwise comparisons (Bonferroni corrections applied) on the sub-

levels of planning condition revealed that there were small yet significant differences in 

participants’ production of repetitions and reformulations. In particular, participants tended to 

produce the most number of repetitions in the GT condition relative to the UG condition (mean 

difference = .008, 95% CI [.003, .013], p = .000, d = 0.02), and the GW condition (mean 

difference = .005, 95% CI [.000, .010], p = .002, d = 0.02). Likewise, participants produced 

more reformulations in the GT condition than the UG condition (mean difference = .004, 95% CI 

[.000, .007], p = .047, d = 0.01). On the other hand, repair fluency did not vary in accordance to 

the type of test tasks.  

 106 

Overall, findings from the fluency dimensions denoted a stronger effect of test-task types 

than the type of planning activities on the amount of language produced and the pausing 

phenomena.  

 107 

Table 28  

Descriptive statistics for repair fluency by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant 

differences  

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning 

Task type 

Verbatim 

Repetitions 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.03 

UG < GT 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

GW < GT 

Replacements 

0.01 

0.01 

0.02 

0.01 

0.02 

0.01 

0.02 

0.02 

0.02 

ns 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.01) 

Substitutive  Reformulations 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

0.02 

UG < GT 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.01) 

(0.01) 

(0.02) 

(0.01) 

(0.01) 

Hesitations 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

False starts 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

(0.01) 

ns 

ns 

ns 

ns 

ns 

ns 

ns 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/15. Ns indicates no statistically significant difference between paired variables. 

 

 

 108 

3.2.3.2 Accuracy 

Measures for accuracy included error-free clauses per AS-unit and lexical errors per 100 

words. As seen in Table 29, although with slight differences, the ratio of error-free clauses to the 

total number of clauses was generally the highest in the GT condition. The differences were 

clearer between the IP task and the two integrated tasks; speech produced in the IP task condition 

contained the least error-free clauses. In terms of lexical errors, the trend was that participants 

produced fewer errors in the integrated tasks than the IP task. However, there was a sudden 

increase in the lexical errors in the IT-L task under the GT condition. Such a difference in the 

lexical errors is well illustrated in Figure 11.  

Subsequently, a follow-up RM MANOVA confirmed a significant interaction effect 

between planning conditions and test-task types (Greenhouse Giesser corrections: F4, 572 = 3.443, 

p = .006, ηp2 = .025). Figure 12 further supports this result by depicting the increase in lexical 

errors in the IT-L task performed in the GT condition. I subsequently conducted a post-hoc one-

way ANOVA with planning condition as an independent variable and lexical errors per 100 

words for IT-L task as a dependent variable. The finding was that there was a main effect for 

planning condition (F2, 290 = 9.676, p = .000, ηp2 = .007). Furthermore, pairwise comparison 

(with Bonferonni correction) provided that participants made more lexical errors in the GT 

condition relative to the UG (mean difference = .451, 95% CI [.130, .774], p = .002, d = 0.46) 

and the GW conditions (mean difference = .551, 95% CI [.228, .873], p = .000, d = 0.63).  

 

 

 109 

Table 29  

Descriptive statistics for accuracy by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant differences  

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning 

Task type 

Error-free clauses 

0.39 

0.45 

0.53 

0.42 

0.49 

0.52 

0.35 

0.53 

0.55 

ns 

IP < IT-RL 

ratio  

(0.76) 

(0.67) 

(0.73) 

(0.76) 

(0.57) 

(0.69) 

(0.85) 

(0.72) 

(0.65) 

IP < IT-L 

Lexical errors per 

1.10 

1.08 

1.15 

1.13 

1.17 

1.22 

1.08 

1.17 

1.47 

Within IT-L: 

IP < IT-L 

100 words 

(0.97) 

(1.02) 

(1.04) 

(1.27) 

(0.90) 

(0.82) 

(1.00) 

(0.94) 

(0.93) 

UG < GT 

GW < GT 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/6. Ns indicates no statistically significant difference between paired variables. 

 

 

 110 

Mean 

error-free 
clauses 
ratio

0.6

0.5

0.4

0.3

0.2

0.1

0

Mean 
lexical 

errors per 
100 words

UG
GW
GT

IP

IT-RL

IT-L

Test-Task Types

  

1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

UG
GW
GT

 

IP

IT-RL

IT-L

Test-Task Types

Figure 11 Accuracy measures 

Figure 12 Interaction effect on lexical error per 100 words 

 111 

 

3.2.3.2 Complexity 

3.2.3.2.1 Syntactic complexity 

For syntactic complexity, I examined AS-unit length (i.e., the number of syllables 

produced within an AS-unit) and ratio of subordinate clause to AS-unit. The results summarized 

in Table 30 seemed to suggest that the AS-unit length was the greatest in the GT condition, and 

the shortest in the GW condition. In particular, it was in the GW condition that the IT-RL task 

exerted the largest difference in the AS-unit length to that of the IT-L task (IT-RL task: M = 

32.99, SD = 13.03; IT-L task: M = 38.67; SD = 21.43). In the remaining planning conditions, the 

differences between the two integrated tasks were less noticeable. The bar graphs depicted in 

Figure 13 for AS-unit length further provided that it was especially within the IT-RL task 

condition that the three planning conditions differed. There seemed to be a sudden drop in the 

AS-unit length for the GW condition within the IT-RL task. Variations within the three planning 

conditions in relations to the test-task types were more implied for the results for subordinate 

clauses. As in Table 30, the ratio did not vary to a great extent amongst the three test-task types 

in the GW condition, while variations were clearer in both UG and GT condition. Figure 13 

depicted that similar to the trends in the AS-unit length, the differences amongst the planning 

conditions were identifiable within the integrated task conditions.   

A follow-up RM MANOVA exhibited a statistically significant interaction effect 

between planning condition and test-task type only on subordinate clauses (Greenhouse-Giesser 

corrections: F4, 572 = 2.266, p = .038, ηp2 = .018). In terms of AS-unit length, the RM MANOVA 

exerted a statistically significant main effect for test-task type (Greenhouse-Giesser corrections: 

F2, 578 = 12.232, p = .000, ηp2 = .041). I subsequently performed a post-hoc one-way ANOVA 

with planning condition as an independent variable and subordinate clauses in each three test 

 112 

tasks as the dependent variables. Planning condition had a statistically significant main effect on 

the IT-RL task (F2, 293 = 5.893, p = .003, ηp2 = .004). Pairwise comparisons (with Bonferonni 

correction) revealed that within the IT-RL task, subordinate clauses were the greatest in the UG 

condition relative to the GW condition (mean difference = .159, 95% CI [.041, .276], p = .004, d 

= 0.49) and the GT conditions (mean difference = .126, 95% CI [.009, .244], p = .031, d = 0.35). 

There was a borderline main effect of planning condition on the IT-L task (F2, 293 = 3.050, p 

= .051, ηp2 = .002); within the IT-L task, the GT condition facilitated more subordinate clauses 

than the GW condition (mean difference = .103, 95% CI [-.001, .206], p = .053, d = 0.39). 

Figure 14 corroborates the post-hoc analyses in that participants produced more subordinate 

clauses in the UG condition when responding to the IT-RL task as opposed to the two guided-

planning conditions.

 113 

Table 30  

Descriptive statistics for syntactic complexity by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant differences  

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning 

Task type 

AS-Unit length 

33.42  

37.89 

39.50 

31.32 

32.99 

38.67 

33.59 

38.49 

39.73 

ns 

IP < IT-RL 

(18.97) 

(22.79) 

(19.21) 

(19.16) 

(13.03) 

(21.43) 

(21.01) 

(19.16) 

(16.71) 

IT -RL < IT-L 

Subordinate clause 

0.52 

0.60 

0.55 

0.49 

0.43 

0.47 

0.49 

0.47 

0.58 

Within IT-RL: 

ns 

per AS-unit 

(0.53) 

(0.42) 

(0.33) 

(0.46) 

(0.26) 

(0.25) 

(0.42) 

(0.50) 

(0.31) 

GW < UG 

GT < UG 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/6. Ns indicates no statistically significant difference between paired variables. 

 

 

 

 

 

 114 

Mean 
AS-Unit 
length

45
40
35
30
25
20
15
10
5
0

IP

IT-RL

IT-L

Test-Task Type

Mean

Subordina
te clause 
ratio to 
AS-Unit

UG
GW
GT

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

UG
GW
GT

 

IP

IT-RL

IT-L

Test-Task Type

Figure 13 Syntactic complexity measures 

Figure 14 Interaction effect on subordinate clauses

 115 

 

3.2.3.2.2 Lexical diversity 

For lexical diversity, I examined the number of unique words (type), type-toke ratio 

(TTR), and the percentage of cohesive devices such as conjunctions (e.g., and, but) and sentence 

linking words (e.g., nonetheless, in other words) in the speech samples. A general trend depicted 

in Table 31 and Figure 15 was that the number of unique words was the greatest in the IT-RL 

task regardless of planning conditions. On the other hand, TTR was the highest in the IP 

condition in all three planning conditions. Given the high number of unique words produced in 

the integrated tasks, the corresponding lower TTR values could suggest that there were more 

total number of words generated in these tasks. In terms of the cohesive devices, participants’ 

speeches relatively contained more sentence linking words as opposed to conjunctions. 

Differences amongst the three planning conditions seemed to be less noticeable; yet as suggested 

in Figure 15, there were some extent of variations in IP task.  

A follow-up RM MANOVA demonstrated a statistically significant interaction between 

planning condition and test-task type on sentence linking words (Greenhouse-Giesser 

corrections: F4, 572 = 9.903, p = .000, ηp2 = .059) (see Figure 16). For remaining variables, there 

was only a significant main effect of test-task type (unique words:  F2, 572 = 254.275, p = .000, 

ηp2 = .467; TTR: F2, 572 = 96.475, p = .000, ηp2 = .250; conjunctions: F2, 572 = 33.076, p = .000, 

ηp2 = .102). I subsequently conducted a post-hoc one-way ANOVA with planning condition as 

independent variable and sentence linking words from each test-task type as dependent variables. 

It was found that a statistically significant main effect of planning condition was only on IP task 

condition (F2, 291 = 13.421, p = .000, ηp2 = .303). A pairwise comparison between planning 

conditions revealed that participants used significantly less sentence linking words in the UG 

 116 

condition relative to the GW (mean difference = -.026, 95% CI [-.038, -.013], p = .000, d = 0.72) 

and the GT condition (mean difference = -.022, 95% CI [-.034, .-.009], p = .000, d = 0.49). 

 117 

Table 31  

Descriptive statistics for lexical diversity by planning conditions and test tasks 

Measure 

UG  

(N = 98) 

GW  

(N = 98) 

GT  

(N = 98) 

Pairwise significant differences  

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Planning 

Task type 

Num. of Unique 

51.77 

63.80 

58.01 

50.59 

63.55 

57.57 

52.51 

63.58 

58.36 

ns 

IP < IT-RL 

words 

(10.48) 

(12.31) 

(10.20) 

(9.86) 

(12.10) 

(10.16) 

(9.08) 

(12.59) 

(12.15) 

IP < IT-L 

IT-L < IT-RL 

Type-toke ratio 

0.54 

0.52 

0.49 

0.55 

0.52 

0.49 

0.54 

0.52 

0.48 

ns 

IT-RL < IP 

(TTR) 

(0.06) 

(0.06) 

(0.05) 

(0.08) 

(0.06) 

(0.06) 

(0.07) 

(0.06) 

(0.05) 

IT-L < IP 

IT-L < IT-RL 

Conjunctions 

0.04 

0.04 

0.04 

0.03 

0.03 

0.04 

0.04 

0.03 

0.04 

ns 

IP < IT-RL 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.02) 

(0.19) 

(0.02) 

(0.02) 

IT-RL < IT-L 

Sentence linking 

0.07 

0.08 

0.09 

0.10 

0.08 

0.09 

0.09 

0.08 

0.09 

UG < GW 

IT-RL < IP 

devices 

(0.05) 

(0.03) 

(0.02) 

(0.03) 

(0.03) 

(0.02) 

(0.03) 

(0.03) 

(0.02) 

UG < GT 

IT-RL < IT-L 

Note. Standard deviations are in parenthesis. Pairwise comparisons (subsequent analysis of RM MANOVA) are based on adjusted alpha level of 

0.5/6. Ns indicates no statistically significant difference between paired variables. 

 

 

 118 

Mean 
Num. of 
unique 
words

70

60

50

40

30

20

10

0

Mean 

Conjunctio

ns

  

UG
GW
GT

UG
GW
GT

IP

IT-RL

IT-L

Test-Task Type

0.05

0.04

0.03

0.02

0.01

0

IP

IT-RL

IT-L

Test-Task Type

Mean

Type-token 

ratio

0.6

0.5

0.4

0.3

0.2

0.1

0

Mean 
Sentence 
Linking

0.12

0.1

0.08

0.06

0.04

0.02

0

Figure 15 Lexical diversity measures 

   

 

 119 

UG
GW
GT

UG
GW
GT

 

 

IP

IT-RL

IT-L

Test-Task Type

IP

IT-RL

IT-L

Test-Task Type

 

Figure 16 Interaction effect on sentence linking devices 

 

 

 120 

 

3.2.4 Relationship between test scores and speech quality 

To explore whether speech quality from each planning condition predicted test 

performance, I conducted three separate GEE analyses with 19 CAF measures generated from 

each planning condition as independent variables and level of test performance as binary 

dependent variable (low- and high-scorers). Results for fluency, accuracy, and complexity 

measures are reported in Table 32, 33, and 34.  

In terms of fluency (see Table 32), speech rate measured for three planning conditions 

consistently contributed to higher test performance to a significant extent (UG: b = 2.83, p 

= .000; GW: b = 4.74, p = .014; GT: b = 2.77, p = .000); the positive b values suggested that 

higher-scorers produced fast and fluid speech regardless of planning types. In addition, two 

measures appeared to be significant factors for higher test performance in UG and GT condition; 

these included time before articulation (UG: b = -0.73, p = .047; GT: b = -0.31, p = .050) and 

mean length of utterance (UG: b = 0.08, p = .020; GT: b = 0.08, p = .012). The negative b value 

for time before articulation indicated that high-scorers were less likely to take time before 

responding. As per mean length of utterance, high-scorers tended to produce more syllables 

within an utterance boundary. Repair fluency measures were not significant factors leading to 

higher test scores. 

In terms of accuracy (see Table 33), lexical errors per 100 words predicted higher test 

scores in all planning conditions (UG: b = -0.82, p = .000; GW: b = -0.47, p = .010; GT: b = -

0.46, p = .012). Again, the negative b values in all cases suggested that higher scorers on the 

speaking test made fewer lexical errors in their speech, and this association appeared to be 

statistically significant regardless of planning types.  

For complexity (see Table 34), it was unique words that consistently made statistically 

 121 

significant contribution to test performance (UG: b = 0.10, p = .000; GW: b = 0.12 p = .000; 

GT: b = 0.10, p = .000). The positive b values implied that higher scorers were likely to use 

produce more unique words in their speech across all three planning conditions. Notably, it was 

only within the GW condition that AS-unit length appeared as a significant factor impacting on 

test performance (b = 0.02; p = .024).  

All in all, there were at least one indices within each performance dimension that 

indicated a trend in significantly predicting test performance regardless of planning conditions: 

these included speech rate from fluency, lexical errors per 100 words from accuracy, and unique 

words from complexity.

 122 

Table 32  

Summary of GEE statistics for fluency measures  

Measure 

Tot. Num. of syllables  

Speech rate  

Time spent before art.  

Filled pauses per min. 

Unfilled pauses per min. 

Mean length of utterance 

Repetitions 

Replacements 

Reformulations 

Hesitations 

False starts 

B 

-0.06 

2.83 

-0.73 

1.77 

7.83 

0.08 

1.56 

-14.40 

-12.16 

9.44 

21.34 

UG  

(N = 98) 

Wald 

1.75 

23.24 

3.51 

0.32 

3.34 

5.39 

0.03 

2.32 

1.65 

0.73 

2.91 

p 

.186 

.000 

.047 

.571 

.060 

.020 

.864 

.129 

.199 

.392 

.088 

GW  

(N = 98) 

Wald 

1.46 

6.00 

1.34 

2.15 

1.53 

2.60 

0.69 

1.63 

0.67 

0.07 

0.53 

B 

-0.02 

4.74 

-0.12 

-4.27 

-2.48 

0.05 

7.12 

-13.85 

-8.94 

3.60 

10.15 

GT  

(N = 98) 

Wald 

2.28 

35.23 

3.30 

0.74 

1.67 

6.24 

0.07 

1.52 

0.94 

0.00 

0.02 

p 

.131 

.000 

.050 

.390 

.196 

.012 

.788 

.218 

.332 

.990 

.884 

p 

.227 

.014 

.248 

.142 

.216 

.107 

.405 

.202 

.413 

.791 

.466 

B 

-0.01 

2.77 

-0.31 

-2.49 

-5.84 

0.08 

-1.83 

-10.32 

-10.21 

-0.19 

-2.27 

Note. Goodness of fit for UG, GW, and GT models are based on the Quasi Likelihood under Independence Model Criterion (QIC). QIC values 

were 425.04, 423.75, and 427.78 for UG, GW, and GT models, respectively.  

 

 123 

Table 33  

Summary of GEE statistics for accuracy measures  

Measure 

Error-free clauses 

Lexical errors per 100 

words 

UG  

(N = 98) 

Wald 

0.91 

17.55 

B 

0.78 

-0.82 

GW  

(N = 98) 

Wald 

0.77 

6.69 

p 

.763 

.000 

B 

0.27 

-0.47 

GT  

(N = 98) 

Wald 

2.60 

6.27 

p 

.107 

.012 

p 

.381 

.010 

B 

0.48 

-0.46 

Note. Goodness of fit for UG, GW, and GT models are based on the Quasi Likelihood under Independence Model Criterion (QIC). QIC values 

were 402.13, 410.84, and 414.06 for UG, GW, and GT models, respectively.  

 

 

 

 

 

 

 124 

Table 34  

Summary of GEE statistics for complexity measures  
Measure 

UG  

(N = 98) 

Wald 

0.06 

1.10 

51.34 

0.22 

0.02 

0.15 

B 

0.02 

0.40 

0.10 

-1.24 

0.95 

-1.38 

p 

.805 

.295 

.000 

.889 

.035 

.697 

AS-Unit length 

Subordinate clause per 

AS-unit 

Unique words 

Type-token ratio 

Conjunctions 

Sentence linking 

GW  

(N = 98) 

Wald 

5.07 

2.58 

57.54 

1.50 

0.74 

1.73 

B 

0.02 

0.69 

0.12 

0.51 

5.23 

8.32 

GT  

(N = 98) 

Wald 

2.34 

0.92 

42.44 

0.53 

0.67 

0.01 

p 

.126 

.338 

.000 

.465 

.797 

.909 

p 

.024 

.108 

.000 

.221 

.391 

.188 

B 

0.02 

0.95 

0.10 

-1.48 

2.25 

-0.67 

Note. Goodness of fit for UG, GW, and GT models are based on the Quasi Likelihood under Independence Model Criterion (QIC). QIC values 

were 418.05, 418.65, and 415.09 for UG, GW, and GT models, respectively.  

 

 

 125 

3.3 Research question 3: Test-takers’ survey responses 

For research question 3, I looked at how participants’ survey responses differed by 

planning conditions and test-task types. These responses concerned (a) the participants’ 

confidence in performance in different conditions (confidence); (b) the participants’ perceptions 

on the appropriateness of the length of planning time by different test tasks (appropriateness of 

planning time); and (c) the participants evaluation of the effectiveness of the type of planning 

condition on different test tasks (effectiveness of type of planning). I further broke down the 

results in terms of levels of test performance (low-scorers: N = 49; high-scorers: N = 50).  

3.3.1 Confidence in performance 

Figures 17, 18, and 19 graphically summarize the results for confidence (rated on a 5-

point scale with 1 being “completely unconfident” to 5 being “completely confident”). Each bar 

represents frequency counts for a scale category. For the IP task, 40.8% of low-scorers felt fairly 

(N = 18) or completely confident (N = 2) that they had performed well in the GW condition, 

while 65.3% of low-scorers in the GT condition were fairly (N = 28) and completely confident 

(N = 4) in their performance. On the other hand, high-scorers only gave the highest confidence 

rating in the GW condition as opposed to low-scorers. For the IT-RL task, the majority gave 

moderate to high ratings. One difference was that a subset of low-scorers tended to give lower 

ratings in the UG condition (N = 14; 28.6%) while such phenomenon for high-scorers was 

noticeable in the GT condition (N = 13; 26%). For the IT-L task, low-scorers exerted similar 

patterns across planning conditions. Yet high-scorers gave contrasting response patterns for the 

GW and GT conditions. While vast majority of them strongly felt they performed well in the 

GW condition (N = 32; 64%), their responses were more scattered in the GT condition; their 

responses for fairly unconfident (rating category “2”) were the highest of all planning conditions.  

 126 

Figure 17 Confidence ratings for IP task 

 

 

 

 127 

 

Figure 18 Confidence ratings for IT-RL task 

 

 

 128 

 

 

Figure 19 Confidence ratings for IT-L task 

 

 129 

3.3.2 Appropriateness of planning time 

In terms of appropriateness of planning time, Figures 21 and 22 demonstrated that 

participants’ responses generally clustered around higher ratings of appropriateness for the two 

integrated tasks. On the other hand, as in Figure 20, responses were more likely to disperse for 

the IP task; more specifically, increased responses on fairly inappropriate (“2” rating) to neutral 

(“3” rating) in the GW condition were noticeable for both low- and high-scorers. This indicated 

that both groups of participants felt that planning time for the IP task was not sufficient 

especially under the GW condition. For low-scorers, stark differences in perceptions were 

identifiable in the GT condition; these participants thought that planning time in the IP task is 

both fairly appropriate (N = 21; 42.9%) and fairly inappropriate (N = 21; 42.9%). For the IT-RL 

task, the majority of high-scorers strongly agreed in the GW condition that planning time was 

fairly sufficient (N = 30; 60%). In fact, this trend of tight clustering of higher ratings was found 

from both groups in both GW and GT conditions. On the other hand, low-scorers tended to give 

diverse ratings under the UG condition; they gave contrasting responses that planning time was 

both fairly inappropriate (N = 17; 34.7%) and fairly appropriate (N = 20; 40.8%). For the IT-L 

task, response patterns were quite similar from both groups across planning condition; yet low-

scorers strongly felt that planning time felt sufficient under the GW condition (N = 30; 61.2%).  

  

 130 

Figure 20 Appropriateness of planning time in IP task 

 

 

  

 131 

Figure 21 Appropriateness of planning time in IT-RL task 

 

 

 

 132 

Figure 22 Appropriateness of planning time in IT-L task 

 

 

 133 

 

3.3.3 Effectiveness of types of planning 

As depicted in Figures 23, 24, and 25, participants’ response patterns for effectiveness of 

types of planning were generally similar across test-task types. In most cases, participants gave 

moderate to high ratings of effectiveness regardless of planning conditions. Interestingly, high-

scorers’ ratings on “5” slightly increased from the IP (N = 10; 20%) to the IT-L (N = 16; 32%) 

task for GW condition. At the same time, relatively fewer high-scorers seemed to be completely 

satisfied by GT condition when performing the IT-RL task (N = 6; 12%), while more of them 

favored the condition when performing the IP (N = 13; 26%) and IT-L task (N = 11; 22%). 

 

 134 

 
 

 

Figure 23 Effectiveness of planning for IP task 

 

 

 135 

  

 

Figure 24 Effectiveness of planning for IT-RL task 

 

 

 

 136 

 

Figure 25 Effectiveness of planning for IT-L task 

 

 

 137 

3.3.4 Perceptual differences by planning and task type 

To further reveal differences in perceptions by testing conditions, I conducted a repeated 

measures (RM) MANOVA with three survey response categories as dependent variables, test-

task type as within-subjects variable, and planning condition and test performance level as 

between-subjects variables. There was a statistically significant interaction effect between 

planning condition and test-task type on appropriateness of planning time (Greenhouse-Geisser 

correction: F2, 538 = 2.84, p = .027, ηp2 = .016) and a borderline effect on effectives of type of 

planning (F2, 538 = 2.30, p = .060, ηp2 = .015). On the other hand, there were no factors 

contributing significantly to confidence ratings. 

To follow-up on the main effect found from RM MANOVA results, I conducted a series 

of Wilcoxon paired samples tests for appropriateness of planning and effectiveness of planning, 

while comparing the results by planning condition. In terms of appropriateness of planning, the 

finding was that it was only within the two guided planning conditions that participants had 

differences in perceptions. Upon performing within the GW condition, participants regarded the 

planning time on the IP task was relatively less sufficiently given than that of the IT-RL task (Z 

= -3.56, p = .000) and the IT-L task (Z = -2.58, p = .010). Similarly, within the GT condition, 

appropriateness rating for planning time of the IP task was lower than that of the IT-RL task (Z 

= -2.33, p = .020) and the IT-L task (Z = -2.15, p = .025). There were no differences between 

the two integrated tasks on participants’ perceptions. The results corroborated the graphical 

results illustrated in Figures 21 and 22. 

For effectiveness of planning, it was only within the GT condition that participants had 

different perceptions with regard to how the planning condition effectively contributed to the 

performance of different test-task types. For instance, participants thought that the GT condition 

 138 

was more helpful when performing the IP task (Z = -2.47 p = .013) and the IT-L task (Z = -2.17, 

p = .030) relative to the IT-RL task. This result aligned with the trend depicted in Figures 23 and 

25. Participants’ perceptions on effectiveness of planning did not differ across test-task types for 

both UG and GW conditions. 

Overall, the results indicate that participants (both low and high-scorers) had similar 

opinions across the three categories of survey questions. They were mostly confident in their 

performance and agreed on a moderate to high effectiveness of all planning conditions. In 

addition, they considered that IP tasks were in need of more planning time than the integrated 

tasks. 

 

3.4 Research question 4: Test-takers’ interview responses 

To address research question 4, I individually interviewed participants for gauging their 

perceptions on the effectiveness of the planning conditions (type and given amount of time) in 

association with test-task types. In this section, I focus on reporting the response patterns 

pertaining to two interview questions from which the most unexpected, yet interesting responses 

were generated. I further break down the results by the two sub-groups of high- and low-test 

performance. It should be noted that the results are reported with 90 participants’ interview 

responses (due to missing audio files of 4 participants, and inaudible quality of audio files of 5 

participants).  

3.4.1 Interview question 1: Which type of planning was helpful for you when responding? 

With Interview question 1, I opted for gauging participants’ thoughts on how each 

planning condition was helpful for them during test taking. As reported in Table 35, 57.8% of all 

participants (N = 52), which included the most of both low- and high-scorers, favored the GW 

 139 

condition. On the other hand, neither UG nor GT conditions were mentioned greatly by most of 

the participants. Interestingly, about a quarter of all participants did not opt for a specific 

planning condition. About 24% of participants (N = 22) agreed that planning condition is 

effective when used for a particular test-task type, but not useful for another. A small subset of 

participants had contrasting opinions: 7.78% of participants (N = 7) thought that all three 

planning types were helpful while 6.64% of participants (N = 6) thought none were effective.  

 

Table 35 

Frequency of comments on helpful planning conditions 

Category 

Low-scorers  

High-scorers  

Total 

Number of comments 

(N = 44) 

(N = 46) 

GW condition 

Depends on test-task type 

All were helpful 

None were helpful 

UG/GT condition 

Total 

 

24 

12 

2 

4 

1 

44 

28 

10 

5 

2 

1 

46 

 

52 

22 

7 

6 

2 

90 

Participants favoring the GW condition mentioned that GW allowed them to use writing 

strategies that they were mostly familiar with or had been using in most testing contexts. 

Although participants did not have extensive experiences in taking oral language tests, GW 

condition made it possible for them to transfer their strategies in non-testing situations (e.g., 

 140 

regular studying, reading, etc.) as well as conventional paper-based testing settings. This is well 

illustrated in participant 2077’s response below: 

 

[Participant 2077, female, low-scorer]  

When I am taking reading tests or just generally any paper-based tests, I like to write 

down key words or paraphrased sentences of the given text. This gives me a complete 

sense of understanding the text. I think I was trying to do the same thing in this speaking 

test, which helped me organize the information given from the question.  

 

In a similar vein, because most participants were used to scribbling or making notes in 

non-testing situations, they found it rather difficult not to do so in the GT condition. Participant 

2068’s response gave a good contrast of the two conditions. He mentioned that GT is only 

helpful for retrieving information given in the tasks, while GW facilitates language production.   

 

[Participant 2068, male, high-scorer] 

I personally keep a diary and like to organize my thoughts by writing them out. I think 

this made me kind of not like the GT condition. Although GT was helpful in tracing back 

to the information I was given from the task, I don't think it helped me be prepared of the 

type of language I would use for responding. Recalling the information does not 

necessarily lead to fluent speaking, I would say. In GW, it was helpful to read off of the 

notes I took during planning time when responding. Even just briefly looking at the key 

words I wrote down helped me think of what I would say next.  

 

 141 

Yet some participants assessed the effectiveness of planning by certain test-task types. 

For instance, for Participant 2065, GW was not effective in the IP task condition due to the short 

amount of time given. For him, there was not sufficient time to make written planning; instead, 

GT was found more useful.  

  

[Participant 2065, male, high-scorer] 

There was just simply not enough time to write down what I would like to say. I think 

there must be more effective ways to make notes, or it may be that I am just a slow 

writer. This is why I think I did better in the GT condition for the IP task. It helped me 

better to get at the two points I would like to raise.  

 

Participant 2013 thought that the abstractness of the statements given in the IP task makes 

it difficult to come up with concrete written planning.  

 

[Participant 2013, Female, high-scorer] 

I think the information given in the IT-RL or IT-L tasks are pretty much structured: 

statement of the problem, and the two interlocutors’ opinions on that. IP task is not 

structured in any way because it deals with abstract statements that you have not thought 

of before; almost like whether you like your mom or dad better. Because of this nature, 

you need to think hard about your own opinion on that, and even good supporting 

examples or reasons, which is quite difficult to achieve within 15 seconds. By the time 

you are ready to jot down some ideas or even key words for reference, the planning time 

is up.  

 142 

 

In a similar vein, some participants discussed about the effectiveness of GW and GT with 

regard to the availability of the input resources in the IP tasks versus the two integrated tasks. For 

instance, GW aligned to the goal of the two integrated tasks: to be able to make use of the key 

words and to rightfully integrate the information in the response. For that, a certain extent of 

writing was needed to accurately capture the key information from the input source. For 

Participant 2083, GW was more useful for IT-RL task in making a visual map that structured the 

given information. The graphic information helped him on efficiently conveying a good chunk of 

information within the limited time.  

 

 [Participant 2083, male, high-scorer] 

Because you get a lot of information from IT-RL task, I always found it better to first re-

structure the information in the following order: the issue at hand posed by the 

University, and the woman’s (or the man’s) opinion on that. I found it useful to use 

graphic organizers for that. When responding, that graphic information made it easy for 

me to give a structured response. I would not be able to achieve that by merely outlining 

and summarizing the information in my head.   

 

On the other hand, participants who had mixed feelings toward the planning conditions 

discussed about the provision of guidance for planning. Those who favored the guided conditions 

thought that it is especially helpful for novice test takers. These participants, as in the responses 

of Participants 2087 and 2031, seemed to appreciate even a brief statement given to them before 

taking the test.  

 143 

 

[Participant 2087, male, high-scorer]  

I personally never prepared for TOEFL or any equivalent test. I think this makes it more 

appreciative to the fact that there were some outlines given for us for planning. I don’t 

know if it had actually improved my speaking or not, but it did give me a sense of feeling 

that the test developers are actually caring about the test takers. I always thought that they 

would want us to fail to some extent so that we could keep on taking the tests.  

 

[Participant 2031, female, high-scorer]  

Overall, the guidance was helpful for me to realize certain strategies when taking 

speaking tests. I never thought of thinking silently would help better organize the 

scrambled information I have in my head. I always thought I needed to grab a piece of 

paper and make a script out of what I would say, and read it off from it; especially for 

presenting something in English. But it did actually work when I had tried it.  

 

Yet there were others that were not affected by the guidance provided for planning. 

Participant 2055, for instance, thought that it would only affect novice test takers, and not more 

experienced ones. Ultimately, she thought that test performance is not a matter of planning the 

response well or not; it is simply a matter of whether one has higher proficiency in English 

speaking. 

 

 [Participant 2055, female, low-scorer]  

 144 

 You would have already internalized a strategy of your own regardless of elaborated 

instruction on planning. Of course, I personally would benefit from lengthened time to 

think before I start speaking, but I would think those who are experienced and are 

proficient would not even need so much time either. They would be capable in making 

something up as they go.  

 

Overall, as opposed to the results reported for research questions 1 and 2, writing as the 

planning seemed to be favored by participants; yet still a good number of participants thought the 

effectiveness of a particular planning is enhanced when used for a certain test-task type. Again, 

this result coincided with the findings reported in previous sections that directs to the overall 

impact of test-task characteristics on both test performance and speech quality. 

3.4.2 Interview question 2: Were the times allotted for planning sufficient to you?   

From interview question 2, participants overwhelmingly came to a consensus with which 

the findings in the previous section coincided: there was a strong influence of test-task 

characteristics. Figure 26 demonstrates such a trend (the number of responses are accounted for 

overlapping responses; low-scorers: N = 44; high-scorers: N = 46). In general, more than half of 

the two sub-groups of participants agreed that integrated tasks are given sufficient, or perhaps 

given too much, planning time; this was slightly more geared towards the IT-RL task (N = 53; 

58.9% of participants). On the other hand, the majority of participants claimed that planning 

times were substantially lacking for the IP task (N = 65; 72.2% of participants). When making 

the sub-group comparison, low-scorers (N = 23; 52.3% of low-scorer participants) found IT-L 

task given the longest planning time relative to high-scorers (N = 14; 30.4% of high-scorer 

participants). Low-scorers (N = 36; 81.8% of low-scorer participants) were more likely to think 

 145 

that IP task lacked planning time than high-scorers (N = 29; 63% of high-scorer participants). 

For these low-scorers, IT-L task shared some similarities with IP task in that it asks their 

opinions about the discussed issue in the listening dialogue; therefore, they thought more 

planning time should be allotted to IT-L task than IT-RL task, which does not require any 

elaboration on one’s opinion. Overall, it was apparent that participants were agreeing that IP 

tasks need more planning time, while integrated tasks could be given lesser planning time.  

 146 

 

Figure 26 Frequency of responses: Sufficiency and lack of planning time across test tasks 

 

 147 

Participants further gave reasons for the claims they had made. Figure 27 demonstrates 

the response patterns for why corresponding participants perceived integrated tasks were given 

sufficient planning time. Interestingly, the responses ultimately revealed what participants 

thought that each test-task type measured.   

The majority of participants pointed out that the integrated tasks provided structured 

input sources that can be directly incorporated in the responses. For theses participants, 

restructuring or summarizing contents given in the reading passage and listening dialogue made 

it rather unnecessary to have a long buffering time in between. Participants 2007 and 2029 

touched on this very matter. In so doing, they further made connections between the length of 

planning time and how the integrated tasks are not necessarily testing one’s speaking ability.  

 

[Participant 2007, female, low-scorer]  

Because integrated tasks require the extraction of detailed information from reading and 

listening, you have nothing to prepare if you have not done a good job in reading and 

listening, or vice versa. Either way, you just sit there the whole time staring at the 

monitor. I would say integrated tasks evaluate your reading and listening ability, not your 

speaking ability. 

 

[Participant 2029, female, high-scorer] 

The goal is to repeat what was given in the prompts, not produce original language and 

content. If one remembers well or took quality notes, then 30 seconds are way too much 

of a time to spend for planning. In a way, integrated tasks test one’s memorizing ability 

and skills. If you managed to jot down a lot of information, then you are good. 

 148 

 

Figure 27 Reasons given for the sufficiency of planning time in integrated tasks 

 

 

 

 149 

It is interesting to note how participants perceived that the appropriateness of planning 

time is linked with what they feel is being measured by the tasks. From Figure 27 and the 

comments above, it is apparent that those who were not successful in reading and listening the 

input sources were also not able to use the planning time usefully; they did not have sufficient 

information to structure their responses during the given time. In addition to language abilities 

such as reading or listening, it is striking that participants perceived that memory skills are at the 

core of the performance of integrated tasks.  

Figure 28 summarizes the reasons for why participants felt IP tasks lacked sufficient 

planning time. Participants’ concerns with IP tasks primarily lied in the heavier burden on 

constructing original responses to conceptual statements relative to the integrated tasks in which 

such burden is much reduced with input sources for reference (e.g., reading passages and 

listening dialogues) and concrete requirements of task fulfillment. 

 

 150 

 

Figure 28 Reasons given for the lack of planning time in IP tasks 

 

 

 151 

 

Participant 2025 well demonstrated on the trend depicted in Figure 28.  

 

[Participant 2025, female, high-scorer] 

Giving 15 seconds to plan for an abstract statement that I have never thought of before is 

quite harsh. What do you prefer reading: newspaper, books, or magazines? Who asks you 

this in real life? It’s hard to think about my preference and supporting examples for such 

a topic, even though I know it’s a test and I can just fake my answers. I would also have 

hard time instantly giving an answer to it in Korean. 

 

The need to construct an original response was a burden for participants who went 

through the language transfer process (from Korean to English) before responding in English. 

For them, 15 seconds of planning time was insufficient for taking multiple steps in constructing a 

response, which encompassed understanding the statement, coming up with ideas, and 

transferring the ideas from Korean to English. Participant 2099 commented about this matter. 

 

[Participant 2099, male, low-scorer] 

I feel like I can do so much better if I was a native speaker of English. These are the type 

of questions that you would be better off if you were the speaker of that language, not 

even just being good at speaking it. You can’t take anything away from the test prompt, 

it’s just your ability to create an original response. And there’s always that buffering time 

to transfer your thoughts in Korean first then to English. 15 seconds are way too short for 

that. 

 

 152 

Due to the reasons above, some participants further commented on the possibility of long 

buffering or carried-over planning in the actual response. Because participants lacked with time 

for effective planning in IP tasks, they were not able to give immediate responses after being 

cued. Participants 2044 and 2081 gave such striking comments.  

 

[Participant 2044, female, low-scorer] 

For integrated tasks, your job is to structure your response according to the given 

information. Maybe adding a bit more of your own language, but nothing new. With IP 

tasks, you need to come up with a new story, which could be fake, but still, you need to 

start from scratch. Fifteen seconds are definitely too short to accomplish that. That’s why 

I was not able to begin speaking immediately after the beep. I feel like I was quite numb, 

and that I kept thinking about what to say even after being prompted.  

 

[Participant 2081, female, low-scorer] 

I need to first think what I’ll say in Korean, and then transfer it to English, but time only 

allows me to get to the Korean part. I need sometime before I speak. Even with that, I 

have to make my response as I go. At times I thought: what is the point of having 

planning time?  

 

The spilled-over planning time essentially coincided with the time spent before 

articulation measured for speed fluency. This suggested that short planning time in IP task led 

some participants to take some time even during the actual response time, which further may 

have impacted on his or her speed fluency.  

 153 

Table 36 summarizes the test-task characteristics in conjunction to what participants had 

commented on the associated planning time. From participants’ perspectives, IP tasks were 

equipped with [+time pressure], [-input resources], and [+prompt abstractness], which led to 

participants feel less prepared for responding. Participants generally required more time to “jump 

start”; consequently, they may have used some of their actual responding time for planning. On 

the other hand, integrated tasks exerted features such as [-time pressure], [+input resources], and 

[+prompt concreteness] that may have nurtured a sense of preparedness among participants. Yet 

participants also claimed on how non-oral skills were essential for successfully integrated tasks.  

 

Table 36  

Properties of test-task characteristics in relations to planning time 

Short planning time: IP tasks 

Extensive planning time: Integrated tasks 

+ Time pressure 

- Input resources 

- Time pressure 

+ Input resources 

+ Prompt abstractness 

+ Prompt concreteness  

Less preparedness: Need more time to jump 

Enhanced sense of preparedness 

start  

All or nothing: less efficient use of planning if 

Spilled over planning during response time 

reading and listening was not successful 

 

 

 154 

CHAPTER 4: DISCUSSION  

In this chapter, I discuss the major research results of the current study in light of 

previous literature on pre-task planning and task complexity; I specifically draw onto related 

perspectives from both conventional SLA and language assessment research. I organize the 

current chapter into four sub-sections in accordance to the four research questions guiding the 

present study.  

4.1 Research question 1  

For research question 1, I explored whether participants’ test scores on the speaking tests 

are affected by planning conditions as well as test-task types. The major result from the MFRM 

analyses was that planning conditions had null effects on test performance across all three test 

sets. More specifically, test scores did not differ depending on whether participants were 

provided with detailed guidance of planning or not (i.e., Unguided vs. Guided planning 

conditions). Likewise, test scores did not vary within the two guided planning conditions either 

(i.e., GW vs. GT planning conditions). Such insignificant effects of planning conditions in all 

three test sets are in partial alignment with the results found in most of the language assessment 

literature on planning time (Elder & Iwashita, 2005; Elder et al., 2002; Wigglesworth, 1997; 

Wigglesworth & Elder, 2010). I note that it is partial alignment due to the fact that these studies 

compared the effect of planning in terms of the availability of planning ([- planning] versus [+ 

planning] conditions). In the current study, it was the type of planning that was manipulated 

across testing conditions; The planning itself was always made available. Therefore, it is 

uncertain whether the study results would have differed had the study design followed that of the 

previous literature.  

 155 

Nevertheless, a robust consensus among the previous line of studies is that regardless of 

different study contexts, [+ planning] condition per se does not make a substantial impact on test 

score variations within testing environments (Ellis, 2009); In fact, this lack of a substantial 

impact was consistently observed across divergent planning conditions that varied on the length 

of planning time from 1 minutes of planning (Wigglesworth, 1997, 2001) to 3 minutes of 

planning (Elder & Iwashita, 2005; Wigglesworth & Elder, 2010). If this finding was similarly 

applied to the current study settings, it could be that the insignificance of varied planning 

conditions simply stems from the limited effects of [+ planning] in testing contexts in general. It 

should be noted that the same conclusion cannot be drawn from classroom- or laboratory-based 

SLA studies in which performance is primarily captured through discourse measures of CAF, 

and not quantified by subjective ratings. Yet as informed from the current study results (which 

will be discussed further in the following section), and from the bulk of relevant literature in 

language assessment, planning conditions demonstrate limited impact on overall speech quality 

measured by CAF dimensions as well (cf. see Nitta & Nakatsuhara, 2014, on how paired 

dynamics change between paired candidates in accordance to planning conditions in the contexts 

of paired oral assessment). This might be because there is a potentiality of unique factors 

underlying testing contexts that render the limited effects of [+ planning] condition on overall 

oral performance.     

Some scholars responded to the lack of a [+ planning] effect in the testing environment as 

stemming from the impressionistic indicators of oral proficiency (Elder & Iwashita, 2005; Nitta 

& Nakatsuhara, 2014; Skehan, 2015; Wigglesworth & Elder, 2010); that is, they blamed the 

subjective nature of the score assignment itself. Wigglesworth and Elder (2010), for instance, 

noted that with memory constraints, test takers might be able to sustain the benefits of planning 

 156 

only within the first few utterances of their speech (see Ortega, 2005, on how even in non-testing 

contexts, and hence with lengthier planning time, planned contents are limited in being 

completely transferred to real-time performance). Raters, on the other hand, may tend to 

formulate a final impression based on the overall speech or towards the end of the speech. 

Raters’ score assignments, therefore, are averaged from the entire stretch of speech, and 

presumably not informed from a couple of planned utterance, if any (Wigglesworth & Elder, 

2010). Thus, the assumption is that even if differential planning time differentially affects 

speech, it may not affect it for a sustained amount of time, and thus the effects may not be 

perceived in test scores. This is because the scores are rendered after the raters listen to sustained 

speech, long after the effect has, in essence, melted away.  

On a related note, raters may resort to a general ‘first impression’ during the real-time 

rating processes. Although difference in modes does not warrant complete generalization, 

research from writing assessment informs that raters are less likely to make concurrent 

judgments during their live rating processes (Crisp, 2012; Grainger, Purnell, & Kipf, 2008; 

Lumely, 2002; Wolfe, 1997). Rather, they may make a rapid, intuitive evaluation based on their 

mental representation of the performance, and not necessarily refer back to a specific part of 

language output or scoring criteria for deriving a final judgment (Wolfe, 1997). Researchers 

from educational measurement have further analogously put forth that with cognitive demands 

and memory constraints in real-time rating procedures, raters, especially those with less 

experience, may rely on a general impression of one’s performance (Lance, LaPointe, & Stewart, 

1994). At times, these rating behaviors extend to a phenomenon widely coined as the halo effect 

(Fisicaro & Lance, 1990), with raters assigning similar score bands for the same individual 

candidate across test items that are administered under different conditions. The relatively less 

 157 

experienced raters in the current study, therefore, may have been less sensitive in capturing 

qualitative differences in test takers’ speech, if any (in the current study contexts, it would be 

planned versus unplanned utterances). However, the picture is still complicated in that as will be 

discussed shortly, results suggest that raters were indeed able to discern differences from IP tasks 

to integrated tasks in terms of score assignment. Presumably, it may be that saliency in what 

affects the discrimination of planned versus unplanned speech in the current study was 

insufficient to generate significant score discriminations, relative to how test-task types 

influenced the construction of speech.  

From the test taker’s perspective, this speculation is also plausible when considering how 

stringent the amount of planning time given in testing contexts is (cf. SLA researchers have put 

forth that planning is effective with at least 10 minutes of time; Mehnert, 1998; Ellis, 2009). 

With such timed natures, test takers might only be allowed to plan for one or two major points, 

addressing the task at hand (which they manage to actually put into words in the earlier part of 

their responses), and consequently they may make up the rest of the content during their actual 

response time. Recall again that such a pattern of incomplete planning was also observed with 

the current study’s participants as evidenced through their interview responses (see Chapter 3). If 

this was truly the case in the current study, planned speech might have not constituted the entire 

speech sample of an individual participant to a great extent, thereby not weighing substantially in 

raters’ judgments to render significant variations in scores. In relations to the timed nature of 

testing contexts, researchers have also noted a possibility of test takers being less skillful in using 

such short amount of time for planning (Elder & Iwashita, 2005; Wigglesworth & Elder, 2010). 

Test takers may be fairly familiar with planning or preparing for academic speech performance 

(e.g., presentations in non-testing contexts), but not within pressured, timed testing conditions. 

 158 

Similarly, although instructions for planning were given to test takers in the current study (e.g., 

GW and GT conditions), their inexperience of taking TOEFL iBT-type tests (or generally any 

timed oral proficiency tests) might have confounded the effects of planning. In this sense, the 

notion of authenticity of employing planning time in testing contexts may be called into question 

(Elder & Iwashita, 2005). The premise of this line of thought is that the option of planning in 

testing contexts corresponds to how learners prepare themselves for performing academic speech 

in classrooms. However, although authenticity in testing does not always entail close 

resemblance of language practices in non-testing environment (Bachman, 2002; Splosky, 1985), 

it seems that the type of planning performed between testing and non-testing contexts exerts 

significant discrepancies.  

A related speculation is whether the planning time itself was ever sufficient enough to 

generate meaningful contributions to improved speech or if the planning time was an effective 

use of time. As will be discussed shortly, this speculation is partially coming from the results of 

the current study, which showed a strong influence of test-task characteristics on whether 

participants benefitted from shorter or lengthier planning times. Even with the IP tasks in which 

participants generally claimed that they were disadvantaged by the short amount of planning 

time, participants made connections to such limits with the unique nature of the IP tasks. 

Therefore, as previous researchers uniformly concluded (Elder & Iwashita, 2005; Ortega, 2005; 

Wigglesworth, 1997; Wigglesworth & Elder, 2005), the length of planning time is not the sole 

factor explaining the underlying limited effects of [+ planning] conditions in testing contexts in 

general.  

Going back to the discussion of rater’s score assignment, previous researchers have 

interestingly suggested that a discrepancy may exist between the focus of what test takers take 

 159 

into consideration when planning and what is being prioritized by raters when they judge (Ellis, 

2005; Nitta & Nakatsuhara, 2014; Skehan & Foster, 1997, 1998; Wigglesworth & Elder, 2010). 

The demanding nature of testing contexts likely direct test takers to accurate language output, or 

perhaps helps them focus on forms during their planning. Indeed, Ortega (2005) noted that in 

non-testing situations, learners tend to not fixate on accuracy. On the other hand, if raters were 

able to adequately discriminate given speech samples according to the sub-dimensions of the 

rating scale, they would have globally looked into a number of qualitative measures (e.g., 

fluency, pronunciation, topic development, etc.) in addition to accuracy of language. In terms of 

the current study, if test-takers’ awareness of being situated in a testing situation took over the 

effects of planning conditions, they may have consciously (or unconsciously) exerted selective 

behaviors in planning.  

The above-mentioned possibilities are yet to be proven without further investigations in 

rater cognition as well as what test takers had planned for during the actual planning processes 

(e.g., analysis on the written planning of participants). In fact, what appeared to be more 

transparent in eliciting score variations from the MFRM analyses (and the GEE analysis) was the 

influence of test-task types. Regardless of the planning conditions, participants performed 

relatively poorly in the impromptu, IP tasks than the input-based, integrated tasks. This result 

could be somewhat counter-intuitive to the widely attested complexity and difficulty of 

integrated tasks, which are said to be difficult because they have dual-demands in skill 

application (more elements are required to be accomplished) as opposed to speaking-only test-

tasks (e.g., Brown et al., 1984; Iwashita, McNamara, & Elder, 2001; Skehan,1998). Lengthened 

planning and responding times given in operational integrated tasks may have accordingly 

resulted from such attested task characteristics (see Plakans, 2010, for similar claims made in the 

 160 

contexts of L2 writing assessment in which reading-writing integrated tasks are compared to 

writing-only independent tasks). However, speaking-only IP tasks are equally capable in eliciting 

not only perceived difficulty of the tasks, but also an enhanced sense of test anxiety during task 

performance among test takers (Huang & Hung, 2010; Hong et al., 2016). This finding in 

addition to the current study’s results seem to point to the possibility that inherent test-task 

characteristics uniquely contribute to an over-ride of whichever planning activity was used (if 

any).   

Especially within the context of IP task condition, there is a possibility of within-task 

planning (i.e., planning that takes place during real-time performance) taking over pre-task 

planning (Elder & Iwashita, 2005; Nitta & Nakatsuhara, 2014; Wigglesworth & Elder, 2010). 

Also termed as online planning, previous researchers argued that when pressured (by time limits 

or other cognitive capacities), speakers are likely to prioritize different levels of planning 

processes (Ellis & Yuan, 2003, 2005; Skehan & Foster, 2005). In such a context, speakers may 

not have enough time during pre-task planning to access necessary linguistic information in their 

working memory to facilitate language production. Likewise, if the stringent amount of planning 

time (15 seconds) and the IP-specific task features (e.g., abstractness in test prompts; see Chapter 

3 for supporting interview responses) indeed had impacted on test-taking, participants in the 

current study may have not been granted the opportunity to fully exploit the given amount of 

time to plan for their responses. That is, the underlying task condition may even have forced 

participants to make use of constant short-term planning and monitoring of their speech during 

the actual response times. Recall again that in Chapter 3, the existence of time spent before 

articulation and within-task planning was revealed in the results of speech quality and test 

takers’ interview responses on the effectiveness of the planning time in IP tasks. In testing 

 161 

contexts specific to the IP-type tasks, therefore, test takers might benefit from ongoing planning, 

while pre-task planning are less likely to be used efficiently; in such cases, pre-task planning 

activities, whether they were supported with guidance or not, would have less significance.   

But did participants similarly benefit from within-task planning for integrated tasks? One 

can speculate, but the survey and interview responses from participants demonstrated that the 

participants were less likely to feel the same extent of pressure from the length of planning time 

for integrated tasks that might lead to the extensive use of online planning. In fact, the majority 

thought that planning times in integrated tasks were sufficient, or at times, excessive. Note the 

relatively small difference in the length of planning times from IT-L and IT-RL tasks to IP tasks 

(5 and 10 seconds of difference from IP tasks). What might have caused these perceived 

differences (even with such small differences in length of planning time across tasks), and also 

the score variation between IP and integrated tasks? One plausible explanation could be that for 

integrated tasks, the use of note-taking during reading and listening might have been sufficed for 

constructing responses. Although Ellis (2005) pointed out that note-taking is not a core feature of 

what constitutes a “testing context,” most high-stakes oral proficiency tests indeed provide 

participants with the options to take notes during testing (see Cublio & Winke, 2010); thus, it is 

not a feature only pertaining to “classroom, laboratory studies” of planning (Ellis, 2005, p. 218). 

In fact, the nature of information given in TOEFL iBT-type integrated tasks is highly demanding 

in terms of quantity and quality for restoring in and retrieving back from working memory; for 

instance, IT-RL presents both a paragraph-long reading text and a listening dialogue between 

two interlocutors. Upon such a condition, the tasks require a certain portion of information to be 

incorporated in responses, which hypothetically, would put more demand in the quality of 

immediate recording of the presented information. But because integrated tasks do not require 

 162 

substantial transformation of language in responses (i.e., summarization of contents are required 

than originality of responses) (Prahbu, 1987; Skehan & Foster, 1997; Skehan, 1998), notes of 

key words or contents would suffice to construct a response (see Huang & Hung, 2010 and Hong 

et al., 2016 for similar suggestions). On the contrary, as some participants noted in the interview, 

if reading and listening was not successful (resulting in unsuccessful uptake of required 

information in note-taking), then they might not have had enough information to make use of 

during the given planning time. In such cases, pre-task planning would not have exerted a 

substantial impact on subsequent performance. Presumably, this precise feature of integration in 

a variety of levels (e.g., integration of both language skills and given information) seems to 

prompt participants to prioritize instant recording of information, then to retrieve information 

after few seconds later. If note-taking of the given input sources has operated as one variety of 

pre-task planning (and hence provided a sense of preparedness in responding), participants might 

have felt planning time was repetitive, something they had already engaged in.  

Overall, the findings from research question 1 corresponds to Ellis (2009): Planning 

alone does not contribute to the underlying factors of task performance, but it should be 

understood in conjunctions to the fundamental characteristics of tasks. In the current study I 

found the lack of any significant findings of [+ planning] to be associated with how planning 

times were manipulated across the test-tasks, and how the test-tasks each required different 

approaches for preparing for responses.    

 

4.2 Research question 2 

For research question 2, I analyzed the quality of speech samples generated in each 

planning condition in terms of the CAF dimensions. In terms of the preliminary factor analysis, 

 163 

the factor loadings of all dimensions were partially in line with previous researchers’ findings 

(Skehan & Foster, 2005; Tavakoli & Skehan, 2005). For instance, the researchers also found that 

different measures across the three dimensions loaded onto the same factor. In the present study, 

error-free clauses loaded on to the same factor with fluency measures (e.g., mean length of run) 

and complexity measures. Yet the high factor loadings of the accuracy measures are an 

interesting finding.  

In terms of the three performance dimensions, there was an alignment with the study 

results from Research question 1. The overall finding was that the effects of planning conditions 

were mostly over-ridden by test-task characteristics yet there were still interesting interaction 

effects between the two variables in a few measures. Fluency was generally influenced by the 

type of test-tasks that participants performed. Participants produced lengthier discourse in 

integrated tasks, while demonstrating less immediacy in actually articulating their responses in 

IP tasks. Such an effect of test-tasks was not revealed for repair fluency; however, there were 

hardly any instances of repairs in the raw coding (see Appendix H) to render any subsequent 

significance. Instead, planning conditions had significant main effects on the number of 

repetitions produced especially in the GT condition. Yet the small effect sizes and mean 

difference found amongst planning conditions within repetitions (GT vs UG: mean difference 

= .008, 95% CI [.003, .013], p = .000, d = 0.02; GT vs GW: mean difference = .005, 95% CI 

[.000, .010], p = .002, d = 0.02) are indicative of that the effect of planning conditions might be 

in effect not meaningful (Field, 2012). Thus, for fluency, I conclude that there were statistically 

significant main effects of test-task types on speed and breakdown fluency, which generally took 

over the effects of pre-task planning.  

 164 

For accuracy, there was a significant interaction between planning conditions and test-

task types on lexical errors per 100 words; the effects were revealed within the IT-L task in 

which lexical errors were made the most under GT condition. In terms of grammatical errors, on 

the other hand, there was only a main effect of test-task types; specifically, the integrated tasks 

contained fewer clauses embedding grammatical errors than the IP task. Therefore, the results 

are indicating that the GT condition had an effect on accuracy.  

In terms of complexity, most lexical diversity measures demonstrated clear effects of test-

task types; for instance, more unique words and conjunctions were produced in the integrated 

tasks than the IP tasks. On the other hand, IP tasks generated higher TTR, which is indicative of 

the measure’s association with the length of the produced language across tasks. It was only from 

subordination for syntactic complexity that planning conditions interacted with test-task types; 

specifically, subordination of clauses was the greatest in the UG condition compared to GW and 

GT conditions within the IT-RL task. A main effect of planning was also only found from 

sentence linking devices for lexical diversity. In all cases, the two guided conditions contained 

more sentence linking devices than UG.  

 The overall finding departs from an extensive line of SLA research regarding the role of 

[+ planning] leading to greater fluency in spoken performance and mixed effects rendered on 

accuracy and complexity (Crookes, 1989; Foster & Skehan, 1996; Gilabert, 2007; Skehan & 

Foster, 2005; Tavokoli & Skehan, 2005; Yuan & Ellis, 2003). The current study is also unique in 

that such a null effect of planning on fluency was found in the contexts in which planning times 

were always made available. Moreover, it was the test-task types steadily influencing both speed 

and breakdown fluency in the absence of an effect of pre-task planning varieties. Several features 

of the study design that differs from the previous line of studies need to be discussed. First of all, 

 165 

the previous studies were likely to focus on discerning the effects of planning itself while 

controlling for other variables such as task types (e.g., Elder & Iwashita, 2005; Kawauchi, 2005; 

Ortega, 1999; Mochizuki & Ortega, 2008; Rutherford, 2001; Skehan & Foster, 1997; Tajima, 

2003; Wendel, 1997; Yuan & Ellis, 2003). In such cases, the same type of narrative or picture-

description tasks were used across [+ planning] and [- planning] conditions. However, the 

positive effects of planning on fluency have been also discovered in few studies conducted by 

Foster and Skehan where task types had been manipulated in the study design (e.g., Foster, 1996; 

Foster & Skehan, 1996; Skehan & Foster, 2005). Then another possibility is that the type of 

fluency measures employed in the current study were fundamentally capturing a different aspect 

of fluency as conceptualized in previous literature (see Tavakoli & Skehan, 2005, for a 

discussion on the analytic measures adopted by SLA researchers for investigating the construct 

of fluency). Even if this is true, fluency itself has been inevitably operationalized through 

countless measures amongst the SLA literature; it is essentially a multi-faceted construct (Freed, 

2000; Koponen & Riggenbach, 2000). Yet the fact that fluency has been continuously uncovered 

to make significant difference on oral performance with the aid of [+ planning] in SLA literature 

is suggestive of the potentiality that the current study can raise on either (1) differences in how 

planning operates across contexts (testing versus non-testing environment) (Ellis, 2009) and/or 

(2) the intrinsic characteristics of test-task types prevailing over pre-task planning varieties (and 

many other possibilities).  

In terms of (1), previous researchers (Ellis, 2005, 2009; Elder et al., 2002; Nitta & 

Nakatsuhara, 2014) asserted that the high-stakes nature of testing contexts leads test takers to 

attend to the accuracy of the outcome of speech relative to its delivery. That is, based on their 

previous test-taking experience, test takers might have unconsciously (or consciously) developed 

 166 

an internalized conception that answers to tests are essentially dichotomous in nature (e.g., 

“right” or “wrong”) (Kohn, 2000). If such a perception governs an individual test taker during 

the test-taking process, he or she would hope to provide a “right” answer and attend less to how 

it is expressed even in constructed-response tests (recall that accuracy is also operationalized as 

“a concern to avoid error”; Skehan, 2009, p. 510). For instance, test takers might produce errors 

in suprasegmental features in speech (e.g., speech rate, intonation, word stress, accents that have 

them appear less engaging) yet strive to deliver error-free speech. Subsequently, such test-takers’ 

sensitivity towards providing accurate responses would lead them to engage in careful online 

monitoring of their speech articulation even during actual response time (Ellis, 2005). In fact, the 

results of certain sub-measures of the CAF dimensions potentially speak to the task-specific 

effects of within-task planning raised in the previous section. Although different allotments in 

response times across test-tasks should be taken into consideration, the tendency in the IP tasks 

was that participants generally spoke less within a shorter stretch of articulation time (reflected 

by the increased time spent before articulation, speech rate, and decreased mean length of run) 

and demonstrated pausing or hesitating phenomena more frequently (suggested by the number of 

unfilled pauses and the time spent before articulation). On the other hand, they produced fewer 

lexical errors while generating more cohesive devices (such as sentence linking) for the IP tasks 

than for the IT-RL tasks. Within the integrated task conditions, however, the CAF dimensions 

showed a potential case of parallel gains. The differentiated output for fluency and accuracy in 

accordance to test-task types suggests that participants may had difficulty in directing equivalent 

and simultaneous attentional resources for fluency and accuracy particularly in the IP tasks; the 

high-stakes nature of testing may have imposed more emphasis on language usage during live 

performance so that participants engaged more in pausing for monitoring (e.g., unfilled pauses) 

 167 

or buying time to plan the language for what will be said (e.g., time spent before articulation) 

(Skehan & Foster, 2005) at the expense of fluidity in speech.  

In particular, the significant increase in the time spent before articulation in the IP tasks 

relative to the two integrated tasks necessitates further exploration. It is an empirical question as 

to what such measure generally signifies: is it a reflection of test anxiety specific to the timed 

nature of speaking tests, with an added pressure stemming from the unfamiliarity of talking onto 

a computer machine (see Lee & Winke, 2018, for discussions on how computerized test features 

such as a timing device can affect test taker’s cognitive operations for performing on oral test-

tasks)?2 Drawing upon its factor loading with speed fluency measures (see Table 25), the 

increased time spent before articulation specific to IP tasks seems to at least indicate a decreased 

sense of immediacy or promptness (i.e., hesitance) in providing instant responses. When further 

probing into participants’ interview responses, the measure also implies the carried-over, 

incomplete pre-task planning. In fact, seminary researchers in the field of cognitive psychology 

in spontaneous speech have long put forth a relevant proposition describing the very 

phenomenon (Butterworth, 1975; Goldman-Esler, 1968; Griffin & Bock, 2000; Tannenbaum et 

al., 1965). Goldman-Eisler (1968) claimed that spontaneous speech is both cyclic and 

incremental in nature; that is, although with divergent variations in duration and length, speech is 

consisted with a hesitant phase (chiefly consisting of silence; Butterworth, 1975) proceeding a 

fluent phase of speech (i.e., ongoing utterance). For instance, even in real-time speech settings, 

speakers tend to take at least a brief second or two to respond back to their interlocutors’ 

questions (see Griffin & Bock, 2000, in how speakers, even gazing at relevant objects, take 500 

                                                

2 This is less likely as participants did not show the same pattern of behaviors for the integrated 
tasks. Novelty effect on tasks does not seem to be the answer either since all task conditions 
(orders of test-tasks, planning conditions) were counter-balanced across study groups.  

 168 

milliseconds of time before properly naming the objects in their L1). Within such contexts, the 

researchers posited that the preceding hesitant phase function as the cognitive processing time, 

or the planning time necessary for the subsequent fluent phase to take place (Butterworth, 1975). 

Therefore, the silence during the hesitant phase is psychological in nature in that by delaying the 

onset of speech, speakers are either consciously or unconsciously striving to retrieve sufficient 

information from their working memory and prepare for the upcoming articulation (Butterworth, 

1975; Goldman-Eisler, 1968). An extensive amount of time taken during the hesitant phase then 

illustrates that speakers are in need of longer buffering time (Levelt, 1989). In the current study, 

the buffered time used in addition to the given planning time would suggest that participants had 

to engage in extra planning due to the inefficient and ineffective use of original planning time 

given for the IP task (this will be discussed further in the following section). While time spent for 

articulation reflects the cyclical processes in naturalized speaking contexts, it might function 

differently in time-pressured settings such as testing contexts. In fact, the amount of such 

buffered time was negatively associated with test performance, indicating that the longer spent 

for “jump starting,” the lower the test scores had gotten. This suggests that at least for the lower 

scorers, planning may not have been complete or was inefficiently carried out during the original 

pre-task planning time, which subsequently affected these individuals to carve out a certain 

portion of time out of the actual responding time for an additional preparation time. This may 

have caused them to engage more frequently in inner speech monitoring during speech and have 

them eventually run out of time for providing complete responses. While further investigation is 

needed, results from factor analysis that time spent for articulation is negatively related to some 

fluency and syntactic complexity measures is also potentially suggestive of its unfavorable effect 

on subsequent speech production (yet the directionality of the effects on different measures 

 169 

should be further warranted). Overall, regardless of whichever pre-task planning conditions 

participants were under, the effects could be limited due to the inevitable split-over planning as 

well as within-task planning taking place.  

Going back to the general trend appearing in the data, the findings seem to lend moderate 

support regarding the Limited Capacity Hypothesis (otherwise coined as the Trade-off 

Hypothesis; Bygate, 1999; Skehan, 1998; Skehan & Foster, 1999) because there was a trade-off 

between fluency and accuracy in oral performance, and generally there was a competition among 

the three performance dimensions. However, the fact that such phenomenon was only observed 

within the IP tasks and not uniformly across test-tasks or planning conditions calls for 

speculation. Alternatively, the results could be understood from the finer categorization of task 

characteristics and conditions (Skehan, 2001). As previous researchers have advocated, the 

following task features are likely to advantage both accuracy and fluency: (1) tasks presenting 

concrete or familiar information; and (2) tasks containing clear structure (Skehan, 2009). In 

addition, tasks requiring information manipulation (e.g., summarizing or describing pictures) 

lead to higher complexity. From what participants had elaborated, IP tasks are quite the opposite 

from these descriptions owing to the abstractness in the statements and the absence of cued 

information; rather, it seems that such features pertain to the integrated tasks. If this was the case, 

then the integrated tasks may be equipped with characteristics that basically free up limited 

working memory constraints during test taking and eventually advantage all or the majority of 

the CAF dimensions simultaneously (which is connected to the higher test scores participants 

received on these tasks). These may include [+ concreteness] of test prompts, [+ input resources] 

by presenting structured information for reference, and [- time pressure] with increased planning 

and responding time (see Table 36 in Chapter 3). Additionally, the aid of note-taking operating 

 170 

as pre-task planning for integrated tasks could have boosted the effects. On the other hand, IP 

tasks may inherently challenge test takers to concurrently improve on all three CAF constructs. 

The elements constituting IP tasks may have contributed to higher extents of constraints in one’s 

working memory (Huang & Hung, 2010), which led to commitment of attention to one area of 

speech quality (Skehan, 2009). Subsequently, this result may indicate that the IP tasks and the 

integrated tasks are not only tapping onto different processing constraints for test taking, but also 

mapping onto different constructs of spoken performance (this will be discussed in the following 

section in relations to participants’ perceived differences in the effectiveness of planning 

conditions inherent in IP versus integrated tasks) (cf. see Brown et al., 2005, on how the two 

types of tasks did not exert significant qualitative differences; but see Kyle et al., 2016, on their 

take on the two task types by means of a variety of linguistic characteristics).  

Meanwhile, the results from the integrated tasks may partially corroborate Robinson’s 

Cognition Hypothesis (Robinson, 2001, 2002) concerning the simultaneous improvement of the 

CAF dimensions. The premise of Robinson’s claim is that accuracy and complexity, in 

particular, increase within the contexts of increased task complexity; that is, the complex nature 

of the tasks advances speakers to generate enhanced quality of speech (Swain, 1993). Based on 

Robinson’s taxonomy of task complexity (2001, p. 294), integrated tasks do indeed possess 

several elements contributing to the added layer of complexity of the task conditions such as the 

extra demands in language ability (e.g., reading, listening, then speaking) as well as integrating 

relevant information into the performance. However, it turns out that in the present study, such 

inherent elements of the integrated tasks were perceived by participants to function positively on 

language production. That is, the input resources (e.g., reading passage, listening dialogue) had 

served as references for participants. Although it is unknown to what exact extent did the 

 171 

participants made use of the input sources directly into their responses, the potentiality of such 

text integration taking place could explain the concurrent enhancement of accuracy and 

complexity in the integrated task performance. Recently, Crossley et al. (2014) found evidence as 

to a positive influence of text integration practices on overall speech quality in the context of IT-

L task condition. They discovered that the amount of the words in the source texts being 

integrated into the response was the strongest predictor of test takers’ overall speaking test scores 

on the IT-L task. Implicit in Crossley et al. (2014) (and also in the extensive line of research on 

text integration in integrated writing tasks; see, for instance, Plakans & Gebril, 2016) is that 

through integration and repetition of source-text features, test takers are able to reproduce key 

contents of the sources in their responses, which in turn lead to the construction of an overall 

coherent, rich response. While Crossley et al. (2014) made their observations upon usages of 

individual content words, it might also be the case that test takers go beyond word-level 

integration to exploit further grammatical and lexical properties of the language used in the 

source texts (e.g., verb phrases, collocations). To some extent, the responses to integrated tasks 

then might be summaries or paraphrases of the presented source texts at best (cf. IT-L tasks 

require a summary as well as the test taker’s personal evaluation of the given problem in the 

task). If the originality of language is not a major concern of evaluation (as implied in the 

TOEFL iBT Speaking rubric for integrated tasks), then such text-integration practices might 

actually be one of the key factors what make the overall performance to appear as upgraded in 

dimensions such as accuracy and/or complexity (cf. see Crossley et al., 2014, for their concerns 

on the construct validity of integrated tasks owing to the active text-integration practices of test 

takers). IP tasks, on the other hand, may appear to adhere to the kind of simpler monologue tasks 

(as what previous researchers had identified; Foster & Skehan, 1996; Robinson, 2001), but in 

 172 

effect, the elements they inherit may impose higher cognitive demand on test takers; hence, 

resulting in lower test performance and imbalanced development of the speech quality 

dimensions. While empirical investigation is further needed to confirm these speculations, 

participants’ way of uniformly discerning task characteristics specific to IP and integrated tasks 

suggest that the widely-held task complexity dichotomy in the field (e.g., Robinson, 2001) may 

not be well applied to the current study’s context, or generally to the testing contexts as the 

TOEFL iBT. Although task participants’ perceptions (also defined as task difficulty; Robinson, 

2001) cannot be relied upon entirely, the current study found plausible connections as to the 

results pertaining to the test performance, speech quality, and to the perceived differences of test-

taking conditions across test-task types.  

The final point that should be addressed is the small, yet significant effects of planning 

conditions found on three specific measures; namely, lexical errors per 100 words, 

subordination of clauses, and sentence linking devices. Especially, the two guided conditions 

lend mixed results; for instance, while GT condition seemed to render lexical errors the most 

(within the IT-L task condition), both GW and GT conditions generated more cohesive features 

in speech than the UG condition. The UG condition, on the other hand, facilitated subordination 

of clauses more so than the two guided conditions. In terms of the results pertaining to lexical 

errors and cohesive features in the speech, Foster and Skehan (1996) found similar results in 

their study comparing the effects of detailed and undetailed planning conditions (which basically 

corresponds to the guided and unguided planning conditions in the present study). In their study, 

the number of error-free clauses (which encompassed both grammatical and lexical errors) was 

least likely to occur in the undetailed planning condition. On the other hand, complexity 

measured through clause-to-C-unit ratio was greater in the detailed planning condition. The 

 173 

researchers hence suggested that when explicit suggestions or guidance directing to how 

speakers should use the given planning time or relevant pre-task planning strategies, task 

participants may direct their attention to the content of the message (e.g., coherence of the 

speech) to be expressed rather than language. When also drawing onto the interview responses, 

the GT condition in the current study may have caused selective channeling of resources in 

responses, facilitating global idea generation at the expense of the retrieval of appropriate 

language forms. Conversely, the condition that simply gives planning time and no directions as 

to tasks might have allowed participants to speak more freely. Especially when considering that 

the effects were found on the integrated tasks, the inherent task characteristics such as [- 

planning time pressure] and [+ input resources] may have equipped participants with enhanced 

sense of preparedness (or confidence in speaking), which in turn facilitated them to make more 

attempts in elaborating; hence, leading them to produce more lengthened subordinate clauses. It 

could also be the case that the text integration practices (as mentioned above) enabled in IT-RL 

task conditions may have given sufficient source of information for participants to “play around” 

with the language and eventually push them to try out more.  

  

4.3 Research question 3  

For research question 3, I analyzed the survey response data in terms of participants’ 

confidence in their performance per test-task type, participants’ perceptions of appropriateness 

of planning time per test-task type, and participants’ evaluation on the effectiveness of type of 

planning. In accordance to the previous studies conducted within testing contexts, neither the 

participants’ self-assessment of task difficulty (indirectly gauged as confidence in task 

performance in the current study) or the effectiveness of [+ planning] was found to significantly 

 174 

differ per planning conditions (e.g., Elder & Iwashita, 2005; Elder et al., 2002; Wigglesworth & 

Elder, 2010). Researchers considered such results generally stemming from consistent null effect 

of [+ planning] found in testing conditions. From various reasons given in the previous sections 

(over-riding of within-planning, note-taking, and test-task characteristics), the current study’s 

finding concurs with what the previous researchers have asserted.  

However, the results from the RM MANOVA with the survey responses revealed that 

participants’ perceptions differed statistically significant on the appropriateness of planning time 

per test-task types across planning conditions (Greenhouse-Geisser correction: F2, 538 = 2.84, p 

= .027, ηp2 = .016). From the follow-up post-hoc tests, IP tasks’ planning time was considered 

by participants to be insufficiently given than the other two integrated tasks, especially only 

under the two guided planning conditions. In other words, the planning time inherent in the IP 

tasks was perceived to be shorter when participants were given guidance in planning (either by 

writing for planning or thinking for planning). This might indicate that neither GW or GT 

operated positively in reducing the time pressure participants may have been imposed on 

particularly under the IP task condition.  

Two possibilities can be raised from this result: (1) The provision of guidance did not 

help participants to use the planning time in the IP tasks efficiently; and (2) participants are 

likely to be less pressured even by the short amount of planning time in the IP tasks when not 

given any guidance in planning. In terms of (1), the physical act of writing (as pointed out 

subsequently by participants in the interview), although with its advantages, takes up quite some 

time of the planning time; in such cases, it might not function efficiently within the most time-

pressured planning condition such as the IP tasks. The GT condition, similarly, may have 

restrained participants from freely resorting to their own strategies for planning. This connects to 

 175 

point (2), in that the instructions given for planning have operated as an additional layer of 

pressure for participants; presumably they might have felt like they are obliged to follow the 

instructions, and not resort to their own strategies for planning. Especially within the unique 

contexts of testing (enhance time pressure, high-stakes nature of tests, etc.), test takers are 

inclined to develop and use whichever strategy works best for them (Wigglesworth & Elder, 

2010). In real-life settings, if test-takers have prepared extensively for such high-stakes tests, 

they are likely to have internalized a certain set of strategies that are biased for their best (Swain, 

1984). Thus, unlike the non-testing contexts where students are eager to learn and receive 

support from their teachers, careful guidance in planning may not be effective, or even be 

distractive to test-takers who may have already developed their own way of responding to the 

test-tasks.  

 

4.4 Research question 4 

The quantified performance data observed in answering research questions 1 and 2 

indirectly touched on the possibility of the influence of test-task characteristics on how 

participants managed the given planning times per test-task type. However, the interview data 

directly addressed the possible discrepancy between test developers and test takers in terms of 

the immediate impact of test-task characteristics on real-time test-taking and performance 

(Ockey, Koyama, & Setoguchi, 2013).  

There were essentially three recurring themes consistently maintained throughout the 

interview data. First, in line with the survey responses, participants felt that they had benefitted 

the most under GW planning condition. This was due to their familiarity with the act of writing 

in both testing and non-testing situations for preparing for a certain task. With the GT planning 

 176 

condition, on the other hand, some participants agreed that the GT planning condition had 

hampered their effective planning. It is quite interesting to note that in reality, participants’ test 

performance did not differ significantly between either of the planning condition. It could be that 

the familiarity factor inherent in the GW planning condition may have masked some of the 

advantageous features pertaining to the GT planning condition. This is plausible because some 

participants were indeed able to point out the efficiency of GT in certain instances. For instance, 

participants noted that GW is directly helpful in retrieving linguistic information while GT is 

useful in idea generation. In addition, they also pointed out that GT is helpful for test-tasks that 

are given shorter amount of planning time, which is indicative of a possible interaction effect 

between planning conditions and test-task types. Perhaps such an effect may be realized with 

participants with more diverse profiles (e.g., low or intermediate level of English proficiency). 

Kawauchi (2005) is certainly an example of this case: learners with lower levels of English 

proficiency in this study preferred and benefitted from the option of reviewing reading resources 

over making notes for planning. Such diverging patterns between learner groups were not clearly 

observable in the current study owing to the background of participants, which is a limitation in 

the current study design that should be addressed by future studies. 

Second, participants put forth a uniform statement that the planning times should be 

reversely offered between the IP and integrated test-task types. This result further suggests that 

the IP test-tasks are not in any way ‘simpler’ than integrated test-tasks; perhaps, it could even be 

that test takers perceive the IP test-tasks to be more difficult owing to how the tasks are designed. 

Presumably, the perceptions of different groups of stakeholders hold in terms of task complexity 

and difficulty vary considerably. Although it is not clearly articulated in the TOEFL iBT testing 

manual, the task-based language testing scholarship (or largely the language testing field) has 

 177 

shaped an assumption that integrated test-tasks requiring an added demand in language skills are 

inherently complex (e.g., Kyle et al., 2015; Plakan, 2010; Wigglesworth & Foster, 2016). Yet 

according to the participants, the precise nature of dual demands in language skills was what 

assisted their test performance. On the other hand, the IP test-tasks were equipped with task 

characteristics signifying higher degree of complexity (Skehan, 1998); [+ time pressure] from the 

short amount of planning time, [+ abstractness] from the test prompt, and [- input resources] that 

they can refer to. On the basis of these categorization, participants’ responses further reflected on 

how they thought about the construct of the test-tasks are. Interestingly, they held diverging 

views on what the IP and integrated test-tasks are each measuring.  

Third, participants’ responses validated the possibility of the influence of a spilled-over 

planning on prompts performance of IP test-tasks. In addition to the disadvantage of the short 

amount of planning time, participants provided that how they process and decode the test 

prompts are the reason why their pre-task planning is carried over the actual responding time. If 

this is indeed a valid phenomenon, the short amount of planning time, in some cases, might be 

destructively operating in subsequent responding processes. In fact, the trend found in research 

question 2 regarding time spent before articulation was that lower-scorers involved in more 

silence time before they began to articulate their responses. To obtain a more comprehensive 

account of what really time spent before articulation entails, introspective investigations (e.g., 

retrospective think-aloud protocols or exploring the written planning sheets of participants) could 

be carried out in future research. Yet the negative relationship between test performance and 

increased amount of carried-over time could at least cast a concern on the effectiveness of the 

short planning time given in IP test-tasks.  

 178 

This study result, therefore, indicate two further points. First, observing how test takers 

used the given planning times can further tap onto the inherent test-task characteristics and their 

influence on test performance. Second, test takers are capable of providing insights on the 

properties of test-task conditions and characteristics, and hence, reveal their perceptions of the 

validity of test constructs. This suggests the need to regularly monitor the widely-practiced task 

implementation methods and conditions as well as to revisit test developer’s assumptions on how 

test takers would perform on a given task. 

 

 179 

5.1 Implication 

CHAPTER 5: CONCLUSION 

Task-based performance testing has its significance in promoting authenticity and 

practicality in assessment with the aim to replicate the kinds of activities as well as the language 

ability which candidates are likely to (and are needed to) demonstrate in the real-world contexts 

(Long, 2015; Wigglesworth & Elder, 2010). However, often such merits disguise the hard reality 

that “as a test method…it remains one of the most expensive approaches to assessment and, in 

terms of development and delivery, one of the most complex” (Wigglesworth & Foster, 2017, p. 

129). The fact that it is the test-task that is the centralized unit underlying a number of relevant 

practices such as test development, scoring, and language pedagogy, lends implications in 

different areas of task-based language assessment and pedagogy scholarship (Norris, 2016). In 

concluding the dissertation, I first draw on the purpose and the results of the current study to 

address its precise contribution to the existing and growing body of task-based language testing 

research, and to further elaborate on the broader implications this research has for language 

pedagogy (Long, 2015; Van Gorp & Deygers, 2013). 

First and foremost, the study adds to the literature of task-based language testing research 

by undertaking a joint investigation of both test-task implementation condition (e.g., planning 

conditions) and test-task characteristics (e.g., IP and integrated test-task types) in the context 

closely resembling the operational testing of TOEFL iBT. As a result, the study findings could 

be interpreted to address the theoretical and empirical underpinning of an existing assessment not 

to mention the ecological validity in research. Studies of planning time thus far have been carried 

out in closed laboratory settings (e.g., Wigglesworth, 1997) or in real testing settings where 

planning times are originally not provided (e.g., Elder & Iwashita, 2005; Elder et al., 2002; 

 180 

Tavakoli & Skehan, 2005; Wigglesworth & Elder, 2010); in such contexts, it may be often 

difficult to discern whether the effects are rightfully pertaining to the speech-processing 

mechanisms specific to the unique environment of test-taking. In this dissertation, the null effects 

of planning were consistent with the previous planning literature in language testing 

(Wigglesworth & Elder, 2010). However, the reasons as to that precise discrepancy from 

conventional task-based research finding was discussed in light of the mediating influence of 

test-task types, which fundamentally stemmed from the unique contribution of the testing 

environment of the TOEFL iBT speaking section. Therefore, the study results not only addressed 

why there are divergent effects of planning in testing research, which in turn could also inform 

the existing task-based research (Ellis, 2009; Skehan, 2016; Tavakoli & Skehan, 2005), but also 

tapped onto the test-takers’ processing constraints stemming from the imposed conditions and 

characteristics specific to the TOEFL iBT test-task types. Such an investigation has the potential 

to raise either concerns or validation of how the employed test-tasks relate to the construct of 

interest, as well as which factors may potentially obstruct valid and reliable testing practices 

(Wigglesworth & Foster, 2017). 

In this sense, the study can also be seen as an effort to bridge the two lines of research 

entities: namely, the more theoretically-underpinned general SLA and language assessment. One 

aspect of the current study that adds to such link is the study’s establishment on the theoretical 

framework of conventional task-based research in addition to employing methodologies used in 

the language testing research. In this dissertation, I attempted to triangulate multiple data sources 

by conducting analyses linking the quantified scoring data to the CAF measures (e.g., GEE 

analysis of the relationship between the test performance and the quality of speech as measured 

through CAF measures). This was to account for the arbitrariness of the three-way categorization 

 181 

of performance constructs, which is a critical concern from a measurement standpoint. Two 

relevant points can be raised. First of all, task-based research has mostly, and solely 

conceptualized performance as to the extent to which higher and lower degrees of CAF measures 

are identified; that is, higher degree of the three dimensions (e.g., high accuracy, complexity, and 

fluency reflected in speech) generally imply greater spoken performance. However, performance 

as understood from the perspective of language assessment is that it is a wholesome concept that 

needs to be interpreted based on multiple yardsticks and evidence. Validation of an assessment 

product is essentially a holy grail of collecting evidence as to whether the implemented task 

condition or test-task features are leading up to demonstrating/tapping onto one’s true ability. 

Second, the distinctions of the three constructs are not so much of a concern in language 

assessment. Often times scoring rubrics may or may not treat each dimension as separate entities; 

holistic rubrics, for instance, may have raters assign a broad level band of performance that 

basically compromises the varying degrees of performance dimensions (which might have been 

the case of the current study as discussed above in Chapter 4). Therefore, lower degrees of a 

particular dimension are not seen as detrimental to overall test performance. In addition, it is 

likely that raters are trained or inclined to assign a score out of their global impression of a test-

taker’s speech and paying less attention to specific aspects of performance.  

But at the same time, language testing research has not fully exploited what the three 

performance dimensions (CAF) can offer in illuminating performance differences across 

planning and task characteristics (Wigglesworth & Frost, 2017). Presumably, the three CAF 

constructs are the most researched and supported ‘watchdogs’ of language development from 

both theoretical and empirical standpoint. They account for the processing competence (Skehan, 

1998), or the mechanisms related to the response processes (AERA, APA, NCME, 2014; 

 182 

Standards for Educational and Psychological Testing), which is an under-researched domain in 

language testing, but it is very much vital to comprehend the multi-faceted setting of testing. 

Thus, the CAF framework could give valuable insight as to both test development and research 

practices in terms of refining the rating scales and rubric on a narrower level yet ultimately the 

test construct on a broader level.  

The study result also pinpoints the discrepancy between speaking test developers and 

speaking test takers (Ockey et al., 2013). Often times, test taker’s voices operate as a 

supplemental source of reference of test validation (Cheng & DeLuca, 2011; Hamp-Lyons, 

2000). However, the current study’s findings provide validation evidence for the TOEFL iBT 

test-task types from the test takers owing to their evaluation of “test constructs and the 

interaction between these interpretations...and test design” (Fox & Cheng, 2007, p. 9). The test 

takers in this study shed new insights on a number of task conditions and characteristics that may 

or may not have been clearly articulated in terms of their theoretical and practical considerations. 

This demonstrated that test-takers’ perceptions and orientations are equally valid sources that 

prompt test developers to revisit their hunches and assumptions on what have been granted and 

practiced for an extensive period of time (Moss, Girard, & Haniford, 2006).  

On a pedagogical standpoint, task-based testing research (as in the current study) can help 

teachers in making informed decisions about their own assessments and instructional practices. 

As a starting point, the intertwined effects of task conditions and inherent characteristics can be 

interpreted in syllabus design, teaching material development, and generally teacher’s 

professional development. But on a fundamental level, the findings pertaining to the discrepancy 

found in the planning effects between the conventional task-based and testing research can 

facilitate the thinking of connecting assessment and instructional practices in classrooms. 

 183 

Teachers can first draw from research, such as the current study, on studying what is unique to 

the testing settings. They can further make revisions on how they administer and design test-

tasks for assessment purposes in their classrooms. For instance, they could try out a variety of 

lengths of planning time or planning activities for different in-class and assessment tasks (e.g., 

presentations, pair/group discussion, one-on-on interview), and see in which condition students 

are able to demonstrate their spoken abilities better. Teachers would greatly benefit from 

exploring the evidence-based research findings that hint at a possible intertwined effect between 

task design and language performance on enhancing their assessment literacy, and the awareness 

of properly gauging student performance in classrooms.  

 

5.2 Limitation and future research 

There are a number of limitations of the current study that could be addressed by future 

research. First of all, the study design could be refined to better ascertain the effects of planning 

time and test-task types. In the current study, the test-task conditions were employed directly 

from the operational testing setting; but the findings could differ in a more tightly controlled 

environment where variables are manipulated in a variety of ways. For instance, it could be 

investigated whether test takers truly benefit more from a shorter planning time for the integrated 

test-tasks and a longer time for IP test-tasks. Likewise, the planning conditions devised in the 

present study might not have been divergent enough from one another to generate significant 

differences in test performance. Yet given that participants have distinctive orientation toward a 

particular planning condition (e.g., GW) especially when performing a particular test-task, future 

studies can look at whether such effects hold with different type of test-tasks (e.g., decision-

making tasks). 

 184 

Second, there is a possibility that participants performed better on integrated tasks owing 

to note-taking before the planning time had begun. In this sense, the effect of pre-task planning 

might have been masked to some extent because of a pre-planning activity. Although allowing 

note-taking was to simulate the actual operational testing, future studies might consider not 

providing the option of note-taking to tease out the effect of pre-task planning. 

Third, the current study collected data from speakers with specific profiles; that is, they 

were relatively proficient speakers of English and they did not have extensive experience of 

taking the TOEFL iBT test (or other types of English oral proficiency tests). The rationale behind 

recruiting such participants were two folds: first, to ensure that a meaningful unit of speech is 

elicited for data analysis, and second, not to confound participants’ reactions to the employed 

test-task condition from previous test-taking or test-preparation practices. However, the findings 

could differ with test takers with more diverse profiles; particularly, the limited effects of 

planning found in the current study could be reversed when involving speakers with lower level 

of English proficiency (e.g., Kawauchi, 2005).  

Finally, the study results pertaining to the CAF indices were based on subjective, manual 

coding. Although the coding procedures involved multiple steps and intervention sessions to 

maintain rigorousness and reliability in the data, the labor-intensive nature of manual coding 

could have had a certain extent of influence on how coders interpreted the data. Future studies 

could make use of the automatic text-analysis tools (e.g., TAACO; Crossley et al., 2016; Kyle et 

al., 2015) to avoid such potential impact of subjectivity (and fatigue) of coders imposed on 

speech data. However, the use of the automatic tools should be cautiously carried out as they are 

primarily designed to analyze written texts, and hence, may not be sensible to catch certain 

speech phenomena (e.g., pausing). A certain extent of manual coding in such cases could be a 

 185 

beneficiary supplement.  

 186 

APPENDICES

 187 

Appendix A 

Language learning and test-taking background questionnaire (in English) 

Thank you for participating in this survey. This survey is distributed as a part of a larger study 
conducted by Shinhye Lee. Please send her an email at leeshin2@msu.edu if you have any 
questions. This will take 10 to 15 minutes in total.  
 
General Information  

 

¨ Female  ¨ Male  

1.  Name: ______ 
2.  Gender: 
3.  Date and year of birth: 
4.  Name of university that you are enrolled in now: 
5.  Year in college:  ¨ Freshman   ¨ Sophomore   ¨ Junior   ¨ Senior   ¨ MA/Ph.D. 
6.  Major field of study: 
_______________________ 
7.  What is the main language you speak at home?   _______________________ 
8.  What other languages do you speak at home?   _______________________ 
 

 

 

 

Language Learning Background 
9.  At what age did you first started to study English?  
10. How long have you been studying English?  

 

_______________________ 

__________(years) __________(months) 

 

11. In which contexts/situations did you study English? Check all that apply. 

¨  At home (from parents, caregivers) 
¨  At school (Primary, secondary, high school) 
¨  At private institutions 
¨  After immigrating to the English-speaking countries 
¨  At language courses during my study abroad in the English-speaking countries 
¨  Other (specify):  _______________________________________ 

 
12. How often are you engaged in the following activities for English? 

Activities 

Daily  Weekly 

Listening to news broadcasts or music  ¨ 

¨ 
Watching TV or movies 
¨ 
Reading books/magazines 
¨ 
Writing emails 
Speaking with friends outside the class  ¨ 

¨ 

¨ 
¨ 
¨ 
¨ 

A few times  
a month 
¨ 

Once  
a month or less 
¨ 

¨ 
¨ 
¨ 
¨ 

¨ 
¨ 
¨ 
¨ 

 188 

13. Please rate on a scale of 1-6 your current ability on English reading, writing, and listening 
(circle the number below). (1= Very poor; 2= Poor; 3= Fair; 4= Good; 5= Very good; 6= 
Native-like) 

Reading 
1   2   3   4   5   6     

Writing 
1   2   3   4   5   6     

Speaking 
1   2   3   4   5   6     

Listening 
1   2   3   4   5   6     

 

 
14. Please rate on a scale of 1-6 your interest in studying English (circle the number below). 
(1=Not interested to 6=Strongly interested) 
Strongly interested  
1 
 
 
Test-Taking Background 
15. Please mark your relevance of the listed English language proficiency tests below. 

 
  4 

5 

 

2 

 

 

 

3 

 

 

 

Not interested 

       6 

Awareness of the tests 

Test 

I don’t know 
the test at all 

I am 

somewhat 
familiar with 

the test 

I am very 

familiar with 

the test 

Test Preparation 

I am not 

preparing for 

I am currently 
preparing for 

the test 

the test 

TOEIC 
TOEIC 
Speaking 
TEPS 
TOEFL 
iBT 
IELTS 
OPIc 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

¨ 
¨ 

 
 

16. Please mark the relevant box below regarding your test-taking experiences on the listed 

English language proficiency tests. 

Test-taking experience 

Number of test-taking 

Test 

I have never taken the 

TOEIC 

test 
¨ 

I have taken the test 

¨ 

 

 189 

¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 

 

Table for question 16 (cont’d) 

TOEIC 
Speaking 

TEPS 

TOEFL 
iBT 

IELTS 

OPIc 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 
¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 
¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 
¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 
¨ 1-3 times 
¨ 4-6 times 
¨ 7-10 times 

 
17. Please indicate your score on the relevant English language proficiency test below. If you 

have an OPIc score, please indicate the band level you were assigned.  

 

TOEIC 
TOEIC Speaking 
TEPS 
TOEFL iBT 
IELTS 
OPIc 

Test 

Scores/Band level 

 
 
 
 
 
 

 
 
18. In what way did you prepare for the test(s) that you have indicated above? 

Test 

A week – A 

A month – 3 

3 months – 6 

6 months – 1 

Test preparation method 

 

month 

¨ 

months 

¨ 

months 

¨ 

year 
¨ 

Test preparation 
courses offered at my 
university 

 

 

 190 

Table for question 18 (cont’d) 

Test preparation 
courses at a private 
academy/institute 
Private study groups 
Online resources 
Self-study  
Other (specify): 
_____________ 

 

¨ 

¨ 
¨ 
¨ 
¨ 

¨ 

¨ 
¨ 
¨ 
¨ 

¨ 

¨ 
¨ 
¨ 
¨ 

¨ 

¨ 
¨ 
¨ 
¨ 

19. If you have answered items on “test preparation courses,” which type of test-taking strategies 

were you instructed on? Please mark everything that applies below.  
¨  Taking tests on computer 
¨  Constructing a key template of speaking response 
¨  Constructing a key template of speaking response 
¨  Note-taking strategies 
¨  Time management strategies (e.g., strategic use of responding & planning time)  
¨  Other (specify): 

_______________________________________ 

 
 
 

 

  

 

 191 

Appendix B 

Elicited imitation task 

Ortega, L., Iwashita, N., Norris, J. M., & Rabie, S. (2002, October 3-6). An investigation of 

elicited imitation tasks in crosslinguistic SLA research. Paper presented at the Second 
Language Research Forum, Toronto. 

 
Instruction (given in Korean): 

In this task, you’ll be asked to repeat some sentences in Korean and some sentences in English. 
Please follow the instructions carefully. Please do not take any notes during this exercise. Now 
let’s begin. 
이 시험에서는 한국어 문장과 영어 문장을 듣고 문장을 소리내어 따라하는 능력을 
테스트합니다. 주어진 설명을 잘 듣고 따라해 보세요. 시험을 치는 동안 노트필기는 할 수 
없습니다. 
 

You are going to hear several sentences in Korean. After each sentence, there will be a short 
pause, followed by a tone sound {TONE}. Your task is to try to repeat exactly what you hear. 
You will be given sufficient time after the tone to repeat the sentence. Repeat as much as you 
can. Remember, DON'T START REPEATING THE SENTENCE UNTIL YOU HEAR THE 
TONE SOUND {TONE}. Now let's begin. 
지금부터 한국어 문장 6개를 듣게 됩니다. 각 문장이 끝난 후, 삐소리가 나면 들었던 
문장을 따라 말해보세요. 문장을 부분적으로라도 최대한 많이 따라하는 것이 중요합니다. 
문장을 반복할 시간은 삐소리 후 충분히 주어집니다. 삐소리를 듣기 전 문장을 반복하면 
안됩니다. 지금부터 시험을 시작합니다. 
 

나는 꽃이 좋다. (6 syllables) 3.9 seconds pause 

[translation: I like flowers] 

나는 편지를 쓴다. (7 syllables) 4.1 s 

[translation: I write a letter] 

나는 큰 차가 필요하다. (9 syllables) 5.6 s 

[translation: I need a big car] 

비가와서 밖에 안 나간다.  (10 syllables) 6 s 
[translation: As it is raining, I don't go out] 

여자 아이는 넘어져서 다쳤다.  (12 syllables) 7.2 s 

[translation: The girl fell down and got hurt] 

나는 집에 돌아오자마자 밥을 먹었다. (15 syllables) 9.5 s 

 192 

[translation: As soon as I returned home, I ate meal] 

 

That was the last Korean sentence 
한국어 문장은 여기까지 입니다. 

 
Now, you are going to hear 30 sentences in English. Once again, after each sentence, there will 
be a short pause, followed by a tone sound {TONE}. Your task is to try to repeat exactly what 
you hear in English. You will be given sufficient time after the tone to repeat the sentence. 
Repeat as much as you can. Remember, DON'T START REPEATING THE SENTENCE 
UNTIL YOU HEAR THE TONE SOUND {TONE}. Now let's begin. 
지금부터는 30개의 영어문장을 듣게 됩니다. 이전과 마찬가지로 각 문장을 들은 후 
삐소리가 나면, 들었던 영어 문장을 따라 말해 보세요. 문장을 부분적으로라도 최대한 
많이 따라하는 것이 중요합니다. 삐소리를 듣기 전 문장을 반복하면 안됩니다. 지금부터 
시험을 시작합니다.  
 

1. I have to get a haircut (7) 3.6 seconds pause 
2. The red book is on the table (8) 4.55 s          
3. The streets in this city are wide (8) 4.8 s 
4. He takes a shower every morning (9) 5.37 s 
5. What did you say you were doing today? (10) 5.45 s 
6. I doubt that he knows how to drive that well (10) 6.2 s 
7. After dinner I had a long, peaceful nap (11) 6.8 s 
8. It is possible that it will rain tomorrow (12) 6.74 s 
9. I enjoy movies which have a happy ending (12) 7.22 s 
10. The houses are very nice but too expensive (12) 7.92 s 
11. The little boy whose kitten died yesterday is sad (13) 8.6 s 
12. That restaurant is supposed to have very good food (13) 8.1 s 
13. I want a nice, big house in which my animals can live (14) 9.3 s 
14. You really enjoy listening to country music, don't you (14) 8.7 s 
15. She just finished painting the inside of her apartment (14) 8.7 s          
16.  Cross the street at the light and then just continue straight ahead (15) 9.85 s 
17.  The person I'm dating has a wonderful sense of humor (15) 9.1 s 
18. She only orders meat dishes and never eats vegetables (15/16) 10.1 s 
19. I wish the price of town houses would become affordable (15) 9.4 s 

 193 

20. I hope it will get warmer sooner this year than it did last year (16) 10.1 s 
21. A good friend of mine always takes care of my neighbor’s three children (16) 10.6 s 
22. The black cat that you fed yesterday was the one chased by the dog(16) 10.8 s 
23. Before he can go outside, he has to finish cleaning his room (16) 10.4 s 
24. The most fun I've ever had was when we went to the opera (16) 10.2 s   
25. The terrible thief whom the police caught was very tall and thin (17) 12.5 s 
26.  Would you be so kind as to hand me the book which is on the table? (17) 10.8 s 
27.  The number of people who smoke cigars is increasing every year (17/18) 11.1 s 
28. I don't know if the 11:30 train has left the station yet (18) 11.2 s 
29.  The exam wasn't nearly as difficult as you told me it would be (18) 11.2 s 
30. There are a lot of people who don’t eat anything at all in the morning (19) 12 s 

This is the end of the repetition task. Thank you. 

문장반복시험이 끝났습니다. 감사합니다. 

 

 194 

 

 

  

  

  

  

  

 

Scoring guidelines for Elicited Imitation task 

SCORE 0 
Criteria 
 
•  Nothing (Silence) 
 
•  Garbled (unintelligible, usually transcribed as 

XXX) 

•  Minimal repetition, then item abandoned: 

- Only 1 word repeated 
- Only 1 content word plus function word(s) 
- Only 1 content word plus function word(s) 
plus extraneous words that weren’t in the 
original stimulus 
- Only function word(s) repeated 
 

NOTE: with only, just, yet (meaningful adverbs), 

score 1 

 
 
SCORE 1 
Criteria 
•  When only about half of idea units are 

represented in the string but a lot of important 
information in the original stimulus is left out 

 
 
 
•  When barely half of lexical words get 

repeated and meaningful content results that is 
unrelated (or opposed) to stimulus, frequently 
with hesitation markers 

 
•  Or when string doesn’t in itself constitute a 
self-standing sentence with some (targetlike 
or nontargetlike) meaning (This may happen 
more often with shorter items, where if only 2 
of 3 content words are repeated and no 
grammatical relation between them is 
attempted, then score 1) 

 
 
 
•  Also when half of a long stimulus is left out, 

and the sentence produced is incomplete 

 

Examples 
 

 

 
 
- The- the street in... in... street... hmm (16/#2) 
 
- I wish... comfta-portable (19/#1) 
- I watch a movie (9/#22) 
- You don’t... don’t you? (14/#1) 
 
- He just finished (15/#23) 
(Closed word + Adv + lexical word) (score 1) 

Examples 
- Cross the cross--cross the street ahead and. 
(16/#4) 
- I don’t have nap (7/#1) 
- I ...the last year (20/#4) 
- I have to hair-haircu (1/#24) 
- Would you... the book on the table (26/#7) 
 
- I wonder... why he... drive... well (6/#9) 
- He just finished painting... inside the park 
(15/#11) 

- I enjoy movie what shew have a... have a (9/#3) 
- She only eats vegetables and have xx- never eat 
vegetables (18/#4) 
- I want to big nice house.(13/#25) 
- A good frien of my take a good my chilren 
(21/#25) 
- I wannata ....... animalslive (13/#26) 
- Zu book .... table (2/#26) 
- I doubt he how to drive (6/#25) 
 
-The little boy the kitten... no.. is sad... I can’t 
remember (11/#8) 
- Before... before he can go outside for (23/#11) 
 

 195 

Examples 
- The gooda friend take care o- chi- children (left 
out that it was the neighbor’s children, and that 
they were three) (21/#1) 
- After dinner I have a long piece [peace?] of a 
nap (<a long, peaceful nap) (7/#4) 
-  She just finished painting the seaside her 
apartment (<inside of) (15/#4) 
- The restaurant was supposed to have ve- good 
food (<is supposed; meaning changed to past) 
(12/#4) 
-  I want to big house which... in which... animal 
can live (left out ‘nice’ ‘my’ and made animal into 
singular) (13/#4) 
- Would you hand me... the books which are on 
the table (<book; meaning changed to plural) 
(26/#4) 
- It is possible to day tomorrow (from 
pronunciation problem, it is ambiguous whether 
‘rain’ has been understood, but it is possible) 
(8/#1) 
 
 

Examples 
 
- It is possible... the rain tomorrow (8/#11) 
- That restaurant ah.... supposed to... ah... very 
good food (12/#14) 
 
 
- Would you pass me the book on the table 
(26/#21)(Score 3) 
- Would you be so kind...to bring... the book...on 
the table (26/#13)(Score 3) 
- The rest-restaurant is supposed to have good 
food (12/#11)(Score 3) 
 
- The number of people who smoke ...um is 
increasing every year (27/#10)(Score 2) 
- He just finished painting... inside of a  his house 
(15/#5)(Score 2) 
- She finished a painting... inside her apartment 
(15/#7)(Score 2) 
- The person I'm dading is ...wonderful... 
humour (17/#11)(Score 2) 

 
 
SCORE 3 
Criteria 
•  Original, complete meaning is preserved as in 

the stimulus. Strings which are quite 
ungrammatical can get a 3 score, as long as 
exact meaning is preserved. Some 
synonymous substitutions are acceptable. 

 
•  Examples of acceptable substitutions (SCORE 

3): hand/give/pass are acceptable synonyms 
for item 26. Substitutions of and & but are 
acceptable. A lot of = many, etc. 

•  Anything with or without ‘very’ can be 

considered synonymous. 

 
•  Examples of unacceptable substitutions or 

SCORE 2 
Criteria 
•  When content of string preserves at least more 

than half of the idea units in the original 
stimulus; string in meaningful, and the 
meaning is close or related to original, but it 
departs from it in some slight changes in 
content, which makes content inexact, 
incomplete, or ambiguous 

omissions (SCORE 2): 
- cigar smoking> smoking 
- apartment >house/room 
- he<>she 
- sense of humor> humor 
- finished cleaning>cleaned 
- order> eat 
- nice,big > big 

 196 

- AUX cannot be omitted (can go> go) 
- a lot of Noun> 0 Noun 
-too Adj > 0 Adj 
 
 
•  Changes in grammar that don’t affect 

meaning should be scored as 3. For instance, 
failure to supply past tense (had>have) and 
missing articles should be considered 
grammar change only (score 3). 

 
•  By contrast, cases of extra marking or more 
marked morphology should be considered as 
meaning change. For example, a present tense 
repeated as past or as future should be scored 
as meaning change (score 2). 

 
•  Similarly, singular/plural differences between 

stimulus and repeated string change the 
meaning, not only the grammar (score 2).  

 
•  Changes of person (he for she or she for he) 

change the meaning; but problems of 
agreement (she...her versus she...his) should 
be considered grammatical change, not 
meaning change. 

 
•  Ambiguous changes in grammar that COULD 
be interpreted as meaning changes from a NS 
perspective should be scored as 2. That is, as a 
general principle in case of doubt about 
whether meaning has changed or not, score 2. 

- Before he get outside...he must clean his room  
(23/#9)(Score 2) 
- She always eat...meat...nev-never eat vegetable 
(18/#5)(Score 2)  
 
 
- After dinner I have a long peaceful nap. 
(7/#17)(Score 3) 
 
 
 
 
 
- The restaurant was supposed to have ve- good 
food.(12/#24)(Score 2) 
- After the dinner I will have a long... sp- peaceful 
nap. (7/#8)(Score 2) 
 
 
 
- The street in the city is wide (3/#8)(Score 2) 
 
- She just finished painting ...his room inside 
(15/#14) (Score 2)(apartment is missing) 
 
 
 
 
- The streets on the city is wide (3/#23)(Score 2) 
(We can’t know whether the number agreement is 
just a grammar problem or an interpretation 
problem, but string is ambiguous in meaning: (a) a 
generic plural statement or (b) a statement about 
one street (score 2). 
 

 

 

 

 
 

 
SCORE 4 
Criteria 
•  Exact repetition: String matches stimulus 

exactly. Both form and meaning are correct 
without exception or doubt. 

Examples 
 

 

 197 

 

Appendix C 

Test tasks 

 

 

Test Set A 
 
Task 1 
Directions: You will now be asked to give your opinion about a familiar topic. Give yourself 15 
seconds to prepare your response. Then record yourself speaking for 45 seconds. 
 

Choose a place you go to often that is important to you and explain why it is important. 
Please include specific details in your explanation. 

 
 
Task 2 
 
Directions: You will not read a short passage and then listen to a conversation on the same topic. 
You will then be asked a question about them. After you hear the question, you will have 30 
seconds to prepare you response and 60 seconds to speak. 
 
Give yourself 45 seconds to read the article. 
 

Bus Service Elimination Planned 
The university has decided to discontinue its free bus service for students. The reason given 
for this decision is that few students ride the buses and the buses are expensive to operate. 
Currently, the buses run from the center of campus past university buildings and through some 
of the neighborhoods surrounding the campus. The money saved by eliminating the bus 
service will be used to expand the over-crowded student parking lots. 

 
Directions: Now listen to two students discussing the article. 
 

Male student: I don’t like the university’s plan. 
Female student: Really? I’ve ridden those buses, and sometimes there were only a few people 

Male student: I see your point. But I think the problem is the route’s out of date. It only gets 
through the neighborhoods that’ve gotten too expensive for students to live in. 
It’s ridiculous that they haven’t already changed the route – you know, so it 
goes where most off-campus students live now. I bet if they did that, they’d get 
plenty of students riding those buses. 

Female student: Well, at least they’re adding more parking. It’s gotten really tough to find a 

on the bus. 

space. 

Male student: That’s the other part I don’t like, actually. Cutting back the bus service and 

adding parking’s just gonna encourage more students to drive on campus. And 
that’ll just add to the noise around campus and create more traffic…and that’ll 
increase the need for more parking spaces. 

 198 

Female student: Yeah, I guess I can see your point. Maybe it would be better if more students 

used the buses instead of driving. 

Male student: Right. And the university should make it easier to do that, not harder. 

 
Directions: Give yourself 30 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 
 

The man expresses his opinion of the university’s plan to eliminate the bus service. State 
his opinion and explain the reasons he gives for holding that opinion. 
 

 
Task 3 
 
Directions: Now listen to part of a lecture in a economics class. 
 

Professor: So let’s talk about money. What is money? Well, typically people think of coins 
and paper “bills” as money, but that’s using a somewhat narrow definition of the 
term. A broad definition is this: money is anything that people can use to make 
purchases with. Since many things can be used to make purchases, money can have 
many different forms. Certainly, coins and bills are one form of money. People 
exchange goods and services for coins or paper bills, and they use this money, these 
bills to obtain other goods and services. For example, you might give a taxi driver 
five dollars to purchase a ride in his taxi. And he in turn gives the five dollars to a 
farmer to buy some vegetables.  
But, as I said, coins and bills aren’t the only form of money under this broad 
definition. Some societies make use of a barter system. Basically, in a barter system 
people exchange goods and services directly for other goods and services. The taxi 
driver, for example, might give a ride to a farmer in exchange for some vegetables. 
Since the vegetables are used to pay for a service, by our broad definition the 
vegetables are used in barter as a form of money. 
Now, as I mentioned, there’s also a second, a narrower definition of money. In the 
United States only coins and bills are legal tender—meaning that by law, a seller 
must accept them as payment. The taxi driver must accept coins or bills as payment 
for a taxi ride. OK? But in the U.S., the taxi driver is not required to accept 
vegetables in exchange for a ride. So a narrower definition of money might be 
whatever is legal tender in a society, whatever has to be accepted as payment. 

 
 
Directions: Give yourself 20 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 
 

Using the points and examples from the lecture, explain the two definitions of money 
presented by the professor. 

 
 

 

 199 

Test set B 
Task 1 
 
Directions: You will now be asked to give your opinion about a familiar topic. Give yourself 15 
seconds to prepare your response. Then record yourself speaking for 45 seconds. 
 

What kind of reading material, such as novels, magazines, or poetry, do you most like to 
read in your free time? Explain why you find this kind of reading material interesting. 
 

 
Task 2 
 
Directions: Read a passage about a topic in psychology. You will have 45 seconds to read the 
passage. Begin reading now. 
 
 

Actor-observer 
People account for their own behavior differently from how they account for the behavior of 
others. When observing the behavior of others, we tend to attribute their actions to their 
character or their personality rather than to external factors. In contrast, we tend to explain our 
own behavior in terms of situational factors beyond our own control rather than attributing it 
to our own character. One explanation for this difference is that people are aware of the 
situational forces affecting them but not of situational forces affecting other people. Thus, 
when evaluating someone else’s behavior, we focus on the person rather than the situation. 

 
Directions: Now listen to part of a lecture in a psychology class. 
 

Professor: So we encounter this in life all the time, but many of us are unaware that we do this. 

Even psychologists who study it, like me. For example, the other day I was at the 
store and I was getting in line to buy something. But just before I was actually in 
line, some guy comes out of nowhere and cuts right in front of me. Well, I was 
really annoyed and thought, “That was rude!” I assumed he was just a selfish, 
inconsiderate person when, in fact, I had no idea why he cut in line in front of me or 
whether he even realized he was doing it. Maybe he didn’t think I was actually in 
line yet…But my immediate reaction was to assume he was a selfish or rude person.  
OK, so a few days after that, I was at the store again. Only this time I was in a real 
hurry—I was late for an important meeting—and I was frustrated that everything 
was taking so long. And what’s worse, all the checkout lines were long, and it 
seemed like everyone was moving so slowly. But then I saw a slightly shorter line! 
But some woman with a lot of stuff to buy was walking toward it, so I basically ran 
to get there first, before her, and well, I did. Now, I didn’t think of myself as a bad 
or rude person for doing this. I had an important meeting to get to—I was in a hurry, 
so, you know, I had done nothing wrong.  

 
Directions: Give yourself 30 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 

 200 

Explain how the two examples discussed by the professor illustrate differences in the 
ways people explain behavior. 
 

 
Task 3 
 
Direction: Now listen to a conversation between a student and her advisor. 
 

Advisor: OK, Becky, so, you’ve chosen all your courses for next term? 
Student: Well, not really, professor. Actually, I’ve got a problem. 
Advisor: Oh? 
Advisor: Yeah, well, I still need to take an American literature course; it’s required for 
graduation. But I’ve been putting it off. But since my next term is my last… 

Advisor: Yeah, you can’t put it off any longer! 
Student: Right. The thing is though, it’s not offered next term. 
Advisor: I see. Hmm. Ah, how about, ah, taking the course at another university? 
Student: I thought about that. It’s offered at City College, but, that’s so far away. Commuting 
back and forth would take me a couple of hours, you know, a big chunk of time with 
all my other studies and everything. 

Advisor: True, but it’s been done. Or, ah, there are a couple of graduate courses in American 

literature. Why not take one of those? 

Student: Yeah, but, wouldn’t that be hard, though? I mean, it’s a graduate course; that’d be 

pretty intense.  

beyond your abilities. 

Advisor: Yeah, it’d probably mean more studying than you’re used to, but I’m sure it’s not 

 
Directions: Give yourself 20 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 
 

The speakers discuss two possible solutions to the woman’s problem. Briefly summarize 
the problem. Then state which solution you recommend and explain why. 
 

 

 

 

 201 

Test set C 
 
Task 1 
 
Directions: You will now be asked to give your opinion about a familiar topic. Give yourself 15 
seconds to prepare your response. Then record yourself speaking for 45 seconds. 
 

Some students prefer to work on class assignments by themselves. Others believe it is 
better to work in a group. Which do you prefer? Explain why. 
 

 
Task 2 
 
Directions: The university’s Dining Services Department has announced a change. Read an 
announcement about this change. You will have 45 seconds to read the announcement. Begin 
reading now. 
 

Hot Breakfasts Eliminated 
Beginning next month, Dining services will no longer serve hot breakfast foods at university 
dining halls. Instead, students will be offered a wide assortment of cold breakfast items in the 
morning. These cold breakfast foods, such as breads, fruit, and yogurt, are healthier than many 
of the hot breakfast items that we will stop serving, so health-conscious students should 
welcome this change. Students will benefit in another way as well, because limiting the 
breakfast selection to cold food items will save money and allow us to keep our meal plans 
affordable. 

 
Direction: Now listen to two students discussing the announcement. 
 

Female Student: Do you believe any of this? It’s ridiculous. 
Male Student: What do you mean? It is important to eat healthy foods. 
Female Student: Sure it is, but they’re saying yogurt’s better for you than an omelet, or than 

hot cereal? I mean whether something’s hot or cold, that shouldn’t be the 
issue. Except maybe on a really cold morning, but in that case, which is going 
to be better for you—a bowl of cold cereal or a nice warm omelet? It’s 
obvious; there’s no question. 

Male Student: I’m not going to argue with you there. 
Female Student: And this whole thing about saving money. 
Male Student: What about it? 
Female Student: Well, they’re actually going to make things worse for us, not better. ‘Cause if 
they start cutting back and we can’t get what we want right here, on campus, 
well, we’re going to be going off campus and pay off-campus prices, and you 
know what? That will be expensive. Even if it’s only two or three mornings a 
week, it can add up. 

 
Directions: Give yourself 30 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 

 202 

The woman expresses her opinion of the change that has been announced. State her 
opinion and explain her reasons for holding that opinion. 
 
 
Task 3 
 

Directions: Now listen to part of a lecture in a psychology class. The professor is discussing 
advertising strategies. 
 

Professor: In advertising, various strategies are used to persuade people to buy product. In 
order to sell more products, advertisers will often try to make us believe that a 
product will meet our needs or desires perfectly, even if it’s not true. The strategies 
they use can be subtle, uh, “friendly” forms of persuasion that are sometimes hard to 
recognize.   
In a lot of ads, repetition is a key strategy. Research shows that repeated exposure to 
a message, even something meaningless or untrue, is enough to make people accept 
it or see it in a positive light. You’ve all seen the car commercials on TV, like, uh, 
the one that refers to its “roomy” cars, over and over again. You know which one I 
mean. This guy is driving around and he keeps stopping to pick up different 
people—he picks up 3 or 4 people. And each time, the narrator says, “Plenty of 
room for friends, plenty of room for family, plenty of room for everybody.” The 
same message is repeated several times in the course of the commercial. Now, the 
car, uh, the car actually looks kind of small. It’s not a very big car at all, but you get 
the sense that it’s pretty spacious. You’d think that the viewer would reach the 
logical conclusion that the slogan, uh, misrepresents the product. Instead, what 
usually happens is that when the statement “plenty of room” is repeated often 
enough, people are actually convinced it’s true.  
Um, another strategy they use is to get a celebrity to advertise a product. It turns out 
that we’re more likely to accept an advertising claim made by somebody famous—a 
person we admire and find appealing. We tend to think they’re trustworthy. So, um, 
you might have a car commercial that features a well-known race car driver. Now, it 
may not be a very fast car—uh, it could even be an inexpensive vehicle with a low 
performance rating. But if a popular race car driver is shown driving it, and saying, 
“I like my cars fast!” then people will believe the car is impressive for its speed. 

 
Directions: Give yourself 20 seconds to prepare your response to the following question. Then 
record yourself speaking for 60 seconds. 
 

Using points and examples from the lecture, explain how persuasive strategies are used in 
advertising. 

 

 

 

 203 

Appendix D 

Post questionnaire 

After Guided Planning Conditions  
 
1.  Please mark your confidence in performing each of the three test-tasks below. 

Confidence 

Test 

I was not 

confident at all. 

I was somewhat 
not confident. 

Independent 
task 
Integrated 
task: Reading 
& Listening 
Integrated 
task: 
Listening 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

I was confident 
on an average 

I was 
fairly 

level. 
¨ 

¨ 

¨ 

confident. 

 

 

 

I was 

completely 
confident. 

 

 

 

 
 
 
2.  Please mark your perceptions on the appropriateness of the planning time per test-task type. 

Appropriateness of planning time 
It was 

It was just 

Test 

It was not 

sufficient at all. 

Independent 
task 
Integrated 
task: Reading 
& Listening 
Integrated 
task: 
Listening 

¨ 

¨ 

¨ 

 
 
 
 
 

somewhat not 

sufficient. 

¨ 

¨ 

¨ 

 204 

It was 
fairly 

sufficient. 

 

 

 

It was 

excessively 
sufficient. 

 

 

 

right. 

¨ 

¨ 

¨ 

3.  To what extent did you use the particular planning activity for which test-task type? 

Test 

I did not use in all 

cases. 

Usefulness 

I fairly used the 
planning activity. 

I used the planning 

activity a lot. 

Independent task 
Integrated task: 
Reading & 
Listening 
Integrated task: 
Listening 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

 
 
4.  To what extent was the particular planning activity effective/useful for performing which 

test-task type? 

Test 

Independent 
task 
Integrated task: 
Reading & 
Listening 
Integrated task: 
Listening 

It was not 
useful at all. 

It was 

somewhat not 

useful. 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

Usefulness 

It was useful 
on an average 

level. 
¨ 

¨ 

¨ 

It was 
fairly 
useful. 

It was 

completely 

useful. 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

 
 
5.  What are the pros and cons for using the particular planning activity? 

___________________________________________________________________________
___________________________________________________________________________
____________________________________________________ 

6.  Have you practiced the particular planning activity you have just done before?  

 

¨ Yes  ¨ No 

 
7.  Have any of your teachers taught you how to plan before speaking?  ¨ Yes  ¨ No 
      If yes, was the planning activity you just did now taught by your teachers?    ¨ Yes  ¨ No 
 
 

 

 205 

After Unguided Planning Conditions 
 
1.  Please mark your confidence in performing each of the three test-tasks below. 

Confidence 

Test 

I was not 

confident at all. 

I was somewhat 
not confident. 

Independent 
task 
Integrated 
task: Reading 
& Listening 
Integrated 
task: 
Listening 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

I was confident 
on an average 

I was 
fairly 

level. 
¨ 

¨ 

¨ 

confident. 

 

 

 

I was 

completely 
confident. 

 

 

 

 
2.  Please mark your perceptions on the appropriateness of the planning time per test-task type. 

Test 

It was not 

sufficient at all. 

Independent 
task 
Integrated 
task: Reading 
& Listening 
Integrated 
task: 
Listening 

¨ 

¨ 

¨ 

Appropriateness of planning time 
It was 

It was just 

somewhat not 

sufficient. 

¨ 

¨ 

¨ 

right. 

¨ 

¨ 

¨ 

It was 
fairly 

sufficient. 

 

 

 

It was 

excessively 
sufficient. 

 

 

 

 
3.  To what extent did you use the particular planning activity for which test-task type? 

Test 

I did not use in 

all cases. 

Usefulness 
I fairly used the 
planning activity. 

I used the planning 

activity a lot. 

Independent task 
Integrated task: Reading 
& Listening 
Integrated task: Listening 

 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

 206 

4.  To what extent was the particular planning activity effective/useful for performing which 

test-task type? 

Test 

Independent 
task 
Integrated task: 
Reading & 
Listening 
Integrated task: 
Listening 

It was not 
useful at all. 

It was 

somewhat not 

useful. 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

Usefulness 

It was useful 
on an average 

level. 
¨ 

¨ 

¨ 

It was 
fairly 
useful. 

It was 

completely 

useful. 

¨ 

¨ 

¨ 

¨ 

¨ 

¨ 

 
5.  Indicate (by ticking all relevant boxes) which of the following things you did during your 

planning time before you started speaking. 

 

 

I thought about grammar in my head. 
I practiced useful sentences or phrases 
in my head. 
I wrote down useful sentences or 
phrases on paper. 
I made a list of vocabulary in my 
head. 
I wrote down vocabulary in my head. 
I made a list of useful organizing 
and/or linking language in my head. 
I wrote down useful organizing and/or 
linking language on paper. 
I practiced pronunciation in my head. 
I tried to decide what topic I would 
talk about. 
I thought about how to organize my 
ideas. 
I thought about the content and ideas 
needed for the question. 

With 15 
seconds 
¨ 
¨ 

With 20 
seconds 
¨ 
¨ 

With 30 
seconds 
¨ 
¨ 

¨ 

¨ 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

¨ 

¨ 

¨ 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

¨ 

 

¨ 

¨ 

¨ 
¨ 

¨ 

¨ 
¨ 

¨ 

¨ 

 

 207 

Table for question 5 (cont’d) 

¨ 

¨ 

¨ 

¨ 
¨ 

I wrote down ideas in my first 
language & then translated them. 
I thought about nothing. 
I did other things (please tell me what 
you did) 
Other (specify):  
______________________________________________________________________________
______________________________________________________________________________ 
 

¨ 
¨ 

¨ 
¨ 

 

 208 

Appendix E 

One-on-one interview questions 

 

1.  Which planning activities were most helpful in performing the task? 

2.  Did you find the planning times appropriate for each test-task type? Say why or why not. 

3.  Do you think you used the planning time as well as you could have? Say why/ why not. 

4.  Have you ever been given instruction/training on how to use planning time? If yes, how 

useful was it? If no, do you think it would help to have this kind of training? 

 

 

 209 

Appendix F 

Scoring rubric for speaking tasks 

Task 1 (Independent task) 

 

 210 

Task 2 and Task 3 (Integrated task) 

 

 

 

 211 

Appendix G 

Elder and Iwashita’s (2005) rating scales on fluency, accuracy, and complexity 

Fluency 

5 

4 

3 

2 

1 

 

Accuracy 

5 

4 

3 

2 

1 

 

 

 

Speaks without hesitation; speech is generally of a speed similar to a native 

speaker. 

Speaks fairly fluently with only occasional hesitation, false starts and 

modification of attempted utterance. Speech is only slightly slower than that of a 

native speaker. 

Speaks more slowly than a native speaker due to hesitations and word-finding 

delays. 

A marked degree of hesitation due to word-finding delays or inability to phrase 

utterances easily. 

Speech is quite disfluent due to frequent and lengthy hesitations or false starts. 

Errors are barely noticeable. 

Errors are not unusual, but rarely major. 

Manages most common forms, with occasional errors; major errors present. 

Limited linguistic control: major errors frequent. 

Clear lack of linguistic control even of basic forms. 

 212 

Complexity 

5 

Confidently attempts a variety of verb forms (e.g., passives, modals, tense and 

aspect), even if the use is not always correct. Regularly takes risks 

grammatically in the service of expressing complex meaning. Routinely attempts 

the use of coordination and subordination to convey ideas that cannot be 

expressed in a single clause, even if the result is occasionally awkward or 

incorrect. 

Attempts a variety of verb forms, even if the use is not always correct. Takes 

risks grammatically in the service of expressing complex meaning. Regularly 

attempts the use of coordination and subordination to convey ideas that cannot 

be expressed in a single clause, even if the result is awkward or incorrect. 

Mostly relies on simple verb forms, with some attempt to use a greater variety of 

forms. Some attempts to use coordination and subordination to convey ideas that 

cannot be expressed in a single clause. 

Produces numerous sentence fragments in a predictable set of simple clause 

structures. If coordination and/or subordination are attempted to express more 

complex clause relations, this is hesitant and done with difficulty. 

Produces mostly sentence fragments and simple phrases. Little attempt to use 

any grammatical means to connect idea across clauses. 

 

 213 

4 

3 

2 

1 

 

 

 

 

Appendix H 

Basic descriptive statistics for the raw coding data 

 

Test Set A for fluency indices 

Fluency 

UG  

(N = 32) 

IT-RL 

IP 

IT-L 

IP 

Test Set A 

GW  

(N = 33) 

IT-RL 

GT  

(N = 33) 

IT-RL 

IT-L 

IT-L 

IP 

Filled Pauses (Num) 

3.36 (1.92) 

4.80 (3.14) 

4.27 (2.66) 

4.20 (3.22) 

4.80 (4.37) 

5.18 (4.36) 

3.03 (2.84) 

4.10 (3.10) 

4.16 (3.53) 

Unfilled Pauses (Num) 

4.83 (2.41) 

5.13 (2.03) 

5.89 (2.42) 

6.27 (2.45) 

6.00 (3.07) 

5.73 (2.04) 

5.82 (2.14) 

5.92 (2.44) 

5.86 (2.38) 

Reformulations (Num) 

0.78 (0.78) 

1.13 (0.78) 

1.03 (0.87) 

0.79 (0.87) 

1.02 (0.84) 

0.79 (0.75) 

0.76 (0.71) 

1.27 (0.83) 

1.19 (0.73) 

Repetitions (Num) 

0.70 (0.86) 

1.14 (1.32) 

1.10 (1.02) 

0.83 (0.97) 

1.20 (1.17) 

0.97 (0.94) 

1.18 (0.88) 

1.15 (1.00) 

1.33 (1.13) 

Replacements (Num) 

0.50 (0.67) 

0.77 (0.87) 

1.02 (0.88) 

0.56 (0.95) 

1.03 (1.12) 

0.61 (0.69) 

0.68 (0.76) 

1.05 (1.12) 

1.00 (0.92) 

Hesitations (Num) 

False starts (Num) 

 

0.50 (0.77) 

0.55 (0.78) 

0.65 (0.71) 

0.26 (0.44) 

0.59 (0.72) 

0.35 (0.66) 

0.27 (0.45) 

0.58 (0.94) 

0.48 (0.65) 

0.33 (0.53) 

0.27 (0.44) 

0.31 (0.49) 

0.26 (0.61) 

0.39 (0.54) 

0.21 (0.48) 

0.27 (0.45) 

0.36 (0.53) 

0.25 (0.42) 

 

 214 

Fluency 

Test Set B for fluency indices 

UG  

(N = 33) 

IT-RL 

IP 

IT-L 

IP 

Test Set B 

GW  

(N = 33) 

IT-RL 

GT  

(N = 33) 

IT-RL 

IT-L 

IT-L 

IP 

Filled Pauses (Num) 

3.49 (2.46) 

4.80 (3.04) 

4.27 (2.93) 

4.06 (3.23) 

5.44 (3.88) 

4.92 (4.49) 

3.82 (2.25) 

4.17 (3.82) 

3.30 (2.95) 

Unfilled Pauses (Num) 

6.25 (3.13) 

6.09 (2.84) 

6.39 (2.57) 

7.24 (2.54) 

7.73 (2.27) 

7.48 (2.30) 

7.09 (3.38) 

7.18 (3.18) 

7.18 (3.32) 

Reformulations (Num) 

0.70 (0.68) 

0.97 (0.85) 

1.12 (0.64) 

0.73 (0.67) 

1.12 (0.78) 

1.02 (0.81) 

1.11 (0.73) 

1.30 (1.23) 

1.17 (0.67) 

Repetitions (Num) 

0.91 (0.74) 

1.15 (1.03) 

1.15 (1.00) 

0.88 (1.06) 

1.23 (1.22) 

1.09 (1.03) 

0.80 (0.84) 

0.58 (0.71) 

0.64 (1.85) 

Replacements (Num) 

0.67 (0.65) 

1.18 (1.41) 

0.96 (0.92) 

0.58 (0.66) 

0.92 (1.10) 

0.73 (0.79) 

0.80 (0.84) 

0.58 (0.71) 

0.82 (1.07) 

Hesitations (Num) 

False starts (Num) 

 

 

0.20 (0.39) 

0.44 (0.66) 

0.36 (0.60) 

0.17 (0.41) 

0.42 (0.57) 

0.41 (0.57) 

0.33 (0.54) 

0.70 (1.01) 

0.41 (0.63) 

0.10 (0.26) 

0.33 (0.46) 

0.18 (0.41) 

0.10 (0.29) 

0.29 (0.45) 

0.27 (0.52) 

0.14 (0.34) 

0.27 (0.45) 

0.18 (0.37) 

 

 215 

Fluency 

Test Set C for fluency indices 

UG  

(N = 32) 

IT-RL 

IP 

IT-L 

IP 

Test Set C 

GW  

(N = 33) 

IT-RL 

GT  

(N = 32) 

IT-RL 

IT-L 

IT-L 

IP 

Filled Pauses (Num) 

3.59 (3.23) 

4.80 (3.00) 

4.44 (3.04) 

3.97 (3.43) 

5.21 (5.15) 

5.09 (4.71) 

3.03 (2.60) 

3.47 (3.86) 

4.03 (4.10) 

Unfilled Pauses (Num) 

6.23 (2.24) 

6.70 (2.86) 

6.14 (2.95) 

6.73 (2.08) 

7.12 (3.04) 

5.99 (2.80) 

6.47 (2.76) 

7.13 (2.78) 

6.41 (2.85) 

Reformulations (Num) 

0.89 (0.78) 

1.23 (0.81) 

1.03 (0.82) 

1.05 (0.54) 

1.02 (0.78) 

1.26 (1.06) 

0.78 (0.61) 

1.11 (0.90) 

1.00 (0.72) 

Repetitions (Num) 

1.00 (1.02) 

1.39 (0.92) 

1.48 (1.17) 

0.71 (0.84) 

0.89 (0.79) 

0.68 (0.79) 

0.42 (0.71) 

1.09 (0.89) 

0.73 (0.72) 

Replacements (Num) 

0.80 (0.90) 

1.13 (1.29) 

0.98 (0.85) 

0.58 (0.79) 

0.67 (0.89) 

0.80 (0.83) 

0.47 (0.66) 

0.64 (0.89) 

0.75 (1.02) 

Hesitations (Num) 

False starts (Num) 

 

 

0.36 (0.65) 

0.30 (0.52) 

0.44 (0.67) 

0.42 (0.90) 

0.42 (0.74) 

0.39 (0.83) 

0.25 (0.66) 

0.52 (0.83) 

0.45 (0.73) 

0.22 (0.55) 

0.19 (0.40) 

0.20 (0.40) 

0.12 (0.33) 

0.08 (0.25) 

0.14 (0.34) 

0.03 (0.18) 

0.13 (0.34) 

0.34 (0.48) 

 

 216 

Accuracy 

Test Set A for accuracy indices 

UG  

(N = 32) 

Test Set A 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Error-free clauses (Num) 

2.39 (1.34) 

2.28 (1.46)  2.90 (1.25) 

2.05 (1.40)  2.05 (1.39) 

2.61 (1.29)  2.05 (1.34) 

2.15 (1.56)  2.03 (1.49) 

Lexical errors (Num) 

1.58 (0.76) 

1.70 (0.94)  1.69 (0.73) 

1.77 (1.07)  1.79 (1.20) 

1.76 (1.00)  1.35 (1.06) 

1.50 (1.00)  1.42 (0.88) 

 

 

Accuracy 

Test Set B for accuracy indices 

UG  

(N = 33) 

Test Set B 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Error-free clauses (Num) 

2.18 (1.26) 

2.27 (1.46)  3.03 (1.47) 

2.58 (1.80)  2.21 (1.39) 

3.12 (2.06)  2.85 (1.40) 

2.94 (1.30)  3.58 (1.92) 

Lexical errors (Num) 

1.24 (0.93) 

1.58 (0.84)  1.70 (0.82) 

1.30 (0.93)  1.47 (0.95) 

1.47 (0.93)  1.03 (0.84) 

1.38 (1.19)  1.56 (1.37) 

 

 

 

 217 

Accuracy 

Test Set C for accuracy indices 

UG  

(N = 32) 

Test Set C 

GW  

(N = 33) 

GT  

(N = 32) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

Error-free clauses (Num) 

3.13 (1.63) 

3.05 (1.30)  3.14 (1.95) 

2.41 (1.18)  2.97 (1.57) 

3.49 (1.53)  3.16 (1.71) 

3.23 (1.95)  3.28 (1.67) 

Lexical errors (Num) 

1.25 (0.95) 

1.22 (1.07)  1.44 (1.03) 

1.00 (0.97)  1.23 (1.23) 

0.97 (1.34)  0.72 (0.89) 

0.83 (0.99)  0.95 (1.23) 

 

Complexity 

 

Test Set A for complexity indices 

UG  

(N = 32) 

Test Set A 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

AS-Units (Num) 

2.75 (1.24) 

3.47 (1.53)  3.31 (1.24) 

2.99 (1.34)  3.85 (1.71) 

3.67 (1.68)  2.85 (1.14) 

3.35 (1.43)  3.11 (1.18) 

Subordinate clauses (Num) 

1.66 (1.00) 

2.03 (1.07)  2.03 (1.05) 

1.47 (0.98)  1.73 (1.16) 

2.02 (1.42)  1.64 (0.86) 

1.68 (1.14)  1.80 (1.03) 

 

 

 

 218 

Complexity 

Test Set B for complexity indices 

UG  

(N = 33) 

Test Set B 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

AS-Units (Num) 

2.91 (1.07) 

3.61 (1.32)  4.08 (1.74) 

3.09 (1.38)  3.46 (1.49) 

4.23 (1.93)  2.77 (1.08) 

3.49 (1.18)  4.11 (1.46) 

Subordinate clauses (Num) 

1.27 (0.84) 

1.38 (0.82)  1.86 (1.25) 

1.42 (1.00)  1.65 (1.12) 

2.35 (1.15)  1.29 (1.22) 

1.71 (1.06)  2.09 (1.22) 

 

Accuracy 

 

Test Set C for complexity indices 

UG  

(N = 32) 

Test Set C 

GW  

(N = 33) 

GT  

(N = 32) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

AS-Units (Num) 

2.88 (0.98) 

3.39 (1.17)  3.89 (1.24) 

2.85 (0.97)  3.30 (1.47) 

3.74 (1.47)  2.73 (1.07) 

3.54 (1.10)  3.52 (1.06) 

Subordinate clauses (Num) 

0.81 (0.78) 

1.47 (1.04)  2.20 (1.42) 

1.20 (1.09)  1.76 (1.25) 

1.97 (1.38)  1.17 (0.85) 

1.56 (1.01)  1.58 (0.87) 

 

 

 

 219 

Rating category 

Accuracy 

Complexity 

Fluency 

 

 

Rating category 

Accuracy 

Complexity 

Fluency 

 

Test Set A for CAF ratings 

UG  

(N = 32) 

IT-RL 

IP 

IT-L 

IP 

Test Set A 

GW  

(N = 33) 

IT-RL 

GT  

(N = 33) 

IT-RL 

IT-L 

IT-L 

IP 

3.48 (1.01) 

3.45 (0.96) 

3.42 (1.02) 

3.38 (1.08) 

3.24 (0.90) 

3.27 (0.91) 

3.59 (0.80) 

3.79 (0.89) 

3.72 (1.02) 

3.34 (1.04) 

3.38 (1.03) 

3.31 (1.24) 

3.27 (1.04) 

3.59 (0.98) 

3.38 (1.00) 

3.73 (0.79) 

3.79 (0.84) 

3.75 (0.92) 

3.53 (1.13) 

3.50 (1.11) 

3.34 (1.16) 

3.35 (1.24) 

3.55 (1.07) 

3.30 (1.13) 

3.68 (0.94) 

3.99 (0.84) 

3.81 (0.87) 

Test Set B for CAF ratings 

UG  

(N = 33) 

Test Set A 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

3.45 (0.71) 

3.44 (0.77)  3.49 (0.66) 

3.44 (0.87)  3.38 (0.96) 

3.47 (0.90)  3.41 (0.78) 

3.59 (0.91)  3.73 (0.81) 

3.52 (0.76) 

3.61 (0.76)  3.74 (0.77) 

3.41 (0.97)  3.36 (1.02) 

3.55 (0.91)  3.38 (0.76) 

3.56 (0.82)  3.73 (0.74) 

3.42 (0.88) 

3.65 (0.70)  3.64 (0.77) 

3.29 (1.07)  3.29 (1.09) 

3.49 (0.95)  3.44 (0.84) 

3.53 (0.85)  3.77 (0.70) 

 220 

Test Set C for CAF ratings 

UG  

(N = 33) 

Test Set C 

GW  

(N = 33) 

GT  

(N = 33) 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

IP 

IT-RL 

IT-L 

3.48 (1.01) 

3.45 (0.96)  3.42 (1.02) 

3.38 (1.08)  3.24 (0.90) 

3.27 (0.91)  3.59 (0.80) 

3.79 (0.89)  3.72 (1.02) 

3.34 (1.04) 

3.38 (1.03)  3.31 (1.24) 

3.27 (1.04)  3.59 (0.98) 

3.38 (1.00)  3.73 (0.79) 

3.79 (0.84)  3.75 (0.92) 

3.53 (1.13) 

3.50 (1.11)  3.34 (1.16) 

3.35 (1.24)  3.55 (1.07) 

3.30 (1.13)  3.68 (0.94) 

3.99 (0.84)  3.81 (0.87) 

Rating category 

Accuracy 

Complexity 

Fluency 

 

 

 221 

REFERENCES 

 

 222 

REFERENCES 

 

 

AERA, APA, NCME. (2014). Standards for educational and psychological testing. Washington, 

DC: American Education Research Association. 

Bachman, L. F. (2002). Some reflections on task-based language performance assessment. 

Language Testing, 19(4), 453-476. 

Bachman, L. F. (2004). Statistical analyses for language assessment. Cambridge, England: 

Cambridge University Press. 

Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford: Oxford 

University Press. 

Batstone, R. (2005). Planning as discourse activity. In R. Ellis (Ed.), Planning and Task 
Performance in a Second Language (pp. 277-296). Amsterdam: John Benjamins. 

Barkaoui, K., Brooks, L., Swain, M., & Lapkin, S. (2012). Test-takers’ strategic behaviors in 

independent and integrated speaking tasks. Applied Linguistics, 34, 304-324. 

Bechger, T. M., Maris, G., & Hsiao, Y. P. (2010). Detecting halo effects in performance-based 

examinations. Applied Psychological Measurement, 34, 607–619. 

Bley-Vroman, R. (1983). The comparative fallacy in interlanguage studies: the case of 

systematicity. Language Learning, 33(1), 1-17. 

Bond, T. G., & Fox, C. M. (2007). Applying the Rasch model: Fundamental measurement in the 

human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum. 

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group 

oral discussion task. Language Testing, 20(1), 89-110. 

Bowles, M. A. (2011). Measuring implicit and explicit knowledge: What can heritage learners 

contribute? Studies in Second Language Acquisition, 33, 247-271.  

Brown, A., Iwashita, N., & McNamara, T. (2005). An examination of rater orientations and test-

taker performance on English-for-academic purposes speaking tasks (Monograph No. 
29). Educational Testing Service. 

Brown, G., Anderson, A., Shilcock, R., & Yule, G. (1984). Teaching talk: Strategies for 

production and assessment. Cambridge: Cambridge University Press. 

Bulté, B., & Housen, A. (2012). Defining and operationalising L2 complexity. In A. Housen, V. 
Kuiken & I. Vedder (Eds.), Dimensions of L2 performance and proficiency: Complexity, 

 223 

accuracy and fluency in SLA (pp. 21-46). Amsterdam: John Benjamins. 

Butler, F., Eignor, D., Jones, S., McNamara, T., & Suomi, B. (1999). TOEFL 2000 speaking 

framework: A working paper. (TOEFL Monograph Series Report No. 20). Princeton, NJ: 
Educational Testing Service. 

Butterworth, B. (1975). Hesitation and semantic planning in speech. Journal of Psycholinguistic 

Research, 4, 75–87. 

Bygate, M. (1996). Effects of task repetitions: Appraising the developing language of learners. In 

J. Willis & D. Willis (Eds.), Challenge and change in language teaching (pp. 136-146). 
Oxford: Heinemann. 

Cicchetti, D. V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and 

standardized assessment instruments in psychology. Psychological Assessment, 6, 284–
290. 

Cheng, L., & DeLuca, C. (2011) Voices from test-takers: further evidence for language 

assessment validation and use. Educational Assessment, 16(2), 104-122. 

Choi, I.-C. (2008). The impact of EFL testing on EFL education in Korea. Language Testing, 

25(1), 39-62.  

Cox, T. L., Bown, J., & Burdis, J. (2015). Exploring proficiency-based vs. performance-based 

items with elicited imitation assessment. Foreign Language Annals, 48(3), 350-371. 

Crisp, V. (2012). An investigation of rater cognition in the assessment of projects. Educational 

Measurement: Issues and Practice, 31(3), 10-20. 

Crookes, G. (1989). Planning and interlanguage variation. Studies in Second Language 

Acquisition, 11, 367-383.  

Crossley, S., Clevinger, A., & Kim, Y. (2014). The role of lexical properties and cohesive 

devices in text integration and their effect on human ratings of speaking proficiency. 
Language Assessment Quarterly, 11(3), 250–270. 

Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text 
cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior 
Research Methods, 48(4), 1227-1237. 

Cublio, J., & Winke, P. (2013). Redefining the L2 listening construct within an integrated 
writing task: Considering the impacts of visual-cue interpretation and note-taking. 
Language Assessment Quarterly, 10(4), 371-397 

Davis, A., Brown, A., Elder, C., Hill, K., Lumley, T., & McNamara, T. (1999). Studies in 

language testing: Dictionary of language testing. Cambridge: Cambridge University 
Press. 

 224 

De Bot, K. (1992). A bilingual production model: Levelt’s speaking model adapted. Applied 

Linguistics, 13, 1–24.  

De Jong, N. H., Groenhout, R., Schoonen, R., & Hulstijn, J. H. (2013). Second language fluency: 

Speaking style or proficiency? Correcting measures of second language fluency for first 
language behavior. Applied Psycholinguistics, 36(2), 223-243. 

Eckes, T. (2009). Many-facet Rasch measurement. Reference supplement to the manual for 

relating language examinations to the Common European Framework of Reference for 
Languages: Learning, teaching, assessment. 

Educational Testing Service (2012). The official guide to the TOEFL® test. Princeton, NJ: 

Educational Testing Service. 

Elder, C., & Iwashita, N. (2005). Planning for test performance: What difference does it make? 

In R. Ellis (Ed.), Planning and task performance in a second language (pp. 219-238). 
Amsterdam: John Benjamins. 

Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiency 

tasks: What does the test-taker have to offer? Language Testing, 19, 347-368. 

Ellis, R. (1987). Interlanguage variability in narrative discourse: Style shifting in the use of the 

past tense. Studies in Second Language Acquisition, 9, 1–20. 

Ellis, R. (2005). Planning and task performance in a second language. Amsterdam: John 

Benjamins. 

Ellis, R. (2009). The differential effects of three types of task planning on the fluency, 

complexity and accuracy in L2 oral production. Applied Linguistics, 30(4), 474-509. 

Erlam, R. (2006). Elicited imitation as a measure of L2 implicit knowledge: An empirical 

validation study. Applied Linguistics, 27(3), 464–491. 

Ferrari, S. (2012). A longitudinal study of complexity, accuracy and fluency variation in second 
language development. In A. Housen, F. Kuiken, & I. Vedder (Eds.), Dimensions of L2 
performance and proficiency: Complexity, accuracy and fluency in SLA (pp. 277–297). 
Philadelphia: John Benjamins. 

Field, A., Miles, J., Field, Z. (2012). Discovering statistics using R. London: Sage Publications 

Limited. 

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical 

transactions of the Royal Society of London, A222, 309-368. 

Fisicaro, S. A., & Lance, C. E. (1990). Implications of three causal models for the measurement 

of halo error. Applied Psychological Measurement, 14, 419-429. 

Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: John Wiley. 

 225 

Foster, P. (1996). Doing the task better: How planning time influences students’ performance. In 
J. Willis & D. Willis (Eds.), Challenge and change in language teaching (pp. 126–135). 
London: Heinemann. 

Foster, P., & Skehan, P. (1996). The influence of planning on performance in task-based 

learning. Studies in Second Language Acquisition, 18(3), 299-324.   

Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all 

reasons. Applied Linguistics, 21, 354-375.  

Fox, J., & Cheng, L. (2007). Did we take the same test? Differing accounts of the Ontario 

Secondary School Literacy Test by first and second language test-takers. Assessment in 
Education: Principles, Policy and Practice, 14(1), 9–26. 

Freed, B. (2000). Is fluency, like beauty, in the eyes (and ears) of the beholder? In H. 

Riggenbach (Ed.), Perspectives on fluency (pp. 243-265). Ann Arbor: The University of 
Michigan Press. 

Friedman, A. (2012). How to collect and analyze qualitative data. In A. Mackey & S. M. Gass 
(Eds.), Research methods in second language acquisition: A practical guide (pp. 180-
200). Chichester: Wiley-Blackwell. 

Ghisletta, P., & Spini, D. (2004). An introduction to generalized estimating equations and an 
application to assess selectivity effects in a longitudinal study on very old individuals. 
Journal of Educational and Behavioral Statistics, 29, 421-437. 

Gilabert, R. (2007). The simultaneous manipulation of task complexity along planning time and 

(+/_ Here-and-Now): Effects on L2 oral production. In M. Garcia-Mayo (Ed.), 
Investigating Tasks in Formal Language Learning. (pp. 44–68). Clevedon: Multilingual 
Matters.  

Goldman-Eisler, F. (1968). Psycholinguistics: Experiments in spontaneous speech. New York: 

Academic Press. 

Grainger, P., Purnell, K., & Kipf, R. (2008). Judging quality through substantive conversations 

between markers. Assessment and Evaluation in Higher Education, 33, 133–142. 

Griffin, Z. M., & Bock, K. (2000). What the eyes say about speaking. Psychological Science, 

11(4), 274–279. 

Harvill, L. M. (1991). NCME instructional module: Standard error of measurement. Educational 

Measurement: Issues and Practice, 10(2), 33–41. 

Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language 

acquisition. Applied linguistics, 30(4), 461-473. 

Huang, H. T. D., & Hung, S. T. A. (2010). Examining the practice of a reading-to-speak test 

task: anxiety and experience of EFL students. Asia Pacific Education Review, 11(2), 235-

 226 

242. 

Hong, H. T. V., Huang, H. T. D., & Hung, S. T. A. (2016). Test-taker characteristics and 
integrated speaking test performance: A path-analytic study. Language Assessment 
Quarterly, 13(4), 283-301. 

Hunt, K. W. (1965). Differences in Grammatical Structures Written at Three Grade Levels. 

National Council of Teachers of English, Urbana, IL. Research Report No. 3. 

Iwashita, N., McNamara, T., & Elder, C. (2001). Can we predict task difficulty in an oral 

proficiency test? Exploring the potential of information processing approach to task 
design. Language Learning, 51, 401-436. 

Kahng, J. (2014). Exploring utterance and cognitive fluency of L1 and L2 English speakers: 

Temporal measures and stimulated recall. Language Learning, 64(4), 809-854. 

Kawauchi, C. (2005). The effects of strategic planning on the oral narratives of learners with low 
and high intermediate proficiency. In R. Ellis (Ed.), Planning and task-performance in a 
second language (pp. 143-164). Amsterdam: John Benjamins.  

Koponen, M., & Riggenbach, H. (2000). Overview: Varying perspectives on fluency. In H. 

Riggenbach (Ed.), Perspectives on fluency (pp. 5–24). Ann Arbor: University of 
Michigan Press. 

Kormos, J., & Denes, M. (2004). Exploring measures and perceptions of fluency in the speech of 

second language learners. System, 32, 145-164.  

Kormos, J. (2006). Speech production and second language acquisition. Mahwah, NJ: Lawrence 

Erlbaum Associates.   

Kormos, J. (2011). Speech production and the Cognition Hypothesis. In P. Robinson (Ed.), 

Second Language Task Complexity: Researching the Cognition Hypothesis of Language 
Learning and Performance (pp. 39-60). Amsterdam: John Benjamins. 

Kuiken, F., & Vedder, I. (2008). Cognitive task complexity and written output in Italian and 

French as a foreign language. Journal of Second Language Writing, 17, 48-60. 

Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT 

speaking tasks: Insights from natural language processing. Language Testing, 33, 319-
340. 

Lance, C. E., LaPointe, J. A., & Stewart, A. M. (1994). A test of the context dependency of three 

causal models of halo rater error. Journal of Applied Psychology, 79, 332-340. 

Laufer, B. (1991). The development of L2 lexis in the expression of the advanced learner. The 

Modern Language Journal, 75, 440-448.  

Laufer, B., & Nation, P. (1995). Vocabulary size and use: Lexical richness in L2 written 

 227 

production. Applied Linguistics, 16, 307-322. 

Lazaraton, A. (2002). A qualitative approach to the validation of oral language tests. 

Cambridge: Cambridge University Press. 

Lee, S., & Winke, P. (2018). Young language learners’ response processes when taking 
computerized tasks for speaking assessment. Language Testing, 35(2), 239-269. 

Lennon, P. (1990). Investigating fluency in EFL: A quantitative approach. Language Learning, 

40, 387-417. 

Levelt, W. (1989). Speaking: From intention to articulation. Cambridge, MA: The MIT Press.  

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press. 

Linacre, J. M. (2000). Item discrimination and infit mean-squares. Rasch Measurement: 

Transactions of the Rasch Measurement SIG, 14(2), 743. 

Linacre, J. M. (2017). FACETS Rasch-model computer program (Version 3.80.3) [Computer 

software]. Chicago, IL: Winsteps.com. 

Linacre, J. M., & Wright, B. D. (2002). Construction of measures from many-facet data. Journal 

of Applied Measurement, 3(4), 484–509. 

Long, M. (2015). Second language acquisition and task-based language teaching. Maiden: 

Wiley-Blackwell. 

Lumely, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to 

the raters? Language Testing, 19, 246-276. 

Malvern, D., & Richards, B. (2002). Investigating accommodation in language proficiency 
interviews using a new measure of lexical diversity. Language Testing, 19, 85-104. 

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation 

coefficients. Psychological Methods, 1, 30–46. 

McNamara, T. (1995). Modelling performance: opening Pandora’s box. Applied Linguistics, 16, 

159-179. 

Mehnert, U. (1998). The effects of different lengths of time for planning on second language 

performance. Studies in Second Language Acquisition, 20, 83–108. 

Mislevy, R. J., Steinberg, L. S. & Almond, R. G. (2003). On the structure of educational 
assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–67. 

Mochizuki, N., & Ortega, L. (2008). Balancing communication and grammar in beginning level 

foreign language classrooms: A study of guided planning and relativization, Language 
Teaching Research, 12, 11–37. 

 228 

Moss, P. A., Girard, B. J., & Haniford, L. C. (2006). Validity in educational assessment. Review 

of Research in Education, 30, 109–162. 

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet 

Rasch measurement: Part II. Journal of Applied Measurement, 5(2), 189-227. 

Nitta, R., & Nakatsuhara, F. (2014). A multifaceted approach to investigating pre-task planning 

effects on paired oral test performance. Language Testing, 31(2), 147-175. 

Norris, J. M. (2009). Task-based teaching and testing. In M. J. Long & C. J. Doughty (Eds.), The 

handbook of language teaching (pp. 578-594). Chichester, UK: Wiley-Blackwell. 

Norris, J. M. (2016). Current uses for task-based language assessment. Annual Review of Applied 

Linguistics, 36, 230-244. 

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in 

instructed SLA: The case of complexity. Applied Linguistics, 30, 555-578. 

Ockey, G. J., Koyama, D., & Setoguchi, E. (2013). Stakeholder input and test design: A case 
study on changing the interlocutor familiarity facet of the group oral discussion test. 
Language Assessment Quarterly, 10(1), 1–17. 

Ortega, L. (1999). Planning and focus on form in L2 oral performance. Studies in Second 

Language Acquisition, 21, 109-148.   

Ortega, L. (2000). Understanding syntactic complexity: The measurement of change in the 

syntax of instructed L2 Spanish learners. Unpublished Doctoral dissertation, University 
of Hawai'i at Manoa. 

Ortega, L. (2005). What do learners plan? Learner-driven attention to form during pre-task 

planning. In R. Ellis (Ed.), Planning and task performance in a second language (pp. 77-
109). Amsterdam: John Benjamins. 

Ortega, L., Iwashita, N., Norris, J. M., & Rabie, S. (2002, October 3-6). An investigation of 

elicited imitation tasks in crosslinguistic SLA research. Paper presented at the Second 
Language Research Forum, Toronto. 

Papageorgiou, S., Stevens, R., & Goodwin, S. (2012). The relative difficulty of dialogic and 

monologic input in a second-language listening comprehension test. Language 
Assessment Quarterly, 9(4), 375-397. 

Park, H. I. (2015). Language and cognition in monolinguals and bilinguals: A study of 

spontaneous and caused motion events in Korean and English. Unpublished doctoral 
dissertation. Washington DC: Georgetown University, Department of Linguistics. 

Plakans, L. (2010). Independent versus integrated writing tasks: A comparison of task 

representation. TESOL Quarterly, 44(1), 185–194. 

 229 

Plakans, L., & Gebril, A. (2016). Exploring the relationship of organization and connection with 

scores in integrated writing assessment, Assessing Writing, 31, 98-112. 

Polio, C. (1997). Measures of linguistic accuracy in second language writing research. Language 

Learning, 47, 101-143. 

Prabhu, N. S. (1987). Second language pedagogy. Oxford: Oxford University Press. 

Richards, J., Platt, J., & Platt, H. (1996). Dictionary of language teaching and Applied 

Linguistics. London: Longman. 

Riggenbach, H. (1991). Toward an understanding of fluency: A microanalysis of nonnative 

speaker conversation. Discourse Processes, 14, 423-441.  

Robinson, P. (1995). Task complexity and second language narrative discourse. Language 

Learning, 45, 99-140. 

Robinson, P. (2001). Task complexity, task difficulty, and task production: Exploring 

interactions in a componential framework. Applied Linguistics, 22, 27-57. 

Robinson, P. (2003). Attention and memory. In C. J. Doughty & M. H. Long (Eds.), The 

handbook of second language acquisition (pp.631-678). Oxford: Blackwell. 

Robinson, P. (2005). Cognitive complexity and task sequencing: Studies in componential 

framework for second language task design. International Review of Applied Linguistics, 
43, 1-32. 

Robinson, P. (2011). Second language task complexity, the cognition hypothesis, language 

learning, and performance. In P. Robinson (Ed.), Second Language Task Complexity: 
Researching the Cognition Hypothesis of Language Learning and Performance (pp. 3- 
37). Amsterdam: John Benjamins. 

Rutherford, K. (2001). An investigation into the effects of planning on oral production in a 

second language. Unpublished master’s thesis, University of Auckland, Auckland, New 
Zealand.  

Sangarun, J. (2005). The effects of focusing on meaning and form in strategic planning. In R. 

Ellis (Ed.), Planning and Task Performance in a Second Language (pp. 111-141). 
Amsterdam: John Benjamins. 

Segalowitz, N. (2010). Cognitive bases of second language fluency. New York: Routledge. 

Skehan, P. (1996). A framework for the implementation of task-based instruction. Applied 

Linguistics, 17, 38-62. 

Skehan, P. (1998). A cognitive approach to language learning. Oxford: Oxford University Press. 

Skehan, P. (2009). Modelling second language performance: Integrating complexity, accuracy, 

 230 

fluency, and lexis. Applied Linguistics, 30(4), 510-532. 

Skehan, P. (2016). Tasks versus conditions: Two perspectives on task research and their 

implications for pedagogy. Annual Review of Applied Linguistics, 36, 34-49. 

Skehan, P., & Foster, P. (1997). Task type and task processing conditions as influences on 

foreign language performance. Language Teaching Research, 1, 185-211.  

Skehan, P., & Foster, P. (1999). The influence of task structure and processing conditions on 

narrative retellings. Language Learning, 49, 93-120. 

Skehan, P., & Foster, P. (2005). Strategic and on-line planning: The influence of surprise 

information and task time on second language performance. In R. Ellis (Ed.), Planning 
and task-performance in a second language (pp. 193-216). Amsterdam: John Benjamins.  

Spolsky, B. (1985). The limits of authenticity in language testing. Language Testing, 2(1), 31–

40. 

Strauss, A. L., & Corbin, J. (1998). Basics of qualitative research: Technique and procedures for 

producing grounded theory (2nd ed.). London, England: SAGE. 

Stiger, T. R., Kosinski, A. S., Barnhart, H. X., & Kleinbaum, D. G. (1998). ANOVA for repeated 

ordinal data with small sample size: A comparison of ANOVA, MANOVA, WLS and 
GEE methods by simulation. Commun Stat B Simul Comput, 27, 357–375. 

Swain, M. (1984). Large-scale communicative testing: a case study. In Savignon, S. J. and Berns, 

M. (Eds.), Initiatives in communicative language teaching (pp. 185-201). Reading, MA: 
Addison-Wesley. 

Swain, M. (1993). The output hypothesis: Just speaking and writing aren't enough. Canadian 

modern language review, 50(1), 158-164. 

Tajima, M. (2003). The Effects of Planning on Oral Performance of Japanese as a Foreign 

Language. Unpublished doctoral dissertation. Purdue University, West Lafayette. 

Tavakoli, P., & Skehan, P. (2005). Strategic planning, task structure and performance testing. In 

R. Ellis (Ed.), Planning and task performance in a second language (pp. 239–273). 
Amsterdam: John Benjamins.  

Tracy-Ventura, N., McManus, K., Norris, J., & Ortega, L. (2013). "Repeat as much as you can": 

Elicited imitation as a measure of oral proficiency in L2 French. In P. Leclercq, H. 
Hilton, & A. Edmonds (Eds.), Proficiency as- sessment issues in SLA research: Measures 
and practices. Clevedon, UK: Multilingual Matters. 

Trofimovich, P., & Baker, W. (2006). Learning second language suprasegmentals: Effect of L2 

experience on prosody and fluency characteristics of L2 speech. Studies in Second 
Language Acquisition, 28, 1-30. 

 231 

Van Gorp, K., & Deygers, B. (2013). Task-based language assessment. In A. Kunan (Ed.), The 

companion to language assessment: Vol. 2. Approaches and development (pp. 578 -593). 
Oxford, UK: Wiley-Blackwell. 

Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language 

Testing, 29(3), 325-344. 

VanPatten, B. (1990). Attending to form and content in the input: An experiment in 

consciousness. Studies in Second Language Acquisition, 12, 287-301. 

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 

263-287. 

Weigle, S. (2004). Integrating reading and writing in a competency test for non-native speakers 

of English. Assessing writing, 9, 28–47. 

Wendel, J. (1997). Planning and second language narrative production. Unpublished doctoral 

dissertation, Temple University, Japan.  

West, D. E. (2012). Elicited imitation as a measure of morphemic accuracy: Evidence from L2 

Spanish. Language and Cognition, 4(3), 203–222. 

Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test 

discourse. Language Testing, 14(1), 21-44. 

Wigglesworth, G. (2001). Influences on performance in task based oral assessments. In M. 

Bygate, P. Skehan, and M. Swain (Eds.), Researching pedagogic tasks, second language 
learning, teaching and testing (pp. 186-209). London: Longman. 

Wigglesworth, G., & Elder, C. (2010). An investigation of the effectiveness and validity of 

planning time in speaking test tasks. Language Assessment Quarterly, 7(1), 1-24.  

Wigglesworth, G., & Frost, K. (2017). Task and performance-based assessment. In E. Shohamy, 

I. Or, & S. May (Eds.), Language testing and assessment (pp. 121-133). Springer.  

Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in 

rating oral performance. Language Testing, 30(2), 231–252. 

Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a 

psychometric scoring system. Assessing Writing, 4(1), 83-106. 

Xi, X. (2005). Do visual chunks and planning impact performance on the graph description task 

in the SPEAK exam? Language Testing, 22, 463–508. 

Yuan, F., & Ellis, R. (2003). The effects of pre-task and on-line planning on fluency, complexity 

and accuracy in L2 monologic oral production. Applied Linguistics 24(1), 1–27. 

 

 232